View
101
Download
1
Category
Preview:
Citation preview
A Brief Introduction toBig Data
Andrea PasquaOxford Lectures Part I
July 22nd, 2016
Outline● Why big data
● Statistics and data science
● Machine learning models
● Questions
Why Big Data
Our Times● Data is plentiful and inexpensive for the first time in history
1 GB Storage
1980
$193K
2014
$0.02
Challenges and Opportunities● Challenges
○ Ingesting data■ Distributed computing
■ Efficiency is paramount
Challenges and Opportunities● Challenges
○ Ingesting data
○ Organizing data■ Structured versus Unstructured Data
■ Cleaning and Curating
Challenges and Opportunities● Challenges
○ Ingesting data
○ Organizing data
○ Interpreting data■ Which signals matter
■ Visualizing
Challenges and Opportunities● Challenges
○ Ingesting data
○ Organizing data
○ Interpreting data
○ Fighting spurious correlations■ Highly-adaptable models can mistake random correlations for systemic
■ Old problem magnified
Challenges and Opportunities● Challenges
○ Ingesting data
○ Organizing data
○ Interpreting data
○ Fighting spurious correlations
Abundance of choices can be disorienting
Challenges and Opportunities● Opportunities
○ Automated mining for signals and patterns■ Closer to a multi-purpose selection machine
■ Sensory cortex of the brain
Challenges and Opportunities● Opportunities
○ Automated mining for signals and patterns
○ Ambitious models for ambitious goals■ IBM Watson Superhuman Jeopardy contestant
■ Google Car Self-driving cars
■ Ms COCO (Deep Learning) State-of-the-art image recognition
■ NLP: Word2Vec Glimpses of understanding in NLP
Challenges and Opportunities● Opportunities
○ Automated mining for signals and patterns
○ Ambitious models for ambitious goals■ IBM Watson Superhuman Jeopardy contestant
■ Google Car Self-driving cars
■ Ms COCO (Deep Learning) State-of-the-art image recognition
■ NLP: Word2Vec Glimpses of understanding in NLP
■ Radius Inc. Predicting business behavior
StatisticsAnd Data Science
What is Data Science?● Big data challenges
○ Ingesting data
○ Organizing data
○ Interpreting data
○ Fighting spurious correlations
What is Data Science?● Big data challenges
○ Ingesting data
○ Organizing data
○ Interpreting data
○ Fighting spurious correlations
● Technology-centered view○ Data science is about ingesting, organizing and visualizing large amounts of data
○ Basically the same as data analysis
What is Data Science?● Big data challenges
○ Ingesting data
○ Organizing data
○ Interpreting data
○ Fighting spurious correlations
● Statistics-centered view○ So what is new?
○ Data-rich situation allows for powerful models (millions of parameters)
○ Extreme danger of overfitting
What is Data Science?● Data science is statistical thinking about
○ The construction of highly-powerful, highly adaptable models
○ To predict complex phenomena
○ While keeping at bay overfitting
The Generalization Problem
Machine model of “Man”
The Generalization Problem
Machine model of “Man”
Extrapolationto the wild
The Generalization Problem
Machine model of “Man”
Extrapolationto the wild
The Generalization Problem
Machine model of “Man”
Extrapolationto the wild
Lack of generalization
The machine overfit, and the error was underestimated
The Generalization Problem
Machine model of “Man”
Extrapolationto the wild
The Generalization Problem
Machine model of “Man”
Extrapolationto the wild
Adequate generalization
The machine fit right, and the error was estimated correctly
So What is New?● Photographic camera versus pencil drawings
○ Models can wrap around the training data and fit it precisely…
○ … too precisely (overfitting)
Hastie et al. The Elements of Statistical Learning, 2011
So What is New?● Photographic camera versus pencil drawings
○ Models can wrap around the training data and fit it precisely…
○ … too precisely (overfitting)
Hastie et al. The Elements of Statistical Learning, 2011
Optimal Complexity
Search for Optimal Complexity● Will models generalize to new data?
○ Set aside some data and use it only to test generalizability
Search for Optimal Complexity● Will models generalize to new data?
● Separate the data in Train, Validate and Test portions
Train Validate Test
Search for Optimal Complexity● Will models generalize to new data?
● Training: several models are all fitted
Train Validate Test
Fit models of varying complexity
Search for Optimal Complexity● Will models generalize to new data?
● Validating: choose the model performing best
Train Validate Test
Choose the right amount of complexity
Search for Optimal Complexity● Will models generalize to new data?
● Testing: evaluate the performance of the winning model
Train Validate Test
Evaluate the model on fresh data
Judging Models● The special case of binary classification
○ Win or lose, pay back or default, buy or decline to buy
○ Only one metric of good performance
Judging Models
TP FN
TNFP
● The special case of binary classification
● Confusion matrix○ TP: truly positive and classified as such
○ FP: truly negative but classified as positive
○ TN: truly negative and classified as such
○ FN: truly positive but classified as negative
Judging Models● The special case of binary classification
● Confusion matrix
● Other measures are derived○ Accuracy: how often right
○ Precision: few false positives.
○ Sensitivity: few false negatives
TP FN
TNFP
Judging Models● The special case of binary classification
● Confusion matrix
● Other measures are derived
● Which are more damaging FP or FN?○ Depends on the use case
TP FN
TNFP
Tunable models and ROC● Some models are tunable
○ They can be made more precise and less sensitive, or vice versa
Tunable models and ROC● Some models are tunable
● ROC curve○ Sensitivity vs. (1 - precision)
Sens
itivity
Precision
Tunable models and ROC● Some models are tunable
● ROC curve○ Sensitivity vs. (1 - precision)
Few FP and FN
Tunable models and ROC● Some models are tunable
● ROC curve○ Sensitivity vs. (1 - precision)
Many FP and FN
Tunable models and ROC● Some models are tunable
● ROC curve○ Sensitivity vs. (1 - precision)
few FPs but many
FNs
few FNs but many
FPs
Tunable models and ROC● Some models are tunable
● ROC curve○ Sensitivity vs. (1 - precision)
Perfect
Random
Tunable models and ROC● Some models are tunable
● ROC curve
● Area Under the Curve (AUC) is a good summary statistics
○ ~ 50% - 100%
Machine Learning Models
ML models● A vast collection of tools
○ Random Forest and Gradient Boosted Trees (Netflix prize)
○ Collaborative filtering and matrix factorization methods: Netflix
○ … and more:■ Linear and logistic regression, with different regulators
■ Nonlinear models: e.g. SVM (Support Vector Machines)
■ Bayesian methods: naïve Bayes
■ Unsupervised clustering
■ … and many more
ML models● A vast collection of tools
○ Random Forest and Gradient Boosted Trees (Netflix prize)
○ Collaborative filtering and matrix factorization methods: Netflix
○ … and more:
● A closer look to some tools○ Neural Networks and Deep Learning
○ Word2Vec
Neural Networks
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components■ Neurons
■ Synapses
■ Axons
■ Sensory Inputs
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
...
Neural Networks● Origin: AI
○ To build an expert system imitate the only expert system we know: the brain
○ So we need analogs of brain components
http://learn.genetics.utah.edu
... Signal
Neural Networks● Signal Propagation
○ Linear weights for the synapses
○ Thresholding function in the body
... wn
w2….wn-1
w1Signal
Neural Networks● Architecture
○ Arrange in layers and propagate the signal forward
Input Layer
Hidden Layers
OutputLayer
Prediction,Classification,
Ranking,Scoring,
…
Signal
Neural Networks● Learning
○ Supervised: compare output with labeled data
○ Comparison: penalize deviations from truth■ Loss is often quadratic, but not necessarily
■ e.g. (estimated position - real position)2
○ Learning step:■ adjust weights to reduce penalty
■ back-propagate the adjustments towards earlier
layers
○ Weight-adjustment is analog to synaptic
reinforcement in the brain
Input Output
Forward Propagation of Signals
Back Propagation of Weight Adjustments
Neural Networks● Too much learning?
○ Neural networks are rich and adaptable■ Very many weights, especially if there are several layers
○ Minimizing the loss on training, will likely result in overfitting
○ For the model to generalize■ Do small incremental step (how small? Learning rate)
■ Use cross-validation to determine the optimal learning speed
○ Higher training error but lower test error
Towards Deep Learning● Humble beginnings
○ Just a few artificial neurons with simple learning (Perceptron Model, 1958)
● Fell in disrepute (70s)○ AI winter
● Came back (late 80s)○ Larger sizes and more sophisticated learning
● Exploded (2015)○ Deep learning stormed
○ Versatile and powerful
Deep Learningand Convolutional Neural Networks
Deep Learning
Input Layer Hidden Layers Output
Layer
...
● Neural Networks used to be shallow with one or few hidden layers
● Then a deep hierarchy of hidden layers was introduced
CNN● CNN are Convolutional Neural Networks
○ Problem: large input size overfitting
○ Reason: large number of input weights high-res image ~ 6M input ws
○ Solution: look at the visual cortex localization & translation inv.
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
Input Layer
Conv.Layer
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
Input Layer
Conv.Layer
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
Input Layer
Conv.Layer
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
Input Layer
Conv.Layer
Subs.Layer
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
Input Layer
Conv.Layer
Subs.Layer
Conv.Layer
Subs.Layer
...
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
...
Input Layer
Conv.Layer
Subs.Layer
Conv.Layer
Subs.Layer
Fully Conn.Layer
CNN
...
● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
Input Layer
Conv.Layer
Subs.Layer
Conv.Layer
Subs.Layer
Fully Conn.Layer
OutputLayer
CNN● CNN are Convolutional Neural Networks
● Convolution layers are localized and identical
● Deep indeed!○ And very neural too
○ State of the art for image recognition…
○ … and several other applications
Word2Vecand Natural Language Processing
Natural Language Processing● Natural Language: Human language as commonly expressed
○ Digest natural language
○ … and process it for a variety of purposes■ e.g. determine a course of action (imperative speech)
■ … or summarize/translate…
■ … or sentiment analysis (classification)
○ Crucial to smooth computer-human interactions
Natural Language Processing● Natural Language: Human language as commonly expressed
○ Digest natural language…
○ … and process it for a variety of purposes
○ Crucial to smooth computer-human interactions
● But is it real understanding?○ Semantic field of a word: e.g. “king” and “monarch”
○ Analogical thinking: e.g. “woman is to man as queen is to king”
○ Context resolution: e.g. “does pear fit in a conversation about fruits”
Word2Vec and NLP● Word2Vec
○ Tomas Mikolov et al. at Google
○ Associate a high-dimensional vector to a word or phrase
Internalrepresentation
of words
Word2Vec and NLP● Word2Vec
○ Tomas Mikolov et al. at Google
○ Associate a high-dimensional vector to a word or phrase
w1, …. , wn
C1, …. , Cn
Words and Contexts
C w
Word2Vec and NLP● Is it real understanding?
● Internal representation of words aware of context and analogies
● Potential to revolutionize computer-human interaction
Paris - France + Italy = Rome !!!
Conclusions
Our Times● Data is plentiful and inexpensive for the first time in history
● Statistical Thinking is learning to cope with large data…
● … and with new more ambitious goals
Our Times● Data is plentiful and inexpensive for the first time in history
● Statistical Thinking is learning to cope with large data…
● … and with new more ambitious goals
● First glimpses of usable AI
Our Times● Data is plentiful and inexpensive for the first time in history
● Statistics Thinking is learning to cope with large data…
● … and with new more ambitious goals
● First glimpses of usable AI
● Data science has disrupted many industries…
● … and will continue to do so.
Next
Predictive marketing● An area of high-impact for data science and big data
● Radius Intelligence
Thanks
Appendix ICross-validation
Search for Optimal Complexity● We split the data to determine the optimal complexity and to test
● Good against overfitting, but data is not used fully
Train Validate Test
Why Cross-Validation?● Two contrasting problems
○ Overfitting, a. k. a. generalization problem
○ Full use of available data
Why Cross-Validation?● Two contrasting problems
○ Overfitting, a. k. a. generalization problem
○ Full use of available data
● Contrasting because...○ To ensure generalization, test on fresh data
○ Fresh data cannot be used for training.
Why Cross-Validation?● Two contrasting problems
○ Overfitting, a. k. a. generalization problem
○ Full use of available data
● Contrasting because...○ To ensure generalization, test on fresh data
○ Fresh data cannot be used for training.
● Worse if we need to run multiple models○ Choosing on Test data would be overfitting
Cross-Validation● Nested N-fold Cross-Validation
○ Start with a split …
Validate TestTrainTrain Train
Cross-Validation● Nested N-fold Cross-Validation
○ … then change it …
Validate TestTrainTrain Train
Cross-Validation● Nested N-fold Cross-Validation
○ … and so on, until you have gone through all N(N-1) ~ N^2 combinations
Train TrainTestTrain Validate
Cross-Validation● Nested N-fold Cross-Validation
○ All the data gets used to train, validate and test the model
○ We are still out-of-sample
○ For each validation and test we now have a distribution of values
Train TrainTestTrain Validate
Cross-Validation● Simple N-fold Cross-Validation
○ As an alternative, we can also perform a simple n-fold cross validation
○ Test is held-out
○ N-fold for the validation stage
Train TestValidateTrain Train
Cross-Validation● Simple N-fold Cross-Validation
○ As an alternative, we can also perform a simple n-fold cross validation
○ Test is held-out
○ N-fold for the validation stage
Validate TestTrainTrain Train
Cross-Validation● Simple N-fold Cross-Validation
○ As an alternative, we can also perform a simple n-fold cross validation
○ Test is held-out
○ N-fold for the validation stage
Train TestTrainTrain Validate
Recommended