96
A Brief Introduction to Big Data Andrea Pasqua Oxford Lectures Part I July 22nd, 2016

Oxford Lectures Part 1

Embed Size (px)

Citation preview

Page 1: Oxford Lectures Part 1

A Brief Introduction toBig Data

Andrea PasquaOxford Lectures Part I

July 22nd, 2016

Page 2: Oxford Lectures Part 1

Outline● Why big data

● Statistics and data science

● Machine learning models

● Questions

Page 3: Oxford Lectures Part 1

Why Big Data

Page 4: Oxford Lectures Part 1

Our Times● Data is plentiful and inexpensive for the first time in history

1 GB Storage

1980

$193K

2014

$0.02

Page 5: Oxford Lectures Part 1

Challenges and Opportunities● Challenges

○ Ingesting data■ Distributed computing

■ Efficiency is paramount

Page 6: Oxford Lectures Part 1

Challenges and Opportunities● Challenges

○ Ingesting data

○ Organizing data■ Structured versus Unstructured Data

■ Cleaning and Curating

Page 7: Oxford Lectures Part 1

Challenges and Opportunities● Challenges

○ Ingesting data

○ Organizing data

○ Interpreting data■ Which signals matter

■ Visualizing

Page 8: Oxford Lectures Part 1

Challenges and Opportunities● Challenges

○ Ingesting data

○ Organizing data

○ Interpreting data

○ Fighting spurious correlations■ Highly-adaptable models can mistake random correlations for systemic

■ Old problem magnified

Page 9: Oxford Lectures Part 1

Challenges and Opportunities● Challenges

○ Ingesting data

○ Organizing data

○ Interpreting data

○ Fighting spurious correlations

Abundance of choices can be disorienting

Page 10: Oxford Lectures Part 1

Challenges and Opportunities● Opportunities

○ Automated mining for signals and patterns■ Closer to a multi-purpose selection machine

■ Sensory cortex of the brain

Page 11: Oxford Lectures Part 1

Challenges and Opportunities● Opportunities

○ Automated mining for signals and patterns

○ Ambitious models for ambitious goals■ IBM Watson Superhuman Jeopardy contestant

■ Google Car Self-driving cars

■ Ms COCO (Deep Learning) State-of-the-art image recognition

■ NLP: Word2Vec Glimpses of understanding in NLP

Page 12: Oxford Lectures Part 1

Challenges and Opportunities● Opportunities

○ Automated mining for signals and patterns

○ Ambitious models for ambitious goals■ IBM Watson Superhuman Jeopardy contestant

■ Google Car Self-driving cars

■ Ms COCO (Deep Learning) State-of-the-art image recognition

■ NLP: Word2Vec Glimpses of understanding in NLP

■ Radius Inc. Predicting business behavior

Page 13: Oxford Lectures Part 1

StatisticsAnd Data Science

Page 14: Oxford Lectures Part 1

What is Data Science?● Big data challenges

○ Ingesting data

○ Organizing data

○ Interpreting data

○ Fighting spurious correlations

Page 15: Oxford Lectures Part 1

What is Data Science?● Big data challenges

○ Ingesting data

○ Organizing data

○ Interpreting data

○ Fighting spurious correlations

● Technology-centered view○ Data science is about ingesting, organizing and visualizing large amounts of data

○ Basically the same as data analysis

Page 16: Oxford Lectures Part 1

What is Data Science?● Big data challenges

○ Ingesting data

○ Organizing data

○ Interpreting data

○ Fighting spurious correlations

● Statistics-centered view○ So what is new?

○ Data-rich situation allows for powerful models (millions of parameters)

○ Extreme danger of overfitting

Page 17: Oxford Lectures Part 1

What is Data Science?● Data science is statistical thinking about

○ The construction of highly-powerful, highly adaptable models

○ To predict complex phenomena

○ While keeping at bay overfitting

Page 18: Oxford Lectures Part 1

The Generalization Problem

Machine model of “Man”

Page 19: Oxford Lectures Part 1

The Generalization Problem

Machine model of “Man”

Extrapolationto the wild

Page 20: Oxford Lectures Part 1

The Generalization Problem

Machine model of “Man”

Extrapolationto the wild

Page 21: Oxford Lectures Part 1

The Generalization Problem

Machine model of “Man”

Extrapolationto the wild

Lack of generalization

The machine overfit, and the error was underestimated

Page 22: Oxford Lectures Part 1

The Generalization Problem

Machine model of “Man”

Extrapolationto the wild

Page 23: Oxford Lectures Part 1

The Generalization Problem

Machine model of “Man”

Extrapolationto the wild

Adequate generalization

The machine fit right, and the error was estimated correctly

Page 24: Oxford Lectures Part 1

So What is New?● Photographic camera versus pencil drawings

○ Models can wrap around the training data and fit it precisely…

○ … too precisely (overfitting)

Hastie et al. The Elements of Statistical Learning, 2011

Page 25: Oxford Lectures Part 1

So What is New?● Photographic camera versus pencil drawings

○ Models can wrap around the training data and fit it precisely…

○ … too precisely (overfitting)

Hastie et al. The Elements of Statistical Learning, 2011

Optimal Complexity

Page 26: Oxford Lectures Part 1

Search for Optimal Complexity● Will models generalize to new data?

○ Set aside some data and use it only to test generalizability

Page 27: Oxford Lectures Part 1

Search for Optimal Complexity● Will models generalize to new data?

● Separate the data in Train, Validate and Test portions

Train Validate Test

Page 28: Oxford Lectures Part 1

Search for Optimal Complexity● Will models generalize to new data?

● Training: several models are all fitted

Train Validate Test

Fit models of varying complexity

Page 29: Oxford Lectures Part 1

Search for Optimal Complexity● Will models generalize to new data?

● Validating: choose the model performing best

Train Validate Test

Choose the right amount of complexity

Page 30: Oxford Lectures Part 1

Search for Optimal Complexity● Will models generalize to new data?

● Testing: evaluate the performance of the winning model

Train Validate Test

Evaluate the model on fresh data

Page 31: Oxford Lectures Part 1

Judging Models● The special case of binary classification

○ Win or lose, pay back or default, buy or decline to buy

○ Only one metric of good performance

Page 32: Oxford Lectures Part 1

Judging Models

TP FN

TNFP

● The special case of binary classification

● Confusion matrix○ TP: truly positive and classified as such

○ FP: truly negative but classified as positive

○ TN: truly negative and classified as such

○ FN: truly positive but classified as negative

Page 33: Oxford Lectures Part 1

Judging Models● The special case of binary classification

● Confusion matrix

● Other measures are derived○ Accuracy: how often right

○ Precision: few false positives.

○ Sensitivity: few false negatives

TP FN

TNFP

Page 34: Oxford Lectures Part 1

Judging Models● The special case of binary classification

● Confusion matrix

● Other measures are derived

● Which are more damaging FP or FN?○ Depends on the use case

TP FN

TNFP

Page 35: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

○ They can be made more precise and less sensitive, or vice versa

Page 36: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

● ROC curve○ Sensitivity vs. (1 - precision)

Sens

itivity

Precision

Page 37: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

● ROC curve○ Sensitivity vs. (1 - precision)

Few FP and FN

Page 38: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

● ROC curve○ Sensitivity vs. (1 - precision)

Many FP and FN

Page 39: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

● ROC curve○ Sensitivity vs. (1 - precision)

few FPs but many

FNs

few FNs but many

FPs

Page 40: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

● ROC curve○ Sensitivity vs. (1 - precision)

Perfect

Random

Page 41: Oxford Lectures Part 1

Tunable models and ROC● Some models are tunable

● ROC curve

● Area Under the Curve (AUC) is a good summary statistics

○ ~ 50% - 100%

Page 42: Oxford Lectures Part 1

Machine Learning Models

Page 43: Oxford Lectures Part 1

ML models● A vast collection of tools

○ Random Forest and Gradient Boosted Trees (Netflix prize)

○ Collaborative filtering and matrix factorization methods: Netflix

○ … and more:■ Linear and logistic regression, with different regulators

■ Nonlinear models: e.g. SVM (Support Vector Machines)

■ Bayesian methods: naïve Bayes

■ Unsupervised clustering

■ … and many more

Page 44: Oxford Lectures Part 1

ML models● A vast collection of tools

○ Random Forest and Gradient Boosted Trees (Netflix prize)

○ Collaborative filtering and matrix factorization methods: Netflix

○ … and more:

● A closer look to some tools○ Neural Networks and Deep Learning

○ Word2Vec

Page 45: Oxford Lectures Part 1

Neural Networks

Page 46: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components■ Neurons

■ Synapses

■ Axons

■ Sensory Inputs

Page 47: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

Page 48: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

Page 49: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

Page 50: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

Page 51: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

Page 52: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

Page 53: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

...

Page 54: Oxford Lectures Part 1

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components

http://learn.genetics.utah.edu

... Signal

Page 55: Oxford Lectures Part 1

Neural Networks● Signal Propagation

○ Linear weights for the synapses

○ Thresholding function in the body

... wn

w2….wn-1

w1Signal

Page 56: Oxford Lectures Part 1

Neural Networks● Architecture

○ Arrange in layers and propagate the signal forward

Input Layer

Hidden Layers

OutputLayer

Prediction,Classification,

Ranking,Scoring,

Signal

Page 57: Oxford Lectures Part 1

Neural Networks● Learning

○ Supervised: compare output with labeled data

○ Comparison: penalize deviations from truth■ Loss is often quadratic, but not necessarily

■ e.g. (estimated position - real position)2

○ Learning step:■ adjust weights to reduce penalty

■ back-propagate the adjustments towards earlier

layers

○ Weight-adjustment is analog to synaptic

reinforcement in the brain

Input Output

Forward Propagation of Signals

Back Propagation of Weight Adjustments

Page 58: Oxford Lectures Part 1

Neural Networks● Too much learning?

○ Neural networks are rich and adaptable■ Very many weights, especially if there are several layers

○ Minimizing the loss on training, will likely result in overfitting

○ For the model to generalize■ Do small incremental step (how small? Learning rate)

■ Use cross-validation to determine the optimal learning speed

○ Higher training error but lower test error

Page 59: Oxford Lectures Part 1

Towards Deep Learning● Humble beginnings

○ Just a few artificial neurons with simple learning (Perceptron Model, 1958)

● Fell in disrepute (70s)○ AI winter

● Came back (late 80s)○ Larger sizes and more sophisticated learning

● Exploded (2015)○ Deep learning stormed

○ Versatile and powerful

Page 60: Oxford Lectures Part 1

Deep Learningand Convolutional Neural Networks

Page 61: Oxford Lectures Part 1

Deep Learning

Input Layer Hidden Layers Output

Layer

...

● Neural Networks used to be shallow with one or few hidden layers

● Then a deep hierarchy of hidden layers was introduced

Page 62: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

○ Problem: large input size overfitting

○ Reason: large number of input weights high-res image ~ 6M input ws

○ Solution: look at the visual cortex localization & translation inv.

Page 63: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Page 64: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Input Layer

Conv.Layer

Page 65: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Input Layer

Conv.Layer

Page 66: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Input Layer

Conv.Layer

Page 67: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Input Layer

Conv.Layer

Subs.Layer

Page 68: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Input Layer

Conv.Layer

Subs.Layer

Conv.Layer

Subs.Layer

...

Page 69: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

...

Input Layer

Conv.Layer

Subs.Layer

Conv.Layer

Subs.Layer

Fully Conn.Layer

Page 70: Oxford Lectures Part 1

CNN

...

● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

Input Layer

Conv.Layer

Subs.Layer

Conv.Layer

Subs.Layer

Fully Conn.Layer

OutputLayer

Page 71: Oxford Lectures Part 1

CNN● CNN are Convolutional Neural Networks

● Convolution layers are localized and identical

● Deep indeed!○ And very neural too

○ State of the art for image recognition…

○ … and several other applications

Page 72: Oxford Lectures Part 1

Word2Vecand Natural Language Processing

Page 73: Oxford Lectures Part 1

Natural Language Processing● Natural Language: Human language as commonly expressed

○ Digest natural language

○ … and process it for a variety of purposes■ e.g. determine a course of action (imperative speech)

■ … or summarize/translate…

■ … or sentiment analysis (classification)

○ Crucial to smooth computer-human interactions

Page 74: Oxford Lectures Part 1

Natural Language Processing● Natural Language: Human language as commonly expressed

○ Digest natural language…

○ … and process it for a variety of purposes

○ Crucial to smooth computer-human interactions

● But is it real understanding?○ Semantic field of a word: e.g. “king” and “monarch”

○ Analogical thinking: e.g. “woman is to man as queen is to king”

○ Context resolution: e.g. “does pear fit in a conversation about fruits”

Page 75: Oxford Lectures Part 1

Word2Vec and NLP● Word2Vec

○ Tomas Mikolov et al. at Google

○ Associate a high-dimensional vector to a word or phrase

Internalrepresentation

of words

Page 76: Oxford Lectures Part 1

Word2Vec and NLP● Word2Vec

○ Tomas Mikolov et al. at Google

○ Associate a high-dimensional vector to a word or phrase

w1, …. , wn

C1, …. , Cn

Words and Contexts

C w

Page 77: Oxford Lectures Part 1

Word2Vec and NLP● Is it real understanding?

● Internal representation of words aware of context and analogies

● Potential to revolutionize computer-human interaction

Paris - France + Italy = Rome !!!

Page 78: Oxford Lectures Part 1

Conclusions

Page 79: Oxford Lectures Part 1

Our Times● Data is plentiful and inexpensive for the first time in history

● Statistical Thinking is learning to cope with large data…

● … and with new more ambitious goals

Page 80: Oxford Lectures Part 1

Our Times● Data is plentiful and inexpensive for the first time in history

● Statistical Thinking is learning to cope with large data…

● … and with new more ambitious goals

● First glimpses of usable AI

Page 81: Oxford Lectures Part 1

Our Times● Data is plentiful and inexpensive for the first time in history

● Statistics Thinking is learning to cope with large data…

● … and with new more ambitious goals

● First glimpses of usable AI

● Data science has disrupted many industries…

● … and will continue to do so.

Page 82: Oxford Lectures Part 1

Next

Page 83: Oxford Lectures Part 1

Predictive marketing● An area of high-impact for data science and big data

● Radius Intelligence

Page 84: Oxford Lectures Part 1

Thanks

Page 85: Oxford Lectures Part 1

Appendix ICross-validation

Page 86: Oxford Lectures Part 1

Search for Optimal Complexity● We split the data to determine the optimal complexity and to test

● Good against overfitting, but data is not used fully

Train Validate Test

Page 87: Oxford Lectures Part 1

Why Cross-Validation?● Two contrasting problems

○ Overfitting, a. k. a. generalization problem

○ Full use of available data

Page 88: Oxford Lectures Part 1

Why Cross-Validation?● Two contrasting problems

○ Overfitting, a. k. a. generalization problem

○ Full use of available data

● Contrasting because...○ To ensure generalization, test on fresh data

○ Fresh data cannot be used for training.

Page 89: Oxford Lectures Part 1

Why Cross-Validation?● Two contrasting problems

○ Overfitting, a. k. a. generalization problem

○ Full use of available data

● Contrasting because...○ To ensure generalization, test on fresh data

○ Fresh data cannot be used for training.

● Worse if we need to run multiple models○ Choosing on Test data would be overfitting

Page 90: Oxford Lectures Part 1

Cross-Validation● Nested N-fold Cross-Validation

○ Start with a split …

Validate TestTrainTrain Train

Page 91: Oxford Lectures Part 1

Cross-Validation● Nested N-fold Cross-Validation

○ … then change it …

Validate TestTrainTrain Train

Page 92: Oxford Lectures Part 1

Cross-Validation● Nested N-fold Cross-Validation

○ … and so on, until you have gone through all N(N-1) ~ N^2 combinations

Train TrainTestTrain Validate

Page 93: Oxford Lectures Part 1

Cross-Validation● Nested N-fold Cross-Validation

○ All the data gets used to train, validate and test the model

○ We are still out-of-sample

○ For each validation and test we now have a distribution of values

Train TrainTestTrain Validate

Page 94: Oxford Lectures Part 1

Cross-Validation● Simple N-fold Cross-Validation

○ As an alternative, we can also perform a simple n-fold cross validation

○ Test is held-out

○ N-fold for the validation stage

Train TestValidateTrain Train

Page 95: Oxford Lectures Part 1

Cross-Validation● Simple N-fold Cross-Validation

○ As an alternative, we can also perform a simple n-fold cross validation

○ Test is held-out

○ N-fold for the validation stage

Validate TestTrainTrain Train

Page 96: Oxford Lectures Part 1

Cross-Validation● Simple N-fold Cross-Validation

○ As an alternative, we can also perform a simple n-fold cross validation

○ Test is held-out

○ N-fold for the validation stage

Train TestTrainTrain Validate