Oxford Lectures Part 1

A Brief Introduction toBig Data

Andrea PasquaOxford Lectures Part I

July 22nd, 2016

Outline● Why big data

● Statistics and data science

● Machine learning models

● Questions

Why Big Data

Our Times● Data is plentiful and inexpensive for the first time in history

1 GB Storage

1980

$193K

2014

$0.02

Challenges and Opportunities● Challenges

○ Ingesting data■ Distributed computing

■ Efficiency is paramount


○ Ingesting data

○ Organizing data■ Structured versus Unstructured Data

■ Cleaning and Curating


○ Ingesting data

○ Organizing data

○ Interpreting data■ Which signals matter

■ Visualizing


○ Ingesting data

○ Organizing data

○ Interpreting data

○ Fighting spurious correlations■ Highly-adaptable models can mistake random correlations for systemic

■ Old problem magnified


○ Ingesting data

○ Organizing data


○ Fighting spurious correlations

Abundance of choices can be disorienting

Challenges and Opportunities● Opportunities

○ Automated mining for signals and patterns■ Closer to a multi-purpose selection machine

■ Sensory cortex of the brain


○ Automated mining for signals and patterns

○ Ambitious models for ambitious goals■ IBM Watson Superhuman Jeopardy contestant

■ Google Car Self-driving cars

■ Ms COCO (Deep Learning) State-of-the-art image recognition

■ NLP: Word2Vec Glimpses of understanding in NLP


○ Automated mining for signals and patterns

○ Ambitious models for ambitious goals■ IBM Watson Superhuman Jeopardy contestant

■ Google Car Self-driving cars

■ Ms COCO (Deep Learning) State-of-the-art image recognition

■ NLP: Word2Vec Glimpses of understanding in NLP

■ Radius Inc. Predicting business behavior

StatisticsAnd Data Science

What is Data Science?● Big data challenges

○ Ingesting data

○ Organizing data




○ Ingesting data

○ Organizing data



● Technology-centered view○ Data science is about ingesting, organizing and visualizing large amounts of data

○ Basically the same as data analysis


○ Ingesting data

○ Organizing data



● Statistics-centered view○ So what is new?

○ Data-rich situation allows for powerful models (millions of parameters)

○ Extreme danger of overfitting

What is Data Science?● Data science is statistical thinking about

○ The construction of highly-powerful, highly adaptable models

○ To predict complex phenomena

○ While keeping at bay overfitting

The Generalization Problem

Machine model of “Man”



Extrapolationto the wild







Lack of generalization

The machine overfit, and the error was underestimated







Adequate generalization

The machine fit right, and the error was estimated correctly

So What is New?● Photographic camera versus pencil drawings

○ Models can wrap around the training data and fit it precisely…

○ … too precisely (overfitting)

Hastie et al. The Elements of Statistical Learning, 2011

So What is New?● Photographic camera versus pencil drawings

○ Models can wrap around the training data and fit it precisely…

○ … too precisely (overfitting)

Hastie et al. The Elements of Statistical Learning, 2011

Optimal Complexity

Search for Optimal Complexity● Will models generalize to new data?

○ Set aside some data and use it only to test generalizability


● Separate the data in Train, Validate and Test portions

Train Validate Test


● Training: several models are all fitted

Train Validate Test

Fit models of varying complexity


● Validating: choose the model performing best

Train Validate Test

Choose the right amount of complexity


● Testing: evaluate the performance of the winning model

Train Validate Test

Evaluate the model on fresh data

Judging Models● The special case of binary classification

○ Win or lose, pay back or default, buy or decline to buy

○ Only one metric of good performance

Judging Models

TP FN

TNFP

● The special case of binary classification

● Confusion matrix○ TP: truly positive and classified as such

○ FP: truly negative but classified as positive

○ TN: truly negative and classified as such

○ FN: truly positive but classified as negative


● Confusion matrix

● Other measures are derived○ Accuracy: how often right

○ Precision: few false positives.

○ Sensitivity: few false negatives

TP FN

TNFP


● Confusion matrix

● Other measures are derived

● Which are more damaging FP or FN?○ Depends on the use case

TP FN

TNFP

Tunable models and ROC● Some models are tunable

○ They can be made more precise and less sensitive, or vice versa


● ROC curve○ Sensitivity vs. (1 - precision)

Sens

itivity

Precision



Few FP and FN



Many FP and FN



few FPs but many

FNs

few FNs but many

FPs



Perfect

Random


● ROC curve

● Area Under the Curve (AUC) is a good summary statistics

○ ~ 50% - 100%

Machine Learning Models

ML models● A vast collection of tools

○ Random Forest and Gradient Boosted Trees (Netflix prize)

○ Collaborative filtering and matrix factorization methods: Netflix

○ … and more:■ Linear and logistic regression, with different regulators

■ Nonlinear models: e.g. SVM (Support Vector Machines)

■ Bayesian methods: naïve Bayes

■ Unsupervised clustering

■ … and many more

ML models● A vast collection of tools

○ Random Forest and Gradient Boosted Trees (Netflix prize)

○ Collaborative filtering and matrix factorization methods: Netflix

○ … and more:

● A closer look to some tools○ Neural Networks and Deep Learning

○ Word2Vec

Neural Networks

Neural Networks● Origin: AI

○ To build an expert system imitate the only expert system we know: the brain

○ So we need analogs of brain components■ Neurons

■ Synapses

■ Axons

■ Sensory Inputs



○ So we need analogs of brain components

http://learn.genetics.utah.edu

























...





... Signal

Neural Networks● Signal Propagation

○ Linear weights for the synapses

○ Thresholding function in the body

... wn

w2….wn-1

w1Signal

Neural Networks● Architecture

○ Arrange in layers and propagate the signal forward

Input Layer

Hidden Layers

OutputLayer

Prediction,Classification,

Ranking,Scoring,

…

Signal

Neural Networks● Learning

○ Supervised: compare output with labeled data

○ Comparison: penalize deviations from truth■ Loss is often quadratic, but not necessarily

■ e.g. (estimated position - real position)2

○ Learning step:■ adjust weights to reduce penalty

■ back-propagate the adjustments towards earlier

layers

○ Weight-adjustment is analog to synaptic

reinforcement in the brain

Input Output

Forward Propagation of Signals

Back Propagation of Weight Adjustments

Neural Networks● Too much learning?

○ Neural networks are rich and adaptable■ Very many weights, especially if there are several layers

○ Minimizing the loss on training, will likely result in overfitting

○ For the model to generalize■ Do small incremental step (how small? Learning rate)

■ Use cross-validation to determine the optimal learning speed

○ Higher training error but lower test error

Towards Deep Learning● Humble beginnings

○ Just a few artificial neurons with simple learning (Perceptron Model, 1958)

● Fell in disrepute (70s)○ AI winter

● Came back (late 80s)○ Larger sizes and more sophisticated learning

● Exploded (2015)○ Deep learning stormed

○ Versatile and powerful

Deep Learningand Convolutional Neural Networks

Deep Learning

Input Layer Hidden Layers Output

Layer

...

● Neural Networks used to be shallow with one or few hidden layers

● Then a deep hierarchy of hidden layers was introduced

CNN● CNN are Convolutional Neural Networks

○ Problem: large input size overfitting

○ Reason: large number of input weights high-res image ~ 6M input ws

○ Solution: look at the visual cortex localization & translation inv.


● Convolution layers are localized and identical



Input Layer

Conv.Layer



Input Layer

Conv.Layer



Input Layer

Conv.Layer



Input Layer

Conv.Layer

Subs.Layer



Input Layer

Conv.Layer

Subs.Layer

Conv.Layer

Subs.Layer

...



...

Input Layer

Conv.Layer

Subs.Layer

Conv.Layer

Subs.Layer

Fully Conn.Layer

CNN

...

● CNN are Convolutional Neural Networks


Input Layer

Conv.Layer

Subs.Layer

Conv.Layer

Subs.Layer

Fully Conn.Layer

OutputLayer



● Deep indeed!○ And very neural too

○ State of the art for image recognition…

○ … and several other applications

Word2Vecand Natural Language Processing

Natural Language Processing● Natural Language: Human language as commonly expressed

○ Digest natural language

○ … and process it for a variety of purposes■ e.g. determine a course of action (imperative speech)

■ … or summarize/translate…

■ … or sentiment analysis (classification)

○ Crucial to smooth computer-human interactions

Natural Language Processing● Natural Language: Human language as commonly expressed

○ Digest natural language…

○ … and process it for a variety of purposes

○ Crucial to smooth computer-human interactions

● But is it real understanding?○ Semantic field of a word: e.g. “king” and “monarch”

○ Analogical thinking: e.g. “woman is to man as queen is to king”

○ Context resolution: e.g. “does pear fit in a conversation about fruits”

Word2Vec and NLP● Word2Vec

○ Tomas Mikolov et al. at Google

○ Associate a high-dimensional vector to a word or phrase

Internalrepresentation

of words

Word2Vec and NLP● Word2Vec

○ Tomas Mikolov et al. at Google

○ Associate a high-dimensional vector to a word or phrase

w1, …. , wn

C1, …. , Cn

Words and Contexts

C w

Word2Vec and NLP● Is it real understanding?

● Internal representation of words aware of context and analogies

● Potential to revolutionize computer-human interaction

Paris - France + Italy = Rome !!!

Conclusions


● Statistical Thinking is learning to cope with large data…

● … and with new more ambitious goals


● Statistical Thinking is learning to cope with large data…


● First glimpses of usable AI


● Statistics Thinking is learning to cope with large data…


● First glimpses of usable AI

● Data science has disrupted many industries…

● … and will continue to do so.

Next

Predictive marketing● An area of high-impact for data science and big data

● Radius Intelligence

Thanks

Appendix ICross-validation

Search for Optimal Complexity● We split the data to determine the optimal complexity and to test

● Good against overfitting, but data is not used fully

Train Validate Test

Why Cross-Validation?● Two contrasting problems

○ Overfitting, a. k. a. generalization problem

○ Full use of available data




● Contrasting because...○ To ensure generalization, test on fresh data

○ Fresh data cannot be used for training.




● Contrasting because...○ To ensure generalization, test on fresh data

○ Fresh data cannot be used for training.

● Worse if we need to run multiple models○ Choosing on Test data would be overfitting

Cross-Validation● Nested N-fold Cross-Validation

○ Start with a split …

Validate TestTrainTrain Train


○ … then change it …



○ … and so on, until you have gone through all N(N-1) ~ N^2 combinations

Train TrainTestTrain Validate


○ All the data gets used to train, validate and test the model

○ We are still out-of-sample

○ For each validation and test we now have a distribution of values

Train TrainTestTrain Validate

Cross-Validation● Simple N-fold Cross-Validation

○ As an alternative, we can also perform a simple n-fold cross validation

○ Test is held-out

○ N-fold for the validation stage

Train TestValidateTrain Train










Train TestTrainTrain Validate

Documents

Oxford Lectures Part 1