Introduction to Machine Learning @ Mooncascade ML Camp

by Ilya Kuzovkin [email protected]

Mooncascade ML Camp 2016

Machine LearningESSENTIAL CONCEPTS

ONE MACHINE LEARNING USE CASE

Can we ask a computer to create those patterns

automatically?


automatically?

Yes


automatically?

Yes

How?

Raw data

Instance Raw dataClass (label)A data sample:

“7”


“7”

How to represent it in a machine-readable form?


“7”


Feature extraction


“7”


Feature extraction

28 p

x

28 px


“7”

28 p

x

28 px784 pixels in total

Feature vector(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)


Feature extraction


“7”

28 p

x


Feature vector(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)


Feature extraction

(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

(0, 0, 0, …, 13, 48, 102, 0, 46, 255,… 0, 0, 0)

(0, 0, 0, …, 17, 34, 12, 43, 122, 70,… 0, 7, 0)

(0, 0, 0, …, 98, 21, 255, 255, 231, 140,… 0, 0, 0)

“7”“2”

“8”“2”


“7”

28 p

x


Feature vector(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)


Feature extraction

(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

(0, 0, 0, …, 13, 48, 102, 0, 46, 255,… 0, 0, 0)

(0, 0, 0, …, 17, 34, 12, 43, 122, 70,… 0, 7, 0) Dataset(0, 0, 0, …, 98, 21, 255, 255, 231, 140,… 0, 0, 0)

“7”“2”

“8”“2”

The data is in the right format — what’s next?

The data is in the right format — what’s next?• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata

• AODE• Artificialneuralnetwork• Backpropagation• NaiveBayesclassifier• Bayesiannetwork• Bayesianknowledgebase• Case-basedreasoning• Decisiontrees• Inductivelogicprogramming• Gaussianprocessregression• Geneexpressionprogramming• Groupmethodofdatahandling(GMDH)• LearningAutomata• LearningVectorQuantization• LogisticModelTree• Decisiontree• Decisiongraphs• Lazylearning• MonteCarloMethod• SARSA

• Instance-basedlearning• NearestNeighborAlgorithm• Analogicalmodeling• Probablyapproximatelycorrectlearning(PACL)• Symbolicmachinelearningalgorithms• Subsymbolicmachinelearningalgorithms• Supportvectormachines• RandomForest• Ensemblesofclassifiers• Bootstrapaggregating(bagging)• Boosting(meta-algorithm)• Ordinalclassification• Regressionanalysis• Informationfuzzynetworks(IFN)• Linearclassifiers• Fisher'slineardiscriminant• Logisticregression• NaiveBayesclassifier• Perceptron• Supportvectormachines• Quadraticclassifiers• k-nearestneighbor• Boosting

Pick an algorithm

The data is in the right format — what’s next?• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata



Pick an algorithm

DECISION TREE

vs.

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

PIXEL #417

>200 <200

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

PIXEL #417

>200 <200

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

PIXEL #123

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

PIXEL #123

<100 >100

PIXEL #123

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

<100 >100

PIXEL #123

DECISION TREE

DECISION TREE

ACCURACY

ACCURACY

Confusion matrix

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =

correctly classified

total number of samples

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =



Beware of an imbalanced dataset!

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =




Consider the following model: “Always predict 2”

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =




Consider the following model: “Always predict 2”

Accuracy 0.9

True

cla

ss

Predicted class

DECISION TREE

DECISION TREE

“You said 100% accurate?! Every 10th digit your system detects is wrong!”

Angry client

DECISION TREE

“You said 100% accurate?! Every 10th digit your system detects is wrong!”

Angry client

We’ve trained our system on the data the client gave us. But our system has never seen the new data the client applied it to.

And in the real life — it never will…

OVERFITTING

Simulate the real-life situation — split the dataset

OVERFITTING


OVERFITTING


OVERFITTING


Underfitting!“Too stupid” OK Overfitting!

“Too smart”

OVERFITTING

Underfitting!“Too stupid” OK Overfitting!

“Too smart”

OVERFITTING

Our current decision tree has too much capacity, it just has memorized all of the data.

Let’s make it less complex.

You probably did not notice, but we are overfitting again :(

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%


• Evaluate the models created with different parameters

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%



!• Estimate overfitting

TRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%




TRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%




TRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%




TRAVALITRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%




TRAVALITRAVALITRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%




Use only once to get the final performance estimate

TRAVALITRAVALITRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

VALIDATION SET 20%

TEST SET 20%

TRAINING SET 60%

VALIDATION SET 20%

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

What if we got too optimistic validation set?

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%


TRAINING SET 80%

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%


TRAINING SET 80%

Fix the parameter value you ned to evaluate, say msl=15

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%


TRAINING SET 80%


TRAINING VAL

TRAINING VAL

TRAININGVAL

Repeat 10 times

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%


TRAINING SET 80%


TRAINING VAL

TRAINING VAL

TRAININGVAL

Repeat 10 times } Take average validation score over 10 runs — it is a more stable estimate.

MACHINE LEARNING PIPELINE

Take raw data Extract features Split into TRAINING and TEST

Pick an algorithm and parameters

Train on the TRAINING data

Evaluate on the TRAINING data

with CV

Train on the whole TRAINING

Fix the best parameters

Evaluate on TESTReport final

performance to the client

Try our different algorithms and parameters






with CV






“So it is ~87%…erm… Could you do better?”






with CV






“So it is ~87%…erm… Could you do better?”

Yes

• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata



Pick another algorithm

• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata



Pick another algorithm

RANDOM FOREST

RANDOM FORESTDecision tree:

pick best out of all features

RANDOM FORESTDecision tree:

pick best out of all featuresRandom forest:

pick best out of random subset of features

RANDOM FOREST

RANDOM FOREST

pick best out of another random subset of features

RANDOM FOREST

pick best out of another random subset of features pick best out of yet another

random subset of features

RANDOM FOREST

RANDOM FOREST

RANDOM FOREST

class

instance

RANDOM FOREST

class

instance

RANDOM FOREST

class

instance

RANDOM FOREST

class

instance

Happy client

ALL OTHER USE CASES

Sound

Frequency components Genre Bag of

words Topic

Text

Pixel values

Image

Cat or dog

Video

Frame pixels

Walking or running

Database records Biometric data

Census data

Average salary … Dead or

alive

HANDS-ON SESSION

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Technology

Introduction to Machine Learning @ Mooncascade ML Camp