103
INTRO TO DATA SCIENCE LECTURE 4: INTRO TO ML & KNN CLASSIFICATION Francesco Mosconi DAT10 SF // October 15, 2014

INTRO TO DATA SCIENCE - GitHub Pages

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INTRO TO DATA SCIENCE - GitHub Pages

INTRO TO DATA SCIENCELECTURE 4: INTRO TO ML & KNN CLASSIFICATIONFrancesco Mosconi DAT10 SF // October 15, 2014

Page 2: INTRO TO DATA SCIENCE - GitHub Pages

DATA SCIENCE IN THE NEWSHEADER– CLASS NAME, PRESENTATION TITLE

Page 3: INTRO TO DATA SCIENCE - GitHub Pages

DATA SCIENCE IN THE NEWS

Source: http://f1metrics.wordpress.com/2014/10/03/building-a-race-simulator/

Page 4: INTRO TO DATA SCIENCE - GitHub Pages

DATA SCIENCE IN THE NEWS

Source: http://www.pyimagesearch.com/2014/10/13/deep-learning-amazon-ec2-gpu-python-nolearn/

Page 5: INTRO TO DATA SCIENCE - GitHub Pages

RECAP

‣Cleaning data ‣Dealing with missing data ‣Setting up github for homework !

LAST TIME

Page 6: INTRO TO DATA SCIENCE - GitHub Pages

QUESTIONS?INTRO TO DATA SCIENCE

Page 7: INTRO TO DATA SCIENCE - GitHub Pages

I. WHAT IS MACHINE LEARNING? II. CLASSIFICATION PROBLEMSIII. BUILDING EFFECTIVE CLASSIFIERS IV. THE KNN CLASSIFICATION MODELEXERCISES:IV. LAB: KNN CLASSIFICATION IN PYTHON V. BONUS LAB: VISUALIZATION WITH MATPLOTLIB (IF TIME ALLOWS)

AGENDA

Page 8: INTRO TO DATA SCIENCE - GitHub Pages

I. WHAT IS MACHINE LEARNING?

INTRO TO DATA SCIENCE

Page 9: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 9

Page 10: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 10

Page 11: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 11

Page 12: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 12

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!!!!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning

Page 13: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 13

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning

Page 14: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 14

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !!!!!source: http://en.wikipedia.org/wiki/Machine_learning

Page 15: INTRO TO DATA SCIENCE - GitHub Pages

WHAT IS MACHINE LEARNING? 15

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !‣ generalization – making predictions from data !!!source: http://en.wikipedia.org/wiki/Machine_learning

Page 16: INTRO TO DATA SCIENCE - GitHub Pages

REMEMBER THIS? 16

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

Page 17: INTRO TO DATA SCIENCE - GitHub Pages

WE ARE NOW HERE 17

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

Page 18: INTRO TO DATA SCIENCE - GitHub Pages

WE WANT TO GO HERE 18

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

Page 19: INTRO TO DATA SCIENCE - GitHub Pages

WE WANT TO GO HERE 19

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

QUESTION!What does it take to make this jump?

Page 20: INTRO TO DATA SCIENCE - GitHub Pages

ANSWER: PROBLEM SOLVING! 20

Page 21: INTRO TO DATA SCIENCE - GitHub Pages

ANSWER: PROBLEM SOLVING! 21

NOTE!Implementing solutions to ML problems is the focus of this course!

Page 22: INTRO TO DATA SCIENCE - GitHub Pages

THE STRUCTURE OF MACHINE LEARNING PROBLEMS

INTRO TO DATA SCIENCE

Page 23: INTRO TO DATA SCIENCE - GitHub Pages

REMEMBER WHAT WE SAID BEFORE? 23

Supervised

Unsupervised

Making predictions

Extracting structure

Page 24: INTRO TO DATA SCIENCE - GitHub Pages

REMEMBER WHAT WE SAID BEFORE? 24

representation

generalization

Supervised

Unsupervised

Making predictions

Extracting structure

Page 25: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF LEARNING PROBLEMS - SUPERVISED EXAMPLE 25

Page 26: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 26

Page 27: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 27

Page 28: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF DATA 28

Continuous Categorical

Quantitative Qualitative

Page 29: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF DATA 29

Continuous Categorical

Quantitative Qualitative

NOTE!The space where data live is called the feature space. !Each point in this space is called a record.

Page 30: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF ML SOLUTIONS 30

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

Page 31: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF ML SOLUTIONS 31

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

NOTEWe will implement solutions using models and algorithms. !Each will fall into one of these four buckets.

Page 32: INTRO TO DATA SCIENCE - GitHub Pages

WHATIS THEGOALOFMACHINE LEARNING?

QUESTION

Page 33: INTRO TO DATA SCIENCE - GitHub Pages

REMEMBER WHAT WE SAID BEFORE? 33

Supervised

Unsupervised

Making predictions

Extracting structureANSWER!The goal is determined by the type of problem.

Page 34: INTRO TO DATA SCIENCE - GitHub Pages

HOWDO YOUDETERMINETHE RIGHTAPPROACH?

QUESTION

Page 35: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF ML SOLUTIONS 35

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

ANSWER!The right approach is determined by the desired solution.

Page 36: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF ML SOLUTIONS 36

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

ANSWER!The right approach is determined by the desired solution.

NOTE!All of this depends on your data!

Page 37: INTRO TO DATA SCIENCE - GitHub Pages

WHATDO YOUDOWITH YOURRESULTS?

QUESTION

Page 38: INTRO TO DATA SCIENCE - GitHub Pages

THE DATA SCIENCE WORKFLOW 38

source: http://benfry.com/phd/dissertation-110323c.pdf

ANSWER!Interpret them and react accordingly.

Page 39: INTRO TO DATA SCIENCE - GitHub Pages

THE DATA SCIENCE WORKFLOW 39

source: http://benfry.com/phd/dissertation-110323c.pdf

ANSWER!Interpret them and react accordingly

NOTE!This also relies on your problem solving skills!

Page 40: INTRO TO DATA SCIENCE - GitHub Pages

II. CLASSIFICATION PROBLEMS

INTRO TO DATA SCIENCE

Page 41: INTRO TO DATA SCIENCE - GitHub Pages

TYPES OF ML SOLUTIONS 41

Supervised

Unsupervised

Continuous Categorical

??? ???

??? ???

Page 42: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 42

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

Page 43: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 43

Here’s (part of) an example dataset:

Page 44: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 44

Here’s (part of) an example dataset:

{independent variables

Page 45: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 45

Here’s (part of) an example dataset: {class labels

(qualitative){independent variables

Page 46: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 46

Q: What does “supervised” mean?

Page 47: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 47

Q: What does “supervised” mean? A: We know the labels.

class labels

(qualitative)

Page 48: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 48

Q: How does a classification problem work?

Page 49: INTRO TO DATA SCIENCE - GitHub Pages

CLASSIFICATION PROBLEMS 49

Q: How does a classification problem work? A: Data in, predicted labels out.

source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf

Page 50: INTRO TO DATA SCIENCE - GitHub Pages

Q: What steps does a classification problem require?

dataset

CLASSIFICATION PROBLEMS 50

model

Page 51: INTRO TO DATA SCIENCE - GitHub Pages

Q: What steps does a classification problem require? 1) split dataset

dataset

CLASSIFICATION PROBLEMS 51

model

Page 52: INTRO TO DATA SCIENCE - GitHub Pages

Q: What steps does a classification problem require? 1) split dataset 2) train model

CLASSIFICATION PROBLEMS 52

dataset

training set

model

Page 53: INTRO TO DATA SCIENCE - GitHub Pages

Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model

CLASSIFICATION PROBLEMS 53

model

dataset

training set

test set

Page 54: INTRO TO DATA SCIENCE - GitHub Pages

Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions

CLASSIFICATION PROBLEMS 54

model

dataset

test set

training set

predictions

new data

Page 55: INTRO TO DATA SCIENCE - GitHub Pages

Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions

CLASSIFICATION PROBLEMS 55

model

dataset

test set

training set

predictions

new data

NOTEThis new data is called out of sample data. !We don’t know the labels for these OOS records!

Page 56: INTRO TO DATA SCIENCE - GitHub Pages

III. BUILDING EFFECTIVE CLASSIFIERS

INTRO TO DATA SCIENCE

Page 57: INTRO TO DATA SCIENCE - GitHub Pages

Q: What types of prediction error will we run into?

BUILDING EFFECTIVE CLASSIFIERS 57

model

dataset

test set

training set

predictions

new data

Page 58: INTRO TO DATA SCIENCE - GitHub Pages

Q: What types of prediction error will we run into? 1) training error

BUILDING EFFECTIVE CLASSIFIERS 58

model

dataset

test set

training set

predictions

new data

Page 59: INTRO TO DATA SCIENCE - GitHub Pages

Q: What types of prediction error will we run into? 1) training error 2) generalization error

BUILDING EFFECTIVE CLASSIFIERS 59

model

dataset

test set

training set

predictions

new data

Page 60: INTRO TO DATA SCIENCE - GitHub Pages

Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error

BUILDING EFFECTIVE CLASSIFIERS 60

model

dataset

test set

training set

predictions

new data

Page 61: INTRO TO DATA SCIENCE - GitHub Pages

Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error

BUILDING EFFECTIVE CLASSIFIERS 61

model

dataset

test set

training set

predictions

NOTE!We want to estimate OOS prediction error so we know what to expect from our model.

new data

Page 62: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 62

Q: Why should we use training & test sets?

Page 63: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 63

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset.

Page 64: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 64

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error?

Page 65: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 65

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set).

Page 66: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 66

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set). A: Down to zero!

Page 67: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 67

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set). A: Down to zero!

NOTE!This phenomenon is called overfitting.

Page 68: INTRO TO DATA SCIENCE - GitHub Pages

OVERFITTING 68

source: Data Analysis with Open Source Tools, by Philipp K. Janert. O’Reilly Media, 2011.!

Page 69: INTRO TO DATA SCIENCE - GitHub Pages

OVERFITTING - EXAMPLE 69

source: http://www.dtreg.com

Page 70: INTRO TO DATA SCIENCE - GitHub Pages

OVERFITTING - EXAMPLE 70

source: http://www.dtreg.com

Page 71: INTRO TO DATA SCIENCE - GitHub Pages

TRAINING ERROR 71

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set). A: Down to zero! !A: Training error is not a good estimate of OOS accuracy.

NOTE!This phenomenon is called overfitting.

Page 72: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 72

Suppose we do the train/test split.

Page 73: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 73

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy?

Page 74: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 74

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split.

Page 75: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 75

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same?

Page 76: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 76

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not!

Page 77: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 77

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.

Page 78: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 78

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.

NOTE!The generalization error gives a high-variance estimate of OOS accuracy.

Page 79: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 79

Something is still missing!

Page 80: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 80

Something is still missing! !Q: How can we do better?

Page 81: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 81

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors.

Page 82: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 82

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average?

Page 83: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 83

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking!

Page 84: INTRO TO DATA SCIENCE - GitHub Pages

GENERALIZATION ERROR 84

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking! !A: Cross-validation.

Page 85: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 85

Steps for n-fold cross-validation:

Page 86: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 86

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions.

Page 87: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 87

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set.

Page 88: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 88

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error.

Page 89: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 89

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration.

Page 90: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 90

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration. 5) Take the average generalization error as the estimate of OOS accuracy.

Page 91: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 91

Features of n-fold cross-validation:

Page 92: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 92

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error.

Page 93: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 93

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing.

Page 94: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 94

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split

Page 95: INTRO TO DATA SCIENCE - GitHub Pages

CROSS-VALIDATION 95

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split 4) Can be used for model selection.

Page 96: INTRO TO DATA SCIENCE - GitHub Pages

IV. KNN CLASSIFICATION

INTRO TO DATA SCIENCE

Page 97: INTRO TO DATA SCIENCE - GitHub Pages

KNN CLASSIFICATION - BASICS 97

Suppose we want to predict the color of the grey dot.

Page 98: INTRO TO DATA SCIENCE - GitHub Pages

KNN CLASSIFICATION 98

Suppose we want to predict the color of the grey dot. !1) Pick a value for k.

k = 3

Page 99: INTRO TO DATA SCIENCE - GitHub Pages

KNN CLASSIFICATION 99

Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors.

k = 3

Page 100: INTRO TO DATA SCIENCE - GitHub Pages

KNN CLASSIFICATION 100

Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.

Page 101: INTRO TO DATA SCIENCE - GitHub Pages

KNN CLASSIFICATION 101

Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.

OPTIONAL NOTE!Our definition of “nearest” implicitly uses the Euclidean distance function.

Page 102: INTRO TO DATA SCIENCE - GitHub Pages

KNN CLASSIFICATION 102

Another example with k = 3 Will our new example be blue or orange?

Page 103: INTRO TO DATA SCIENCE - GitHub Pages

LABS

INTRO TO DATA SCIENCE