26
TODAY’S LECTURE Data collection Data processing Exploratory analysis & Data viz Analysis, hypothesis testing, & ML Insight & Policy Decision 1

Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

TODAY’S LECTURE

Data collection

Data processing

Exploratory analysis

& Data viz

Analysis, hypothesis testing, &

ML

Insight & Policy

Decision

!1

Page 2: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

TODAY’S LECTUREMore nonlinear classification/regression methods • Decision trees recap • Bootstrapping, bagging • random forests in Scikit-Learn • Support Vector Machines (SVMs) Thanks to: Hector Corrada Bravo (UMD), Panagiotis Tsaparas (U of I), Oliver Schulte (SFU)

!2

Page 3: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

DECISION TREES IN SCIKIT

Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc)

!3

from sklearn.datasets import load_irisfrom sklearn import tree

# Load a common dataset, fit a decision tree to itiris = load_iris()clf = tree.DecisionTreeClassifier()clf = clf.fit(iris.data, iris.target)

# Predict most likely classclf.predict([[2., 2.]])

# Predict PDF over classes (%training samples in leaf)clf.predict_proba([[2., 2.]])

Page 4: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

VISUALIZING A DECISION TREE

!4

from IPython.display import Imagedot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True)graph = pydotplus.graph_from_dot_data(dot_data)Image(graph.create_png())

Page 5: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

BOOTSTRAPPINGBootstrapping is to make an inference about an estimate of statistic. Given:

• Observations

• What can we say about the standard error of y? • Challenges:

• Our estimation procedure is deterministic. • we should retain whatever properties of estimate y result

from obtaining it from n samples.

!5

x1, …, xn and estimates, y =1n

n

∑i=1

xi

Page 6: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

BOOTSTRAPPINGBootstrapping is a randomization procedure that measures the variance of estimates of y. Procedure:

• Generate B random datasets by sampling with replacement from dataset

• Construct estimates from each dataset,

• Compute center(mean) and spread(variance) of estimates

!6

x1, …, xn . Denote randomized dataset b as x1b, …, xnb

yb =1n

n

∑i=1

xib

yb

Page 7: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

RANDOM FORESTSDecision trees are very interpretable, but may be brittle to changes in the training data, as well as noise Random forests are an ensemble method that: • Resamples the training data; • Builds many decision trees; and • Averages predictions of trees to classify. This is done through bagging and random feature selection

!7

Page 8: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

BAGGINGBagging: Bootstrap aggregation

Resampling a training set of size n via the bootstrap: • Sample with replacement n elements General scheme for random forests: 1. Create B bootstrap samples, {Z1, Z2, …, ZB} 2. Build B decision trees, {T1, T2, …, TB}, from {Z1, Z2, …, ZB} Classification/Regression: 1. Each tree Tj predicts class/value yj

2. Return average 1/B Σj={1,...,B} yj for regression, or majority vote for classification

!8

Page 9: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

!9

obs_id ft_1 ft_21 12.2 puppy2 34.5 dog3 8.1 cat

Original training dataset (Z):

obs_id ft_1 ft_23 8.1 cat2 34.5 dog3 8.1 cat

obs_id ft_1 ft_21 12.2 puppy2 34.5 dog1 12.2 puppy

obs_id ft_1 ft_21 12.2 puppy1 12.2 puppy3 8.1 cat

Z1 Z2ZB

B Bootstrap samples Zj

Aggregate/Vote

T1 T2 TBTj

Class estimate or predicted value

Page 10: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

RANDOM ATTRIBUTE SELECTIONWe get some randomness via bootstrapping • We like this! Randomness increases the bias of the forest

slightly at a huge decrease in variance (due to averaging) We can further reduce correlation between trees by: 1. For each tree, at every split point … 2. … choose a random subset of attributes … 3. … then split on the “best” (entropy, Gini) within only that

subset

!10

Page 11: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

RANDOM FORESTS IN SCIKIT-LEARN

Can we get even more random?! Extremely randomized trees (ExtraTreesClassifier) do bagging, random attribute selection, but also: 1. At each split point, choose random splits 2. Pick the best of those random splits Similar bias/variance performance to RFs, but can be faster computationally

!11

from sklearn.ensemble import RandomForestClassifier

# Train a random forest of 10 default decision treesX = [[0, 0], [1, 1]]Y = [0, 1]clf = RandomForestClassifier(n_estimators=10)clf = clf.fit(X, Y)

Page 12: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

!12

SUPPORT VECTOR MACHINES

Page 13: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINES (SVM)

Find a linear hyperplane (decision boundary) that will separate the data !13

Page 14: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINES

One possible solution

B1

!14

Page 15: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINES

Other possible solutions

B2

!15

Page 16: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINES

Which one is better? B1 or B2? ?????????? How do you define better? ??????????

B1

B2

!16

Page 17: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINES

Find hyperplane maximizes the margin ! B1 is better than B2

B1

B2

b11

b12

b21b22

margin

!17

Page 18: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINESB1

b11

b12

!18

w ⋅ x + b = 0

w ⋅ x + b = − 1 w ⋅ x + b = + 1

f (x) = {1 if w ⋅ x + b ≥ 10, if w ⋅ x + b ≤ − 1 Margin =

2| | w | |2

Page 19: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINESWe want to maximize:

Which is equivalent to minimizing:

But subject to the following constraints:

This is a constrained optimization problem • Numerical approaches to solve it (e.g., quadratic programming)

!19

Margin =2

| | w | |2

L(w) =| | w | |2

2

w ⋅ x + b ≥ 1 if yi = 1w ⋅ x + b ≤ − 1 if yi = − 1

Page 20: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINESWhat if the problem is not linearly separable?

𝜉𝑖

𝑤

Apply some sort of penalty

!20

Page 21: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SUPPORT VECTOR MACHINESWhat if the problem is not linearly separable? • Introduce slack variables • Need to minimize:

• Subject to:

!21

L(w) =| | w | |2

2+ C(

N

∑i=1

ξKi )

w ⋅ x + b ≥ 1 − ξi if yi = 1w ⋅ x + b ≤ − 1 + ξi if yi = − 1

Page 22: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

NONLINEAR SUPPORT VECTOR MACHINESWhat if the decision boundary is not linear?

!22

Page 23: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

NONLINEAR SUPPORT VECTOR MACHINESTransform data into higher dimensional space

!23

Page 24: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

SVMS IN SCIKIT-LEARN

Lots of defaults used for hyperparameters – can use cross validation to search for good ones

!24

from sklearn import svm

# Fit a default SVM classifier to fake dataX = [[0, 0], [1, 1]]y = [0, 1]clf = svm.SVC()clf.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

Page 25: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

MODEL SELECTION IN SCIKIT-LEARN

!25

from sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import classification_report

# ... Load some raw data into X and y ...# Split the dataset in two equal partsX_train, X_test, y_train, y_test = \ train_test_split(X, y, test_size=0.5, random_state=0)

# Pick values of hyperparameters you want to considertuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]} ]

Page 26: Lec27 July8 2019 · Trains a decision tree using default hyperparameters (attribute chosen to split on either Gini or entropy, no max depth, etc) !3 from sklearn.datasets import load_iris

MODEL SELECTION IN SCIKIT-LEARN

!26

# Perform a complete grid search + cross validation# for each of the hyperparameter vectorsclf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=‘precision’)clf.fit(X_train, y_train)

# Now that you’ve selected good hyperparameters via CV,# and trained a model on your training data, get an# estimate of the “true error” on your test sety_true, y_pred = y_test, clf.predict(X_test)print(classification_report(y_true, y_pred))