Decision Trees II and Bias/Variance and Cross-Validation

Decision Trees II and Bias/Variance andCross-Validation

Nipun Batra and teaching staffJanuary 11, 2020

IIT Gandhinagar

Real Input Real Output

Example 1

Let us consider the dataset given below

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Example 1

What would be the prediction for decision tree with depth 0?

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Example 1

Prediction for decision tree with depth 0.Horizontal dashed line shows the predicted Y value. It is theaverage of Y values of all datapoints.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Example 1

What would be the decision tree with depth 1?

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Example 1

Decision tree with depth 1

X[0] <= 2.5mse = 0.667samples = 6value = 1.0

mse = 0.0samples = 2value = 0.0

Example 1

The Decision Boundary

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Example 1

What would be the decision tree with depth 2 ?

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Example 1

Decision tree with depth 2

Example 1

The Decision Boundary

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Objective function

Here, Feature is denoted by X and Label by Y.Let the “decision boundary” or “split” be at X = S.Let the region X < S, be region R1.Let the region X > S, be region R2.

Objective function

Then, letC1 = Mean (Yi|Xi ∈ R1)C2 = Mean (Yi|Xi ∈ R2)

Objective function

Then, letC1 = Mean (Yi|Xi ∈ R1)C2 = Mean (Yi|Xi ∈ R2)Loss =

∑i((Yi − C1|Xi ∈ R1)2 + (Yi − C2|Xi ∈ R2)2)

Objective function

Then, letC1 = Mean (Yi|Xi ∈ R1)C2 = Mean (Yi|Xi ∈ R2)Loss =

∑i((Yi − C1|Xi ∈ R1)2 + (Yi − C2|Xi ∈ R2)2)

Our objective is to minimize the loss and findminS

((Yi − C1|Xi ∈ R1)2 + (Yi − C2|Xi ∈ R2)2

How to find optimal split “S”?

1. Sort all datapoints (X,Y) in increasing order of X.

How to find optimal split “S”?

1. Sort all datapoints (X,Y) in increasing order of X.

2. Evaluate the loss function for all

S =Xi+Xi+1

and the select the S with minimum loss.

A Question!

Draw a regression tree for Y = sin(X), 0 ≤ X ≤ 2π

A Question!

Dataset of Y = sin(X), 0 ≤ X ≤ 7 with 10,000 points

0 1 2 3 4 5 6 7Feature, X

A Question!

Regression tree of depth 1

X[0] <= 3.033mse = 0.463

samples = 10000value = 0.035

mse = 0.086samples = 4333

value = 0.657

mse = 0.23samples = 5667value = -0.441

A Question!

Decision Boundary

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

A Question!

Regression tree with no depth limit is too big to fit in a slide.It has of depth 20. The decision boundaries are in figure below.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

Summary

• Interpretability an important goal• Decision trees: well known interpretable models• Learning optimal tree is hard• Greedy approach:• Recursively split to maximize “performance gain”• Issues:

• Can overfit easily!• Empirically not as powerful as other methods

A Question!

What would be the decision boundary of a decision treeclassifier?

1 2 3 4 5 6 7X1

Decision Boundary for a tree with depth 1

1 2 3 4 5 6 7X1

(a) Decision Boundary

X[0] <= 4.0entropy = 1.0samples = 12value = [6, 6]

entropy = 0.65samples = 6value = [5, 1]

(b) Decision Tree

Decision Boundary for a tree with no depth limit

1 2 3 4 5 6 7X1

(c) Decision Boundary

X[0] <= 1.5entropy = 0.918

samples = 3value = [2, 1]

X[0] <= 6.5entropy = 0.918

(d) Decision Tree

Are deeper trees always better?

As we saw, deeper trees learn more complex decisionboundaries.

Are deeper trees always better?

As we saw, deeper trees learn more complex decisionboundaries.

But, sometimes this can lead to poor generalization

An example

Consider the dataset below

1 2 3 4 5 6X1

Train dataset

(e) Train Set

1 2 3 4 5 6X1

Test dataset

(f) Test Set

Underfitting

Underfitting is also known as high bias, since it has a verybiased incorrect assumption.

1 2 3 4 5 6X1

(g) Decision Boundary

X[1] <= 1.5entropy = 0.964samples = 36

value = [22, 14]

value = [16, 14]

entropy = 0.98samples = 24

value = [10, 14]

(h) Decision Tree

Overfitting

Overfitting is also known as high variance, since very smallchanges in data can lead to very different models.Decision tree learned has depth of 10.

1 2 3 4 5 6X1

Intution for Variance

A small change in data can lead to very different models.

Dataset 1

1 2 3 4 5 6X1

Train dataset

Dataset 2

1 2 3 4 5 6X1

Intution for Variance

value = [22, 14]

value = [16, 14]

value = [10, 14]

X[1] <= 2.5entropy = 0.811

X[1] <= 4.5entropy = 0.918

value = [22, 14]

value = [16, 14]

value = [10, 14]

X[1] <= 2.5entropy = 0.811

X[0] <= 4.5entropy = 0.918

A Good Fit

1 2 3 4 5 6X1

value = [22, 14]

value = [16, 14]

value = [10, 14]

Accuracy vs Depth Curve

0 2 4 6 8 10Depth

Test AccuracyTrain Accuracy

As depth increases, train accuracy improvesAs depth increases, test accuracy improves till a pointAt very high depths, test accuracy is not good (overfitting).

0 2 4 6 8 10Depth

As depth increases, train accuracy improves

As depth increases, test accuracy improves till a pointAt very high depths, test accuracy is not good (overfitting).

0 2 4 6 8 10Depth

As depth increases, train accuracy improvesAs depth increases, test accuracy improves till a point

At very high depths, test accuracy is not good (overfitting).

0 2 4 6 8 10Depth

As depth increases, train accuracy improvesAs depth increases, test accuracy improves till a pointAt very high depths, test accuracy is not good (overfitting). 29

Accuracy vs Depth Curve : Underfitting

The highlighted region is the underfitting region.Model is too simple (less depth) to learn from the data.

2 4 6 8 10Depth

Accuracy vs Depth Curve : Overfitting

The highlighted region is the overfitting region.Model is complex (high depth) and hence also learns theanomalies in data.

2 4 6 8 10Depth

The highlighted region is the good fit region.We want to maximize test accuracy while being in this region.

2 4 6 8 10Depth

The big question!?

How to find the optimal depth for a decision tree?

The big question!?

How to find the optimal depth for a decision tree?

Use cross-validation!

Our General Training Flow

Train Data Test Data

Training

Predicting

Actual Labels

Predicted Labels

Error MetricCalculation

Accuracy

K-Fold cross-validation: Utilise full dataset for testing

Train Test

Train Test Train

Test Train

FOLD 1

FOLD 2

FOLD 3

FOLD 4

The Validation Set

Train Data Test Data

ModelDepth = 1

Training

Predicting

Actual Labels

Predicted Labels

Error MetricCalculation

Accuracy

Validation

ModelDepth = 2

ModelDepth = 3

Accuracy on validation set

Validation AccuracyDepth 1 : 70%Depth 2 : 90%Depth 3 : 80%

Select model having highest validation

accuracy

Nested Cross Validation

Divide your training set into K equal parts.Cyclically use 1 part as “validation set” and the rest for training.Here K = 4

Train Validation

Train Validation Train

Validation Train

FOLD 1

FOLD 2

FOLD 3

FOLD 4

Nested Cross Validation

Average out the validation accuracy across all the foldsUse the model with highest validation accuracy

Train Validation

Train Validation Train

Validation Train

FOLD 1

FOLD 2

FOLD 3

FOLD 4

Average out Validation Accuracy across all folds.

Select the hyperparameter for which the model gives the highest average validation accuracy

Next time: Ensemble Learning

• How to combine various models?• Why to combine multiple models?• How can we reduce bias?• How can we reduce variance?

Decision Trees II and Bias/Variance and Cross-Validation

Documents

New Bias, Variance and Parsimony in Regression Analysis ECS 256 …heather.cs.ucdavis.edu/~matloff/256/Slides/Chris.pdf · 2014. 3. 12. · Bias, Variance and Parsimony in Regression

Approximating the Bias and Variance of Chain Ladder

Bias-Variance in Machine Learningwcohen/10-601/bias-variance.pdf · Bias-variance decomposition • This is something real that you can (approximately) measure experimentally –

Bias-Variance Tradeoffs in Program Analysis

Bias-Variance Trade-off in ML - University at Buffalosrihari/CSE574/Chap3/3.3-Bias... · 2016. 10. 5. · basis functions ϕ and their no. M ... • Bias-Variance decomposition provides

Regression Variance-Bias Trade-off

A Uni ed Bias-Variance Decompositionhomes.cs.washington.edu/~pedrod/bvd.pdf · A Uni ed Bias-Variance Decomposition Pedro Domingos Department of Computer Science and Engineering University

Disentangling Bias and Variance in Election Pollsgelman/research/unpublished/polling-errors.pdfDisentangling Bias and Variance in Election Polls ... for 608 state-level presidential,

BIAS, VARIANCE , AND ARCING CLASSIFIERSdocs.salford-systems.com/BIAS_VARIANCE_ARCING.pdf · 1 BIAS, VARIANCE , AND ARCING CLASSIFIERS Leo Breiman leo@stat.berkeley.edu Statistics

Lecture 3: Regularized Linear models€¦ · Bias-Variance Tradeoff – Linear Regression • Linear model • Bias-Variance decomposition of linear model. 9/6/2017 Machine Learning

Disentangling Bias and Variance in Election Pollsgelman/research/published/polling-errors.p… · Disentangling Bias and Variance in Election Polls Houshmand Shirani-Mehr Stanford

Gaussians Linear Regression Bias-Variance TradeoffLinear Regression Bias-Variance Tradeoff Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University January 22nd,

Bias vs Variance

V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Bias-Variance Analysis of Ensemble Learning

Understanding the Bias-Variance Tradeoff

Regularization: Ridge Regression and the LASSOstatweb.stanford.edu/~owen/courses/305a/Rudy...Part I: The Bias-Variance Tradeoﬀ Part I The Bias-Variance Tradeoﬀ Statistics 305:

Bias and Variance in Continuous EDA: massively parallel continuous optimization

The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726

Linear Regression, Regularization Bias-Variance Tradeoffrgreiner/C-466/SLIDES/3b-Regression.pdf · Linear Regression, Regularization Bias-Variance Tradeoff HTF: Ch3, 7 B: Ch3 Thanks