Lecture 19: Decision trees - Stanford University · Lecture 19: Decision trees Reading: Section8.1 STATS202: Dataminingandanalysis JonathanTaylor November7,2017 Slidecredits: SergioBacallado

Lecture 19: Decision trees

Reading: Section 8.1

STATS 202: Data mining and analysis

Jonathan TaylorNovember 7, 2017

Slide credits: Sergio Bacallado

1 / 1

Decision trees, 10,000 foot view

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

1. Find a partition of the spaceof predictors.

2. Predict a constant in eachset of the partition.

3. The partition is defined bysplitting the range of onepredictor at a time.

→ Not all partitions arepossible.

2 / 1


|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4



3. The partition is defined bysplitting the range of onepredictor at a time.

→ Not all partitions arepossible.

2 / 1


|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4



3. The partition is defined bysplitting the range of onepredictor at a time.→ Not all partitions arepossible.

2 / 1

Example: Predicting a baseball player’s salary

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

The prediction for a point in Ri is the average of the trainingpoints in Ri.

3 / 1

How is a decision tree built?

I Start with a single region R1, and iterate:

1. Select a region Rk, a predictor Xj , and a splitting point s,such that splitting Rk with the criterion Xj < s produces thelargest decrease in RSS:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2

2. Redefine the regions with this additional split.

I Terminate when there are 5 observations or fewer in eachregion.

I This grows the tree from the root towards the leaves.

4 / 1




|T |∑m=1

∑xi∈Rm

(yi − yRm)2




4 / 1




|T |∑m=1

∑xi∈Rm

(yi − yRm)2




4 / 1


|Years < 4.5

RBI < 60.5

Putouts < 82

Years < 3.5

Years < 3.5

Hits < 117.5

Walks < 43.5

Runs < 47.5

Walks < 52.5

RBI < 80.5

Years < 6.5

5.487

4.622 5.183

5.394 6.189

6.015 5.5716.407 6.549

6.459 7.0077.289

5 / 1

How do we control overfitting?

I Idea 1: Find the optimal subtree by cross validation.

→ There are too many possibilities – harder than best subsets!

I Idea 2: Stop growing the tree when the RSS doesn’t drop bymore than a threshold with any new cut.

→ In our greedy algorithm, it is possible to find good cutsafter bad ones.

6 / 1






6 / 1






6 / 1






6 / 1


Solution: Prune a large tree from the leaves to the root.

I Weakest link pruning:

I Starting with T0, substitute a subtree with a leaf to obtain T1,by minimizing:

RSS(T1)−RSS(T0)

|T0| − |T1|.

I Iterate this pruning to obtain a sequence T0, T1, T2, . . . , Tmwhere Tm is the null tree.

I Select the optimal tree Ti by cross validation.

7 / 1





RSS(T1)−RSS(T0)

|T0| − |T1|.



7 / 1





RSS(T1)−RSS(T0)

|T0| − |T1|.



7 / 1





RSS(T1)−RSS(T0)

|T0| − |T1|.



7 / 1


... or an equivalent procedure

I Cost complexity pruning:

I Solve the problem:

minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.

I When α =∞, we select the null tree.

I When α = 0, we select the full tree.

I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.

I Choose the optimal α (the optimal Ti) by cross validation.

8 / 1





minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.





8 / 1





minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.





8 / 1





minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.





8 / 1





minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.





8 / 1





minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.





8 / 1





minimizeT|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |.





8 / 1

Cross validation WRONG WAY!

1. Construct a sequence of trees T0, . . . , Tm for a range of valuesof α.

2. Split the training points into 10 folds.

3. For k = 1, . . . , 10,I For each tree Ti, use every fold except the kth to estimate the

averages in each region.

I For each tree Ti, calculate the RSS in the test fold.

4. For each tree Ti, average the 10 test errors, and select thevalue of α that minimizes the error.

9 / 1








9 / 1








9 / 1








9 / 1

Cross validation, the right way


2. For k = 1, . . . , 10, using every fold except the kth:I Construct a sequence of trees T1, . . . , Tm for a range of values

of α, and find the prediction for each region in each one.

I For each tree Ti, calculate the RSS on the test set.

3. Select the parameter α that minimizes the average test error.

Note: We are doing all fitting, including the construction of thetrees, using only the training data.

10 / 1








10 / 1








10 / 1








10 / 1

Example. Predicting baseball salaries

|Years < 4.5

RBI < 60.5

Putouts < 82

Years < 3.5

Years < 3.5

Hits < 117.5

Walks < 43.5

Runs < 47.5

Walks < 52.5

RBI < 80.5

Years < 6.5

5.487

4.622 5.183

5.394 6.189

6.015 5.5716.407 6.549

6.459 7.0077.289

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tree Size

Me

an

Sq

ua

red

Err

or

Training

Cross−Validation

Test

11 / 1

Example. Predicting baseball salaries

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tree Size

Me

an

Sq

ua

red

Err

or

Training

Cross−Validation

Test

12 / 1

Classification trees

I They work much like regression trees.

I We predict the response by majority vote, i.e. pick the mostcommon class in every region.

I Instead of trying to minimize the RSS:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2

we minimize a classification loss function.

13 / 1





|T |∑m=1

∑xi∈Rm

(yi − yRm)2


13 / 1





|T |∑m=1

∑xi∈Rm

(yi − yRm)2


13 / 1

Classification lossesI The 0-1 loss or misclassification rate:

|T |∑m=1

∑xi∈Rm

1(yi 6= yRm)

I The Gini index:|T |∑m=1

qm

K∑k=1

pmk(1− pmk),

where pm,k is the proportion of class k within Rm, and qm isthe proportion of samples in Rm.

I The cross-entropy:

−|T |∑m=1

qm

K∑k=1

pmk log(pmk).

14 / 1

Classification losses

I The Gini index and cross-entropy are better measures of thepurity of a region, i.e. they are low when the region is mostlyone category.

I Motivation for the Gini index:

If instead of predicting the most likely class, we predict arandom sample from the distribution (p1,m, p2,m, . . . , pK,m),the Gini index is the expected misclassification rate.

I It is typical to use the Gini index or cross-entropy for growingthe tree, while using the misclassification rate when pruningthe tree.

15 / 1






15 / 1






15 / 1

Example. Heart dataset.

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157

Chol < 244MaxHR < 156

MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

No YesNo

NoYes

No

No No No Yes

Yes No No

No Yes

Yes Yes

Yes

5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Tree Size

Err

or

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No No

No Yes

Yes Yes

16 / 1

Some advantages of decision trees

I Very easy to interpret!

I Closer to human decision-making.

I Easy to visualize graphically.

I They easily handle qualitative predictors and missing data.

I Downside: they don’t necessarily fit as well!

17 / 1







17 / 1







17 / 1







17 / 1







17 / 1

Documents

Lecture 19: Decision trees - Stanford University · Lecture 19: Decision trees Reading: Section8.1 STATS202: Dataminingandanalysis JonathanTaylor November7,2017 Slidecredits: SergioBacallado