Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Lecture 19: Decision trees
Reading: Section 8.1
STATS 202: Data mining and analysis
Jonathan TaylorNovember 7, 2017
Slide credits: Sergio Bacallado
1 / 1
Decision trees, 10,000 foot view
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
1. Find a partition of the spaceof predictors.
2. Predict a constant in eachset of the partition.
3. The partition is defined bysplitting the range of onepredictor at a time.
→ Not all partitions arepossible.
2 / 1
Decision trees, 10,000 foot view
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
1. Find a partition of the spaceof predictors.
2. Predict a constant in eachset of the partition.
3. The partition is defined bysplitting the range of onepredictor at a time.
→ Not all partitions arepossible.
2 / 1
Decision trees, 10,000 foot view
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
1. Find a partition of the spaceof predictors.
2. Predict a constant in eachset of the partition.
3. The partition is defined bysplitting the range of onepredictor at a time.→ Not all partitions arepossible.
2 / 1
Example: Predicting a baseball player’s salary
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
Years
Hits
1
117.5
238
1 4.5 24
R1
R3
R2
The prediction for a point in Ri is the average of the trainingpoints in Ri.
3 / 1
How is a decision tree built?
I Start with a single region R1, and iterate:
1. Select a region Rk, a predictor Xj , and a splitting point s,such that splitting Rk with the criterion Xj < s produces thelargest decrease in RSS:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2
2. Redefine the regions with this additional split.
I Terminate when there are 5 observations or fewer in eachregion.
I This grows the tree from the root towards the leaves.
4 / 1
How is a decision tree built?
I Start with a single region R1, and iterate:
1. Select a region Rk, a predictor Xj , and a splitting point s,such that splitting Rk with the criterion Xj < s produces thelargest decrease in RSS:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2
2. Redefine the regions with this additional split.
I Terminate when there are 5 observations or fewer in eachregion.
I This grows the tree from the root towards the leaves.
4 / 1
How is a decision tree built?
I Start with a single region R1, and iterate:
1. Select a region Rk, a predictor Xj , and a splitting point s,such that splitting Rk with the criterion Xj < s produces thelargest decrease in RSS:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2
2. Redefine the regions with this additional split.
I Terminate when there are 5 observations or fewer in eachregion.
I This grows the tree from the root towards the leaves.
4 / 1
How is a decision tree built?
|Years < 4.5
RBI < 60.5
Putouts < 82
Years < 3.5
Years < 3.5
Hits < 117.5
Walks < 43.5
Runs < 47.5
Walks < 52.5
RBI < 80.5
Years < 6.5
5.487
4.622 5.183
5.394 6.189
6.015 5.5716.407 6.549
6.459 7.0077.289
5 / 1
How do we control overfitting?
I Idea 1: Find the optimal subtree by cross validation.
→ There are too many possibilities – harder than best subsets!
I Idea 2: Stop growing the tree when the RSS doesn’t drop bymore than a threshold with any new cut.
→ In our greedy algorithm, it is possible to find good cutsafter bad ones.
6 / 1
How do we control overfitting?
I Idea 1: Find the optimal subtree by cross validation.
→ There are too many possibilities – harder than best subsets!
I Idea 2: Stop growing the tree when the RSS doesn’t drop bymore than a threshold with any new cut.
→ In our greedy algorithm, it is possible to find good cutsafter bad ones.
6 / 1
How do we control overfitting?
I Idea 1: Find the optimal subtree by cross validation.
→ There are too many possibilities – harder than best subsets!
I Idea 2: Stop growing the tree when the RSS doesn’t drop bymore than a threshold with any new cut.
→ In our greedy algorithm, it is possible to find good cutsafter bad ones.
6 / 1
How do we control overfitting?
I Idea 1: Find the optimal subtree by cross validation.
→ There are too many possibilities – harder than best subsets!
I Idea 2: Stop growing the tree when the RSS doesn’t drop bymore than a threshold with any new cut.
→ In our greedy algorithm, it is possible to find good cutsafter bad ones.
6 / 1
How do we control overfitting?
Solution: Prune a large tree from the leaves to the root.
I Weakest link pruning:
I Starting with T0, substitute a subtree with a leaf to obtain T1,by minimizing:
RSS(T1)−RSS(T0)
|T0| − |T1|.
I Iterate this pruning to obtain a sequence T0, T1, T2, . . . , Tmwhere Tm is the null tree.
I Select the optimal tree Ti by cross validation.
7 / 1
How do we control overfitting?
Solution: Prune a large tree from the leaves to the root.
I Weakest link pruning:
I Starting with T0, substitute a subtree with a leaf to obtain T1,by minimizing:
RSS(T1)−RSS(T0)
|T0| − |T1|.
I Iterate this pruning to obtain a sequence T0, T1, T2, . . . , Tmwhere Tm is the null tree.
I Select the optimal tree Ti by cross validation.
7 / 1
How do we control overfitting?
Solution: Prune a large tree from the leaves to the root.
I Weakest link pruning:
I Starting with T0, substitute a subtree with a leaf to obtain T1,by minimizing:
RSS(T1)−RSS(T0)
|T0| − |T1|.
I Iterate this pruning to obtain a sequence T0, T1, T2, . . . , Tmwhere Tm is the null tree.
I Select the optimal tree Ti by cross validation.
7 / 1
How do we control overfitting?
Solution: Prune a large tree from the leaves to the root.
I Weakest link pruning:
I Starting with T0, substitute a subtree with a leaf to obtain T1,by minimizing:
RSS(T1)−RSS(T0)
|T0| − |T1|.
I Iterate this pruning to obtain a sequence T0, T1, T2, . . . , Tmwhere Tm is the null tree.
I Select the optimal tree Ti by cross validation.
7 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
How do we control overfitting?
... or an equivalent procedure
I Cost complexity pruning:
I Solve the problem:
minimizeT|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |.
I When α =∞, we select the null tree.
I When α = 0, we select the full tree.
I The solution for each α is among T1, T2, . . . , Tm from weakestlink pruning.
I Choose the optimal α (the optimal Ti) by cross validation.
8 / 1
Cross validation WRONG WAY!
1. Construct a sequence of trees T0, . . . , Tm for a range of valuesof α.
2. Split the training points into 10 folds.
3. For k = 1, . . . , 10,I For each tree Ti, use every fold except the kth to estimate the
averages in each region.
I For each tree Ti, calculate the RSS in the test fold.
4. For each tree Ti, average the 10 test errors, and select thevalue of α that minimizes the error.
9 / 1
Cross validation WRONG WAY!
1. Construct a sequence of trees T0, . . . , Tm for a range of valuesof α.
2. Split the training points into 10 folds.
3. For k = 1, . . . , 10,I For each tree Ti, use every fold except the kth to estimate the
averages in each region.
I For each tree Ti, calculate the RSS in the test fold.
4. For each tree Ti, average the 10 test errors, and select thevalue of α that minimizes the error.
9 / 1
Cross validation WRONG WAY!
1. Construct a sequence of trees T0, . . . , Tm for a range of valuesof α.
2. Split the training points into 10 folds.
3. For k = 1, . . . , 10,I For each tree Ti, use every fold except the kth to estimate the
averages in each region.
I For each tree Ti, calculate the RSS in the test fold.
4. For each tree Ti, average the 10 test errors, and select thevalue of α that minimizes the error.
9 / 1
Cross validation WRONG WAY!
1. Construct a sequence of trees T0, . . . , Tm for a range of valuesof α.
2. Split the training points into 10 folds.
3. For k = 1, . . . , 10,I For each tree Ti, use every fold except the kth to estimate the
averages in each region.
I For each tree Ti, calculate the RSS in the test fold.
4. For each tree Ti, average the 10 test errors, and select thevalue of α that minimizes the error.
9 / 1
Cross validation, the right way
1. Split the training points into 10 folds.
2. For k = 1, . . . , 10, using every fold except the kth:I Construct a sequence of trees T1, . . . , Tm for a range of values
of α, and find the prediction for each region in each one.
I For each tree Ti, calculate the RSS on the test set.
3. Select the parameter α that minimizes the average test error.
Note: We are doing all fitting, including the construction of thetrees, using only the training data.
10 / 1
Cross validation, the right way
1. Split the training points into 10 folds.
2. For k = 1, . . . , 10, using every fold except the kth:I Construct a sequence of trees T1, . . . , Tm for a range of values
of α, and find the prediction for each region in each one.
I For each tree Ti, calculate the RSS on the test set.
3. Select the parameter α that minimizes the average test error.
Note: We are doing all fitting, including the construction of thetrees, using only the training data.
10 / 1
Cross validation, the right way
1. Split the training points into 10 folds.
2. For k = 1, . . . , 10, using every fold except the kth:I Construct a sequence of trees T1, . . . , Tm for a range of values
of α, and find the prediction for each region in each one.
I For each tree Ti, calculate the RSS on the test set.
3. Select the parameter α that minimizes the average test error.
Note: We are doing all fitting, including the construction of thetrees, using only the training data.
10 / 1
Cross validation, the right way
1. Split the training points into 10 folds.
2. For k = 1, . . . , 10, using every fold except the kth:I Construct a sequence of trees T1, . . . , Tm for a range of values
of α, and find the prediction for each region in each one.
I For each tree Ti, calculate the RSS on the test set.
3. Select the parameter α that minimizes the average test error.
Note: We are doing all fitting, including the construction of thetrees, using only the training data.
10 / 1
Example. Predicting baseball salaries
|Years < 4.5
RBI < 60.5
Putouts < 82
Years < 3.5
Years < 3.5
Hits < 117.5
Walks < 43.5
Runs < 47.5
Walks < 52.5
RBI < 80.5
Years < 6.5
5.487
4.622 5.183
5.394 6.189
6.015 5.5716.407 6.549
6.459 7.0077.289
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Tree Size
Me
an
Sq
ua
red
Err
or
Training
Cross−Validation
Test
11 / 1
Example. Predicting baseball salaries
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Tree Size
Me
an
Sq
ua
red
Err
or
Training
Cross−Validation
Test
12 / 1
Classification trees
I They work much like regression trees.
I We predict the response by majority vote, i.e. pick the mostcommon class in every region.
I Instead of trying to minimize the RSS:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2
we minimize a classification loss function.
13 / 1
Classification trees
I They work much like regression trees.
I We predict the response by majority vote, i.e. pick the mostcommon class in every region.
I Instead of trying to minimize the RSS:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2
we minimize a classification loss function.
13 / 1
Classification trees
I They work much like regression trees.
I We predict the response by majority vote, i.e. pick the mostcommon class in every region.
I Instead of trying to minimize the RSS:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2
we minimize a classification loss function.
13 / 1
Classification lossesI The 0-1 loss or misclassification rate:
|T |∑m=1
∑xi∈Rm
1(yi 6= yRm)
I The Gini index:|T |∑m=1
qm
K∑k=1
pmk(1− pmk),
where pm,k is the proportion of class k within Rm, and qm isthe proportion of samples in Rm.
I The cross-entropy:
−|T |∑m=1
qm
K∑k=1
pmk log(pmk).
14 / 1
Classification losses
I The Gini index and cross-entropy are better measures of thepurity of a region, i.e. they are low when the region is mostlyone category.
I Motivation for the Gini index:
If instead of predicting the most likely class, we predict arandom sample from the distribution (p1,m, p2,m, . . . , pK,m),the Gini index is the expected misclassification rate.
I It is typical to use the Gini index or cross-entropy for growingthe tree, while using the misclassification rate when pruningthe tree.
15 / 1
Classification losses
I The Gini index and cross-entropy are better measures of thepurity of a region, i.e. they are low when the region is mostlyone category.
I Motivation for the Gini index:
If instead of predicting the most likely class, we predict arandom sample from the distribution (p1,m, p2,m, . . . , pK,m),the Gini index is the expected misclassification rate.
I It is typical to use the Gini index or cross-entropy for growingthe tree, while using the misclassification rate when pruningthe tree.
15 / 1
Classification losses
I The Gini index and cross-entropy are better measures of thepurity of a region, i.e. they are low when the region is mostlyone category.
I Motivation for the Gini index:
If instead of predicting the most likely class, we predict arandom sample from the distribution (p1,m, p2,m, . . . , pK,m),the Gini index is the expected misclassification rate.
I It is typical to use the Gini index or cross-entropy for growingthe tree, while using the misclassification rate when pruningthe tree.
15 / 1
Example. Heart dataset.
|Thal:a
Ca < 0.5
MaxHR < 161.5
RestBP < 157
Chol < 244MaxHR < 156
MaxHR < 145.5
ChestPain:bc
Chol < 244 Sex < 0.5
Ca < 0.5
Slope < 1.5
Age < 52 Thal:b
ChestPain:a
Oldpeak < 1.1
RestECG < 1
No YesNo
NoYes
No
No No No Yes
Yes No No
No Yes
Yes Yes
Yes
5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Tree Size
Err
or
TrainingCross−ValidationTest
|Thal:a
Ca < 0.5
MaxHR < 161.5 ChestPain:bc
Ca < 0.5
No No
No Yes
Yes Yes
16 / 1
Some advantages of decision trees
I Very easy to interpret!
I Closer to human decision-making.
I Easy to visualize graphically.
I They easily handle qualitative predictors and missing data.
I Downside: they don’t necessarily fit as well!
17 / 1
Some advantages of decision trees
I Very easy to interpret!
I Closer to human decision-making.
I Easy to visualize graphically.
I They easily handle qualitative predictors and missing data.
I Downside: they don’t necessarily fit as well!
17 / 1
Some advantages of decision trees
I Very easy to interpret!
I Closer to human decision-making.
I Easy to visualize graphically.
I They easily handle qualitative predictors and missing data.
I Downside: they don’t necessarily fit as well!
17 / 1
Some advantages of decision trees
I Very easy to interpret!
I Closer to human decision-making.
I Easy to visualize graphically.
I They easily handle qualitative predictors and missing data.
I Downside: they don’t necessarily fit as well!
17 / 1
Some advantages of decision trees
I Very easy to interpret!
I Closer to human decision-making.
I Easy to visualize graphically.
I They easily handle qualitative predictors and missing data.
I Downside: they don’t necessarily fit as well!
17 / 1