48
Classification and Regression Trees (CART)

Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Embed Size (px)

Citation preview

Page 1: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Classification and Regression Trees

(CART)

Page 2: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Variety of approaches used

• CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression Trees”

• C4.5 A Machine Learning Approach by Quinlan

• Engineering approach by Sethi and Sarvarayudu

Page 3: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Example

• University of California- a study into patients after admission for a heart attack

• 19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours)

• Question: Can the high risk (will not survive 30 days) patients be identified?

Page 4: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Answer

H

Is the minimum systolic blood pressure over the !st 24 hours>91?

Is age>62.5?

Is sinus tachycardia present?

H L

L

Page 5: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Features of CART

• Binary Splits

• Splits based only on one variable

Page 6: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Plan for Construction of a Tree

• Selection of the Splits

• Decisions when to decide that a node is a terminal node (i.e. not to split it any further)

• Assigning a class to each terminal node

Page 7: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Impurity of a Node

• Need a measure of impurity of a node to help decide on how to split a node, or which node to split

• The measure should be at a maximum when a node is equally divided amongst all classes

• The impurity should be zero if the node is all one class

Page 8: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Measures of Impurity

• Misclassification Rate• Information, or Entropy• Gini Index

In practice the first is not used for the following reasons:

• Situations can occur where no split improves the misclassification rate

• The misclassification rate can be equal when one option is clearly better for the next step

Page 9: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Problems with Misclassification Rate I

40 of A 60 of A

60 of A 40 of B

Possible split

Possible split

Neither improves misclassification rate, but together give perfect classification!

Page 10: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Problems with Misclassification Rate II

400 of A 400 of B

300 of A

100 of B

100 of A

300 of B

OR? 400 of A

400 of B

200 of A

400 of B

200 of A

0 of B

Page 11: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Misclassification rate for two classes

0 1

1/2

0.5p1

Page 12: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Information

• If a node has a proportion of pj of each of the classes then the information or entropy is:

j

jj pppi log)(

where 0log0 = 0

Note: p=(p1,p2,…. pn)

Page 13: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Gini Index

• This is the most widely used measure of impurity (at least by CART)

• Gini index is:

j

jji

ji ppppi 21)(

Page 14: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Scaled Impurity functions

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p1

Misclassification rate

Gini Index

Information

Page 15: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Tree Impurity

• We define the impurity of a tree to be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree

• Example i) Impurity of a tree with one single node, with both A and B having 400 cases, using the Gini Index:

Proportions of the two cases= 0.5Therefore Gini Index= 1-(0.5)2- (0.5)2 = 0.5

Page 16: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Tree Impurity Calculations

Numbers of Cases

Proportion of Cases

Gini Index

A B A B

pA pB p2A p2

B 1- p2A- p2

B

400 400 0.5 0.5 0.25 0.25 0.5

Page 17: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Number of Cases

Proportion of Cases

Gini Index Contrib. To Tree

A B A B

pA pB p2A p2

B 1- p2A - p2

B

300 100 0.75 0.25 0.5625 0.0625 0.375 0.1875

100 300 0.25 0.75 0.0625 0.5625 0.375 0.1875

Total 0.375

200 400 0.33 0.67 0.1111 0.4444 0.4444 0.3333

200 0 1 0 1 0 0 0

Total 0.3333

Page 18: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Selection of Splits• We select the split that most decreases the

Gini Index. This is done over all possible places for a split and all possible variables to split.

• We keep splitting until the terminal nodes have very few cases or are all pure – this is an unsatisfactory answer to when to stop growing the tree, but it was realized that the best approach is to grow a larger tree than required and then to prune it!

Page 19: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Example – The same one used for Nearest Neighbour classification

0 2 4 6 8

02

46

Classifying A or B

x

y

AAA

AAAA

A

AAAA

AA

A

AA

A

AA A

A

A

AA

A

AA

AA

AA

A A

AA

A

A A

AA

A

A

AA

AA

A

A

A

BB

BB

B

BB

B

B

B

B B

B

B B

BB

BB

B

BB

BB

B

B

B

B

B

B

BB

B

B

B

B

B

B

BB

B B

B

B

B

B

B

B

B

B

Page 20: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Possible Splits

• There are two possible variables to split on and each of those can split for a range of values of c i.e.:

x<c or x≥c

And:

y<c or y≥c

Page 21: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Split= 2.81

x y Class A B A B

2.61 2.02 A 1 0 0 0

2.57 2.10 A 1 0 0 0

2.85 2.46 B 0 0 0 1

2.45 2.85 A 1 0 0 0

2.76 3.00 A 1 0 0 0

2.82 3.07 A 0 0 1 0

2.68 3.13 B 0 1 0 0

Etc.

Page 22: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Top 50 50 100 0.5 0.5 0.25 0.25 0.5

 

Split Left 44 7 51 0.86 0.14 0.74 0.02 0.24 0.12

Split Right 6 43 49 0.12 0.88 0.01 0.77 0.21 0.11

 

  Sum= 0.23

 

 Change in 0.27

 Gini Index

Page 23: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Split Change

0.27

1.5 0.10

1.6 0.11

1.7 0.12

1.8 0.13

1.9 0.16

2 0.16

2.1 0.17

2.2 0.17

2.3 0.15

Then use Data table to find the best value for a split.

Page 24: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Improvement in Gini Index

0

0.05

0.1

0.15

0.2

0.25

0.3

1.5 2 2.5 3 3.5 4

Split Value on x

Cha

nge

in G

ini I

ndex

Page 25: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Improvement in Gini Index

0

0.05

0.1

0.15

0.2

0.25

1.5 2 2.5 3 3.5 4

Split value for y

Cha

nge

in G

ini I

ndex

Page 26: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

The Next Step

• You’d now need to develop a series of spreadsheets to work out the next best split

• This is easier in R!

|x< 2.808

y>=2.343 y>=3.442

A B A B

Page 27: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Developing Trees using R

• Need to load the package “rpart” which contains the set of functions for CART

• The function looks like: NNB.tree<-rpart(Type~., NNB[ , 1:2], cp = 1e-3)

This takes the data in Type (which contains the classes for the data, i.e. A or B), and builds a model on all the variables indicated by “~.” . The data is in NNB[, 1,2] and cp is complexity parameter (more to come about this).

Page 28: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

0 2 4 6 8

02

46

Plot showing how Tree works

x

y

AAA

AAAA

A

AAAA

AA

A

AA

A

AA A

A

A

AA

A

AA

AA

AA

A A

AA

A

A A

AA

A

A

AA

AA

A

A

A

BB

BB

B

BB

B

B

B

B B

B

B B

BB

BB

B

BB

BB

B

B

B

B

B

B

BB

B

B

B

B

B

B

BB

B B

B

B

B

B

B

B

B

B

Page 29: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

A More Complicated Example

• This is based on my own research• Wish to tell which is best method of

exponential smoothing to use based on the data automatically.

• The variables used are the differences of the fits for three different methods (SES, Holt’s and Damped Holt’s Methods), and the alpha, beta and phi estimated for Damped Holt method.

Page 30: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

This gives a very complicated tree!

|Diff2>=5.229

phi< 0.9732

phi< 0.9395

Diff1< 25.7

phi>=0.9716

beta< 0.7296

beta>=0.6953

alpha>=0.4296

Diff1< 44.38

beta>=0.3652

beta< 0.5043

Diff1< 30.02

Diff2< 14.85

beta>=0.06944phi< 0.9829

Diff2>=1.557

Diff2>=3.109

Diff1>=2.293

beta>=0.1674

Diff2< 3.481

phi>=0.7092

Diff2< 2.37

phi< 0.6876

Diff2>=2.216

Diff1>=3.833

DHolt

DHolt

DHolt

DHolt

DHolt

DHoltHolt

Holt

Holt

Holt

DHoltHolt DHoltHolt

Holt

DHolt

DHolt

DHoltSES

SES DHolt

DHoltSES

SES

DHoltSES

Page 31: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Pruning the Tree I• As I said earlier it has been found that the

best method of arriving at a suitable size for the tree is to grow an overly complex one then to prune it back. The pruning is based on the misclassification rate. However the error rate will always drop (or at least not increase) with every split. This does not mean however that the error rate on Test data will improve.

Page 32: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Misclassification Rates

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80

Size of the Tree

Error

rates

Misclassification rateon Training Set

Misclassification rateon Test Set

Source: CART by Breiman et al.

Page 33: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Pruning the Tree II

• The solution to this problem is cross-validation. One version of the method carries out a 10 fold cross validation where the data is divided into 10 subsets of equal size (at random) and then the tree is grown leaving out one of the subsets and the performance assessed on the subset left out from growing the tree. This is done for each of the 10 sets. The average performance is then assessed.

Page 34: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Pruning the Tree III

• This is all done by the command “rpart” and the results can be accessed using “printcp “ and “plotcp”.

• We can then use this information to decide how complex (determined by the size of cp) the tree needs to be. The possible rules are to minimise the cross validation relative error (xerror), or to use the “1-SE rule” which uses the largest value of cp with the “xerror” within one standard deviation of the minimum. This is preferred by Breiman et al and B D Ripley who has included it as a dashed line in the “plotcp” function

Page 35: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

> printcp(expsmooth.tree)

Classification tree:rpart(formula = Model ~ Diff1 + Diff2 + alpha + beta + phi, data = expsmooth, cp = 0.001)

Variables actually used in tree construction:[1] alpha beta Diff1 Diff2 phi

Root node error: 2000/3000 = 0.66667

n= 3000

CP nsplit rel error xerror xstd1 0.4790000 0 1.0000 1.0365 0.0126552 0.2090000 1 0.5210 0.5245 0.0130593 0.0080000 2 0.3120 0.3250 0.0112824 0.0040000 4 0.2960 0.3050 0.0110225 0.0035000 5 0.2920 0.3115 0.0111096 0.0025000 8 0.2810 0.3120 0.0111157 0.0022500 9 0.2785 0.3085 0.0110698 0.0020000 13 0.2675 0.3105 0.0110969 0.0017500 16 0.2615 0.3075 0.01105610 0.0016667 20 0.2545 0.3105 0.01109611 0.0012500 23 0.2495 0.3175 0.01118712 0.0010000 25 0.2470 0.3195 0.011213

Page 36: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Relative Miclassification Errors

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

No of Splits

Relat

ive E

rror R

ate Relative Error

Relative CV Error

This relative CV error tends to be very flat which is why the “1-SE” Rule is preferred

Page 37: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

cp

X-v

al R

ela

tive

Err

or

0.2

0.4

0.6

0.8

1.0

Inf 0.32 0.0057 0.003 0.0021 0.0017 0.0011

1 2 3 5 6 9 10 14 17 21 24 26

size of tree

Page 38: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

This suggests that a cp of 0.003 is about right for this tree - giving the tree shown

|Diff2>=5.229

phi< 0.9732 Diff2>=1.557

Diff2>=3.109DHolt Holt

DHolt SES

SES

Page 39: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Cost complexity

• Whilst we did not use misclassification rate to decide on where to split the tree we do use it in the pruning. The key term is the relative error (which is normalised to one for the top of the tree). The standard approach is to choose a value of , and then to choose a tree to minimise

R =R+ sizewhere R is the number of misclassified points and the size of the tree is the number of end points. “cp” is /R(root tree).

Page 40: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Regression trees

Trees can be used to model functions though each end point will result in the same predicted value, a constant for that end point. Thus regression trees are like classification trees except that the end pint will be a predicted function value rather than a predicted classification.

Page 41: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Measures used in fitting Regression Tree

• Instead of using the Gini Index the impurity criterion is the sum of squares, so splits which cause the biggest reduction in the sum of squares will be selected.

• In pruning the tree the measure used is the mean square error on the predictions made by the tree.

Page 42: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Regression Example

• In an effort to understand how computer performance is related to a number of variables which describe the features of a PC the following data was collected: the size of the cache, the cycle time of the computer, the memory size and the number of channels (both the last two were not measured but minimum and maximum values obtained).

Page 43: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

|cach< 27

mmax< 6100

mmax< 1750

mmax< 2500

chmax< 4.5

syct< 110

syct>=360

chmin< 5.5

cach< 0.5

chmin>=1.5

mmax< 1.4e+04

mmax< 2.8e+04

cach< 96.5

mmax< 1.124e+04

chmax< 14

cach< 56

1.09

1.33

1.35

1.411.54

1.28

1.53

1.69

1.761.87

1.971.83

2.042.23

2.322.272.67

This gave the following tree:

Page 44: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

cp

X-v

al R

ela

tive

Err

or

0.2

0.4

0.6

0.8

1.0

1.2

Inf 0.088 0.03 0.018 0.012 0.0054 0.0032 0.0018

1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17

size of tree

We can see that we need a cp value of about 0.008 - to give a tree with 11 leaves or terminal nodes

Page 45: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

|cach< 27

mmax< 6100

mmax< 1750 syct>=360

chmin< 5.5

cach< 0.5

mmax< 2.8e+04

cach< 96.5

mmax< 1.124e+04

cach< 56

1.09 1.43 1.28

1.53 1.75

1.97 1.83 2.14

2.32 2.27 2.67

This enables us to see that, at the top end, it is the size of the cache and the amount of memory that determine performance

Page 46: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Advantages of CART

• Can cope with any data structure or type

• Classification has a simple form

• Uses conditional information effectively

• Invariant under transformations of the variables

• Is robust with respect to outliers

• Gives an estimate of the misclassification rate

Page 47: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Disadvantages of CART

• CART does not use combinations of variables

• Tree can be deceptive – if variable not included it could be as it was “masked” by another

• Tree structures may be unstable – a change in the sample may give different trees

• Tree is optimal at each split – it may not be globally optimal.

Page 48: Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression

Exercises

• Implement Gini Index on a spreadsheet

• Have a go at the lecture examples using R and the script available on the web

• Try classifying the Iris data using CART.