Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Comparative Review of Classification Trees
by Leonardo Auslender,
leoldv12 ‘at’ gmail ‘dot’ com
Independent Statistical Research Consultant 2013
— 2 —
1) Trees/CART: varieties, algorithm 2) Model Deployment: scoring 3) Examples. 4) Concluding Remarks: Brains, the future Review of Trees: Auslender, L. (1998): Alacart, Poor Man’s Classification Trees, North Eastern SAS Users Group Conference.
Contents
— 3 —
A Field vARIGuide to Tree
CART Tree (S+, R)
AID THAID CHAID
ID3 C4.5 C5.0
1) Varieties of Tree Methods
— 4 —
CART
Classification and
Regression Trees
Source: Breiman L., Freedman J. Stone R., Olshen C.: Classification and Regression Trees, Wadsworth, International Group, Belmont, CA, 1984
— 5 —
Aim: separate two classes by using X1 and X2 and producing more homogenous rectangular regions.
— 6 —
CART: Underlying Classification algorithm Using missclasification.
Y X1 X2 X3 x4
0 1 10 21 1
1 1 30 8 1
0 2 0 8 0
0 3 10 8 0
Misscl (Y / X1 <=1) = .5 Misscl (Y / X1 > 1) = 0, repeat for every value of X1 and for every other X variable, select optimal variable and split (actually uses Gini in
Cart).
— 7 —
Basic CART Algorithm: binary dependent
variable or target (0,1)
Range of Continuous Variable A
“0” “0”
70% “1”
“1”
20%
50%
Original % of ‘0’s and ‘1’s of dep. var
Splitting point
Xi
Y
— 8 —
Divide and Conquer: recursive partitioning
n = 5,000
10% HELOC
n = 3,350 n = 1,650
Debits < 19 yes no
21% HELOC 5% HELOC
— 9 —
Ideal SAS code to find splits
Proc summary data = …. Nway; class (all independent vars); var depvar; /* this is ‘target’, 0/1*/
output out = ….. Sum = ; run;
For large data sets (large N, large NVAR), hardware and software constraints prevent completion.
— 10 —
Fitted Decision Tree: Interpretation and
structure
VAR C
>1
25%
0-52
45%
VAR B
VAR A <19 19
5%
0,1
21%
>52
— 11 —
The Cultivation of Trees
• Split Search
– Which splits are to be considered?
• Splitting Criterion
– Which split is best?
• Stopping Rule
– When should the splitting stop?
• Pruning Rule
– Should some branches be lopped-off?
— 12 —
Possible Splits to Consider: most common is binary
because...
1
100,000
200,000
300,000
400,000
500,000
2 4 6 8 10 12 14 16 18 20
Nominal Input Ordinal
Input
Input Levels
If input has 1000 levels, 999 possible binary splits and 999 * 998 /2 trinary split, etc.
— 13 —
Splitting Criterion: gini, twoing,
misclassification, entropy…
A) Minimize Gini impurity criterion (favors node homogeneity) ----------------- B) Maximize Twoing impurity criterion (favors class separation) Empirical results: for binary dependent variables, Gini and Twoing are equivalent. For trinomial, Gini provides more accurate trees. Beyond three categories, twoing performs better.
— 14 —
The Right-Sized Tree
Stunting
Pruning
— 15 —
— 16 —
— 17 —
— 18 —
Benefits of Trees
• Interpretability
– Tree structured presentation
• Mixed Measurement Scales
– Nominal, ordinal, interval
– Regression trees
• Robustness
• Missing Values
— 19 —
…Benefits
• Automatically
– Detects interactions (AID) in hierarchical conditioning search, not ‘ala’ regression analysis.
– Selects input variables
Input Input
Prob
Multivariate Step Function
— 20 —
Drawbacks of Trees
. Unstable: small perturbations in data can lead to big changes in trees.
. Linear structures are approximated in very rough
form.
. Applications may require that
rules descriptions for
different categories not
share the same attributes.
. It is a conditional
Structure and interpretation many times misunderstands the conditioning effect.
— 21 —
Drawbacks of Trees (cont.)
. Tends to over-fit => overly optimistic accuracy.
. Large trees are very difficult to interpret.
. Tree size conditioned by data set size.
. No valid inferential
procedures at present
(if it matters).
. Greedy search
algorithm.
— 22 —
Note on Missing Values.
1) Missingness NOT in Y (see Wang and Sheng, 2007, JMLR for semi-supervised method for missing Y).
2) Different methods of imputation: 1) C4.5: probabilistic split: variables with missing values are
attached to child nodes with weights equal to proportion of non-missing values.
2) Complete case: eliminate all missing observations, and train. 3) Grand mode/mean: imputed if categorical/continuous. 4) Separate class: appropriate for categorical. For continuous,
create extreme large value and thus separate missings from non-missings.
5) Complete variable case: delete all variables with missing values.
6) Surrogate (CART default): Use surrogate variable/s whenever variable is missing. At testing or scoring, if variable is missing, uses surrogate/s.
Tree Derivative: Random Forests. (Breiman, 1999)
— 23 —
Random Forests proceed in the following steps, and notice that there is no
need to create a training, validation and a test data sets:
1. Take a random sample of N observations with replacement
(“bagging”) from the data set. On average, select about 2/3 of rows. The
remaining 1/3 are called “out of bag (OOB)” observations. A new random
selection is performed for each tree constructed.
2. Using the observations selected in step 1, construct a decision tree to
its maximum size, without pruning. As the tree is built, allow only a
subset of the total set of predictor variables to be considered as
possible splitters for each node. Select the set of predictors to be
considered as random subset of the total set of available predictors.
For example, if there are ten predictors, choose five of them randomly as
candidate splitters. Perform a new random selection for each split. Some
predictors (possibly best one) will not be considered for each split, but
predictor excluded from one split may be used for another split in the
same tree.
— 24 —
No Overfitting or Pruning.
The "Over-fitting“ problem appears in large, single-tree models where the model fits
noise in the data, which causes poor generalization power, which is the basis for
pruning those models. In nearly all cases, decision tree forests do not have problem
with over-fitting, and there is no need to prune trees in the forest. Generally, the more
trees in a forest, the better the fit.
Internal Measure of Test Set (Generalization) Error .
About 1/3 of observations are excluded from each tree in the forest, which are called
“out of bag (OOB)”. That is, each tree has a different set of out-of-bag observations
that implies each OOB set constitutes an independent test sample.
To measure the generalization error of decision tree forests, the OOB set for each tree
is run through the tree and the error rate of prediction is computed.
The error rates for the trees in the forest are then averaged to obtain the overall
generalization error rate for the decision tree forest model.
There are several advantages to this method of computing the generalization error:
(1) All observations are used to construct the model, and none have to be held back
as a separate test set,
(2) The testing is fast because only one forest has to be constructed (as compared to
V-fold cross-validation where additional trees have to be constructed).
— 25 —
2) Scoring: battle horse of database marketing.
Model Deployment.
— 26 —
Scoring Recipe
• Model
– Formula
• Data Modifications
– Derived inputs
– Variable Transformations
– Missing value imputation
• Scoring Code
Scored data
Original
computation algorithm
— 27 —
/* PROGRAM ALGOR8.PGM WITH 8 FINAL NODES*/
/* METHOD MISSCL ALACART TEST */
RETAIN ROOT 1;
IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE <= 12
THEN DO;
NODE = '4_1 ';
PRED = 0 ;
/* % NODE IMPURITY = 0.0399 ; */
/* BRANCH # = 1 ; */
/* NODE FREQ = 81 ; */
END;
ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE > 12
THEN DO;
NODE = '4_2 ';
PRED = 1 ;
/* % NODE IMPURITY = 0.4478 ; */
/* BRANCH # = 2 ; */
/* NODE FREQ = 212 ; */
END;
ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE > 90.36
THEN DO;
NODE = '3_2 ';
PRED = 0 ;
Scoring Recipe: example of scoring output generated by
Alacart
— 28 —
Scorability
X1
0
.2
.4
.6
.8
1
X2 0 .2 .4 .6 .8 1
Scoring Code Classifier
If x1<.47 & x2<.18 or x1>.47 & x2>.29, then red.
Tree
Training Data
New Case
— 29 —
1st. Data set: Titanic.
Titanic survival data, available on the web. 1313 observations
but due to missing “age” values, 756 complete observations,
out of 1313 total number of observations. Below, variables
available for analysis (the “*” variables are transformations
to “help” the logistic). „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ† ‚Data Contents ‚Variable‚ ‚ ‚ Length ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Variable Name ‚Variable Label ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚AGE ‚Yrs of Age ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGESEX ‚Age * Sex ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGESQ ‚Age * Age ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS1 ‚First Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS2 ‚Second Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PASSCLASS3 ‚Third Class ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SEX ‚Sex Female = 1 ‚ 8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SURVIVED ‚Survived = 1 ‚ 8‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
Original Data
Survived
Total
Did not Survived
Count % Obs Count % Obs
All All All All Count
Age Gender Total
5 0.38 96 7.31 101 Present female 1st
2nd 10 0.76 75 5.71 85
3rd 56 4.27 46 3.50 102
male 1st 82 6.25 43 3.27 125
2nd 106 8.07 21 1.60 127
3rd 184 14.01 32 2.44 216
All 443 33.74 313 23.84 756
Missing Gender Total
4 0.30 38 2.89 42 female 1st
2nd 3 0.23 19 1.45 22
3rd 76 5.79 34 2.59 110
male 1st 38 2.89 16 1.22 54
2nd 42 3.20 4 0.30 46
3rd 257 19.57 26 1.98 283
All 420 31.99 137 10.43 557
All 863 65.73 450 34.27 1313
— 32 —
— 33 — 33
Complete Data
W/O Missing
Age.
Data Description: SURVIVED
G + PCl
% G +
PCl
DID NOT SURVIVED
Mean AGE Mean AGE
Count
%
Tota
l
Obs
% Of
Gender Mean Count
%
Tota
l
Obs
% Of
Gender
Mea
n Count
GENDER PSSNGR
CLASS
5 0.66 1.28 35.20 96
13.7
0 39.60 37.91 101 13.36 female 1st
2nd 10 1.32 3.28 31.40 75 9.92 21.92 26.85 85 11.24
3rd 56 7.41 9.27 23.82 46 6.08 11.37 23.72 102 13.49
Total
71 9.39 13.82 24.90 217
28.7
0 73.89 30.87 288 38.10
male PSSNGR
CLASS
82
10.8
5 26.66 44.84 43 5.69 16.03 34.25 125 16.53 1st
2nd
106
14.0
2 24.36 31.70 21 3.78 3.39 14.84 127 16.80
3rd
184
24.3
4 36.16 27.10 32 4.23 7.69 23.09 216 28.57
Total
372
49.2
1 87.18 33.32 96
13.7
0 27.11 25.95 468 61.90
Total
443
58.6
0 100.00 31.13 313
41.4
0 100.00 29.36 756 100.00
Pr (Fem/Surv)
Pr (Surv & Fem)
— 34 —
Logistic Vs. trees. Titanic No missing Values.
Consistency Information Value Characteristic
756 'informs.titanic_no_missing' # obs. Number of variables 7
Number of continuous variables 3 Number of class variables
4
Trees used 3, Forest 5, while Stepwise 5 plus the intercept.
Models * Vars * Coeffs Estimate Pr > Chi-Square Importance # Rules
Var Sel Type Variable
0.546 493 Forest AGE
PASSCLASS1 0.400 66
PASSCLASS2 0.248 58
PASSCLASS3 0.465 59
SEX 1.000 97 STEPWISE AGE -0.039177938 0.000
PASSCLASS1 1.2919799232 0.000
PASSCLASS3 -1.229467857 0.000
SEX 2.631357225 0.000
Intercept -0.163634963 0.550 Trees AGE 0.408 1
PASSCLASS3 0.556 2
SEX 1.000 1
— 35 —
Training: Rates '-' ==> misclass & Missprec
_PREDICTED
0 1 Overall
Class
Rate Prec Rate
Class Rate
Prec Rate
Class Rate Prec Rate
Survived = 1 Model Name
96.39 78.35 -3.61 -7.58 96.39 0 FOREST
LOGISTIC_STEPWISE 83.97 80.35 -16.03 -24.23 83.97
TREES 96.61 77.12 -3.39 -7.46 96.61 1 FOREST -37.70 -21.65 62.30 92.42 62.30
LOGISTIC_STEPWISE -29.07 -19.65 70.93 75.77 70.93
TREES -40.58 -22.88 59.42 92.54 59.42 Overall FOREST 78.35 92.42 82.28 82.28
LOGISTIC_STEPWISE 80.35 75.77 78.57 78.57
TREES 77.12 92.54 81.22 81.22
Trees have the highest classification rate (96.39%) and an excellent precision rate (92.42%). Forest comes at a close second.
— 36 —
Gains Table
Events Rate
Cum Events Rate
% Event Captured
Cum % Events
Captured Lift Cum Lift Brier Score
* 100 Percentile Model Name
100.000 100.000 12.141 12.141 2.415 2.415 0.511
5 FOREST
LOGISTIC_STEPWISE 94.737 94.737 11.502 11.502 2.288 2.288 5.111
TREES 95.119 95.119 11.548 11.548 2.297 2.297 0.000
10 FOREST
97.368 98.684 11.821 23.962 2.352 2.384 0.936
LOGISTIC_STEPWISE 97.368 96.053 11.821 23.323 2.352 2.320 1.023
TREES 91.935 93.527 11.161 22.709 2.221 2.259 0.000
15 FOREST
90.789 96.053 11.022 34.984 2.193 2.320 5.166
LOGISTIC_STEPWISE 89.474 93.860 10.863 34.185 2.161 2.267 2.837
TREES 91.935 92.997 11.161 33.871 2.221 2.246 0.000
20 FOREST
93.421 95.395 11.342 46.326 2.256 2.304 3.859
LOGISTIC_STEPWISE 86.842 92.105 10.543 44.728 2.098 2.225 3.951
TREES 91.935 92.731 11.161 45.032 2.221 2.240 0.000
— 37 —
Comparing the results.
1) Trees required fewer variables than logistic and thus easier to interpret. Forest by definition use all the variables.
2) Trees obtained slightly larger lift measures but who can beat forests?
3) Trees and Forest determine the most important variable immediately, female sex, at the top of the tree, while with logistic it is not very clear.
— 38 —
2nd Data set: Surendra Financial
Data.
No information available about meaning or
measurement. All variables called R1 – R84,
one binary dependent variable “Newgroup”.
There are no missing values, the missing
values have been somehow imputed, but
not reported.
— 39 —
Data Mining Example: Just fit a model.
Consistency Information Value
Characteristic
45,175
'surendra.newsurendra' # obs.
Number of variables
84 Number of continuous variables
84 Number of class variables
0
Financial information with target = “newgroup” and variable names R1 – R85 Without any information as to what anything means. Forest omitted from the exercise.
— 40 —
Difficult to interpret, the larger ‘p’ is.
— 41 —
Gains Table
Events Rate
Cum Events Rate
% Event Captured
Cum % Events
Captured Lift Cum Lift Brier Score
* 100 Percent
ile Model Name
92.873 92.873 37.391 37.391 7.477 7.477 5.043
5 LOGISTIC_STEPWISE
TREES
98.639 98.639 39.712 39.712 7.942 7.942 0.000 10 LOGISTIC_STEPWIS
E 49.270 71.071 19.836 57.227 3.967 5.722 25.486
TREES
50.476 74.558 20.322 60.034 4.064 6.003 0.000 15 LOGISTIC_STEPWIS
E 30.235 57.459 12.173 69.399 2.434 4.626 20.962
TREES 12.772 53.962 5.142 65.176 1.028 4.345 0.000
20 LOGISTIC_STEPWISE
19.088 47.869 7.681 77.081 1.537 3.854 15.217
Logistic selected 49 variables, Trees 12. 10 of the 12 also used by Logistic.
— 42 —
Comparing the results.
1) Trees selected 12 and logistic 49 variables.
2) The initial split on R73 produced almost perfectly pure nodes. R73 was also selected by Stepwise, but Stepwise doesn’t stop fast enough.
3) Model performance, as evaluated by lift, favors Trees.
— 43 —
Very quick: Trees vs. Gradient Boosting.
Task: Classify into ‘5’ segments. Tools: Trees and Gradient Boosting (different versions). Compare by classification, precision and F1 rates.
Model descriptions
_MODEL_
Obs STUDY NUMBER
1 tree_equal_prob_CV_10 1
2 tree_origl_prob_CV_10 2
3 tree_origl_CV_10_5_split 3
4 tree_origl_CV_10_2_split_dec 4
5 Boost simple 5
6 Boost equal Probs 6
7 Custs_Boost dec matrix 7
8 Boost_equal_2nd_stage 8
9 Boost_orig_2nd_stage 9
Then models 1 through 4 are TREES, 5 through 9 BOOSTING.
— 44 —
— 45 —
— 46 —
— 47 —
Quick summary conclusions for multi-classification.
1) All boosting methods are good and just
‘1’ of trees competes with them in one case.
2) The performance of any of the boosting methods was similar, thus not much model specification search is required.
3) Boosting methods very difficult to interpret.
4. Concluding Remarks
— 49 —
Different algorithms
1) Non-greedy algorithms and two- or three-step ahead search.
2) Hybrid models, which combine regression and tree Methods (not very popular after the 2000s). 3) Boosting or majority voting methods, which generate a sequence of trees and classifications and the outcome is decided democratically. 4) Binned trees, in which splits searches are conducted after discretizing all variables, thus allowing for possibly non-linear effect searches. ...
— 50 —
Avoid over-fitting / overtorture... because ...
Instead, in Sherlock Holmes’ words: “I never guess. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts”. (A Scandal in Bohemia).
We should not act as Mark Twain says: “Get your facts first, and then you can
distort
them as much as you please.”
— 51 —
Let us not be in haste …
Method comparison by way of two examples does not imply general method superiority. There are many examples in the literature in which logistic regression performed better.
— 52 —
The End