Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
1
From Trees to Forests and Rule SetsA Unified Overview of Ensemble Methods
13th Intl. Conf. on Knowledge Discovery and Data Mining (KDD 2007)
Giovanni [email protected]
PDF Solutions, Inc., Santa Clara University
John [email protected]
Elder Research, Inc.
2© 2007 Seni & Elder KDD07
Tutorial Goals
• Convey the essentials of an immensely useful modeling methodology
• Explain two recent developments
– Importance Sampling – Rule Ensembles
• Provide the foundation for further study/research
• Reference resources in R
• Show real examples
2
3© 2007 Seni & Elder KDD07
Overview
• In a Nutshell, Examples & Timeline
• Predictive Learning
• Decision Trees
– Regression tree induction– Desirable data mining properties– Limitations
• Model Selection
– Bias-Variance Tradeoff– Cost-complexity pruning– Cross-Validation– Regularization via shrinkage (LASSO)
4© 2007 Seni & Elder KDD07
Overview (2)
• Ensemble Methods
– Ensemble Learning & Importance Sampling (ISLE)– Generic Ensemble Generation– Bagging, Random Forest, AdaBoost, MART
• Rule Ensembles
• Interpretation
• Example from Industry
3
5© 2007 Seni & Elder KDD07
Ensemble Methods in a Nutshell
• “Algorithmic” statistical procedure
• Based on combining the fitted values from a number of fitting attempts
• Loosely related to:
– Iterative procedures– Bootstrap procedures
• Original idea: a “weak” procedure can be strengthened if it can operate “by committee”
– e.g., combining low-bias/ high variance procedures
6© 2007 Seni & Elder KDD07
Ensemble Methods In a Nutshell (2)
• Shown to perform extremely well in a variety of scenarios
• Shown to have desirable statistical properties
• Accompanied by interpretation methodology
4
7© 2007 Seni & Elder KDD07
ExamplesData Mining Products
Model 1
P re di c t ive D yn amix
8© 2007 Seni & Elder KDD07
ExamplesAlgorithm Response Surfaces
Nearest NeighborDecision Tree
Delaunay Triangles(or Hinging Hyperplanes)
Neural Network(or Polynomial Network)
Kernel
5
9© 2007 Seni & Elder KDD07
ExamplesRelative Performance
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Diabetes Gaussian Hypothyroid German Credit Waveform Investment
Neural NetworkLogistic RegressionLinear Vector QuantizationProjection Pursuit RegressionDecision Tree
Erro
r Rel
ativ
e to
Pee
r Tec
hniq
ues (
low
er is
bet
ter)
10© 2007 Seni & Elder KDD07
Examples“Bundling” Improves Performance
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Diabetes Gaussian Hypothyroid German Credit Waveform Investment
Advisor PerceptronAP weighted averageVoteAverage
Erro
r Rel
ativ
e to
Pee
r Tec
hniq
ues (
low
er is
bet
ter)
6
11© 2007 Seni & Elder KDD07
ExamplesCredit Scoring Performace
50
55
60
65
70
75
80
0 1 2 3 4 5#Models Combined (averaging output rank)
#Def
aulte
rs M
isse
d (f
ewer
is b
ette
r)
NT
NS STMT
PS
PTNP
MS
MN
MP
SNT
MPN
SPTSMTPNTSPNMPT MNT
SMNSMP
Bundled Trees
Stepwise Regression
Polynomial Network
Neural Network
MARS
SPNT
SMPT
SMNT
SMPNMPNT
SMPNT
12© 2007 Seni & Elder KDD07
Timeline
• CART (Breiman, Friedman, Stone, Olshen, 1983)
• Bagging (Breiman, 1996)
– Random Forest (Ho, 1995; Breiman 2001)
• AdaBoost (Freund, Schapire, 1997)
• Boosting – a statistical view (Friedman, Hastie, Tibshirani, 2000)
– Gradient Boosting (Friedman, 2001)– Stochastic Gradient Boosting (Friedman, 1999)
• Importance Sampling Learning Ensembles (ISLE) (Friedman, Popescu, 2003)
7
13© 2007 Seni & Elder KDD07
Timeline (2)
• Lasso (Tibshirani, 1996)
– Garotte (Breiman, 1995)– LARS (Efron, 2004)
– Gradient directed regularization for linear regression (Friedman, 2004)
– Boosted Lasso (Zhao, 2005)– Elastic Net (Zou, Hastie, 2005)
• Rule Ensembles (Friedman, Popescu, 2005)
– Stochastic Discrimination (Kleinberg, 2000)
14© 2007 Seni & Elder KDD07
Overview
• In a Nutshell & Timeline
Predictive Learning
• Decision Trees
– Regression tree induction– Desirable data mining properties
• Model Selection
– Bias-Variance Tradeoff– Cost-complexity pruning– Cross-Validation– Regularization via shrinkage (LASSO)
8
15© 2007 Seni & Elder KDD07
Predictive LearningExample
• A simple data set
• Modeling Goals:
– Descriptive: summarize existing data in understandable and actionable way
– Predictive: what is the "response" (e.g., class) of new point ?
………
?M54.5
badM12.0
goodM21.0
ResponsePETI
TI
PE
2 5
1M2M3M4M
9M
M
16© 2007 Seni & Elder KDD07
Predictive LearningExample (2)
• Ordinary Linear Regression (OLR):
• Model:
TI
PE
2 5
1M2M3M4M
9M
M
⎩⎨⎧
≥else
0)(xF)
∑=
+=n
jjj xaaF
10)(x
) ⇒ Not flexibleenough
9
17© 2007 Seni & Elder KDD07
Predictive LearningExample (3)
• Decision Trees
• Descriptive model:
5≥TI
{ }3,2,1 MMMPE ∉
2≥TI
TI
PE
2 5
1M2M3M4M
9M
M
[ ] { } ELSE
3,2,1 AND 5 ,2 ⇒∈∈ MMMPETI bad good
18© 2007 Seni & Elder KDD07
Predictive LearningProcedure Summary
• Given "training" data
– are measured values of attributes (properties, characteristics) of an object
– is the "response" (or output) variable– are the "predictor" (or input) variables– is a random sample from some unknown (joint) distribution
• Build a functional model
– Offers adequate and interpretable description of how the inputs affect the outputs
},{},,,,{ 121 iiN
iniii yxxxyD x== L
iji xy ,
iy
ijx
D
)(),,,( 21 xFxxxFy n == L)
10
19© 2007 Seni & Elder KDD07
Predictive LearningProcedure Summary (2)
• Model: underlying functional form sought from data
• Score criterion: judges (lack of) quality of fitted model
• Search Strategy: minimization procedure of criterion
∈= );(ˆ)(ˆ axx FF ℱ family of functions indexed by a
∑=
=N
iii FyL
NR
1));(ˆ,(1)( axa
)∑ ∑= =
⎟⎟⎠
⎞⎜⎜⎝
⎛−−
N
i
n
jjji xaay
N 1
2
10
1 :OLR e.g.,
)(minargˆ aaa
R)
=OLR: direct matrix algebra
Trees, NN, : heuristic iterative algorithmClustering
20© 2007 Seni & Elder KDD07
Overview
• In a Nutshell & Timeline
• Predictive Learning
Decision Trees
– Regression tree induction– Desirable data mining properties– Limitations
• Model Selection
– Bias-Variance Tradeoff– Cost-complexity pruning– Cross-Validation– Regularization via shrinkage (LASSO)
11
21© 2007 Seni & Elder KDD07
Decision TreesInduction Overview
• Model:
– : (hyper) rectangles in input space induced by tree cuts– : estimated response within each rectangle is constant– : indicator function; 1 if its argument is true
∑=
∈==M
mmm RIcTy
1)ˆ(ˆ)(ˆ xx
1x2x
y
1R
2R
3R
4R5R
mR
mc( )LI
Adapted from ESL 9.2
22© 2007 Seni & Elder KDD07
Decision TreesInduction Overview (2)
• Score criterion:
– Regression “loss”
• Squared error:
• Absolute error:
– Prediction “risk”
• Average loss over all predictions –
• Goal: find tree with lowest prediction risk
2)ˆ()ˆ,( yyyyL −=
|ˆ|)ˆ,( yyyyL −=
))(,()( , xx TyLETR y=
T
12
23© 2007 Seni & Elder KDD07
Decision TreesInduction Overview (3)
• Search
– Goal: (least squares)
– Unrestricted optimization with respect to is difficult!
– Region restrictions
{ }{ }
2
1 1,1 )(minargˆ,ˆ
1
∑ ∑= =
⎥⎦
⎤⎢⎣
⎡∈−=
N
i
M
mmimi
Rc
M
mm RIcyRcM
mm
x
{ }MmR 1
1x
2x
X √
1x
2x
R1
R2
R3
R4
R5
• Disjoint• Cover space• "Simple“
)(ˆ ii Ty x=
24© 2007 Seni & Elder KDD07
Decision TreesGrowing Algorithm
• Greedy Iterative procedure
– Starting with a single region -- i.e., all given data– At the m-th iteration:
– i.e., Forward stagewise additive procedure– When should we stop?
for each region Rfor each attribute xj in R
for each possible split sj of xj
record change in score when we partition R into Rl and Rr
Choose (xj , sj ) giving maximum improvement to fitReplace R with Rl; add Rr
13
25© 2007 Seni & Elder KDD07
Decision TreesDesirable Data Mining Properties
• Ability to deal with irrelevant inputs
– i.e., automatic variable subset selection– Measure anything you can measure– Score provided for selected variables ("importance")
• No data preprocessing needed
– Naturally handle all types of variables
• numeric, binary, categorical
– Invariant under monotone transformations:• Variable scales are irrelevant
• Immune to bad -distributions (e.g., outliers)
( )jjj xgx =
jx
26© 2007 Seni & Elder KDD07
Decision TreesDesirable Data Mining Properties (2)
• Computational scalability
– Relatively fast:
• Missing value tolerant
– Moderate loss of accuracy due to missing values– Handling via "surrogate" splits
• "Off-the-shelf" procedure
– Few tunable parameters
• Interpretable model representation
– Binary tree graphic
( )NnNO log
14
27© 2007 Seni & Elder KDD07
Decision TreesLimitations
• Discontinuous piecewise constant model
– In order to have many splits you need to have a lot of data– In high-dimensions, you often run out of data after a few splits
• Data fragmentation
– Each split reduces training data for subsequent splits
x
)(xF
28© 2007 Seni & Elder KDD07
Decision TreesLimitations (2)
• Not good for low interaction
– e.g., is worst function for trees
– In order for to enter model, must split on it
• Path from root to node is a product of indicators
• Not good for that has dependence on many variables
( )x*F
( )
( )∑
∑
=
=
=
+=
n
jjj
n
jjjo
xf
xaaF
1
*
1
* x
(no interaction, additive)
lx
( )x*F
15
29© 2007 Seni & Elder KDD07
Decision TreesLimitations (3)
• High variance caused by greedy search strategy (local optima)
– i.e., Small changes in data (sampling fluctuations) can cause big changes in tree
– Errors in upper splits are propagated down to affect all splits below it– Very deep trees might be questionable– Pruning is important
• What to do next?
– Use other methods when possible– Fix-up trees: use "ensembles“
• Maintain advantages while dramatically increasing accuracy
30© 2007 Seni & Elder KDD07
Overview
• In a Nutshell & Timeline
• Predictive Learning
• Decision Trees
– Regression tree induction– Desirable data mining properties– Limitations
Model Selection
– Bias-Variance Tradeoff– Cost-complexity pruning– Cross-Validation– Regularization via shrinkage (LASSO)
16
31© 2007 Seni & Elder KDD07
Model SelectionWhat is the “right” size of a tree?
• Dilemma
– If tree (# of regions) is too small, then piecewise constant approximation is too crude (bias) ⇒ increased errors
– If tree is too large, then it fits the training data too closely(overfitting, increased variance) ⇒ increased errors
ο ο
ο ο ο
ο
ο ο
ο ο
ο ο ο
ο
ο
ο
Target
functi
on
ο ο
ο ο ο
ο
ο ο
ο ο
ο ο ο
ο
ο
ο1c
2c
ο ο
ο ο ο
ο
ο ο
ο ο
ο ο ο
ο
ο
ο1c
2c3c
x
y
x x
y y
32© 2007 Seni & Elder KDD07
Model SelectionBias−Variance Decomposition
• Suppose , where
• Consider “idealized” aggregate estimator:
– Average over all possible data sets
• Prediction error of fit at a point using squared-error loss is decomposable:
– Typically, when bias is low, variance is high and vice-versa
ε+= )(XfY ),0(~ 2σε N
)()( XX Dff)
Ε=
)(Xf)
0xX =
[ ][ ][ ] 22
0000
22 00
2 000
)()()()(
)()(
|)()(
σ
σ
+−+−Ε=
+−Ε=
=−Ε=
xxxx
xx
xXxx
ffff
ff
fYErr
)
)
)[ ] [ ]
[ ] [ ]2
002
22 00
2 00
22 00
2 00
))(())((
)()()()(
)()()()(
σ
σ
σ
++=
+−Ε+−=
+−Ε+−Ε=
xx
xxxx
xxxx
fVarfBias
ffff
ffff
))
)
)
17
33© 2007 Seni & Elder KDD07
Model SelectionBias−Variance Schematic
•
•
•
•Truth: )(xf
ε+= )(XfYRealization:
Fit:
Variance
Model Space ℱ
)(xf)
Model Bias
Adapted from ESL 7.2
)(XfAverage:
34© 2007 Seni & Elder KDD07
Model SelectionBias−Variance Tradeoff
– Right sized tree, , when test error is at a minimum
– Overfitting is exacerbated when model (fitting procedure) is highly flexible
Model Complexity (e.g., tree size)
Pred
ictio
n Er
ror
Low BiasHigh Variance
High BiasLow Variance
Test Sample
Training Sample
HighLow *M
*M
Adapted from ESL 7.1
18
35© 2007 Seni & Elder KDD07
Model SelectionTree Pruning
• Augmented tree score criterion:
– : “risk” of tree – how good it fits the dataanalogous to the residual sum of squares
– : size of tree – number of terminal nodesanalogous to the model degrees of freedom
– : complexity parameter – cost of adding another var to the model
⇒ Complexity term penalizes for the increased variance associated with more complex model
• Goal: find
TTRTR ⋅+= αα )(ˆ)(ˆ
)(TR)
T
α
)(ˆmin* TRT T αα =
36© 2007 Seni & Elder KDD07
Model SelectionTree Pruning (2)
• Cost-Complexity Pruning (Cont.)
– is a “meta” parameter that controls the degree of stabilization
• Larger values provides increased regularization ⇒ more stable estimates
• : largest tree ( ) ⇒ least stable estimates
• : stump (root only) ⇒ deterministic
• Varying produces a nested (finite) sequence of subtrees
• Choose the value of , (and thus ), to minimize future risk
α
0=α
∞=α
maxT
α
0=αmaxT
1α1T 2T K
2α 3α 0>>αroot
α *α **α
T )( αTR)
19
37© 2007 Seni & Elder KDD07
Model SelectionCross−Validation
• Error on the training data is not a useful estimator
– If test set is not available, need alternative method
• Combine measures of fit to obtain “honest” estimate of model’s “quality”
– Each measure is constructed from random, non-overlapping subset of the data
– E.g., 3-fold CV
Train
Train
Test1D2D3D )1(T
Train
Train
Test
1D2D3D
)3(TK
38© 2007 Seni & Elder KDD07
Model SelectionCross−Validation (2)
– : tree built on
– : indexing function
– : cross-validation estimate of prediction error
i.e., average of the fit measures over the V splits
Train
Train
Test1D2D3D )1(T
Train
Test
Train
1D2D3D
)2(T
vDD −)(vT
{ } },,1{,,1:)( VNi KaKυ
( )∑=
=N
ii
ii
CV TyLN
R1
)( )(,1ˆ xυ
Train
Train
Test
1D2D3D
)3(T
20
39© 2007 Seni & Elder KDD07
Model SelectionCross−Validation (3)
• Set of models is indexed by tuning parameter α, and we have
• Idea generalization: can “averaging” over a set of trees help correct for overly optimistic trees?
– i.e., combine fitted values instead of measures of fit ⇒ Ensembles
{ })()( xvT
( ) ( )∑=
=N
ii
ii
CV TyLN
R1
)( )(,1ˆ xυαα
α
0=αmaxT
1α1T 2T K
2α 3α 0>>αroot
40© 2007 Seni & Elder KDD07
Model SelectionRegularization via Shrinkage (Lasso − Tibshirani, 1996)
• Consider linear model
• Standard LR coefficients estimation
• Shrinking: continuous variable selection method
– “penalty” term provides a (deterministic) value that is independent of the particular random sample
⇒ stabilizing influence on the criterion being minimized
– promotes reduced variance of the estimated coefficient values
∑=
+=n
jjj xaaF
10)(x
( )∑ ∑= ==
N
i
nijji
aj xayLa
j1 1j}{
,minarg}ˆ{
∑ =
n
j ja1
λ( ) += ∑ ∑= = ,minarg}ˆ{
1 1j}{
N
i
nijji
a
Lassoj xayLa
j
21
41© 2007 Seni & Elder KDD07
Model SelectionRegularization via Shrinkage (2)
• Lasso solutions nonlinear in y, and a quadratic programming algorithm is used to compute them
• L1 penalty leads to sparse solutions where there are many predictors with
– Often improves prediction accuracy– Easier to interpret– “Bet on sparsity” principle
• is a “tuning” parameter
– : no regularization… least stable estimates– : continuous coefficient “paths”
0=ja
0=λ
λ
0>λ
⇒ Typically estimated via cross-validation
42© 2007 Seni & Elder KDD07
Model SelectionRegularization via Shrinkage (3)
• Lasso 2D estimation illustration
– : least-squares -- i.e.,
– Error function contours and constraint region
( )⋅L
( ) +−−−= ∑ =
2
1 22110}{
210 minarg}ˆ,ˆ,ˆ{ N
i iiia
xaxaayaaaj
If first point where contours hit constraint region occurs at a corner, then one parameter is zeroja
1a
2aa LS
a Lasso •
( )21 aa +λ
•
22
43© 2007 Seni & Elder KDD07
Model SelectionRegularization via Shrinkage (4)
• Forward stagewise linear regression (Hastie et al, 2001)
– Iterative strategy producing solutions that closely approximates the effect of the lasso
– See also LARS (Efron, et al 2004), Gradient directed regularization for linear regression (Friedman, 2005)
( ){ }
∑
∑ ∑
=
= =
=
⋅+←
⎥⎦
⎤⎢⎣
⎡−−=
=
>=
n
j jj
jj
j
N
i
n
lijilli
j
j
xaF
signaaa
xxayj
Mm
Ma
1
*
1
2
1 ,
**
)(ˆ write
}
)(amount malinfinitesiby decrement Increment/ //
ˆminarg ,
residualscurrent fitsbest that predictor Select //
{ to1For
large ; 0 ; 0 Initialize
**
*
)
)
x
αε
αα
ε
α
• After iterations, manywill be 0
∞<Mja)
• In general, ( ) LSjj aMa )) <
• λ/1≈M
44© 2007 Seni & Elder KDD07
Model SelectionRegularization via Shrinkage (5)
• Lasso coefficient “paths” example:
–– see R package "lars"
λ/1≈s
23
45© 2007 Seni & Elder KDD07
Model SelectionRegularization Summary
• What is regularization?
– “any part of model building which takes into account – implicitly or explicitly – the finiteness and imperfection of the data and the limited information in it, which we can term “variance” in an abstract sense” [Rosset, 2003]
• Forms of regularization
– Explicit constraints on model complexity
• Constrained models can equivalently be cast as penalized ones
– Implicit through incremental building of the model
– Choice of robust loss functions
46© 2007 Seni & Elder KDD07
Model SelectionRegularization Summary (2)
•
•
•
•Truth: )(xf
ε+= )(XfYRealization:
Fit:
Variance
Model Space ℱ
)(xf)
)(XLassof)
Model Bias
Adapted from ESL 7.2
••
Regularized Fit:
Restricted Model Space
⇒ smaller prediction error
24
47© 2007 Seni & Elder KDD07
Overview
Ensemble Methods
– Ensemble Learning & Importance Sampling (ISLE)– Generic Ensemble Generation– Bagging, Random Forest, AdaBoot, MART
• Rule Ensembles
• Interpretation
48© 2007 Seni & Elder KDD07
Ensemble MethodsLearning
• Model:
– : “basis” functions (or “base learners”)
– i.e., linear model in a (very) high dimensional space of derivedvariables
• Learner characterization:
– : a specific set of joint parameter values – e.g., split definitions at internal nodes and predictions at terminal nodes
– : function class – i.e., set of all base learners of specified family
∑ =+=
M
m mmTccF10 )()( xx
{ }MmT 1)(x
);()( mm TT pxx =
mp
{ } PT ∈ppx );(
25
49© 2007 Seni & Elder KDD07
Ensemble MethodsLearning (2)
• Two-step process – approximate solution to
– Step 1: Choose points
• i.e., select
– Step 2: Determine weights
• e.g., via regularized LR
( )∑ ∑=
=+=
N
i
M
m mmmic
Momm TccyLc
Momm 1
10} ,{
);( ,min} ,{ pxpp
))
{ }Mm 1p
{ } { } PM
m TT ∈⊂ ppxx );()( 1
{ }Mmc 0
50© 2007 Seni & Elder KDD07
Ensemble MethodsExample
• 2 features, 2 classes
⇒ Another example in Appendix 1
26
51© 2007 Seni & Elder KDD07
Ensemble MethodsImportance Sampling (Friedman, 2003)
• How to judiciously choose the “basis” functions (i.e., )?
• Goal: find “good” so that
• Connection with numerical integration:
–
{ }Mm 1p
)()}{,}{;( *11 xpx FaF M
mM
m ≅{ }Mm 1p
)(w )( M
1m m mII ppp ∑∫ =Ρ≈∂
vs.
);(
);()(
);()()(
1
1
m
M
mm
M
mmmm
Tc
Taw
dTaF
px
pxp
ppxpx
∑
∑
∫
=
=
Ρ
=
≅
=
Accuracy improves when we choose more points from this region…
52© 2007 Seni & Elder KDD07
Ensemble MethodsImportance Sampling (2)
• Parameter importance measure:
– Indicative of relevance of
– Naive approach: -- i.e., uniform
– Better idea: inversely related to ’s “risk”
• i.e., has high error ⇒ lack of relevance of
– Single point vs. group importance
• i.e., with/out knowledge of the other points that will be used
• Sequential approximation: ‘s relevance judged in the context of the (fixed) previously selected points
– Characterization of : “center” and “width”
p
)(pr
iidr m )(p
mp
)(pr
p
);( mT px mp
27
53© 2007 Seni & Elder KDD07
Ensemble MethodsImportance Sampling (3)
⇒ The empirically observed strength-correlation tradeoff can be understood in terms of the width of
)(prBroad
{ }MmT 1);( px• Ensemble of “strong” base
learners − i.e., all with )()( ∗≈ pp RiskRisk m
);( mT px• ’s yield similar highly correlatedpredictions ⇒ unexceptional performance
)(prNarrow
• Diverse ensemble − i.e., predictions arenot highly correlated with each other
• However, many “weak” base learners − i.e.,⇒ poor performance )()( ∗>> pp RiskRisk m
)(pr
( )pp p Riskminarg=∗
54© 2007 Seni & Elder KDD07
Ensemble MethodsImportance Sampling (4)
• Heuristic sampling strategy : sampling around by iteratively applying small perturbations to existing problem structure
– Generating ensemble members
– is a (random) modification of any of• Data distribution − e.g., by re-weighting the observations
• Loss function − e.g., by modifying its argument
• Search algorithm (used to find minp)
– Width of is controlled by degree of perturbation
)(pr) ∗p
);()( mm TT pxx =
( ){ }}
);(,minarg { to1For
pxp p TyLPERTURBMm
xymm Ε==
{}⋅ PERTURB
)(pr)
28
55© 2007 Seni & Elder KDD07
Ensemble MethodsGeneric Ensemble Generation (Friedman, 2003)
• Step 1: Choose :
– Algorithm control:
• : random sub-sample of size ⇒ impacts ensemble "diversity"
• : “memory” function ( )
{ }Mm 1p
Forward StagewiseFitting Procedure
υη ,,L
N≤η)(ηmS
∑ −
=− ⋅=1
11 )()( m
k km TF xx υ 10 ≤≤υ
( )
{ }Mm
mmm
mm
Siiimim
T
TFFTT
TFyLMm
F
m
1
1
)(1
0
)( write
})( )()(
);()(
);()( ,minarg { to1For
0)(
x
xxxpxx
pxxp
x
p
⋅+==
+=
==
−
∈−∑
υ
η )(ηmS)(1 imF x−
υ
Modification of data distribution Modification of loss function
(“sequential” approximation)
56© 2007 Seni & Elder KDD07
Ensemble MethodsGeneric Ensemble Generation (2)
• Step 2: Choose quadrature coefficients :
– Given , coefficients are obtained by a regularized linear regression
– Regularization here helps reduce bias (in addition to variance) of the model
– New iterative fast algorithms for various loss functions
• “Gradient Direct Regularization…”, Friedman & Popescu, 2004
{ }Mmc 0
{ } { }Mmm
Mmm TT 11 );()( == = pxx
∑∑ ∑== =
+⎟⎠
⎞⎜⎝
⎛+=
M
mm
M
mimmi
cm cTccyLc
m 1
N
1i 10
}{ )( ,minarg}{ λx)
29
57© 2007 Seni & Elder KDD07
Ensemble MethodsBagging (Breiman, 1996)
• Bagging = Bootstrap Aggregation
•
•
•
•
•
– i.e., perturbation of the data distribution only– Parallelizable– See R command ipred:ipredbagg
2)()ˆ,( yyyyL )−=
2/N=η
)(xmT
0=υ
{ }M1/1 ,0 Mcc mo ==
( )
{ }Mm
mmm
mm
Siiimim
T
TFFTT
TFyLMm
F
m
1
1
)(1
0
)( write
})( )()(
);()(
);()( ,minarg { to1For
0)(
x
xxxpxx
pxxp
x
p
⋅+==
+=
==
−
∈−∑
υ
ηη
υ
( ) L⇒ no memory
⇒ are large un-prunedtrees
i.e., not fit to the data (avg)
58© 2007 Seni & Elder KDD07
Ensemble MethodsBagging − Example
• Simulated data
– Predictor vars:
– Response var:
– Training:
• 200 bootstrap samples
– Testing:
)195.095.095.095.095.095.01
,0
0()( 5
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡== OMNpn x
8.0)5.0|1( 2.0)5.0|1( : toaccording generated class;-two
1
1
=>==≤=
xyPxyP ⇒ Bayes error
rate : 0.2
2000=N
30=N
Source: ESL 8.7.1
30
59© 2007 Seni & Elder KDD07
Ensemble MethodsBagging − Example (2)
• Bootstrap trees
Tree 2…
0
0
0
1
1
1
Tree 1
0
0
1 1
1
60© 2007 Seni & Elder KDD07
Ensemble MethodsBagging − Example (3)
• Test Error
– Trees have high variance on these data due to the correlation in the predictors
– Bagging succeeds in smoothing out this variance and hence reducing test error
31
61© 2007 Seni & Elder KDD07
Ensemble MethodsBagging − Why it helps?
• Under , averaging reduces variance and leaves bias unchanged
• Consider “idealized” bagging (aggregate) estimator:
– fit to bootstrap data set
– is sampled from actual population distribution (not training data)
– We can write:
⇒ true population aggregation never increases mean squared error!
⇒ Bagging will often decrease MSE…
2)()ˆ,( yyyyL )−=
)( )( xx Zff)
Ε=
Zf)
{ }Nii xyZ 1,=
Z
[ ] [ ][ ] [ ][ ]2
2 2
2 2
)(
)()()(
)()()()(
x
xxx
xxxx
fY
fffY
fffYfY
Z
ZZ
−Ε≥
−Ε+−Ε=
−+−Ε=−Ε)
))
62© 2007 Seni & Elder KDD07
Ensemble MethodsRandom Forest (Breiman, 2001)
• Bagging + algorithm randomizing
– Subset splittingAs each tree is constructed…
• Draw a random sample of predictors before each node is split
• Find best split as usual but selecting only from subset of predictors
⇒ Increased diversity among − i.e., wider
• Width (inversely) controlled by
– Speed improvement over Bagging
⎣ ⎦1)(log2 += nns
{ }MmT 1)(x )(pr)
sn
32
63© 2007 Seni & Elder KDD07
Ensemble MethodsRandom Forest (2)
• Potential improvements:
– Different data sampling strategy (not fixed)– Fit quadrature coefficients to data
• See R command randomForestBag RF Bag_6_5%_P RF_6_5%_P
Com
para
tive
RM
S Er
ror • xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacementPost-processing – i.e., using estimated “optimal” quadraturecoefficients
• Simulation study with 100 different targetfunctions (Popescu, 2005)
• xxx_6_5%_P : Significantly faster to build!
64© 2007 Seni & Elder KDD07
• AdaBoost = Adaptive Boosting
•
•
•
•
•
•
– “Real” AdaBoost : extension to regression case (Friedman, 2000)– See R command boost:adaboost
( )
Mmm
mmmm
mm
Siiimi
cmm
Tc
TcFFTT
TcFyLcMm
F
m
1
1
)(1
,
0
)}(,{ write}
)( )()( );()(
);()( ,minarg),( { to1For
0)(
x
xxxpxx
pxxp
x
p
⋅⋅+==
⋅+=
==
−
∈−∑
υ
η
⇒ Sequentialpartial regression coefficients
Ensemble MethodsAdaBoost (Freund & Schapire, 1997)
)(xmT
{ }M1 ,0 mo cc =
⇒ Any “weak” learner
{ }1 ,1 ;)ˆexp()ˆ,( −∈⋅−= yyyyyL
1=υ ⇒ Sequential sampling
N=η ⇒ Data perturbed viaobservation weights
( ) ( )∑ ===
M
m mmM TcsignFsigny1
)()( xx)
η
υ
mc ( ) L
33
65© 2007 Seni & Elder KDD07
Ensemble MethodsAdaBoost (2)
• Equivalence to Forward Stagewise Fitting Procedure
– We need to show is equivalent to line a. above
– is equivalent to line c.
– How weights are derived/updated
[ ]
( )∑
∑∑
=
+
=
=
≠⋅⋅=
−=
≠=
==
M
m mm
imimm
im
i
mmm
N
im
i
imiN
im
im
mim
i
Tsign
TyIww
errerr
w
TyIwerr
wTMm
Nw
1
)()1(
1)(
1)(
)(
)0(
)(Output
})((expSet d.
))1(log( Compue c.
))((
Compute b. with data training to)( classifier aFit a.
{ to1For 1 :n weightsobservatio
x
x
x
x
α
α
α
( )p
p ⋅= minargm
( )
Mmm
mmmm
mm
Siiimi
cmm
Tc
TcFFTT
TcFyLcMm
F
m
1
1
)(1
,
0
)}(,{ write}
)( )()( );()(
);()( ,minarg),( { to1For
0)(
x
xxxpxx
pxxp
x
p
⋅⋅+==
⋅+=
==
−
∈−∑
υ
η
( )c
mc ⋅= minarg
)(miw
App
endi
x2
66© 2007 Seni & Elder KDD07
Ensemble MethodsAdaBoost (3)
• Why exponential loss?
– Implementation convenience
– Meaningful population minimizer:
• i.e., half the log-odds of
• Equivalently,
• Same minimizer obtained with (negative) binomial log-likelihood
• However, exponential loss is less robust in noisy settings
• No natural generalization to K classes
( ))|1Pr(
)|1Pr(log21minarg)( )(
|)( x
xx xx
x −==
=Ε= ⋅−∗
YYeF FY
YF
)|1Pr( x=Y
)(211)|1Pr(
xx ∗⋅−+==
FeY
( ) =−Ε )( ,(| xx FYlY ( ) 1log( )(2|
xx
FYY e ⋅−+Ε
Appendix3
34
67© 2007 Seni & Elder KDD07
• Boosting with any differentiable loss criterion
• General and
•
•
•
•
•
•
– See R command gbm
( )
Mmm
mmmm
mm
Siiimi
cmm
Tc
TcFFTT
TcFyLcMm
cF
m
1
1
)(1
,
00
)}(),{( write}
)( )()( );()(
);()( ,minarg),( { to1For
)(
x
xxxpxx
pxxp
x
p
⋅
⋅⋅+==
⋅+=
==
−
∈−∑
υ
υ
η
⇒ “shrunk” sequentialpartial regression coefficients
Ensemble MethodsGradient Boosting (Friedman, 2001)
{ }M1mc
η
υ
mc ( ) L
)ˆ,( yyL y
⇒= 1.0υ Sequential sampling
2N=η
⇒ )(xmT Any “weak” learner
∑ ==
N
i ico cyLc1
),(minarg
)(xMFy =)
0c
App
endi
x 4
68© 2007 Seni & Elder KDD07
Ensemble MethodsGradient Boosting / MART
• MART = Multiple Additive Regression Trees
–
• J is a meta-parameter that controls interaction order allowed by approximation
– Adding to model is like adding J separate (basis) functions
⇒ computational efficient update
⇒ )(xmT J−terminal node regression tree ( )mJJm ∀=
⇒= 2J “main-effects” only
⇒= 3J two-variable interactions allowed (i.e., second-order effects)
⇒ L value of J should reflect dominant interaction level of target function
)(xmT
( ) ∑=
∈=J
jjj
Jjj RIbRbT
11 )(},{; xx
jmmjm bc ⋅=γwith
∑=
−− ∈⋅+=⋅+=⇒J
jjmjmmmmmmm RIbcFTcFF
111 )()( )( )()( xxxxx
∑=
− ∈+=J
jjmjmm RIF
11 )()( xx γ
Appendix 5
35
69© 2007 Seni & Elder KDD07
Ensemble MethodsGradient Boosting / MART (2)
• GB/MART doesn’t choose quadrature coefficients in a separate step
– i.e., ISLE recipe:
• However, boosting is still understood as an “incremental forward stagewise regression” procedure with Lasso penalty (shrinkage controlled by )
– Note similarity with Forward Stagewise Linear Regression procedure with as predictors
∑∑ ∑== =
+⎟⎠
⎞⎜⎝
⎛+=
M
mm
M
mimmi
cm cTccyLc
m 1
N
1i 10
}{ )( ,minarg}{ λx)
υ
{ }MmT 1)(x
70© 2007 Seni & Elder KDD07
Ensemble MethodsParallel vs. Sequential Ensembles
• Sequential ISLE tend to perform better than parallel ones
– Consistent with results observed in classical Monte Carlo integration
Bag RF
Bag_6_5%_P RF_6_5%_P
Com
para
tive
RM
S Er
ror • xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacementPost-processing – i.e., using estimated “optimal” quadraturecoefficients
• Simulation study with 100 different targetfunctions (Popescu, 2005)
• Seq_η_ν%_P : “Sequential” ensemble6 terminal nodes treesη: “memory” factorν% samples without replacementPost-processing
MART Seq_0.01_20%_P
Seq_0.1_50%_P
“Sequential”
“Parallel”
36
71© 2007 Seni & Elder KDD07
Overview
• Ensemble Methods
– Ensemble Learning & Importance Sampling (ISLE)– Generic Ensemble Generation– Bagging, Random Forest, AdaBoot, MART
Rule Ensembles
• Interpretation
72© 2007 Seni & Elder KDD07
Ensemble MethodsRule Ensembles (Friedman & Popescu, 2005)
• Ensemble Recap:
– Model:
– : “basis” functions (or “base learners”)
• Derived predictors capture non-linearities and interactions
– 2-stage fitting process:
i. generate basis functions
Ii. Post fit to the data via regularized regression
– Simple representation offered by a single tree no longer available
∑ =+=
M
m mm faaF10 )()( xx
{ }Mmf 1)(x
37
73© 2007 Seni & Elder KDD07
Ensemble MethodsRule Ensembles (2)
• Trees as collection of conjunctive rules:
– These simple rules, , can be used as base learners
– Main motivation is interpretability
∑=
∈=J
jjmjmm RIcT
1
)ˆ(ˆ)( xx
1x2x
yR2
R1
R3
R5
R4
152215
27
)27()22()( 211 >⋅>= xIxIr x⇒ 1R
⇒ 2R )270()22()( 212 ≤≤⋅>= xIxIr x
⇒ 3R )0()2215()( 213 xIxIr ≤⋅≤<=x
⇒ 4R )15()150()( 214 >⋅≤≤= xIxIr x
⇒ 5R )150()150()( 215 ≤≤⋅≤≤= xIxIr x
{ }1,0)( ∈xmr
74© 2007 Seni & Elder KDD07
Ensemble MethodsRule Ensembles (3)
• Rule-based model:
– Still a piecewise constant model
• Linear targets can still be problematic…
– We can complement the non-linear rules with purely linear terms:
• Original continuous variables can be replaced by their “winzorized” versions
∑∑ ++=j
jjm
mm xbraaF )()( 0 xx
∑+=m
mmraaF )()( 0 xx
jx
38
75© 2007 Seni & Elder KDD07
Ensemble MethodsRule Ensembles (4)
• Rule generation
– Use some approximate optimization approach to solve
where are the split definitions for rule
– Take advantage of a decision tree ensemble
• E.g., one rule for each terminal node in each tree
• In the case of shallow trees (boosting), regions corresponding to non-terminal nodes can also be included
– Each terminal node tree generates rules
( )∑∈
− +=)(
1 );()( ,minargηmSi
iimim rFyL pxxpp
mp
)(r
)(xmr
)(xmT
−J ( )12 −× J
76© 2007 Seni & Elder KDD07
Ensemble MethodsRule Ensembles (5)
• Rule fitting
– Linear regularized procedure
• total number of rules
• total number of linear terms
– Tree size controls rule “complexity”• A terminal node tree can generate rules involving up to factors
• Modeling order interactions requires rules with or more factors
( )( )∑∑
∑ ∑∑
==
= ==
++
++=
p
j jK
k k
N
i
p
j ijjK
k ikkiba
jk
ba
xbraayLbajk
11
1 110}{},{
)( ,minarg})ˆ{},ˆ({
λ
x
( )∑=
−×=M
mmJK
1
12
np ≤
−J ( )1−J
−J J
39
77© 2007 Seni & Elder KDD07
Overview
• Ensemble Methods
– Ensemble Learning & Importance Sampling (ISLE)– Generic Ensemble Generation– Bagging, Random Forest, AdaBoot, MART
• Rule Ensembles
Interpretation
78© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation
• Importance scores
– Which particular variables are the most influential?
• i.e., relative influence or contribution in predicting the response
– Higher importance variables are more likely to be of interest– Detection of “masking”
• Interaction statistic
– Which variables are involved in interactions with other variables? – Strength and degrees of those interactions
• Partial dependence plots
– Nature of the dependence of the response on influential inputs
App
endi
x 6
40
79© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation − Example
• Simulated data
– Predictor vars:
– Response var:
– Training:
10=n
21 )1( )1( 11 ===−= xpxp
10,,2 31 )1( )0( )1( K======−= mxpxpxp mmm
⎩⎨⎧
−=++⋅+⋅+−=++⋅+⋅+
=1 233
1233
1765
1432
xzxxxxzxxx
y( )2 ,0~ Nz
200=N
noise"" are , , 1098 xxx
Source: CART
80© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation − Example (2)
• Single tree vs. rule-ensemble
X1= -10.51-0.6740
X1 = 1 & X4 ∈ {1} 0.510.9142
X1 = -1 & X5 ∈ {1} &X7 ∈ {0, 1} 0.111.0046
X1 = -1 & X6 ∈ {-1, 0}0.351.2557
X1 = -1 & X6 ∈ {-1}0.35-1.0071
X1 = 1 & X2 ∈{-1, 0} & X3 ∈{-1, 0}0.24-1.5983
X1 = 1 & X2 ∈{0, 1} & X3 ∈ {0, 1} 0.171.3762
X1 = -1 & X5 ∈{ -1}0.14-2.2595
X1 = 1 & X2 ∈{0, 1}0.301.80100
RuleSuppCoefImp
41
81© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation − Example (3)
• Variable importance
– See R package “rulefit”
82© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation − Example (4)
• Interaction statistic – “null” adjusted
42
83© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation − Example (5)
• Two-variable interaction statistic: and ( )* ,5X ( )* ,1X
84© 2007 Seni & Elder KDD07
Ensemble MethodsInterpretation − Example (7)
• Two-Variable Partial Dependencies
– Effect of and on response after accounting for the (average) effects of the other variables…
jx kx
⎩⎨⎧
−=++⋅+⋅+−=++⋅+⋅+
=1233
1233
1765
1432
xzxxxxzxxx
y
Ŧ: Here translated to have a minimum value of zero
43
85© 2007 Seni & Elder KDD07
References 1. Breiman, L., Friedman, J.H., Olshen, R., and Stone, C. (1993). Classification and Regression
Trees. Chappman & Hall/CRC.
2. Breiman, L. (1995). Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37(4):373-384.
3. Breiman, L. (1996). Bagging Predictors. Machine Learning, 26:123-140.
4. Breiman, L. (1998). Arcing Classifiers. Annals of Statistics 26(2):801-849.
5. Breiman, L. (2001). Random Forests, random features. Technical Report, University of California, Berkeley.
6. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least Angle Regression. Annals of Statistics, 32(2):407–499.
7. Elder, J. (2003). The Generalization Paradox of Ensembles. Journal of Computational and Graphical Statistics, 12(4):853-864.
8. Freund, Y. and Schapire, R.E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proc. of the 13th Intl. Conference, Morgan Kauffman, San Francisco, 148-156.
9. Freund, Y. and Schapire, R.E. (1997). A decision theoretical generalization of on-line learning and an application to boosting. Journal of Computer and systems Sciences, 55(1):133-68.
10. Friedman, J.H. (1999). Stochastic gradient boosting. Technical Report Department of Statistics, Stanford University.
86© 2007 Seni & Elder KDD07
References (2)
11. Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive Logistic Regression: a Statistical View of Boosting. Annals of Statistics, 28:337-40.
12. Friedman, J. (2001). Greedy function approximation: the gradient boosting machine, Annals of Statistics, 29:1189–1232.
13. Friedman, J. and Popescu, B. E. (2003). Importance Sampled Learning Ensembles. TechnicalReport, Stanford University.
14. Friedman, J. and Popescu, B. E. (2004). Gradient directed regularization for linear regression and classification. Technical Report, Stanford University.
15. Friedman, J. and Popescu, B. E. (2005). Predictive learning via rule ensembles. Technical Report, Stanford University.
16. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning (ESL) −Data Mining, Inference and Prediction. Springer
17. Ho, T.K. (1995). Random Decision Forests. Proc. of the 3rd Intl. Conference on Document Analysis and Recognition, 278-282.
18. Kleinberg, E. (2000). On the Algorithmic Implementation of Stochastic Discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5):473-490.
19. Rosset, S. (2003). Topics in Regularization and Boosting. PhD. Thesis, Stanford university.
44
87© 2007 Seni & Elder KDD07
References (3)
20. Seni, G., Yang, E. and Akar, S. (2007). Yield Modeling with Rule Ensembles. Proc. of the 18th
IEEE/SEMI Advanced Semiconductor Manufacturing Conference.
21. Tibshirani, R. Regression shrinkage and selection via the lasso (1996). J. Royal Statistics Society B., 58:267-288.
22. Zhao, P. and Yu, B. (2005). Boosted lasso. Proc. of the SIAM Intl. Conference on Data Mining, Newport Beach, CA, 35-44.
23. Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Keynote Address, SIAM workshop on Variable Selection, Newport Beach, CA.
88© 2007 Seni & Elder KDD07
Appendices
1. Visualizing Bagging and AdaBoost Decision Boundaries
2. On AdaBoost: Equivalence to Forward Stagewise Fitting Procedure
3. On AdaBoost: Prove population minimizer of exponential loss is the half the log-odds of
4. On Gradient Boosting: Solving for robust loss criterion
5. On MART: tree specific optimization
• LAD-regression algorithm
6. Interpretation Statistics for Ensemble Methods
7. On Complexity and Generalized Degrees of Freedom
)|1Pr( x=Y
45
89© 2007 Seni & Elder KDD07
Appendix 1Visualizing Bagging and AdaBoost (2-dimensional, 2-class example)
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
90© 2007 Seni & Elder KDD07
Appendix 1Decision boundary of a single tree
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
46
91© 2007 Seni & Elder KDD07
Appendix 1100 bagged trees leads to smoother boundary
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
92© 2007 Seni & Elder KDD07
Appendix 1AdaBoost, after one iteration (CART splits, larger points have great weight)
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
47
93© 2007 Seni & Elder KDD07
Appendix 1After 3 iterations of AdaBoost
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
94© 2007 Seni & Elder KDD07
Appendix 1After 20 iterations of AdaBoost
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
48
95© 2007 Seni & Elder KDD07
Appendix 1Decision boundary after 100 iterations of AdaBoost
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
96© 2007 Seni & Elder KDD07
Appendix 2On AdaBoost – Equivalence to FSF Procedure• We have
– doesn’t depend on , thus can be regarded as an observation weight
– Solution to (1) can be obtained in two steps:
• Step1: given , solve for
• Step2: given , solve for
( )∑=
− ⋅+=N
iiimi
cmm TcFyLc
11
,);()( ,minarg),( pxxp
p
( )∑=
− ⋅⋅−⋅−=N
iiiimi
cmm TycFyc
11
,);( )(expminarg),( pxxp
p⇒⋅−= )ˆexp()ˆ,( yyyyL
)()( 1with imi Fymi ew x−−=( )∑
=
⋅⋅−⋅=N
iii
mi
cTycw
1
)(
,);(expminarg px
p
)(miw por c
(1)
c );( mT px
( )∑=
⋅⋅−⋅=N
iii
mi
Tm TycwT
1
)( )(expminarg x ⎥⎦
⎤⎢⎣
⎡≠=⇒ ∑
=
))((minarg 1
)(ii
N
i
mi
Tm TyIwT x
mT c
( )∑=
⋅⋅−⋅=N
iimi
mi
cm Tycwc
1
)( )(expminarg x∑
∑=
=≠
=−
=⇒ N
im
i
N
i imim
im
m
mm
w
TyIwerr
errerrc
1)(
1)( ))((
1log21
xwhere
49
97© 2007 Seni & Elder KDD07
Appendix 2On AdaBoost – Equivalence to FSF Procedure (2)
• Step1: given , solve for
– i.e., is the classifier that minimizes the weighted error rate (line a.)
• Step 2: given , solve for
c );( mT px
( )∑=
⋅⋅−⋅=N
iii
mi
Tm TycwT
1
)( )(expminarg x ⎥⎦
⎤⎢⎣
⎡⋅+⋅= ∑∑
≠=
−
)(
)(
)(
)( minargiiii Ty
mi
c
Ty
mi
c
Twewe
xx
⎥⎦
⎤⎢⎣
⎡⋅+⋅−⋅= ∑∑∑
≠≠
−
=
−
)(
)(
)(
)(
1
)( minargiiii Ty
mi
c
Ty
mi
cN
i
mi
c
Twewewe
xx
( ) ⎥⎦
⎤⎢⎣
⎡⋅+≠⋅−= ∑∑
=
−
=
−N
i
mi
cii
N
i
mi
cc
TweTyIwee
1
)(
1
)( ))((minarg x ⎥⎦
⎤⎢⎣
⎡≠= ∑
=
))((minarg1
)(ii
N
i
mi
TTyIw x
constant doesn’t depend on T
( ) ⎥⎦
⎤⎢⎣
⎡⋅+≠⋅−= ∑∑
=
−
=
−N
i
mi
cimi
N
i
mi
cc
cm weTyIweec
1
)(
1
)( ))((minarg x ⇒ compute and set to zero derivativewith respect to c
mT
mT c
98© 2007 Seni & Elder KDD07
Appendix 2On AdaBoost – Equivalence to FSF Procedure (3)
• Step2: (cont)
and
( )
∑∑∑
∑∑∑
∑∑∑
∑ ∑∑
∑ ∑ ∑
∑ ∑∑ ∑
=
==
=
==
===
= ==−
= = =−−
= =−−
= =−−
≠
≠−=⇒
≠
≠−=⇒
≠−=≠⋅⇒
=−≠+≠⋅⇒
=⋅−≠⋅+≠⋅⇒
=⋅−≠⋅+=⋅+≠⋅−∂∂
N
i imim
i
N
i imim
iN
im
iN
i imim
i
N
i imim
iN
im
ic
N
i imim
iN
im
iN
i imim
ic
N
i
N
im
iimim
iN
i imim
icc
N
i
N
i
N
im
ic
imim
ic
imim
ic
N
i
N
im
ic
imim
iccN
i
N
im
ic
imim
icc
xTyIw
xTyIwwc
xTyIw
xTyIwwe
xTyIwwxTyIwe
wxTyIwxTyIwee
wexTyIwexTyIwe
wexTyIweewexTyIweec
1)(
1)(
1)(
1)(
1)(
1)(
2
1)(
1)(
1)(2
1 1)()(
1)(2
1 1 1)()()(
1 1)()(
1 1)()(
))((
))((ln
21
))((
))((
))(( ))((
0 ))(( ))(( )by (dividing
0 ))(( ))((
0))(()())(()(
∑∑∑
∑∑
∑∑∑
∑∑
∑∑
=
==
=
=
=
==
=
=
=
=
≠
≠−=
≠
≠−
=≠
≠−
=−
N
i imim
i
N
i imim
iN
im
i
N
im
i
N
i imim
i
N
im
i
N
i imim
iN
im
i
N
im
i
N
i imim
i
N
im
i
N
i imim
i
m
m
xTyIw
xTyIww
w
xTyIw
w
xTyIww
w
xTyIw
w
xTyIw
errerr
1)(
1)(
1)(
1)(
1)(
1)(
1)(
1)(
1)(
1)(
1)(
1)(
))((
))((
))((
))((
))((
))((1
1
50
99© 2007 Seni & Elder KDD07
Appendix 2On AdaBoost – Equivalence to FSF Procedure (4)
• Finally, weight update expression:
– is the quantity in line 2c
– Equivalence to d. √
⇒ AdaBoost minimizes the exponential loss with a forward-stagewise additive modeling approach!
)1( +miw
)()()( 1 xxx mmmm TcFF ⋅+= −[ ])()()()1( 1 immimiimi TcFyFym
i eew xxx ⋅+−−+ −==⇒
)()(1 imimimi TycFy ee xx ⋅⋅−− ⋅= − )()( imim Tycmi ew x⋅⋅−⋅=
mimim cTyIcmi eew −≠⋅⋅ ⋅⋅= ))((2)( x⇒−≠=⋅− 1))((2)( imiimi TyITy xx
mimim cTyImi eew −≠⋅ ⋅⋅= ))(()( xα
multiplies all weights… has no effect
mc⋅= 2 mα
100© 2007 Seni & Elder KDD07
Appendix 3On AdaBoost – Population Minimizer
• Prove ( ))|1Pr(
)|1Pr(log21minarg)( )(
|)( x
xx xx
x −==
=Ε= ⋅−∗
YYeF FY
YF
)|1Pr(
)|1Pr(ln21)(
)|1Pr(ln)|1Pr(ln)(2 )()|1Pr(ln)()|1Pr(ln
)|1Pr()|1Pr( 0)(
)|(
)|1Pr( )|1Pr()(
)|(
)|1Pr( )|1Pr()|(
)()()(
)()()(
)()()(
xYxYxf
xYxYxfxfxYxfxY
exYexYxf
xeE
exYexYxf
xeE
exYexYxeE
xfxfxyf
xfxfxyf
xfxfxyf
−==
=⇒
−=−==⇒−==+−=⇒
⋅==⋅−=⇒=∂
∂⇒
⋅−=+⋅=−=∂
∂⇒
⋅−=+⋅==
−−
−−
−−
51
101© 2007 Seni & Elder KDD07
Appendix 4On Gradient Boosting
• Solving for robust loss criterion (e.g., absolute loss, binomialdeviance) requires use of a “surrogate”, more convenient,
• Like before, we solve in two steps:
– Step1: find … here we use
– Step2: given , solve for
)ˆ,(~ yyL
( )∑=
− ⋅+=N
iiimi
cmm TcFyLc
11
,);()( ,minarg),( pxxp
p
);( mT px )ˆ,(~ yyL
( )∑=
− ⋅+=N
iiimim TcFyL
11
);()( ,~minarg pxxp
p
mT c
( )∑=
− ⋅+=N
imiimi
cm TcFyLc
11 );()( ,minarg pxx
102© 2007 Seni & Elder KDD07
Appendix 4On Gradient Boosting (2)
• is derived from analogy to numerical optimization in function space
– Learning: − i.e., minimum “risk”
– Each possible is a “point” in − i.e.,
)ˆ,(~ yyL
)(minargˆ FRF F=
F Nℜ )(,),(),( 21 NFFF xxxF K=
)(FR
)( 1xF
)( 2xF
52
103© 2007 Seni & Elder KDD07
Appendix 4On Gradient Boosting (3)
• Steepest descent in function space (“non-parametric” view)
– Gradient components
– Step size (line search):
)(FR
)( 1xF
)( 2xF
guess initial : 0F
∑ ==
M
m mM 0FF
))(101 FR∇−= ρFF
( )
( )11
)()(,
)()(,
)(
)( 1111
−− ==⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
∂∂
∂∂=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
∂∂
∂∂=∇
mm FFNNNFFN
m
FFyL
FFyL
FR
FRR
xx
xx
x
xLL
( )RR mmm ∇−= − ρρ ρ 1minarg F ⇒ like “Step2” before
104© 2007 Seni & Elder KDD07
Appendix 4On Gradient Boosting (4)
• Problem: defined on training data only
– Select parameterized function (defined for all ):
– Choose so that is most “parallel” to
– Surrogate criterion :
MF)
);( pxTx
p );( pxT )(FR∇
)(FR
)( 1xF
)( 2xF
Rm∇−
);( pxT
θ
• Geometric interpretation of correlation:
( ){ } { }( )Nii
Ni TFRcorr 11 );( ,)(cos ==∇−= pxxθ
L~ ( ) )(
)(, ~
1−=∂
∂−=
mFFi
iii F
FyLyx
x ⇒ Least-squares minimization
• Most highly correlated is found fromsolution :
( )( )∑=
⋅−∇−=N
iiimm TFR
1
2
,);( )(minarg pxxp
pβ
β
);( pxT
53
105© 2007 Seni & Elder KDD07
Appendix 5On MART
• Adding to model is like adding J separate (basis) functions
• Since regions are disjoint, we can do separate updates in each terminal node
– Step2: was:
now:
i.e., optimal constant update in each terminal region
)(xmT
( ) ∑=
∈=J
jjj
Jjj RIbRbT
11 )(},{; xx
jmmjm bc ⋅=γwith
∑=
−− ∈⋅+=⋅+=⇒J
jjmjmmmmmmm RIbcFTcFF
111 )()( )( )()( xxxxx
∑=
− ∈+=J
jjmjmm RIF
11 )()( xx γ
jmR
( )∑=
− ⋅+=N
imiimi
cm TcFyLc
11 );()( ,minarg pxx
( )∑∈
− +=jmi R
imijm FyLx
x γγγ
)( ,minarg 1
106© 2007 Seni & Elder KDD07
Appendix 5On MART (2)
• Summary:
( )
{ }{ }
( )
( )
( )
)()(ˆ write
}
ˆ )()(
expansion pdate // U
)( ,minargˆ tscoefficien find :Step2 //
~minarg,
)()(, ~
criterion surrogate using )( find :Step1//
{ to1For
),(minarg)(
11
1
1
2
1,1
10
1
xx
xxx
x
x
xx
x
x
x
m
J
jjmijmmm
Rimijm
N
i
J
jjmijmi
Rb
Jjmjm
FFi
iii
m
N
i i
FF
RIFF
FyL
RIbyRb
FFyLy
TMm
yLF
jmi
jmjm
m
=
∈⋅+=
+=
⎥⎦
⎤⎢⎣
⎡∈−=
∂∂
−=
=
=
∑
∑
∑ ∑
∑
=−
∈−
= =
=
=
−
)
))
)
γυ
γγ
γ
γ
γ
54
107© 2007 Seni & Elder KDD07
Appendix 5On MART (3)
• LAD-regression:
– More robust than
– Resistant to outliers in … trees already providing resistance to outliers in
– Algorithm:
FyFyL −=),(
( )2Fy −
y
x
{ }
( )
{ } { }( )
{ }
( )}
ˆ )()(
expansion // Update
1 )(ˆ
tscoefficien find :Step2//
,~ treeregression-LS node terminal
)( ~
)( find :Step1//
{ to1For )(
11
11
11
1
10
∑=
−
−∈
−
∈⋅+=
=−=
−=
−=
=
=
J
jjmijmmm
NimiRjm
Nii
Jjm
imii
m
Ni
RIFF
JjFymedian
yJR
Fysigny
T
MmymedianF
jmi
)
K
)
)
xxx
x
x
x
x
x
x
γυ
γ
108© 2007 Seni & Elder KDD07
Appendix 6Interpretation − Variable Importance
• Single tree measure (CART) :
– i.e., sum the "goodness of split" scores whenever is used in a surrogate split
– Sum is over all internal nodes in the tree
– If was used as the primary split in some node, then corresponding score is used in the summation
– Normalized measure:
• i.e., only the relative magnitudes of the are interesting
• Tree ensemble generalization
– Average over all of the trees:
( ) ( ) ( )∑∈
=⋅Ι=ℑTt
tvl ltvIstvx )(~),(ˆ)(
lx
lx( )lsl,Ι
( ) ( ) ( )lmljj xxx ℑℑ⋅=≤≤1
max/100Imp
( )jxℑ
( ) ( )∑=
ℑ=ℑM
mmjj Tx
Mx
1
;1
Tt∈
55
109© 2007 Seni & Elder KDD07
Appendix 6Interpretation − Variable Importance (2)
• Rule ensemble measure:
– Term (global) importance: absolute value of coefficient of standardized predictor
• Rule term: where rule’s support is
• Linear term:
– Term (local) importance: at each point … absolute change in when term is removed
• Rule term:
• Linear term:
)1(ˆ kkkk ssa −⋅⋅=ℑ
)( jjj xstdb ⋅=ℑ)
∑=
=N
iikk r
Ns
1
)(1 x
)(xF)
x
kkkk sra −⋅=ℑ )(ˆ)( xx
jjjj xxb −⋅=ℑ)
)(x∑∈
ℑ+ℑ=ℑkl rx
kkll rsize )()()()( xxx
110© 2007 Seni & Elder KDD07
Appendix 6Interpretation − Partial Dependence Plots (Friedman, 2001)
• Visualize the effect on of a small subset of the important input variables
– after accounting for the (average) effects of the other (“complement”) variables – i.e.,
–
– Partial dependence on :
– Approximated by:
are the values of occurring in the training data
)(xF)
{ }nS xxx ,,, 21 K⊂x
Cx { }nCS xxx ,,, 21 K=∪ xx
),()( CSFF xxx))
=
Sx ),()( CSSS FFC
xxx X
))Ε=
∑ ==
N
i iCSSS xFN
F1
),(1)(ˆ xx
},,{ 1 NCC xx K Cx
56
111© 2007 Seni & Elder KDD07
Appendix 6Interpretation − Partial Dependence Plots (2)
• Example
– The shape of the function on either variable is affected by the values of the other, suggesting the presence of an interaction
112© 2007 Seni & Elder KDD07
Appendix 6Interpretation − Interaction Statistic (Friedman, 2005)
• If and do not interact, then
– can be expressed as sum of two functions:
i.e., does not depend on ; is independent of
– Thus, partial dependence on can be decomposed:
i.e., sum of respective partial dependencies
• Test for the presence of interaction
jx kx
{ }kjS xx ,=x
)()(),(, kkjjkjkj xFxFxxF)))
+=
) ,( kj xx
[ ] ∑∑==
−−=N
iikijkj
N
iikkijjikijkjjk xxFxFxFxxFH
1
2,
2
1,
2 ),()()(),())))
)(xF)
)()()( \\\\ kkjj ffF xxx +=)
)( \\ jjf x jx )( \\ kkf x kx
57
113© 2007 Seni & Elder KDD07
Appendix 6Interpretation − Interaction Statistic (2)
• If does not interact with any other variable
– can be expressed as sum of two functions:
where is a function only of
– Thus,
• Test whether interacts with any other variable
jx
)(xF)
)()()( \\ jjjj ffF xxx +=)
)()()( \\ jjjj FxFF xx +=)
)( jj xf jx
jx
[ ] ∑∑==
−−=N
ii
N
ijijijjij FFxFFH
1
22
1\\
2 )()()()( xxx))))
114© 2007 Seni & Elder KDD07
Appendix 7Complexity of Ensembles
• Ensembles generalizing well was a theoretical surprise
– Importance Sampling helps us understand some behavior
• Chief danger of data mining: Overfit
• Occam’s razor: regulate complexity to avoid overfit
• But, does the razor work?- counter-evidence has been gathering
• What if complexity is measured incorrectly?
• Generalized degrees of freedom (Ye, 1998)
• Experiments: single vs. bagged decision trees
• Summary: factors that matter
Outline:
58
115© 2007 Seni & Elder KDD07
Appendix 7Overfit models generalize poorly
0
1
2
3
4
5
0.1 0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1 5.6 6.1 6.6 7.1 7.6 8.1 8.6 9.1 9.6
X
Y = f(X)TestPntsYhatXYhatX2YhatX3YhatX4YhatX5YhatX6YhatX7YhatX8YhatX9
116© 2007 Seni & Elder KDD07
Appendix 7Occam’s Razor
• Nunquam ponenda est pluralitas sin necesitate“Entities should not be multiplied beyond necessity” “If (nearly) as descriptive, the simpler model is more correct.”
• But, gathering doubts:
– Ensemble methods which employ multiple models (e.g., bagging, boosting, bundling, Bayesian model averaging
– Nonlinear terms have higher (or lower) than linear effect– Much overfit is from excessive search (e.g., Jensen 2000), rather
than over-parameterization– Neural network structures are fixed, but their degree of fit grows
with time
• Domingos (1998) won KDD Best Paper arguing for its death
– What if complexity is measured incorrectly?
59
117© 2007 Seni & Elder KDD07
Appendix 7Target Shuffling can measure “over search”(e.g., Battery MTC, Monte Carlo Target in CART β)
• Break link between target, Y, and features, Xby shuffling Y to form Ys.
• Model new Ys ~ f(X)
• Measure quality of resulting (random) model
• Repeat to build distribution
-> Best (or mean) shuffled (i.e., useless) model sets the baseline above which true model performance may be measured
118© 2007 Seni & Elder KDD07
Appendix 7Complexity can be measured by the Flexibility of the Modeling Process
• Generalized Degrees of Freedom, GDF (Ye, JASA 3/1998)
– Perturb output, re-fit procedure, measure changes in estimates
• Covariance Inflation Criterion, CIC (Tibshirani & Knight, 1999)
– Shuffle output, re-fit procedure, measure covariance between new and old estimates.
• Key step (loop around modeling procedure) reminiscent of Regression Analysis Tool, RAT (Faraway, 1991) -- where resampling tests of a 2-second procedure took 2 days to run.
60
119© 2007 Seni & Elder KDD07
Appendix 7Generalized Degrees of Freedom (GDF)• #terms in Linear Regression (LR) = DoF, k
• Nonlinear terms (e.g., MARS) can have effect of ~3k (Friedman, Owen ‘91)
• Other parameters can have effects < 1 (e.g., under-trained neural networks)
Procedure (Ye, 1998):
• For LR, k = trace(Hat Matrix) = Σ δyhat / δy
• Define GDF to be sum of sensitivity of each fitted value, yhat, to perturbations in the corresponding output, y. That is, instead of extrapolating from LR by counting terms, use alternate trace measure which is equivalent under LR.
• (Similarly, the effective degrees of freedom of a spline model is estimated by the trace of the projection matrix, S: yhat = Sy )
• Put a y-perturbation loop around the entire modeling process (which can involve multiple stages)
120© 2007 Seni & Elder KDD07
Appendix 7GDF computation
ModelingProcess
X yehat
yey +
N(0,σ)perturbations
obse
rvat
ions
(ye,yehat)
Ye robustness trick: average, across perturbation runs, the sensitivities for a given observation.
y = 0.2502x - 0.821
-1.80
-1.60
-1.40
-1.20
-1.00
-0.80
-0.60-1.80 -1.60 -1.40 -1.20 -1.00 -0.80 -0.60
Ye
61
121© 2007 Seni & Elder KDD07
Appendix 7Example: data surface is piecewise constant
122© 2007 Seni & Elder KDD07
Appendix 7Additive N(0,.5) noise
62
123© 2007 Seni & Elder KDD07
Appendix 7100 random training samples
124© 2007 Seni & Elder KDD07
Appendix 7Estimation Surfaces for bundles of 5 trees
4-leaf trees 8-leaf trees(some of the finer structure is real)
Bagging produces gentler stair-steps than raw tree (illustrating how it generalizes better for smooth functions)
63
125© 2007 Seni & Elder KDD07
Appendix 7Equivalent tree for 8-leaf bundle (25% pruned)
So a bundled tree is still a tree. But is it as complex as it looks?
126© 2007 Seni & Elder KDD07
Appendix 7Experiment: Introduce selection noise (an additional 8 candidate input variables)
Estimation surface for5-bag of 4-leaf trees
… for 8-leaf trees
Main structure here is clear enough for simple models to avoid noise inputsbut their eventual use leads to a distribution of estimates on 2-d projection.
64
127© 2007 Seni & Elder KDD07
Appendix 7Estimated GDF vs. #parameters
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9
Parameters, k
Tree 2dBagged Tree 2dTree 2d+ 8 noiseBagged Tree 2d+ 8 noiseLinear Regression 1d
1
~5
~4 ~4
~3
Bagging reduces complexity
Noise x’s increase complexity
128© 2007 Seni & Elder KDD07
Appendix 7Ensembles & Complexity Summary
• Bundling competing models improves generalization.
• Different model families are a good source of component diversity.
• If we measure complexity as flexibility (GDF) the classic relation between complexity and overfit is revived.
– The more a modeling process can match an arbitrary change made to its output, the more complex it is.
– Simplicity is not parsimony.
• Complexity increases with distracting variables.
• It is expected to increase with parameter power and search thoroughness, and decrease with priors, shrinking, and clarity of structure in data. Constraints (observations) may go either way…
• Model ensembles often have less complexity than their components.
• Diverse modeling procedures can be fairly compared using GDF