From Trees to Forests and Rule Sets · 1 From Trees to Forests and Rule Sets A Unified Overview of Ensemble Methods 13th Intl. Conf. on Knowledge Discovery and Data Mining (KDD 2007)

1

From Trees to Forests and Rule SetsA Unified Overview of Ensemble Methods

13th Intl. Conf. on Knowledge Discovery and Data Mining (KDD 2007)

Giovanni [email protected]

PDF Solutions, Inc., Santa Clara University

John [email protected]

Elder Research, Inc.

2© 2007 Seni & Elder KDD07

Tutorial Goals

• Convey the essentials of an immensely useful modeling methodology

• Explain two recent developments

– Importance Sampling – Rule Ensembles

• Provide the foundation for further study/research

• Reference resources in R

• Show real examples

2


Overview

• In a Nutshell, Examples & Timeline

• Predictive Learning

• Decision Trees

– Regression tree induction– Desirable data mining properties– Limitations

• Model Selection

– Bias-Variance Tradeoff– Cost-complexity pruning– Cross-Validation– Regularization via shrinkage (LASSO)


Overview (2)

• Ensemble Methods

– Ensemble Learning & Importance Sampling (ISLE)– Generic Ensemble Generation– Bagging, Random Forest, AdaBoost, MART

• Rule Ensembles

• Interpretation

• Example from Industry

3


Ensemble Methods in a Nutshell

• “Algorithmic” statistical procedure

• Based on combining the fitted values from a number of fitting attempts

• Loosely related to:

– Iterative procedures– Bootstrap procedures

• Original idea: a “weak” procedure can be strengthened if it can operate “by committee”

– e.g., combining low-bias/ high variance procedures


Ensemble Methods In a Nutshell (2)

• Shown to perform extremely well in a variety of scenarios

• Shown to have desirable statistical properties

• Accompanied by interpretation methodology

4


ExamplesData Mining Products

Model 1

P re di c t ive D yn amix


ExamplesAlgorithm Response Surfaces

Nearest NeighborDecision Tree

Delaunay Triangles(or Hinging Hyperplanes)

Neural Network(or Polynomial Network)

Kernel

5


ExamplesRelative Performance

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German Credit Waveform Investment

Neural NetworkLogistic RegressionLinear Vector QuantizationProjection Pursuit RegressionDecision Tree

Erro

r Rel

ativ

e to

Pee

r Tec

hniq

ues (

low

er is

bet

ter)


Examples“Bundling” Improves Performance

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German Credit Waveform Investment

Advisor PerceptronAP weighted averageVoteAverage

Erro

r Rel

ativ

e to

Pee

r Tec

hniq

ues (

low

er is

bet

ter)

6


ExamplesCredit Scoring Performace

50

55

60

65

70

75

80

0 1 2 3 4 5#Models Combined (averaging output rank)

#Def

aulte

rs M

isse

d (f

ewer

is b

ette

r)

NT

NS STMT

PS

PTNP

MS

MN

MP

SNT

MPN

SPTSMTPNTSPNMPT MNT

SMNSMP

Bundled Trees

Stepwise Regression

Polynomial Network

Neural Network

MARS

SPNT

SMPT

SMNT

SMPNMPNT

SMPNT


Timeline

• CART (Breiman, Friedman, Stone, Olshen, 1983)

• Bagging (Breiman, 1996)

– Random Forest (Ho, 1995; Breiman 2001)

• AdaBoost (Freund, Schapire, 1997)

• Boosting – a statistical view (Friedman, Hastie, Tibshirani, 2000)

– Gradient Boosting (Friedman, 2001)– Stochastic Gradient Boosting (Friedman, 1999)

• Importance Sampling Learning Ensembles (ISLE) (Friedman, Popescu, 2003)

7


Timeline (2)

• Lasso (Tibshirani, 1996)

– Garotte (Breiman, 1995)– LARS (Efron, 2004)

– Gradient directed regularization for linear regression (Friedman, 2004)

– Boosted Lasso (Zhao, 2005)– Elastic Net (Zou, Hastie, 2005)

• Rule Ensembles (Friedman, Popescu, 2005)

– Stochastic Discrimination (Kleinberg, 2000)


Overview

• In a Nutshell & Timeline

Predictive Learning

• Decision Trees

– Regression tree induction– Desirable data mining properties

• Model Selection


8


Predictive LearningExample

• A simple data set

• Modeling Goals:

– Descriptive: summarize existing data in understandable and actionable way

– Predictive: what is the "response" (e.g., class) of new point ?

………

?M54.5

badM12.0

goodM21.0

ResponsePETI

TI

PE

2 5

1M2M3M4M

9M

M


Predictive LearningExample (2)

• Ordinary Linear Regression (OLR):

• Model:

TI

PE

2 5

1M2M3M4M

9M

M

⎩⎨⎧

≥else

0)(xF)

∑=

+=n

jjj xaaF

10)(x

) ⇒ Not flexibleenough

9


Predictive LearningExample (3)

• Decision Trees

• Descriptive model:

5≥TI

{ }3,2,1 MMMPE ∉

2≥TI

TI

PE

2 5

1M2M3M4M

9M

M

[ ] { } ELSE

3,2,1 AND 5 ,2 ⇒∈∈ MMMPETI bad good


Predictive LearningProcedure Summary

• Given "training" data

– are measured values of attributes (properties, characteristics) of an object

– is the "response" (or output) variable– are the "predictor" (or input) variables– is a random sample from some unknown (joint) distribution

• Build a functional model

– Offers adequate and interpretable description of how the inputs affect the outputs

},{},,,,{ 121 iiN

iniii yxxxyD x== L

iji xy ,

iy

ijx

D

)(),,,( 21 xFxxxFy n == L)

10


Predictive LearningProcedure Summary (2)

• Model: underlying functional form sought from data

• Score criterion: judges (lack of) quality of fitted model

• Search Strategy: minimization procedure of criterion

∈= );(ˆ)(ˆ axx FF ℱ family of functions indexed by a

∑=

=N

iii FyL

NR

1));(ˆ,(1)( axa

)∑ ∑= =

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

N

i

n

jjji xaay

N 1

2

10

1 :OLR e.g.,

)(minargˆ aaa

R)

=OLR: direct matrix algebra

Trees, NN, : heuristic iterative algorithmClustering


Overview



Decision Trees


• Model Selection


11


Decision TreesInduction Overview

• Model:

– : (hyper) rectangles in input space induced by tree cuts– : estimated response within each rectangle is constant– : indicator function; 1 if its argument is true

∑=

∈==M

mmm RIcTy

1)ˆ(ˆ)(ˆ xx

1x2x

y

1R

2R

3R

4R5R

mR

mc( )LI

Adapted from ESL 9.2


Decision TreesInduction Overview (2)

• Score criterion:

– Regression “loss”

• Squared error:

• Absolute error:

– Prediction “risk”

• Average loss over all predictions –

• Goal: find tree with lowest prediction risk

2)ˆ()ˆ,( yyyyL −=

|ˆ|)ˆ,( yyyyL −=

))(,()( , xx TyLETR y=

T

12


Decision TreesInduction Overview (3)

• Search

– Goal: (least squares)

– Unrestricted optimization with respect to is difficult!

– Region restrictions

{ }{ }

2

1 1,1 )(minargˆ,ˆ

1

∑ ∑= =

⎥⎦

⎤⎢⎣

⎡∈−=

N

i

M

mmimi

Rc

M

mm RIcyRcM

mm

x

{ }MmR 1

1x

2x

X √

1x

2x

R1

R2

R3

R4

R5

• Disjoint• Cover space• "Simple“

)(ˆ ii Ty x=


Decision TreesGrowing Algorithm

• Greedy Iterative procedure

– Starting with a single region -- i.e., all given data– At the m-th iteration:

– i.e., Forward stagewise additive procedure– When should we stop?

for each region Rfor each attribute xj in R

for each possible split sj of xj

record change in score when we partition R into Rl and Rr

Choose (xj , sj ) giving maximum improvement to fitReplace R with Rl; add Rr

13


Decision TreesDesirable Data Mining Properties

• Ability to deal with irrelevant inputs

– i.e., automatic variable subset selection– Measure anything you can measure– Score provided for selected variables ("importance")

• No data preprocessing needed

– Naturally handle all types of variables

• numeric, binary, categorical

– Invariant under monotone transformations:• Variable scales are irrelevant

• Immune to bad -distributions (e.g., outliers)

( )jjj xgx =

jx


Decision TreesDesirable Data Mining Properties (2)

• Computational scalability

– Relatively fast:

• Missing value tolerant

– Moderate loss of accuracy due to missing values– Handling via "surrogate" splits

• "Off-the-shelf" procedure

– Few tunable parameters

• Interpretable model representation

– Binary tree graphic

( )NnNO log

14


Decision TreesLimitations

• Discontinuous piecewise constant model

– In order to have many splits you need to have a lot of data– In high-dimensions, you often run out of data after a few splits

• Data fragmentation

– Each split reduces training data for subsequent splits

x

)(xF


Decision TreesLimitations (2)

• Not good for low interaction

– e.g., is worst function for trees

– In order for to enter model, must split on it

• Path from root to node is a product of indicators

• Not good for that has dependence on many variables

( )x*F

( )

( )∑

∑

=

=

=

+=

n

jjj

n

jjjo

xf

xaaF

1

*

1

* x

(no interaction, additive)

lx

( )x*F

15


Decision TreesLimitations (3)

• High variance caused by greedy search strategy (local optima)

– i.e., Small changes in data (sampling fluctuations) can cause big changes in tree

– Errors in upper splits are propagated down to affect all splits below it– Very deep trees might be questionable– Pruning is important

• What to do next?

– Use other methods when possible– Fix-up trees: use "ensembles“

• Maintain advantages while dramatically increasing accuracy


Overview



• Decision Trees


Model Selection


16


Model SelectionWhat is the “right” size of a tree?

• Dilemma

– If tree (# of regions) is too small, then piecewise constant approximation is too crude (bias) ⇒ increased errors

– If tree is too large, then it fits the training data too closely(overfitting, increased variance) ⇒ increased errors

ο ο

ο ο ο

ο

ο ο

ο ο

ο ο ο

ο

ο

ο

Target

functi

on

ο ο

ο ο ο

ο

ο ο

ο ο

ο ο ο

ο

ο

ο1c

2c

ο ο

ο ο ο

ο

ο ο

ο ο

ο ο ο

ο

ο

ο1c

2c3c

x

y

x x

y y


Model SelectionBias−Variance Decomposition

• Suppose , where

• Consider “idealized” aggregate estimator:

– Average over all possible data sets

• Prediction error of fit at a point using squared-error loss is decomposable:

– Typically, when bias is low, variance is high and vice-versa

ε+= )(XfY ),0(~ 2σε N

)()( XX Dff)

Ε=

)(Xf)

0xX =

[ ][ ][ ] 22

0000

22 00

2 000

)()()()(

)()(

|)()(

σ

σ

+−+−Ε=

+−Ε=

=−Ε=

xxxx

xx

xXxx

ffff

ff

fYErr

)

)

)[ ] [ ]

[ ] [ ]2

002

22 00

2 00

22 00

2 00

))(())((

)()()()(

)()()()(

σ

σ

σ

++=

+−Ε+−=

+−Ε+−Ε=

xx

xxxx

xxxx

fVarfBias

ffff

ffff

))

)

)

17


Model SelectionBias−Variance Schematic

•

•

•

•Truth: )(xf

ε+= )(XfYRealization:

Fit:

Variance

Model Space ℱ

)(xf)

Model Bias


)(XfAverage:


Model SelectionBias−Variance Tradeoff

– Right sized tree, , when test error is at a minimum

– Overfitting is exacerbated when model (fitting procedure) is highly flexible

Model Complexity (e.g., tree size)

Pred

ictio

n Er

ror

Low BiasHigh Variance

High BiasLow Variance

Test Sample

Training Sample

HighLow *M

*M


18


Model SelectionTree Pruning

• Augmented tree score criterion:

– : “risk” of tree – how good it fits the dataanalogous to the residual sum of squares

– : size of tree – number of terminal nodesanalogous to the model degrees of freedom

– : complexity parameter – cost of adding another var to the model

⇒ Complexity term penalizes for the increased variance associated with more complex model

• Goal: find

TTRTR ⋅+= αα )(ˆ)(ˆ

)(TR)

T

α

)(ˆmin* TRT T αα =


Model SelectionTree Pruning (2)

• Cost-Complexity Pruning (Cont.)

– is a “meta” parameter that controls the degree of stabilization

• Larger values provides increased regularization ⇒ more stable estimates

• : largest tree ( ) ⇒ least stable estimates

• : stump (root only) ⇒ deterministic

• Varying produces a nested (finite) sequence of subtrees

• Choose the value of , (and thus ), to minimize future risk

α

0=α

∞=α

maxT

α

0=αmaxT

1α1T 2T K

2α 3α 0>>αroot

α *α **α

T )( αTR)

19


Model SelectionCross−Validation

• Error on the training data is not a useful estimator

– If test set is not available, need alternative method

• Combine measures of fit to obtain “honest” estimate of model’s “quality”

– Each measure is constructed from random, non-overlapping subset of the data

– E.g., 3-fold CV

Train

Train

Test1D2D3D )1(T

Train

Train

Test

1D2D3D

)3(TK


Model SelectionCross−Validation (2)

– : tree built on

– : indexing function

– : cross-validation estimate of prediction error

i.e., average of the fit measures over the V splits

Train

Train

Test1D2D3D )1(T

Train

Test

Train

1D2D3D

)2(T

vDD −)(vT

{ } },,1{,,1:)( VNi KaKυ

( )∑=

=N

ii

ii

CV TyLN

R1

)( )(,1ˆ xυ

Train

Train

Test

1D2D3D

)3(T

20


Model SelectionCross−Validation (3)

• Set of models is indexed by tuning parameter α, and we have

• Idea generalization: can “averaging” over a set of trees help correct for overly optimistic trees?

– i.e., combine fitted values instead of measures of fit ⇒ Ensembles

{ })()( xvT

( ) ( )∑=

=N

ii

ii

CV TyLN

R1

)( )(,1ˆ xυαα

α

0=αmaxT

1α1T 2T K

2α 3α 0>>αroot


Model SelectionRegularization via Shrinkage (Lasso − Tibshirani, 1996)

• Consider linear model

• Standard LR coefficients estimation

• Shrinking: continuous variable selection method

– “penalty” term provides a (deterministic) value that is independent of the particular random sample

⇒ stabilizing influence on the criterion being minimized

– promotes reduced variance of the estimated coefficient values

∑=

+=n

jjj xaaF

10)(x

( )∑ ∑= ==

N

i

nijji

aj xayLa

j1 1j}{

,minarg}ˆ{

∑ =

n

j ja1

λ( ) += ∑ ∑= = ,minarg}ˆ{

1 1j}{

N

i

nijji

a

Lassoj xayLa

j

21


Model SelectionRegularization via Shrinkage (2)

• Lasso solutions nonlinear in y, and a quadratic programming algorithm is used to compute them

• L1 penalty leads to sparse solutions where there are many predictors with

– Often improves prediction accuracy– Easier to interpret– “Bet on sparsity” principle

• is a “tuning” parameter

– : no regularization… least stable estimates– : continuous coefficient “paths”

0=ja

0=λ

λ

0>λ

⇒ Typically estimated via cross-validation



• Lasso 2D estimation illustration

– : least-squares -- i.e.,

– Error function contours and constraint region

( )⋅L

( ) +−−−= ∑ =

2

1 22110}{

210 minarg}ˆ,ˆ,ˆ{ N

i iiia

xaxaayaaaj

If first point where contours hit constraint region occurs at a corner, then one parameter is zeroja

1a

2aa LS

a Lasso •

( )21 aa +λ

•

22



• Forward stagewise linear regression (Hastie et al, 2001)

– Iterative strategy producing solutions that closely approximates the effect of the lasso

– See also LARS (Efron, et al 2004), Gradient directed regularization for linear regression (Friedman, 2005)

( ){ }

∑

∑ ∑

=

= =

=

⋅+←

⎥⎦

⎤⎢⎣

⎡−−=

=

>=

n

j jj

jj

j

N

i

n

lijilli

j

j

xaF

signaaa

xxayj

Mm

Ma

1

*

1

2

1 ,

**

)(ˆ write

}

)(amount malinfinitesiby decrement Increment/ //

ˆminarg ,

residualscurrent fitsbest that predictor Select //

{ to1For

large ; 0 ; 0 Initialize

**

*

)

)

x

αε

αα

ε

α

• After iterations, manywill be 0

∞<Mja)

• In general, ( ) LSjj aMa )) <

• λ/1≈M



• Lasso coefficient “paths” example:

–– see R package "lars"

λ/1≈s

23


Model SelectionRegularization Summary

• What is regularization?

– “any part of model building which takes into account – implicitly or explicitly – the finiteness and imperfection of the data and the limited information in it, which we can term “variance” in an abstract sense” [Rosset, 2003]

• Forms of regularization

– Explicit constraints on model complexity

• Constrained models can equivalently be cast as penalized ones

– Implicit through incremental building of the model

– Choice of robust loss functions


Model SelectionRegularization Summary (2)

•

•

•

•Truth: )(xf

ε+= )(XfYRealization:

Fit:

Variance

Model Space ℱ

)(xf)

)(XLassof)

Model Bias


••

Regularized Fit:

Restricted Model Space

⇒ smaller prediction error

24


Overview

Ensemble Methods

– Ensemble Learning & Importance Sampling (ISLE)– Generic Ensemble Generation– Bagging, Random Forest, AdaBoot, MART

• Rule Ensembles

• Interpretation


Ensemble MethodsLearning

• Model:

– : “basis” functions (or “base learners”)

– i.e., linear model in a (very) high dimensional space of derivedvariables

• Learner characterization:

– : a specific set of joint parameter values – e.g., split definitions at internal nodes and predictions at terminal nodes

– : function class – i.e., set of all base learners of specified family

∑ =+=

M

m mmTccF10 )()( xx

{ }MmT 1)(x

);()( mm TT pxx =

mp

{ } PT ∈ppx );(

25


Ensemble MethodsLearning (2)

• Two-step process – approximate solution to

– Step 1: Choose points

• i.e., select

– Step 2: Determine weights

• e.g., via regularized LR

( )∑ ∑=

=+=

N

i

M

m mmmic

Momm TccyLc

Momm 1

10} ,{

);( ,min} ,{ pxpp

))

{ }Mm 1p

{ } { } PM

m TT ∈⊂ ppxx );()( 1

{ }Mmc 0


Ensemble MethodsExample

• 2 features, 2 classes

⇒ Another example in Appendix 1

26


Ensemble MethodsImportance Sampling (Friedman, 2003)

• How to judiciously choose the “basis” functions (i.e., )?

• Goal: find “good” so that

• Connection with numerical integration:

–

{ }Mm 1p

)()}{,}{;( *11 xpx FaF M

mM

m ≅{ }Mm 1p

)(w )( M

1m m mII ppp ∑∫ =Ρ≈∂

vs.

);(

);()(

);()()(

1

1

m

M

mm

M

mmmm

Tc

Taw

dTaF

px

pxp

ppxpx

∑

∑

∫

=

=

Ρ

=

≅

=

Accuracy improves when we choose more points from this region…


Ensemble MethodsImportance Sampling (2)

• Parameter importance measure:

– Indicative of relevance of

– Naive approach: -- i.e., uniform

– Better idea: inversely related to ’s “risk”

• i.e., has high error ⇒ lack of relevance of

– Single point vs. group importance

• i.e., with/out knowledge of the other points that will be used

• Sequential approximation: ‘s relevance judged in the context of the (fixed) previously selected points

– Characterization of : “center” and “width”

p

)(pr

iidr m )(p

mp

)(pr

p

);( mT px mp

27



⇒ The empirically observed strength-correlation tradeoff can be understood in terms of the width of

)(prBroad

{ }MmT 1);( px• Ensemble of “strong” base

learners − i.e., all with )()( ∗≈ pp RiskRisk m

);( mT px• ’s yield similar highly correlatedpredictions ⇒ unexceptional performance

)(prNarrow

• Diverse ensemble − i.e., predictions arenot highly correlated with each other

• However, many “weak” base learners − i.e.,⇒ poor performance )()( ∗>> pp RiskRisk m

)(pr

( )pp p Riskminarg=∗



• Heuristic sampling strategy : sampling around by iteratively applying small perturbations to existing problem structure

– Generating ensemble members

– is a (random) modification of any of• Data distribution − e.g., by re-weighting the observations

• Loss function − e.g., by modifying its argument

• Search algorithm (used to find minp)

– Width of is controlled by degree of perturbation

)(pr) ∗p

);()( mm TT pxx =

( ){ }}

);(,minarg { to1For

pxp p TyLPERTURBMm

xymm Ε==

{}⋅ PERTURB

)(pr)

28


Ensemble MethodsGeneric Ensemble Generation (Friedman, 2003)

• Step 1: Choose :

– Algorithm control:

• : random sub-sample of size ⇒ impacts ensemble "diversity"

• : “memory” function ( )

{ }Mm 1p

Forward StagewiseFitting Procedure

υη ,,L

N≤η)(ηmS

∑ −

=− ⋅=1

11 )()( m

k km TF xx υ 10 ≤≤υ

( )

{ }Mm

mmm

mm

Siiimim

T

TFFTT

TFyLMm

F

m

1

1

)(1

0

)( write

})( )()(

);()(

);()( ,minarg { to1For

0)(

x

xxxpxx

pxxp

x

p

⋅+==

+=

==

−

∈−∑

υ

η )(ηmS)(1 imF x−

υ

Modification of data distribution Modification of loss function

(“sequential” approximation)


Ensemble MethodsGeneric Ensemble Generation (2)

• Step 2: Choose quadrature coefficients :

– Given , coefficients are obtained by a regularized linear regression

– Regularization here helps reduce bias (in addition to variance) of the model

– New iterative fast algorithms for various loss functions

• “Gradient Direct Regularization…”, Friedman & Popescu, 2004

{ }Mmc 0

{ } { }Mmm

Mmm TT 11 );()( == = pxx

∑∑ ∑== =

+⎟⎠

⎞⎜⎝

⎛+=

M

mm

M

mimmi

cm cTccyLc

m 1

N

1i 10

}{ )( ,minarg}{ λx)

29


Ensemble MethodsBagging (Breiman, 1996)

• Bagging = Bootstrap Aggregation

•

•

•

•

•

– i.e., perturbation of the data distribution only– Parallelizable– See R command ipred:ipredbagg

2)()ˆ,( yyyyL )−=

2/N=η

)(xmT

0=υ

{ }M1/1 ,0 Mcc mo ==

( )

{ }Mm

mmm

mm

Siiimim

T

TFFTT

TFyLMm

F

m

1

1

)(1

0

)( write

})( )()(

);()(

);()( ,minarg { to1For

0)(

x

xxxpxx

pxxp

x

p

⋅+==

+=

==

−

∈−∑

υ

ηη

υ

( ) L⇒ no memory

⇒ are large un-prunedtrees

i.e., not fit to the data (avg)


Ensemble MethodsBagging − Example

• Simulated data

– Predictor vars:

– Response var:

– Training:

• 200 bootstrap samples

– Testing:

)195.095.095.095.095.095.01

,0

0()( 5

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡== OMNpn x

8.0)5.0|1( 2.0)5.0|1( : toaccording generated class;-two

1

1

=>==≤=

xyPxyP ⇒ Bayes error

rate : 0.2

2000=N

30=N

Source: ESL 8.7.1

30


Ensemble MethodsBagging − Example (2)

• Bootstrap trees

Tree 2…

0

0

0

1

1

1

Tree 1

0

0

1 1

1


Ensemble MethodsBagging − Example (3)

• Test Error

– Trees have high variance on these data due to the correlation in the predictors

– Bagging succeeds in smoothing out this variance and hence reducing test error

31


Ensemble MethodsBagging − Why it helps?

• Under , averaging reduces variance and leaves bias unchanged

• Consider “idealized” bagging (aggregate) estimator:

– fit to bootstrap data set

– is sampled from actual population distribution (not training data)

– We can write:

⇒ true population aggregation never increases mean squared error!

⇒ Bagging will often decrease MSE…

2)()ˆ,( yyyyL )−=

)( )( xx Zff)

Ε=

Zf)

{ }Nii xyZ 1,=

Z

[ ] [ ][ ] [ ][ ]2

2 2

2 2

)(

)()()(

)()()()(

x

xxx

xxxx

fY

fffY

fffYfY

Z

ZZ

−Ε≥

−Ε+−Ε=

−+−Ε=−Ε)

))


Ensemble MethodsRandom Forest (Breiman, 2001)

• Bagging + algorithm randomizing

– Subset splittingAs each tree is constructed…

• Draw a random sample of predictors before each node is split

• Find best split as usual but selecting only from subset of predictors

⇒ Increased diversity among − i.e., wider

• Width (inversely) controlled by

– Speed improvement over Bagging

⎣ ⎦1)(log2 += nns

{ }MmT 1)(x )(pr)

sn

32


Ensemble MethodsRandom Forest (2)

• Potential improvements:

– Different data sampling strategy (not fixed)– Fit quadrature coefficients to data

• See R command randomForestBag RF Bag_6_5%_P RF_6_5%_P

Com

para

tive

RM

S Er

ror • xxx_6_5%_P : 6 terminal nodes trees

5% samples without replacementPost-processing – i.e., using estimated “optimal” quadraturecoefficients

• Simulation study with 100 different targetfunctions (Popescu, 2005)

• xxx_6_5%_P : Significantly faster to build!


• AdaBoost = Adaptive Boosting

•

•

•

•

•

•

– “Real” AdaBoost : extension to regression case (Friedman, 2000)– See R command boost:adaboost

( )

Mmm

mmmm

mm

Siiimi

cmm

Tc

TcFFTT

TcFyLcMm

F

m

1

1

)(1

,

0

)}(,{ write}

)( )()( );()(

);()( ,minarg),( { to1For

0)(

x

xxxpxx

pxxp

x

p

⋅⋅+==

⋅+=

==

−

∈−∑

υ

η

⇒ Sequentialpartial regression coefficients

Ensemble MethodsAdaBoost (Freund & Schapire, 1997)

)(xmT

{ }M1 ,0 mo cc =

⇒ Any “weak” learner

{ }1 ,1 ;)ˆexp()ˆ,( −∈⋅−= yyyyyL

1=υ ⇒ Sequential sampling

N=η ⇒ Data perturbed viaobservation weights

( ) ( )∑ ===

M

m mmM TcsignFsigny1

)()( xx)

η

υ

mc ( ) L

33


Ensemble MethodsAdaBoost (2)

• Equivalence to Forward Stagewise Fitting Procedure

– We need to show is equivalent to line a. above

– is equivalent to line c.

– How weights are derived/updated

[ ]

( )∑

∑∑

=

+

=

=

≠⋅⋅=

−=

≠=

==

M

m mm

imimm

im

i

mmm

N

im

i

imiN

im

im

mim

i

Tsign

TyIww

errerr

w

TyIwerr

wTMm

Nw

1

)()1(

1)(

1)(

)(

)0(

)(Output

})((expSet d.

))1(log( Compue c.

))((

Compute b. with data training to)( classifier aFit a.

{ to1For 1 :n weightsobservatio

x

x

x

x

α

α

α

( )p

p ⋅= minargm

( )

Mmm

mmmm

mm

Siiimi

cmm

Tc

TcFFTT

TcFyLcMm

F

m

1

1

)(1

,

0

)}(,{ write}

)( )()( );()(


0)(

x

xxxpxx

pxxp

x

p

⋅⋅+==

⋅+=

==

−

∈−∑

υ

η

( )c

mc ⋅= minarg

)(miw

App

endi

x2


Ensemble MethodsAdaBoost (3)

• Why exponential loss?

– Implementation convenience

– Meaningful population minimizer:

• i.e., half the log-odds of

• Equivalently,

• Same minimizer obtained with (negative) binomial log-likelihood

• However, exponential loss is less robust in noisy settings

• No natural generalization to K classes

( ))|1Pr(

)|1Pr(log21minarg)( )(

|)( x

xx xx

x −==

=Ε= ⋅−∗

YYeF FY

YF

)|1Pr( x=Y

)(211)|1Pr(

xx ∗⋅−+==

FeY

( ) =−Ε )( ,(| xx FYlY ( ) 1log( )(2|

xx

FYY e ⋅−+Ε

Appendix3

34


• Boosting with any differentiable loss criterion

• General and

•

•

•

•

•

•

– See R command gbm

( )

Mmm

mmmm

mm

Siiimi

cmm

Tc

TcFFTT

TcFyLcMm

cF

m

1

1

)(1

,

00

)}(),{( write}

)( )()( );()(


)(

x

xxxpxx

pxxp

x

p

⋅

⋅⋅+==

⋅+=

==

−

∈−∑

υ

υ

η

⇒ “shrunk” sequentialpartial regression coefficients

Ensemble MethodsGradient Boosting (Friedman, 2001)

{ }M1mc

η

υ

mc ( ) L

)ˆ,( yyL y

⇒= 1.0υ Sequential sampling

2N=η

⇒ )(xmT Any “weak” learner

∑ ==

N

i ico cyLc1

),(minarg

)(xMFy =)

0c

App

endi

x 4


Ensemble MethodsGradient Boosting / MART

• MART = Multiple Additive Regression Trees

–

• J is a meta-parameter that controls interaction order allowed by approximation

– Adding to model is like adding J separate (basis) functions

⇒ computational efficient update

⇒ )(xmT J−terminal node regression tree ( )mJJm ∀=

⇒= 2J “main-effects” only

⇒= 3J two-variable interactions allowed (i.e., second-order effects)

⇒ L value of J should reflect dominant interaction level of target function

)(xmT

( ) ∑=

∈=J

jjj

Jjj RIbRbT

11 )(},{; xx

jmmjm bc ⋅=γwith

∑=

−− ∈⋅+=⋅+=⇒J

jjmjmmmmmmm RIbcFTcFF

111 )()( )( )()( xxxxx

∑=

− ∈+=J

jjmjmm RIF

11 )()( xx γ

Appendix 5

35


Ensemble MethodsGradient Boosting / MART (2)

• GB/MART doesn’t choose quadrature coefficients in a separate step

– i.e., ISLE recipe:

• However, boosting is still understood as an “incremental forward stagewise regression” procedure with Lasso penalty (shrinkage controlled by )

– Note similarity with Forward Stagewise Linear Regression procedure with as predictors

∑∑ ∑== =

+⎟⎠

⎞⎜⎝

⎛+=

M

mm

M

mimmi

cm cTccyLc

m 1

N

1i 10

}{ )( ,minarg}{ λx)

υ

{ }MmT 1)(x


Ensemble MethodsParallel vs. Sequential Ensembles

• Sequential ISLE tend to perform better than parallel ones

– Consistent with results observed in classical Monte Carlo integration

Bag RF

Bag_6_5%_P RF_6_5%_P

Com

para

tive

RM

S Er

ror • xxx_6_5%_P : 6 terminal nodes trees

5% samples without replacementPost-processing – i.e., using estimated “optimal” quadraturecoefficients

• Simulation study with 100 different targetfunctions (Popescu, 2005)

• Seq_η_ν%_P : “Sequential” ensemble6 terminal nodes treesη: “memory” factorν% samples without replacementPost-processing

MART Seq_0.01_20%_P

Seq_0.1_50%_P

“Sequential”

“Parallel”

36


Overview



Rule Ensembles

• Interpretation


Ensemble MethodsRule Ensembles (Friedman & Popescu, 2005)

• Ensemble Recap:

– Model:

– : “basis” functions (or “base learners”)

• Derived predictors capture non-linearities and interactions

– 2-stage fitting process:

i. generate basis functions

Ii. Post fit to the data via regularized regression

– Simple representation offered by a single tree no longer available

∑ =+=

M

m mm faaF10 )()( xx

{ }Mmf 1)(x

37


Ensemble MethodsRule Ensembles (2)

• Trees as collection of conjunctive rules:

– These simple rules, , can be used as base learners

– Main motivation is interpretability

∑=

∈=J

jjmjmm RIcT

1

)ˆ(ˆ)( xx

1x2x

yR2

R1

R3

R5

R4

152215

27

)27()22()( 211 >⋅>= xIxIr x⇒ 1R

⇒ 2R )270()22()( 212 ≤≤⋅>= xIxIr x

⇒ 3R )0()2215()( 213 xIxIr ≤⋅≤<=x

⇒ 4R )15()150()( 214 >⋅≤≤= xIxIr x

⇒ 5R )150()150()( 215 ≤≤⋅≤≤= xIxIr x

{ }1,0)( ∈xmr



• Rule-based model:

– Still a piecewise constant model

• Linear targets can still be problematic…

– We can complement the non-linear rules with purely linear terms:

• Original continuous variables can be replaced by their “winzorized” versions

∑∑ ++=j

jjm

mm xbraaF )()( 0 xx

∑+=m

mmraaF )()( 0 xx

jx

38



• Rule generation

– Use some approximate optimization approach to solve

where are the split definitions for rule

– Take advantage of a decision tree ensemble

• E.g., one rule for each terminal node in each tree

• In the case of shallow trees (boosting), regions corresponding to non-terminal nodes can also be included

– Each terminal node tree generates rules

( )∑∈

− +=)(

1 );()( ,minargηmSi

iimim rFyL pxxpp

mp

)(r

)(xmr

)(xmT

−J ( )12 −× J



• Rule fitting

– Linear regularized procedure

• total number of rules

• total number of linear terms

– Tree size controls rule “complexity”• A terminal node tree can generate rules involving up to factors

• Modeling order interactions requires rules with or more factors

( )( )∑∑

∑ ∑∑

==

= ==

++

++=

p

j jK

k k

N

i

p

j ijjK

k ikkiba

jk

ba

xbraayLbajk

11

1 110}{},{

)( ,minarg})ˆ{},ˆ({

λ

x

( )∑=

−×=M

mmJK

1

12

np ≤

−J ( )1−J

−J J

39


Overview



• Rule Ensembles

Interpretation


Ensemble MethodsInterpretation

• Importance scores

– Which particular variables are the most influential?

• i.e., relative influence or contribution in predicting the response

– Higher importance variables are more likely to be of interest– Detection of “masking”

• Interaction statistic

– Which variables are involved in interactions with other variables? – Strength and degrees of those interactions

• Partial dependence plots

– Nature of the dependence of the response on influential inputs

App

endi

x 6

40


Ensemble MethodsInterpretation − Example

• Simulated data

– Predictor vars:

– Response var:

– Training:

10=n

21 )1( )1( 11 ===−= xpxp

10,,2 31 )1( )0( )1( K======−= mxpxpxp mmm

⎩⎨⎧

−=++⋅+⋅+−=++⋅+⋅+

=1 233

1233

1765

1432

xzxxxxzxxx

y( )2 ,0~ Nz

200=N

noise"" are , , 1098 xxx

Source: CART


Ensemble MethodsInterpretation − Example (2)

• Single tree vs. rule-ensemble

X1= -10.51-0.6740

X1 = 1 & X4 ∈ {1} 0.510.9142

X1 = -1 & X5 ∈ {1} &X7 ∈ {0, 1} 0.111.0046

X1 = -1 & X6 ∈ {-1, 0}0.351.2557

X1 = -1 & X6 ∈ {-1}0.35-1.0071

X1 = 1 & X2 ∈{-1, 0} & X3 ∈{-1, 0}0.24-1.5983

X1 = 1 & X2 ∈{0, 1} & X3 ∈ {0, 1} 0.171.3762

X1 = -1 & X5 ∈{ -1}0.14-2.2595

X1 = 1 & X2 ∈{0, 1}0.301.80100

RuleSuppCoefImp

41



• Variable importance

– See R package “rulefit”



• Interaction statistic – “null” adjusted

42



• Two-variable interaction statistic: and ( )* ,5X ( )* ,1X



• Two-Variable Partial Dependencies

– Effect of and on response after accounting for the (average) effects of the other variables…

jx kx

⎩⎨⎧

−=++⋅+⋅+−=++⋅+⋅+

=1233

1233

1765

1432

xzxxxxzxxx

y

Ŧ: Here translated to have a minimum value of zero

43


References 1. Breiman, L., Friedman, J.H., Olshen, R., and Stone, C. (1993). Classification and Regression

Trees. Chappman & Hall/CRC.

2. Breiman, L. (1995). Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37(4):373-384.

3. Breiman, L. (1996). Bagging Predictors. Machine Learning, 26:123-140.

4. Breiman, L. (1998). Arcing Classifiers. Annals of Statistics 26(2):801-849.

5. Breiman, L. (2001). Random Forests, random features. Technical Report, University of California, Berkeley.

6. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least Angle Regression. Annals of Statistics, 32(2):407–499.

7. Elder, J. (2003). The Generalization Paradox of Ensembles. Journal of Computational and Graphical Statistics, 12(4):853-864.

8. Freund, Y. and Schapire, R.E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proc. of the 13th Intl. Conference, Morgan Kauffman, San Francisco, 148-156.

9. Freund, Y. and Schapire, R.E. (1997). A decision theoretical generalization of on-line learning and an application to boosting. Journal of Computer and systems Sciences, 55(1):133-68.

10. Friedman, J.H. (1999). Stochastic gradient boosting. Technical Report Department of Statistics, Stanford University.


References (2)

11. Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive Logistic Regression: a Statistical View of Boosting. Annals of Statistics, 28:337-40.

12. Friedman, J. (2001). Greedy function approximation: the gradient boosting machine, Annals of Statistics, 29:1189–1232.

13. Friedman, J. and Popescu, B. E. (2003). Importance Sampled Learning Ensembles. TechnicalReport, Stanford University.

14. Friedman, J. and Popescu, B. E. (2004). Gradient directed regularization for linear regression and classification. Technical Report, Stanford University.

15. Friedman, J. and Popescu, B. E. (2005). Predictive learning via rule ensembles. Technical Report, Stanford University.

16. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning (ESL) −Data Mining, Inference and Prediction. Springer

17. Ho, T.K. (1995). Random Decision Forests. Proc. of the 3rd Intl. Conference on Document Analysis and Recognition, 278-282.

18. Kleinberg, E. (2000). On the Algorithmic Implementation of Stochastic Discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(5):473-490.

19. Rosset, S. (2003). Topics in Regularization and Boosting. PhD. Thesis, Stanford university.

44


References (3)

20. Seni, G., Yang, E. and Akar, S. (2007). Yield Modeling with Rule Ensembles. Proc. of the 18th

IEEE/SEMI Advanced Semiconductor Manufacturing Conference.

21. Tibshirani, R. Regression shrinkage and selection via the lasso (1996). J. Royal Statistics Society B., 58:267-288.

22. Zhao, P. and Yu, B. (2005). Boosted lasso. Proc. of the SIAM Intl. Conference on Data Mining, Newport Beach, CA, 35-44.

23. Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Keynote Address, SIAM workshop on Variable Selection, Newport Beach, CA.


Appendices

1. Visualizing Bagging and AdaBoost Decision Boundaries

2. On AdaBoost: Equivalence to Forward Stagewise Fitting Procedure

3. On AdaBoost: Prove population minimizer of exponential loss is the half the log-odds of

4. On Gradient Boosting: Solving for robust loss criterion

5. On MART: tree specific optimization

• LAD-regression algorithm

6. Interpretation Statistics for Ensemble Methods

7. On Complexity and Generalized Degrees of Freedom

)|1Pr( x=Y

45


Appendix 1Visualizing Bagging and AdaBoost (2-dimensional, 2-class example)

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0


Appendix 1Decision boundary of a single tree

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

46


Appendix 1100 bagged trees leads to smoother boundary

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0


Appendix 1AdaBoost, after one iteration (CART splits, larger points have great weight)

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

47


Appendix 1After 3 iterations of AdaBoost

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0


Appendix 1After 20 iterations of AdaBoost

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

48


Appendix 1Decision boundary after 100 iterations of AdaBoost

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0


Appendix 2On AdaBoost – Equivalence to FSF Procedure• We have

– doesn’t depend on , thus can be regarded as an observation weight

– Solution to (1) can be obtained in two steps:

• Step1: given , solve for


( )∑=

− ⋅+=N

iiimi

cmm TcFyLc

11

,);()( ,minarg),( pxxp

p

( )∑=

− ⋅⋅−⋅−=N

iiiimi

cmm TycFyc

11

,);( )(expminarg),( pxxp

p⇒⋅−= )ˆexp()ˆ,( yyyyL

)()( 1with imi Fymi ew x−−=( )∑

=

⋅⋅−⋅=N

iii

mi

cTycw

1

)(

,);(expminarg px

p

)(miw por c

(1)

c );( mT px

( )∑=

⋅⋅−⋅=N

iii

mi

Tm TycwT

1

)( )(expminarg x ⎥⎦

⎤⎢⎣

⎡≠=⇒ ∑

=

))((minarg 1

)(ii

N

i

mi

Tm TyIwT x

mT c

( )∑=

⋅⋅−⋅=N

iimi

mi

cm Tycwc

1

)( )(expminarg x∑

∑=

=≠

=−

=⇒ N

im

i

N

i imim

im

m

mm

w

TyIwerr

errerrc

1)(

1)( ))((

1log21

xwhere

49


Appendix 2On AdaBoost – Equivalence to FSF Procedure (2)


– i.e., is the classifier that minimizes the weighted error rate (line a.)

• Step 2: given , solve for

c );( mT px

( )∑=

⋅⋅−⋅=N

iii

mi

Tm TycwT

1

)( )(expminarg x ⎥⎦

⎤⎢⎣

⎡⋅+⋅= ∑∑

≠=

−

)(

)(

)(

)( minargiiii Ty

mi

c

Ty

mi

c

Twewe

xx

⎥⎦

⎤⎢⎣

⎡⋅+⋅−⋅= ∑∑∑

≠≠

−

=

−

)(

)(

)(

)(

1

)( minargiiii Ty

mi

c

Ty

mi

cN

i

mi

c

Twewewe

xx

( ) ⎥⎦

⎤⎢⎣

⎡⋅+≠⋅−= ∑∑

=

−

=

−N

i

mi

cii

N

i

mi

cc

TweTyIwee

1

)(

1

)( ))((minarg x ⎥⎦

⎤⎢⎣

⎡≠= ∑

=

))((minarg1

)(ii

N

i

mi

TTyIw x

constant doesn’t depend on T

( ) ⎥⎦

⎤⎢⎣

⎡⋅+≠⋅−= ∑∑

=

−

=

−N

i

mi

cimi

N

i

mi

cc

cm weTyIweec

1

)(

1

)( ))((minarg x ⇒ compute and set to zero derivativewith respect to c

mT

mT c



• Step2: (cont)

and

( )

∑∑∑

∑∑∑

∑∑∑

∑ ∑∑

∑ ∑ ∑

∑ ∑∑ ∑

=

==

=

==

===

= ==−

= = =−−

= =−−

= =−−

≠

≠−=⇒

≠

≠−=⇒

≠−=≠⋅⇒

=−≠+≠⋅⇒

=⋅−≠⋅+≠⋅⇒

=⋅−≠⋅+=⋅+≠⋅−∂∂

N

i imim

i

N

i imim

iN

im

iN

i imim

i

N

i imim

iN

im

ic

N

i imim

iN

im

iN

i imim

ic

N

i

N

im

iimim

iN

i imim

icc

N

i

N

i

N

im

ic

imim

ic

imim

ic

N

i

N

im

ic

imim

iccN

i

N

im

ic

imim

icc

xTyIw

xTyIwwc

xTyIw

xTyIwwe

xTyIwwxTyIwe

wxTyIwxTyIwee

wexTyIwexTyIwe

wexTyIweewexTyIweec

1)(

1)(

1)(

1)(

1)(

1)(

2

1)(

1)(

1)(2

1 1)()(

1)(2

1 1 1)()()(

1 1)()(

1 1)()(

))((

))((ln

21

))((

))((

))(( ))((

0 ))(( ))(( )by (dividing

0 ))(( ))((

0))(()())(()(

∑∑∑

∑∑

∑∑∑

∑∑

∑∑

=

==

=

=

=

==

=

=

=

=

≠

≠−=

≠

≠−

=≠

≠−

=−

N

i imim

i

N

i imim

iN

im

i

N

im

i

N

i imim

i

N

im

i

N

i imim

iN

im

i

N

im

i

N

i imim

i

N

im

i

N

i imim

i

m

m

xTyIw

xTyIww

w

xTyIw

w

xTyIww

w

xTyIw

w

xTyIw

errerr

1)(

1)(

1)(

1)(

1)(

1)(

1)(

1)(

1)(

1)(

1)(

1)(

))((

))((

))((

))((

))((

))((1

1

50



• Finally, weight update expression:

– is the quantity in line 2c

– Equivalence to d. √

⇒ AdaBoost minimizes the exponential loss with a forward-stagewise additive modeling approach!

)1( +miw

)()()( 1 xxx mmmm TcFF ⋅+= −[ ])()()()1( 1 immimiimi TcFyFym

i eew xxx ⋅+−−+ −==⇒

)()(1 imimimi TycFy ee xx ⋅⋅−− ⋅= − )()( imim Tycmi ew x⋅⋅−⋅=

mimim cTyIcmi eew −≠⋅⋅ ⋅⋅= ))((2)( x⇒−≠=⋅− 1))((2)( imiimi TyITy xx

mimim cTyImi eew −≠⋅ ⋅⋅= ))(()( xα

multiplies all weights… has no effect

mc⋅= 2 mα


Appendix 3On AdaBoost – Population Minimizer

• Prove ( ))|1Pr(

)|1Pr(log21minarg)( )(

|)( x

xx xx

x −==

=Ε= ⋅−∗

YYeF FY

YF

)|1Pr(

)|1Pr(ln21)(

)|1Pr(ln)|1Pr(ln)(2 )()|1Pr(ln)()|1Pr(ln

)|1Pr()|1Pr( 0)(

)|(

)|1Pr( )|1Pr()(

)|(

)|1Pr( )|1Pr()|(

)()()(

)()()(

)()()(

xYxYxf

xYxYxfxfxYxfxY

exYexYxf

xeE

exYexYxf

xeE

exYexYxeE

xfxfxyf

xfxfxyf

xfxfxyf

−==

=⇒

−=−==⇒−==+−=⇒

⋅==⋅−=⇒=∂

∂⇒

⋅−=+⋅=−=∂

∂⇒

⋅−=+⋅==

−−

−−

−−

51


Appendix 4On Gradient Boosting

• Solving for robust loss criterion (e.g., absolute loss, binomialdeviance) requires use of a “surrogate”, more convenient,

• Like before, we solve in two steps:

– Step1: find … here we use

– Step2: given , solve for

)ˆ,(~ yyL

( )∑=

− ⋅+=N

iiimi

cmm TcFyLc

11

,);()( ,minarg),( pxxp

p

);( mT px )ˆ,(~ yyL

( )∑=

− ⋅+=N

iiimim TcFyL

11

);()( ,~minarg pxxp

p

mT c

( )∑=

− ⋅+=N

imiimi

cm TcFyLc

11 );()( ,minarg pxx


Appendix 4On Gradient Boosting (2)

• is derived from analogy to numerical optimization in function space

– Learning: − i.e., minimum “risk”

– Each possible is a “point” in − i.e.,

)ˆ,(~ yyL

)(minargˆ FRF F=

F Nℜ )(,),(),( 21 NFFF xxxF K=

)(FR

)( 1xF

)( 2xF

52



• Steepest descent in function space (“non-parametric” view)

– Gradient components

– Step size (line search):

)(FR

)( 1xF

)( 2xF

guess initial : 0F

∑ ==

M

m mM 0FF

))(101 FR∇−= ρFF

( )

( )11

)()(,

)()(,

)(

)( 1111

−− ==⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

∂∂

∂∂=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

∂∂

∂∂=∇

mm FFNNNFFN

m

FFyL

FFyL

FR

FRR

xx

xx

x

xLL

( )RR mmm ∇−= − ρρ ρ 1minarg F ⇒ like “Step2” before



• Problem: defined on training data only

– Select parameterized function (defined for all ):

– Choose so that is most “parallel” to

– Surrogate criterion :

MF)

);( pxTx

p );( pxT )(FR∇

)(FR

)( 1xF

)( 2xF

Rm∇−

);( pxT

θ

• Geometric interpretation of correlation:

( ){ } { }( )Nii

Ni TFRcorr 11 );( ,)(cos ==∇−= pxxθ

L~ ( ) )(

)(, ~

1−=∂

∂−=

mFFi

iii F

FyLyx

x ⇒ Least-squares minimization

• Most highly correlated is found fromsolution :

( )( )∑=

⋅−∇−=N

iiimm TFR

1

2

,);( )(minarg pxxp

pβ

β

);( pxT

53


Appendix 5On MART

• Adding to model is like adding J separate (basis) functions

• Since regions are disjoint, we can do separate updates in each terminal node

– Step2: was:

now:

i.e., optimal constant update in each terminal region

)(xmT

( ) ∑=

∈=J

jjj

Jjj RIbRbT

11 )(},{; xx

jmmjm bc ⋅=γwith

∑=

−− ∈⋅+=⋅+=⇒J

jjmjmmmmmmm RIbcFTcFF

111 )()( )( )()( xxxxx

∑=

− ∈+=J

jjmjmm RIF

11 )()( xx γ

jmR

( )∑=

− ⋅+=N

imiimi

cm TcFyLc

11 );()( ,minarg pxx

( )∑∈

− +=jmi R

imijm FyLx

x γγγ

)( ,minarg 1


Appendix 5On MART (2)

• Summary:

( )

{ }{ }

( )

( )

( )

)()(ˆ write

}

ˆ )()(

expansion pdate // U

)( ,minargˆ tscoefficien find :Step2 //

~minarg,

)()(, ~

criterion surrogate using )( find :Step1//

{ to1For

),(minarg)(

11

1

1

2

1,1

10

1

xx

xxx

x

x

xx

x

x

x

m

J

jjmijmmm

Rimijm

N

i

J

jjmijmi

Rb

Jjmjm

FFi

iii

m

N

i i

FF

RIFF

FyL

RIbyRb

FFyLy

TMm

yLF

jmi

jmjm

m

=

∈⋅+=

+=

⎥⎦

⎤⎢⎣

⎡∈−=

∂∂

−=

=

=

∑

∑

∑ ∑

∑

=−

∈−

= =

=

=

−

)

))

)

γυ

γγ

γ

γ

γ

54


Appendix 5On MART (3)

• LAD-regression:

– More robust than

– Resistant to outliers in … trees already providing resistance to outliers in

– Algorithm:

FyFyL −=),(

( )2Fy −

y

x

{ }

( )

{ } { }( )

{ }

( )}

ˆ )()(

expansion // Update

1 )(ˆ

tscoefficien find :Step2//

,~ treeregression-LS node terminal

)( ~

)( find :Step1//

{ to1For )(

11

11

11

1

10

∑=

−

−∈

−

∈⋅+=

=−=

−=

−=

=

=

J

jjmijmmm

NimiRjm

Nii

Jjm

imii

m

Ni

RIFF

JjFymedian

yJR

Fysigny

T

MmymedianF

jmi

)

K

)

)

xxx

x

x

x

x

x

x

γυ

γ


Appendix 6Interpretation − Variable Importance

• Single tree measure (CART) :

– i.e., sum the "goodness of split" scores whenever is used in a surrogate split

– Sum is over all internal nodes in the tree

– If was used as the primary split in some node, then corresponding score is used in the summation

– Normalized measure:

• i.e., only the relative magnitudes of the are interesting

• Tree ensemble generalization

– Average over all of the trees:

( ) ( ) ( )∑∈

=⋅Ι=ℑTt

tvl ltvIstvx )(~),(ˆ)(

lx

lx( )lsl,Ι

( ) ( ) ( )lmljj xxx ℑℑ⋅=≤≤1

max/100Imp

( )jxℑ

( ) ( )∑=

ℑ=ℑM

mmjj Tx

Mx

1

;1

Tt∈

55


Appendix 6Interpretation − Variable Importance (2)

• Rule ensemble measure:

– Term (global) importance: absolute value of coefficient of standardized predictor

• Rule term: where rule’s support is

• Linear term:

– Term (local) importance: at each point … absolute change in when term is removed

• Rule term:

• Linear term:

)1(ˆ kkkk ssa −⋅⋅=ℑ

)( jjj xstdb ⋅=ℑ)

∑=

=N

iikk r

Ns

1

)(1 x

)(xF)

x

kkkk sra −⋅=ℑ )(ˆ)( xx

jjjj xxb −⋅=ℑ)

)(x∑∈

ℑ+ℑ=ℑkl rx

kkll rsize )()()()( xxx


Appendix 6Interpretation − Partial Dependence Plots (Friedman, 2001)

• Visualize the effect on of a small subset of the important input variables

– after accounting for the (average) effects of the other (“complement”) variables – i.e.,

–

– Partial dependence on :

– Approximated by:

are the values of occurring in the training data

)(xF)

{ }nS xxx ,,, 21 K⊂x

Cx { }nCS xxx ,,, 21 K=∪ xx

),()( CSFF xxx))

=

Sx ),()( CSSS FFC

xxx X

))Ε=

∑ ==

N

i iCSSS xFN

F1

),(1)(ˆ xx

},,{ 1 NCC xx K Cx

56


Appendix 6Interpretation − Partial Dependence Plots (2)

• Example

– The shape of the function on either variable is affected by the values of the other, suggesting the presence of an interaction


Appendix 6Interpretation − Interaction Statistic (Friedman, 2005)

• If and do not interact, then

– can be expressed as sum of two functions:

i.e., does not depend on ; is independent of

– Thus, partial dependence on can be decomposed:

i.e., sum of respective partial dependencies

• Test for the presence of interaction

jx kx

{ }kjS xx ,=x

)()(),(, kkjjkjkj xFxFxxF)))

+=

) ,( kj xx

[ ] ∑∑==

−−=N

iikijkj

N

iikkijjikijkjjk xxFxFxFxxFH

1

2,

2

1,

2 ),()()(),())))

)(xF)

)()()( \\\\ kkjj ffF xxx +=)

)( \\ jjf x jx )( \\ kkf x kx

57


Appendix 6Interpretation − Interaction Statistic (2)

• If does not interact with any other variable

– can be expressed as sum of two functions:

where is a function only of

– Thus,

• Test whether interacts with any other variable

jx

)(xF)

)()()( \\ jjjj ffF xxx +=)

)()()( \\ jjjj FxFF xx +=)

)( jj xf jx

jx

[ ] ∑∑==

−−=N

ii

N

ijijijjij FFxFFH

1

22

1\\

2 )()()()( xxx))))


Appendix 7Complexity of Ensembles

• Ensembles generalizing well was a theoretical surprise

– Importance Sampling helps us understand some behavior

• Chief danger of data mining: Overfit

• Occam’s razor: regulate complexity to avoid overfit

• But, does the razor work?- counter-evidence has been gathering

• What if complexity is measured incorrectly?

• Generalized degrees of freedom (Ye, 1998)

• Experiments: single vs. bagged decision trees

• Summary: factors that matter

Outline:

58


Appendix 7Overfit models generalize poorly

0

1

2

3

4

5

0.1 0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1 5.6 6.1 6.6 7.1 7.6 8.1 8.6 9.1 9.6

X

Y = f(X)TestPntsYhatXYhatX2YhatX3YhatX4YhatX5YhatX6YhatX7YhatX8YhatX9


Appendix 7Occam’s Razor

• Nunquam ponenda est pluralitas sin necesitate“Entities should not be multiplied beyond necessity” “If (nearly) as descriptive, the simpler model is more correct.”

• But, gathering doubts:

– Ensemble methods which employ multiple models (e.g., bagging, boosting, bundling, Bayesian model averaging

– Nonlinear terms have higher (or lower) than linear effect– Much overfit is from excessive search (e.g., Jensen 2000), rather

than over-parameterization– Neural network structures are fixed, but their degree of fit grows

with time

• Domingos (1998) won KDD Best Paper arguing for its death

– What if complexity is measured incorrectly?

59


Appendix 7Target Shuffling can measure “over search”(e.g., Battery MTC, Monte Carlo Target in CART β)

• Break link between target, Y, and features, Xby shuffling Y to form Ys.

• Model new Ys ~ f(X)

• Measure quality of resulting (random) model

• Repeat to build distribution

-> Best (or mean) shuffled (i.e., useless) model sets the baseline above which true model performance may be measured


Appendix 7Complexity can be measured by the Flexibility of the Modeling Process

• Generalized Degrees of Freedom, GDF (Ye, JASA 3/1998)

– Perturb output, re-fit procedure, measure changes in estimates

• Covariance Inflation Criterion, CIC (Tibshirani & Knight, 1999)

– Shuffle output, re-fit procedure, measure covariance between new and old estimates.

• Key step (loop around modeling procedure) reminiscent of Regression Analysis Tool, RAT (Faraway, 1991) -- where resampling tests of a 2-second procedure took 2 days to run.

60


Appendix 7Generalized Degrees of Freedom (GDF)• #terms in Linear Regression (LR) = DoF, k

• Nonlinear terms (e.g., MARS) can have effect of ~3k (Friedman, Owen ‘91)

• Other parameters can have effects < 1 (e.g., under-trained neural networks)

Procedure (Ye, 1998):

• For LR, k = trace(Hat Matrix) = Σ δyhat / δy

• Define GDF to be sum of sensitivity of each fitted value, yhat, to perturbations in the corresponding output, y. That is, instead of extrapolating from LR by counting terms, use alternate trace measure which is equivalent under LR.

• (Similarly, the effective degrees of freedom of a spline model is estimated by the trace of the projection matrix, S: yhat = Sy )

• Put a y-perturbation loop around the entire modeling process (which can involve multiple stages)


Appendix 7GDF computation

ModelingProcess

X yehat

yey +

N(0,σ)perturbations

obse

rvat

ions

(ye,yehat)

Ye robustness trick: average, across perturbation runs, the sensitivities for a given observation.

y = 0.2502x - 0.821

-1.80

-1.60

-1.40

-1.20

-1.00

-0.80

-0.60-1.80 -1.60 -1.40 -1.20 -1.00 -0.80 -0.60

Ye

61


Appendix 7Example: data surface is piecewise constant


Appendix 7Additive N(0,.5) noise

62


Appendix 7100 random training samples


Appendix 7Estimation Surfaces for bundles of 5 trees

4-leaf trees 8-leaf trees(some of the finer structure is real)

Bagging produces gentler stair-steps than raw tree (illustrating how it generalizes better for smooth functions)

63


Appendix 7Equivalent tree for 8-leaf bundle (25% pruned)

So a bundled tree is still a tree. But is it as complex as it looks?


Appendix 7Experiment: Introduce selection noise (an additional 8 candidate input variables)

Estimation surface for5-bag of 4-leaf trees

… for 8-leaf trees

Main structure here is clear enough for simple models to avoid noise inputsbut their eventual use leads to a distribution of estimates on 2-d projection.

64


Appendix 7Estimated GDF vs. #parameters

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9

Parameters, k

Tree 2dBagged Tree 2dTree 2d+ 8 noiseBagged Tree 2d+ 8 noiseLinear Regression 1d

1

~5

~4 ~4

~3

Bagging reduces complexity

Noise x’s increase complexity


Appendix 7Ensembles & Complexity Summary

• Bundling competing models improves generalization.

• Different model families are a good source of component diversity.

• If we measure complexity as flexibility (GDF) the classic relation between complexity and overfit is revived.

– The more a modeling process can match an arbitrary change made to its output, the more complex it is.

– Simplicity is not parsimony.

• Complexity increases with distracting variables.

• It is expected to increase with parameter power and search thoroughness, and decrease with priors, shrinking, and clarity of structure in data. Constraints (observations) may go either way…

• Model ensembles often have less complexity than their components.

• Diverse modeling procedures can be fairly compared using GDF

Documents

From Trees to Forests and Rule Sets · 1 From Trees to Forests and Rule Sets A Unified Overview of Ensemble Methods 13th Intl. Conf. on Knowledge Discovery and Data Mining (KDD 2007)