31
Improving Accuracy by Voting Classification Algorithms Ronny Kohavi Silicon Graphics, Inc. and Blue Martini LLC 28 Sept 1998 Joint work with Eric Bauer Silicon Graphics, Inc.

Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample

Improving Accuracy byVoting Classification Algorithms

Ronny KohaviSilicon Graphics, Inc. and Blue Martini LLC

28 Sept 1998

Joint work with Eric BauerSilicon Graphics, Inc.

Page 2: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2

Ronny Kohavi Australia 1998

Outline

♦ Introduction to voting methods.

♦ Experimental design and the Bias−Variance decomposition.

♦ Bagging: pruning, using prob estimates, wagging, backfitting.

♦ Boosting: AdaBoost, Arc−X4. Numerical instabilities.

♦ Open questions.

Page 3: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



3

Ronny Kohavi Australia 1998

Introduction to Voting Methods

♦Main idea: build multiple models and combine them.

♦Variants differ in: − How models are built (e.g., change data or

change algorithm). − How predictions are combined (e.g., uniform

vs. non−uniform weighting, multiple levels−−stacking).

Model 1 Model 2 Model 3 Model 4

Combiner

Page 4: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



4

Ronny Kohavi Australia 1998

Key Ingredients

1. Low error rate for models.

2. Diversity, i.e., non−correlated (or anti−correlated) models.

3. Many models.

♦It is easy to satisfy #2 and #3 by sacrificing #1: build bad models.

♦It is easy to satisfy #1 and #3 by sacrificing #2: build small tweaks to a good model.

Page 5: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



5

Ronny Kohavi Australia 1998

Examples of Voting Algorithms

♦ Bagging: − Use bootstrap samples (sample with

replacement) to create different datasets. − Combiner uses uniform weighting.

♦ Wagging: similar to bagging, but − Reweigh instances instead of sample.

♦ Randomized splits in trees: − Modify split selection: randomly select

(e.g., uniformly) from k best splits.

♦ Option trees: − Select top k splits and combine them (at

multiple levels of the tree).

Mod

ify d

ata

Mod

ify d

ata

Mod

ify a

lgo

Mod

ify a

lgo

Page 6: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



6

Ronny Kohavi Australia 1998

Examples of Voting Algorithms (II)

♦ Arc−x4: − Increase weight of misclassified instances

♦ Boosting: − Increase weight of misclassified instances − Combine classifiers, giving low error

classifiers higher weight.

Disadvantage of above Adapting resample and combine algorithms: hard to parallelize. Each classifier is created based on the previous ones.

Mod

ify d

ata

Mod

ify d

ata

Page 7: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



7

Ronny Kohavi Australia 1998

(Dis)advantages of Voting Methods

Advantages ♦ Lower error rate.♦ Multiple models can give more insight (probably only for uniform combinations).

Disadvantages:♦ Loss of comprehensibility: − Less structure (except for option trees). − Huge models.♦ Slower induction. May exhaust hardware memory.♦ Slower classification time.

Page 8: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

8

Ronny Kohavi Australia 1998

Introduction to the Bias−VarianceDecomposition

The B+V decomposition is a powerful tool for analyzing induction algorithms.

It holds for finite samples (not in asymptopia).

Given: Target concept, Training set size, Induction algorithm, it provides a decomposition of the error into − Intrinsic noise (Bayes Optimal) − Squared bias: how well do hypotheses match

the target on average. − Variance: how much hypotheses vary for

different training sets.

Page 9: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



9

Ronny Kohavi Australia 1998

The Decomposition

E(C) =∑

x

P (x)(

bias2x + variancex + σ2

x

)

(1)

where

bias2x ≡1

2

y∈Y

[P (YF = y | x)− P (YH = y | x)]2 (2)

variancex ≡1

2

1−∑

y∈Y

P (YH = y | x)2

(3)

σ2x ≡1

2

1−∑

y∈Y

P (YF = y | x)2

. (4)

f and m in the conditioning events are implicit.

Page 10: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 0

Ronny Kohavi Australia 1998

Example

Assume that the Boolean label is independent of the attributes (random concept).The label is 1 with probability (1−p) for p<0.5

Constant classifier: predict 1.

− Bias2: p2 (the average guess is off by p). − Variance: 0 (rock stable guess).

Single rule: predict 1 if A_i=1 (A_i is an attribute that leads to a pure split by chance)

− Bias2: 0 (on average you predict well). − Var: p(1−p) (unstable predictions because A_i is a "random" split).

Page 11: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 1

Ronny Kohavi Australia 1998

Tree Pruning / Overfitting

The node

is not pure

yet we stop

and predict majority""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Predict 1

We split the

node and get

two pure nodes

with the right

probabilities

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Predict 0 Predict 1

p 1−pTest

Bias2 = p2

Var = 0

Error = Bias + Var = p2

Bias2 = 0Var = p(1−p)

p2<p(1−p) if p< 0.5, which we assumed.

In this case, it is better not to split. The variance hurts us

because we built a structure that is too complex.

Error = Bias + Var = p(1−p)

The previous example shows why pruning is useful.

Page 12: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 2

Ronny Kohavi Australia 1998

Curse of Dimensionality

20 dimensional unit hyper−cube.100,000 instances uniformly distributed.What is the expected distance of an instance to its closest neighbor?

0.1 0.5 0.7 0.9 0.99 0.999 1.520.0

0,0,0 1,0,0

0,1,1

1,1,1

0,0,1

Page 13: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 3

Ronny Kohavi Australia 1998

Experimental Design

Details of large experiment by Bauer and Kohavi (to appear in Machine Learning journal).

Desiderata for data sets and sampling sizes:♦ Small confidence interval on estimated error. We chose files with >1000 instances.♦ There should be room for improvement. Sample sizes chosen based on learning

curves so that we know error is not optimal.

10

15

20

25

30

35

40

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Err

or (

%)

Number of instances

letter

MC4naive-bayes

2

4

6

8

10

12

14

16

18

0 500 1000 1500 2000 2500

Err

or (

%)

Number of instances

segment

MC4naive-bayes

Page 14: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 4

Ronny Kohavi Australia 1998

Induction Algorithms

♦ MC4: similar to C4.5, implemented in MLC++ − No pruning: deactivate pruning. − Probabilistic estimates: leaves predict

distribution (frequency counts). − (Actual paper has two versions of decision

stumps.)

♦ NB: Naive−Bayes with discretized data.

Page 15: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 5

Ronny Kohavi Australia 1998

Bagging

Input: training set S, Inducer I, integer T (number of bootstrap samples).

1. for i = 1 to T {

2. S′ = bootstrap sample from S (i.i.d. sample with replacement).

3. Ci = I(S′)

4. }

5. C∗(x) = argmaxy∈Y

i:Ci(x)=y

1 (the most often predicted label y)

Output: classifier C∗.

In the experiments, T was set to 25.

Page 16: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 6

Ronny Kohavi Australia 1998

Bagging Observations

♦ Bagging was uniformly better on all 14 datasets!

♦ Error reduction due to variance reduction. Average relative reduction in err was 29%.

10.00

5.00

10.00

15.00MC4

bagged MC4

bagged MC4 without pruning with prob. estimates

bagged MC4 without pruning with prob. estimates and backfitting

Bias is below variance

♦ Trees were larger. Hypothesis: replicated instances seem like

strong patterns and pruning is incorrect.

waveform-400.00

5.00

10.00

15.00

20.00

25.00

30.00

letter0.00

5.00

10.00

15.00

20.00

25.00

Page 17: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 7

Ronny Kohavi Australia 1998

Bagging Observations − Pruning

♦ If tree pruning is disabled, then − Bagged trees are smaller (training set size is

effectively smaller−63.2% unique instances). − Average bias was reduced by 14% (relative). − Average variance grew by 11% (relative).

chess0.00

0.50

1.00

1.50

2.00

2.50

3.00

nursery0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

10.00

5.00

10.00

15.00MC4

bagged MC4

bagged MC4 without pruning with prob. estimates

bagged MC4 without pruning with prob. estimates and backfitting

Bias is below variance

♦ "No pruning" did not make an overall difference, but we suspect that with more replicates, it is better not to prune.

Page 18: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 8

Ronny Kohavi Australia 1998

Bagging Variants

♦ Wagging (Weight Aggregation) perturbs the training set weights instead of sampling.

Results were similar to bagging.

♦ Backfitting takes the unused data from each bagging replicate (~ 36.8% unique instances) and updates the counts at the leaves.

Average relative error decreased 3%, which was all due to variance reduction. Variances for all files improved!

Page 19: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



1 9

Ronny Kohavi Australia 1998

Boosting

Input: training set S of size m, Inducer I, integer T (number of trials).

1. S′ = S with instance weights assigned to be 1.

2. For i = 1 to T {

3. Ci = I(S′)

4. ǫi =1m

xj∈S′:Ci(xj)6=yj

weight(x) (weighted error on training set).

5. If ǫi > 1/2, set S′ to a bootstrap sample from S with weight 1 for

every instance and goto step 3 (this step is limited to25 times after which we exit the loop).

6. βi = ǫi/(1− ǫi)

7. For-each xj , divide weight(xj) by 2ǫi if Ci(xj) 6= yj and 2(1−ǫi) otherwise

8. }

9. C∗(x) = argmaxy∈Y

i:Ci(x)=y

log1

βi

Output: classifier C∗.

Page 20: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

2 0

Ronny Kohavi Australia 1998

Observations on Boosting

♦ Incorrect instances are weighted by a factor inversely proportional to the training set error (1/2e).

A training set error of 0.1% will cause weights

to grow by a factor of 500. Without careful attention, numerical precision

problems occur.

♦ The total weight of the misclassified instances is half the original training set weight.

The correctly classified instances get the other half of the total weight.

Page 21: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 1

Ronny Kohavi Australia 1998

Running Example − Shuttle (I)

Five misclassified examples on training set of size 5,000 (0.1%) causes their weight to be 500.

Test−seterror:0.38%

Page 22: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 2

Ronny Kohavi Australia 1998

Running Example − Shuttle (II)

One misclassified example (0.01%) that was not previously misclassified is reweighted from 0.5 to 2500.

Test−seterror:0.19%

Page 23: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 3

Ronny Kohavi Australia 1998

Running Example − Shuttle (III)

Five mistakes again, all on instances previously correctly classified.

Test−seterror:0.21%

Page 24: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 4

Ronny Kohavi Australia 1998

Running Example − Shuttle (IV)

12 mistakes are made.

Test−seterror:0.45%

Page 25: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 5

Ronny Kohavi Australia 1998

Running Example − Shuttle (V)

One misclassified example with weight 0.063.Training set error is 0.0012%.

If original AdaBoost is used, beta is 0.0000125, which causes weights to go

below 10−6 prior to normalization. Underflow problems start...

Page 26: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 6

Ronny Kohavi Australia 1998

Running Example − Shuttle (VI)

Classifier makes no mistakes.Note that this is a single classifier, which is significantly better than the original one!

Test−seterror:0.08%

Page 27: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 7

Ronny Kohavi Australia 1998

AdaBoost Observations

♦ AdaBoost slightly outperformed Bagging.

♦ Unlike Bagging, boosting did not uniformly reduce the error.

Hypothyroid, sick−euthyroid, adult, and LED−24 had higher errors.

♦ Average tree size was larger for most files. It was especially larger for files on which performance degraded.

♦ Problems with robustness to noise.

Page 28: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 8

Ronny Kohavi Australia 1998

Boosting: Bias + Variance

♦ Boosting reduced both bias and variance: Average bias reduced 32% (relative). Average variance reduced 16% (relative).

DNA-nominal0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

chess0.00

0.50

1.00

1.50

2.00

2.50

3.00MC4

bagged MC4 without pruning with prob. estimates and backfitting

boosted MC4 using Arc-x4-resample

boosted MC4 using AdaBoost

Bias is below variance

Page 29: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



2 9

Ronny Kohavi Australia 1998

Open Questions

♦ Can AdaBoost be made more robust to noise?♦ Arc−X4 did not work with reweighting. Why?♦ Can we learn a single model that is better (as

happened with shuttle)?♦ Bagging and Boosting build huge structures.

What happened to Occam’s razor? Is there a compact representation?

♦ Bagging worked better without pruning. AdaBoost did not. Why?

♦ Boosting is sequential. Can parallelism be used?

Page 30: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



3 0

Ronny Kohavi Australia 1998

♦ AdaBoost reduced the error by 27% with MC4 and 24% with Naive−Bayes (relative).

Note however that we knew improvement was possible on our datasets.

♦ Bagging reduces variance. AdaBoost reduces both bias and variance.

♦ Bagging benefits from no pruning, probabilistic variants, and backfitting.

♦ Be careful with numerical instabilities when implementing AdaBoost.

Summary

Page 31: Silicon Graphics, Inc. and Blue Martini LLCai.stanford.edu/~ronnyk/vote-talk.pdf · 2015. 11. 11. · ♦ Wagging: similar to bagging, but − Reweigh instances instead of sample



3 1

Ronny Kohavi Australia 1998

♦ We used about 4,000 CPU hours. Many runs were done on Flurry, a 128 CPU

Origin 2000 with 30GB of RAM.

♦ We spent a lot of time trying to track an assertion failure, where sometimes normalizing an array did not add up to 1.0.

After many experiments, we found that CPU#55 on Flurry was making arithmetic errors sometimes...

♦ Today the OS runs a program called paranoia on large machines to track such problems.

CPU #55 on Flurry