52
COMP 551 – Applied Machine Learning Lecture 12: Ensemble learning Associate Instructor: Herke van Hoof ([email protected]) Slides mostly by: Joelle Pineau ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

COMP 551 – Applied Machine LearningLecture 12: Ensemble learning

Associate Instructor: Herke van Hoof

([email protected])

Slides mostly by: Joelle Pineau ([email protected])

Class web page: www.cs.mcgill.ca/~jpineau/comp551

Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

Page 2: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau2

Today’s quiz1. Output of 1NN for A?

2. Output of 3NN for A?

3. Output of 3NN for B?

4. Explain in 1-2 sentences the difference between a "lazy" learner (such as nearest neighbour classifier) and an "eager" learner (such as logistic regression classifier).

COMP-551: Applied Machine Learning

Page 3: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau3

Project #2

• A note on the contest rules:

– You are allowed to use the built-in cross-validation methods from libraries like scikit-learn, for all parts.

– You are allowed to use NLTK or another library for preprocessing your data for all parts

– You can use an outside corpus to evaluate the features (e.g. TF-IDF).

COMP-551: Applied Machine Learning

Page 4: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau4

Project #2

COMP-551: Applied Machine Learning

Page 5: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau5

Project #2

COMP-551: Applied Machine Learning

• Some features:

– Sub-word features (skiing: ski – kii – iin - ing) allows out-of-vocabulary and misspelling

– Languages in hierarchical tree make use of inbalance in classes

– K-means and feature selection to reduce model size

Page 6: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau6

Next topic: Ensemble methods

• Recently seen supervised learning methods:– Logistic regression, Naïve Bayes, LDA/QDA– Decision trees, Instance-based learning

• Core idea of decision trees? Build complex classifiers from simpler ones.E.g. Linear separator -> Decision trees

• Ensemble methods use this idea with other ‘simple’ methods

• Several ways to do this.– Bagging– Random forests– Boosting

COMP-551: Applied Machine Learning

Lectures 4,5 – Linear ClassificationLecture 7 – Decision Trees

Page 7: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau7

Ensemble learning in general• Key idea: Run a base learning algorithm multiple times, then

combine the predictions of the different learners to get a final prediction.– What’s a base learning algorithm?

• Naïve Bayes, LDA, Decision trees, SVMs, …

COMP-551: Applied Machine Learning

Page 8: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau8

Ensemble learning in general• Key idea: Run a base learning algorithm multiple times, then

combine the predictions of the different learners to get a final prediction.– What’s a base learning algorithm?

• Naïve Bayes, LDA, Decision trees, SVMs, …

• First attempt: Construct several classifiers independently.– Bagging.

– Randomizing the test selection in decision trees (Random forests).

– Using a different subset of input features to train different trees.

COMP-551: Applied Machine Learning

Page 9: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau9

Ensemble learning in general• Key idea: Run a base learning algorithm multiple times, then

combine the predictions of the different learners to get a final prediction.– What’s a base learning algorithm?

• Naïve Bayes, LDA, Decision trees, SVMs, …

• First attempt: Construct several classifiers independently.– Bagging.

– Randomizing the test selection in decision trees (Random forests).

– Using a different subset of input features to train different trees.

• More complex approach: Coordinate the construction of the hypotheses in the ensemble.

COMP-551: Applied Machine Learning

Page 10: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau10

Ensemble methods in general

• Training models independently on same dataset tends to yield

same result!

• For an ensemble to be useful, trained models need to be

different

1. Use slightly different (randomized) datasets

2. Use slightly different (randomized) training procedure

COMP-551: Applied Machine Learning

Page 11: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau11

Recall bootstrapping

• Given dataset D, construct a bootstrap replicate of D, called Dk, which has the same number of examples, by drawing samples from D with replacement.

• Use the learning algorithm to construct a hypothesis hk by training on Dk.

• Compute the prediction of hk on each of the remaining points, from the set Tk = D – Dk.

• Repeat this process K times, where K is typically a few hundred.

COMP-551: Applied Machine Learning

Lecture 6 –Evaluation

Page 12: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau12

Estimating bias and variance

• For each point x, we have a set of estimates h1(x), …, hK(x), with K≤B

(since x might not appear in some replicates).

• The average empirical prediction of x is: ĥ (x) = (1/K) ∑k=1:K hk(x).

• We estimate the bias as: y – ĥ(x).

• We estimate the variance as: (1/(K-1)) ∑k=1:K ( ĥ(x) - hk(x) )2.

COMP-551: Applied Machine Learning

Page 13: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau13

Bagging: Bootstrap aggregation

• If we did all the work to get the hypotheses hb, why not use all of

them to make a prediction? (as opposed to just estimating

bias/variance/error).

• All hypotheses get to have a vote.

– For classification: pick the majority class.

– For regression, average all the predictions.

• Which hypotheses classes would benefit most from this

approach?

COMP-551: Applied Machine Learning

Page 14: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau14

Bagging

• For each point x, we have a set of estimates h1(x), …, hK(x), with

K≤B (since x might not appear in some replicates).

– The average empirical prediction of x is: ĥ (x) = (1/K) ∑k=1:K hk(x).

– We estimate the bias as: y – ĥ(x).

– We estimate the variance as: (1/(K-1)) ∑k=1:K ( ĥ(x) - hk(x) )2.

COMP-551: Applied Machine Learning

Page 15: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau15

Bagging

• In theory, bagging eliminates variance altogether.

• In practice, bagging tends to reduce variance and increase bias.

• Use this with “unstable” learners that have high variance, e.g.

decision trees, neural networks, nearest-neighbour.

COMP-551: Applied Machine Learning

Page 16: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau16

Random forests (Breiman, 2001)• Basic algorithm:

– Use K bootstrap replicates to train K different trees.

– At each node, pick m variables at random (use m<M, the total number of features).

– Determine the best test (using normalized information gain).

– Recurse until the tree reaches maximum depth (no pruning).

COMP-551: Applied Machine Learning

Page 17: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau17

Random forests (Breiman, 2001)• Basic algorithm:

– Use K bootstrap replicates to train K different trees.

– At each node, pick m variables at random (use m<M, the total number of features).

– Determine the best test (using normalized information gain).

– Recurse until the tree reaches maximum depth (no pruning).

• Comments:

– Each tree has high variance, but the ensemble uses averaging, which reduces variance.

– Random forests are very competitive in both classification and regression, but still subject to overfitting.

COMP-551: Applied Machine Learning

Page 18: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau18

Extremely randomized trees (Geurts et al., 2006)

• Basic algorithm:– Construct K decision trees.– Pick m attributes at random (without replacement) and pick a random

test involving each attribute.– Evaluate all tests (using a normalized information gain metric) and pick

the best one for the node.– Continue until a desired depth or a desired number of instances (nmin)

at the leaf is reached.

COMP-551: Applied Machine Learning

Page 19: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau19

Extremely randomized trees (Geurts et al., 2005)• Basic algorithm:

– Construct K decision trees.– Pick m attributes at random (without replacement) and pick a random

test involving each attribute.– Evaluate all tests (using a normalized information gain metric) and pick

the best one for the node.– Continue until a desired depth or a desired number of instances (nmin)

at the leaf is reached.

• Comments:– Very reliable method for both classification and regression.– The smaller m is, the more randomized the trees are; small m is best,

especially with large levels of noise. Small nmin means less bias and more variance, but variance is controlled by averaging over trees.

– Compared to single trees, can pick smaller nmin (less bias)

COMP-551: Applied Machine Learning

Page 20: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau20

Randomization

• For an ensemble to be useful, trained models need to be

different

1. Use slightly different (randomized) datasets• Bootstrap Aggregation (Bagging)

2. Use slightly different (randomized) training procedure• Extremely randomized trees, Random Forests

COMP-551: Applied Machine Learning

Page 21: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau21

Randomization in general

• Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results.

• Examples:– Random feature selection Random projections.

• Advantages?

• Disadvantages?

COMP-551: Applied Machine Learning

Page 22: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau22

Randomization in general

• Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results.

• Examples:– Random feature selection Random projections.

• Advantages?– Very fast, easy, can handle lots of data.

– Can circumvent difficulties in optimization.

– Averaging reduces the variance introduced by randomization.

• Disadvantages?

COMP-551: Applied Machine Learning

Page 23: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau23

Randomization in general

• Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results.

• Examples:– Random feature selection Random projections.

• Advantages?– Very fast, easy, can handle lots of data.

– Can circumvent difficulties in optimization.

– Averaging reduces the variance introduced by randomization.

• Disadvantages?– New prediction may be more expensive to evaluate (go over all trees).

– Still typically subject to overfitting.

– Low interpretability compared to standard decision trees.

COMP-551: Applied Machine Learning

Page 24: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau24

Randomization

• For an ensemble to be useful, trained models need to be

different

1. Use slightly different (randomized) datasets• Bootstrap Aggregation (Bagging)

2. Use slightly different (randomized) training procedure• Extremely randomized trees, Random Forests

3. Alternative method to randomization?

COMP-551: Applied Machine Learning

Page 25: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau25

Additive models

• In an ensemble, the output on any instance is computed by

averaging the outputs of several hypotheses.

• Idea: Don’t construct the hypotheses independently. Instead,

new hypotheses should focus on instances that are problematic

for existing hypotheses.

– If an example is difficult, more components should focus on it.

COMP-551: Applied Machine Learning

Page 26: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau26

Boosting• Boosting:

– Use the training set to train a simple predictor.

– Re-weight the training examples, putting more weight on examples that were not properly classified in the previous predictor.

– Repeat n times.

– Combine the simple hypotheses into a single, accurate predictor.

COMP-551: Applied Machine Learning

Boosting classifier

Data

Weak Learner

Weak Learner

Weak Learner

H1

H2

Hn

Final

hypothesis

F(H1,H2,...Hn)

D1

D2

Dn

Original

COMP-652, Lecture 11 - October 16, 2012 16

Page 27: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau27

Notation

• Assume that examples are drawn independently from some

probability distribution P on the set of possible data D.

• Let JP(h) be the expected error of hypothesis h when data is

drawn from P:

JP(h) = ∑<x,y>J(h(x),y)P(<x,y>)

where J(h(x),y) could be the squared error, or 0/1 loss.

COMP-551: Applied Machine Learning

Page 28: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau28

Weak learners

• Assume we have some “weak” binary classifiers:

– A decision stump is a single node decision tree: xi>t

– A single feature Naïve Bayes classifier.

– A 1-nearest neighbour classifier.

• “Weak” means JP(h)<1/2-ɣ (assuming 2 classes), where ɣ>0

– So true error of the classifier is only slightly better than random.

• Questions:– How do we re-weight the examples?

– How do we combine many simple predictors into a single classifier?

COMP-551: Applied Machine Learning

Page 29: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau29

Example

COMP-551: Applied Machine Learning

Toy example

!"##"$!%

&'%

()*%+,-./01%

2)345%&%

COMP-652, Lecture 11 - October 16, 2012 20

Page 30: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau30

Example: First step

COMP-551: Applied Machine Learning

Toy example: First step

!"##"$!%

&'%

()*%+,-./01%

2)345%&%

COMP-652, Lecture 11 - October 16, 2012 21

Page 31: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau31

Example: Second step

COMP-551: Applied Machine Learning

Toy example: Second step

!"##"$!%

&'%

()*+,%#%

()*+,%-%COMP-652, Lecture 11 - October 16, 2012 22

Page 32: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau32

Example: Third step

COMP-551: Applied Machine Learning

Toy example: Third step

!"##"$!%

&'%

()*+,%#%

()*+,%-%

COMP-652, Lecture 11 - October 16, 2012 23

Page 33: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau33

Example: Final hypothesis

COMP-551: Applied Machine Learning

Toy example: Final hypothesis

!"##"$!%

&'%

()*+,%-./01234)4%

567%83*92:+;<4%

60:/+;)40*%=)12%

•! 6>?'%@AB)*,+*C4%D39)4)0*%E;33%F,G0;)12:H%

•! D39)4)0*%I1B:/4%@0*,.%4)*G,3%+J;)KB13H%

COMP-652, Lecture 11 - October 16, 2012 24

Page 34: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau34

AdaBoost (Freund & Schapire, 1995)

COMP-551: Applied Machine Learning

Page 35: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau35

AdaBoost (Freund & Schapire, 1995)

COMP-551: Applied Machine Learning

Page 36: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau36

AdaBoost (Freund & Schapire, 1995)

COMP-551: Applied Machine Learning

weight of weak learner t

Page 37: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau37

AdaBoost (Freund & Schapire, 1995)

COMP-551: Applied Machine Learning

weight of weak learner t

Page 38: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau38

AdaBoost (Freund & Schapire, 1995)

COMP-551: Applied Machine Learning

weight of weak learner t

Page 39: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau39

Why these equations?

COMP-551: Applied Machine Learning

• Loss function:

• Has a gradient

• Upper bound on classification loss

• Stronger signal for wrong classifications

• Stronger signal if wrong and far from boundary

L =NX

i=1

e

�mi

mi = yi

KX

k=1

↵khk(xi)

Page 40: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau40

Why these equations?

COMP-551: Applied Machine Learning

• Loss function:

• Update equations are derived from this loss function

L =NX

i=1

e

�mi

mi = yi

KX

k=1

↵khk(xi)

Page 41: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau41

Properties of AdaBoost

• Compared to other boosting algorithms, main insight is to

automatically adapt the error rate at each iteration.

COMP-551: Applied Machine Learning

Page 42: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau42

Properties of AdaBoost

• Compared to other boosting algorithms, main insight is to

automatically adapt the weights at each iteration.

• Training error on the final hypothesis is at most:

recall: ɣt is how much better than random is ht

• AdaBoost gradually reduces the training error exponentially fast.

COMP-551: Applied Machine Learning

Page 43: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau43

Real data set: Text categorization

COMP-551: Applied Machine Learning

Real data set: Text Categorization

!"##"$!%

&!%

'()*+,-.+,/%0123425,,%67.82*9,%

•! :*;14*%).%8-+57<%=7.82*9%8<%47*53+>%,*?*752%8-+57<%

@1*,3.+,%A.7%*54B%*(59=2*/%

C! D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%&EF%

C! D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%#EF%

C! D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%GEF%

. .

.

H*()%I5)*>.7-J53.+%•! K*4-,-.+%,)19=,/%=7*,*+4*%.A%L.7;%.7%,B.7)%

=B75,*M%'(59=2*/%

“If the word Clinton appears in the document predict

document is about politics”

;5)585,*/%:*1)*7,%;5)585,*/%N6%

COMP-652, Lecture 11 - October 16, 2012 25

Page 44: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau44

Boosting empirical evaluation

COMP-551: Applied Machine Learning

Boosting empirical evaluation

20

39

©Carlos Guestrin 2005-2007

Boosting: Experimental Results

Comparison of C4.5, Boosting C4.5, Boosting decision

stumps (depth 1 trees), 27 benchmark datasets

[Freund & Schapire, 1996]

errorerror

err

or

40

©Carlos Guestrin 2005-2007

COMP-652, Lecture 11 - October 16, 2012 26

C4.5: Lecture 7 – Decision Trees

Page 45: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau45

Bagging vs Boosting

COMP-551: Applied Machine Learning

Bagging vs. Boosting

0 20 40 60 80

boosting

0

20

40

60

80

ba

gg

ing

FindAttrTest

0 20 40 60 80

FindDecRule

0 5 10 15 20 25 300

5

10

15

20

25

30

C4.5

Figure 4: Comparison of boosting and bagging for each of theweak learners.

and assign each mislabel weight 1 times the number oftimes it was chosen. The hypotheses computed in thismanner are then combined using voting in a naturalmanner;namely, given , the combined hypothesis outputs the labelwhich maximizes .For either error or pseudo-loss, the differences between

bagging and boosting can be summarized as follows: (1)bagging always uses resampling rather than reweighting; (2)bagging does not modify the distribution over examples ormislabels, but instead always uses the uniform distribution;and (3) in forming the final hypothesis, bagging gives equalweight to each of the weak hypotheses.

3.3 THE EXPERIMENTS

We conducted our experiments on a collection of machinelearning datasets available from the repository at Universityof California at Irvine.3 A summary of some of the proper-ties of these datasets is given in Table 1. Some datasets areprovided with a test set. For these, we reran each algorithm20 times (since some of the algorithms are randomized),and averaged the results. For datasets with no provided testset, we used 10-fold cross validation, and averaged the re-sults over 10 runs (for a total of 100 runs of each algorithmon each dataset).

In all our experiments, we set the number of rounds ofboosting or bagging to be 100.

3.4 RESULTS AND DISCUSSION

The results of our experiments are shown in Table 2.The figures indicate test error rate averaged over mul-tiple runs of each algorithm. Columns indicate whichweak learning algorithm was used, and whether pseudo-loss (AdaBoost.M2) or error (AdaBoost.M1) was used.Note that pseudo-loss was not used on any two-class prob-lems since the resulting algorithmwould be identical to thecorresponding error-based algorithm. Columns labeled “–”indicate that the weak learning algorithmwas used by itself(with no boosting or bagging). Columns using boosting orbagging are marked “boost” and “bag,” respectively.

One of our goals in carrying out these experiments wasto determine if boosting using pseudo-loss (rather than er-ror) is worthwhile. Figure 3 shows how the different al-gorithms performed on each of the many-class ( 2)problems using pseudo-loss versus error. Each point in thescatter plot represents the error achieved by the twocompet-ing algorithms on a given benchmark, so there is one point

3URL “http://www.ics.uci.edu/˜mlearn/MLRepository.html”

0 5 10 15 20 25 30

boosting FindAttrTest

0

5

10

15

20

25

30

C4.5

0 5 10 15 20 25 30

boosting FindDecRule

0 5 10 15 20 25 30

boosting C4.5

0 5 10 15 20 25 30

bagging C4.5

Figure 5: Comparison of C4.5 versus various other boosting andbagging methods.

for each benchmark. These experiments indicate that boost-ing using pseudo-loss clearly outperforms boosting usingerror. Using pseudo-loss did dramatically better than erroron every non-binary problem (except it did slightly worseon “iris” with three classes). Because AdaBoost.M2 didso much better than AdaBoost.M1, we will only discussAdaBoost.M2 henceforth.

As the figure shows, using pseudo-loss with bagginggave mixed results in comparison to ordinary error. Over-all, pseudo-loss gave better results, but occasionally, usingpseudo-loss hurt considerably.

Figure 4 shows similar scatterplots comparing the per-formance of boosting and bagging for all the benchmarksand all three weak learner. For boosting, we plotted the er-ror rate achieved using pseudo-loss. To present bagging inthe best possible light, we used the error rate achieved usingeither error or pseudo-loss, whichever gave the better resulton that particular benchmark. (For the binary problems,and experiments withC4.5, only error was used.)

For the simpler weak learning algorithms (FindAttr-Test and FindDecRule), boosting did significantly and uni-formly better than bagging. The boosting error rate wasworse than the bagging error rate (using either pseudo-lossor error) on a very small number of benchmark problems,and on these, the difference in performance was quite small.On average, for FindAttrTest, boosting improved the errorrate over using FindAttrTest alone by 55.2%, compared tobagging which gave an improvement of only 11.0% usingpseudo-loss or 8.4% using error. For FindDecRule, boost-ing improved the error rate by 53.0%, bagging by only18.8% using pseudo-loss, 13.1% using error.

When usingC4.5 as theweak learning algorithm, boost-ing and bagging seem more evenly matched, althoughboosting still seems to have a slight advantage. On av-erage, boosting improved the error rate by 24.8%, baggingby 20.0%. Boosting beat bagging by more than 2% on 6 ofthe benchmarks, while baggingdid not beat boostingby thisamount on any benchmark. For the remaining 20 bench-marks, the difference in performance was less than 2%.

Figure 5 shows in a similarmanner howC4.5 performedcompared to bagging withC4.5, and compared to boostingwith each of the weak learners (using pseudo-loss for thenon-binary problems). As the figure shows, using boostingwith FindAttrTest does quite well as a learning algorithmin its own right, in comparison to C4.5. This algorithmbeat C4.5 on 10 of the benchmarks (by at least 2%), tiedon 14, and lost on 3. As mentioned above, its averageperformance relative to using FindAttrTest by itself was55.2%. In comparison,C4.5’s improvement in performance

6

COMP-652, Lecture 11 - October 16, 2012 27

Page 46: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau46

Bagging vs Boosting

• Bagging is typically faster, but may get a smaller error reduction

(not by much).

• Bagging works well with “reasonable” classifiers.

• Boosting works with very simple classifiers.

E.g., Boostexter - text classification using decision stumps based on single words.

• Boosting may have a problem if a lot of the data is mislabeled,

because it will focus on those examples a lot, leading to

overfitting.

COMP-551: Applied Machine Learning

Page 47: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau47

Why does boosting work?

COMP-551: Applied Machine Learning

Page 48: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau48

Why does boosting work?

• Weak learners have high bias. By combining them, we get more

expressive classifiers. Hence, boosting is a bias-reduction

technique.

COMP-551: Applied Machine Learning

Page 49: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau49

Why does boosting work?

• Weak learners have high bias. By combining them, we get more

expressive classifiers. Hence, boosting is a bias-reduction

technique.

• Adaboost minimizes an upper bound on the misclassifcation

error, within the space of functions that can be captured by a

linear combination of the base classifiers.

• What happens as we run boosting longer? Intuitively, we get

more and more complex hypotheses. How would you expect bias

and variance to evolve over time?

COMP-551: Applied Machine Learning

Page 50: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau50

A naïve (but reasonable) analysis of error

• Expect the training error to continue to drop (until it reaches 0).

• Expect the test error to increase as we get more voters, and hf

becomes too complex.

COMP-551: Applied Machine Learning

A naive (but reasonable) analysis of generalization error

• Expect the training error to continue to drop (until it reaches 0)

• Expect the test error to increase as we get more voters, and hf becomestoo complex.

20 40 60 80 100

0.2

0.4

0.6

0.8

1

COMP-652, Lecture 11 - October 16, 2012 30

Page 51: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau51

Actual typical run of AdaBoost• Test error does not increase even after 1000 runs! (more than 2

million decision nodes!)

• Test error continues to drop even after training error reaches 0!

• These are consistent results through many sets of experiments!

• Conjecture: Boosting does not overfit!

COMP-551: Applied Machine Learning

Actual typical run of AdaBoost

10 100 10000

5

10

15

20

• Test error does not increase even after 1000 runs! (more than 2 milliondecision nodes!)

• Test error continues to drop even after training error reaches 0!

• These are consistent results through many sets of experiments!

COMP-652, Lecture 11 - October 16, 2012 31

Page 52: COMP 551 –Applied Machine Learning Lecture 12: Ensemble ...jpineau/comp551/Lectures/10EnsembleLearning.pdfCOMP-551: Applied Machine Learning Lecture 6– ... • In practice, bagging

Joelle Pineau52COMP-551: Applied Machine Learning

What you should know• Ensemble methods combine several hypotheses into one prediction.

• They work better than the best individual hypothesis from the same

class because they reduce bias or variance (or both).

• Extremely randomized trees are a bias-reduction technique.

• Bagging is mainly a variance-reduction technique, useful for complex

hypotheses.

• Main idea is to sample the data repeatedly, train several classifiers and

average their predictions.

• Boosting focuses on harder examples, and gives a weighted vote to the

hypotheses.

• Boosting works by reducing bias and increasing classification margin.