Machine Learningepxing/Class/10701/slides...Ensemble methods Boosting from Weak Learners Eric Xing...

Ensemble methodsBoosting from Weak Learners

Eric Xing

Lecture 11, October 15, 2015

Machine Learning

10-701, Fall 2015

b r a c e

Reading: Chap. 14.3 C.B book1© Eric Xing @ CMU, 2006-2015

Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees)

Are good - Low variance, don’t usually overfitAre bad - High bias, can’t solve hard learning problems

Can we make weak learners always good??? No!!! But often yes…

Weak Learners:Fighting the bias-variance tradeoff

2© Eric Xing @ CMU, 2006-2015

Why boost weak learners?Goal: Automatically categorize type of call requested

(Collect, Calling card, Person-to-person, etc.)

Easy to find “rules of thumb” that are “often” correct.E.g. If ‘card’ occurs in utterance, then predict ‘calling card’

Hard to find single highly accurate prediction rule.

3© Eric Xing @ CMU, 2006-2015

Voting (Ensemble Methods) Instead of learning a single (weak) classifier, learn many weak

classifiers that are good at different parts of the input space

Output class: (Weighted) vote of each classifier Classifiers that are most “sure” will vote with more conviction Classifiers will be most “sure” about a particular part of the space On average, do better than single classifier!

H: X → Y (-1,1)h1(X) h2(X)

H(X) = sign(∑αt ht(X))t

weights

H(X) = h1(X)+h2(X)

4© Eric Xing @ CMU, 2006-2015

Voting (Ensemble Methods) Instead of learning a single (weak) classifier, learn many

weak classifiers that are good at different parts of the input space

Output class: (Weighted) vote of each classifier Classifiers that are most “sure” will vote with more conviction Classifiers will be most “sure” about a particular part of the space On average, do better than single classifier!

But how do you ??? force classifiers ht to learn about different parts of the input space? weigh the votes of different classifiers? t

5© Eric Xing @ CMU, 2006-2015

Bagging Recall decision trees (lecture 3)

Pros: interpretable, can handle discrete and continuous features, robust to outliers, low bias, etc.

Cons: high variance

Trees are perfect candidates for ensembles Consider averaging many (nearly) unbiased tree estimators Bias remains similar, but variance is reduced

This is called bagging (bootstrap aggregating) (Breiman, 1996) Train many trees on bootstrapped data, then take average

Bootstrap: statistical term for “roll n-face dice n times”

Random Forest Reduce correlation between trees, by introducing randomness1. For b = 1, …, B,

1. Draw a bootstrap dataset 2. Learn a tree on , in particular select features randomly out of

features as candidates before splitting

2. Output: Regression: Classification: majority vote

Typically take

Rationale: Combination of methods There is no algorithm that is always the most accurate

We can select simple “weak” classification or regression methods and combine them into a single “strong” method

Different learners use different

Algorithms Parameters Representations (Modalities) Training sets Subproblems

The problem: how to combine them

8© Eric Xing @ CMU, 2006-2015

Boosting [Schapire’89] Idea: given a weak learner, run it multiple times on (reweighted)

training data, then let learned classifiers vote

On each iteration t: weight each training example by how incorrectly it was classified Learn a weak hypothesis – ht

A strength for this hypothesis – t

Final classifier:

Practically useful, and theoretically interesting Important issues:

what is the criterion that we are optimizing? (measure of loss) we would like to estimate each new component classifier in the same manner

(modularity)

H(X) = sign(∑αt ht(X))

9© Eric Xing @ CMU, 2006-2015

Combination of classifiers Suppose we have a family of component classifiers

(generating ±1 labels) such as decision stumps:

where = {k,w,b}

Each decision stump pays attention to only a single component of the input vector

bwxxh k sign);(

10© Eric Xing @ CMU, 2006-2015

Combination of classifiers con’d We’d like to combine the simple classifiers additively so that

the final classifier is the sign of

where the “votes” {i} emphasize component classifiers that make more reliable predictions than others

Important issues: what is the criterion that we are optimizing? (measure of loss) we would like to estimate each new component classifier in the same manner

(modularity)

);();()(ˆ mmhhh xxx 11

11© Eric Xing @ CMU, 2006-2015

AdaBoost Input:

N examples SN = {(x1,y1),…, (xN,yN)} a weak base learner h = h(x,)

Initialize: equal example weights wi = 1/N for all i = 1..N Iterate for t = 1…T:

1. train base learner according to weighted example set (wt ,x) and obtain hypothesis ht = h(x,t)

2. compute hypothesis error t

3. compute hypothesis weight t

4. update example weights for next iteration wt+1

Output: final hypothesis as a linear combination of ht

12© Eric Xing @ CMU, 2006-2015

AdaBoost At the kth iteration we find (any) classifier h(x; k*) for which

the weighted classification error:

is better than chance. This is meant to be "easy" --- weak classifier

Determine how many “votes” to assign to the new component classifier:

stronger classifier gets more votes

Update the weights on the training examples:

kkk εε /)(log. 150

);(exp kikiki

ki hayWW x 1

kik WhyIW

*1 );(( x

13© Eric Xing @ CMU, 2006-2015

Boosting Example (Decision Stumps)

14© Eric Xing @ CMU, 2006-2015

Boosting Example (Decision Stumps)

15© Eric Xing @ CMU, 2006-2015

What is the criterion that we are optimizing? (measure of loss)

16© Eric Xing @ CMU, 2006-2015

Measurement of error Loss function:

Generalization error:

Objective: find h with minimum generalization error

Main boosting idea: minimize the empirical error:

))(( e.g. ))(,( xx hyIhy

))(,()( xhyEhL

iii hyN

1 ))(,()(ˆ x

17© Eric Xing @ CMU, 2006-2015

Exponential Loss Empirical loss:

Another possible measure of empirical loss is

iimihyhL

1)(ˆexp)(ˆ x

iimi hyN

))(ˆ,(1)(ˆ x

18© Eric Xing @ CMU, 2006-2015

Exponential Loss One possible measure of empirical loss is

The combined classifier based on m − 1 iterations defines a weighted loss criterion for the next simple classifier to add

each training sample is weighted by its "classifiability" (or difficulty) seen by the classifier we have built so far

);(exp

);(exp)(ˆexp

);()(ˆexp

)(ˆexp)(ˆ

imimiimi

Recall that:);();()(ˆ

mmm hhh xxx 11

)(ˆexp imimi hyW x1

19© Eric Xing @ CMU, 2006-2015

Linearization of loss function We can simplify a bit the estimation criterion for the new

component classifiers (assuming is small)

Now our empirical loss criterion reduces to

We could choose a new component classifier to optimize this weighted agreement

);();(exp mimimimi hayhay xx 1

)(ˆexp

20© Eric Xing @ CMU, 2006-2015

A possible algorithm At stage m we find * that maximize (or at least give a

sufficiently high) weighted agreement:

each sample is weighted by its "difficulty" under the previously combined m − 1 classifiers,

more "difficult" samples received heavier attention as they dominates the total loss

Then we go back and find the “votes” m* associated with the new classifier by minimizing the original weighted (exponential) loss

mi hyW

1 );( *x

);(exp)(ˆ1

mi hayWhL x

21© Eric Xing @ CMU, 2006-2015

The AdaBoost algorithm At the kth iteration we find (any) classifier h(x; k*) for which

the weighted classification error:

is better than change. This is meant to be "easy" --- weak classifier

Determine how many “votes” to assign to the new component classifier:

stronger classifier gets more votes

Update the weights on the training examples:

kik WhyIW

*1 );(( x

kkk εε /)(log. 150

);(exp kikiki

ki hayWW x 1

22© Eric Xing @ CMU, 2006-2015

The AdaBoost algorithm cont’d The final classifier after m boosting iterations is given by the

sign of

the votes here are normalized for convenience

11 );();()(ˆ xxx

23© Eric Xing @ CMU, 2006-2015

Boosting We have basically derived a Boosting algorithm that

sequentially adds new component classifiers, each trained on reweighted training examples each component classifier is presented with a slightly different problem

AdaBoost preliminaries: we work with normalized weights Wi on the training examples, initially

uniform ( Wi = 1/n) the weight reflect the "degree of difficulty" of each datum on the latest

classifier

24© Eric Xing @ CMU, 2006-2015

AdaBoost: summary Input:

N examples SN = {(x1,y1),…, (xN,yN)} a weak base learner h = h(x,)

Initialize: equal example weights wi = 1/N for all i = 1..N Iterate for t = 1…T:

1. train base learner according to weighted example set (wt,x) and obtain hypothesis ht = h(x,t)

2. compute hypothesis error t

3. compute hypothesis weight t

4. update example weights for next iteration wt+1

Output: final hypothesis as a linear combination of ht

25© Eric Xing @ CMU, 2006-2015

Base Learners Weak learners used in practice:

Decision stumps (axis parallel splits) Decision trees (e.g. C4.5 by Quinlan 1996) Multi-layer neural networks Radial basis function networks

Can base learners operate on weighted examples? In many cases they can be modified to accept weights along with the

examples In general, we can sample the examples (with replacement) according to

the distribution defined by the weights

26© Eric Xing @ CMU, 2006-2015

Boosting often, Robust to overfitting Test set error decreases even after training error is zero

[Schapire, 1989]

but not always

Test Error

Training Error

Boosting results – Digit recognition

27© Eric Xing @ CMU, 2006-2015

T – number of boosting rounds d – VC dimension of weak learner, measures complexity of

classifier m – number of training examples

Generalization Error Bounds

T smalllarge small

T largesmall largetradeoff

bias variance

[Freund & Schapire’95]

Generalization Error Bounds

Boosting can overfit if T is large

Boosting often, Contradicts experimental results Robust to overfitting Test set error decreases even after training error is zero

Need better analysis tools – margin based bounds

[Freund & Schapire’95]

With highprobability

Why it is working? You will need some learning theory (to be covered in the next

two lectures) to understand this fully, but for now let's just go over some high level ideas

Generalization Error:

With high probability, Generalization error is less than:

As T goes up, our bound becomes worse, Boosting should overfit!

Trainingerror

Testerror

The Boosting Approach to Machine Learning, by Robert E. Schapire

Experiments

Training Margins When a vote is taken, the more predictors agreeing, the more

confident you are in your prediction.

Margin for example:

The margin lies in [−1, 1] and is negative for all misclassified examples.

Successive boosting iterations improve the majority vote or margin for the training examples

mimiiiih

11 );();( )(margin xxx

A Margin Bound

Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting

methods. The Annals of Statistics, 26(5):1651-1686, 1998.

For any , the generalization error is less than:

It does not depend on T!!!

mdO,yh )(marginPr x

Summary Boosting takes a weak learner and converts it to a strong one

Works by asymptotically minimizing the empirical error

Effectively maximizes the margin of the combined hypothesis

Some additional points for fun

Logistic regression assumes:

And tries to maximize data likelihood:

Equivalent to minimizing log loss

Boosting and Logistic Regression

Logistic regression equivalent to minimizing log loss

Both smooth approximations of 0/1 loss!

Boosting minimizes similar loss function!!

Weighted average of weak learners

0/1 loss

exp losslog loss

Logistic regression: Minimize log loss

Define

where xj predefined features(linear classifier)

Jointly optimize over all weights w0, w1, w2…

Boosting: Minimize exp loss

Define

where ht(x) defined dynamically to fit data(not a linear classifier)

Weights t learned per iteration t incrementally

Weighted average of weak learners

Hard Decision/Predicted label:

Soft Decision:(based on analogy withlogistic regression)

Hard & Soft Decision

Good : Can identify outliers since focuses on examples that are hard to categorize

Bad : Too many outliers can degrade classification performancedramatically increase time to convergence

Effect of Outliers

Goal: Find nonlinear predictor such that

Gradient boosting generalizes Adaboost(exponential loss) to any smooth loss functions

Gradient Boosting

Square loss (regression)

Logistic loss (classification)Margin loss (ranking) (prefer item i over j)Others…

Let’s use decision tree to approximate A J-leaf node decision tree can be viewed as a

partition of the input space

and a prediction value (weight) associated with each partition

Will learn (tree structure) first, then

Gradient Boosting Decision Tree

Machine Learningepxing/Class/10701/slides...Ensemble methods Boosting from Weak Learners Eric Xing...

Documents

Machine Learningepxing/Class/10701/slides/lecture21.pdf · 2015. 12. 1. · Eric Xing Types of Learning Supervised Learning A situation in which sample (input, output) pairs of the

Solution of Final Exam : 10-701/15-781 Machine Learningepxing/Class/10701/exams... · If you need more room to work out your answer to a question, use the back of the page and clearly

Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/DeepLearning.pdf · Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti

Expectation Maximizationguestrin/Class/10701-S07/Slides/em... · ©2005-2007 Carlos Guestrin Expectation Maximization Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation

IS 10701 2012

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture6.pdf · Advanced Introduction to Machine Learning 10715, Fall 2014 The Kernel Trick, Reproducing Kernel

Machine Learningepxing/Class/10701-08f/Lecture/lecture5.pdf · zresult follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-© Eric Xing @ CMU,

Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

Machine Learningepxing/Class/10701-08s/Lecture/...2 Logistics zText book zChris Bishop, Pattern Recognition and Machine Learning (required) z Tom Mitchell, Machine Learning zDavid

Machine Learningepxing/Class/10701-08s/... · Markov Decision Process (MDP) zset of states S, set of actions A, initial state S0 ... and then follows the path marked by the dashed

Bias-Variance Tradeoffguestrin/Class/10701/slides/overfitting-logistic... · What’s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine

Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

The Boosting Approach to Machine Learning An Overvieaarti/Class/10701/readings/boosting-schapire.pdf · The Boosting Approach to Machine Learning An Overview Robert E. Schapire AT&T

Machine Learningepxing/Class/10701-08s/Lecture/lecture26-FA.pdf · 1 Eric Xing 1 Machine Learning 10-701/15-781, Spring 2008 Factor Analysis and Metric Learning Eric Xing Lecture

Machine Learningepxing/Class/10701-11f/Lecture/lecture1.pdf3 Logistics z5 homework assignments: 25% of grade z Theory exercises z Implementation exercises zFinal project: 30% of grade

Final Exam, 10701 Machine Learning, Spring 2009epxing/Class/10701/exams/09s-701-final.pdf · Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no

Machine Learningepxing/Class/10701-12f/Lecture/lecture4...Classifiers zGoal: Wish to learn f: X →Y, e.g., P(Y|X) zGenerative classifiers (e.g., Naïve Bayes): zAssume some functional

10701: Introduction to Machine Learningepxing/Class/10701-20/slides/...10701: Introduction to Machine Learning Neural Networks and Deep Learning (1) -Basics in artificial neural networksEric