View
1
Download
0
Category
Preview:
Citation preview
Ensemble methodsBoosting from Weak Learners
Eric Xing
Lecture 11, October 15, 2015
Machine Learning
10-701, Fall 2015
b r a c e
Reading: Chap. 14.3 C.B book1© Eric Xing @ CMU, 2006-2015
Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees)
Are good - Low variance, don’t usually overfitAre bad - High bias, can’t solve hard learning problems
Can we make weak learners always good??? No!!! But often yes…
Weak Learners:Fighting the bias-variance tradeoff
2© Eric Xing @ CMU, 2006-2015
Why boost weak learners?Goal: Automatically categorize type of call requested
(Collect, Calling card, Person-to-person, etc.)
Easy to find “rules of thumb” that are “often” correct.E.g. If ‘card’ occurs in utterance, then predict ‘calling card’
Hard to find single highly accurate prediction rule.
3© Eric Xing @ CMU, 2006-2015
Voting (Ensemble Methods) Instead of learning a single (weak) classifier, learn many weak
classifiers that are good at different parts of the input space
Output class: (Weighted) vote of each classifier Classifiers that are most “sure” will vote with more conviction Classifiers will be most “sure” about a particular part of the space On average, do better than single classifier!
1 -1
? ?
? ?
1 -1
H: X → Y (-1,1)h1(X) h2(X)
H(X) = sign(∑αt ht(X))t
weights
H(X) = h1(X)+h2(X)
4© Eric Xing @ CMU, 2006-2015
Voting (Ensemble Methods) Instead of learning a single (weak) classifier, learn many
weak classifiers that are good at different parts of the input space
Output class: (Weighted) vote of each classifier Classifiers that are most “sure” will vote with more conviction Classifiers will be most “sure” about a particular part of the space On average, do better than single classifier!
But how do you ??? force classifiers ht to learn about different parts of the input space? weigh the votes of different classifiers? t
5© Eric Xing @ CMU, 2006-2015
Bagging Recall decision trees (lecture 3)
Pros: interpretable, can handle discrete and continuous features, robust to outliers, low bias, etc.
Cons: high variance
Trees are perfect candidates for ensembles Consider averaging many (nearly) unbiased tree estimators Bias remains similar, but variance is reduced
This is called bagging (bootstrap aggregating) (Breiman, 1996) Train many trees on bootstrapped data, then take average
Bootstrap: statistical term for “roll n-face dice n times”
© Eric Xing @ CMU, 2006-2016 6
Random Forest Reduce correlation between trees, by introducing randomness1. For b = 1, …, B,
1. Draw a bootstrap dataset 2. Learn a tree on , in particular select features randomly out of
features as candidates before splitting
2. Output: Regression: Classification: majority vote
Typically take
© Eric Xing @ CMU, 2006-2016 7
Rationale: Combination of methods There is no algorithm that is always the most accurate
We can select simple “weak” classification or regression methods and combine them into a single “strong” method
Different learners use different
Algorithms Parameters Representations (Modalities) Training sets Subproblems
The problem: how to combine them
8© Eric Xing @ CMU, 2006-2015
Boosting [Schapire’89] Idea: given a weak learner, run it multiple times on (reweighted)
training data, then let learned classifiers vote
On each iteration t: weight each training example by how incorrectly it was classified Learn a weak hypothesis – ht
A strength for this hypothesis – t
Final classifier:
Practically useful, and theoretically interesting Important issues:
what is the criterion that we are optimizing? (measure of loss) we would like to estimate each new component classifier in the same manner
(modularity)
H(X) = sign(∑αt ht(X))
9© Eric Xing @ CMU, 2006-2015
Combination of classifiers Suppose we have a family of component classifiers
(generating ±1 labels) such as decision stumps:
where = {k,w,b}
Each decision stump pays attention to only a single component of the input vector
bwxxh k sign);(
10© Eric Xing @ CMU, 2006-2015
Combination of classifiers con’d We’d like to combine the simple classifiers additively so that
the final classifier is the sign of
where the “votes” {i} emphasize component classifiers that make more reliable predictions than others
Important issues: what is the criterion that we are optimizing? (measure of loss) we would like to estimate each new component classifier in the same manner
(modularity)
);();()(ˆ mmhhh xxx 11
11© Eric Xing @ CMU, 2006-2015
AdaBoost Input:
N examples SN = {(x1,y1),…, (xN,yN)} a weak base learner h = h(x,)
Initialize: equal example weights wi = 1/N for all i = 1..N Iterate for t = 1…T:
1. train base learner according to weighted example set (wt ,x) and obtain hypothesis ht = h(x,t)
2. compute hypothesis error t
3. compute hypothesis weight t
4. update example weights for next iteration wt+1
Output: final hypothesis as a linear combination of ht
12© Eric Xing @ CMU, 2006-2015
AdaBoost At the kth iteration we find (any) classifier h(x; k*) for which
the weighted classification error:
is better than chance. This is meant to be "easy" --- weak classifier
Determine how many “votes” to assign to the new component classifier:
stronger classifier gets more votes
Update the weights on the training examples:
kkk εε /)(log. 150
);(exp kikiki
ki hayWW x 1
n
i
ki
n
ikii
kik WhyIW
1
1
1
*1 );(( x
13© Eric Xing @ CMU, 2006-2015
Boosting Example (Decision Stumps)
14© Eric Xing @ CMU, 2006-2015
Boosting Example (Decision Stumps)
15© Eric Xing @ CMU, 2006-2015
What is the criterion that we are optimizing? (measure of loss)
16© Eric Xing @ CMU, 2006-2015
Measurement of error Loss function:
Generalization error:
Objective: find h with minimum generalization error
Main boosting idea: minimize the empirical error:
))(( e.g. ))(,( xx hyIhy
))(,()( xhyEhL
N
iii hyN
hL1
1 ))(,()(ˆ x
17© Eric Xing @ CMU, 2006-2015
Exponential Loss Empirical loss:
Another possible measure of empirical loss is
n
iimihyhL
1)(ˆexp)(ˆ x
N
iimi hyN
hL1
))(ˆ,(1)(ˆ x
18© Eric Xing @ CMU, 2006-2015
Exponential Loss One possible measure of empirical loss is
The combined classifier based on m − 1 iterations defines a weighted loss criterion for the next simple classifier to add
each training sample is weighted by its "classifiability" (or difficulty) seen by the classifier we have built so far
);(exp
);(exp)(ˆexp
);()(ˆexp
)(ˆexp)(ˆ
mimi
n
i
mi
mimi
n
iimi
n
imimiimi
n
iimi
hayW
hayhy
hayhy
hyhL
x
xx
xx
x
1
1
11
11
1
Recall that:);();()(ˆ
mmm hhh xxx 11
)(ˆexp imimi hyW x1
1
19© Eric Xing @ CMU, 2006-2015
Linearization of loss function We can simplify a bit the estimation criterion for the new
component classifiers (assuming is small)
Now our empirical loss criterion reduces to
We could choose a new component classifier to optimize this weighted agreement
);();(exp mimimimi hayhay xx 1
n
imii
mim
n
i
mi
mimi
n
i
mi
n
iimi
hyWaW
hayW
hy
1
1
1
1
1
1
1
1
);(
));((
)(ˆexp
x
x
x
)(ˆexp imimi hyW x1
1
20© Eric Xing @ CMU, 2006-2015
A possible algorithm At stage m we find * that maximize (or at least give a
sufficiently high) weighted agreement:
each sample is weighted by its "difficulty" under the previously combined m − 1 classifiers,
more "difficult" samples received heavier attention as they dominates the total loss
Then we go back and find the “votes” m* associated with the new classifier by minimizing the original weighted (exponential) loss
n
imii
mi hyW
1
1 );( *x
);(exp)(ˆ1
1mimi
n
i
mi hayWhL x
21© Eric Xing @ CMU, 2006-2015
The AdaBoost algorithm At the kth iteration we find (any) classifier h(x; k*) for which
the weighted classification error:
is better than change. This is meant to be "easy" --- weak classifier
Determine how many “votes” to assign to the new component classifier:
stronger classifier gets more votes
Update the weights on the training examples:
n
i
ki
n
ikii
kik WhyIW
1
1
1
*1 );(( x
kkk εε /)(log. 150
);(exp kikiki
ki hayWW x 1
)(ˆexp imimi hyW x1
1
22© Eric Xing @ CMU, 2006-2015
The AdaBoost algorithm cont’d The final classifier after m boosting iterations is given by the
sign of
the votes here are normalized for convenience
m
mmhhh
1
11 );();()(ˆ xxx
23© Eric Xing @ CMU, 2006-2015
Boosting We have basically derived a Boosting algorithm that
sequentially adds new component classifiers, each trained on reweighted training examples each component classifier is presented with a slightly different problem
AdaBoost preliminaries: we work with normalized weights Wi on the training examples, initially
uniform ( Wi = 1/n) the weight reflect the "degree of difficulty" of each datum on the latest
classifier
24© Eric Xing @ CMU, 2006-2015
AdaBoost: summary Input:
N examples SN = {(x1,y1),…, (xN,yN)} a weak base learner h = h(x,)
Initialize: equal example weights wi = 1/N for all i = 1..N Iterate for t = 1…T:
1. train base learner according to weighted example set (wt,x) and obtain hypothesis ht = h(x,t)
2. compute hypothesis error t
3. compute hypothesis weight t
4. update example weights for next iteration wt+1
Output: final hypothesis as a linear combination of ht
25© Eric Xing @ CMU, 2006-2015
Base Learners Weak learners used in practice:
Decision stumps (axis parallel splits) Decision trees (e.g. C4.5 by Quinlan 1996) Multi-layer neural networks Radial basis function networks
Can base learners operate on weighted examples? In many cases they can be modified to accept weights along with the
examples In general, we can sample the examples (with replacement) according to
the distribution defined by the weights
26© Eric Xing @ CMU, 2006-2015
Boosting often, Robust to overfitting Test set error decreases even after training error is zero
[Schapire, 1989]
but not always
Test Error
Training Error
Boosting results – Digit recognition
27© Eric Xing @ CMU, 2006-2015
T – number of boosting rounds d – VC dimension of weak learner, measures complexity of
classifier m – number of training examples
Generalization Error Bounds
T smalllarge small
T largesmall largetradeoff
bias variance
[Freund & Schapire’95]
28© Eric Xing @ CMU, 2006-2015
Generalization Error Bounds
Boosting can overfit if T is large
Boosting often, Contradicts experimental results Robust to overfitting Test set error decreases even after training error is zero
Need better analysis tools – margin based bounds
[Freund & Schapire’95]
With highprobability
29© Eric Xing @ CMU, 2006-2015
Why it is working? You will need some learning theory (to be covered in the next
two lectures) to understand this fully, but for now let's just go over some high level ideas
Generalization Error:
With high probability, Generalization error is less than:
As T goes up, our bound becomes worse, Boosting should overfit!
30© Eric Xing @ CMU, 2006-2015
Trainingerror
Testerror
The Boosting Approach to Machine Learning, by Robert E. Schapire
Experiments
31© Eric Xing @ CMU, 2006-2015
Training Margins When a vote is taken, the more predictors agreeing, the more
confident you are in your prediction.
Margin for example:
The margin lies in [−1, 1] and is negative for all misclassified examples.
Successive boosting iterations improve the majority vote or margin for the training examples
m
mimiiiih
hhy,y
1
11 );();( )(margin xxx
32© Eric Xing @ CMU, 2006-2015
A Margin Bound
Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting
methods. The Annals of Statistics, 26(5):1651-1686, 1998.
For any , the generalization error is less than:
It does not depend on T!!!
2
mdO,yh )(marginPr x
33© Eric Xing @ CMU, 2006-2015
Summary Boosting takes a weak learner and converts it to a strong one
Works by asymptotically minimizing the empirical error
Effectively maximizes the margin of the combined hypothesis
34© Eric Xing @ CMU, 2006-2015
Some additional points for fun
© Eric Xing @ CMU, 2006-2015 35
Logistic regression assumes:
And tries to maximize data likelihood:
Equivalent to minimizing log loss
iid
Boosting and Logistic Regression
36© Eric Xing @ CMU, 2006-2015
Logistic regression equivalent to minimizing log loss
Both smooth approximations of 0/1 loss!
Boosting minimizes similar loss function!!
Weighted average of weak learners
1
0
0/1 loss
exp losslog loss
Boosting and Logistic Regression
37© Eric Xing @ CMU, 2006-2015
Logistic regression: Minimize log loss
Define
where xj predefined features(linear classifier)
Jointly optimize over all weights w0, w1, w2…
Boosting: Minimize exp loss
Define
where ht(x) defined dynamically to fit data(not a linear classifier)
Weights t learned per iteration t incrementally
Boosting and Logistic Regression
38© Eric Xing @ CMU, 2006-2015
Weighted average of weak learners
Hard Decision/Predicted label:
Soft Decision:(based on analogy withlogistic regression)
Hard & Soft Decision
39© Eric Xing @ CMU, 2006-2015
Good : Can identify outliers since focuses on examples that are hard to categorize
Bad : Too many outliers can degrade classification performancedramatically increase time to convergence
Effect of Outliers
40© Eric Xing @ CMU, 2006-2015
Goal: Find nonlinear predictor such that
Gradient boosting generalizes Adaboost(exponential loss) to any smooth loss functions
Gradient Boosting
41© Eric Xing @ CMU, 2006-2015
Square loss (regression)
Logistic loss (classification)Margin loss (ranking) (prefer item i over j)Others…
Let’s use decision tree to approximate A J-leaf node decision tree can be viewed as a
partition of the input space
and a prediction value (weight) associated with each partition
Will learn (tree structure) first, then
Gradient Boosting Decision Tree
42© Eric Xing @ CMU, 2006-2011
Recommended