Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction...

Introduction to Machine Learning (67577)Lecture 4

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

Boosting

Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 1 / 29

Outline

1 Weak learnability

2 Boosting the confidence

3 Boosting the accuracy using AdaBoost

4 AdaBoost as a learner for Halfspaces++

5 AdaBoost and the Bias-Complexity Tradeoff

6 Weak Learnability and Separability with Margin

7 AdaBoost for Face Detection

Weak Learnability

Definition ((ε, δ)-Weak-Learnability)

A class H is (ε, δ)-weak-learnable if there exists a learning algorithm, A,and a training set size, m ∈ N, such that for every distribution D over Xand every f ∈ H,

Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .

Remarks:

Almost identical to (strong) PAC learning, but we only need tosucceed for specific ε, δ

Every class H is (1/2, 0)-weak-learnable

Intuitively, one can think of a weak learner as an algorithm that usesa simple ’rule of thumb’ to output a hypothesis that performs justslightly better than a random guess

Weak Learnability

Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .

Remarks:

Weak Learnability

Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .

Remarks:

Weak Learnability

Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .

Remarks:

Example of weak learner

X = R, H is the class of 3-piece classifiers, e.g.

θ1 θ2

+ +−

Let B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 be the class ofDecision Stumps

Claim: There is a constant m, such that ERMB over m examples is a(5/12, 1/2)-weak learner for HProof:

Observe that there’s always a decision stump with LD,f (h) ≤ 1/3Apply VC bound for the class of decision stumps

θ1 θ2

+ +−

θ1 θ2

+ +−

Claim: There is a constant m, such that ERMB over m examples is a(5/12, 1/2)-weak learner for H

Proof:

θ1 θ2

+ +−

The problem of boosting

Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classH

Can we use A to construct a strong learner ?

If A is computationally efficient, can we boost it efficiently ?

Two questions:

Boosting the confidenceBoosting the accuracy

Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classHCan we use A to construct a strong learner ?

Two questions:

Boosting the confidence

Boosting the accuracy

Two questions:

Outline

1 Weak learnability

Suppose A is an (ε0, δ0)-weak learner for H that requires m0 examples

For any δ, ε ∈ (0, 1) we show how to learn H to accuracy ε0 + ε withconfidence δ

Step 1: Apply A on k =⌈

log(2/δ)log(1/δ0)

⌉i.i.d. samples, each of which of

m0 examples, to obtain h1, . . . , hk

Step 2: Take additional validation sample of size |V | ≥ 2 log(4k/δ)ε2

output h ∈ argminhi LV (hi)

Claim: W.p. at least 1− δ, we have LD(h) ≤ ε0 + ε

log(2/δ)log(1/δ0)

First, by the validation procedure guarantees

P[LD(h) > miniLD(hi) + ε] ≤ δ/2 .

Second,

P[miniLD(hi) > ε0] = P[∀i LD(hi) > ε0]

=k∏i=1

P[LD(hi) > ε0]

≤ δk0 ≤ δ/2 .

Apply the union bound to conclude the proof.

Second,

k∏i=1

P[LD(hi) > ε0]

≤ δk0 ≤ δ/2 .

Second,

k∏i=1

P[LD(hi) > ε0]

≤ δk0 ≤ δ/2 .

Boosting a learner that succeeds on expectation

Suppose that A is a learner that guarantees:

ES∼Dm

[LD(A(S))] ≤ minh∈H

LD(h) + ε .

Denote θ = LD(A(S))−minh∈H LD(h), so we obtain

ES∼Dm

[θ] ≤ ε .

Since θ is a non-negative random variable, we can apply Markov’sinequality to obtain

P[θ ≥ 2ε] ≤ E[θ]

2ε≤ 1

Corollary: A is (2ε, 1/2)-weak learner.

ES∼Dm

LD(h) + ε .

ES∼Dm

[θ] ≤ ε .

P[θ ≥ 2ε] ≤ E[θ]

2ε≤ 1

ES∼Dm

LD(h) + ε .

ES∼Dm

[θ] ≤ ε .

P[θ ≥ 2ε] ≤ E[θ]

2ε≤ 1

ES∼Dm

LD(h) + ε .

ES∼Dm

[θ] ≤ ε .

P[θ ≥ 2ε] ≤ E[θ]

2ε≤ 1

Outline

1 Weak learnability

Boosting the accuracy

AdaBoost (’Adaptive Boosting’)

Input: S = (x1, y1), . . . , (xm, ym), where for each i, yi = f(xi)

Output: hypothesis h with small error on S

We’ll later analyze L(D,f)(h) as well

AdaBoost calls the weak learner on distributions over S

The AdaBoost Algorithm

input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

invoke weak learner ht = WL(D(t), S)

compute εt = LD(t)(ht) =∑mi=1D

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

D(t)i exp(−wtyiht(xi))∑m

j=1D(t)j exp(−wtyjht(xj))

for all i = 1, . . . ,m

output the hypothesis hs(x) = sign(∑T

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

initialize D(1) = ( 1m , . . . ,

for t = 1, . . . , T :

(t)i 1[yi 6=ht(xi)]

let wt = 12 log

(1εt− 1)

update D(t+1)i =

for all i = 1, . . . ,m

t=1wtht(x))

Intuition: AdaBoost forces WL to focus on problematicexamples

Claim: The error of ht w.r.t. D(t+1) is exactly 1/2

Proof:

m∑i=1

D(t+1)i 1[yi 6=ht(xi)] =

∑mi=1D

(t)i e−wtyiht(xi)1[yi 6=ht(xi)]∑mj=1D

(t)j e−wtyjht(xj)

=ewtεt

ewtεt + e−wt(1− εt)=

εtεt + e−2wt(1− εt)

εt + εt1−εt (1− εt)

Proof:

m∑i=1

D(t+1)i 1[yi 6=ht(xi)] =

∑mi=1D

(t)j e−wtyjht(xj)

=ewtεt

Proof:

m∑i=1

D(t+1)i 1[yi 6=ht(xi)] =

∑mi=1D

(t)j e−wtyjht(xj)

=ewtεt

Theorem

If WL is (1/2− γ, δ) weak learner then, with probability at least 1− δT ,

LS(hs) ≤ exp(−2 γ2 T ) .

Remarks:

For any ε > 0 and γ ∈ (0, 1/2), if T ≥ log(1/ε)2γ2

, then AdaBoost will

output a hypothesis hs with LS(hs) ≤ ε.Setting ε = 1/(2m) the hypothesis hs must have a zero training error

Since the weak learner is invoked on a distribution over S, in manycases δ can be 0. In any case, by “boosting the confidence”, we canassume w.l.o.g. that δ is very small.

Theorem

LS(hs) ≤ exp(−2 γ2 T ) .

Remarks:

output a hypothesis hs with LS(hs) ≤ ε.

Setting ε = 1/(2m) the hypothesis hs must have a zero training error

Theorem

LS(hs) ≤ exp(−2 γ2 T ) .

Remarks:

Theorem

LS(hs) ≤ exp(−2 γ2 T ) .

Remarks:

Outline

1 Weak learnability

AdaBoost as a Learner for Halfspaces++

Let B be the set of all hypotheses the WL may return

Observe that AdaBoost outputs a hypothesis from the class

L(B, T ) =

x 7→ sign

(T∑t=1

wtht(x)

): w ∈ RT , ∀t, ht ∈ B

Since WL is invoked only on distributions over S we can assumew.l.o.g. that B = g1, . . . , gd for some d ≤ 2m.

Denote ψ(x) = (g1(x), . . . , gd(x)). Therefore:

L(B, T ) =x 7→ sign (〈w,ψ(x)〉) : w ∈ Rd, ‖w‖0 ≤ T

where ‖w‖0 = |i : wi 6= 0|.That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)

L(B, T ) =

x 7→ sign

(T∑t=1

wtht(x)

): w ∈ RT , ∀t, ht ∈ B

L(B, T ) =

x 7→ sign

(T∑t=1

wtht(x)

): w ∈ RT , ∀t, ht ∈ B

L(B, T ) =

x 7→ sign

(T∑t=1

wtht(x)

): w ∈ RT , ∀t, ht ∈ B

where ‖w‖0 = |i : wi 6= 0|.

That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)

L(B, T ) =

x 7→ sign

(T∑t=1

wtht(x)

): w ∈ RT , ∀t, ht ∈ B

Expressiveness of L(B, T )

Suppose X = R and B is Decision Stumps,

B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .

Let GT be the class of piece-wise constant functions with T pieces

Claim: GT ⊆ L(B, T )

Composing halfspaces on top of simple classes can be very expressive !

B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .

Outline

1 Weak learnability

Bias-complexity

Recall:

εapp εest

We have argued that the expressiveness of L(B, T ) grows with T

In other words, the approximation error decreases with T

We’ll show that the estimation error increases with T

Therefore, the parameter T of AdaBoost enables us to control thebias-complexity tradeoff

Bias-complexity

Recall:

εapp εest

Bias-complexity

Recall:

εapp εest

Bias-complexity

Recall:

εapp εest

The Estimation Error of L(B, T )

Claim:VCdim(L(B, T )) ≤ O(T · VCdim(B))

Corollary: if m ≥ Ω(log(1/δ)γ2ε

)and T = log(m)/(2γ2), then w.p. of

at least 1− δ,L(D,f)(hs) ≤ ε .

The Estimation Error of L(B, T )

Claim:VCdim(L(B, T )) ≤ O(T · VCdim(B))

Corollary: if m ≥ Ω(log(1/δ)γ2ε

)and T = log(m)/(2γ2), then w.p. of

at least 1− δ,L(D,f)(hs) ≤ ε .

Outline

1 Weak learnability

Weak Learnability and Separability with Margin

We have essentially shown: if H is weak learnable, then H ⊆ L(B,∞)

What about the other direction ?

Using von Neumanns minimax theorem, it can be shown that ifL(B,∞) separates a training set with `1 margin γ then ERMB is a γweak learner for HThis is beyond the scope of the course

Using von Neumanns minimax theorem, it can be shown that ifL(B,∞) separates a training set with `1 margin γ then ERMB is a γweak learner for H

This is beyond the scope of the course

Outline

1 Weak learnability

Face Detection

Classify rectangles in an image as face or non-face

Weak Learner for Face Detection

Rules of thumb:

“eye region is often darker than the cheeks”

“bridge of the noise is brighter than the eyes”

We want to combine few rules of thumb to obtain a face detector

“Sparsity” reflects both small estimation error but also speed !

Rules of thumb:

“eye region is often darker than the cheeks”

“bridge of the noise is brighter than the eyes”

We want to combine few rules of thumb to obtain a face detector

“Sparsity” reflects both small estimation error but also speed !

Each hypothesis in the base class is of the form h(x) = f(g(x)), where fis a decision stump and g : R24,24 → R is parameterized by:

An axis-aligned rectangle R. Since each image is of size 24× 24,there are at most 244 axis-aligned rectangles.

A type, t ∈ A,B,C,D. Each type corresponds to a mask:

AdaBoost for Face Detection

The first and second features selected by AdaBoost, as implemented byViola and Jones.

Figure 5: The first and second features selected by AdaBoost. The two features are shown in the top rowand then overlayed on a typical training face in the bottom row. The first feature measures the difference inintensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on theobservation that the eye region is often darker than the cheeks. The second feature compares the intensitiesin the eye regions to the intensity across the bridge of the nose.

directly increases computation time.

4 The Attentional Cascade

This section describes an algorithm for constructing a cascade of classifiers which achieves increased detec-tion performance while radically reducing computation time. The key insight is that smaller, and thereforemore efficient, boosted classifiers can be constructed which reject many of the negative sub-windows whiledetecting almost all positive instances. Simpler classifiers are used to reject the majority of sub-windowsbefore more complex classifiers are called upon to achieve low false positive rates.Stages in the cascade are constructed by training classifiers using AdaBoost. Starting with a two-feature

strong classifier, an effective face filter can be obtained by adjusting the strong classifier threshold to min-imize false negatives. The initial AdaBoost threshold, , is designed to yield a low error rate onthe training data. A lower threshold yields higher detection rates and higher false positive rates. Based onperformance measured using a validation training set, the two-feature classifier can be adjusted to detect100% of the faces with a false positive rate of 40%. See Figure 5 for a description of the two features usedin this classifier.The detection performance of the two-feature classifier is far from acceptable as an object detection

system. Nevertheless the classifier can significantly reduce the number sub-windows that need further pro-cessing with very few operations:

1. Evaluate the rectangle features (requires between 6 and 9 array references per feature).

2. Compute the weak classifier for each feature (requires one threshold operation per feature).

Summary

Boosting the confidence using validation

Boosting the accuracy using AdaBoost

The power of composing halfspaces over simple classes

The bias-complexity tradeoff

AdaBoost works in many practical problems !

Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction...

Documents

Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem

Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · 2014-05-27 · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School

Miriam and Gilad Shwartz (and Sivan); Co -Directors

grim gesa shwartz

Hypernyms under Siege - Vered Shwartz · Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection Vered Shwartz1, Enrico Santus2 and Dominik Schlechtweg3

Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Accelerating Stochastic Optimization - Tel Aviv Universitywolf/deeplearningmeeting/pdfs/FastStochastic.pdf · Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS

Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel

Particle Detectors - Grupen & Shwartz

The United State’s Entrance to World War I Lauren Rizzi, Annie Shwartz, Erin Hawley Bonnecarrere, 4th

“Particle Detectors” Claus Grupen and Boris Shwartz”epx.phys.tohoku.ac.jp/eeweb/jp/seminar/20100520belle.pdf“Particle Detectors” Claus Grupen and Boris Shwartz” 2010/05/20

NlogN Entropy Optimization Sarit Shwartz Yoav Y. Schechner Michael Zibulevsky Sponsors: ISF, Dvorah Foundation 1

Introduction To Equalization Presented By : Guy Wolf Roy Ron Guy Shwartz (Adavanced DSP, Dr.Uri Mahlab) HAIT 14.5.04

Introduction to Machine Learning (67577) Lecture 10shais/Lectures2014/lecture10.pdf · 5 Convolutional Neural Networks (CNN) 6 Feature Learning Shai Shalev-Shwartz (Hebrew U) IML

Introduction to Machine Learning - arXiv · Introduction to Machine Learning 67577 - Fall, ... 7.3 Learnability of Finite Concept ... ming over columns or over rows is called marginalization

Introduction to Machine Learning (67577) - HUJI CSE€¦ · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem

Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Introduction to Machine Learning (67577)shais/Lectures2014/2017Lecture_DeepLear… · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The

Introduction to Machine Learning (67577) Lecture 2shais/Lectures2014/lecture2.pdf · Lecture 2 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem