Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Support vector machines (SVMs) Lecture 5

David Sontag New York University

So5 margin SVM

w.x + b = +1

w.x + b = -‐1

w.x + b = 0

Slack penalty C > 0: •  C=∞ ! minimizes upper bound on 0-‐1 loss •  C≈0 ! points with ξi=0 have big margin

•  Select using cross-‐valida=on

“slack variables”

ξ2

ξ1 ξ3

ξ4

Support vectors: Data points for which the constraints are binding

QP form:

More “natural” form:

Empirical loss RegularizaNon term

Equivalent if

So5 margin SVM

Subgradient (for non-‐differenNable funcNons)

(Sub)gradient descent of SVM objecNve

Step size:

-‐

The Pegasos Algorithm Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20

For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = (1-‐ηtλ) wt + ηt yj xj Else wt+1 = (1-‐ηtλ) wt

Output: wt+1

General framework Ini=alize: w1 = 0, t=0

While not converged t = t+1 Choose a stepsize, ηt Choose a direcNon, pt Go! Test for convergence

Output: wt+1

Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20

For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηtλwt

Output: wt+1



Output: wt+1

The Pegasos Algorithm



Output: wt+1

Convergence choice : Fixed number of itera=ons T=20*|data|



Output: wt+1




Output: wt+1



Output: wt+1

Stepsize choice: -‐ Ini=alize with 1/λ -‐ Decays with 1/t




Output: wt+1


While not converged t = t+1 Choose a stepsize, ηt Choose a direc=on, pt Go! Test for convergence

Output: wt+1

Direc=on choice: Stochas=c approx to the subgradient


Subgradient calculaNon �

2

||w||2 + 1

m

X

i

max{0, 1� yiw · xi}

�

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:

Stochas=c Approx:

For a randomly chosen data point i

(in the assignment the choice of i is not random -‐ easier to debug and compare between students).


2

||w||2 + 1

m

X

i


�

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:

Stochas=c Approx:

(sub)gradient:

�||w||+ d

dw



2

||w||2 + 1

m

X

i


�

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:

Stochas=c Approx:

(sub)gradient:

�||w||+ d

dw

max{0, 1� yiw · xi}yiw · xi

>1 <1 =1

�yixi 0

1 0

0 yiw · xi


2

||w||2 + 1

m

X

i


�

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:

Stochas=c Approx:

(sub)gradient:

�||w||+ d

dw

max{0, 1� yiw · xi}yiw · xi

>1 <1 =1

�yixi 00

1 0

0 yiw · xi


2

||w||2 + 1

m

X

i


�

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:

Stochas=c Approx:

(sub)gradient:

if yiw · xi < 1

else

�w � yixi

�w + 0


For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηt(λwt + 0)

Output: wt+1


While not converged t = t+1 Choose a stepsize, ηt Choose a direc=on, pt Go! Test for convergence

Output: wt+1

Direc=on choice: Stochas=c approx to the subgradient


if yiw · xi < 1

else

�w � yixi

�w + 0



Output: wt+1



Output: wt+1

Go: update wt+1 = wt -‐ ηtpt


Why is this algorithm interesNng?

•  Simple to implement, state of the art results. – NoNce similarity to Perceptron algorithm! Algorithmic differences: updates if insufficient margin, scales weight vector, and has a learning rate.

•  Since based on stochas7c gradient descent, its running Nme guarantees are probabilisNc.

•  Highlights interesNng tradeoffs between running Nme and data.

Much faster than previous methods

•  3 datasets (provided by Joachims) –  Reuters CCAT (800K examples, 47k features) –  Physics ArXiv (62k examples, 100k features)

–  Covertype (581k examples, 54 features)

Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters 2 77 20,075

Covertype 6 85 25,514

Astro-Physics 2 5 80

Approximate algorithms Error Decomposition

• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer

w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:

– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)

• Optimization error:– Extra error due to only optimizing to within finite precision

err(w0)

err(w*)

err(w)Prediction error

From ICML’08 presentaNon (available here)

[Shalev Schwartz, Srebro ’08]

Note: w0 is redefined in this context (see below) – does not refer to ini=al weight vector






err(w0)

err(w*)


Pegasos Guarantees

A5er updates:

err(wT) < err(w0) +

With probability 1-‐

✏

�

T = O

✓1

��✏

◆







err(w0)

err(w*)


Pegasos Guarantees

A5er updates:

err(wT) < err(w0) +

With probability 1-‐

✏

�

Running Nme does NOT depend on:

-‐# training examples!

It DOES depend on: -‐ Dimensionality d (why?) -‐ ApproximaNon and -‐ Difficulty of problem

✏ �

T = O

✓1

��✏

◆

�


But how is that possible? The Double-Edged Sword

• When data set size increases:– Estimation error decreases– Can increase optimization error,

i.e. optimize to within lesser accuracy ⇒ fewer iterations– But handling more data is expensive

e.g. runtime of each iteration increases• Stochastic Gradient Descent,

e.g. PEGASOS (Primal Efficient Sub-Gradient Solver for SVMs) [Shalev-Shwartz Singer Srebro, ICML’07]

– Fixed runtime per iteration– Runtime to get fixed accuracy does not increase with n

err(w0)

err(w*)

err(w)

data set size (n)

Prediction error

As the dataset grows, our approximaNons can be worse to get the same error!


Documents

Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02