23
Support vector machines (SVMs) Lecture 5 David Sontag New York University

Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Support  vector  machines  (SVMs)  Lecture  5  

David  Sontag  New  York  University  

Page 2: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

So5  margin  SVM  

w.x  +  b  =  +1  

w.x  +  b  =  -­‐1

 

w.x  +  b  =  0  

Slack  penalty  C > 0:  •  C=∞ !  minimizes  upper  bound  on  0-­‐1  loss  •  C≈0 !    points  with  ξi=0  have  big  margin  

•   Select  using  cross-­‐valida=on  

“slack  variables”  

ξ2

ξ1 ξ3

ξ4

Support  vectors:  Data  points  for  which  the  constraints  are  binding    

Page 3: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

QP  form:  

More  “natural”  form:  

Empirical  loss  RegularizaNon  term  

Equivalent  if  

So5  margin  SVM  

Page 4: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Subgradient  (for  non-­‐differenNable  funcNons)  

Page 5: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

(Sub)gradient  descent  of  SVM  objecNve  

Step  size:  

-­‐  

Page 6: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

The  Pegasos  Algorithm  Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  (1-­‐ηtλ)  wt  +  ηt  yj  xj      Else        wt+1  =  (1-­‐ηtλ)  wt  

Output:  wt+1  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direcNon,  pt    Go!      Test  for  convergence  

Output:  wt+1  

Page 7: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  wt  –  ηt(λwt-­‐    yjxj)      Else        wt+1  =  wt  –  ηtλwt  

Output:  wt+1  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direcNon,  pt    Go!      Test  for  convergence  

Output:  wt+1  

The  Pegasos  Algorithm  

Page 8: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  wt  –  ηt(λwt-­‐    yjxj)      Else        wt+1  =  wt  –  ηtλwt  

Output:  wt+1  

Convergence  choice  :  Fixed  number  of  itera=ons              T=20*|data|  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direcNon,  pt    Go!      Test  for  convergence  

Output:  wt+1  

The  Pegasos  Algorithm  

Page 9: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  wt  –  ηt(λwt-­‐    yjxj)      Else        wt+1  =  wt  –  ηtλwt  

Output:  wt+1  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direcNon,  pt    Go!      Test  for  convergence  

Output:  wt+1  

Stepsize  choice:  -­‐  Ini=alize  with  1/λ                  -­‐  Decays  with  1/t  

The  Pegasos  Algorithm  

Page 10: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  wt  –  ηt(λwt-­‐    yjxj)      Else        wt+1  =  wt  –  ηtλwt  

Output:  wt+1  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direc=on,  pt    Go!      Test  for  convergence  

Output:  wt+1  

Direc=on  choice:    Stochas=c  approx  to  the  subgradient  

The  Pegasos  Algorithm  

Page 11: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Subgradient  calculaNon  �

2

||w||2 + 1

m

X

i

max{0, 1� yiw · xi}

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:  

Stochas=c  Approx:  

For  a  randomly  chosen  data  point  i  

(in  the  assignment  the  choice  of  i  is  not  random  -­‐  easier  to  debug  and  compare  between  students).  

Page 12: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Subgradient  calculaNon  �

2

||w||2 + 1

m

X

i

max{0, 1� yiw · xi}

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:  

Stochas=c  Approx:  

(sub)gradient:  

�||w||+ d

dw

max{0, 1� yiw · xi}

Page 13: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Subgradient  calculaNon  �

2

||w||2 + 1

m

X

i

max{0, 1� yiw · xi}

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:  

Stochas=c  Approx:  

(sub)gradient:  

�||w||+ d

dw

max{0, 1� yiw · xi}yiw · xi

>1  <1  =1  

�yixi 0

1  0  

0  yiw · xi

Page 14: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Subgradient  calculaNon  �

2

||w||2 + 1

m

X

i

max{0, 1� yiw · xi}

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:  

Stochas=c  Approx:  

(sub)gradient:  

�||w||+ d

dw

max{0, 1� yiw · xi}yiw · xi

>1  <1  =1  

�yixi 00

1  0  

0  yiw · xi

Page 15: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Subgradient  calculaNon  �

2

||w||2 + 1

m

X

i

max{0, 1� yiw · xi}

2

||w||2 +max{0, 1� yiw · xi}

Objec=ve:  

Stochas=c  Approx:  

(sub)gradient:  

if yiw · xi < 1

else

�w � yixi

�w + 0

Page 16: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  wt  –  ηt(λwt-­‐    yjxj)      Else        wt+1  =  wt  –  ηt(λwt  +  0)  

Output:  wt+1  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direc=on,  pt    Go!      Test  for  convergence  

Output:  wt+1  

Direc=on  choice:    Stochas=c  approx  to  the  subgradient  

The  Pegasos  Algorithm  

if yiw · xi < 1

else

�w � yixi

�w + 0

Page 17: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Pegasos  Algorithm  (from  homework)  Ini=alize:  w1  =  0,  t=0  For  iter  =  1,2,…,20  

 For  j=1,2,…,|data|      t  =  t+1      ηt  =  1/(tλ)      If  yj(wt  xj)  <  1        wt+1  =  wt  –  ηt(λwt-­‐    yjxj)      Else        wt+1  =  wt  –  ηtλwt  

Output:  wt+1  

General  framework  Ini=alize:  w1  =  0,  t=0  

While  not  converged    t  =  t+1    Choose  a  stepsize,  ηt    Choose  a  direcNon,  pt    Go!      Test  for  convergence  

Output:  wt+1  

Go:    update  wt+1  =  wt  -­‐  ηtpt  

The  Pegasos  Algorithm  

Page 18: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Why  is  this  algorithm  interesNng?  

•  Simple  to  implement,  state  of  the  art  results.  – NoNce  similarity  to  Perceptron  algorithm!  Algorithmic  differences:  updates  if  insufficient  margin,  scales  weight  vector,  and  has  a  learning  rate.  

•  Since  based  on  stochas7c  gradient  descent,  its  running  Nme  guarantees  are  probabilisNc.  

•  Highlights  interesNng  tradeoffs  between  running  Nme  and  data.  

Page 19: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Much  faster  than  previous  methods  

•  3  datasets  (provided  by  Joachims)  –  Reuters  CCAT  (800K  examples,  47k  features)  –  Physics  ArXiv  (62k  examples,  100k  features)  

–  Covertype  (581k  examples,  54  features)  

Training  Time  (in  seconds):  

Pegasos SVM-Perf SVM-Light

Reuters 2 77 20,075

Covertype 6 85 25,514

Astro-Physics 2 5 80

Page 20: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Approximate  algorithms  Error Decomposition

• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer

w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:

– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)

• Optimization error:– Extra error due to only optimizing to within finite precision

err(w0)

err(w*)

err(w)Prediction error

From  ICML’08  presentaNon  (available  here)  

[Shalev  Schwartz,  Srebro  ’08]  

Note:  w0  is  redefined  in  this  context  (see  below)  –  does  not  refer  to  ini=al  weight  vector  

Page 21: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Approximate  algorithms  Error Decomposition

• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer

w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:

– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)

• Optimization error:– Extra error due to only optimizing to within finite precision

err(w0)

err(w*)

err(w)Prediction error

Pegasos  Guarantees  

A5er                                              updates:  

   err(wT)  <  err(w0)  +    

With  probability  1-­‐  

T = O

✓1

��✏

[Shalev  Schwartz,  Srebro  ’08]  

Page 22: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

Approximate  algorithms  Error Decomposition

• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer

w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:

– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)

• Optimization error:– Extra error due to only optimizing to within finite precision

err(w0)

err(w*)

err(w)Prediction error

Pegasos  Guarantees  

A5er                                              updates:  

   err(wT)  <  err(w0)  +    

With  probability  1-­‐  

Running  Nme  does  NOT  depend  on:  

 -­‐#  training  examples!  

It  DOES  depend  on:    -­‐  Dimensionality  d  (why?)    -­‐  ApproximaNon            and      -­‐  Difficulty  of  problem  

✏ �

T = O

✓1

��✏

[Shalev  Schwartz,  Srebro  ’08]  

Page 23: Supportvectormachines( SVMs) Lecture5 - MITpeople.csail.mit.edu/.../courses/ml16/slides/lecture5.pdf · 2016-02-10 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02

But  how  is  that  possible?  The Double-Edged Sword

• When data set size increases:– Estimation error decreases– Can increase optimization error,

i.e. optimize to within lesser accuracy ⇒ fewer iterations– But handling more data is expensive

e.g. runtime of each iteration increases• Stochastic Gradient Descent,

e.g. PEGASOS (Primal Efficient Sub-Gradient Solver for SVMs) [Shalev-Shwartz Singer Srebro, ICML’07]

– Fixed runtime per iteration– Runtime to get fixed accuracy does not increase with n

err(w0)

err(w*)

err(w)

data set size (n)

Prediction error

As  the  dataset  grows,  our  approximaNons  can  be  worse  to  get  the  same  error!  

[Shalev  Schwartz,  Srebro  ’08]