46
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #20: Machine Learning 2

CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

  • Upload
    ngodiep

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

CS  5614:  (Big)  Data  Management  Systems  

B.  Aditya  Prakash  Lecture  #20:  Machine  Learning  2  

Page 2: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Supervised  Learning  §  Example:  Spam  filtering  

§  Instance  space  x  ∈  X  (|X|=  n  data  points)  –  Binary  or  real-­‐valued  feature  vector  x  of    word  occurrences    

–  d  features  (words  +  other  things,  d~100,000)  §  Class  y  ∈  Y  

–  y:  Spam  (+1),  Ham  (-­‐1)  

§  Goal:  EsJmate  a  funcDon  f(x)  so  that  y  =  f(x)  VT  CS  5614   2  Prakash  2014  

Page 3: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

More  generally:  Supervised  Learning  

§  Would  like  to  do  predicJon:      esJmate  a  funcDon  f(x)  so  that  y  =  f(x)  

§  Where  y  can  be:  –  Real  number:  Regression  –  Categorical:  ClassificaJon  –  Complex  object:  

•  Ranking  of  items,  Parse  tree,  etc.  

§  Data  is  labeled:  –  Have  many  pairs  {(x,  y)}  

•  x  …  vector  of  binary,  categorical,  real  valued  features    •  y  …  class  ({+1,  -­‐1},  or  a  real  number)  

VT  CS  5614   3  Prakash  2014  

Page 4: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Supervised  Learning  §  Task:  Given  data  (X,Y)  build  a  model  f()    to  predict  Y’  

based  on  X’  §  Strategy:  EsJmate    

on  .    Hope  that  the  same    also    works  to  predict  unknown    –  The  “hope”  is  called  generalizaJon  

•  Overfi\ng:  If  f(x)  predicts  well  Y  but  is  unable  to  predict  Y’    –  We  want  to  build  a  model  that  generalizes    well  to  unseen  data  

•  But  how  can  we  well  on  data  we  have    never  seen  before?!?  

Prakash  2014   VT  CS  5614   4  

X   Y  

X’   Y’  Test data

Training data

Page 5: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Supervised  Learning  §  Idea:  Pretend  we  do  not  know  the  data/labels  we  actually  do  know  –  Build  the  model  f(x)  on  the  training  data  See  how  well  f(x)  does  on  the  test  data  

•  If  it  does  well,  then  apply  it  also  to  X’    

§  Refinement:  Cross  validaJon  –  Spli\ng  into  training/validaDon  set  is  brutal  –  Let’s  split  our  data  (X,Y)  into  10-­‐folds  (buckets)  –  Take  out  1-­‐fold  for  validaDon,  train  on  remaining  9  –  Repeat  this  10  Dmes,  report  average  performance  

VT  CS  5614   5  

X   Y  

X’  

Validation set

Training set

Test set

Prakash  2014  

Page 6: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Linear  models  for  classificaJon  §  Binary  classificaJon:  

§  Input:  Vectors  x(j)  and  labels  y(j)  –  Vectors  x(j)    are  real  valued  where    

§  Goal:  Find  vector    w = (w1, w2 ,... , wd ) –  Each  wi  is  a  real  number    

VT  CS  5614   6  

f (x) = +1 if w1 x1 + w2 x2 +. . . wd xd ≥ θ -1 otherwise {

w ⋅ x = 0 -­‐   -­‐  -­‐  -­‐   -­‐  -­‐  -­‐  -­‐   -­‐  

-­‐   -­‐  

-­‐   -­‐  -­‐  

-­‐  

θ−⇔

∀⇔

,

1,

ww

xxxNote:

Decision  boundary  is  linear  

w  

Prakash  2014  

||x||2 = 1

Page 7: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Perceptron  [Rosenbla`  ‘58]  

§  (very)  Loose  moJvaJon:  Neuron  §  Inputs  are  feature  values  §  Each  feature  has  a  weight  wi  

§  AcJvaJon  is  the  sum:  –  f(x) = Σi wi xi = w⋅ x

§  If  the  f(x)  is:  – PosiJve:  Predict  +1  – NegaJve:  Predict  -­‐1  

VT  CS  5614   7  

∑  

x1  x2  x3  x4  

≥  0?  

w1  w2  w3  w4  

viagra  nigeria

 

Spam=1

Ham=-1 w  

x(1)  x(2)  

w⋅x=0  

Prakash  2014  

Page 8: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Perceptron:  EsJmaJng  w  §  Perceptron:  y’  =  sign(w⋅  x)  §  How  to  find  parameters  w?  

–  Start  with  w0  =  0  –  Pick  training  examples  x(t)  one  by  one  (from  disk)  –  Predict  class  of  x(t)  using  current  weights  

•  y’  =  sign(w(t)⋅  x(t))  

–  If  y’  is  correct  (i.e.,  yt  =  y’)  •  No  change:  w(t+1)  =  w(t)

 

–  If  y’  is  wrong:  adjust  w(t)        w(t+1)  =  w(t)  +  η  ⋅  y  (t)  ⋅  x(t)  

–  η  is  the  learning  rate  parameter  –  x(t)    is  the  t-­‐th  training  example  –  y(t)    is  true  t-­‐th  class  label  ({+1,  -­‐1})  

VT  CS  5614   8  

w(t)

η⋅y(t)⋅x(t)

x(t), y(t)=1

w(t+1)

Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly.

Prakash  2014  

Page 9: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Perceptron:  The  Good  and  the  Bad  

§  Good:  Perceptron  convergence  theorem:  –  If  there  exist  a  set  of  weights  that  are  consistent  (i.e.,  the  data  is  linearly  separable)  the  Perceptron  learning  algorithm  will  converge  

§  Bad:  Never  converges:    If  the  data  is  not  separable  weights  dance  around  indefinitely  

§  Bad:  Mediocre  generalizaJon:  – Finds  a  “barely”  separaDng  soluDon  

VT  CS  5614   9  Prakash  2014  

Page 10: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

UpdaJng  the  Learning  Rate  §  Perceptron  will  oscillate  and  won’t  converge  §  So,  when  to  stop  learning?  §  (1)  Slowly  decrease  the  learning  rate  η    

–  A  classic  way  is  to:  η  =  c1/(t  +  c2)  •  But,  we  also  need  to  determine  constants  c1  and  c2  

§  (2)  Stop  when  the  training  error  stops  chaining  §  (3)  Have  a  small  test  dataset  and  stop  when  the  test  set  error  stops  decreasing  

§  (4)  Stop  when  we  reached  some  maximum  number  of  passes  over  the  data  

VT  CS  5614   10  Prakash  2014  

Page 11: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SUPPORT  VECTOR  MACHINES  

Prakash  2014   VT  CS  5614   11  

Page 12: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Support  Vector  Machines  

§  Want  to  separate  “+”  from  “-­‐”  using  a  line  

VT  CS  5614   12  

Data:  §  Training  examples:    

–  (x1,  y1)  …  (xn,  yn)  §  Each  example  i:  

–  xi  =  (  xi(1),…  ,  xi(d)  )  •  xi(j)  is  real  valued  

–  yi  ∈  {  -­‐1,  +1  }  §  Inner  product:      

+  +  

+  

+  

+   +   -­‐  -­‐  

-­‐  

-­‐  -­‐  -­‐  

-­‐  

Which  is  best  linear  separator  (defined  by  w)?  Prakash  2014  

w · x =Pd

j=1 w(j) · x(j)

Page 13: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

+   +  

+  +  

+  

+  

+  

+  

+  

-­‐  

-­‐  -­‐  

-­‐  -­‐  

-­‐  

-­‐  

-­‐  

-­‐  

A  

B  

C  

Largest  Margin  §  Distance  from  the  separaJng  hyperplane  corresponds  to  the  “confidence”  of  predicJon  

§  Example:  – We  are  more  sure  about  the  class  of  A  and  B  than  of  C    

VT  CS  5614   13  Prakash  2014  

Page 14: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Largest  Margin  §  Margin  :  Distance  of  closest  example  from    the  decision  line/hyperplane  

VT  CS  5614   14  The reason we define margin this way is due to theoretical convenience and existence of

generalization error bounds that depend on the value of margin. Prakash  2014  

Page 15: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Why  maximizing  𝜸  a  good  idea?  

§  Remember:  Dot  product    𝑨⋅𝑩=‖𝑨‖⋅‖𝑩‖⋅ 𝐜𝐨𝐬𝜽 

VT  CS  5614   15  

‖𝑨‖𝒄𝒐𝒔𝜽

Prakash  2014  

Page 16: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Why  maximizing  𝜸  a  good  idea?  

Prakash  2014   VT  CS  5614   16  

Page 17: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

A  (xA(1),  xA(2))  

M  (x1,  x2)  

H  

d(A, L) = |AH| = |(A-M) · w| = |(xA

(1) – xM(1)) w(1) + (xA

(2) – xM(2)) w(2)|

= xA(1) w(1) + xA

(2) w(2) + b = w · A + b

Remember xM(1)w(1) + xM

(2)w(2) = - b since M belongs to line L

w  L  

+  

What  is  the  margin?  §  Let:  

–  Line  L:  w·∙x+b  =  w(1)x(1)+w(2)x(2)+b=0  

–  w  =  (w(1),  w(2))    –  Point  A  =  (xA(1),  xA(2))  –  Point  M  on  a  line  =  (xM(1),  xM(2))  

(0,0)

VT  CS  5614   17  Prakash  2014  

Page 18: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Largest  Margin  

Prakash  2014   VT  CS  5614   18  

Page 19: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Support  Vector  Machine  § Maximize  the  margin:  

– Good  according  to  intuiJon,    theory  (VC  dimension)  &    pracJce  

– 𝜸  is  margin  …  distance  from      is  margin  …  distance  from    the  separaJng  hyperplane  

 

VT  CS  5614   19  

+  +   +  

+  +  

+  

+  

+  -­‐  

-­‐   -­‐  -­‐  -­‐  -­‐  

-­‐  

w⋅x+b=0

γ  γ  

Maximizing  the  margin  

γ  γ

γγ

≥+⋅∀ )(,..

max,

bxwyits ii

w

Prakash  2014  

Page 20: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SUPPORT  VECTOR  MACHINES:  DERIVING  THE  MARGIN  

Prakash  2014   VT  CS  5614   20  

Page 21: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Support  Vector  Machines  

§  SeparaJng  hyperplane    is  defined  by  the    support  vectors  –  Points  on  +/-­‐  planes    from  the  soluDon    

–  If  you  knew  these    points,  you  could    ignore  the  rest  

–  Generally,    d+1  support  vectors  (for  d  dim.  data)  

VT  CS  5614   21  Prakash  2014  

Page 22: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Canonical  Hyperplane:  Problem  

Prakash  2014   VT  CS  5614   22  

Page 23: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Canonical  Hyperplane:  SoluJon  

Prakash  2014   VT  CS  5614   23  

Page 24: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Maximizing  the  Margin  

Prakash  2014   VT  CS  5614   24  

Page 25: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Non-­‐linearly  Separable  Data  

§  If  data  is  not  separable  introduce  penalty:  

– Minimize  ǁ‖wǁ‖2  plus  the    number  of  training  mistakes  

–  Set  C  using  cross  validaDon  

§  How  to  penalize  mistakes?  – All  mistakes  are  not  equally  bad!  

VT  CS  5614   25  

1)(,..mistakes) ofnumber (# C min 2

21

≥+⋅∀

⋅+

bxwyitsw

ii

w

+   +  

+  +  

+  

+  

+  -­‐  

-­‐  -­‐  

-­‐  

-­‐  -­‐  

-­‐  

+  -­‐  

-­‐  

Prakash  2014  

Page 26: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Support  Vector  Machines  §  Introduce  slack  variables  ξi

 

§  If  point  xi  is  on  the  wrong    side  of  the  margin  then    get  penalty  ξi  

VT  CS  5614   26  

iii

n

iibw

bxwyits

Cwi

ξ

ξξ

−≥+⋅∀

⋅+ ∑=

1)(,..

min1

221

0,,

+   +  

+  +  

+  

+  +   -­‐   -­‐  

-­‐  -­‐  -­‐  

For each data point: If margin ≥ 1, don’t care If margin < 1, pay linear penalty

+  

ξj  

-­‐   ξi  

Prakash  2014  

Page 27: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Slack  Penalty  𝑪  

§ What  is  the  role  of  slack  penalty  C:  – C=∞:  Only  want  to  w,  b    that  separate  the  data  

– C=0:  Can  set  ξi  to  anything,    then  w=0  (basically    ignores  the  data)  

VT  CS  5614   27  

1)(,..mistakes) ofnumber (# C min 2

21

≥+⋅∀

⋅+

bxwyitsw

ii

w

+   +  

+  +  

+  

+  +   -­‐   -­‐  

-­‐  -­‐  -­‐  

+   -­‐  

big C

“good” C small C

(0,0) Prakash  2014  

Page 28: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Support  Vector  Machines  §  SVM  in  the  “natural”  form  

§  SVM  uses  “Hinge  Loss”:  

 

VT  CS  5614   28  

{ }∑=

+⋅−⋅+⋅n

iii

bwbxwyCww

121

,)(1,0max minarg

Margin Empirical loss L (how well we fit training data)

Regularization parameter

iii

n

iibw

bxwyits

Cw

ξ

ξ

−≥+⋅⋅∀

+ ∑=

1)(,..

min1

221

,

-­‐1                0              1            2      

0/1 loss

pena

lty

)( bwxyz ii +⋅⋅=

Hinge loss: max{0, 1-z}

Prakash  2014  

Page 29: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SUPPORT  VECTOR  MACHINES:  HOW  TO  COMPUTE  THE  MARGIN?  

Prakash  2014   VT  CS  5614   29  

Page 30: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SVM:  How  to  esJmate  w?  

§  Want  to  esJmate    and  !  –  Standard  way:  Use  a  solver!  

•  Solver:  sopware  for  finding  soluDons  to    “common”  opDmizaDon  problems  

§  Use  a  quadraJc  solver:  – Minimize  quadraDc  funcDon  –  Subject  to  linear  constraints  

§  Problem:  Solvers  are  inefficient  for  big  data!  VT  CS  5614   30  

iii

n

iibw

bwxyits

Cww

ξ

ξ

−≥+⋅⋅∀

⋅+⋅ ∑=

1)(,..

min1

21

,

Prakash  2014  

Page 31: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SVM:  How  to  esJmate  w?  §  Want  to  esJmate  w,  b!  §  AlternaJve  approach:  

– Want  to  minimize  f(w,b):  

§  Side  note:  –  How  to  minimize  convex  funcJons  ?  –  Use  gradient  descent:  minz  g(z)  –  Iterate:  zt+1  ←  zt  –  η  ∇g(zt)    

VT  CS  5614   31  

∑ ∑= = ⎭

⎬⎫

⎩⎨⎧

+−⋅+⋅=n

i

d

j

ji

ji bxwyCwwbwf

1 1

)()(21 )(1,0max),(

iii

n

iibw

bwxyits

Cww

ξ

ξ

−≥+⋅⋅∀

+⋅ ∑=

1)(,..

min1

21

,

g(z)  

z  Prakash  2014  

Page 32: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SVM:  How  to  esJmate  w?  §  Want  to  minimize  f(w,b):  

§  Compute  the  gradient  ∇(j)  w.r.t.  w(j)  

VT  CS  5614   32  

∑= ∂

∂+=

∂=∇

n

ijiij

jj

wyxLCw

wbwff

1)(

)()(

)( ),(),(

else

1)(w if 0),(

)(

)(

jii

iijii

xy

bxywyxL

−=

≥+⋅=∂

( ) ∑ ∑∑= == ⎭

⎬⎫

⎩⎨⎧

+−+=n

i

d

j

ji

ji

d

j

j bxwyCwbwf1 1

)()(

1

2)(21 )(1,0max),(

Empirical loss

Prakash  2014  

L(xi, yi)

Page 33: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Iterate until convergence: •  For j = 1 … d

•  Evaluate: •  Update:

w(j) ← w(j) - η∇f(j)

SVM:  How  to  esJmate  w?  §  Gradient  descent:  

 

§  Problem:  –  CompuJng  ∇f(j)  takes  O(n)  Jme!  

•  n  …  size  of  the  training  dataset  

VT  CS  5614   33  

∑= ∂

∂+=

∂=∇

n

ijiij

jj

wyxLCw

wbwff

1)(

)()(

)( ),(),(

η…learning rate parameter C… regularization parameter

Prakash  2014  

Page 34: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SVM:  How  to  esJmate  w?  §  StochasJc  Gradient  Descent  

–  Instead  of  evaluaDng  gradient  over  all  examples  evaluate  it  for  each  individual  training  example  

§  StochasJc  gradient  descent:  

VT  CS  5614   34  

)()()( ),()( j

iiji

j

wyxLCwxf

∂⋅+=∇

∑= ∂

∂+=∇

n

ijiijj

wyxLCwf

1)(

)()( ),(We just had:

Iterate until convergence: •  For i = 1 … n

•  For j = 1 … d •  Compute: ∇f(j)(xi) •  Update: w(j) ← w(j) - η ∇f(j)(xi)

Notice: no summation over i anymore

Prakash  2014  

Page 35: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SUPPORT  VECTOR  MACHINES:  EXAMPLE  

Prakash  2014   VT  CS  5614   35  

Page 36: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Example:  Text  categorizaJon  

§  Example  by  Leon  Bo`ou:  – Reuters  RCV1  document  corpus  

•  Predict  a  category  of  a  document  –  One  vs.  the  rest  classificaDon  

– n  =  781,000  training  examples  (documents)  – 23,000  test  examples  – d  =  50,000  features  

•  One  feature  per  word  •  Remove  stop-­‐words  •  Remove  low  frequency  words  

VT  CS  5614   36  Prakash  2014  

Page 37: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Example:  Text  categorizaJon  §  QuesJons:  

–  (1)  Is  SGD  successful  at  minimizing  f(w,b)?  –  (2)  How  quickly  does  SGD  find  the  min  of  f(w,b)?  –  (3)  What  is  the  error  on  a  test  set?  

VT  CS  5614   37  

Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM

(1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

Prakash  2014  

Page 38: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

OpJmizaJon  “Accuracy”  

VT  CS  5614   38  

Optimization quality: | f(w,b) – f (wopt,bopt) |

Conventional SVM

SGD SVM

For optimizing f(w,b) within reasonable quality SGD-SVM is super fast

Prakash  2014  

Page 39: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

SGD  vs.  Batch  Conjugate  Gradient  

§  SGD  on  full  dataset  vs.  Conjugate  Gradient  on  a  sample  of  n  training  examples  

VT  CS  5614   39  

Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times

Theory says: Gradient descent converges in linear time . Conjugate gradient

converges in .

𝒌… condition number Prakash  2014  

Page 40: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

PracJcal  ConsideraJons  §  Need  to  choose  learning  rate  η  and  t0  

§  Leon  suggests:  –  Choose  t0  so  that  the  expected  iniDal  updates  are  comparable  with  the  

expected  size  of  the  weights  –  Choose  η:  

•  Select  a  small  subsample  •  Try  various  rates  η  (e.g.,  10,  1,  0.1,  0.01,  …)  •  Pick  the  one  that  most  reduces  the  cost  •  Use  η  for  next  100k  iteraDons  on  the  full  dataset  

VT  CS  5614   40  

⎟⎠

⎞⎜⎝

⎛∂

∂+

+−←+ w

yxLCwtt

ww iit

ttt

),(

01

η

Prakash  2014  

Page 41: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

PracJcal  ConsideraJons  

§  Sparse  Linear  SVM:  –  Feature  vector  xi  is  sparse  (contains  many  zeros)  

•  Do  not  do:  xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] •  But  represent  xi  as  a  sparse  vector xi=[(4,1), (9,5), …]

–  Can  we  do  the  SGD  update  more  efficiently?  

–  Approximated  in  2  steps:                                                                                        cheap:  xi  is  sparse  and  so  few                        coordinates  j  of  w  will  be  updated                                                                                            expensive:  w  is  not  sparse,  all                            coordinates  need  to  be  updated  

VT  CS  5614   41  

⎟⎠

⎞⎜⎝

⎛∂

∂+−←

wyxLCwww ii ),(η

wyxLCww ii

∂∂

−←),(

η

)1( η−← ww

Prakash  2014  

Page 42: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

PracJcal  ConsideraJons  

VT  CS  5614   42  Prakash  2014  

Page 43: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

PracJcal  ConsideraJons  

§  Stopping  criteria:      How  many  iteraJons  of  SGD?  

–  Early  stopping  with  cross  validaJon  •  Create  a  validaDon  set  •  Monitor  cost  funcDon  on  the  validaDon  set  •  Stop  when  loss  stops  decreasing  

–  Early  stopping  •  Extract  two  disjoint  subsamples  A  and  B  of  training  data  •  Train  on  A,  stop  by  validaDng  on  B  •  Number  of  epochs  is  an  esDmate  of  k  •  Train  for  k  epochs  on  the  full  dataset  

VT  CS  5614   43  Prakash  2014  

Page 44: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

What  about  mulJple  classes?  §  Idea  1:  One  against  all  Learn  3  classifiers  –  +  vs.  {o,  -­‐}  –  -­‐    vs.  {o,  +}  –  o  vs.  {+,  -­‐}  Obtain:    w+  b+,    w-­‐  b-­‐,    wo  bo  

§  How  to  classify?  §  Return  class  c  

arg maxc wc x + bc VT  CS  5614   44  Prakash  2014  

Page 45: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

Learn  1  classifier:  MulJclass  SVM  §  Idea  2:  Learn  3  sets  of  weights  simultaneously!  

– For  each  class  c  esDmate  wc,  bc  – Want  the  correct  class  to  have  highest  margin:  

wyi xi + byi ≥ 1 + wc xi + bc ∀c ≠ yi , ∀i

VT  CS  5614   45  

(xi,  yi)  

Prakash  2014  

Page 46: CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

MulJclass  SVM  

§  OpJmizaJon  problem:  

–  To  obtain  parameters  wc  ,  bc  (for  each  class  c)    we  can  use  similar  techniques  as  for  2  class  SVM  

§  SVM  is  widely  perceived  a  very  powerful  learning  algorithm  

VT  CS  5614   46  

icicyiy

n

iicbw

bxwbxw

Cw

iiξ

ξ

−++⋅≥+⋅

+ ∑∑=

1

min1c

221

,

iiyc

i

i

∀≥

∀≠∀

,0,

ξ

Prakash  2014