32
Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013

Frank-Wolfe optimization insights in machine learning

Embed Size (px)

DESCRIPTION

Frank-Wolfe optimization insights in machine learning. Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team. SMILE – November 4 th 2013. Outline. Frank-Wolfe optimization Frank-Wolfe for structured prediction links with previous algorithms - PowerPoint PPT Presentation

Citation preview

Page 1: Frank-Wolfe optimization insights in machine learning

Frank-Wolfe optimization insights in machine

learning

Simon Lacoste-JulienINRIA / École Normale Supérieure

SIERRA Project Team

SMILE – November 4th 2013

Page 2: Frank-Wolfe optimization insights in machine learning

Outline

Frank-Wolfe optimization

Frank-Wolfe for structured prediction links with previous algorithms block-coordinate extension results for sequence prediction

Herding as Frank-Wolfe optimization extension: weighted Herding simulations for quadrature

Page 3: Frank-Wolfe optimization insights in machine learning

FW algorithm – repeat:

f convex & cts. differentiable

M convex & compact

alg. for constrained opt.:

(aka conditional gradient)

where:

1) Find good feasible direction by minimizing linearization of :f

2) Take convex step in direction:

®t+1 = (1 ¡ °t) ®t + °t st+1

Frank-Wolfe algorithm [Frank, Wolfe 1956]

Properties: O(1/T) rate sparse iterates get duality gap for

free affine invariant rate holds even if linear

subproblem solved approximately

min®2M

f (®)

Page 4: Frank-Wolfe optimization insights in machine learning

Frank-Wolfe: properties

convex steps => convex sparse combo:

get duality gap certificate for free

(special case of Fenchel duality gap) also converge as O(1/T)!

only need to solve linear subproblem *approximately* (additive/multiplicative bound)

affine invariant!

®T = ½0®0 +TX

t=1½tst where

TX

t=0½t = 1

[see Jaggi ICML 2013]

Page 5: Frank-Wolfe optimization insights in machine learning

[ICML 2013]Simon

Lacoste-JulienMartinJaggi

PatrickPletscher

MarkSchmidt

Block-Coordinate Frank-Wolfe Optimization for

Structured SVMs

Page 6: Frank-Wolfe optimization insights in machine learning

Structured SVM optimization

learn classifier:

structured prediction:

structured SVM primal:

decoding

vs. binary hinge loss:

structured hinge loss:-> loss-augmented

decoding

-> exp. number of variables!

structured SVM dual: primal-dual pair:

Page 7: Frank-Wolfe optimization insights in machine learning

Structured SVM optimization (2)

popular approaches: stochastic subgradient method

pros: online! cons: sensitive to step-size; don’t know when to stop

cutting plane method (SVMstruct) pros: automatic step-size; duality gap cons: batch! -> slow for large n

our approach: block-coordinate Frank-Wolfe on dual

-> combines best of both worlds: online! automatic step-size via analytic line search duality gap rates also hold for approximate oracles

rate: after K passes through data:

[Ratliff et al. 07,Shalev-Shwartz et al. 10]

[Tsochantaridis et al. 05, Joachims et al. 09]

Page 8: Frank-Wolfe optimization insights in machine learning

FW algorithm – repeat:

f convex & cts. differentiable

M convex & compact

alg. for constrained opt.:

(aka conditional gradient)

where:

1) Find good feasible direction by minimizing linearization of :f

®t+1 = (1 ¡ °t) ®t + °t st+1

2) Take convex step in direction:

®t+1 = (1 ¡ °t) ®t + °t st+1

Frank-Wolfe algorithm [Frank, Wolfe 1956]

Properties: O(1/T) rate sparse iterates get duality gap for

free affine invariant rate holds even if linear

subproblem solved approximately

min®2M

f (®)

Page 9: Frank-Wolfe optimization insights in machine learning

Frank-Wolfe for structured SVM

FW algorithm – repeat:

structured SVM dual:

1) Find good feasible direction by minimizing linearization of :f

®t+1 = (1 ¡ °t) ®t + °t st+1

2) Take convex step in direction:

¡ min®2M

f (®)

use primal-dual link:

key insight: loss-augmented decodingon each example i

becomes a batch subgradient step:

f (®)choose by analytic line search on quadratic dual

link between FW and subgradient method: see [Bach 12]

Page 10: Frank-Wolfe optimization insights in machine learning

FW for structured SVM: properties

running FW on dual batch subgradient on primal but adaptive step-size from analytic line-search and duality gap stopping criterion

‘fully corrective’ FW on dual cutting plane alg. still O(1/T) rate; but provides

simpler proof for SVMstruct convergence+ approximate oracles guarantees

not faster than simple FW in our experiments

BUT: still batch => slow for large n...

(SVMstruct)

Page 11: Frank-Wolfe optimization insights in machine learning

Block-Coordinate Frank-Wolfe (new!)

for constrained optimization over compact product domain:

pick i at random; update only block i with a FW step:

we proved same O(1/T) rate as batch FW

-> each step n times cheaper though

-> constant can be the same (SVM e.g.)

Properties: O(1/T) rate sparse iterates get duality gap

guarantees affine invariant rate holds even if linear

subproblem solved approximately

Page 12: Frank-Wolfe optimization insights in machine learning

Block-Coordinate Frank-Wolfe (new!)

for constrained optimization over compact product domain:

pick i at random; update only block i with a FW step:

loss-augmented decoding

structured SVM:

we proved same O(1/T) rate as batch FW

-> each step n times cheaper though

-> constant can be the same (SVM e.g.)

Page 13: Frank-Wolfe optimization insights in machine learning

BCFW for structured SVM: properties each update requires 1 oracle call

advantages over stochastic subgradient: step-sizes by line-search -> more robust duality gap certificate -> know when to stop

guarantees hold for approximate oracles implementation:

https://github.com/ppletscher/BCFWstruct almost as simple as stochastic subgradient method caveat: need to store one parameter vector per example

(or store the dual variables) for binary SVM -> reduce to DCA method [Hsieh et

al. 08] interesting link with prox SDCA [Shalev-Shwartz et

al. 12]

(vs. n for SVMstruct)

so get error after K passes through data

(vs. for SVMstruct)

Page 14: Frank-Wolfe optimization insights in machine learning

More info about constants...

BCFW rate:

batch FW rate:

comparing constants: for structured SVM – same constants: identity Hessian + cube constraint:

(no speed-up)

->remove with line-search

“curvature”

“product curvature”

Page 15: Frank-Wolfe optimization insights in machine learning

Sidenote: weighted averaging standard to average iterates of stochastic

subgradient method

uniform averaging:

vs. t-weighted averaging:

[L.-J. et al. 12], [Shamir & Zhang 13]

weighted avg. improves duality gap for BCFW also makes a big difference in test error!

Page 16: Frank-Wolfe optimization insights in machine learning

Experiments

OCR dataset CoNLL dataset

Page 17: Frank-Wolfe optimization insights in machine learning

CoNLL dataset

Surprising test error though!

test error: optimization error:

flipped!

Page 18: Frank-Wolfe optimization insights in machine learning

Conclusions for 1st part

applying FW on dual of structured SVM unified previous algorithms provided line-search version of batch

subgradient new block-coordinate variant of Frank-Wolfe

algorithm same convergence rate but with cheaper

iteration cost yields a robust & fast algorithm for structured

SVM future work:

caching tricks non-uniform sampling

regularization path explain weighted avg.

test error mystery

Page 19: Frank-Wolfe optimization insights in machine learning

On the Equivalence between Herding and

Conditional Gradient Algorithms[ICML 2012]

SimonLacoste-Julien

FrancisBach

GuillaumeObozinski

Page 20: Frank-Wolfe optimization insights in machine learning

A motivation: quadrature Approximating integrals:

Random sampling yields error

Herding [Welling 2009] yields error!

[Chen et al. 2010] (like quasi-MC)

This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe)

suggests extensions - e.g. weighted version with

BUT extensions worse for learning??? -> yields interesting insights on properties of

herding...

Z

Xf (x)p(x)dx ¼

1T

TX

t=1f (xt)

xt » p(x) O(1=p

T)

O(1=T)

O(e¡ cT )

Page 21: Frank-Wolfe optimization insights in machine learning

Outline

Background: Herding [Conditional gradient algorithm]

Equivalence between herding & cond. gradient Extensions New rates & theorems

Simulations Approximation of integrals with cond. gradient

variants Learned distribution vs. max entropy

Page 22: Frank-Wolfe optimization insights in machine learning

Review of herding [Welling ICML 2009]

Learning in MRF:

Motivation:

pµ(x) =1Zµ

exp(hµ;©(x)i )

data

parameter

samples

learning:(app.) ML /

max. entropy moment matching

µM L

(app.) inference:sampling

(pseudo)-herdin

g

feature map© : X ! F

Page 23: Frank-Wolfe optimization insights in machine learning

hµ;¹ i ¡ maxx2X

hµ;©(x)i

Herding updates

Zero temperature limit of log-likelihood:

Herding updates -subgradient ascent updates:

Properties:1) weakly chaotic -> entropy?2) Moment matching:-> our focus

‘Tipi’ function:

(thanks to Max Welling for picture)

xt+1 2 arg maxx2X

hµt;©(x)i

µt+1 = µt + ¹ ¡ ©(xt+1)

µt

k¹ ¡1T

TX

t=1©(xt)k2 = O(1=T 2)

lim¯ ! 0

hµ;¹ i ¡ ¯ log

0

@X

x2Xexp(

hµ;©(x)i )

1

A

Page 24: Frank-Wolfe optimization insights in machine learning

Approx. integrals in RKHS

Reproducing property: Define mean map : Want to approximate integrals of the form:

Use weighted sum to get approximated mean:

Approximation error is then bounded by:

f 2 F ) f (x) = hf ;©(x)i¹ = Ep(x)©(x)

Ep(x)f (x) = Ep(x)hf ;©(x)i = hf ; ¹ i

¹̂ = Ep̂(x)©(x) =TX

t=1wt©(xt)

jEp(x) f (x) ¡ Ep̂(x) f (x)j · kf kk¹ ¡ ¹̂ k

F Controlling moment discrepancy is

enough to control error of integrals in RKHS :

Page 25: Frank-Wolfe optimization insights in machine learning

Conditional gradient algorithm

(aka Frank-Wolfe)

Alg. to optimize: Repeat:Find good feasible direction byminimizing linearization of J:

Take convex step in direction:

-> Converges in O(1/T) in general

ming2M

J (g)M

J (g)

gt¹gt+1

J convex & (twice) cts. differentiableMconvex & compact

J (gt) + hJ 0(gt);g ¡ gti

¹gt+1 2 arg ming2M

hJ 0(gt);gi

gt+1 = (1 ¡ ½t) gt + ½t ¹gt+1 ½t = 1=(t + 1)

J (g) =12

kg¡ ¹ k2

Page 26: Frank-Wolfe optimization insights in machine learning

Trick: look at cond. gradient on dummy objective:

Herding & cond. grad. are equiv.

xt+1 2 arg maxx2X

hµt;©(x)i

µt+1 = µt + ¹ ¡ ©(xt+1)

ming2M

f J (g) =12

kg¡ ¹ k2g

¹gt+1 2 arg ming2M

hgt ¡ ¹ ;gi

gt+1 = (1 ¡ ½t) gt + ½t ¹gt+1

gt ¡ ¹ = ¡ µt=t

M = convf ©(x)g

(t + 1)gt+1 = tgt + ©(xt+1)

gT =1T

TX

t=1©(xt) = ¹̂ T½0 = 1

herding updates:

cond. grad. updates:

+ Do change of variable:

©(xt+1)

½t = 1=(t + 1)Same with step-

size:

Subgradient ascent and cond. gradient are Fenchel duals of each other! (see also [Bach 2012])

Page 27: Frank-Wolfe optimization insights in machine learning

Extensions of herding

More general step-sizes -> gives weighted sum:

Two extensions:

1) Line search for

2) Min. norm point algorithm(min J(g) on convex hull of previously visited points)

gT =TX

t=1wt©(xt)

½t

Page 28: Frank-Wolfe optimization insights in machine learning

Rates of convergence & thms.

No assumption: cond. grad. yields*:

If assume in rel. int. of with radius [Chen et al. 2010] yields for herding

Whereas line search version yields[Guélat & Marcotte 1986, Beck & Teboulle 2004]

Propositions:1)

2)

kgt ¡ ¹ k2 = O(1=t)

¹ M r > 0kgt ¡ ¹ k2 = O(1=t2)

(½t = 1=(t + 1))

kgt ¡ ¹ k2 = O(e¡ ct)

suppose X compact and © continuous

F ¯ nite dim. and p full support means 9r > 0

F in¯ nite dim. means r = 0 (i.e. [Chen et al. 2010] doesn’t hold!)

Page 29: Frank-Wolfe optimization insights in machine learning

T

Simulation 1: approx. integrals

Kernel herding onUse RKHS with Bernouilli polynomial kernel

(infinite dim.) (closed

form)

X = [0;1]

log10 k¹̂ T ¡ ¹ k

p(x) /³ P K

k=1 ak cos(2k¼x) + bk sin(2k¼x)´2

Page 30: Frank-Wolfe optimization insights in machine learning

Simulation 2: max entropy? learning independent bits:X = f ¡ 1;1gd, d = 10

©(x) = xirrational ¹ rational ¹

log10 k¹̂ T ¡ ¹ k

error on moments

log10 kp̂T ¡ pk

error on distributio

n

Page 31: Frank-Wolfe optimization insights in machine learning

Conclusions for 2nd part

Equivalence of herding and cond. gradient:-> Yields better alg. for quadrature based on

moments -> But highlights max entropy / moment matching

tradeoff! Other interesting points:

Setting up fake optimization problems -> harvest properties of known algorithms

Conditional gradient algorithm useful to know... Duality of subgradient & cond. gradient is more

general Recent related work:

link with Bayesian quadrature [Huszar & Duvenaud UAI 2012]

herded Gibbs sampling [Born et al. ICLR 2013]

Page 32: Frank-Wolfe optimization insights in machine learning

Thank you!