13
First-order methods Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani

First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

First-order methodsConvexity

10-725 OptimizationGeoff Gordon

Ryan Tibshirani

Page 2: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Administrivia

• HW1 out, due 9/20‣ in class, at beginning of class—no skipping lecture

to keep working on it ;-)

‣ if you use late days, hand in to course assistant before 1:30PM on due date + k days

• Reminder: think about project teams and project proposals (due 9/25)

2

Page 3: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Review

• Convex sets‣ primal (convex hull) vs. dual (intersect hyperplanes)

‣ supporting, separating hyperplanes

‣ operations that preserve convexity

‣ affine fn; perspective

‣ open/closed/compact

3

Page 4: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Review

• Convex functions‣ epigraph

‣ domain

‣ sublevel sets; quasiconvexity

‣ first order, 2nd order conditions

‣ operations that preserve convexity

‣ perspective; minimization over one argument

4

Page 5: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Ex: structured classifier

25

xi

yi

φj(xi)ψijk(xi, yi)χikl(yi, yi+1)

L(x,y;v,w) =

Classifier:

Page 6: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Learning structured classifier

• Get it right if:

• So, want:

• Where !(y,y’) = {

• RHS:

• RHS – LHS:

• Train: lots of pairs (xt,yt)

26

Page 7: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Strict convexity; strong convexity

• Strictly convex: ‣

• k-strongly convex:‣

7

Page 8: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Extended reals• Suppose dom f ⊂ Rn

• Define g(x) = {

• f convex g convex‣ f(tx+(1-t)y) ≤ tf(x) + (1-t)f(y)

‣ g(tx+(1-t)y) ≤ tg(x) + (1-t)g(y)

‣ cases:

8

Page 9: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Lipschitz

• Function f(x) is Lipschitz (in norm ||x||) if:‣

9

2 0 22024

x2/3

2 0 22024

|x|

2 0 220

24

sgn(x)

Page 10: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Back to gradient descent

• Suppose f(x) is convex, ∇f(x) exists

• Iterations to get to accuracy ϵ(f(x0)–f(x*)): if

‣ f Lipschitz: O(1/ϵ2)

‣ ∇f Lipschitz: O(1/ϵ)‣ f strongly convex: O(ln(1/ϵ))

• Constant in O(…): conditioning of f

10

Page 11: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Conditioning

11

9.4 Steepest descent method 483

PSfrag replacements

x̄(0)

x̄(1)

Figure 9.14 The iterates of steepest descent with norm ! · !P1 , after thechange of coordinates. This change of coordinates reduces the conditionnumber of the sublevel sets, and so speeds up convergence.

PSfrag replacements

x̄(0)

x̄(1)

Figure 9.15 The iterates of steepest descent with norm ! · !P1 , after thechange of coordinates. This change of coordinates increases the conditionnumber of the sublevel sets, and so slows down convergence.

9.4

Steep

estdescen

tm

ethod

483

PSfrag

replacements

x̄(0

)

x̄(1

)

Fig

ure

9.1

4T

he

iteratesof

steepest

descen

tw

ithnorm

!·!

P1 ,

afterth

ech

ange

ofco

ordin

ates.T

his

chan

geof

coord

inates

reduces

the

condition

num

ber

ofth

esu

blevel

sets,an

dso

speed

sup

convergen

ce.

PSfrag

replacements

x̄(0

)

x̄(1

)

Fig

ure

9.1

5T

he

iteratesof

steepest

descen

tw

ithnorm

!·!

P1 ,

afterth

ech

ange

ofco

ordin

ates.T

his

chan

geof

coord

inates

increases

the

condition

num

ber

ofth

esu

blevel

sets,an

dso

slows

dow

ncon

vergence.

482 9 Unconstrained minimization

PSfrag replacements

x(0)

x(1)

x(2)

Figure 9.12 Steepest descent method, with quadratic norm ! · !P2 .

PSfrag replacements

k

P1

P2

f(x

(k) )!

p!

0 10 20 30 4010!15

10!10

10!5

100

105

Figure 9.13 Error f(x(k)) " p! versus iteration k, for the steepest descentmethod with the quadratic norm ! · !P1 and the quadratic norm ! · !P2 .Convergence is rapid for the norm ! · !P1 and very slow for ! · !P2 .

Page 12: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Extensions• Subgradient descent

• Prox operator (e.g., )

• FISTA, mirror descent, conjugate gradient

• Nesterov’s smoothing

• Line search (BV sec 9.2)

• Stochastic GD (when f(x) = E(fi(x) | i ~ P))

‣ sample one i on each iter, use ∇fi(x)

‣ or, minibatches: sample a few i’s, use mean ∇fi(x)12

gt = argming

�g�2p +∇f · g

Page 13: First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

Geoff Gordon—10-725 Optimization—Fall 2012

Comparison: stochastic GD• Iteration bounds for stochastic GD

‣ f Lipschitz: O(1/ϵ2) (worse const, same O() as GD)

‣ f strongly convex: O(ln(1/ϵ)/ϵ) (much worse)

• f Lipschitz: stochastic GD often wins big

• Even if f strongly convex:‣ plain GD: each iter O(N), #iters O(ln(1/ϵ))‣ stochastic: each iter O(1), #iters O(ln(1/ϵ)/ϵ)‣ stochastic can win if lots of data, loose tolerance

‣ could make sense to throw data away, use full GD13