First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012

First-order methodsConvexity

10-725 OptimizationGeoff Gordon

Ryan Tibshirani

Geoff Gordon—10-725 Optimization—Fall 2012

Administrivia

• HW1 out, due 9/20‣ in class, at beginning of class—no skipping lecture

to keep working on it ;-)

‣ if you use late days, hand in to course assistant before 1:30PM on due date + k days

• Reminder: think about project teams and project proposals (due 9/25)

2


Review

• Convex sets‣ primal (convex hull) vs. dual (intersect hyperplanes)

‣ supporting, separating hyperplanes

‣ operations that preserve convexity

‣ affine fn; perspective

‣ open/closed/compact

3


Review

• Convex functions‣ epigraph

‣ domain

‣ sublevel sets; quasiconvexity

‣ first order, 2nd order conditions

‣ operations that preserve convexity

‣ perspective; minimization over one argument

4


Ex: structured classifier

25

xi

yi

φj(xi)ψijk(xi, yi)χikl(yi, yi+1)

L(x,y;v,w) =

Classifier:


Learning structured classifier

• Get it right if:

• So, want:

• Where !(y,y’) = {

• RHS:

• RHS – LHS:

• Train: lots of pairs (xt,yt)

26


Strict convexity; strong convexity

• Strictly convex: ‣

• k-strongly convex:‣

7


Extended reals• Suppose dom f ⊂ Rn

• Define g(x) = {

• f convex g convex‣ f(tx+(1-t)y) ≤ tf(x) + (1-t)f(y)

‣ g(tx+(1-t)y) ≤ tg(x) + (1-t)g(y)

‣ cases:

8


Lipschitz

• Function f(x) is Lipschitz (in norm ||x||) if:‣

9

2 0 22024

x2/3

2 0 22024

|x|

2 0 220

24

sgn(x)


Back to gradient descent

• Suppose f(x) is convex, ∇f(x) exists

• Iterations to get to accuracy ϵ(f(x0)–f(x*)): if

‣ f Lipschitz: O(1/ϵ2)

‣ ∇f Lipschitz: O(1/ϵ)‣ f strongly convex: O(ln(1/ϵ))

• Constant in O(…): conditioning of f

10


Conditioning

11

9.4 Steepest descent method 483

PSfrag replacements

x̄(0)

x̄(1)

Figure 9.14 The iterates of steepest descent with norm ! · !P1 , after thechange of coordinates. This change of coordinates reduces the conditionnumber of the sublevel sets, and so speeds up convergence.

PSfrag replacements

x̄(0)

x̄(1)

Figure 9.15 The iterates of steepest descent with norm ! · !P1 , after thechange of coordinates. This change of coordinates increases the conditionnumber of the sublevel sets, and so slows down convergence.

9.4

Steep

estdescen

tm

ethod

483

PSfrag

replacements

x̄(0

)

x̄(1

)

Fig

ure

9.1

4T

he

iteratesof

steepest

descen

tw

ithnorm

!·!

P1 ,

afterth

ech

ange

ofco

ordin

ates.T

his

chan

geof

coord

inates

reduces

the

condition

num

ber

ofth

esu

blevel

sets,an

dso

speed

sup

convergen

ce.

PSfrag

replacements

x̄(0

)

x̄(1

)

Fig

ure

9.1

5T

he

iteratesof

steepest

descen

tw

ithnorm

!·!

P1 ,

afterth

ech

ange

ofco

ordin

ates.T

his

chan

geof

coord

inates

increases

the

condition

num

ber

ofth

esu

blevel

sets,an

dso

slows

dow

ncon

vergence.

482 9 Unconstrained minimization

PSfrag replacements

x(0)

x(1)

x(2)

Figure 9.12 Steepest descent method, with quadratic norm ! · !P2 .

PSfrag replacements

k

P1

P2

f(x

(k) )!

p!

0 10 20 30 4010!15

10!10

10!5

100

105

Figure 9.13 Error f(x(k)) " p! versus iteration k, for the steepest descentmethod with the quadratic norm ! · !P1 and the quadratic norm ! · !P2 .Convergence is rapid for the norm ! · !P1 and very slow for ! · !P2 .


Extensions• Subgradient descent

• Prox operator (e.g., )

• FISTA, mirror descent, conjugate gradient

• Nesterov’s smoothing

• Line search (BV sec 9.2)

• Stochastic GD (when f(x) = E(fi(x) | i ~ P))

‣ sample one i on each iter, use ∇fi(x)

‣ or, minibatches: sample a few i’s, use mean ∇fi(x)12

gt = argming

�g�2p +∇f · g


Comparison: stochastic GD• Iteration bounds for stochastic GD

‣ f Lipschitz: O(1/ϵ2) (worse const, same O() as GD)

‣ f strongly convex: O(ln(1/ϵ)/ϵ) (much worse)

• f Lipschitz: stochastic GD often wins big

• Even if f strongly convex:‣ plain GD: each iter O(N), #iters O(ln(1/ϵ))‣ stochastic: each iter O(1), #iters O(ln(1/ϵ)/ϵ)‣ stochastic can win if lots of data, loose tolerance

‣ could make sense to throw data away, use full GD13

Documents

First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725 Optimization Geoff Gordon Ryan Tibshirani. Geoff Gordon—10-725 Optimization—Fall 2012