Smoothing

L. Vandenberghe EE236C (Spring 2013-14)

Smoothing

introduction

smoothing via conjugate

examples

1

First-order convex optimization methods

complexity of finding -suboptimal point of f

subgradient method: f nondifferentiable with Lipschitz constant G

O((G/)2

)iterations

proximal gradient method: f = g + h, h a simple nondifferentiablefunction, g differentiable with L-Lipschitz continuous gradient

O (L/) iterations

fast proximal gradient methods (lecture 7)

O(L/) iterations

Smoothing 2

Nondifferentiable optimization by smoothing

for nondifferentiable f that cannot be handled by proximal gradient method

replace f with differentiable approximation f (parametrized by )

minimize f by (fast) gradient method

complexity: #iterations for (fast) gradient method depends on L/

L is Lipschitz constant of f

is accuracy with which the smooth problem is solved

trade-off in amount of smoothing (choice of )

large L (less smoothing) gives more accurate approximation

small L (more smoothing) gives faster convergence

Smoothing 3

Example: Huber penalty as smoothed absolute value

(z) =

{z2/(2) |z| |z| /2 |z|

/2/2 z

(

z

)

controls accuracy and smoothness

accuracy

|z|

2 (z) |z|

smoothness

(z) 1

Smoothing 4

Huber penalty approximation of 1-norm minimization

f(x) = Ax b1, f(x) =mi=1

(aTi x bi)

accuracy: from f(x)m/2 f(x) f(x),

f(x) f f(x) f +

m

2

to achieve f(x) f we need f(x) f with = m/2

Lipschitz constant of f is L = A22/

complexity: for = /m

L

=A22

(m/2)=

2mA2

2

i.e., O(L/) = O(1/) iteration complexity for fast gradient method

Smoothing 5

Outline

introduction


examples

Minimum of strongly convex function

if x is a minimizer of a strongly convex function f , then it is unique and

f(y) f(x) +

2y x22 y dom f

( is the strong convexity constant of f ; see page 1-17)

proof: if some y does not satisfy the inequality, then for small positive

f((1 )x+ y) (1 )f(x) + f(y) (1 )

2y x22

= f(x) + (f(y) f(x)

2y x22) +

2

2x y22

< f(x)

Smoothing 6

Conjugate of strongly convex function

suppose f is closed and strongly convex with constant and conjugate

f(y) = supxdom f

(yTx f(x))

f is defined and differentiable at all y, with gradient

f(y) = argmaxx

(yTx f(x)

)

f is Lipschitz continuous with constant 1/

f(u)f(v)2 1

u v2

Smoothing 7

outline of proof

yTx f(x) has a unique maximizer xy for every y (follows fromclosedness and strong convexity of f(x) yTx)

from page 4-18, f(y) = xy

from strong convexity and page 6 (with xu = f(u), xv = f

(v))

f(xu) vTxu f(xv) v

Txv +

2xu xv

22

f(xv) uTxv f(xu) u

Txu +

2xu xv

22

adding the left- and right-hand sides of the inequalities gives

xu xv22 (xu xv)

T (u v)

by the Cauchy-Schwarz inequality, xu xv2 u v2

Smoothing 8

Proximity function

d is a proximity function for a closed convex set C if

d is continuous and strongly convex

C dom d

d(x) measures distance of x to the center xd = argminxC d(x) of C

normalization

we will assume the strong convexity constant is 1 and infxC d(x) = 0

for a normalized proximity function

d(x) 1

2x xd

22 x C

Smoothing 9

common proximity functions

d(x) = x u22/2 with xd = u C

d(x) =ni=1

wi(xi ui)2/2 with wi 1 and xd = u C

d(x) =ni=1

xi log xi + log n for C = {x 0 | 1Tx = 1}, xd = (1/n)1

example (probability simplex): entropy and d(x) = (1/2)x (1/n)122

(1, 0, 0)(0, 1, 0)

(0, 0, 1)

entropy

(1, 0, 0)(0, 1, 0)

(0, 0, 1)

Euclidean

Smoothing 10

Smoothing via conjugate

conjugate (dual) representation: suppose f can be expressed as

f(x) = supydomh

((Ax+ b)Ty h(y))

= h(Ax+ b)

where h is closed and convex with bounded domain

smooth approximation: choose proximity function d for C = cl domh

f(x) = supydomh

((Ax+ b)Ty h(y) d(y)

)= (h+ d)(Ax+ b)

f is differentiable because h+ d is strongly convex

Smoothing 11

Example: absolute value

conjugate representation

|x| = sup1y1

xy = h(x), h(y) = I[1,1](y)

proximity function: choosing d(y) = y2/2 gives Huber penalty

f(x) = sup1y1

(xy y2/2

)=

{x2/(2) |x| |x| /2 |x| >

proximity function: choosing d(y) = 1

1 y2 gives

f(x) = sup1y1

(xy +

1 y2

)=x2 + 2

Smoothing 12

another conjugate representation of |x|

|x| = supy1+y2=1y0

x(y1 y2)

i.e., |x| = h(Ax) for h = IC,

C = {y 0 | y1 + y2 = 1}, A =

[11

]

proximity function for C

d(y) = y1 log y1 + y2 log y2 + log 2

smooth approximation

f(x) = supy1+y2=1

(xy1 xy2 + (y1 log y1 + y2 log y2 + log 2))

= log

(ex/ + ex/

2

)

Smoothing 13

comparison: three smooth approximations of absolute value

4

3

2

1

0

1

2

3

4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

huber

sqrt

log-sum-exp

x

f

(

x

)

Smoothing 14

Gradient of smooth approximation

f(x) = (h+ d)(Ax+ b)

= supydomh


)

from properties of the conjugate of strongly convex function (page 7)

f is differentiable, with gradient

f(x) = AT argmaxydomh


)

f is Lipschitz continuous with constant

L =A22

Smoothing 15

Accuracy of smooth approximation

f(x) D f(x) f(x), D = supydomh

d(y)

note D < + because domh is bounded and domh dom d

lower bound follows from

f(x) = supydomh


) sup

ydomh

((Ax+ b)Ty h(y) D

)= f(x) D

upper bound follows from

f(x) supydomh

((Ax+ b)Ty h(y)

)= f(x)

Smoothing 16

Complexity

to find solution of nondifferentiable problem with accuracy f(x) f

solve smoothed problem with accuracy = D, so that

f(x) f f(x) + D f + D =

Lipschitz constant of f is L = A22/

complexity: for = /(2D)

L

=A22

( D)=

4DA222

gives O(1/) iteration bound for fast gradient method

efficiency in practice can be improved by decreasing gradually

Smoothing 17

Outline

introduction


examples

Piecewise-linear approximation

f(x) = maxi=1,...,m

(aTi x+ bi)


f(x) = supy0,1Ty=1

(Ax+ b)Ty

proximity function

d(y) =

mi=1

yi log yi + logm


f(x) = log

mi=1

e(aTi x+bi)/ logm

Smoothing 18

1-Norm approximation

f(x) = Ax b1


f(x) = supy1

(Ax b)Ty

proximity function

d(y) =1

2

i

wiy2i (with wi > 1)

smooth approximation: Huber approximation

f(x) =ni=1

wi(aTi x bi)

Smoothing 19

Maximum eigenvalue

conjugate representation: for X Sn,

f(X) = max(X) = supY0,trY=1

tr(XY )

proximity function: negative matrix entropy

d(Y ) =

ni=1

i(Y ) log i(Y ) + log n


f(X) = supY0,trY=1

(tr(XY ) d(Y ))

= log

ni=1

ei(X)/ log n

Smoothing 20

Nuclear norm

nuclear norm f(X) = X is sum of singular values of X Rmn


f(X) = supY 21

tr(XTY )

proximity function

d(Y ) =1

2Y 2F


f(X) = supY 21

(tr(XTY ) d(Y )) =i

(i(X))

the sum of the Huber penalties applied to the singular values of X

Smoothing 21

Lagrange dual function

minimize f0(x)subject to fi(x) 0, i = 1, . . . ,m

x C

fi convex, C closed and bounded

smooth approximation of dual function: choose prox. function d for C

g() = infxC

(f0(x) +

mi=1

ifi(x) + d(x)

)

this is equivalent to regularizing the primal problem

minimize f0(x) + d(x)subject to fi(x) 0, i = 1, . . . ,m

x C

Smoothing 22

References

D. Bertsekas, Nonlinear Programming (1995), 5.4.5

Yu. Nesterov, Smooth minimization of non-smooth functions,Mathematical Programming (2005)

Smoothing 23

Documents

Smoothing