25
L. Vandenberghe EE236C (Spring 2013-14) Smoothing introduction smoothing via conjugate examples 1

Smoothing

Embed Size (px)

DESCRIPTION

smoothing

Citation preview

  • L. Vandenberghe EE236C (Spring 2013-14)

    Smoothing

    introduction

    smoothing via conjugate

    examples

    1

  • First-order convex optimization methods

    complexity of finding -suboptimal point of f

    subgradient method: f nondifferentiable with Lipschitz constant G

    O((G/)2

    )iterations

    proximal gradient method: f = g + h, h a simple nondifferentiablefunction, g differentiable with L-Lipschitz continuous gradient

    O (L/) iterations

    fast proximal gradient methods (lecture 7)

    O(L/) iterations

    Smoothing 2

  • Nondifferentiable optimization by smoothing

    for nondifferentiable f that cannot be handled by proximal gradient method

    replace f with differentiable approximation f (parametrized by )

    minimize f by (fast) gradient method

    complexity: #iterations for (fast) gradient method depends on L/

    L is Lipschitz constant of f

    is accuracy with which the smooth problem is solved

    trade-off in amount of smoothing (choice of )

    large L (less smoothing) gives more accurate approximation

    small L (more smoothing) gives faster convergence

    Smoothing 3

  • Example: Huber penalty as smoothed absolute value

    (z) =

    {z2/(2) |z| |z| /2 |z|

    /2/2 z

    (

    z

    )

    controls accuracy and smoothness

    accuracy

    |z|

    2 (z) |z|

    smoothness

    (z) 1

    Smoothing 4

  • Huber penalty approximation of 1-norm minimization

    f(x) = Ax b1, f(x) =mi=1

    (aTi x bi)

    accuracy: from f(x)m/2 f(x) f(x),

    f(x) f f(x) f +

    m

    2

    to achieve f(x) f we need f(x) f with = m/2

    Lipschitz constant of f is L = A22/

    complexity: for = /m

    L

    =A22

    (m/2)=

    2mA2

    2

    i.e., O(L/) = O(1/) iteration complexity for fast gradient method

    Smoothing 5

  • Outline

    introduction

    smoothing via conjugate

    examples

  • Minimum of strongly convex function

    if x is a minimizer of a strongly convex function f , then it is unique and

    f(y) f(x) +

    2y x22 y dom f

    ( is the strong convexity constant of f ; see page 1-17)

    proof: if some y does not satisfy the inequality, then for small positive

    f((1 )x+ y) (1 )f(x) + f(y) (1 )

    2y x22

    = f(x) + (f(y) f(x)

    2y x22) +

    2

    2x y22

    < f(x)

    Smoothing 6

  • Conjugate of strongly convex function

    suppose f is closed and strongly convex with constant and conjugate

    f(y) = supxdom f

    (yTx f(x))

    f is defined and differentiable at all y, with gradient

    f(y) = argmaxx

    (yTx f(x)

    )

    f is Lipschitz continuous with constant 1/

    f(u)f(v)2 1

    u v2

    Smoothing 7

  • outline of proof

    yTx f(x) has a unique maximizer xy for every y (follows fromclosedness and strong convexity of f(x) yTx)

    from page 4-18, f(y) = xy

    from strong convexity and page 6 (with xu = f(u), xv = f

    (v))

    f(xu) vTxu f(xv) v

    Txv +

    2xu xv

    22

    f(xv) uTxv f(xu) u

    Txu +

    2xu xv

    22

    adding the left- and right-hand sides of the inequalities gives

    xu xv22 (xu xv)

    T (u v)

    by the Cauchy-Schwarz inequality, xu xv2 u v2

    Smoothing 8

  • Proximity function

    d is a proximity function for a closed convex set C if

    d is continuous and strongly convex

    C dom d

    d(x) measures distance of x to the center xd = argminxC d(x) of C

    normalization

    we will assume the strong convexity constant is 1 and infxC d(x) = 0

    for a normalized proximity function

    d(x) 1

    2x xd

    22 x C

    Smoothing 9

  • common proximity functions

    d(x) = x u22/2 with xd = u C

    d(x) =ni=1

    wi(xi ui)2/2 with wi 1 and xd = u C

    d(x) =ni=1

    xi log xi + log n for C = {x 0 | 1Tx = 1}, xd = (1/n)1

    example (probability simplex): entropy and d(x) = (1/2)x (1/n)122

    (1, 0, 0)(0, 1, 0)

    (0, 0, 1)

    entropy

    (1, 0, 0)(0, 1, 0)

    (0, 0, 1)

    Euclidean

    Smoothing 10

  • Smoothing via conjugate

    conjugate (dual) representation: suppose f can be expressed as

    f(x) = supydomh

    ((Ax+ b)Ty h(y))

    = h(Ax+ b)

    where h is closed and convex with bounded domain

    smooth approximation: choose proximity function d for C = cl domh

    f(x) = supydomh

    ((Ax+ b)Ty h(y) d(y)

    )= (h+ d)(Ax+ b)

    f is differentiable because h+ d is strongly convex

    Smoothing 11

  • Example: absolute value

    conjugate representation

    |x| = sup1y1

    xy = h(x), h(y) = I[1,1](y)

    proximity function: choosing d(y) = y2/2 gives Huber penalty

    f(x) = sup1y1

    (xy y2/2

    )=

    {x2/(2) |x| |x| /2 |x| >

    proximity function: choosing d(y) = 1

    1 y2 gives

    f(x) = sup1y1

    (xy +

    1 y2

    )=x2 + 2

    Smoothing 12

  • another conjugate representation of |x|

    |x| = supy1+y2=1y0

    x(y1 y2)

    i.e., |x| = h(Ax) for h = IC,

    C = {y 0 | y1 + y2 = 1}, A =

    [11

    ]

    proximity function for C

    d(y) = y1 log y1 + y2 log y2 + log 2

    smooth approximation

    f(x) = supy1+y2=1

    (xy1 xy2 + (y1 log y1 + y2 log y2 + log 2))

    = log

    (ex/ + ex/

    2

    )

    Smoothing 13

  • comparison: three smooth approximations of absolute value

    4

    3

    2

    1

    0

    1

    2

    3

    4

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    huber

    sqrt

    log-sum-exp

    x

    f

    (

    x

    )

    Smoothing 14

  • Gradient of smooth approximation

    f(x) = (h+ d)(Ax+ b)

    = supydomh

    ((Ax+ b)Ty h(y) d(y)

    )

    from properties of the conjugate of strongly convex function (page 7)

    f is differentiable, with gradient

    f(x) = AT argmaxydomh

    ((Ax+ b)Ty h(y) d(y)

    )

    f is Lipschitz continuous with constant

    L =A22

    Smoothing 15

  • Accuracy of smooth approximation

    f(x) D f(x) f(x), D = supydomh

    d(y)

    note D < + because domh is bounded and domh dom d

    lower bound follows from

    f(x) = supydomh

    ((Ax+ b)Ty h(y) d(y)

    ) sup

    ydomh

    ((Ax+ b)Ty h(y) D

    )= f(x) D

    upper bound follows from

    f(x) supydomh

    ((Ax+ b)Ty h(y)

    )= f(x)

    Smoothing 16

  • Complexity

    to find solution of nondifferentiable problem with accuracy f(x) f

    solve smoothed problem with accuracy = D, so that

    f(x) f f(x) + D f + D =

    Lipschitz constant of f is L = A22/

    complexity: for = /(2D)

    L

    =A22

    ( D)=

    4DA222

    gives O(1/) iteration bound for fast gradient method

    efficiency in practice can be improved by decreasing gradually

    Smoothing 17

  • Outline

    introduction

    smoothing via conjugate

    examples

  • Piecewise-linear approximation

    f(x) = maxi=1,...,m

    (aTi x+ bi)

    conjugate representation

    f(x) = supy0,1Ty=1

    (Ax+ b)Ty

    proximity function

    d(y) =

    mi=1

    yi log yi + logm

    smooth approximation

    f(x) = log

    mi=1

    e(aTi x+bi)/ logm

    Smoothing 18

  • 1-Norm approximation

    f(x) = Ax b1

    conjugate representation

    f(x) = supy1

    (Ax b)Ty

    proximity function

    d(y) =1

    2

    i

    wiy2i (with wi > 1)

    smooth approximation: Huber approximation

    f(x) =ni=1

    wi(aTi x bi)

    Smoothing 19

  • Maximum eigenvalue

    conjugate representation: for X Sn,

    f(X) = max(X) = supY0,trY=1

    tr(XY )

    proximity function: negative matrix entropy

    d(Y ) =

    ni=1

    i(Y ) log i(Y ) + log n

    smooth approximation

    f(X) = supY0,trY=1

    (tr(XY ) d(Y ))

    = log

    ni=1

    ei(X)/ log n

    Smoothing 20

  • Nuclear norm

    nuclear norm f(X) = X is sum of singular values of X Rmn

    conjugate representation

    f(X) = supY 21

    tr(XTY )

    proximity function

    d(Y ) =1

    2Y 2F

    smooth approximation

    f(X) = supY 21

    (tr(XTY ) d(Y )) =i

    (i(X))

    the sum of the Huber penalties applied to the singular values of X

    Smoothing 21

  • Lagrange dual function

    minimize f0(x)subject to fi(x) 0, i = 1, . . . ,m

    x C

    fi convex, C closed and bounded

    smooth approximation of dual function: choose prox. function d for C

    g() = infxC

    (f0(x) +

    mi=1

    ifi(x) + d(x)

    )

    this is equivalent to regularizing the primal problem

    minimize f0(x) + d(x)subject to fi(x) 0, i = 1, . . . ,m

    x C

    Smoothing 22

  • References

    D. Bertsekas, Nonlinear Programming (1995), 5.4.5

    Yu. Nesterov, Smooth minimization of non-smooth functions,Mathematical Programming (2005)

    Smoothing 23