Upload
cesar-manuel-ramos-saldana
View
7
Download
2
Tags:
Embed Size (px)
DESCRIPTION
smoothing
Citation preview
L. Vandenberghe EE236C (Spring 2013-14)
Smoothing
introduction
smoothing via conjugate
examples
1
First-order convex optimization methods
complexity of finding -suboptimal point of f
subgradient method: f nondifferentiable with Lipschitz constant G
O((G/)2
)iterations
proximal gradient method: f = g + h, h a simple nondifferentiablefunction, g differentiable with L-Lipschitz continuous gradient
O (L/) iterations
fast proximal gradient methods (lecture 7)
O(L/) iterations
Smoothing 2
Nondifferentiable optimization by smoothing
for nondifferentiable f that cannot be handled by proximal gradient method
replace f with differentiable approximation f (parametrized by )
minimize f by (fast) gradient method
complexity: #iterations for (fast) gradient method depends on L/
L is Lipschitz constant of f
is accuracy with which the smooth problem is solved
trade-off in amount of smoothing (choice of )
large L (less smoothing) gives more accurate approximation
small L (more smoothing) gives faster convergence
Smoothing 3
Example: Huber penalty as smoothed absolute value
(z) =
{z2/(2) |z| |z| /2 |z|
/2/2 z
(
z
)
controls accuracy and smoothness
accuracy
|z|
2 (z) |z|
smoothness
(z) 1
Smoothing 4
Huber penalty approximation of 1-norm minimization
f(x) = Ax b1, f(x) =mi=1
(aTi x bi)
accuracy: from f(x)m/2 f(x) f(x),
f(x) f f(x) f +
m
2
to achieve f(x) f we need f(x) f with = m/2
Lipschitz constant of f is L = A22/
complexity: for = /m
L
=A22
(m/2)=
2mA2
2
i.e., O(L/) = O(1/) iteration complexity for fast gradient method
Smoothing 5
Outline
introduction
smoothing via conjugate
examples
Minimum of strongly convex function
if x is a minimizer of a strongly convex function f , then it is unique and
f(y) f(x) +
2y x22 y dom f
( is the strong convexity constant of f ; see page 1-17)
proof: if some y does not satisfy the inequality, then for small positive
f((1 )x+ y) (1 )f(x) + f(y) (1 )
2y x22
= f(x) + (f(y) f(x)
2y x22) +
2
2x y22
< f(x)
Smoothing 6
Conjugate of strongly convex function
suppose f is closed and strongly convex with constant and conjugate
f(y) = supxdom f
(yTx f(x))
f is defined and differentiable at all y, with gradient
f(y) = argmaxx
(yTx f(x)
)
f is Lipschitz continuous with constant 1/
f(u)f(v)2 1
u v2
Smoothing 7
outline of proof
yTx f(x) has a unique maximizer xy for every y (follows fromclosedness and strong convexity of f(x) yTx)
from page 4-18, f(y) = xy
from strong convexity and page 6 (with xu = f(u), xv = f
(v))
f(xu) vTxu f(xv) v
Txv +
2xu xv
22
f(xv) uTxv f(xu) u
Txu +
2xu xv
22
adding the left- and right-hand sides of the inequalities gives
xu xv22 (xu xv)
T (u v)
by the Cauchy-Schwarz inequality, xu xv2 u v2
Smoothing 8
Proximity function
d is a proximity function for a closed convex set C if
d is continuous and strongly convex
C dom d
d(x) measures distance of x to the center xd = argminxC d(x) of C
normalization
we will assume the strong convexity constant is 1 and infxC d(x) = 0
for a normalized proximity function
d(x) 1
2x xd
22 x C
Smoothing 9
common proximity functions
d(x) = x u22/2 with xd = u C
d(x) =ni=1
wi(xi ui)2/2 with wi 1 and xd = u C
d(x) =ni=1
xi log xi + log n for C = {x 0 | 1Tx = 1}, xd = (1/n)1
example (probability simplex): entropy and d(x) = (1/2)x (1/n)122
(1, 0, 0)(0, 1, 0)
(0, 0, 1)
entropy
(1, 0, 0)(0, 1, 0)
(0, 0, 1)
Euclidean
Smoothing 10
Smoothing via conjugate
conjugate (dual) representation: suppose f can be expressed as
f(x) = supydomh
((Ax+ b)Ty h(y))
= h(Ax+ b)
where h is closed and convex with bounded domain
smooth approximation: choose proximity function d for C = cl domh
f(x) = supydomh
((Ax+ b)Ty h(y) d(y)
)= (h+ d)(Ax+ b)
f is differentiable because h+ d is strongly convex
Smoothing 11
Example: absolute value
conjugate representation
|x| = sup1y1
xy = h(x), h(y) = I[1,1](y)
proximity function: choosing d(y) = y2/2 gives Huber penalty
f(x) = sup1y1
(xy y2/2
)=
{x2/(2) |x| |x| /2 |x| >
proximity function: choosing d(y) = 1
1 y2 gives
f(x) = sup1y1
(xy +
1 y2
)=x2 + 2
Smoothing 12
another conjugate representation of |x|
|x| = supy1+y2=1y0
x(y1 y2)
i.e., |x| = h(Ax) for h = IC,
C = {y 0 | y1 + y2 = 1}, A =
[11
]
proximity function for C
d(y) = y1 log y1 + y2 log y2 + log 2
smooth approximation
f(x) = supy1+y2=1
(xy1 xy2 + (y1 log y1 + y2 log y2 + log 2))
= log
(ex/ + ex/
2
)
Smoothing 13
comparison: three smooth approximations of absolute value
4
3
2
1
0
1
2
3
4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
huber
sqrt
log-sum-exp
x
f
(
x
)
Smoothing 14
Gradient of smooth approximation
f(x) = (h+ d)(Ax+ b)
= supydomh
((Ax+ b)Ty h(y) d(y)
)
from properties of the conjugate of strongly convex function (page 7)
f is differentiable, with gradient
f(x) = AT argmaxydomh
((Ax+ b)Ty h(y) d(y)
)
f is Lipschitz continuous with constant
L =A22
Smoothing 15
Accuracy of smooth approximation
f(x) D f(x) f(x), D = supydomh
d(y)
note D < + because domh is bounded and domh dom d
lower bound follows from
f(x) = supydomh
((Ax+ b)Ty h(y) d(y)
) sup
ydomh
((Ax+ b)Ty h(y) D
)= f(x) D
upper bound follows from
f(x) supydomh
((Ax+ b)Ty h(y)
)= f(x)
Smoothing 16
Complexity
to find solution of nondifferentiable problem with accuracy f(x) f
solve smoothed problem with accuracy = D, so that
f(x) f f(x) + D f + D =
Lipschitz constant of f is L = A22/
complexity: for = /(2D)
L
=A22
( D)=
4DA222
gives O(1/) iteration bound for fast gradient method
efficiency in practice can be improved by decreasing gradually
Smoothing 17
Outline
introduction
smoothing via conjugate
examples
Piecewise-linear approximation
f(x) = maxi=1,...,m
(aTi x+ bi)
conjugate representation
f(x) = supy0,1Ty=1
(Ax+ b)Ty
proximity function
d(y) =
mi=1
yi log yi + logm
smooth approximation
f(x) = log
mi=1
e(aTi x+bi)/ logm
Smoothing 18
1-Norm approximation
f(x) = Ax b1
conjugate representation
f(x) = supy1
(Ax b)Ty
proximity function
d(y) =1
2
i
wiy2i (with wi > 1)
smooth approximation: Huber approximation
f(x) =ni=1
wi(aTi x bi)
Smoothing 19
Maximum eigenvalue
conjugate representation: for X Sn,
f(X) = max(X) = supY0,trY=1
tr(XY )
proximity function: negative matrix entropy
d(Y ) =
ni=1
i(Y ) log i(Y ) + log n
smooth approximation
f(X) = supY0,trY=1
(tr(XY ) d(Y ))
= log
ni=1
ei(X)/ log n
Smoothing 20
Nuclear norm
nuclear norm f(X) = X is sum of singular values of X Rmn
conjugate representation
f(X) = supY 21
tr(XTY )
proximity function
d(Y ) =1
2Y 2F
smooth approximation
f(X) = supY 21
(tr(XTY ) d(Y )) =i
(i(X))
the sum of the Huber penalties applied to the singular values of X
Smoothing 21
Lagrange dual function
minimize f0(x)subject to fi(x) 0, i = 1, . . . ,m
x C
fi convex, C closed and bounded
smooth approximation of dual function: choose prox. function d for C
g() = infxC
(f0(x) +
mi=1
ifi(x) + d(x)
)
this is equivalent to regularizing the primal problem
minimize f0(x) + d(x)subject to fi(x) 0, i = 1, . . . ,m
x C
Smoothing 22
References
D. Bertsekas, Nonlinear Programming (1995), 5.4.5
Yu. Nesterov, Smooth minimization of non-smooth functions,Mathematical Programming (2005)
Smoothing 23