Upload
others
View
3
Download
1
Embed Size (px)
Citation preview
Lecture 6 Preliminaries of Optimization
March 27 - April 3 2020
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 1 34
1 Basic definitions
The effective domain of f Rn 983041rarr R cup +infin is defined as
dom(f) = x | f(x) lt +infin
A function f Rn 983041rarr R cup +infin is called proper if there exists atleast one x isin Rn such that f(x) lt +infin meaning that dom(f) ∕= emptyThe epigraph of f Rn 983041rarr R cup +infin is defined by
epi(f) = (x y) f(x) le yx isin Rn y isin R
A function f Rn 983041rarr R cup +infin is called closed if epi(f) is closed
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 2 34
2 Solutions of minx
f(x)
x983183 is a local minimizer of f if there is a neighborhood N of x983183
such that f(x) ge f(x983183) for all x isin N
x983183 is a strict local minimizer if it is a local minimizer on someneighborhood N and in addition f(x) gt f(x983183) for all x isin N withx ∕= x983183
x983183 is an isolated local minimizer if there is a neighborhood N of x983183
such that f(x) ge f(x983183) for all x isin N and in addition N containsno local minimizers other than x983183
Strict local minimizers are not always isolated for example
f(x) = x4 cos(1x) + 2x4 f(0) = 0
All isolated local minimizers are strict
x983183 is a global minimizer of f if f(x) ge f(x983183) for all x isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 3 34
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34
3 Convexity
A set Ω sube Rn is called convex if it has the property that
forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]
We usually deal with closed convex sets
For a convex set Ω sube Rn we define the indicator function IΩ asfollows
IΩ(x) =
9830830 if x isin Ω
+infin otherwise
The constrained optimization problem
minxisinΩ
f(x)
can be restated equivalently as follows
minxisinRn
f(x) + IΩ(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34
A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)
Theorem 1 (First-order convexity condition)
Differentiable f is convex if and only if dom(f) is convex and
f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)
Theorem 2 (Second-order convexity conditions)
Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and
nabla2f(x) ≽ 0 forall x isin dom(f)
that is nabla2f(x) is positive semidefinite
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
1 Basic definitions
The effective domain of f Rn 983041rarr R cup +infin is defined as
dom(f) = x | f(x) lt +infin
A function f Rn 983041rarr R cup +infin is called proper if there exists atleast one x isin Rn such that f(x) lt +infin meaning that dom(f) ∕= emptyThe epigraph of f Rn 983041rarr R cup +infin is defined by
epi(f) = (x y) f(x) le yx isin Rn y isin R
A function f Rn 983041rarr R cup +infin is called closed if epi(f) is closed
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 2 34
2 Solutions of minx
f(x)
x983183 is a local minimizer of f if there is a neighborhood N of x983183
such that f(x) ge f(x983183) for all x isin N
x983183 is a strict local minimizer if it is a local minimizer on someneighborhood N and in addition f(x) gt f(x983183) for all x isin N withx ∕= x983183
x983183 is an isolated local minimizer if there is a neighborhood N of x983183
such that f(x) ge f(x983183) for all x isin N and in addition N containsno local minimizers other than x983183
Strict local minimizers are not always isolated for example
f(x) = x4 cos(1x) + 2x4 f(0) = 0
All isolated local minimizers are strict
x983183 is a global minimizer of f if f(x) ge f(x983183) for all x isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 3 34
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34
3 Convexity
A set Ω sube Rn is called convex if it has the property that
forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]
We usually deal with closed convex sets
For a convex set Ω sube Rn we define the indicator function IΩ asfollows
IΩ(x) =
9830830 if x isin Ω
+infin otherwise
The constrained optimization problem
minxisinΩ
f(x)
can be restated equivalently as follows
minxisinRn
f(x) + IΩ(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34
A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)
Theorem 1 (First-order convexity condition)
Differentiable f is convex if and only if dom(f) is convex and
f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)
Theorem 2 (Second-order convexity conditions)
Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and
nabla2f(x) ≽ 0 forall x isin dom(f)
that is nabla2f(x) is positive semidefinite
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
2 Solutions of minx
f(x)
x983183 is a local minimizer of f if there is a neighborhood N of x983183
such that f(x) ge f(x983183) for all x isin N
x983183 is a strict local minimizer if it is a local minimizer on someneighborhood N and in addition f(x) gt f(x983183) for all x isin N withx ∕= x983183
x983183 is an isolated local minimizer if there is a neighborhood N of x983183
such that f(x) ge f(x983183) for all x isin N and in addition N containsno local minimizers other than x983183
Strict local minimizers are not always isolated for example
f(x) = x4 cos(1x) + 2x4 f(0) = 0
All isolated local minimizers are strict
x983183 is a global minimizer of f if f(x) ge f(x983183) for all x isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 3 34
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34
3 Convexity
A set Ω sube Rn is called convex if it has the property that
forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]
We usually deal with closed convex sets
For a convex set Ω sube Rn we define the indicator function IΩ asfollows
IΩ(x) =
9830830 if x isin Ω
+infin otherwise
The constrained optimization problem
minxisinΩ
f(x)
can be restated equivalently as follows
minxisinRn
f(x) + IΩ(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34
A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)
Theorem 1 (First-order convexity condition)
Differentiable f is convex if and only if dom(f) is convex and
f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)
Theorem 2 (Second-order convexity conditions)
Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and
nabla2f(x) ≽ 0 forall x isin dom(f)
that is nabla2f(x) is positive semidefinite
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34
3 Convexity
A set Ω sube Rn is called convex if it has the property that
forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]
We usually deal with closed convex sets
For a convex set Ω sube Rn we define the indicator function IΩ asfollows
IΩ(x) =
9830830 if x isin Ω
+infin otherwise
The constrained optimization problem
minxisinΩ
f(x)
can be restated equivalently as follows
minxisinRn
f(x) + IΩ(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34
A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)
Theorem 1 (First-order convexity condition)
Differentiable f is convex if and only if dom(f) is convex and
f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)
Theorem 2 (Second-order convexity conditions)
Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and
nabla2f(x) ≽ 0 forall x isin dom(f)
that is nabla2f(x) is positive semidefinite
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
3 Convexity
A set Ω sube Rn is called convex if it has the property that
forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]
We usually deal with closed convex sets
For a convex set Ω sube Rn we define the indicator function IΩ asfollows
IΩ(x) =
9830830 if x isin Ω
+infin otherwise
The constrained optimization problem
minxisinΩ
f(x)
can be restated equivalently as follows
minxisinRn
f(x) + IΩ(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34
A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)
Theorem 1 (First-order convexity condition)
Differentiable f is convex if and only if dom(f) is convex and
f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)
Theorem 2 (Second-order convexity conditions)
Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and
nabla2f(x) ≽ 0 forall x isin dom(f)
that is nabla2f(x) is positive semidefinite
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)
Theorem 1 (First-order convexity condition)
Differentiable f is convex if and only if dom(f) is convex and
f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)
Theorem 2 (Second-order convexity conditions)
Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and
nabla2f(x) ≽ 0 forall x isin dom(f)
that is nabla2f(x) is positive semidefinite
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Important properties for convex objective functions
983183 Any local minimizer is also a global minimizer (see Theorem 12)
983183 The set of global minimizers is a convex set (easy to prove)
If there exists a value γ gt 0 such that
f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ
2α(1minus α)983042xminus y98304222
for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
4 Subgradient and subdifferential
Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds
f(x+ d) ge f(x) + vTd
The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)
Lemma 3 (Monotonicity of subdifferentials of convex functions)
forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0
Proof By the definition of subgradient we have
f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)
Adding these two inequalities yields the statement
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Theorem 4 (Fermatrsquos lemma generalization in convex functions)
Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if
0 isin partf(x983183)
Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have
f(x983183 + d) ge f(x983183) forall d isin Rn
which implies that x983183 is a minimizer of f ldquorArrrdquo by definition
Theorem 5
Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))
If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
5 Taylorrsquos theorem
Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Theorem 6 (Taylorrsquos theorem)
Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that
f(x+ p) = f(x) +
983133 1
0nablaf(x+ tp)Tpdt
f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)
If f is twice continuously differentiable we have
nablaf(x+ p) = nablaf(x) +
983133 1
0nabla2f(x+ tp)pdt
f(x+ p) = f(x) +nablaf(x)Tp+1
2pTnabla2f(x+ ξp)p
for some ξ isin (0 1)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
If f Rn 983041rarr R is differentiable and convex then
partf(x) = nablaf(x)
andf(y) ge f(x) +nablaf(x)T(y minus x)
for all xy isin Rn
Lipschitz continuously differentiable with constant L
983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn
If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have
f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
for all xy isin Rn
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ
f(y) ge f(x) +nablaf(x)T(y minus x) +γ
2983042y minus x9830422
Lemma 7
Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that
γ
2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L
2983042y minus x9830422
Condition number κ = Lγ (eg strictly convex quadratic f)
When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to
γI ≼ nabla2f(x) ≼ LI for all x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Theorem 8
Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique
Proof (i) Compactness of level set Show that for any point x0 thelevel set
x | f(x) le f(x0)
is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the
compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is
not unique so that we have two points x1983183 and x2
983183 that minimize f Byusing the strongly convex property we can prove
f
983061x1983183 + x2
983183
2
983062lt f(x1
983183) = f(x2983183)
This is a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
6 Optimality conditions for smooth functions
Theorem 9 (First-order necessary condition)
If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0
Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that
pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]
For any s isin (0 T ] we have by Taylorrsquos theorem that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)
Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Theorem 10 (Second-order necessary conditions)
If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite
Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that
pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]
By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that
f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1
2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)
As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Theorem 11 (Second-order sufficient conditions)
Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f
Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and
f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1
2pTnabla2f(x983183 + ξp)p
= f(x983183) +1
2pTnabla2f(x983183 + ξp)p
for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
A point x is called a stationary point if
nablaf(x) = 0
A stationary point x is called a saddle point if there exist u and vsuch that
f(x+ αu) lt f(x) and f(x+ αv) gt f(x)
for all sufficiently small α gt 0
Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points
If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point
If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Theorem 12
(i) forall convex f any local minimizer x983183 is a global minimizer of f
(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f
Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is
x = λz+ (1minus λ)x983183 for some λ isin (0 1]
By the convexity property for f we have
f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)
Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have
nablaf(x983183)T(zminus x983183) =
d
dλf(x983183 + λ(zminus x983183))|λ=0
= limλrarr0+
f(x983183 + λ(zminus x983183))minus f(x983183)
λ
le limλrarr0+
λf(z) + (1minus λ)f(x983183)minus f(x983183)
λ
= f(z)minus f(x983183) lt 0
Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point
Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms
Numerical algorithms try to seek a point where nablaf vanishes
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
7 Karush-Kuhn-Tucker conditions
Theorem 13 (KKT conditions)
Consider the minimization problem
minxisinRn
f(x) st gi(x) le 0 i = 1 m
where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions
Let x983183 be an optimal solution and assume Slaterrsquos condition
exist x isin Rn st gi(x) lt 0 i = 1 m
hold Then there exist λ1 middot middot middot λm ge 0 satisfying
0 isin partf(x983183) +
m983131
i=1
λipartgi(x983183) λigi(x983183) = 0 i = 1 m
If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Active gi(x) le 0
Inactive gi(x) le 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
8 Moreau envelope and proximal operator
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is
Mλh(x) = infu
983069h(u) +
1
2λ983042uminus x9830422
983070
=1
λinfu
983069λh(u) +
1
2983042uminus x9830422
983070
For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is
proxf (x) = argminu
983069f(u) +
1
2983042uminus x9830422
983070
From the optimality condition (see Theorem 4) we have
0 isin partf(proxf (x)) + (proxf (x)minus x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if
x983183 = proxf (x983183)
For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope
Mλh(x) = h(proxλh(x)) +1
2λ983042proxλh(x)minus x9830422
can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn
minusinfin lt Mλh(x) lt +infin
even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient
nablaMλh(x) =1
λ(xminus proxλh(x))
Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Example h(x) = |x| and Hλ(x) = Mλh(x)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Lemma 14 (Nonexpansivity of proximal operator)
For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn
983042proxf (x)minus proxf (y)983042 le 983042xminus y983042
Proof From the optimality conditions at two points x and y we have
xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))
By applying monotonicity (see Lemma 3) we have
((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0
Rearranging this and applying the Cauchy-Schwartz inequality yields
983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))
le 983042xminus y983042983042proxf (x)minus proxf (y)983042
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Examples of several proximal operators
(1) f(x) = 0proxf (x) = x
(2) f(x) = λ983042x9830421 with λ gt 0
983045proxλ983042middot9830421(x)
983046i= argmin
uisinR
983069λ|u|+ 1
2(uminus xi)
2
983070
=
983099983105983103
983105983101
xi minus λ if xi gt λ
0 if xi isin [minusλλ]
xi + λ if xi lt minusλ
which is known as soft-thresholding
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
(3) f(x) = λ983042x9830420 the number of nonzero components non-convex
983045proxλ983042middot9830420(x)
983046i=
983099983105983103
983105983101
xi if |xi| gtradic2λ
0 xi if |xi| =radic2λ
0 if |xi| ltradic2λ
which is known as hard thresholding
(4) f(x) = IΩ(x)
proxIΩ(x) = argminu
983069IΩ(u) +
1
2983042uminus x9830422
983070= argmin
uisinΩ983042uminus x983042
which is simply the projection of x onto the set Ω
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
9 Convergence rates
An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error
For smooth f 983042nablaf(xk)983042
For nonsmooth convex f
dist(0 partf(xk))
(the sequence of distances from 0 to the subdifferential partf(xk))
Other error measures include
983042xk minus x983183983042 and f(xk)minus f983183
where x983183 is a solution and f983183 is the optimal value of f
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim
krarrinfinφk = 0
We say that Q-linear convergence holds if there is some ρ isin (0 1)such that
φk+1
φkle ρ
for all k sufficiently large
We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that
φk le Cρk k = 1 2
Q-linear rArr R-linear But R-linear ⇏ Q-linear Example
φk =
9830832minusk k even
0 k odd
satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence
φk le Cradick φk le Ck φk le Ck2 k = 1 2
where in each case C is some positive constant
Superlinear convergence occurs when the constant ρ isin (0 1) in
φk+1φk le ρ
can be chosen arbitrarily close to 0
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
The sequence φk converges Q-superlinearly to 0 if
limkrarrinfin
φk+1
φk= 0
Q-Quadratic ifφk+1
φ2k
le C k = 1 2
for some sufficiently large C
R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is
0 le φk le νk k = 1 2
R-quadratic convergence is defined similarly
Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
91 Complexity
For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is
Cρk le 983171
By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that
Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)
The complexity is O( log(1983171)1minusρ )
For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C
Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)
Ck le 983171 hArr k ge C983171 complexity O(1983171)
Ck2 le 983171 hArr k ge983155
C983171 complexity O(1radic983171)
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34
For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is
O(log log(1983171))
To show this one can consider the sequence
ek = Cφk
which satisfiesek+1 le e2k
Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy
Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34