34
Lecture 6: Preliminaries of Optimization March 27 - April 3, 2020 Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 1 / 34

Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

  • Upload
    others

  • View
    3

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Lecture 6 Preliminaries of Optimization

March 27 - April 3 2020

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 1 34

1 Basic definitions

The effective domain of f Rn 983041rarr R cup +infin is defined as

dom(f) = x | f(x) lt +infin

A function f Rn 983041rarr R cup +infin is called proper if there exists atleast one x isin Rn such that f(x) lt +infin meaning that dom(f) ∕= emptyThe epigraph of f Rn 983041rarr R cup +infin is defined by

epi(f) = (x y) f(x) le yx isin Rn y isin R

A function f Rn 983041rarr R cup +infin is called closed if epi(f) is closed

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 2 34

2 Solutions of minx

f(x)

x983183 is a local minimizer of f if there is a neighborhood N of x983183

such that f(x) ge f(x983183) for all x isin N

x983183 is a strict local minimizer if it is a local minimizer on someneighborhood N and in addition f(x) gt f(x983183) for all x isin N withx ∕= x983183

x983183 is an isolated local minimizer if there is a neighborhood N of x983183

such that f(x) ge f(x983183) for all x isin N and in addition N containsno local minimizers other than x983183

Strict local minimizers are not always isolated for example

f(x) = x4 cos(1x) + 2x4 f(0) = 0

All isolated local minimizers are strict

x983183 is a global minimizer of f if f(x) ge f(x983183) for all x isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 3 34

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34

3 Convexity

A set Ω sube Rn is called convex if it has the property that

forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]

We usually deal with closed convex sets

For a convex set Ω sube Rn we define the indicator function IΩ asfollows

IΩ(x) =

9830830 if x isin Ω

+infin otherwise

The constrained optimization problem

minxisinΩ

f(x)

can be restated equivalently as follows

minxisinRn

f(x) + IΩ(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34

A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)

Theorem 1 (First-order convexity condition)

Differentiable f is convex if and only if dom(f) is convex and

f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)

Theorem 2 (Second-order convexity conditions)

Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and

nabla2f(x) ≽ 0 forall x isin dom(f)

that is nabla2f(x) is positive semidefinite

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 2: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

1 Basic definitions

The effective domain of f Rn 983041rarr R cup +infin is defined as

dom(f) = x | f(x) lt +infin

A function f Rn 983041rarr R cup +infin is called proper if there exists atleast one x isin Rn such that f(x) lt +infin meaning that dom(f) ∕= emptyThe epigraph of f Rn 983041rarr R cup +infin is defined by

epi(f) = (x y) f(x) le yx isin Rn y isin R

A function f Rn 983041rarr R cup +infin is called closed if epi(f) is closed

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 2 34

2 Solutions of minx

f(x)

x983183 is a local minimizer of f if there is a neighborhood N of x983183

such that f(x) ge f(x983183) for all x isin N

x983183 is a strict local minimizer if it is a local minimizer on someneighborhood N and in addition f(x) gt f(x983183) for all x isin N withx ∕= x983183

x983183 is an isolated local minimizer if there is a neighborhood N of x983183

such that f(x) ge f(x983183) for all x isin N and in addition N containsno local minimizers other than x983183

Strict local minimizers are not always isolated for example

f(x) = x4 cos(1x) + 2x4 f(0) = 0

All isolated local minimizers are strict

x983183 is a global minimizer of f if f(x) ge f(x983183) for all x isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 3 34

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34

3 Convexity

A set Ω sube Rn is called convex if it has the property that

forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]

We usually deal with closed convex sets

For a convex set Ω sube Rn we define the indicator function IΩ asfollows

IΩ(x) =

9830830 if x isin Ω

+infin otherwise

The constrained optimization problem

minxisinΩ

f(x)

can be restated equivalently as follows

minxisinRn

f(x) + IΩ(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34

A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)

Theorem 1 (First-order convexity condition)

Differentiable f is convex if and only if dom(f) is convex and

f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)

Theorem 2 (Second-order convexity conditions)

Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and

nabla2f(x) ≽ 0 forall x isin dom(f)

that is nabla2f(x) is positive semidefinite

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 3: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

2 Solutions of minx

f(x)

x983183 is a local minimizer of f if there is a neighborhood N of x983183

such that f(x) ge f(x983183) for all x isin N

x983183 is a strict local minimizer if it is a local minimizer on someneighborhood N and in addition f(x) gt f(x983183) for all x isin N withx ∕= x983183

x983183 is an isolated local minimizer if there is a neighborhood N of x983183

such that f(x) ge f(x983183) for all x isin N and in addition N containsno local minimizers other than x983183

Strict local minimizers are not always isolated for example

f(x) = x4 cos(1x) + 2x4 f(0) = 0

All isolated local minimizers are strict

x983183 is a global minimizer of f if f(x) ge f(x983183) for all x isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 3 34

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34

3 Convexity

A set Ω sube Rn is called convex if it has the property that

forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]

We usually deal with closed convex sets

For a convex set Ω sube Rn we define the indicator function IΩ asfollows

IΩ(x) =

9830830 if x isin Ω

+infin otherwise

The constrained optimization problem

minxisinΩ

f(x)

can be restated equivalently as follows

minxisinRn

f(x) + IΩ(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34

A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)

Theorem 1 (First-order convexity condition)

Differentiable f is convex if and only if dom(f) is convex and

f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)

Theorem 2 (Second-order convexity conditions)

Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and

nabla2f(x) ≽ 0 forall x isin dom(f)

that is nabla2f(x) is positive semidefinite

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 4: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 4 34

3 Convexity

A set Ω sube Rn is called convex if it has the property that

forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]

We usually deal with closed convex sets

For a convex set Ω sube Rn we define the indicator function IΩ asfollows

IΩ(x) =

9830830 if x isin Ω

+infin otherwise

The constrained optimization problem

minxisinΩ

f(x)

can be restated equivalently as follows

minxisinRn

f(x) + IΩ(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34

A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)

Theorem 1 (First-order convexity condition)

Differentiable f is convex if and only if dom(f) is convex and

f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)

Theorem 2 (Second-order convexity conditions)

Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and

nabla2f(x) ≽ 0 forall x isin dom(f)

that is nabla2f(x) is positive semidefinite

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 5: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

3 Convexity

A set Ω sube Rn is called convex if it has the property that

forall xy isin Ω rArr (1minus α)x+ αy isin Ω forall α isin [0 1]

We usually deal with closed convex sets

For a convex set Ω sube Rn we define the indicator function IΩ asfollows

IΩ(x) =

9830830 if x isin Ω

+infin otherwise

The constrained optimization problem

minxisinΩ

f(x)

can be restated equivalently as follows

minxisinRn

f(x) + IΩ(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 5 34

A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)

Theorem 1 (First-order convexity condition)

Differentiable f is convex if and only if dom(f) is convex and

f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)

Theorem 2 (Second-order convexity conditions)

Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and

nabla2f(x) ≽ 0 forall x isin dom(f)

that is nabla2f(x) is positive semidefinite

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 6: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

A convex function f Rn 983041rarr R cup +infin has the following definingproperty dom(f) is convex and forall xy isin dom(f) forall α isin [0 1]

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)

Theorem 1 (First-order convexity condition)

Differentiable f is convex if and only if dom(f) is convex and

f(y) ge f(x) +nablaf(x)T(y minus x) forall xy isin dom(f)

Theorem 2 (Second-order convexity conditions)

Assume f is twice continuously differentiable Then f is convex if andonly if dom(f) is convex and

nabla2f(x) ≽ 0 forall x isin dom(f)

that is nabla2f(x) is positive semidefinite

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 6 34

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 7: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Important properties for convex objective functions

983183 Any local minimizer is also a global minimizer (see Theorem 12)

983183 The set of global minimizers is a convex set (easy to prove)

If there exists a value γ gt 0 such that

f((1minus α)x+ αy) 983249 (1minus α)f(x) + αf(y)minus γ

2α(1minus α)983042xminus y98304222

for all xy isin Rn and α isin [0 1] we say that f is strongly convexwith modulus of convexity γ

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 7 34

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 8: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

4 Subgradient and subdifferential

Definition A vector v isin Rn is a subgradient of f at a point x if forall d isin Rn it holds

f(x+ d) ge f(x) + vTd

The subdifferential denoted partf(x) is the set of all subgradients off at x (see FOMO for concrete examples)

Lemma 3 (Monotonicity of subdifferentials of convex functions)

forall convex f if a isin partf(x) and b isin partf(y) we have (aminusb)T(xminus y) ge 0

Proof By the definition of subgradient we have

f(y) ge f(x) + aT(y minus x) and f(x) ge f(y) + bT(xminus y)

Adding these two inequalities yields the statement

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 8 34

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 9: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Theorem 4 (Fermatrsquos lemma generalization in convex functions)

Let f Rn 983041rarr R cup +infin be a proper convex function Then the pointx983183 is a minimizer of f(x) if and only if

0 isin partf(x983183)

Proof ldquolArrrdquo Suppose that 0 isin partf(x983183) We have

f(x983183 + d) ge f(x983183) forall d isin Rn

which implies that x983183 is a minimizer of f ldquorArrrdquo by definition

Theorem 5

Let f Rn 983041rarr R cup +infin be proper convex and let x isin int(dom(f))

If f is differentiable at x then partf(x) = nablaf(x)If partf(x) is a singleton (a set containing a single vector) then f isdifferentiable at x with gradient equal to the unique subgradient

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 9 34

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 10: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

5 Taylorrsquos theorem

Taylorrsquos theorem shows how smooth functions can be locallyapproximated by low-order (eg linear or quadratic) functions

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 10 34

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 11: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Theorem 6 (Taylorrsquos theorem)

Given a continuously differentiable function f Rn 983041rarr R and givenxp isin Rn we have that

f(x+ p) = f(x) +

983133 1

0nablaf(x+ tp)Tpdt

f(x+ p) = f(x) +nablaf(x+ ξp)Tp for some ξ isin (0 1)

If f is twice continuously differentiable we have

nablaf(x+ p) = nablaf(x) +

983133 1

0nabla2f(x+ tp)pdt

f(x+ p) = f(x) +nablaf(x)Tp+1

2pTnabla2f(x+ ξp)p

for some ξ isin (0 1)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 11 34

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 12: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

If f Rn 983041rarr R is differentiable and convex then

partf(x) = nablaf(x)

andf(y) ge f(x) +nablaf(x)T(y minus x)

for all xy isin Rn

Lipschitz continuously differentiable with constant L

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042 for all xy isin Rn

If f is Lipschitz continuously differentiable with constant L thenby Taylorrsquos theorem we have

f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

for all xy isin Rn

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 12 34

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 13: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

For differentiable f Equivalent definition of strongly convex withmodulus of convexity γ

f(y) ge f(x) +nablaf(x)T(y minus x) +γ

2983042y minus x9830422

Lemma 7

Suppose f is strongly convex with modulus of convexity γ and nablaf isuniformly Lipschitz continuous with constant L We have forall xy that

γ

2983042y minus x9830422 le f(y)minus f(x)minusnablaf(x)T(y minus x) le L

2983042y minus x9830422

Condition number κ = Lγ (eg strictly convex quadratic f)

When f is twice continuously differentiable the inequalities inLemma 7 is equivalent to

γI ≼ nabla2f(x) ≼ LI for all x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 13 34

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 14: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Theorem 8

Let f be differentiable and strongly convex with modulus of convexityγ gt 0 Then the minimizer x983183 of f exists and is unique

Proof (i) Compactness of level set Show that for any point x0 thelevel set

x | f(x) le f(x0)

is closed and bounded and hence compact(ii) Existence Since f is continuous it attains its minimum on the

compact level set which is also the solution of minx f(x)(iii) Uniqueness Suppose for contradiction that the minimizer is

not unique so that we have two points x1983183 and x2

983183 that minimize f Byusing the strongly convex property we can prove

f

983061x1983183 + x2

983183

2

983062lt f(x1

983183) = f(x2983183)

This is a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 14 34

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 15: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

6 Optimality conditions for smooth functions

Theorem 9 (First-order necessary condition)

If x983183 is a local minimizer of f and f is continuously differentiable inan open neighborhood of x983183 then nablaf(x983183) = 0

Proof Suppose for contradiction that nablaf(x983183) ∕= 0 Define the vectorp = minusnablaf(x983183) and note that pTnablaf(x983183) = minus983042nablaf(x983183)98304222 lt 0 Becausenablaf is continuous near x983183 there is a scalar T gt 0 such that

pTnablaf(x983183 + tp) lt 0 for all t isin [0 T ]

For any s isin (0 T ] we have by Taylorrsquos theorem that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183 + ξsp) for some ξ isin (0 1)

Therefore f(x983183 + sp) lt f(x983183) for all s isin (0 T ] We have found adirection leading away from x983183 along which f decreases so x983183 is not alocal minimizer and we have a contradiction

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 15 34

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 16: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Theorem 10 (Second-order necessary conditions)

If x983183 is a local minimizer of f and nabla2f is continuous in an openneighborhood of x983183 then nablaf(x983183) = 0 and nabla2f(x983183) is positivesemidefinite

Proof We know from Theorem 9 that nablaf(x983183) = 0 Assume thatnabla2f(x983183) is not positive semidefinite Then we can choose a vector psuch that pTnabla2f(x983183)p lt 0 and because nabla2f is continuous near x983183there is a scalar T gt 0 such that

pTnabla2f(x983183 + tp)p lt 0 for all t isin [0 T ]

By doing a Taylor series expansion around x983183 we have for all s isin (0 T ]and some ξ isin (0 1) that

f(x983183 + sp) = f(x983183) + spTnablaf(x983183) +1

2s2pTnabla2f(x983183 + ξsp)p lt f(x983183)

As in Theorem 9 we have found a direction from x983183 along which f isdecreasing and so again x983183 is not a local minimizerPreliminaries of Optimization Lecture 6 March 27-April 3 2020 16 34

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 17: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Theorem 11 (Second-order sufficient conditions)

Suppose that nabla2f is continuous in an open neighborhood of x983183 and thatnablaf(x983183) = 0 and nabla2f(x983183) is positive definite Then x983183 is a strict localminimizer of f

Proof Because the Hessian nabla2f is continuous and positive definite atx983183 we can choose a radius r gt 0 so that nabla2f(x) remains positivedefinite for all x in the open ball B = z | 983042zminus x983183983042 lt r Taking anynonzero vector p with 983042p983042 lt r we have x983183 + p isin B and

f(x983183 + p) = f(x983183) + pTnablaf(x983183) +1

2pTnabla2f(x983183 + ξp)p

= f(x983183) +1

2pTnabla2f(x983183 + ξp)p

for some ξ isin (0 1) Since x983183 + ξp isin B we have pTnabla2f(x983183 + ξp)p gt 0and therefore f(x983183 + p) gt f(x983183) giving the result

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 17 34

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 18: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

A point x is called a stationary point if

nablaf(x) = 0

A stationary point x is called a saddle point if there exist u and vsuch that

f(x+ αu) lt f(x) and f(x+ αv) gt f(x)

for all sufficiently small α gt 0

Stationary points are not necessarily local minimizers Stationarypoints can be local maximizers or saddle points

If nablaf(x) = 0 and nabla2f(x) has both strictly positive and strictlynegative eigenvalues then x is a saddle point

If nabla2f(x) is positive semidefinite or negative semidefinite thennabla2f(x) alone is insufficient to classify x

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 18 34

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 19: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Theorem 12

(i) forall convex f any local minimizer x983183 is a global minimizer of f

(ii) If f is convex and differentiable then any stationary point x983183 is aglobal minimizer of f

Proof (i) Suppose that x983183 is a local but not a global minimizer Thenwe can find a point z isin Rn with f(z) lt f(x983183) Consider the linesegment that joins x983183 to z that is

x = λz+ (1minus λ)x983183 for some λ isin (0 1]

By the convexity property for f we have

f(x) le λf(z) + (1minus λ)f(x983183) lt f(x983183)

Any neighborhood N of x983183 contains a piece of the line segment sothere will always be points x isin N at which the last inequality issatisfied Hence x983183 is not a local minimizer

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 19 34

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 20: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

(ii) Suppose that x983183 is not a global minimizer and choose z asabove Then from convexity we have

nablaf(x983183)T(zminus x983183) =

d

dλf(x983183 + λ(zminus x983183))|λ=0

= limλrarr0+

f(x983183 + λ(zminus x983183))minus f(x983183)

λ

le limλrarr0+

λf(z) + (1minus λ)f(x983183)minus f(x983183)

λ

= f(z)minus f(x983183) lt 0

Therefore nablaf(x983183) ∕= 0 and so x983183 is not a stationary point

Remark Theorems 9-12 provide the foundations for unconstrainedoptimization algorithms

Numerical algorithms try to seek a point where nablaf vanishes

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 20 34

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 21: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

7 Karush-Kuhn-Tucker conditions

Theorem 13 (KKT conditions)

Consider the minimization problem

minxisinRn

f(x) st gi(x) le 0 i = 1 m

where f Rn 983041rarr R and all gi Rn 983041rarr R are convex functions

Let x983183 be an optimal solution and assume Slaterrsquos condition

exist x isin Rn st gi(x) lt 0 i = 1 m

hold Then there exist λ1 middot middot middot λm ge 0 satisfying

0 isin partf(x983183) +

m983131

i=1

λipartgi(x983183) λigi(x983183) = 0 i = 1 m

If x983183 satisfies the above conditions called KKT conditions then itis an optimal solution of the optimization problem

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 21 34

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 22: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Active gi(x) le 0

Inactive gi(x) le 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 22 34

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 23: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

8 Moreau envelope and proximal operator

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope of (λ h) is

Mλh(x) = infu

983069h(u) +

1

2λ983042uminus x9830422

983070

=1

λinfu

983069λh(u) +

1

2983042uminus x9830422

983070

For a closed proper convex function f Rn 983041rarr R cup +infin theproximal operator of f is

proxf (x) = argminu

983069f(u) +

1

2983042uminus x9830422

983070

From the optimality condition (see Theorem 4) we have

0 isin partf(proxf (x)) + (proxf (x)minus x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 23 34

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 24: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

For a closed proper convex function f Rn 983041rarr Rcup +infin the pointx983183 is a minimizer of f if and only if

x983183 = proxf (x983183)

For a closed proper convex function h Rn 983041rarr R cup +infin and apositive scalar λ the Moreau envelope

Mλh(x) = h(proxλh(x)) +1

2λ983042proxλh(x)minus x9830422

can be viewed as a kind of smoothing or regularization of thefunction h We have for all x isin Rn

minusinfin lt Mλh(x) lt +infin

even when h(x) = +infin for some x isin Rn Moreover Mλh(x) isconvex and differentiable everywhere with gradient

nablaMλh(x) =1

λ(xminus proxλh(x))

Therefore x983183 is a minimizer of h hArr x983183 is a minimizer of Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 24 34

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 25: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Example h(x) = |x| and Hλ(x) = Mλh(x)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 25 34

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 26: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Lemma 14 (Nonexpansivity of proximal operator)

For a closed proper convex function f Rn 983041rarr R cup +infin we have forall xy isin Rn

983042proxf (x)minus proxf (y)983042 le 983042xminus y983042

Proof From the optimality conditions at two points x and y we have

xminus proxf (x) isin partf(proxf (x)) and y minus proxf (y) isin partf(proxf (y))

By applying monotonicity (see Lemma 3) we have

((xminus proxf (x))minus (y minus proxf (y)))T(proxf (x)minus proxf (y)) ge 0

Rearranging this and applying the Cauchy-Schwartz inequality yields

983042proxf (x)minus proxf (y)9830422 le (xminus y)T(proxf (x)minus proxf (y))

le 983042xminus y983042983042proxf (x)minus proxf (y)983042

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 26 34

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 27: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Examples of several proximal operators

(1) f(x) = 0proxf (x) = x

(2) f(x) = λ983042x9830421 with λ gt 0

983045proxλ983042middot9830421(x)

983046i= argmin

uisinR

983069λ|u|+ 1

2(uminus xi)

2

983070

=

983099983105983103

983105983101

xi minus λ if xi gt λ

0 if xi isin [minusλλ]

xi + λ if xi lt minusλ

which is known as soft-thresholding

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 27 34

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 28: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

(3) f(x) = λ983042x9830420 the number of nonzero components non-convex

983045proxλ983042middot9830420(x)

983046i=

983099983105983103

983105983101

xi if |xi| gtradic2λ

0 xi if |xi| =radic2λ

0 if |xi| ltradic2λ

which is known as hard thresholding

(4) f(x) = IΩ(x)

proxIΩ(x) = argminu

983069IΩ(u) +

1

2983042uminus x9830422

983070= argmin

uisinΩ983042uminus x983042

which is simply the projection of x onto the set Ω

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 28 34

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 29: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

9 Convergence rates

An important measure for evaluating algorithms is the rate ofconvergence to zero of some measure of error

For smooth f 983042nablaf(xk)983042

For nonsmooth convex f

dist(0 partf(xk))

(the sequence of distances from 0 to the subdifferential partf(xk))

Other error measures include

983042xk minus x983183983042 and f(xk)minus f983183

where x983183 is a solution and f983183 is the optimal value of f

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 29 34

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 30: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

We denote by φk the sequence of nonnegative scalars whose rateof convergence to 0 we wish to find φk ge 0 and lim

krarrinfinφk = 0

We say that Q-linear convergence holds if there is some ρ isin (0 1)such that

φk+1

φkle ρ

for all k sufficiently large

We say that R-linear convergence holds if there exist ρ isin (0 1) andC gt 0 such that

φk le Cρk k = 1 2

Q-linear rArr R-linear But R-linear ⇏ Q-linear Example

φk =

9830832minusk k even

0 k odd

satisfies R-linear wiht C = 1 and ρ = 12 but is not Q-linear

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 30 34

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 31: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

Sublinear convergence is as its name suggests slower than linearSeveral varieties of sublinear convergence

φk le Cradick φk le Ck φk le Ck2 k = 1 2

where in each case C is some positive constant

Superlinear convergence occurs when the constant ρ isin (0 1) in

φk+1φk le ρ

can be chosen arbitrarily close to 0

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 31 34

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 32: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

The sequence φk converges Q-superlinearly to 0 if

limkrarrinfin

φk+1

φk= 0

Q-Quadratic ifφk+1

φ2k

le C k = 1 2

for some sufficiently large C

R-superlinear if there is a Q-superlinearly convergent sequenceνk that dominates φk that is

0 le φk le νk k = 1 2

R-quadratic convergence is defined similarly

Quadratic and superlinear rates are associated with higher-ordermethods such as Newton and quasi-Newton methods

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 32 34

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 33: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

91 Complexity

For a sequence satisfying the R-linear convergence condition asufficient condition for φk le 983171 lt C is

Cρk le 983171

By using the estimate log ρ le ρminus 1 for all ρ isin (0 1) we have that

Cρk le 983171 hArr k log ρ le log(983171C) lArr k ge log(C983171)(1minus ρ)

The complexity is O( log(1983171)1minusρ )

For an algorithm that satisfies the sublinear rates sufficientconditions for φk le 983171 lt C

Cradick le 983171 hArr k ge (C983171)2 complexity O(19831712)

Ck le 983171 hArr k ge C983171 complexity O(1983171)

Ck2 le 983171 hArr k ge983155

C983171 complexity O(1radic983171)

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 33 34

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34

Page 34: Lecture 6: Preliminaries of Optimizationmath.xmu.edu.cn/group/nona/damc/Lecture06.pdf · Preliminaries of Optimization Lecture 6 March 27-April 3, 2020 17 / 34. A point x is called

For Q-quadratically convergent methods the complexity is doublylogarithmic in 983171 that is

O(log log(1983171))

To show this one can consider the sequence

ek = Cφk

which satisfiesek+1 le e2k

Once the algorithm enters a neighborhood of quadraticconvergence just a few additional iterations are required forconvergence to a solution of high accuracy

Preliminaries of Optimization Lecture 6 March 27-April 3 2020 34 34