Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Lecture 7 Gradient Methods
April 8 - 10 2020
Gradient Methods Lecture 7 April 8 - 10 2020 1 20
1 Line search methods
Given search direction dk and step length αk
xk+1 = xk + αkdk
Search directions
Negative gradientdk = minusnablaf(xk)
Newtondk = minusnabla2f(xk)minus1nablaf(xk)
Quasi-Newtondk = minusBminus1
k nablaf(xk)
Conjugate gradient (βk ensures dkminus1 and dk are conjugate)
dk = minusnablaf(xk) + βkdkminus1
Step length Numerical Optimization sect 31 (exact inexact)
Gradient Methods Lecture 7 April 8 - 10 2020 2 20
2 Gradient Descent (GD) method
Set dk = minusnablaf(xk) and αk =1
L
xk+1 = xk minus 1
Lnablaf(xk) k = 0 1 2
If f is Lipschitz continuously differentiable with constant L then(see Lecture 6 Page 11)
f(x+ αd) le f(x) + αnablaf(x)Td+ α2L
2983042d9830422
This yields
f(xk+1) = f
983061xk minus 1
Lnablaf(xk)
983062le f(xk)minus 1
2L983042nablaf(xk)9830422
which implies f(xk) is a nonincreasing sequence
Gradient Methods Lecture 7 April 8 - 10 2020 3 20
Theorem 1 (Sublinear convergence of GD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with
αk equiv 1L generates a sequence xkinfin0 that satisfies
min0lekleKminus1
983042nablaf(xk)983042 le983157
2L(f(x0)minus f(xK))
Kle
9831582L(f(x0)minus f)
K
Proof The statement is a direct result of
Kminus1983131
k=0
983042nablaf(xk)9830422 le 2L
Kminus1983131
k=0
(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))
and
min0lekleKminus1
983042nablaf(xk)983042 =983157
min0lekleKminus1
983042nablaf(xk)9830422 le
983161983160983160983159 1
K
Kminus1983131
k=0
983042nablaf(xk)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 4 20
Corollary 2
Any accumulation point of xkinfin0 in Theorem 1 is stationary
Theorem 3 (Sublinear convergence of GD O(1k))
Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies
f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)
Theorem 4 (Linear convergence of GD O((1minus γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies
f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 5 20
3 Descent Direction (DD) method
Choose dk to be a descent direction that is
nablaf(xk)Tdk lt 0
which ensures that
f(xk + αdk) lt f(xk)
for sufficiently small α gt 0
Line search methodxk+1 = xk + αkd
k
with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case
Gradient Methods Lecture 7 April 8 - 10 2020 6 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
1 Line search methods
Given search direction dk and step length αk
xk+1 = xk + αkdk
Search directions
Negative gradientdk = minusnablaf(xk)
Newtondk = minusnabla2f(xk)minus1nablaf(xk)
Quasi-Newtondk = minusBminus1
k nablaf(xk)
Conjugate gradient (βk ensures dkminus1 and dk are conjugate)
dk = minusnablaf(xk) + βkdkminus1
Step length Numerical Optimization sect 31 (exact inexact)
Gradient Methods Lecture 7 April 8 - 10 2020 2 20
2 Gradient Descent (GD) method
Set dk = minusnablaf(xk) and αk =1
L
xk+1 = xk minus 1
Lnablaf(xk) k = 0 1 2
If f is Lipschitz continuously differentiable with constant L then(see Lecture 6 Page 11)
f(x+ αd) le f(x) + αnablaf(x)Td+ α2L
2983042d9830422
This yields
f(xk+1) = f
983061xk minus 1
Lnablaf(xk)
983062le f(xk)minus 1
2L983042nablaf(xk)9830422
which implies f(xk) is a nonincreasing sequence
Gradient Methods Lecture 7 April 8 - 10 2020 3 20
Theorem 1 (Sublinear convergence of GD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with
αk equiv 1L generates a sequence xkinfin0 that satisfies
min0lekleKminus1
983042nablaf(xk)983042 le983157
2L(f(x0)minus f(xK))
Kle
9831582L(f(x0)minus f)
K
Proof The statement is a direct result of
Kminus1983131
k=0
983042nablaf(xk)9830422 le 2L
Kminus1983131
k=0
(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))
and
min0lekleKminus1
983042nablaf(xk)983042 =983157
min0lekleKminus1
983042nablaf(xk)9830422 le
983161983160983160983159 1
K
Kminus1983131
k=0
983042nablaf(xk)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 4 20
Corollary 2
Any accumulation point of xkinfin0 in Theorem 1 is stationary
Theorem 3 (Sublinear convergence of GD O(1k))
Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies
f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)
Theorem 4 (Linear convergence of GD O((1minus γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies
f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 5 20
3 Descent Direction (DD) method
Choose dk to be a descent direction that is
nablaf(xk)Tdk lt 0
which ensures that
f(xk + αdk) lt f(xk)
for sufficiently small α gt 0
Line search methodxk+1 = xk + αkd
k
with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case
Gradient Methods Lecture 7 April 8 - 10 2020 6 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
2 Gradient Descent (GD) method
Set dk = minusnablaf(xk) and αk =1
L
xk+1 = xk minus 1
Lnablaf(xk) k = 0 1 2
If f is Lipschitz continuously differentiable with constant L then(see Lecture 6 Page 11)
f(x+ αd) le f(x) + αnablaf(x)Td+ α2L
2983042d9830422
This yields
f(xk+1) = f
983061xk minus 1
Lnablaf(xk)
983062le f(xk)minus 1
2L983042nablaf(xk)9830422
which implies f(xk) is a nonincreasing sequence
Gradient Methods Lecture 7 April 8 - 10 2020 3 20
Theorem 1 (Sublinear convergence of GD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with
αk equiv 1L generates a sequence xkinfin0 that satisfies
min0lekleKminus1
983042nablaf(xk)983042 le983157
2L(f(x0)minus f(xK))
Kle
9831582L(f(x0)minus f)
K
Proof The statement is a direct result of
Kminus1983131
k=0
983042nablaf(xk)9830422 le 2L
Kminus1983131
k=0
(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))
and
min0lekleKminus1
983042nablaf(xk)983042 =983157
min0lekleKminus1
983042nablaf(xk)9830422 le
983161983160983160983159 1
K
Kminus1983131
k=0
983042nablaf(xk)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 4 20
Corollary 2
Any accumulation point of xkinfin0 in Theorem 1 is stationary
Theorem 3 (Sublinear convergence of GD O(1k))
Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies
f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)
Theorem 4 (Linear convergence of GD O((1minus γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies
f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 5 20
3 Descent Direction (DD) method
Choose dk to be a descent direction that is
nablaf(xk)Tdk lt 0
which ensures that
f(xk + αdk) lt f(xk)
for sufficiently small α gt 0
Line search methodxk+1 = xk + αkd
k
with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case
Gradient Methods Lecture 7 April 8 - 10 2020 6 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Theorem 1 (Sublinear convergence of GD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with
αk equiv 1L generates a sequence xkinfin0 that satisfies
min0lekleKminus1
983042nablaf(xk)983042 le983157
2L(f(x0)minus f(xK))
Kle
9831582L(f(x0)minus f)
K
Proof The statement is a direct result of
Kminus1983131
k=0
983042nablaf(xk)9830422 le 2L
Kminus1983131
k=0
(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))
and
min0lekleKminus1
983042nablaf(xk)983042 =983157
min0lekleKminus1
983042nablaf(xk)9830422 le
983161983160983160983159 1
K
Kminus1983131
k=0
983042nablaf(xk)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 4 20
Corollary 2
Any accumulation point of xkinfin0 in Theorem 1 is stationary
Theorem 3 (Sublinear convergence of GD O(1k))
Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies
f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)
Theorem 4 (Linear convergence of GD O((1minus γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies
f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 5 20
3 Descent Direction (DD) method
Choose dk to be a descent direction that is
nablaf(xk)Tdk lt 0
which ensures that
f(xk + αdk) lt f(xk)
for sufficiently small α gt 0
Line search methodxk+1 = xk + αkd
k
with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case
Gradient Methods Lecture 7 April 8 - 10 2020 6 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Corollary 2
Any accumulation point of xkinfin0 in Theorem 1 is stationary
Theorem 3 (Sublinear convergence of GD O(1k))
Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies
f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)
Theorem 4 (Linear convergence of GD O((1minus γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies
f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 5 20
3 Descent Direction (DD) method
Choose dk to be a descent direction that is
nablaf(xk)Tdk lt 0
which ensures that
f(xk + αdk) lt f(xk)
for sufficiently small α gt 0
Line search methodxk+1 = xk + αkd
k
with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case
Gradient Methods Lecture 7 April 8 - 10 2020 6 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
3 Descent Direction (DD) method
Choose dk to be a descent direction that is
nablaf(xk)Tdk lt 0
which ensures that
f(xk + αdk) lt f(xk)
for sufficiently small α gt 0
Line search methodxk+1 = xk + αkd
k
with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case
Gradient Methods Lecture 7 April 8 - 10 2020 6 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Theorem 5 (Sublinear convergence of DD O(1radick))
Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear
search method where dk satisfies
nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042
for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions
f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk
nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk
at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have
min0lekleKminus1
983042nablaf(xk)983042 le983158
L
η2c1(1minus c2)
983158f(x0)minus f
K
Gradient Methods Lecture 7 April 8 - 10 2020 7 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Step lengths satisfying the weak Wolfe conditions
Gradient Methods Lecture 7 April 8 - 10 2020 8 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Corollary 6
Any accumulation point of xkinfin0 in Theorem 5 is stationary
Remark 1
Let xkinfin0 be the sequence in Theorem 5
If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)
If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier
4 Frank-Wolfe (FW) method conditional gradient
Consider the optimization problem
minxisinΩ
f(x)
where Ω is compact and convex and f is convex and differentiable
Gradient Methods Lecture 7 April 8 - 10 2020 9 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
The conditional gradient method is as follows
vk = argminvisinΩ
vTnablaf(xk)
xk+1 = xk + αk(vk minus xk) αk =
2
k + 2
Theorem 7 (Sublinear convergence of FW O(1k))
Suppose Ω is a compact convex set with diameter D ie
983042xminus y983042 le D for all xy isin Ω
Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies
f(xk)minus f(x983183) le2LD2
k + 2 k = 1 2
where x983183 is any solution of minxisinΩ f(x)
Gradient Methods Lecture 7 April 8 - 10 2020 10 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
5 The key idea of acceleration is momentum
Consider the iterate
xk+1 = xk minus αknablaf(983144xk)minuskminus1983131
i=0
microkinablaf(983144xi)
Due to more flexibility it may yield better convergence
This is the foundation of momentum method
xk+1 = xkminusαknablaf(983144xk) + βkMomentum
Gradient Methods Lecture 7 April 8 - 10 2020 11 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
51 Heavy-Ball (HB) method
Each iteration of this method has the form
xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)
where αk gt 0 and βk gt 0 (Two-step method)
HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods
Example Consider the strongly convex quadratic function
minxisinRn
983069f(x) =
1
2xTAxminus bTx
983070
where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let
αk = α =4
(radicL+
radicγ)2
βk = β =
radicLminusradic
γradicL+
radicγ
Gradient Methods Lecture 7 April 8 - 10 2020 12 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
It can be shown that
983042xk minus x983183983042 le Cβk
which further yields
f(xk)minus f(x983183) leLC2
2β2k
If L ≫ γ we haveβ asymp 1minus 2
983155γL
Therefore the complexity is
O(983155
Lγ log(1ε))
Convergence of GD
f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))
Complexity of GD is
O((Lγ) log(1ε))
Gradient Methods Lecture 7 April 8 - 10 2020 13 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
52 Conjugate Gradient (CG) method
Given SPD A isin Rntimesn
Ax = bldquo hArr rdquo minxisinRn
1
2xTAxminus bTx
Steepest Descent iteration
xk+1 = xk minus (Axk minus b)T(Axk minus b)
(Axk minus b)TA(Axk minus b)(Axk minus b)
CG iteration
xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp
kminus1
Gradient Methods Lecture 7 April 8 - 10 2020 14 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
53 Nesterovrsquos Accelerated Gradient (NAG)
NAG iteration
xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x
k minus xkminus1)
Theorem 8 (Sublinear convergence of NAG O(1k2))
Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows
λ0 = 0 λk+1 =1
2
9830611 +
9831561 + 4λ2
k
983062 βk =
λk minus 1
λk+1
yields xkinfin0 that satisfies
f(xk)minus f(x983183) le2L983042x0 minus x9831839830422
(k + 1)2 k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 15 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Theorem 9 (Linear convergence of NAG O((1minus983155
γL)k))
Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0
x1 = x0 minus 1
Lnablaf(x0)
and
αk =1
L βk =
radicLminusradic
γradicL+
radicγ
yields xkinfin0 that satisfies
f(xk)minus f(x983183) leL+ γ
2983042x0 minus x9831839830422
9830611minus
983157γ
L
983062k
k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 16 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
54 ldquoOptimalityrdquo of NAG
NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information
Example Consider the problem A=toeplitz([2-10middot middot middot 0])
minxisinRn
f(x) =1
2xTAxminus eT1 x
The iteration
x0 = 0 xk+1 = xk +
k983131
j=0
ξjnablaf(xj)
yields
f(xk)minus f(x983183) ge3983042A9830422
32(k + 1)2983042x0 minus x9831839830422 k = 1 2
n
2minus 1
Gradient Methods Lecture 7 April 8 - 10 2020 17 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
6 Prox-Gradient (PG) method
Consider the regularized optimization problem
minxisinRn
φ(x) = f(x) + λψ(x)
where f smooth and convex ψ is convex and λ ge 0
Each step of the prox-gradient method is defined as follows
xk+1 = proxαkλψ(xk minus αknablaf(xk))
for some step length αk gt 0 and
proxαkλψ(x) = argmin
u
983069αkλψ(u) +
1
2983042uminus x9830422
983070
xk+1 is the solution of a quadratic approximation to φ(x)
xk+1 = argminu
nablaf(xk)T(uminus xk) +1
2αk983042uminus xk9830422 + λψ(u)
Gradient Methods Lecture 7 April 8 - 10 2020 18 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Define
Gα(x) =1
α
983043xminus proxαλψ (xminus αnablaf(x))
983044 α gt 0
Then at step k of prox-gradient method
xk+1 = xk minus αkGαk(xk)
Lemma 10
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then
(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))
(b) For any z and any α isin (0 1L] we have that
φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α
2983042Gα(x)9830422
Gradient Methods Lecture 7 April 8 - 10 2020 19 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20
Theorem 11 (Sublinear convergence of PG O(1k))
Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that
minxisinRn
φ(x) = f(x) + λψ(x) λ ge 0
attains a minimizer x983183 with optimal objective value φ983183 Then if
αk =1
L
for all k in the prox-gradient method we have
φ(xk)minus φ983183 leL983042x0 minus x9831839830422
2k k = 1 2
Gradient Methods Lecture 7 April 8 - 10 2020 20 20