Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Lecture 13
Gradient Methods for Constrained Optimization
October 16, 2008
Lecture 13
Outline
• Gradient Projection Algorithm
• Convergence Rate
Convex Optimization 1
Lecture 13
Constrained Minimization
minimize f(x)
subject x ∈ X
• Assumption 1:
• The function f is convex and continuously differentiable over Rn
• The set X is closed and convex
• The optimal value f ∗ = infx∈Rn f(x) is finite
• Gradient projection algorithm
xk+1 = PX[xk − αk∇f(xk)]
starting with x0 ∈ X.
Convex Optimization 2
Lecture 13
Bounded Gradients
Theorem 1 Let Assumption 1 hold, and suppose that the gradients areuniformly bounded over the set X. Then, the projection gradient methodgenerates the sequence {xk} ⊂ X such that
• When the constant stepsize αk ≡ α is used, we have
lim infk→∞
f(xk) ≤ f ∗ +αL2
2
• When diminishing stepsize is used with∑
k αk = +∞, we have
lim infk→∞
f(xk) = f ∗.
Proof: We use projection properties and the line of analysis similar to that
of unconstrained method. HWK 6
Convex Optimization 3
Lecture 13
Lipschitz Gradients
• Lipschitz Gradient Lemma For a differentiable convex function f with
Lipschitz gradients, we have for all x, y ∈ Rn,
1
L‖∇f(x)−∇f(y)‖2 ≤ (∇f(x)−∇f(y))T (x− y),
where L is a Lipschitz constant.
• Theorem 2 Let Assumption 1 hold, and assume that the gradients off are Lipschitz continuous over X. Suppose that the optimal solutionset X∗ is not empty. Then, for a constant stepsize αk ≡ α with0 < α < 2
Lconverges to an optimal point, i.e.,
limk→∞
‖xk − x∗‖ = 0 for some x∗ ∈ X∗.
Convex Optimization 4
Lecture 13
Proof:Fact 1: If z = PX[z − v] for some v ∈ <n, then z = PX[z − τv] for any
τ > 0.
Fact 2: z ∈ X∗ if and only if z = PX[z −∇f(z)].
These facts imply that z ∈ X∗ if and only if z = PX[z − τ∇f(z)] for any
τ > 0.
By using the definition of the method and the preceding relation with
τ = α, we obtain for any z ∈ X∗,
‖xk+1 − z‖2 = ‖PX[xk − α∇f(xk)]− PX[z − α∇f(z)‖2.
By non-expansiveness of the projection, it follows
‖xk+1 − z‖2 = ‖xk − z − α(∇f(xk)−∇f(z))‖2
= ‖xk − z‖2 − 2α(xk − z)T(∇f(xk)−∇f(z))
+α2‖∇f(xk)−∇f(z)‖2
Convex Optimization 5
Lecture 13
Using Lipschitz Gradient Lemma, we obtain for any z ∈ X∗,
‖xk+1 − z‖2 ≤ ‖xk − z‖2 −α
L(2− αL)‖∇f(xk)−∇f(z)‖2. (1)
Hence, for all k,
α
L(2− αL)‖∇f(xk)−∇f(z)‖2 ≤ ‖xk − z‖2 − ‖xk+1 − z‖2.
By summing the preceding relations from arbitrary K to N , with K < N,
we obtain
α
L(2−αL)
N∑k=K
‖∇f(xk)−∇f(z)‖2 ≤ ‖xK−z‖2−‖xN+1−z‖2 ≤ ‖xK−z‖2.
Convex Optimization 6
Lecture 13
In particular, setting K = 0 and letting N →∞, we see that
α
L(2− αL)
∞∑k=0
‖∇f(xk)−∇f(z)‖2 ≤ ‖x0 − z‖2 < ∞. (2)
As a consequence, we also have
limk→∞
∇f(xk) = ∇f(z). (3)
By discarding the non-positive term in the right hand side of Eq. (1), we
have for any z ∈ X∗ and all k,
‖xk+1 − z‖2 ≤ ‖xk − z‖2 + (2− αL)‖∇f(xk)−∇f(z)‖2.
By summing these relations over k = K, . . . , N for arbitrary K and N with
K < N, we obtain
Convex Optimization 7
Lecture 13
‖xN+1 − z‖2 ≤ ‖xK − z‖2 + (2− αL)N∑
k=K
‖∇f(xk)−∇f(z)‖2.
Taking limsup as N →∞, we obtain
lim supN→∞
‖xN+1− z‖2 ≤ ‖xK − z‖2 + (2− αL)∞∑
k=K
‖∇f(xk)−∇f(z)‖2.
Now, taking liminf as K →∞ yields
lim supN→∞
‖xN+1 − z‖2 ≤ lim infK→∞
‖xK − z‖2
+ (2− αL) limK→∞
‖∞∑
k=K
∇f(xk)−∇f(z)‖2
=lim infK→∞
‖xK − z‖2,
Convex Optimization 8
Lecture 13
where the equality follows in view of the relation in (2). Thus, we have that
the sequence {‖xk − z‖} is convergent for every z ∈ X∗.
By the inequality in Eq. (1), we have that
‖xk − z‖ ≤ ‖x0 − z‖ for all k.
Hence, the sequence {xk} is bounded, and it has an accumulation point.
Since the scalar sequence {‖xk − z‖} is convergent for every z ∈ X∗, the
sequence {xk} must be convergent.
Suppose now that xk → x̄. By considering the definition of the iterate xk+1,
we have
xk+1 = PX[xk − α∇f(xk)].
Letting k → ∞ and using xk → x̄, and continuity of the gradient ∇f(x),
we obtain
x̄ = PX[x̄− α∇f(x̄)].
In view of facts 1 and 2, the preceding relation is equivalent to x̄ ∈ X∗. �
Convex Optimization 9
Lecture 13
Modes of Convexity: Strict and Strong
• Def. f is strictly convex if for all x 6= y and α ∈ (0,1) we have
f(αx + (1− α)y) < αf(x) + (1− α)f(y)
• Def. f is strongly convex if there exists a scalar ν > 0 such that
f(αx + (1− α)y) ≤ αf(x) + (1− α)f(y)−ν
2α(1− α)‖x− y‖2
for all x, y ∈ <n and any α ∈ [0,1].
The scalar ν is referred to as strongly convex constant.The function is said to be strongly convex with constant ν.
Convex Optimization 10
Lecture 13
Modes of Convexity: Differentiable Function
• Let f : <n → R be continuously differentiable.
• Modes of convexity can be equivalently characterized in terms of the
linearization properties of the function ∇f : <n → <n.
• We have
• f is convex if and only if
f(x) +∇f(x)T(y − x) ≤ f(y) for all x, y ∈ <n
• f is strictly convex if and only if
f(x) +∇f(x)T(y − x) < f(y) for all x 6= y
• f is strongly convex with constant ν if and only if
f(x) +∇f(x)T(y − x) +ν
2‖y − x‖2 ≤ f(y) for all x, y ∈ <n
Convex Optimization 11
Lecture 13
Modes of Convexity: Gradient Mapping
• Let f : <n → R be continuously differentiable.
• Modes of convexity can be equivalently characterized in terms of the
monotonicity properties of the gradient mapping ∇f : <n → <n.
• We have
• f is convex if and only if
(∇f(x)−∇f(y))T(x− y) ≥ 0 for all x, y ∈ <n
• f is strictly convex if and only if
(∇f(x)−∇f(y))T(x− y) > 0 for all x 6= y
• f is strongly convex with constant ν if and only if
(∇f(x)−∇f(y))T(x− y) ≥ ν ‖x− y‖2 for all x, y ∈ <n
Convex Optimization 12
Lecture 13
Modes of Convexity: Twice Differentiable Function
• Let f : <n → R be twice continuously differentiable.
• Modes of convexity can be equivalently characterized in terms of the
definiteness of the Hessians ∇2f(x) for x ∈ <n.
• We have
• f is convex if and only if
∇2f(x) ≥ 0 for all x ∈ <n
• f is strictly convex if
∇2f(x) > 0 for all x ∈ <n
• f is strongly convex with constant ν if and only if
∇2f(x) ≥ ν I for all x ∈ <n
Convex Optimization 13
Lecture 13
Strong Convexity: Implications
Let f be continuously differentiable and strongly convex∗ over Rn with
constant m
• Implications:
• Lower Bound on f over Rn: for all x, y ∈ Rn
f(y) ≥ f(x) +∇f(x)T(y − x) +m
2‖x− y‖22 (4)
� minimize w/r to y in the right-hand side:
f(y) ≥ f(x)−1
2m‖∇f(x)‖2
� minimum over y ∈ Rn:
f(x)− f ∗ ≤1
2m‖∇f(x)‖2
• Useful as a stopping criterion (if you know m)∗Strong convexity over Rn can be replaced by a strong convexity over a set X. Then, all the relations stay valid over the set
Convex Optimization 14
Lecture 13
• Relation (4) with x = x0 and f(y) ≤ f(x0) implies that the level
set Lf(f(x0)) is bounded
• Relation (4) also yields for an optimal x∗ and any x ∈ Rn,
m
2‖x− x∗‖2 ≤ f(x)− f(x∗)
• Last two bullets HWK6 assignment.
Convex Optimization 15
Lecture 13
Convergence Rate: Once Differentiable
Theorem 3 Let Assumption 1 hold, and assume that the gradients of f
are Lipschitz continuous over X with constant L > 0. Suppose that thefunction is strongly convex with constant m > 0. Then:
• A solution x∗ exists and it is unique.
• The iterates generated by the gradient projection method with αk ≡ α
and α < 2L
converge to x∗ with geometric rate, i.e.,
‖xk+1 − x∗‖2 ≤ qk ‖xk − x∗‖2 for all k
with q ∈ (0,1) depending on m and L.
Proof: HWK 6.
Convex Optimization 16
Lecture 13
Convergence Rate: Twice Differentiable
Theorem 4 Let Assumption 1 hold. Assume that the function is twicecontinuously differentiable and strongly convex with constant m > 0.
Assume also that ∇f2(x) ≤ L for all x ∈ X. Then:
• A solution x∗ exists and it is unique.
• The iterates generated by the gradient projection method with αk ≡ α
and α < 2L
converge to x∗ with geometric rate, i.e.,
‖xk+1 − x∗‖ ≤ qk ‖xk − x∗‖ for all k
with q = max{|1− αm|, |1− αL}.
Convex Optimization 17
Lecture 13
Proof: The q here is different from the one in the preceding theorem. Since
∇f2(x) ≤ L for all x ∈ X, it follows that the gradients are Lipschitz
continuous over X with constant L. By the definition of the method and
the non-expansive property of the projection, we have for z = x∗ and any
k,
‖xk+1 − x∗‖2 = ‖PX[xk − α∇f(xk)]− PX[x∗ −∇f(x∗)]‖2
≤ ‖xk − x∗ − α(∇f(xk)−∇f(x∗))‖2. (5)
Mean Value Theorem for vector functions When g : Rn → R is
differentiable on [x, y], we have
g(y) = g(x) +∫ 1
0∇g(x + τ(y − x)) dτ
Convex Optimization 18
Lecture 13
Applying this Theorem with g = ∇f , y = xk and x = x∗, we obtain
∇f(xk) = ∇f(x∗) +∫ 1
0∇2f(x∗ + τ(xk − x∗)) dτ
Hence,
∇f(xk)−∇f(x∗) =∫ 1
0∇2f(x∗ + τ(xk − x∗))dτ. (6)
By introducing Ak(x− x∗) = ∇f(xk)−∇f(x∗) and using this in relation
(5), we obtain
‖xk+1 − x∗‖ ≤ ‖(I − αAk)(xk − x∗)‖ ≤ ‖I − αAk‖ ‖xk − x∗‖
Convex Optimization 19
Lecture 13
The matrix Ak is symmetric, and hence ‖I − Ak‖ is equal to the max
absolute eigenvalue of I −Ak, i.e.,
‖I − αAk‖ = max{|1− αλmax(Ak)|, |1− αλmin(Ak)|}.
In view of Eq. (6), we have Ak =∫ 10 ∇2f(x∗ + τ(xk − x∗))dτ . By the
strong convexity of f , we have ∇2f(x) ≥ mI for all x, while by the given
condition, we have ∇2f(x) ≤ L I. Therefore,
λmax(Ak) ≤ L, λmin(Ak) ≥ m,
implying that
‖I − αAk‖ ≤ max{|1− αm|, |1− αM |}.
Convex Optimization 20
Lecture 13
�
• The parameter q is minimized when α∗ = 2m+L
in which case
q∗ =L−m
L + m⇐⇒ q∗ =
cond(f)− 1
cond(f) + 1,
with cond(f) = Lm
.
Convex Optimization 21
Lecture 13
Upper Bound on Hessian and f over the Level Set
For a twice differentiable strongly convex f :
• The level set L0 = {x | f(x) ≤ f(x0)} is bounded
• The maximum eigenvalue of the Hessian∇2f(x) is a continuous function
of x over L0
• Hence, the maximum eigenvalue of the Hessian is bounded over L0:
there is a constant M such that ∇2f(x) � MI for all x ∈ L0
• Upper Bound on f over L0:
f(y) ≤ f(x) +∇f(x)T(y − x) +M
2‖y − x‖2 for all x, y ∈ L0
• minimize over y ∈ L0 in both sides:
f ∗ ≤ f(x)−1
2M‖∇f(x)‖2 for all x ∈ L0
Convex Optimization 22
Lecture 13
Condition Number of a Matrix
For a a twice differentiable strongly convex f : mI � ∇2f(x) � MI for
all x ∈ L0
• The condition number cond(A) of a positive definite matrix A:
cond(A) =largest eigenvalue of A
smallest eigenvalue of A
• The ratio Mm
is an upper bound on the condition number ∇2f(x) for
every x ∈ L0
Convex Optimization 23
Lecture 13
Strong Convexity and Condition Number of Level Sets
Assume a minimizer x∗ of f over Rn exists and f is strongly convex.
Consider the level set L0 = {x | f(x) ≤ f(x0)}• We have seen that mI � ∇2f(x) � MI for all x ∈ L0
• Also, we have
f ∗ +m
2‖x− x∗‖2 ≤ f(x) ≤ f ∗ +
M
2‖x− x∗‖2
• Hence: Binner ⊆ L0 ⊆ Bouter, where
Binner ={x | ‖x− x∗‖ ≤
√(2 (f(x0)− f ∗) /M)
}Bouter =
{x | ‖x− x∗‖ ≤
√(2 (f(x0)− f ∗) /m)
}• Therefore, we have a bound on cond(L0)
cond(L0) ≤M
m
• The condition number of level sets affects the efficiency of the
algorithms
Convex Optimization 24