Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Coordinate Descent Methods on Huge-ScaleOptimization Problems
Zhimin Peng
Optimization Group Meeting
Warm up exercise?
I Q: Why do mathematicians, after a dinner at a Chineserestaurant, always insist on taking the leftovers home?
I A: Because they know the Chinese remainder theorem!
I Q: What does the zero say to the eight?
I A: Nice belt!
Warm up exercise?
I Q: Why do mathematicians, after a dinner at a Chineserestaurant, always insist on taking the leftovers home?
I A: Because they know the Chinese remainder theorem!
I Q: What does the zero say to the eight?
I A: Nice belt!
Warm up exercise?
I Q: Why do mathematicians, after a dinner at a Chineserestaurant, always insist on taking the leftovers home?
I A: Because they know the Chinese remainder theorem!
I Q: What does the zero say to the eight?
I A: Nice belt!
Warm up exercise?
I Q: Why do mathematicians, after a dinner at a Chineserestaurant, always insist on taking the leftovers home?
I A: Because they know the Chinese remainder theorem!
I Q: What does the zero say to the eight?
I A: Nice belt!
Warm up exercise?
I Q: Why do mathematicians, after a dinner at a Chineserestaurant, always insist on taking the leftovers home?
I A: Because they know the Chinese remainder theorem!
I Q: What does the zero say to the eight?
I A: Nice belt!
Motivation
I consider optimization problem:
minx∈RN
f(x)
I Why coordinate descent methods(CD)?
I CD based on maximal absolute value of gradient
1. Choose ik = arg max1≤i≤n
|∇if(xk)|
2. Update xk+1 = xk − α∇ikf(xk)eik
I What’s the problem with it?
Motivation
I consider optimization problem:
minx∈RN
f(x)
I Why coordinate descent methods(CD)?
I CD based on maximal absolute value of gradient
1. Choose ik = arg max1≤i≤n
|∇if(xk)|
2. Update xk+1 = xk − α∇ikf(xk)eik
I What’s the problem with it?
Motivation
I consider optimization problem:
minx∈RN
f(x)
I Why coordinate descent methods(CD)?
I CD based on maximal absolute value of gradient
1. Choose ik = arg max1≤i≤n
|∇if(xk)|
2. Update xk+1 = xk − α∇ikf(xk)eik
I What’s the problem with it?
Motivation
I consider optimization problem:
minx∈RN
f(x)
I Why coordinate descent methods(CD)?
I CD based on maximal absolute value of gradient
1. Choose ik = arg max1≤i≤n
|∇if(xk)|
2. Update xk+1 = xk − α∇ikf(xk)eik
I What’s the problem with it?
Huge scale problems?
Sources:
I Internet, telecommunication
I Finite element schemes, weather prediction
Features:
I Expensive function evaluation
I Huge data
Conclusion: We need CD methods!
Huge scale problems?
Sources:
I Internet, telecommunication
I Finite element schemes, weather prediction
Features:
I Expensive function evaluation
I Huge data
Conclusion: We need CD methods!
Huge scale problems?
Sources:
I Internet, telecommunication
I Finite element schemes, weather prediction
Features:
I Expensive function evaluation
I Huge data
Conclusion: We need CD methods!
Huge scale problems?
Sources:
I Internet, telecommunication
I Finite element schemes, weather prediction
Features:
I Expensive function evaluation
I Huge data
Conclusion: We need CD methods!
Unconstrained Optimization
minx∈RN
f(x)
Notations:
I Decomposition of RN :
RN =n⊗i=1
Rni
I Partition of the unit matrix U:
IN = (U1,U2, ...,Un) ∈ RN×N ,Ui ∈ RN×ni
I x = (x(1),x(2), ...,x(n))T ∈ RN can be represented as:
x =
n∑i=1
Uix(i)
Unconstrained Optimization
minx∈RN
f(x)
Notations:
I Decomposition of RN :
RN =
n⊗i=1
Rni
I Partition of the unit matrix U:
IN = (U1,U2, ...,Un) ∈ RN×N ,Ui ∈ RN×ni
I x = (x(1),x(2), ...,x(n))T ∈ RN can be represented as:
x =
n∑i=1
Uix(i)
Unconstrained Optimization
minx∈RN
f(x)
Notations:
I Decomposition of RN :
RN =
n⊗i=1
Rni
I Partition of the unit matrix U:
IN = (U1,U2, ...,Un) ∈ RN×N ,Ui ∈ RN×ni
I x = (x(1),x(2), ...,x(n))T ∈ RN can be represented as:
x =
n∑i=1
Uix(i)
Unconstrained Optimization
minx∈RN
f(x)
Notations:
I Decomposition of RN :
RN =
n⊗i=1
Rni
I Partition of the unit matrix U:
IN = (U1,U2, ...,Un) ∈ RN×N ,Ui ∈ RN×ni
I x = (x(1),x(2), ...,x(n))T ∈ RN can be represented as:
x =
n∑i=1
Uix(i)
More notations...
I Partial gradient of f(x)
f ′i(x) = UTi ∇f(x) ∈ Rni
I Assume that the gradient of function f is coordinatewiseLipschitz continuous:
‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)
– ‖x‖∗ = max‖x‖=1〈s,x〉I Optimal coordinate steps:
Ti(x) = x− 1
LiUif
′i(x)#
– s# ∈ arg max〈s,x〉 − 12‖x‖
2
More notations...
I Partial gradient of f(x)
f ′i(x) = UTi ∇f(x) ∈ Rni
I Assume that the gradient of function f is coordinatewiseLipschitz continuous:
‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)
– ‖x‖∗ = max‖x‖=1〈s,x〉
I Optimal coordinate steps:
Ti(x) = x− 1
LiUif
′i(x)#
– s# ∈ arg max〈s,x〉 − 12‖x‖
2
More notations...
I Partial gradient of f(x)
f ′i(x) = UTi ∇f(x) ∈ Rni
I Assume that the gradient of function f is coordinatewiseLipschitz continuous:
‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)
– ‖x‖∗ = max‖x‖=1〈s,x〉I Optimal coordinate steps:
Ti(x) = x− 1
LiUif
′i(x)#
– s# ∈ arg max〈s,x〉 − 12‖x‖
2
More notations...
I Partial gradient of f(x)
f ′i(x) = UTi ∇f(x) ∈ Rni
I Assume that the gradient of function f is coordinatewiseLipschitz continuous:
‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)
– ‖x‖∗ = max‖x‖=1〈s,x〉I Optimal coordinate steps:
Ti(x) = x− 1
LiUif
′i(x)#
– s# ∈ arg max〈s,x〉 − 12‖x‖
2
More notations...
I A new norm:
‖x‖[α] = [
n∑i=1
Lαi ‖x(i)‖(2)(i) ]12
– where ‖ · ‖(i) is some fixed norm.
I Random counter Aα, α ∈ R , which generates an random numberi ∈ {1, ..., n} with probability
p(i)α =Lαi∑j L
αj
Method RCDM(α,x0)
Algorithm:
1. Choose ik = Aα2. Update xk+1 = Tik(xk)
TheoremFor any k ≥ 0, we have
E[f(xk)]− f∗ ≤ 2
k + 4·∑j
Lαj ·R21−α(x0)
where Rβ(x0) = maxx{maxx∗∈X∗ ‖x− x∗‖[β] : f(x) ≤ f(x0)}Comments: Rβ(x0) measures the distance between the initial point x0
and the optimal set X∗. In fact, Rβ(x0) is positively correlated to thedistance between x0 and X∗.
Method RCDM(α,x0)
Algorithm:
1. Choose ik = Aα2. Update xk+1 = Tik(xk)
TheoremFor any k ≥ 0, we have
E[f(xk)]− f∗ ≤ 2
k + 4·∑j
Lαj ·R21−α(x0)
where Rβ(x0) = maxx{maxx∗∈X∗ ‖x− x∗‖[β] : f(x) ≤ f(x0)}Comments: Rβ(x0) measures the distance between the initial point x0
and the optimal set X∗. In fact, Rβ(x0) is positively correlated to thedistance between x0 and X∗.
Proof
I Key inequality 1:
The above inequality is given by the Lipschitz gradient inequality.
I Key inequality 2:
Proof
I Key inequality 1:
The above inequality is given by the Lipschitz gradient inequality.
I Key inequality 2:
Combine the previous key inequalities, we have
Convergence of strongly convex functionsI Strongly convex functions:
f(y) ≥ f(x) + 〈∇f(x),y − x〉+1
2σ(f)‖y − x‖2
σ = σ(f) is the convexity parameter
TheoremLet function f(x) be strongly convex with respect to the norm ‖ · ‖[1−α]with convexity parameter σ1−α = σ1−α(f) > 0. Then, for the sequence{xk} generated by RCMD we have
E[f(xk)]− f∗ ≤ (1− σ1−α(f)
Sα(f))k(f(x0)− f∗)
Proof:
Convergence of strongly convex functionsI Strongly convex functions:
f(y) ≥ f(x) + 〈∇f(x),y − x〉+1
2σ(f)‖y − x‖2
σ = σ(f) is the convexity parameter
TheoremLet function f(x) be strongly convex with respect to the norm ‖ · ‖[1−α]with convexity parameter σ1−α = σ1−α(f) > 0. Then, for the sequence{xk} generated by RCMD we have
E[f(xk)]− f∗ ≤ (1− σ1−α(f)
Sα(f))k(f(x0)− f∗)
Proof:
Convergence of strongly convex functionsI Strongly convex functions:
f(y) ≥ f(x) + 〈∇f(x),y − x〉+1
2σ(f)‖y − x‖2
σ = σ(f) is the convexity parameter
TheoremLet function f(x) be strongly convex with respect to the norm ‖ · ‖[1−α]with convexity parameter σ1−α = σ1−α(f) > 0. Then, for the sequence{xk} generated by RCMD we have
E[f(xk)]− f∗ ≤ (1− σ1−α(f)
Sα(f))k(f(x0)− f∗)
Proof:
I Expected quality is good!
I How about the result of a single run?
I Define function fµ(x) by:
fµ(x) = f(x) +µ
2‖x− x0‖2[1]
I fµ(x) is strongly convex with respect to ‖ · ‖[1]I fµ(x) has convexity parameter µ
TheoremLet us define µ = ε
4R21(x0)
and choose
k ≥ 1 +2
µln
1
2µ(1− β)
If the random point xk is generated by RCDM(0,x0) as applied tofunction fµ, then
Prob(f(xk)− f∗ ≤ ε) ≥ β
Comments: The second inequality is derived by the property ofstrongly convex function.
I Expected quality is good!
I How about the result of a single run?
I Define function fµ(x) by:
fµ(x) = f(x) +µ
2‖x− x0‖2[1]
I fµ(x) is strongly convex with respect to ‖ · ‖[1]I fµ(x) has convexity parameter µ
TheoremLet us define µ = ε
4R21(x0)
and choose
k ≥ 1 +2
µln
1
2µ(1− β)
If the random point xk is generated by RCDM(0,x0) as applied tofunction fµ, then
Prob(f(xk)− f∗ ≤ ε) ≥ β
Comments: The second inequality is derived by the property ofstrongly convex function.
I Expected quality is good!
I How about the result of a single run?
I Define function fµ(x) by:
fµ(x) = f(x) +µ
2‖x− x0‖2[1]
I fµ(x) is strongly convex with respect to ‖ · ‖[1]I fµ(x) has convexity parameter µ
TheoremLet us define µ = ε
4R21(x0)
and choose
k ≥ 1 +2
µln
1
2µ(1− β)
If the random point xk is generated by RCDM(0,x0) as applied tofunction fµ, then
Prob(f(xk)− f∗ ≤ ε) ≥ β
Comments: The second inequality is derived by the property ofstrongly convex function.
I Expected quality is good!
I How about the result of a single run?
I Define function fµ(x) by:
fµ(x) = f(x) +µ
2‖x− x0‖2[1]
I fµ(x) is strongly convex with respect to ‖ · ‖[1]I fµ(x) has convexity parameter µ
TheoremLet us define µ = ε
4R21(x0)
and choose
k ≥ 1 +2
µln
1
2µ(1− β)
If the random point xk is generated by RCDM(0,x0) as applied tofunction fµ, then
Prob(f(xk)− f∗ ≤ ε) ≥ β
Comments: The second inequality is derived by the property ofstrongly convex function.
Accelerated Coordinate Descent
Consider the following scheme applied to strongly convex functionwith given convexity parameter σ:
Convergence
Based on the previous accelerated algorithm, we have the followingconvergence theorem:
Constrained optimization
I Consider the constrained minimization problem
minx∈Q
f(x)
I Q =⊗n
i=1 Qi, where Qi ⊆ Rni are closed and convex
I f(x) is convex and satisfies the smoothness assumption:
‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)
I Algorithm:
(1) Choose randomly i by uniform distribution on {1,...,n}
(2) u(i) = arg minu(i)∈Qi
〈f ′i(xk),u(i) − x(i)k 〉+
Li2‖u(i) − x
(i)k ‖
2(i)
(3) Update xk+1 = xk + UTi (u(i) − x
(i)k )
Constrained optimization
I Consider the constrained minimization problem
minx∈Q
f(x)
I Q =⊗n
i=1 Qi, where Qi ⊆ Rni are closed and convex
I f(x) is convex and satisfies the smoothness assumption:
‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)
I Algorithm:
(1) Choose randomly i by uniform distribution on {1,...,n}
(2) u(i) = arg minu(i)∈Qi
〈f ′i(xk),u(i) − x(i)k 〉+
Li2‖u(i) − x
(i)k ‖
2(i)
(3) Update xk+1 = xk + UTi (u(i) − x
(i)k )
TheoremFor any k ≥ 0 we have
φk − f∗ ≤n
n+ k· (1
2R2
1(x0) + f(x0)− f∗)
If f is strongly convex in ‖ · ‖[1] with constant σ, then
φk − f∗ ≤ (1− 2σ
n(1 + σ))k · (1
2R2
1(x0) + f(x0)− f∗)
Implementation
Google problemI Let E ∈ Rn×n be an incidence matrix of graph;I E = E · diag(ETe)−1;I Google problem:
min1
2‖Ex− x‖2 +
γ
2[〈e,x〉 − 1]2
Google problemI Let E ∈ Rn×n be an incidence matrix of graph;I E = E · diag(ETe)−1;I Google problem:
min1
2‖Ex− x‖2 +
γ
2[〈e,x〉 − 1]2