Coordinate Descent Methods on Huge-Scale Optimization Problemsoptimization/L1/optseminar/BCD_Huge... · 2012. 11. 29. · Coordinate Descent Methods on Huge-Scale Optimization Problems

Coordinate Descent Methods on Huge-ScaleOptimization Problems

Zhimin Peng

Optimization Group Meeting

Warm up exercise?

I Q: Why do mathematicians, after a dinner at a Chineserestaurant, always insist on taking the leftovers home?

I A: Because they know the Chinese remainder theorem!

I Q: What does the zero say to the eight?

I A: Nice belt!

Warm up exercise?




I A: Nice belt!

Warm up exercise?




I A: Nice belt!

Warm up exercise?




I A: Nice belt!

Warm up exercise?




I A: Nice belt!

Motivation

I consider optimization problem:

minx∈RN

f(x)

I Why coordinate descent methods(CD)?

I CD based on maximal absolute value of gradient

1. Choose ik = arg max1≤i≤n

|∇if(xk)|

2. Update xk+1 = xk − α∇ikf(xk)eik

I What’s the problem with it?

Motivation


minx∈RN

f(x)




|∇if(xk)|



Motivation


minx∈RN

f(x)




|∇if(xk)|



Motivation


minx∈RN

f(x)




|∇if(xk)|



Huge scale problems?

Sources:

I Internet, telecommunication

I Finite element schemes, weather prediction

Features:

I Expensive function evaluation

I Huge data

Conclusion: We need CD methods!


Sources:



Features:


I Huge data



Sources:



Features:


I Huge data



Sources:



Features:


I Huge data


Unconstrained Optimization

minx∈RN

f(x)

Notations:

I Decomposition of RN :

RN =n⊗i=1

Rni

I Partition of the unit matrix U:

IN = (U1,U2, ...,Un) ∈ RN×N ,Ui ∈ RN×ni

I x = (x(1),x(2), ...,x(n))T ∈ RN can be represented as:

x =

n∑i=1

Uix(i)


minx∈RN

f(x)

Notations:


RN =

n⊗i=1

Rni




x =

n∑i=1

Uix(i)


minx∈RN

f(x)

Notations:


RN =

n⊗i=1

Rni




x =

n∑i=1

Uix(i)


minx∈RN

f(x)

Notations:


RN =

n⊗i=1

Rni




x =

n∑i=1

Uix(i)

More notations...

I Partial gradient of f(x)

f ′i(x) = UTi ∇f(x) ∈ Rni

I Assume that the gradient of function f is coordinatewiseLipschitz continuous:

‖f ′i(x + Uihi)− f ′i(x)‖∗(i) ≤ Li‖hi‖(i)

– ‖x‖∗ = max‖x‖=1〈s,x〉I Optimal coordinate steps:

Ti(x) = x− 1

LiUif

′i(x)#

– s# ∈ arg max〈s,x〉 − 12‖x‖

2

More notations...





– ‖x‖∗ = max‖x‖=1〈s,x〉

I Optimal coordinate steps:

Ti(x) = x− 1

LiUif

′i(x)#

– s# ∈ arg max〈s,x〉 − 12‖x‖

2

More notations...






Ti(x) = x− 1

LiUif

′i(x)#

– s# ∈ arg max〈s,x〉 − 12‖x‖

2

More notations...






Ti(x) = x− 1

LiUif

′i(x)#

– s# ∈ arg max〈s,x〉 − 12‖x‖

2

More notations...

I A new norm:

‖x‖[α] = [

n∑i=1

Lαi ‖x(i)‖(2)(i) ]12

– where ‖ · ‖(i) is some fixed norm.

I Random counter Aα, α ∈ R , which generates an random numberi ∈ {1, ..., n} with probability

p(i)α =Lαi∑j L

αj

Method RCDM(α,x0)

Algorithm:

1. Choose ik = Aα2. Update xk+1 = Tik(xk)

TheoremFor any k ≥ 0, we have

E[f(xk)]− f∗ ≤ 2

k + 4·∑j

Lαj ·R21−α(x0)

where Rβ(x0) = maxx{maxx∗∈X∗ ‖x− x∗‖[β] : f(x) ≤ f(x0)}Comments: Rβ(x0) measures the distance between the initial point x0

and the optimal set X∗. In fact, Rβ(x0) is positively correlated to thedistance between x0 and X∗.

Method RCDM(α,x0)

Algorithm:

1. Choose ik = Aα2. Update xk+1 = Tik(xk)

TheoremFor any k ≥ 0, we have

E[f(xk)]− f∗ ≤ 2

k + 4·∑j

Lαj ·R21−α(x0)

where Rβ(x0) = maxx{maxx∗∈X∗ ‖x− x∗‖[β] : f(x) ≤ f(x0)}Comments: Rβ(x0) measures the distance between the initial point x0

and the optimal set X∗. In fact, Rβ(x0) is positively correlated to thedistance between x0 and X∗.

Proof

I Key inequality 1:

The above inequality is given by the Lipschitz gradient inequality.

I Key inequality 2:

Proof

I Key inequality 1:

The above inequality is given by the Lipschitz gradient inequality.

I Key inequality 2:

Combine the previous key inequalities, we have

Convergence of strongly convex functionsI Strongly convex functions:

f(y) ≥ f(x) + 〈∇f(x),y − x〉+1

2σ(f)‖y − x‖2

σ = σ(f) is the convexity parameter

TheoremLet function f(x) be strongly convex with respect to the norm ‖ · ‖[1−α]with convexity parameter σ1−α = σ1−α(f) > 0. Then, for the sequence{xk} generated by RCMD we have

E[f(xk)]− f∗ ≤ (1− σ1−α(f)

Sα(f))k(f(x0)− f∗)

Proof:


f(y) ≥ f(x) + 〈∇f(x),y − x〉+1

2σ(f)‖y − x‖2



E[f(xk)]− f∗ ≤ (1− σ1−α(f)

Sα(f))k(f(x0)− f∗)

Proof:


f(y) ≥ f(x) + 〈∇f(x),y − x〉+1

2σ(f)‖y − x‖2



E[f(xk)]− f∗ ≤ (1− σ1−α(f)

Sα(f))k(f(x0)− f∗)

Proof:

I Expected quality is good!

I How about the result of a single run?

I Define function fµ(x) by:

fµ(x) = f(x) +µ

2‖x− x0‖2[1]

I fµ(x) is strongly convex with respect to ‖ · ‖[1]I fµ(x) has convexity parameter µ

TheoremLet us define µ = ε

4R21(x0)

and choose

k ≥ 1 +2

µln

1

2µ(1− β)

If the random point xk is generated by RCDM(0,x0) as applied tofunction fµ, then

Prob(f(xk)− f∗ ≤ ε) ≥ β

Comments: The second inequality is derived by the property ofstrongly convex function.




fµ(x) = f(x) +µ

2‖x− x0‖2[1]



4R21(x0)

and choose

k ≥ 1 +2

µln

1

2µ(1− β)







fµ(x) = f(x) +µ

2‖x− x0‖2[1]



4R21(x0)

and choose

k ≥ 1 +2

µln

1

2µ(1− β)







fµ(x) = f(x) +µ

2‖x− x0‖2[1]



4R21(x0)

and choose

k ≥ 1 +2

µln

1

2µ(1− β)




Accelerated Coordinate Descent

Consider the following scheme applied to strongly convex functionwith given convexity parameter σ:

Convergence

Based on the previous accelerated algorithm, we have the followingconvergence theorem:

Constrained optimization

I Consider the constrained minimization problem

minx∈Q

f(x)

I Q =⊗n

i=1 Qi, where Qi ⊆ Rni are closed and convex

I f(x) is convex and satisfies the smoothness assumption:


I Algorithm:

(1) Choose randomly i by uniform distribution on {1,...,n}

(2) u(i) = arg minu(i)∈Qi

〈f ′i(xk),u(i) − x(i)k 〉+

Li2‖u(i) − x

(i)k ‖

2(i)

(3) Update xk+1 = xk + UTi (u(i) − x

(i)k )

Constrained optimization

I Consider the constrained minimization problem

minx∈Q

f(x)

I Q =⊗n

i=1 Qi, where Qi ⊆ Rni are closed and convex

I f(x) is convex and satisfies the smoothness assumption:


I Algorithm:

(1) Choose randomly i by uniform distribution on {1,...,n}

(2) u(i) = arg minu(i)∈Qi

〈f ′i(xk),u(i) − x(i)k 〉+

Li2‖u(i) − x

(i)k ‖

2(i)

(3) Update xk+1 = xk + UTi (u(i) − x

(i)k )

TheoremFor any k ≥ 0 we have

φk − f∗ ≤n

n+ k· (1

2R2

1(x0) + f(x0)− f∗)

If f is strongly convex in ‖ · ‖[1] with constant σ, then

φk − f∗ ≤ (1− 2σ

n(1 + σ))k · (1

2R2

1(x0) + f(x0)− f∗)

Implementation

Google problemI Let E ∈ Rn×n be an incidence matrix of graph;I E = E · diag(ETe)−1;I Google problem:

min1

2‖Ex− x‖2 +

γ

2[〈e,x〉 − 1]2

Google problemI Let E ∈ Rn×n be an incidence matrix of graph;I E = E · diag(ETe)−1;I Google problem:

min1

2‖Ex− x‖2 +

γ

2[〈e,x〉 − 1]2

Documents

Coordinate Descent Methods on Huge-Scale Optimization Problemsoptimization/L1/optseminar/BCD_Huge... · 2012. 11. 29. · Coordinate Descent Methods on Huge-Scale Optimization Problems