Achieving Acceleration via Direct Discretization of Heavy-Ball … · 2018-08-15 · Achieving Acceleration via Direct Discretization of Heavy-Ball Ordinary Di erential Equation Aryan

Achieving Acceleration via Direct Discretizationof Heavy-Ball Ordinary Differential Equation

Aryan Mokhtari

Laboratory for Information and Decision Systems (LIDS), MIT

Joint work with Jingzhao Zhang, Suvrit Sra, and Ali Jadbabaie

DIMACS/TRIPODS Workshop on Optimization and Machine Learning

August 14th, 2018Supported by ONR-BRC, DARPA FunLoL, DARPA Lagrange, and NSF-IIS

Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 1

Problem formulation

I We aim to solve the following convex problem

minx∈Rd

f (x)

I f : Rd → R is convex and smooth

I We only have access to first-order information ∇f (x)

I The method of choice in many applications is Gradient Descent (GD)

⇒ Easy to implement, robust against noise, ...

I But, GD suffers from a slow convergence rate ⇒ f (xN)− f (x∗) ≤ O(

1N

)I Does not match the oracle lower bound of O

(1N2

)[Nemirovski et al. 1983]


Acceleration in convex setting

I Is it possible to achieve acceleration?

⇒ Acceleration: Any polynomial rate faster than O(

1N

)I Yes ⇒ Nesterov’s accelerated gradient (NAG) method [Nesterov, 1983]

xk+1 = yk −1

L∇f (yk), yk = xk + γk(xk − xk−1)

I f is L-smooth and γk is properly chosen ⇒ f (xN)− f (x∗) ≤ O(

1N2

)I Nesterov’s original derivation is elegant but (a bit) unintuitive

⇒ The intuition behind it remains somewhat mysterious

⇒ Not easy to generalize

I Can one come up with a generic framework for achieving acceleration?


Acceleration in convex setting

I Is it possible to achieve acceleration?

⇒ Acceleration: Any polynomial rate faster than O(

1N

)I Yes ⇒ Nesterov’s accelerated gradient (NAG) method [Nesterov, 1983]

xk+1 = xk + γk(xk − xk−1)− 1

L∇f (xk + γk(xk − xk−1))

I f is L-smooth and γk is properly chosen ⇒ f (xN)− f (x∗) ≤ O(

1N2

)I Nesterov’s original derivation is elegant but (a bit) unintuitive

⇒ The intuition behind it remains somewhat mysterious

⇒ Not easy to generalize

I Can one come up with a generic framework for achieving acceleration?


Literature review

I Linear coupling [Allen-Zhu and Orecchia, 2014]

⇒ Nesterov’s method: Linear coupling of GD and mirror descent

I Integral Quadratic Constraints [Lessard, Recht, Packard, 2014]

⇒ Interpreting first-order methods as linear dynamical systems

I Continuous-time prespective

⇒ Derive a 2nd-order ODE, which is the limit of NAG [Su et al. ’14]

⇒ Bergman Lagrangian to generate accelerated methods [Wibisono et al. ’16]


Continuous-time interpretation

I By taking small step sizes, the exact limit of NAG is a 2nd-order ODE

xk+1 = yk − h ∇f (yk)

yk = xk + k−1k+2

(xk − xk−1)

Continuous limit

O( 1t2 ) rate in CT [SBC ’14]

O( 1tp ) rate in CT [WWJ ’14]

Heavy ball ODE Xt + 3tXt +∇f (Xt) = 0

Xt + p+1tXt +Cp2tp−2∇f (Xt) = 0

I Acceleration is achievable in continuous time

⇒ But, can’t keep the rate after discretization using standard integrators


Heavy ball ODE

HB-ODE: Xt +3

tXt +∇f (Xt) = 0

I First method: (studied in [Wibisono et al. ’16])

Zt = Xt +t

2Xt , Zt = − t

2∇f (Xt)

FE xk+1 =2

kzk +

k − 2

kxk , BE zk = zk−1 −

h2

2k∇f (xk)

I Converges faster than GD ⇒ But, the method is unstable!


Heavy ball ODE

HB-ODE: Xt +3

tXt +∇f (Xt) = 0

I Second method: Ployak’s HB method

V = X , V +3

tV +∇f (X ) = 0

B-Euler: xk+1 = xk + hvk+1, F-Euler: vk+1 =(1− 3

k

)vk − h∇f (xk)

xk+1 = xk + βk(xk − xk−1)− α∇f (xk)

⇒ Stable but slow ⇒ Not faster than O(1/N)

I Third method: Nesterov Accelerated Gradient method

xk+1 = xk + βk(xk − xk−1)− α∇f (xk + βk(xk − xk−1))

⇒ Stable and Fast ⇒ But, can’t be associated with any integrator


Challenge: How to go from continuous to discrete?

I Challenge: Need a stable discretization that achieves acceleration

⇒ Too small step size ⇒ No acceleration

⇒ Too large step size ⇒ Unstable method

I Use symplectic integrators [Betancourt, Jordan, Wilson, 2018]

⇒ Preserving mechanical properties ⇒ Numerically works

I Our result [Zhang, M, Sra, Jadbabaie, 2018]:

⇒ No need for symplectic integrators!

⇒ Acceleration can be achieved using Runge-Kutta integrators

⇒ Can even beat O(

1N2

)⇒ If f is smooth and “flat” around minimum

I How do we obtain this result?

⇒ A more powerful Lyapunov function

⇒ Tighter analysis of discretization error


Recap of Runge-Kutta integrators

Definition. Given a dynamical system y = F (y), current point y0 and stepsize h, an explicit S stage Runge-Kutta method generates the next step via

gi = y0 + hi−1∑j=1

aijF (gj), Φh(y0) = y0 + hS∑

i=1

biF (gi ),

I Φh(y0) is the estimation of the state after time step h

I aij and bi are suitable coefficients defined by the integrator

I {gi}Si=1 are neighboring points where the gradient F (gi ) is evaluated


Recap of Runge-Kutta integrators

I Φh(y0) ⇒ Estimation of the state after time step h

I ϕh(y0) ⇒ True solution to the ODE with initial condition y0

I An integrator Φh(y0) has order s if its discretization error shrinks as

‖Φh(y0)− ϕh(y0)‖ = O(hs+1), as h→ 0.

I The explicit Euler’s method: Φh(y0) = y0 + hF (y0)

⇒ # stages S = 1 ⇒ RK of order s = 1

I The midpoint method: Φh(y0) = y0 + hF (y0 + h2F (y0))

⇒ # stages S = 2 ⇒ RK of order s = 2

s 1 2 3 4 5 6 7 8S 1 2 3 4 6 7 9 11


Assumptions

Assumption 1 There exists an integer p ≥ 2 and a constant L such that

f (x)− f (x∗) ≥ 1L‖∇(i)f (x)‖

pp−i , for i ∈ {1, ..., p − 1}

I p is the order of the first non-zero term of the Taylor’s series around x∗

⇒ It quantifies the flatness of the objective around a minimum

I General smooth functions ⇒ p = 2

I `4-norm regression f (x) = ‖Ax − b‖44 ⇒ p = 4

Assumption 2 There exists an integer s ≥ p and a constant M ≥ 0, such thatf is order (s + 2) differentiable. Furthermore,

‖∇(i)f (x)‖ ≤ M, for i = p, p + 1, . . . , s, s + 1, s + 2.


Algorithm

I We use a different second-order ODE

x(t) + 2p+1t

x(t) + p2tp−2∇f (x(t)) = 0

I The proposed ODE can be written as a dynamical system

y = F (y) =

− 2p+1t

v − p2tp−2∇f (x)v1

, where y =

vxt

∈ R2d+1

I We use RK integrators to discretize this dynamical system

What are the parameters (nubs)?

I We set p in the ODE to be the largest p satisfying Assumption 1

⇒ To get best theoretical result

I The degree of the integrator s ≤ the largest s satisfying Assumption 2


Algorithm and main result

Algorithm 1: Input(f , x0, p, L,M, s,N)

I Set the initial state y0 = [~0; x0; 1] ∈ R2d+1

I Set step size h = C/N1

s+1

I xN ← Order-s-Runge-Kutta-Integrator(F , y0,N, h)

I return xN

Theorem If f is convex and Assumptions 1 and 2 are satisfied, then

f (xN)− f (x∗) ≤ O(N−p s

s+1)

I Flatter objective around a minimum ⇒ Larger p ⇒ Faster rate

I Higher order integrator ⇒ Larger s ⇒ Faster rate


Main result (special cases)

Corollary If f is convex, L-smooth, and 4th order differentiable, then for p = 2and s = 2 we have

f (xN)− f (x∗) ≤ O(

1

N4/3

)I Higher order differentiability allows one to use a higher order integrator

⇒ Leads to the optimal O(1/N2) rate in the limit.

Corollary For f (x) = ‖Ax − b‖44, for p = 4 and s = 2 we have

f (xN)− f (x∗) ≤ O(

1

N8/3

)

I It beats O(N−2) when the function is flat around the minimum


Main result (special cases)

Corollary If f is convex, L-smooth, and 4th order differentiable, then for p = 2and s = 2 we have

f (xN)− f (x∗) ≤ O(

1

N4/3

)I Higher order differentiability allows one to use a higher order integrator

⇒ Leads to the optimal O(1/N2) rate in the limit.

Corollary For f (x) = ‖Ax − b‖44, for p = 4 and s = 4 we have

f (xN)− f (x∗) ≤ O(

1

N16/5

)

I It beats O(N−2) when the function is flat around the minimum


Proof sketch

I Lyapunov function (Differs from the one in [Su, Boyd, Candes, 2014])

E([v ; x ; t]) :=t2

4p2‖v‖2 +

∥∥∥x +t

2pv − x∗

∥∥∥2

+ tp(f (x)− f (x∗)).


Proof sketch


E([v ; x ; t]) :=t2

4p2‖v‖2 +

∥∥∥x +t

2pv − x∗

∥∥∥2

+ tp(f (x)− f (x∗)).

I Step 1: E is non-increasing with time ⇒ E(y) ≤ − tp‖v‖2 ≤ 0


Proof sketch


E([v ; x ; t]) :=t2

4p2‖v‖2 +

∥∥∥x +t

2pv − x∗

∥∥∥2

+ tp(f (x)− f (x∗)).


I Step 2: Bound the discretization error

⇒ ‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)[

[(1+E(yk ))]s+1

tk+ h [(1+E(yk ))]s+2

tk

]⇒ The value of E at a discretized point is close to its continuous counterpart


Proof sketch


E([v ; x ; t]) :=t2

4p2‖v‖2 +

∥∥∥x +t

2pv − x∗

∥∥∥2

+ tp(f (x)− f (x∗)).



⇒ ‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)[

[(1+E(yk ))]s+1

tk+ h [(1+E(yk ))]s+2

tk


I Step 3: (By combining Steps 1 and 2) E for the points generated by thediscretized ODE do not increase significantly

⇒ E(yN) ≤ exp(1) E(y0) + 1


Proof sketch


E([v ; x ; t]) :=t2

4p2‖v‖2 +

∥∥∥x +t

2pv − x∗

∥∥∥2

+ tp(f (x)− f (x∗)).



⇒ ‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)[

[(1+E(yk ))]s+1

1+hk+ h [(1+E(yk ))]s+2

1+hk


I Step 3: (By combining Steps 1 and 2) E for the points generated by thediscretized ODE do not increase significantly

⇒ E(yN) ≤ exp(1) E(y0) + 1

I Step 4: Use definition of E and Step 3 to bound suboptimality

⇒ f (xN)− f (x∗) ≤ E(yN )

tpN≤ eE(y0)+1

(1+Nh)p≤ (L+M+1)p(eE(y0)+1)p+1

CNp ss+1


Numerical results: General smooth & convex case

I f ([x1, x2]) = ‖Ax1− b‖2 + ‖Cx2− d‖44, where A,C ∈ R10×10 and b, d ∈ R10

I The first five entries of b and d are valued 0 and the rest are 1

I The ith row of A (C) ⇒ i .i .d . multivariate Gaussian dist. cond. on bi (di )

I f is convex and satisfies Assumption 1 (flatness) with p = 2

100

101

102

103

104

105

106

10-15

10-10

10-5

100

I For RK of order s = 4 the convergence rate is almost O(1/N2)


Numerical results: Beating Nesterov’s acceleration

I Consider the objective function f (x) = ‖Ax − b‖44

I f is convex and satisfies Assumption 1 (flatness) with p = 4

I We use an order s = 2 RK integrator and set N = 106

I We use p = {1, 2, 4, 8} for the ODE (by our theory p = 8 is not allowed)

100

101

102

103

104

105

106

10-15

10-10

10-5

100

I For p = 4 and s = 2 direct discretization method is faster than NAG

⇒ Converges at the rate O(N−3) ⇒ Better than our guarantee O(N−83 )


Conclusions

I Introduced a general framework to achieve acceleration for CVX problems

I Specified conditions for stably discretizing an ODE using RK integrators

I Only by using first-order information (i.e., purely gradient based)

I For general smooth and convex functions

⇒ Better than O(N−1)

⇒ Becomes close to O(N−2) by larger s (in practice: s = 4 is enough)

I Identified a new condition that quantifies the local flatness of CVX functions

I If the degree of flatness is p > 2

⇒ We can obtain rates better than O(N−2)


References

I A. Nemirovski, D. B. Yudin, and E. R. Dawson, “Problem complexity and methodefficiency in optimization,” 1983.

I Y. Nesterov, “A method of solving a convex programming problem with convergencerate O(1/k2),” Soviet Mathematics Doklady, volume 27, pp. 372-376, 1983.

I Z. Allen-Zhu and L. Orecchia, “Linear coupling: An ultimate unification of gradient andmirror descent,” arXiv preprint arXiv:1407.1537, 2014.

I L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithmsvia integral quadratic constraints,” SIAM Journal on Optimization, 26(1):57-95, 2016.

I W. Su, S. Boyd, and E. Candes, “A differential equation for modeling Nesterov’saccelerated gradient method: Theory and insights,” Advances in Neural InformationProcessing Systems (NIPS), pp. 2510-2518, 2014.

I A. Wibisono, A. C. Wilson, and M. I. Jordan. “A variational perspective on acceleratedmethods in optimization,” Proceedings of the National Academy of Sciences, 113(47),E7351-E7358, 2016.

I M. Betancourt, M. I. Jordan, and A. C. Wilson, “On symplectic optimization,” arXivpreprint arXiv:1802.03653 (2018).

I J. Zhang, A. Mokhtari, S. Sra, and A. Jadbabaie, “Direct Runge-Kutta discretizationachieves acceleration,” arXiv preprint arXiv:1805.00521, 2018.


Note on assumptions

I Our assumptions should hold only on

A := {x ∈ Rd | ∃x ′ ∈ S, ‖x − x ′‖ ≤ 1},

where S := {x ∈ Rd | f (x) ≤ exp(1)((f (x0)− f (x∗) + ‖x0 − x∗‖2) + 1)}

I The iterates always stay in S


Formal statement of the main theorem

Theorem If f is convex and Assumptions 1 and 2 are satisfied, then

f (xN)− f (x∗) ≤ C2E0

[(L+M+1)E0

Ns

s+1

]p= O

(N−p s

s+1),

where E0 := f (x0)− f (x∗) + ‖x0 − x∗‖2 + 1, h = C1(L + M + 1)−1E−10 N−

1s+1

I C1 and C2 only depend on s, p, and the Runge-Kutta integrator

I Don’t need to know C1 ⇒ Using any smaller positive constant is fine


Documents

Achieving Acceleration via Direct Discretization of Heavy-Ball … · 2018-08-15 · Achieving Acceleration via Direct Discretization of Heavy-Ball Ordinary Di erential Equation Aryan