Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Achieving Acceleration via Direct Discretizationof Heavy-Ball Ordinary Differential Equation
Aryan Mokhtari
Laboratory for Information and Decision Systems (LIDS), MIT
Joint work with Jingzhao Zhang, Suvrit Sra, and Ali Jadbabaie
DIMACS/TRIPODS Workshop on Optimization and Machine Learning
August 14th, 2018Supported by ONR-BRC, DARPA FunLoL, DARPA Lagrange, and NSF-IIS
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 1
Problem formulation
I We aim to solve the following convex problem
minx∈Rd
f (x)
I f : Rd → R is convex and smooth
I We only have access to first-order information ∇f (x)
I The method of choice in many applications is Gradient Descent (GD)
⇒ Easy to implement, robust against noise, ...
I But, GD suffers from a slow convergence rate ⇒ f (xN)− f (x∗) ≤ O(
1N
)I Does not match the oracle lower bound of O
(1N2
)[Nemirovski et al. 1983]
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 2
Acceleration in convex setting
I Is it possible to achieve acceleration?
⇒ Acceleration: Any polynomial rate faster than O(
1N
)I Yes ⇒ Nesterov’s accelerated gradient (NAG) method [Nesterov, 1983]
xk+1 = yk −1
L∇f (yk), yk = xk + γk(xk − xk−1)
I f is L-smooth and γk is properly chosen ⇒ f (xN)− f (x∗) ≤ O(
1N2
)I Nesterov’s original derivation is elegant but (a bit) unintuitive
⇒ The intuition behind it remains somewhat mysterious
⇒ Not easy to generalize
I Can one come up with a generic framework for achieving acceleration?
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 3
Acceleration in convex setting
I Is it possible to achieve acceleration?
⇒ Acceleration: Any polynomial rate faster than O(
1N
)I Yes ⇒ Nesterov’s accelerated gradient (NAG) method [Nesterov, 1983]
xk+1 = xk + γk(xk − xk−1)− 1
L∇f (xk + γk(xk − xk−1))
I f is L-smooth and γk is properly chosen ⇒ f (xN)− f (x∗) ≤ O(
1N2
)I Nesterov’s original derivation is elegant but (a bit) unintuitive
⇒ The intuition behind it remains somewhat mysterious
⇒ Not easy to generalize
I Can one come up with a generic framework for achieving acceleration?
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 4
Literature review
I Linear coupling [Allen-Zhu and Orecchia, 2014]
⇒ Nesterov’s method: Linear coupling of GD and mirror descent
I Integral Quadratic Constraints [Lessard, Recht, Packard, 2014]
⇒ Interpreting first-order methods as linear dynamical systems
I Continuous-time prespective
⇒ Derive a 2nd-order ODE, which is the limit of NAG [Su et al. ’14]
⇒ Bergman Lagrangian to generate accelerated methods [Wibisono et al. ’16]
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 5
Continuous-time interpretation
I By taking small step sizes, the exact limit of NAG is a 2nd-order ODE
xk+1 = yk − h ∇f (yk)
yk = xk + k−1k+2
(xk − xk−1)
Continuous limit
O( 1t2 ) rate in CT [SBC ’14]
O( 1tp ) rate in CT [WWJ ’14]
Heavy ball ODE Xt + 3tXt +∇f (Xt) = 0
Xt + p+1tXt +Cp2tp−2∇f (Xt) = 0
I Acceleration is achievable in continuous time
⇒ But, can’t keep the rate after discretization using standard integrators
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 6
Heavy ball ODE
HB-ODE: Xt +3
tXt +∇f (Xt) = 0
I First method: (studied in [Wibisono et al. ’16])
Zt = Xt +t
2Xt , Zt = − t
2∇f (Xt)
FE xk+1 =2
kzk +
k − 2
kxk , BE zk = zk−1 −
h2
2k∇f (xk)
I Converges faster than GD ⇒ But, the method is unstable!
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 7
Heavy ball ODE
HB-ODE: Xt +3
tXt +∇f (Xt) = 0
I Second method: Ployak’s HB method
V = X , V +3
tV +∇f (X ) = 0
B-Euler: xk+1 = xk + hvk+1, F-Euler: vk+1 =(1− 3
k
)vk − h∇f (xk)
xk+1 = xk + βk(xk − xk−1)− α∇f (xk)
⇒ Stable but slow ⇒ Not faster than O(1/N)
I Third method: Nesterov Accelerated Gradient method
xk+1 = xk + βk(xk − xk−1)− α∇f (xk + βk(xk − xk−1))
⇒ Stable and Fast ⇒ But, can’t be associated with any integrator
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 8
Challenge: How to go from continuous to discrete?
I Challenge: Need a stable discretization that achieves acceleration
⇒ Too small step size ⇒ No acceleration
⇒ Too large step size ⇒ Unstable method
I Use symplectic integrators [Betancourt, Jordan, Wilson, 2018]
⇒ Preserving mechanical properties ⇒ Numerically works
I Our result [Zhang, M, Sra, Jadbabaie, 2018]:
⇒ No need for symplectic integrators!
⇒ Acceleration can be achieved using Runge-Kutta integrators
⇒ Can even beat O(
1N2
)⇒ If f is smooth and “flat” around minimum
I How do we obtain this result?
⇒ A more powerful Lyapunov function
⇒ Tighter analysis of discretization error
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 9
Recap of Runge-Kutta integrators
Definition. Given a dynamical system y = F (y), current point y0 and stepsize h, an explicit S stage Runge-Kutta method generates the next step via
gi = y0 + hi−1∑j=1
aijF (gj), Φh(y0) = y0 + hS∑
i=1
biF (gi ),
I Φh(y0) is the estimation of the state after time step h
I aij and bi are suitable coefficients defined by the integrator
I {gi}Si=1 are neighboring points where the gradient F (gi ) is evaluated
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 10
Recap of Runge-Kutta integrators
I Φh(y0) ⇒ Estimation of the state after time step h
I ϕh(y0) ⇒ True solution to the ODE with initial condition y0
I An integrator Φh(y0) has order s if its discretization error shrinks as
‖Φh(y0)− ϕh(y0)‖ = O(hs+1), as h→ 0.
I The explicit Euler’s method: Φh(y0) = y0 + hF (y0)
⇒ # stages S = 1 ⇒ RK of order s = 1
I The midpoint method: Φh(y0) = y0 + hF (y0 + h2F (y0))
⇒ # stages S = 2 ⇒ RK of order s = 2
s 1 2 3 4 5 6 7 8S 1 2 3 4 6 7 9 11
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 11
Assumptions
Assumption 1 There exists an integer p ≥ 2 and a constant L such that
f (x)− f (x∗) ≥ 1L‖∇(i)f (x)‖
pp−i , for i ∈ {1, ..., p − 1}
I p is the order of the first non-zero term of the Taylor’s series around x∗
⇒ It quantifies the flatness of the objective around a minimum
I General smooth functions ⇒ p = 2
I `4-norm regression f (x) = ‖Ax − b‖44 ⇒ p = 4
Assumption 2 There exists an integer s ≥ p and a constant M ≥ 0, such thatf is order (s + 2) differentiable. Furthermore,
‖∇(i)f (x)‖ ≤ M, for i = p, p + 1, . . . , s, s + 1, s + 2.
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 12
Algorithm
I We use a different second-order ODE
x(t) + 2p+1t
x(t) + p2tp−2∇f (x(t)) = 0
I The proposed ODE can be written as a dynamical system
y = F (y) =
− 2p+1t
v − p2tp−2∇f (x)v1
, where y =
vxt
∈ R2d+1
I We use RK integrators to discretize this dynamical system
What are the parameters (nubs)?
I We set p in the ODE to be the largest p satisfying Assumption 1
⇒ To get best theoretical result
I The degree of the integrator s ≤ the largest s satisfying Assumption 2
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 13
Algorithm and main result
Algorithm 1: Input(f , x0, p, L,M, s,N)
I Set the initial state y0 = [~0; x0; 1] ∈ R2d+1
I Set step size h = C/N1
s+1
I xN ← Order-s-Runge-Kutta-Integrator(F , y0,N, h)
I return xN
Theorem If f is convex and Assumptions 1 and 2 are satisfied, then
f (xN)− f (x∗) ≤ O(N−p s
s+1)
I Flatter objective around a minimum ⇒ Larger p ⇒ Faster rate
I Higher order integrator ⇒ Larger s ⇒ Faster rate
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 14
Main result (special cases)
Corollary If f is convex, L-smooth, and 4th order differentiable, then for p = 2and s = 2 we have
f (xN)− f (x∗) ≤ O(
1
N4/3
)I Higher order differentiability allows one to use a higher order integrator
⇒ Leads to the optimal O(1/N2) rate in the limit.
Corollary For f (x) = ‖Ax − b‖44, for p = 4 and s = 2 we have
f (xN)− f (x∗) ≤ O(
1
N8/3
)
I It beats O(N−2) when the function is flat around the minimum
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 15
Main result (special cases)
Corollary If f is convex, L-smooth, and 4th order differentiable, then for p = 2and s = 2 we have
f (xN)− f (x∗) ≤ O(
1
N4/3
)I Higher order differentiability allows one to use a higher order integrator
⇒ Leads to the optimal O(1/N2) rate in the limit.
Corollary For f (x) = ‖Ax − b‖44, for p = 4 and s = 4 we have
f (xN)− f (x∗) ≤ O(
1
N16/5
)
I It beats O(N−2) when the function is flat around the minimum
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 16
Proof sketch
I Lyapunov function (Differs from the one in [Su, Boyd, Candes, 2014])
E([v ; x ; t]) :=t2
4p2‖v‖2 +
∥∥∥x +t
2pv − x∗
∥∥∥2
+ tp(f (x)− f (x∗)).
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 17
Proof sketch
I Lyapunov function (Differs from the one in [Su, Boyd, Candes, 2014])
E([v ; x ; t]) :=t2
4p2‖v‖2 +
∥∥∥x +t
2pv − x∗
∥∥∥2
+ tp(f (x)− f (x∗)).
I Step 1: E is non-increasing with time ⇒ E(y) ≤ − tp‖v‖2 ≤ 0
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 18
Proof sketch
I Lyapunov function (Differs from the one in [Su, Boyd, Candes, 2014])
E([v ; x ; t]) :=t2
4p2‖v‖2 +
∥∥∥x +t
2pv − x∗
∥∥∥2
+ tp(f (x)− f (x∗)).
I Step 1: E is non-increasing with time ⇒ E(y) ≤ − tp‖v‖2 ≤ 0
I Step 2: Bound the discretization error
⇒ ‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)[
[(1+E(yk ))]s+1
tk+ h [(1+E(yk ))]s+2
tk
]⇒ The value of E at a discretized point is close to its continuous counterpart
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 19
Proof sketch
I Lyapunov function (Differs from the one in [Su, Boyd, Candes, 2014])
E([v ; x ; t]) :=t2
4p2‖v‖2 +
∥∥∥x +t
2pv − x∗
∥∥∥2
+ tp(f (x)− f (x∗)).
I Step 1: E is non-increasing with time ⇒ E(y) ≤ − tp‖v‖2 ≤ 0
I Step 2: Bound the discretization error
⇒ ‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)[
[(1+E(yk ))]s+1
tk+ h [(1+E(yk ))]s+2
tk
]⇒ The value of E at a discretized point is close to its continuous counterpart
I Step 3: (By combining Steps 1 and 2) E for the points generated by thediscretized ODE do not increase significantly
⇒ E(yN) ≤ exp(1) E(y0) + 1
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 20
Proof sketch
I Lyapunov function (Differs from the one in [Su, Boyd, Candes, 2014])
E([v ; x ; t]) :=t2
4p2‖v‖2 +
∥∥∥x +t
2pv − x∗
∥∥∥2
+ tp(f (x)− f (x∗)).
I Step 1: E is non-increasing with time ⇒ E(y) ≤ − tp‖v‖2 ≤ 0
I Step 2: Bound the discretization error
⇒ ‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)[
[(1+E(yk ))]s+1
1+hk+ h [(1+E(yk ))]s+2
1+hk
]⇒ The value of E at a discretized point is close to its continuous counterpart
I Step 3: (By combining Steps 1 and 2) E for the points generated by thediscretized ODE do not increase significantly
⇒ E(yN) ≤ exp(1) E(y0) + 1
I Step 4: Use definition of E and Step 3 to bound suboptimality
⇒ f (xN)− f (x∗) ≤ E(yN )
tpN≤ eE(y0)+1
(1+Nh)p≤ (L+M+1)p(eE(y0)+1)p+1
CNp ss+1
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 21
Numerical results: General smooth & convex case
I f ([x1, x2]) = ‖Ax1− b‖2 + ‖Cx2− d‖44, where A,C ∈ R10×10 and b, d ∈ R10
I The first five entries of b and d are valued 0 and the rest are 1
I The ith row of A (C) ⇒ i .i .d . multivariate Gaussian dist. cond. on bi (di )
I f is convex and satisfies Assumption 1 (flatness) with p = 2
100
101
102
103
104
105
106
10-15
10-10
10-5
100
I For RK of order s = 4 the convergence rate is almost O(1/N2)
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 22
Numerical results: Beating Nesterov’s acceleration
I Consider the objective function f (x) = ‖Ax − b‖44
I f is convex and satisfies Assumption 1 (flatness) with p = 4
I We use an order s = 2 RK integrator and set N = 106
I We use p = {1, 2, 4, 8} for the ODE (by our theory p = 8 is not allowed)
100
101
102
103
104
105
106
10-15
10-10
10-5
100
I For p = 4 and s = 2 direct discretization method is faster than NAG
⇒ Converges at the rate O(N−3) ⇒ Better than our guarantee O(N−83 )
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 23
Conclusions
I Introduced a general framework to achieve acceleration for CVX problems
I Specified conditions for stably discretizing an ODE using RK integrators
I Only by using first-order information (i.e., purely gradient based)
I For general smooth and convex functions
⇒ Better than O(N−1)
⇒ Becomes close to O(N−2) by larger s (in practice: s = 4 is enough)
I Identified a new condition that quantifies the local flatness of CVX functions
I If the degree of flatness is p > 2
⇒ We can obtain rates better than O(N−2)
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 24
References
I A. Nemirovski, D. B. Yudin, and E. R. Dawson, “Problem complexity and methodefficiency in optimization,” 1983.
I Y. Nesterov, “A method of solving a convex programming problem with convergencerate O(1/k2),” Soviet Mathematics Doklady, volume 27, pp. 372-376, 1983.
I Z. Allen-Zhu and L. Orecchia, “Linear coupling: An ultimate unification of gradient andmirror descent,” arXiv preprint arXiv:1407.1537, 2014.
I L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithmsvia integral quadratic constraints,” SIAM Journal on Optimization, 26(1):57-95, 2016.
I W. Su, S. Boyd, and E. Candes, “A differential equation for modeling Nesterov’saccelerated gradient method: Theory and insights,” Advances in Neural InformationProcessing Systems (NIPS), pp. 2510-2518, 2014.
I A. Wibisono, A. C. Wilson, and M. I. Jordan. “A variational perspective on acceleratedmethods in optimization,” Proceedings of the National Academy of Sciences, 113(47),E7351-E7358, 2016.
I M. Betancourt, M. I. Jordan, and A. C. Wilson, “On symplectic optimization,” arXivpreprint arXiv:1802.03653 (2018).
I J. Zhang, A. Mokhtari, S. Sra, and A. Jadbabaie, “Direct Runge-Kutta discretizationachieves acceleration,” arXiv preprint arXiv:1805.00521, 2018.
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 25
Note on assumptions
I Our assumptions should hold only on
A := {x ∈ Rd | ∃x ′ ∈ S, ‖x − x ′‖ ≤ 1},
where S := {x ∈ Rd | f (x) ≤ exp(1)((f (x0)− f (x∗) + ‖x0 − x∗‖2) + 1)}
I The iterates always stay in S
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 26
Formal statement of the main theorem
Theorem If f is convex and Assumptions 1 and 2 are satisfied, then
f (xN)− f (x∗) ≤ C2E0
[(L+M+1)E0
Ns
s+1
]p= O
(N−p s
s+1),
where E0 := f (x0)− f (x∗) + ‖x0 − x∗‖2 + 1, h = C1(L + M + 1)−1E−10 N−
1s+1
I C1 and C2 only depend on s, p, and the Runge-Kutta integrator
I Don’t need to know C1 ⇒ Using any smaller positive constant is fine
Aryan Mokhtari Achieving Acceleration via Direct Discretization of HB-ODE 27