31
From an ODE for Nesterov’s method to Accelerated Stochastic Gradient descent Adam Oberman with Maxime Laborde Math and Stats, McGill

From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

From an ODE for Nesterov’s method

to Accelerated Stochastic

Gradient descent

Adam Oberman with Maxime LabordeMath and Stats, McGill

Page 2: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Stochastic Gradient Descent definition:Math vs. ML

• Before 2010: SGD means Stochastic Differential Equations and Brownian Motion

The data sizes have grown faster than the speed of processors … methods limited by the computing time rather than the sample size .… Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems,[Large-Scale Machine Learning with Stochastic Gradient Descent, Leon Bottou],

• Since around 2010: SGD algorithm for training ML models

w - parameters of the model f. L loss function measuring how well model fits the labels y of the data x.

Page 3: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Optimization for Machine Learning: SGD

• Why is stochastic gradient important for machine learning?

Full batch (entire data set) and mini-batch. SGD corresponds to gradient descent approximating the objective by an average over a random subsample of the data set.

Page 4: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Stochastic Approximation

• The minibatch stochastic gradient is harder to analyze

• Instead make the Stochastic Approximation assumption

SGD algorithms• Why not Newton’s method?

• Small scale numerical methods: have enough memory to solve NxN linear system (e.g. Newton’s method). Solve in few iterations.

• SGD Trade-off: lots of iterations (millions) of a slow but memory-cheap algorithm.

Page 5: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

SGD convergence rate

Number of influential variance reduction algorithms to speed up SGD:

• SAG (Schmidt, Le Roux, Bach, 2013), *Lagrange Prize

• SAGA (Defazio, Bach, Lacoste-Julien, 2014)

• SVRG (Jonhson, Zhang, 2013),

However, in many modern applications, variance reduction does not help.

SGD convergence, rate means O(10k) iterations to decrease objective gap by factor of 10. Thus improving rate constant can reduce iterations by 10X.

Effective heuristic: “momentum”. Not theoretically understood.

Page 6: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Nesterov’s accelerated (full) gradient descent

https://distill.pub/2017/momentum/

Accelerated gradient descent: same budget, faster convergence.

Gradient descent.

xk+1 = xk � hrf(xk)<latexit sha1_base64="nISIooKD8vOJ72lwbR05SPNnr5c=">AAACCHicbVC7SgNBFJ2Nrxhfq5YWjgYhIobdWGgjBG0sI5gHZMMyO5lNhp2dXWZmJWFJaWPjh9hYKGKbT7DzQ+ydPApNPHDhcM693HuPFzMqlWV9GZmFxaXllexqbm19Y3PL3N6pySgRmFRxxCLR8JAkjHJSVVQx0ogFQaHHSN0Lrkd+/Z4ISSN+p/oxaYWow6lPMVJacs39npsGJ/YAXsKeG8BT2IUORx5D0C9o4dg181bRGgPOE3tK8uUDp/A9fHIqrvnptCOchIQrzJCUTbsUq1aKhKKYkUHOSSSJEQ5QhzQ15SgkspWOHxnAI620oR8JXVzBsfp7IkWhlP3Q050hUl05643E/7xmovyLVkp5nCjC8WSRnzCoIjhKBbapIFixviYIC6pvhbiLBMJKZ5fTIdizL8+TWqlonxVLtzqNKzBBFuyBQ1AANjgHZXADKqAKMHgAz+AVvBmPxovxbnxMWjPGdGYX/IEx/AEWyJsn</latexit>

Remark: two main A-GD algorithms, correspond to convex and strongly convex case.We focus on one, convex case, to simplify presentation. Strongly convex case is also covered.

Heuristic: momentum term remembers old gradients, overshoots instead of getting stuck.

Page 7: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Motivation: A-GD and SGD Algorithms

xk+1 = yk �1

Lrf(yk)

yk+1 = xk+1 + �k(xk+1 � xk)<latexit sha1_base64="rvZej1j+YLwXQ4COeRp/Qsx3z+s=">AAACQHicbVBPSxtBHJ1NrcbYarRHLz8MloRgyEbBXiyBXjx4UDBGyIblt5NZM+zs7DIzK12WfLRe+hG8ee6lhxbx2lMnMdKa+GDg8f4wMy9IBdem3b53Sm9W3q6uldcrG+/eb25Vt3eudJIpyno0EYm6DlAzwSXrGW4Eu04VwzgQrB9EX6Z+/5YpzRN5afKUDWO8kTzkFI2V/Gr/q19ETXcCH08g9yM4AC9USMGFM/AkBgIhrFuj4XmV/F/0udUEL2AGbbH+LB1YM2r41Vq71Z4Blok7JzUyx7lfvfNGCc1iJg0VqPXA7aRmWKAynAo2qXiZZinSCG/YwFKJMdPDYjbABPatMoIwUfZIAzP1/0aBsdZ5HNhkjGasF72p+Jo3yEz4aVhwmWaGSfp0UZgJMAlM14QRV4wakVuCVHH7VqBjtAMau3nFjuAufnmZXHVa7mGrc3FU636ez1Emu2SP1IlLjkmXnJJz0iOUfCM/yC/y2/nu/HQenMenaMmZdz6QF3D+/AUVwaq6</latexit>

�k =k

k + 1, �k =

pC � 1pC + 1

<latexit sha1_base64="ECfxFJPcr8iKCGUNacqI/4EMEuQ=">AAACK3icbVDLSsNAFJ3UV62vqEs3gyIIak2qoBuh6MalgtVCE8pkOmmHTB7O3AglxK/wI9z4Ky504QO3/ofTB6KtBwYO55zLnXu8RHAFlvVuFCYmp6ZnirOlufmFxSVzeeVKxamkrEZjEcu6RxQTPGI14CBYPZGMhJ5g115w2vOvb5lUPI4uoZswNyTtiPucEtBS0zxxPAakGeBj7PiS0CzIs2Dbznfw3d2o5agbCfgU79r5D9fRprlhla0+8Dixh2SjureP3+8b4rxpPjutmKYhi4AKolTDriTgZkQCp4LlJSdVLCE0IG3W0DQiIVNu1r81x5taaWE/lvpFgPvq74mMhEp1Q08nQwIdNer1xP+8Rgr+kZvxKEmBRXSwyE8Fhhj3isMtLhkF0dWEUMn1XzHtEN0L6HpLugR79ORxclUp2/vlyoVu4wANUERraB1tIRsdoio6Q+eohih6QE/oFb0Zj8aL8WF8DqIFYziziv7A+PoGqeyo/Q==</latexit>

A-GD convex, strongly convex case

xk+1 = xk �hk

Lg(xk),

<latexit sha1_base64="a9m48KIpWuasvx72wkp5drB15R8=">AAACE3icbVC7TgJBFJ3FFyIKamlzIzHBF2Gx0EZDYmNBgYk8EiCb2WEWJjv7yMysgaz8g42/YmOhMbY2dv6Nw6NQ8CQ3OTnn3tx7jx1yJlWx+G0klpZXVteS66mN9OZWJru9U5dBJAitkYAHomljSTnzaU0xxWkzFBR7NqcN270e+417KiQL/Ds1DGnHwz2fOYxgpSUrezSwYvfYHMElDCwXTqHtCEwg7lvuCCrQ7mMFvby2Dk+sbK5YKE4Ai8SckVzZrDxAOpOoWtmvdjcgkUd9RTiWsmWWQtWJsVCMcDpKtSNJQ0xc3KMtTX3sUdmJJz+N4EArXXACoctXMFF/T8TYk3Lo2brTw6ov572x+J/XipRz0YmZH0aK+mS6yIk4qADGAUGXCUoUH2qCiWD6ViB9rENROsaUDsGcf3mR1EsF86xQutVpXKEpkmgP7aM8MtE5KqMbVEU1RNAjekav6M14Ml6Md+Nj2powZjO76A+Mzx8JnZ0y</latexit>

g = rf + e

hk =2pC

k, hk =

c

k1/2,

<latexit sha1_base64="4OTFlleSZ7WLzyjYSsdOmBoYb0Q=">AAACPHicbVBNT9tAFFxToGC+Qnvs5YmoCAkUYoPUXiohceFIBQGkOETPm3Wyynptdp+RIsv8r176I3rjxIVDUdVrz2xCDnx0pJVGM/P09k2cK2mp2bz1Zt7Nzs2/X1j0l5ZXVtdq6x/ObFYYLlo8U5m5iNEKJbVokSQlLnIjMI2VOI+Hh2P//FoYKzN9SqNcdFLsa5lIjuSkbu0kGiBBHza/QaQxVggJbIOIIn/QHYITE4O8hBAie2UIDqtyWO2Av3lzA88D3OmXZbAbVtWO363Vm43mBPCWBFNSZ1Mcd2u/ol7Gi1Ro4gqtbQdhTp0SDUmuROVHhRU58iH2RdtRjamwnXJyfAWfndKDJDPuaYKJ+nyixNTaURq7ZIo0sK+9sfg/r11Q8rVTSp0XJDR/WpQUCiiDcZPQk0ZwUiNHkBvp/gp8gK4Ncn2PSwhen/yWnIWNYK8Rft+vH+xP61hgn9gG22IB+8IO2BE7Zi3G2Q92x36zB++nd+/98f4+RWe86cxH9gLev0eW4Kou</latexit>

SGD convex, strongly convex case

Heuristic A-SGD:

Q: Can we find parameters to achieve faster convergence? Need to understand A-GD in a way than generalizes to stochastic case.

Page 8: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Heuristic: derive accelerated-GD algorithm

double variables and introduce new speed parameter

vk+1 = vk �1

wLrf(vk)

<latexit sha1_base64="9tZeWuDwQpimaAS2k6wBunHk9/U=">AAACFnicbVA9SwNBEN2LXzFGE7W0GQxCRBJysdBGCdhYpIhgPiAJx95mLy7Z2zt29yLhzK+w8a/YWChiK3b+GzcfhSY+GHi8N8PMPDfkTOlS6dtKrKyurW8kN1Nb6e2dTHZ3r6GCSBJaJwEPZMvFinImaF0zzWkrlBT7LqdNd3A18ZtDKhULxK0ehbTr475gHiNYG8nJFoZOPDixx3ABQ2cABeh4EhOIjRLfV8cAHYFdjsHLG/vYyeZKxdIUsEzsOclV7OoDpDOJmpP96vQCEvlUaMKxUm27HOpujKVmhNNxqhMpGmIywH3aNlRgn6puPH1rDEdG6YEXSFNCw1T9PRFjX6mR75pOH+s7tehNxP+8dqS9827MRBhpKshskRdx0AFMMoIek5RoPjIEE8nMrUDusMlFmyRTJgR78eVl0igX7dNi+cakcYlmSKIDdIjyyEZnqIKuUQ3VEUGP6Bm9ojfryXqx3q2PWWvCms/soz+wPn8AUBmebA==</latexit>

(i) Couple equations

(ii) reduce to single gradient evaluation at convex combination

xk+1 = xk �1

Lrf(xk)

<latexit sha1_base64="6gYXPAmf4EXRtbuvJ1ikXePzK44=">AAACEnicbVDLSsNAFJ3UV63VRl26GSxCi1iautCNUnDjoosK9gFNCJPppB0ymYSZibTEfoMbf8WNC0XcunLn3zh9LLR64MLhnHu59x4vZlSqavXLyKysrq1vZDdzW/ntnYK5u9eWUSIwaeGIRaLrIUkY5aSlqGKkGwuCQo+RjhdcTf3OHRGSRvxWjWPihGjAqU8xUlpyzfLITYNjawIv4MgN4Am0fYEwTLXSgDZHHkPQL2mr7JrFaqU6A/xLrAUp1q3GPcwXMk3X/LT7EU5CwhVmSMqeVYuVkyKhKGZkkrMTSWKEAzQgPU05Col00tlLE3iklT70I6GLKzhTf06kKJRyHHq6M0RqKJe9qfif10uUf+6klMeJIhzPF/kJgyqC03xgnwqCFRtrgrCg+laIh0hnonSKOR2CtfzyX9KuVazTSu1Gp3EJ5siCA3AISsACZ6AOrkETtAAGD+AJvIBX49F4Nt6M93lrxljM7INfMD6+ASzOnLs=</latexit>

Second equation is unstable.

Page 9: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Derivation (step 2)

Page 10: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Case 2: modify derivation

Page 11: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Accelerated-SGD AlgorithmIn the sequel, we generalize the Lyapunov analysis for A-GD to the stochastic case. Analysis will give the best constant, and yield the following A-SGD algorithm.

with an improved convergence rate constant (TBD)

Page 12: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

visualize algorithm in 2d (full gradients)

• after one iteration, x and v are in the shallow valley.

• (when properly tuned) the faster v variable pulls x along faster than GD.

comparison GD and A-GD: (same axes)

Page 13: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

visualize algorithm in 2d (stochastic gradients)

Top: comparison SGD and A-SGD: (same axes). Bottom: After 20 iterations, optimality gap 100X smaller.After 20,000 iterations, SGD reaches A-SGD at k = 20.

Page 14: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

visualize algorithm in 2d (stochastic gradients)

• Note: the v moves faster, allows x to average

• the linear term pulls x towards v

Page 15: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Variable learning rate: Convergence for the last iterate

Define G 2 bound on stochastic gradient: E[bg2] G 2

Remark: G 2 = L2D2 + �2 where D is the diameter of the domain

• Strongly convex case: Learning rate hk = O� 1k

I Previous results: Nemirovski, Juditsky, Lan, and Shapiro (2009), Shamir,

Zhang (2013), Jain, Kakade, Netrapalli, Sidford (2018): optimal rate O� 1k

with constants depending on G 2, D and µ.

I Our result: O� 1k

�rate with constants independent of the L-smoothness bound

of the gradient (depends only on �2and µ).

• Convex case:I Shamir, Zhang (2013): optimal rate order:

log(k)pk

with a rate constant that

depends on G 2and D. Learning rate: hk = O

⇣1pk

⌘.

I Jain, Nagaraj, Netrapalli (2019) remove the log factor assuming that the

number of iterations is decided in advance.

I Our result: O(log(k)/pk) rate for the last iterate, with a constant which

depends on �2, but independent of the L-smoothness. New learning rate

schedule

Maxime Laborde Mokaplan, April 2020 14 / 43

Decreasing Learning Rate Results

Page 16: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Strongly Convex A-SGD Rate:

Comparison with previous results

Learning rate: hk = O� 1k

Convergence Results

convex case

strongly convex case

Interpreting Improved rate: For both algorithms, simply an improvement to the rate constant. This is a nice theoretical result, and nice example of continuous time method. In order for it to be of practical use, desire

- practical algorithm, faster in practice (saw this for simple examples)- remains faster when theory no longer applies

- e.g. nonconvex examples, - e.g. training DNNs

Convergence rate E[f (xk) � f ⇤] after k steps: G 2 is a bound on E[g(x)2], and �2

variance. E0 is the initial value of the Lyapunov function

Improvement to the rate constant: independent of L

Maxime Laborde Mokaplan, April 2020 32 / 43

G2 = L2 + �2<latexit sha1_base64="jqr1uNFIqecj17y5gP6bmktY/j4=">AAAB/XicbVDLSgMxFM34rLXqWN25CRZBEMp0XOhGKLiwCxcV7APaacmkmTY0yQxJRqhD8QvcuhbEhSJu3fsJ7vwb01ZEWw/cy+Gce8nN8SNGlXacT2tufmFxaTm1kl7NrK1v2JvZqgpjiUkFhyyUdR8pwqggFU01I/VIEsR9Rmp+/3Tk166IVDQUl3oQEY+jrqABxUgbqW1vn7VceALPTT+ATUW7HLXctp1z8s4Y8IcUpkmumLnL3T68Z8tt+6PZCXHMidCYIaUaBTfSXoKkppiRYboZKxIh3Edd0jBUIE6Ul4yvH8I9o3RgEEpTQsOx+nsjQVypAffNJEe6p6a9kfif14h1cOwlVESxJgJPHgpiBnUIR1HADpUEazYwBGFJza0Q95BEWJvA0iaEmS/PkqqbLxzm3QuTRglMkAI7YBfsgwI4AkVQAmVQARhcg3vwBJ6tG+vRerFeJ6Nz1vfOFvgD6+0LLCyWHQ==</latexit>

Page 17: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Comparison with previous results

Shamir, Zhang (2013) Acc. SGD

hkc2pk

c

k3/4

Rate✓D2

c2 + c2G 2◆

(2 + log(k))pk

E016c2 + c2�2(1 + log(k))

(k1/4 � 1)2

Table: Convergence rate E[f (xk)� f ⇤] after k steps. G 2is a bound on E[g2]. E0 is the

initial value of the Lyapunov function.

Interpreting Improved rate: remove dependence on L-smoothness from the rate

This is a nice theoretical result, and nice example of continuous time method. Inorder for it to be of practical use, desire

• practical algorithm, faster in practice (saw this for simple examples)

Maxime Laborde Mokaplan, April 2020 23 / 43

Convex A-SGD algorithm Rate:

G2 = L2 + �2<latexit sha1_base64="jqr1uNFIqecj17y5gP6bmktY/j4=">AAAB/XicbVDLSgMxFM34rLXqWN25CRZBEMp0XOhGKLiwCxcV7APaacmkmTY0yQxJRqhD8QvcuhbEhSJu3fsJ7vwb01ZEWw/cy+Gce8nN8SNGlXacT2tufmFxaTm1kl7NrK1v2JvZqgpjiUkFhyyUdR8pwqggFU01I/VIEsR9Rmp+/3Tk166IVDQUl3oQEY+jrqABxUgbqW1vn7VceALPTT+ATUW7HLXctp1z8s4Y8IcUpkmumLnL3T68Z8tt+6PZCXHMidCYIaUaBTfSXoKkppiRYboZKxIh3Edd0jBUIE6Ul4yvH8I9o3RgEEpTQsOx+nsjQVypAffNJEe6p6a9kfif14h1cOwlVESxJgJPHgpiBnUIR1HADpUEazYwBGFJza0Q95BEWJvA0iaEmS/PkqqbLxzm3QuTRglMkAI7YBfsgwI4AkVQAmVQARhcg3vwBJ6tG+vRerFeJ6Nz1vfOFvgD6+0LLCyWHQ==</latexit>

Page 18: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Numerical Results (preliminary):

• Quadratic with synthetic noise

• Logistic Regression with minibatch stochastic gradients.

• How much difference should the rate constant make in this case?

• Note: exponential convergence, rate constant not important.

• But c/k rate, constant can be important, (since, e.g, factor of 10 can mean 10X more iterations)

Page 19: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Quadratic, sigma = .05, C=4

• Small condition number. Long time.

• Acc Convex is 10X better than SGD

• Acc-Strongly convex algorithm is 10X better than Acc Convex

Page 20: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Quadratic C=1000

• 40,000 iterations:

• A-SGD-C >>A-SGD SC, >> SGD.

• A-SGD Convex: very fast initial drop (40 iterations)

Page 21: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Logistic Regression I: mu = .2

Other Algorithms:

• SGD: 1/k learning rate (tuned)

• heuristic Nesterov: constant Beta, learning rate drops by constant factor every few epochs, tuned.

20,000 iterations

Page 22: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Logistic Regression 2: mu = .05

Page 23: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

• Statement of convergence rates

• Idea of proof

Page 24: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Strong convexity, L-smoothness

• We use definitions equivalent to standard ones of strong convexity and L-smoothness

• Example: in case where f is quadratic, the constants correspond to the smallest and largest eigenvalues of the Hessian of f.

Page 25: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Lyapunov approach to convergence rates

• We know from previous analysis in full gradient case, we have the following inequality with beta = 0.

• Stochastic case, non-zero beta (as in SGD analysis)

• SGD: balance the terms, to get a decreasing learning rate.

• SGD: beta has a factor of L in it, A-SGD: just a factor of sigma, for better rate

Page 26: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

ODE Liapunov approach

Another ODE for Nesterov’s Method

Our starting point is a perturbation of (S-A-ODE)8><

>:

x =2t(v � x) � 1p

Lrf (x)

v = � t

2rf (x),

(1st-ODE)

The system (1st-ODE) is equivalent to the following ODE

x +3tx + rf (x) = � 1p

L

✓D2f (x) · x +

1trf (x)

◆(H-ODE)

which has an additional Hessian damping term with coefficient 1/pL.

• 2nd order ODE with Hessian damping: Alvarez, Attouch, Bolte, Redont(2002), Attouch, Peyrouquet, Redont (2016)

• Shi, Du, Jordan, Su (2018) introduced a family of high resolution secondorder ODEs which also lead to Nesterov’s method: (H-ODE), special cases = 1p

L

Maxime Laborde Mokaplan, April 2020 18 / 43

Page 27: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Strongly Convex Case

Stochastic gradient version

Learning rate: hk > 0 . Define8>>><

>>>:

xk+1 � xk = �k(vk � xk) � hkpL(rf (yk) + ek),

vk+1 � vk = �k(xk � vk) � hkpµ (rf (yk) + ek),

yk = (1 � �k)xk + �kvk , �k =hk

1+hkpµ .

(FE-SC)

Same algorithm as before, but now• Variable learning rate• Stochastic gradient• To simplify: replace p

µ bypµ

1+hkpµ

Maxime Laborde Mokaplan, April 2020 30 / 43

Page 28: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Convergence result (strongly convex case)

Define the continuous time Lyapunov function

E ac,sc(x , v) := f (x) � f ⇤ +µ

2|v � x⇤|2

Discrete time Lyapunov function E ac,sck := E ac,sc(xk , vk).

Proposition (L., Oberman, 2020)

Assume E ac,sc0 6 1p

L. If hk :=

2pµ

µk+4�2Eac,sc0

�1 ,

E[f (xk)] � f ⇤ 6 4�2

µk + 4�2E ac,sc0

�1 .

Maxime Laborde Mokaplan, April 2020 31 / 43

Page 29: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Accelerated-SDG Algorithm

From (1st-ODE) to Nesterov

Define the learning rate hk > 0 and a discretization of the total time tk .The time discretization of (1st-ODE) with gradients evaluated at

yk =

✓1 � 2hk

tk

◆xk +

2hktk

vk .

is given by 8>><

>>:

xk+1 � xk =2hktk

(vk � xk) � hkpLrf (yk),

vk+1 � vk = �hktk2

rf (yk),

(FE-C)

Proposition

The discretization of (1st-ODE) given by (FE-C) with hk = h = 1pL

andtk = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest).

Fix learning rate, get Nesterov’s algorithm

Maxime Laborde Mokaplan, April 2020 20 / 43

Page 30: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Accelerated-SDG AlgorithmConvergence result (convex case)

Define the continuous time Lyapunov function

E ac,c(t, x , v) := t2(f (x) � f ⇤) + 2|v � x⇤|2

Define the discrete time Lyapunov function E ck by

E ac,ck = E ac,c(tk�1, xk , vk)

Proposition (L., Oberman, 2020)

Assume hk := ck↵ 1p

Land tk =

Pki=1 hi , then for ↵ = 3

4 ,

E[f (xk)] � f ⇤ 61

16c2E0 + c2�2(1 + log(k))

(k1/4 � 1)2

Maxime Laborde Mokaplan, April 2020 22 / 43

Page 31: From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest). Fix learning rate, get Nesterov’s algorithm

Thanks!