Stochastic Optimization Algorithms for Machine Learninglaurent.risser.free.fr/TMP_SHARE/optimCIMI2018/slidesGuanghuiLan.pdf · Stochastic Optimization Algorithms for Machine Learning

Stochastic Optimization Algorithms forMachine Learning

Guanghui (George) Lan

H. Milton School of Industrial and Systems Engineering,Georgia Tech, USA

Workshop on Optimization and LearningCIMI Workshop, Toulouse, France

September 10-13, 2018

beamer-tu-logo

Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

Machine learning (ML)

ML exploits optimization, statistics, high performancecomputing, among other techniques, to transform raw data intoknowledge in order to support decision-making in variousareas, e.g., in biomedicine, health care, logistics, energy andtransportation etc.

2 / 118

beamer-tu-logo


The role of optimization

Optimization provides theoretical insights and efficient solutionmethods for ML models.

2 / 118

beamer-tu-logo


Typical ML models

Given a set of observed data S = (ui , vi)mi=1, drawn from acertain unknown distribution D on U × V .

Goal: to describe the relation between ui and vi ’s forprediction.Applications: predicting strokes and seizures, identifyingheart failure, stopping credit card fraud, predicting machinefailure, identifying spam, ......Classic models:

Lasso regression: minx E[(〈x ,u〉 − v)2] + ρ‖x‖1.Support vector machine: minEu,v [max0, v〈x ,u〉] + ρ‖x‖2

2.Deep learning: minθ,W Eu,v (F (θTψ(Wu))− v)2.

3 / 118

beamer-tu-logo


Optimization models/methods in ML

Stochastic optimization: minx∈X f (x) := Eξ[F (x , ξ)] .F is the regularized loss function and ξ = (u, v):F (x , ξ) = (〈x ,u〉 − v)2 + ρ‖x‖1F (x , ξ) = max0, v〈x ,u〉+ ρ‖x‖22.Often called population risk minimization in ML

Finite-sum minimization: minx∈X

f (x) := 1

N∑N

i=1fi(x).

Empirical risk minimization: fi(x) = F (x , ξi).Distributed/decentralized ML: fi - loss for each agent i .

4 / 118

beamer-tu-logo


Outline

Deterministic first-order methodsStochastic optimization methods

Stochastic gradient descent (SGD)Accelerated SGD (SGD with momentum)Nonconvex SGD and its accelerationAdaptive and accelerated methods

Finite-sum and distributed optimization methodsVariance reduced gradient methodsPrimal-dual gradient methodsDecentralized/Distributed SGD

Unconvered topics: projection-free, second-order methods

5 / 118

beamer-tu-logo


(Sub)Gradient descent

Problemf ∗ := min

x∈Xf (x). Here ∅ 6= X ⊂ Rn is a convex set and f is a convex

function.

Basic IdeaStarting from x1 ∈ Rn, update xt by xt+1 = xt − γt∇f (xt ), t = 1,2, . . . .

Two essential enhancements

f may be non-differentiable: replace ∇f (xt ) by g(xt ) ∈ ∂f (xt ).

xt+1 may be infeasible: project back to X .

xt+1 := argminx∈X‖x − (xt − γtg(xt ))‖2, t = 1,2, . . . .

6 / 118

beamer-tu-logo


Interpretation

From the proximity control point of view,xt+1 = argminy∈X

12‖x − (xt − γtg(xt ))‖2

2= argminx∈Xγt〈g(xt ), x − xt〉+ 1

2‖x − xt‖22

= argminx∈Xγt [f (xt ) + 〈g(xt ), x − xt〉] + 12‖x − xt‖2

2= argminx∈Xγt〈g(xt ), x〉+ 1

2‖x − xt‖22.

Implication

To minimize the linear approximation f (xt ) + 〈g(xt ), x − xt〉 of f (x)over X , without moving too far away from xt .

The role of stepsize

γt controls how much we trust the model and depends on whichproblem classes to be solved.

7 / 118

beamer-tu-logo


Convergence of (Sub)Gradient Descent (GD)

Nonsmooth problems

f is M-Lipschitz continuous, i.e., |f (x)− f (y)| ≤ M‖x − y‖2.

Theorem

Let xks :=

(∑kt=sγt

)−1∑kt=s(γtxt ). Then

f (xks )− f ∗ ≤

(2∑k

t=sγt

)−1 [‖x∗ − xs‖2

2 + M2∑kt=sγ

2t

].

Selection of γt

If γt =√

D2X/(kM2), for some fixed k , then f (xk

1 )− f ∗ ≤ MDX

2√

k,

where DX ≥ maxx1,x2∈X ‖x1 − x2‖.

If γt =√

D2X/(tM2), then f (xk

dk/2e)− f ∗ ≤ O(1)(MDX/√

k).

8 / 118

beamer-tu-logo


Convergence of GD

Smooth problems

f is differentiable, and ∇f is L-Lipschitz continuous, i.e.,|∇f (x)−∇f (y)| ≤ L‖x − y‖2.

∃µ ≥ 0 s.t. f (y) ≥ f (x) + 〈f ′(x), y − x〉+ 12µ‖y − x‖2.

Theorem

Let γ ∈ (0, 2L ]. If γt = γ, t = 1,2, . . ., then

f (xk )− f ∗ ≤ ‖x0−x∗‖2

kh(2−Lh) .

Moreover, if µ > 0 and Qf = L/µ, then‖xk − x∗‖ ≤ ( Qf−1

Qf +1 )k‖x0 − x∗‖,f (xk )− f ∗ ≤ L

2 ( Qf−1Qf +1 )2k‖x0 − x∗‖2.

9 / 118

beamer-tu-logo


Adaption to Geometry: Mirror Descent (MD)

GD is intrinsically linked to the Euclidean structure of Rn:

The method relies on the Euclidean projection,

DX , L and M are defined in terms of the Euclidean norm.

Bregman Distance

Let ‖ · ‖ be a (general) norm on Rn and ‖x‖∗ = sup‖y‖≤1〈x , y〉, and ωbe a continuously differentiable and strongly convex function withmodulus ν with respect to ‖ · ‖. Define

V (x , z) = ω(z)− [ω(x) +∇ω(x)T (z − x)].

Mirror Descent (Nemirovski and Yudin 83, Beck and Teboule 03)

xt+1 = argminx∈Xγt〈g(xt ), x〉+ V (xt , x), t = 1,2, . . . .

GD is a special case of MD: V (xt , x) = ‖x − xt‖22/2.

10 / 118

beamer-tu-logo


Acceleration scheme

Assume that f is smooth: ‖∇f (x)−∇f (y)‖∗ ≤ L‖x − y‖.

Accelerated gradient descent (AGD) (Nesterov 83, 04; Tseng 08)

Choose x0 = x0 ∈ X .1) Set xk = (1− αk )xk−1 + αk xk−1.2) Compute f ′(xk ) and set

xk = argminx∈Xαk 〈f ′(xk ), x〉+ βk V (xk−1, x),xk = (1− αk )xk−1 + αk xk .

3) Set k ← k + 1 and go to step 1).

Theorem

If βk ≥ Lα2k and βk = (1− αk )βk−1 for k ≥ 1, then

f (xk )− f (x) ≤ 1−α1β1

[f (x0)− f (x)] + βk v(x0,−x0), ∀x ∈ X .In particular, If αk = 2/(k + 1) and βk = 4L/[k(k + 1)], then

f (xk )− f (x∗) ≤ 4Lνk(k+1) V (x0, x∗).

11 / 118

beamer-tu-logo


Acceleration scheme (strongly convex case)

Assumption: ∃µ > 0 s.t. f (y) ≥ f (x) + 〈f ′(x), y − x〉+ 12µ‖y − x‖2.

AGD for strongly convex problems

Idea: Restart AGD every N ≡⌈√

8L/µ⌉

iterations.The algorithm.Input: p0 ∈ X .Phase t = 1,2, . . .:

Set pt = xN , where xN is obtained from AGD with x0 = pt−1.

Theorem

For any t ≥ 1, we have ‖pt − x∗‖2 ≤ ( 12 )t‖p0 − x∗‖2.

To have ‖pt − x∗‖2 ≤ ε, the total number of iterations is bounded by⌈√8Lµ

⌉log ‖p0−x∗‖2

ε

12 / 118

beamer-tu-logo


Stochastic optimization problems

The Problem: minx∈X f (x) := Eξ[F (x , ξ)].

Challenges: To compute exact (sub)gradients is computationallyprohibitive.

Stochastic oracleAt iteration t , xt ∈ X being the input, SO outputs a vector G(xt , ξt ),where ξtt≥1 are i.i.d. random variables s.t.

E[G(xt , ξt )] ≡ g(xt ) ∈ ∂Ψ(xt ).

Examples:

f (x) = Eξ[F (x , ξ)]: G(xt , ξt ) = F ′(xt , ξt ), ξt being a randomrealization of ξ.

f (x) =∑m

i=1 fi (x)/m: G(xt , it ) = f ′it (xt ), it being a uniform randomvariable on 1, . . . ,m.

13 / 118

beamer-tu-logo


stochastic mirror descent

Mirror descent stochastic approximation (MDSA)

The algorithm: Replace the exact linear model in MD with itsstochastic approximation (goes back to Robinson and Monro 51,Nemirovski and Yudin 83).

xt+1 = argminx∈Xγt〈Gt , x〉+ V (xt , x), t = 1,2, . . .

Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09))

Assume E[‖G(x , ξ)‖2∗] ≤ M2.

E[f (xks )]− f ∗ ≤

(∑kt=sγt

)−1 [E[V (xs, x∗)] + (2ν)−1M2∑k

t=sγ2t

].

The selection of γt

If γt =√

Ω2/(kM2) for some fixed k , then E[f (xk1 )− f ∗] ≤ MΩ

2√

k,

where Ω ≥ maxx1,x2∈X V (x1, x2).

If γt =√

Ω2/(tM2), then E[f (xkdk/2e)− f ∗] ≤ O(1)(MΩ/

√k).

14 / 118

beamer-tu-logo


stochastic mirror descent

Complexity?

Stochastic Optimization: minx∈X E[F (x , ξ)],One F ′(xt , ξt ) per iteration, totally O(1/ε2) iterations.Optimal sampling complexity.

Deterministic Finite-sum Optimization:minx∈X

f (x) := 1

m∑m

i=1 fi(x)

|fi(x)− fi(y)| ≤ Mi‖x − y‖, |f (x)− f (y)| ≤ M‖x − y‖, ∀x , y ∈ XM≤ M ≡ maxi M

Iteration complexity iteration costMD M2Ω2

ε2≤ M2Ω2

ε2m subgradients

MDSA M2Ω2

ε21 subgradients

MDSA saves up to O(m) subgradient computations.

15 / 118

beamer-tu-logo


Accelerated SGD (SGD with momentum)

Stochastic Composite Optimization (SCO)

Ψ∗ := minx∈XΨ(x) := f (x) + X (x).

X (x) is a simple convex function with known structure (e.g.X (x) = 0, X (x) = ‖x‖1).

∃ L ≥ 0, M ≥ 00 ≤ f (y)− f (x)− 〈f ′(x), y − x〉 ≤ L

2‖y − x‖2 + M‖y − x‖.

f is represented by an SO, which, given input xt ∈ X , outputsG(xt , ξt ) s.t.

E[G(xt , ξt )] ≡ g(xt ) ∈ ∂f (xt ),E[‖G(xt , ξt )− g(xt )‖2

∗]≤ σ2.

Covers both non-smooth minimization (L = 0) and smoothminimization (M = 0);

Covers both deterministic case (σ = 0) and stochastic case(σ 6= 0);

16 / 118

beamer-tu-logo



Accelerated Stochastic Approximation (AC-SA)

Accelerated stochastic approximation, Lan 08 (12)

Choose x0 = x0 ∈ X .1) Set xk = (1− αk )xk−1 + αk xk−1.2) Compute G(xk , ξk ) and set

xk = argminx∈Xαk [〈G(xk , ξk ), x〉+ X (x)] + βk V (xk , x),xk = (1− αk )xk−1 + αk xk .

3) Set k ← k + 1 and go to step 1).

Evolves from Nesterov’s AGD method (Nesterov 83, Tseng 08)originally designed for deterministic smooth optimizaiton.

Challenge I: A unified analysis for smooth, nonsmooth andstochastic optimization.

Challenge II: A proper stepsize policy.Aggressive stepsize (γk ≡ αk/βk = O(k)) for smoothoptimization.

17 / 118

beamer-tu-logo



Convergence Results

Theorem (Lan 08)

Ifαt = 2

t+1 and βt = 4βνt(t+1) , ∀ t = 1, . . . , k ,

for some β ≥ 2L, thenE[Ψ(xk )−Ψ∗] ≤ 4βV (x0,x∗)

νk(k+1) + 4(M2+σ2)(k+2)3β .

In particular, if k is fixed in advance and

β = β∗ = max

2L,[ν(M2+σ2)(k+1)(k+2)

3V (x0,x∗)

] 12

,

thenE[Ψ(xk )−Ψ∗] ≤ O(1)

(LΩ2

k2 + (M+σ)Ω√k

).

Note: It is possible to design a stepsize policy without requiring kgiven in advance by using different stepsize policy or slightlymodifying the algorithm.

18 / 118

beamer-tu-logo



Remarks

A universally optimal method for non-smooth, smooth, andcomposite minimization, allowing stochastic (sub)gradients

The method can allow a very large Lipschitz constant L withoutaffecting the rate of convergence.Direct RM-SA LΩ2+(M+Q)Ω√

kL ≤ M+Q

Ω

SMP LΩ2

k + (M+Q)Ω√k

L ≤ (M+Q)k12

Ω

Optimal method LΩ2

k2 + (M+Q)Ω√k

L ≤ (M+Q)k32

Ω

19 / 118

beamer-tu-logo



Strongly convex problems

∃µ > 0 s.t. f (y)− f (x)− 〈f ′(x), y − x〉 ≥ µ2 ‖y − x‖2.

A generic accelerated SA framework[Ghadimi and Lan 10 (12,13)]

0) Set x0 = x0 and k = 1.

1) Set xk = (1−αk )(µ+βk )

βk +(1−α2k )µ

xk−1 + αk [(1−αk )µ+βk ]

βk +(1−α2k )µ

xk−1,

2) Call the SO for computing Gk ≡ G(xk , ξk ). Setxk = arg minx∈X αk [〈Gk , x〉+ X (x) + µV (xk , x)]

+((1− αk )µ+ βk )V (xk−1, x) ,xk = αk xk + (1− αk )xk−1.

3) Set k ← k + 1 and go to step 1.

Note: Reduces to the AC-SA algorithm in Lan 08 by setting µ = 0.

20 / 118

beamer-tu-logo



Nearly optimal AC-SA for strongly SCO

Theorem: [Ghadimi and Lan 10 (12)] Let xkk≥1 be computed bythe AC-SA algorithm with

αk = 2k+1 and βk = 4L

k(k+1) .

Under Assumption 1, we have,E[Ψ(xk )−Ψ∗] ≤ 2L‖x0−x∗‖2

k(k+1) + 8(M2+σ2)µ(k+1) , ∀ k ≥ 1.

——————

Iteration-complexity of finding an ε-solution:

O(1)

(√L‖x0−x∗‖2

ε + (M+σ)2

µε

).

Significantly better than the classic SA in terms of thedependence on L, µ, σ and ‖x0 − x∗‖.

The second term is unimprovable.

Optimal for problems without smooth components (i.e., L = 0).

Stepsizes independent of σ, M and ‖x0 − x∗‖.21 / 118

beamer-tu-logo



Optimal AC-SA for strongly convex SCO

Basic idea: First assume that µ = 0 and then properly restart theexecution of the AC-SA algorithm.A multi-stage AC-SA algorithm.

0) Let p0 ∈ X and a bound V0 s.t. Ψ(p0)−Ψ(x∗) ≤ V0 be given.

1) For k = 1,2, . . .

1.a) Run Nk iterations of AC-SA with µ = 0, x0 = pk−1, αt = 2t+1 ,

βt ≡ βk,t = 4βkt(t+1) , where

Nk =⌈

max

4√

2Lµ ,

128(M2+σ2)3µV02−(k+1)

⌉,

βk = max

2L,[µ(M2+σ2)3V02−(k−1) N3

k

] 12

;

1.b) Set pk = xNk , where xNk is computed in Step 1.a).

Note: in comparision with the nearly optimal AC-SA, requiresadditional information on M, σ and V0 (or their estimations).

22 / 118

beamer-tu-logo



Optimal AC-SA for strongly convex SCO

Theorem: [Ghadimi and Lan 10 (13)] The multi-stage AC-SAalgorithm will find an ε-solution x ∈ X s.t. E[Ψ(x)−Ψ∗] ≤ ε in at mostK := dlogV0/εe stages. Moreover, the total number of iterationsperformed by this algorithm to find such a solution is bounded by

O(1)√

Lµ max

(1, log V0

ε

)+ M2+σ2

µε

.

—————-

The iteration-complexity is optimal for strongly convex stochasticoptimization.

Significantly better then the nearly optimal one in terms of thedependence on p0 (or x0).

Possibly an optimal single-stage algorithm by using increasingmini-batches (to be discussed later).

23 / 118

beamer-tu-logo


Nonconvex SGD

Nonconvex SP problems

Issue: All previous stochastic algorithms require the convexityassumption to show convergence.Problem: Consider f ∗ := minx∈Rn f (x).

f is differentiable and bounded below, f ∈ C1,1L (Rn) :

‖∇f (y)−∇f (x)‖ ≤ L‖y − x‖, ∀x , y ∈ Rn.f is not necessarily convex.Goal: computing an approximate stationary point.Well-understood in the deterministic case.

Only stochastic gradient G(xk , ξk ) is available at xk .a) E[G(xk , ξk )] = g(xk ) ≡ ∇f (xk ),b) E

[‖G(xk , ξk )− g(xk )‖2

]≤ σ2.

No convergence analysis fo SA type algorithms.24 / 118

beamer-tu-logo


Nonconvex SGD

Motivating examples

Nonconvex loss or regularization (e.g., deep learning):f (x) =

∫Ξ L(x , ξ)dP(ξ) + r(x),

where either the loss function L(x , ξ) or the regularizationterm r(x) is nonconvex, G(x , ξ) = ∇xL(x , ξ) +∇r(x).Simulation-based optimization

f (x) = Eξ[F (x , ξ)].Here F (x , ξ) is given in a black box. Usually only noisyfunction values available. No convexity information.Endogenous uncertainty:

f (x) =∫

Ξ(x) F (x , ξ)dPx (ξ).Here f (x) is usually nonconvex even though F (x , ξ) isconvex w.r.t. x .

25 / 118

beamer-tu-logo


Nonconvex SGD

The randomized stochastic gradient method

Algorithm 1 A randomized stochastic gradient (RSG) method,Ghadimi and Lan 12 (13)

Input: Initial point x1, iteration limit N, stepsizes γkk≥1 andprobability mass function PR(·) supported on 1, . . . ,N.Step 0. Let R be a random variable with probability massfunction PR.Step k = 1, . . . ,R. Compute G(xk , ξk ) and set

xk+1 = xk − γkG(xk , ξk ).Output xR.

26 / 118

beamer-tu-logo


Nonconvex SGD

How would randomization help?

By the smoothness of f :f (xk+1) ≤ f (xk )− γk 〈g(xk ),G(xk , ξk )〉+

Lγ2k

2 ‖G(xk , ξk )‖2.

Taking expectation on both sides w.r.t. ξk , conditioning on(ξ1, . . . , ξk−1):E[f (xk+1)] ≤ f (xk )−

(γk − L

2γ2k

)‖g(xk )‖2 + L

2γ2kσ

2.

Applying the above relation inductively, we obtain:

1L∑N

k=1

(γk − L

2γ2k

)E[‖g(xk )‖2]

≤ 1L [f (x1)− E[f (xN+1)]] + 1

2σ2∑N

k=1 γ2k

≤ 1L

[f (x1)− f ∗]︸︷︷︸Df

+12σ

2∑Nk=1 γ

2k .

27 / 118

beamer-tu-logo


Nonconvex SGD

The randomized stochastic gradient method

Theorem: (Ghadimi and Lan 12 (13)) Suppose that PR(·) andγk are set to

PR(k) := ProbR = k =2γk−Lγ2

k∑Nk=1(2γk−Lγ2

k ), k = 1, ...,N,

γk = min

1L ,

Dσ√

N

, k = 1, . . . ,N,

for some D > 0. Then,1LE[‖∇f (xR)‖2] ≤ LD2

fN +

(D +

D2f

D

)σ√N

.

If, in addition, f is convex with an optimal solution x∗, thenE[f (xR)− f ∗] ≤ LD2

XN +

(D +

D2X

D

)σ√N

,where DX := ‖x1 − x∗‖.

28 / 118

beamer-tu-logo


Nonconvex SGD

Large-deviation properties for the RSG method

Goal: To find an (ε,Λ)-solution x ∈ E in a single run of the RSGalgorithm s.t. Prob‖∇f (x)‖2 ≤ ε ≥ 1− Λ for some Λ ∈ (0,1).

By Markov’s inequality, iteration-complexity of finding an

(ε,Λ)-solution: O

L2D2f

Λε + L2

Λ2

(D +

D2f

D

)2σ2

ε2

.

29 / 118

beamer-tu-logo


Nonconvex SGD

A two-phase randomized stochastic gradientmethod

Algorithm 2 A two-phase RSG (2-RSG) methodInput: Initial point x1, iteration limit N, sample size T , andconfidence level Λ ∈ (0,1).Initialize: Set S = dlog 2/Λe.Optimization phase:

For s = 1, . . . ,S:Call the RSG method with input x1, iteration limit N,stepsizes γk and probability mass function PR sameas in RSG method. Let xs be the output of thisprocedure.

Post-optimization phase: Choose x∗ from x1, . . . , xS s.t.‖g(x∗)‖ = mins=1,...,S ‖g(xs)‖, g(xs) := 1

T∑T

k=1 G(xs, ξk ).

30 / 118

beamer-tu-logo


Nonconvex SGD

A two-phase randomized stochastic gradientmethod

Iteration complexity of 2-RSG method for finding(ε,Λ)-solution:

O

L2D2f log(1/Λ)

ε + L2(

D +D2

fD

)2log(1/Λ)σ

2

ε2+ log2(1/Λ)σ2

Λε

.

When the second terms are the dominating ones in bothbounds, considerably smaller than the previous one up to afactor of

O

1/(Λ2 log2(1/Λ))

.

31 / 118

beamer-tu-logo


Nonconvex SGD

Large-deviation properties

Light-tail Assumption:E[exp‖G(x , ξt )− g(x)‖2∗/σ2

]≤ exp1,

The third term in the previous complexity can be improvedby factor 1/Λ:

O

log(1/Λ)

[L2D2

fε + L2

(D +

D2f

D

)2σ2

ε2

]+ log2(1/Λ)σ2

ε

.

32 / 118

beamer-tu-logo


Nonconvex acceleration

Acceleration for nonconvex optimizaiton?

Consider f ∗ := minx∈Rn f (x).The complexity of the gradient descent method for findinga point x such that ‖∇Ψ(x)‖2 ≤ ε is O(1/ε).It was unclear whether the AG method still converges ornot if f is possibly nonconvex.Question: Can we generalize the AG method for solvingnonconvex problems?

33 / 118

beamer-tu-logo



The AG method for nonconvex smoothoptimization

f ∈ C1,1Lf

(Rn), possibly nonconvex, bounded from below.

Algorithm 3 Nonconvex accelerated gradient (AG)

Input: x0 ∈ Rn, stepsize parameters αk, λk, and βk.0) Set the initial points x0 = x0 and k = 1.1) Set xk = (1− αk )xk−1 + αkxk−1.

2) Compute ∇f (xk ) and setxk = xk−1 − λk∇f (xk ),xk = xk − βk∇f (xk ).

3) Set k ← k + 1 and go to step 1.

34 / 118

beamer-tu-logo



Remarks

If βk = αkλk , ∀k ≥ 1, then the AG method is equivalent toone of the simplest variant of the well-known Nesterov’smethod.

If βk = λk , ∀k ≥ 1, then the AG method reduces to thegradient descent method.

35 / 118

beamer-tu-logo



Convergence properties

Theorem (Ghadimi and Lan 13 (16))Let xk , xkk≥1 be computed by the AG algorithm with

αk = 2k+1 and βk = 1

2Lf.

a) If λk ∈[βk ,(1 + αk

4

)βk], then for any N ≥ 1,

mink=1,...,N

‖∇f (xk )‖2 ≤ 6Lf [f (x0)−f∗]N .

b) If f (·) is convex and λk = k βk2 , then for any N ≥ 1,

mink=1,...,N

‖∇f (xk )‖2 ≤ 96L2f ‖x0−x∗‖2

N2(N+1),

f (xN)− f (x∗) ≤ 4Lf ‖x0−x∗‖2

N(N+1) .

36 / 118

beamer-tu-logo



Remarks

The rate of convergence for the AG method is in the sameorder of magnitude as that for the gradient descentmethod.The AG method exhibits an optimal rate of convergence forconvex problems.The AG method can find a solution x such that‖∇f (x)‖2 ≤ ε in at most O(1/ε

13 ) iterations.

The selection of λk for the nonconvex case is lessaggressive than that for the convex case.

37 / 118

beamer-tu-logo



The AG method for nonconvex compositeoptimization

Consider a class of composite problems given by

minx∈Rn

Ψ(x) + X (x), Ψ(x) := f (x) + h(x).

f ∈ C1,1Lf

(Rn) is possibly nonconvex, h ∈ C1,1Lh

(Rn) is convex,and X is a simple convex (possibly non-differentiable)function with bounded domain. Clearly, we haveΨ ∈ C1,1

LΨ(Rn) with LΨ = Lf + Lh.

38 / 118

beamer-tu-logo



Assumption 1There exists a constant M such that ‖P(x , y , c)‖ ≤ M for anyc ∈ (0,+∞) and x , y ∈ Rn, where P(x , y , c) is given by

P(x , y , c) := argminu∈Rn

〈y ,u〉+

12c‖u − x‖2 + X (u)

.

LemmaIf X (·) is a proper closed convex function with bounded domain,then Assumption 1 is always satisfied.

39 / 118

beamer-tu-logo



Let X ⊆ Rn be given a convex compact set. It can be easilyseen that Assumption 1 holds if X (x) = IX (x), where

IX (x) =

0 x ∈ X ,+∞ x /∈ X .

X (x) = IX (x) + ‖x‖1.

40 / 118

beamer-tu-logo



New termination criterion

G(x , y , c) := 1c [x − P(x , y , c)].

G(x ,Ψ(x), c) is called gradient mapping.G(x ,∇Ψ(x), c) = ∇Ψ(x) when X (·) = 0.G(x , ·, c) is Lipschitz continuous.

LemmaLet x ∈ Rn be given and denote g ≡ ∇Ψ(x). If ‖G(x ,g, c)‖ ≤ ε,then

−∇Ψ(P(x ,g, c)) ∈ ∂X (P(x ,g, c)) + B(ε(cLΨ + 1)),where ∂X (·) denotes the subdifferential of X (·) andB(r) := x ∈ Rn : ‖x‖ ≤ r.

41 / 118

beamer-tu-logo



The AG method for nonconvex compositeoptimization

Algorithm 4 The AG method for composite optimization,Ghadimi and Lan 13 (16)

Replace Step 2 of the Algorithm 1 by

xk = P(xk−1,∇Ψ(xk ), λk ),

xk = P(xk ,∇Ψ(xk ), βk ).

G(xk ,∇Ψ(xk ), βk ) = 1βk

(xk − xk ) is used as the terminationcriterion.

42 / 118

beamer-tu-logo



Convergence properties

Theorem (Ghadimi and Lan 13 (16))If Assumption 1 holds, an optimal solution x∗ exists and

αk = 2k+1 , βk = 1

2LΨand λk = k βk

2 ,then for any N ≥ 1,

mink=1,...,N

‖G(xk ,∇Ψ(xk ), βk )‖2 ≤

24LΨ

[4LΨ‖x0−x∗‖2

N2(N+1)+ Lf

N (‖x∗‖2 + 2M2)].

If, in addition, Lf = 0, thenΦ(xN)− Φ(x∗) ≤ 4LΨ‖x0−x∗‖2

N(N+1) .

43 / 118

beamer-tu-logo



Remarks

After running the AG method for at most N = O(L2Ψ/ε

13 + LΨLf/ε)

iterations, we have −∇Ψ(xN) ∈ ∂X (xN) + B(ε).

Suppose that X (·) is Lipschitz continuous with constant LX . Then thecomplexity of the projected gradient method (Ghadimi, Lan, and Zhang13 (16)) is given by

O

LΨN (LX ) (‖x∗‖+ M) +

L2Ψ

N (‖x∗‖2 + M2),

which has a much worse dependence on the Lipschitz constants Lh andLX for the convex components.

If the Lipschitz constant Lf is small enough, i.e., Lf = O(Lh/N2), thenthe rate of convergence for the AG method is significantly better thanthat of the projected gradient method.

44 / 118

beamer-tu-logo



Stochastic oracle

The function Ψ(·) is represented by a Stochastic first-orderOracle (SO).

At iteration k , xk ∈ X being the input, SO outputs a vectorG(xk , ξk ), where ξk , k ≥ 1, are random variables whosedistribution Pk are supported on Ξk ⊆ Rd .

Assumption 2For any x ∈ Rn and k ≥ 1, we have

a) E[G(x , ξk )] = ∇Ψ(x),b) E

[‖G(x , ξk )−∇Ψ(x)‖2

]≤ σ2.

45 / 118

beamer-tu-logo



The RSAG method for composite stochasticoptimization

Algorithm 5 RSAG for stochastic composite optimization

Input: x0 ∈ Rn, αk, βk > 0 and λk > 0, iteration limitN ≥ 1, and a random variable R with s.t. ProbR = k =pk , k = 1, . . . ,N.0. Set x0 = x0 and k = 1.1. Set

xk = (1− αk )xk−1 + αkxk−1xk = P(xk−1, Gk , λk ),

xk = P(xk , Gk , βk ),

where Gk is given by Gk = 1mk

∑mki=1 G(xk , ξk ,i), for some mk ≥

1 and G(xk , ξk ) is returned by SO .2. If k = R, terminate the algorithm. Otherwise, set k = k +1and go to step 1.

46 / 118

beamer-tu-logo




Under Assumption 2, we have

E[Gk ] = 1mk

∑mki=1 E[G(xk , ξk ,i)] = ∇Ψ(xk ).

E[‖Gk −∇Ψ(xk )‖2] =1

m2kE[‖∑mk

i=1[G(xk , ξk ,i)−∇Ψ(xk )]‖2]≤ σ2

mk.

E[‖G(xk ,∇Ψ(xk ), βk )− G(xk , Gk , βk )‖2] ≤ σ2

mk.

47 / 118

beamer-tu-logo




Theorem

Suppose that αk = 2k+1 , βk = 1

2LΨ, λk = k βk

2 , and

pk = λk Ck∑Nk=1 λk Ck

, k = 1, . . . ,N, where

Ck := 1− LΨλk − LΨ(λk−βk )2

2αk Γkλk

(∑Nτ=k Γτ

). Also assume that

mk =⌈

σ2

LΨD2 min

kLf, k2N

LΨ

⌉, k = 1,2, . . . ,N

for some parameter D. Then under Assumptions 1 and 2, we haveE[‖G(xmd

R ,∇Ψ(xmdR ), βR)‖2]

≤ 96LΨ

[4LΨ(‖x0−x∗‖2+D2)

N3 + Lf (‖x∗‖2+2M2+3D2)N

].

If, in addition, Lf = 0, thenE[Φ(xR)− Φ(x∗)] ≤ LΨ

N2

(12‖x0 − x∗‖2 + 7D2

).

48 / 118

beamer-tu-logo



Remarks

An optimal choice of D would be√‖x∗‖2 + M2.

It can be easily shown that when Lf = 0, Algorithm 4 possesses anoptimal complexity for solving convex SP problems.

The choice of mk requires explicit knowledge of Lf and LΨ.

Theorem

If mk =⌈σ2k

LΨD2

⌉, k = 1, 2, . . . , then

E[‖G(xmdR ,∇Ψ(xmd

R ), βR)‖2]

≤ 96LΨ

[4LΨ‖x0−x∗‖2

N3 + Lf (‖x∗‖2+2M2)+3D2

N

].

49 / 118

beamer-tu-logo


Adaptive SGD with momentum

Motivation

Idea: Using a variable Bregman divergence in each iteration.

Adaptive stochastic subgradient (Duchi, Hazan, Singer 11).Different ways to define the Bregman divergence:ADAM,AMSGrad.Requires X to be bounded, but does not adapt to thegeometry of X .Adaptive to data geometry (e.g., entry-wise variance)Non-convergence for ADAM (exponential moving average)No guarantees for incorporating similar techniques intoSGD with momentum

50 / 118

beamer-tu-logo



Adaptive and accelerated SGD (A2Grad)

[Deng, Cheng, Lan 18]

1) Set xk = (1− αk )xk + αk xk .

2) Sample ξk , compute Gk ∈ ∇F (xk , ξk ) and φk (·).

3) Setxk+1 = argminx∈X 〈Gk , x〉+ γk V (xk , x) + βk Vφk (xk , x) ,xk+1 = (1− αk )xk + αk xk+1.

4) Set k ← k + 1 and go to step 1

Specifically, ifV (x , y) = 1

2‖x − y‖2,Vφk (x , y) = 12 〈(x − y)T Diag(hk )(x − y)〉,

thenxk+1 = ΠX (xk − 1

γk +βk hkGk ),

where ΠX denotes the projection over X .

51 / 118

beamer-tu-logo



Convergence properties of A2Grad

Theorem. [Deng, Cheng, Lan 18] Let βk = β and

hk = 1(k+1)q/2

√∑kτ=0(τ + 1)qδ2

τ with q ∈ [0,2].1 Then

E [f (xK +1)− f (x)] ≤ 2L‖x−x0‖2

(K +1)(K +2) +2βBE[

∑ni=1 ‖δ0:K ,i‖]

K +2 +2E[

∑ni=1 ‖δ0:K ,i‖]β(K +2) .

The deterministic part: O( 1

K 2

).

The stochastic part: Under the assumption E[‖δk,i‖2] ≤ σ2:

E[‖δ0:K ,i‖] ≤√∑K

k=0 E[δ2k,i ] ≤

√K + 1σ. Hence the stochastic

part obtains the O( 1√K

) rate of convergence.

1This operation is applied entry-wise to the vectors hk and δtau.52 / 118

beamer-tu-logo



Different moving average schemes

Setting q = 0, uniform moving average: hk =√∑k

τ=0 δ2τ .

Setting q = 2, incremental moving average with quadratic

weight: hk by hk = 1(k+1)

√∑kτ=0(τ + 1)2δ2

τ .

Exponential moving average:

vk =

δ2

k if k = 0ρvk−1 + (1− ρ)δ2

k otherwisevk = max vk , vk−1hk =

√(k + 1)vk

All above operations are applied entry-wise to the vectorshk , δτ , vk and vk .

53 / 118

beamer-tu-logo



Numerical results

MNIST logistic regression:

0 10 20 30 40 50Epochs

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

Trai

ning

Los

s

AdamAMSGradA2Grad-uniA2Grad-incA2Grad-exp

0 10 20 30 40 50Epochs

0.900

0.905

0.910

0.915

0.920

0.925

Test

ing

Accu

racy


54 / 118

beamer-tu-logo



MNIST neural network:

0 10 20 30 40 50Epochs

0.00

0.02

0.04

0.06

0.08

0.10

Trai

ning

Los

s


0 10 20 30 40 50Epochs

0.965

0.970

0.975

0.980

0.985

Test

ing

Accu

racy


Better or at least comparable performance for all other tests,e.g., CIFARNET and VGG16 on CIFAR10.

55 / 118

beamer-tu-logo


Finite-sum smooth optimization

Problem: Ψ∗ := minx∈X Ψ(x) := f (x) + X (x) + µω(x).

X closed and convex.

X simple, e.g., l1 norm.

ω strongly convex with modulus 1 w.r.t. an arbitrary norm.

µ ≥ 0.

Subproblem argminx∈X 〈g, x〉+ X (x) + µω(x) is easy.

fi smooth convex: ‖∇fi (x1)−∇fi (x2)‖∗ ≤ Li‖x1 − x2‖.

Denote f (x) ≡∑m

i=1fi (x) and L ≡∑m

i=1Li . f is smooth withLipschitz constant Lf ≤ L.

56 / 118

beamer-tu-logo


Why should we care about finite-sumoptimization?

Many problems in data analysis, especially inverse problem aregiven in this form (image reconstruction and matrix completion).

In machine learning, we intend to minimize the true risk, which isgiven in form of expectation .

Theoretically speaking, we do not need to save data.Using SA methods (e.g., MDSA, AC-SA), we directlyminimize the true risk (or generalization error).However, if we can afford to save the dataset and quite afew passes through the dataset, we might be able to getbetter prediction, since the estimation of sample size isusually conservative.

57 / 118

beamer-tu-logo


Distributed ML

Data are distributed over a multi-agent network.minx f (x) :=

∑mi=1fi(x)

s.t. x ∈ X , X := ∩mi=1Xi ,

where fi(x) = E[Fi(x , ζi)] is given in the form of expectation.

Optimization defined over complex multi-agent network.Each agent has its own data (observations of ζi ).Data usually are private - no sharing.Can share knowledge learned from data.Communication among agents can be expensive.Data can be captured on-line.

58 / 118

beamer-tu-logo


Example: distributed SVM

Three agents: minx13 [f1(x) + f2(x) + f3(x)]

f1(x) = 1N1

∑N1i=1

[max0, v1

i 〈x ,u1i 〉]

+ ρ‖x‖22.

f2(x) = 1N2

∑N2i=1

[max0, v2

i 〈x ,u2i 〉]

+ ρ‖x‖22.

f3(x) = 1N3

∑N3i=1

[max0, v3

i 〈x ,u3i 〉]

+ ρ‖x‖22.

Dataset for agent 1: (u1i , v

1i )N1

i=1.


2i )N2

i=1.


2i )N3

i=1.Each agent accesses its own dataset.

59 / 118

beamer-tu-logo


Example: distributed SVM - streaming data

minx13 [f1(x) + f2(x) + f3(x)],

where fj(x) = E[max0, v j〈x ,uj〉

]+ ρ‖x‖22, j = 1,2,3.

Dataset for agent i can be viewed as samples of randomvector (uj , v j), j = 1,2,3.(uj , v j) can satisfy different distribution.Samples can be collected in an online fashion.Agents can possibly share solutions, but not the samples.Need to minimize the communication costs.

60 / 118

beamer-tu-logo


Different topology for distributed learning

61 / 118

beamer-tu-logo


Outline

Review of nonsmooth finite-sumVariance-reduced stochastic mirror descentRandomized primal-dual gradient methodStochastic finite-sum minimization

Random gradient extrapolation

Decentralized SGD

62 / 118

beamer-tu-logo


Review of nonsmooth finite-sum problems

General stochastic programming (SP): minx∈X Eξ[F (x , ξ)].

Reformulation of the finite sum problem as SP:

ξ ∈ 1, . . . ,m, Probξ = i = νi , andF (x , i) = ν−1

i fi (x) + h(x) + µω(x), i = 1, . . . ,m.

Iteration complexity: O(1/ε2) or O(1/ε) (µ > 0).

Iteration cost: m times cheaper than deterministic first-ordermethods.

Save up to a factor of O(m) subgradient computations.

63 / 118

beamer-tu-logo


Smooth finite-sum optimization

Problem: Ψ∗ := minx∈X

Ψ(x) :=∑m

i=1fi (x) + h(x) + µω(x)

.

For simplicity, focus on the strongly convex case (µ > 0).

Goal: find a solution x ∈ X s.t. ‖x − x∗‖ ≤ ε‖x0 − x∗‖.

AGD: O

m√

Lfµ log 1

ε

.

AC-SA: O√

Lfµ log 1

ε + σ2

µε

.

Note: the optimality of these bounds does not preclude more efficientalgorithms for the finite-sum problem (more structural information).

64 / 118

beamer-tu-logo


Variance-reduced methods

Variance reduction techniques

Each iteration requires a randomly selected ∇fi(x).

Stochastic average gradient (SAG) by Schmidt, Roux andBach 13:

O((m + L/µ) log 1

ε

).

Similar results were obtained in Johnson and Zhang 13,Defazio et al. 14...Worse dependence on the L/µ than Nesterov’s method.

65 / 118

beamer-tu-logo



Variance-reduced stochastic mirror descent(VRSMD)

Algorithm 6algVRSMD

Input: x0, γ,T , θt.x0 = x0.For s = 1,2, . . .

Set x = xs−1 and g = ∇F (x).Set x1 = xs−1 and T = Ts.probability Q = q1, . . . ,qm on 1, . . . ,m.For t = 1,2, . . . ,T

Pick it ∈ 1, . . . ,m randomly according to Q.Gt = (∇fit (xt )−∇fit (x))/(qit n) + g.xt+1 = arg minx∈X γ[〈Gt , x〉+ h(x)] + V (xt , x).

Endset xs = xT +1 and xs =

∑Tt=2(θtxt )/

∑Tt=2θt .

End66 / 118

beamer-tu-logo



VRSMD is a multi-epoch method.

Each epoch contains T iterations and requires the computationof a full gradient at the point x .

The gradient ∇f (x) will then be used to define a gradientestimator Gt for ∇f (x t−1) at each iteration t .

Gt has smaller variance than ∇fit (xt−1), while both are unbiased

estimators for ∇f (x t−1).

Lemmavariance_reduced_betterEsti

Conditionally on x1, . . . , xt ,E[δt ] = 0,E[‖δt‖2∗] ≤ 2LQ[f (x)− f (xt )− 〈∇f (xt ), x − xt ],E[‖δt‖2∗] ≤ 4LQ[Ψ(xt )−Ψ(x∗) + Ψ(x)−Ψ(x∗)].

67 / 118

beamer-tu-logo



Smooth problems without strong convexity

Theorem. Suppose that θ = 1, γ = 1/(16LQ), qi = Li∑mi=1Li

and

Ts = 2Ts−1, s = 2,3, . . . , with T1 = 7. Then for any S ≥ 1, we haveE[Ψ(xS)−Ψ(x∗)] ≤ 8

2S−1

[ 114 (Ψ(x0)−Ψ(x∗)) + 16LQV (x0, x∗)

].

The total number of gradient computations required to find a pointx ∈ X s.t. Ψ(x)−Ψ(x∗) ≤ ε, can be bounded byO

m log Ψ(x0)−Ψ(x∗)+LQV (x0,x∗)ε + Ψ(x0)−Ψ(x∗)+LQV (x0,x∗)

ε

.

68 / 118

beamer-tu-logo



Smooth and strongly convex problems

Theorem.Supposeγ = 1

21L ,

qi = Li∑mi=1Li

,

T ≥ 2,θt = (1− γµ)−t , t ≥ 0,

Then for any s ≥ 1, we have∆s ≤

max

[(1− µ

21L )T , 12

]s∆0,

where∆s := γE[Ψ(xs)−Ψ(x∗)] + E[V (xs, x∗)]

+(1− γµ− 4LQγ)µ−1(θT − θ1)E[Ψ(xs)−Ψ(x∗)].

In particular, if T is set to a constant factor of m, denoted by om,then the total number of gradients required to find an ε-solution ofproblem (??) can be bounded by O( L

µ + m) log ∆0ε .

69 / 118

beamer-tu-logo



Open problems and our research

Problems:

Can we accelerate the convergence of randomized incrementalgradient method?

What is the best possible performance we can expect?

Our approach (Lan and Zhou 15):

Develop the primal-dual gradient (PDG) method and show itsinherent relation to Nesterov’s method.

Develop a randomized PDG (RPDG).

Present a new lower complexity bound.

Establish the optimality of the RPDG method.

70 / 118

beamer-tu-logo



Saddle point optimization

Problem: Ψ∗ := minx∈X

Ψ(x) :=∑m

i=1fi (x) + h(x) + µω(x)

.

Saddle point formulation: Let Jf be the conjugate function of f .Consider

Ψ∗ := minx∈X

h(x) + µω(x) + maxg∈G〈x ,g〉 − Jf (g)

The buyer purchases products from the supplier.

The unit price is given by g ∈ Rn.

X , h and ω are constraints and other local cost for the buyer.

The profit of supplier: revenue (〈x ,g〉) - local cost Jf (g).

71 / 118

beamer-tu-logo



Algorithmic elements to achieve equilibrium

Current order quantity x0, and product price g0.

Proximity control functions:P(x0, x) := ω(x)− [ω(x0) + 〈ω′(x0), x − x0〉].Df (g0

i , yi ) := Jf (g)− [Jf (g0) + 〈J ′f (g0),g − g0〉].

Dual prox-mapping:MG(−x ,g0, τ) := arg min

g∈G

〈−x ,g〉+ Jf (g) + τDf (g0,g)

.

x is the given or predicted demand. Maximize the profit, but not toofar away from g0.

Primal prox-mapping:MX (g, x0, η) := arg min

x∈X

〈g, x〉+ h(x) + µω(x) + ηP(x0, x)

.

g is the given or predicted price. Minimize the cost, but not too farway from x0.

72 / 118

beamer-tu-logo



The deterministic PDG

Algorithm 7 The primal-dual gradient method

Let x0 = x−1 ∈ X , and the nonnegative parameters τt, ηt,and αt be given.Set g0 = ∇f (x0).for t = 1, . . . , k do

Update z t = (x t , y t ) according tox t = αt (x t−1 − x t−2) + x t−1.

gt = MG(−x t ,gt−1, τt ).x t = MX (gt , x t−1, ηt ).

end for

73 / 118

beamer-tu-logo



PDG in gradient form

Algorithm 8 PDG method in gradient form

Input: Let x0 = x−1 ∈ X , and the nonnegative parametersτt, ηt, and αt be given.Set x0 = x0.for t = 1,2, . . . , k dox t = αt (x t−1 − x t−2) + x t−1.

x t =(x t + τtx t−1) /(1 + τt ).

gt = ∇f (x t ).x t = MX (gt , x t−1, ηt ).

end for

Idea: set J ′f (gt−1) = x t−1.

74 / 118

beamer-tu-logo



Relation to Nesterov’s method in the non-stronglyconvex setting

A variant of Nesterov’s method:

x t = (1− θt )x t−1 + θtx t−1.x t = MX (

∑mi=1∇fi (x t ), x t−1, ηt ).

x t = (1− θt )x t−1 + θtx t .

Note thatx t = (1− θt )x t−1 + (1− θt )θt−1(x t−1 − x t−2) + θtx t−1.

Equivalent to PDG with τt = (1− θt )/θt and αt = θt−1(1− θt )/θt .

Nesterov’s acceleration: looking-ahead dual players.Gradient descent: myopic dual players (αt = τt = 0 in PDG).

75 / 118

beamer-tu-logo



What is new for PDG?

The connection between Nesterov’s method and the primal-dualmethod (and thus the related Nemirovski’s mirror-prox methodand the popular Douglas-Rachford splitting method) had neverbeen developed before.

Using our technique it is possible to introduce a whole new setsof optimal gradient methods based on these related algorithmsfor saddle point or variational inequality problems.

Inspired us to develop an optimal randomized incrementalgradient method.

76 / 118

beamer-tu-logo



Convergence of PDG

Theorem

Define xk := (∑k

t=1θt )−1∑k

t=1(θtx t ). Suppose that

τt =√

2Lfµ , ηt =

√2Lfµ, αt = α ≡

√2Lf/µ

1+√

2Lf/µ, and θt = 1

αt . Then,

P(xk , x∗) ≤ µ+Lfµ αk P(x0, x∗).

Ψ(xk )−Ψ(x∗) ≤ µ(1− α)−1[1 + Lf

µ (2 + Lfµ )]αk P(x0, x∗).

Theorem

If τt = t−12 , ηt = 4Lf

t , αt = t−1t , and θt = t , then

Ψ(xk )−Ψ(x∗) ≤ 8Lfk(k+1) P(x0, x∗).

77 / 118

beamer-tu-logo



Technical challenges with PDG

Using the conjugate of the objective as the distance generatingfunction seems to be novel.

Using Bregman distance for the strongly convex case in PDG iscompletely new and crucial.

Deal with non-differentiable Bregman distances and theselection of subgradient.

The complexity bounds depend only on the diameter for theprimal feasible set.

The PDG is much different from Nesterov’s method for thestrongly convex case.

78 / 118

beamer-tu-logo



A multi-dual-player saddle point optimization

Let Ji : Yi → R be the conjugate functions of fi and Yi ,i = 1, . . . ,m, denote the dual spaces.minx∈X h(x) + µω(x) + maxyi∈Yi 〈x ,

∑i yi〉 −

∑i J(y) ,

Define their new dual prox-functions and dual prox-mappings asDi (y0

i , yi ) := Ji (yi )− [Ji (y0i ) + 〈J ′i (y0

i ), yi − y0i 〉],

MYi (−x , y0i , τ) := arg min

yi∈Yi

〈−x , y〉+ Ji (yi ) + τDi (y0

i , yi ).

79 / 118

beamer-tu-logo



The RPDG method

Algorithm 9 The RPDG method

Let x0 = x−1 ∈ X , and τt, ηt, and αt be given.Set y0

i = ∇fi(x0), i = 1, . . . ,m.for t = 1, . . . , k do

Choose it according to Probit = i = pi , i = 1, . . . ,m.x t = αt (x t−1 − x t−2) + x t−1.

y ti =

MYi (−x t , y t−1

i , τt ), i = it ,y t−1

i , i 6= it .

y ti =

p−1

i (y ti − y t−1

i ) + y t−1i , i = it ,

y t−1i , i 6= it .

.

x t = MX (∑m

i=1y ti , x

t−1, ηt ).end for

80 / 118

beamer-tu-logo



RPDG in gradient form

Algorithm 10 RPDG

for t = 1, . . . , k doChoose it according to Probit = i = pi , i = 1, . . . ,m.x t = αt (x t−1 − x t−2) + x t−1.

x ti =

(1 + τt )

−1(

x t + τtx t−1i

), i = it ,

x t−1i , i 6= it .

y ti =

∇fi(x t

i ), i = it ,y t−1

i , i 6= it .x t = MX (gt−1 + (p−1

it− 1)(y t

it − y t−1it

), x t−1, ηt ).

gt = gt−1 + y tit − y t−1

it.

end for

81 / 118

beamer-tu-logo



Rate of Convergence

Proposition

Let C = 8Lµ . and

pi = Probit = i = 12m + Li

2L , i = 1, . . . ,m,

τt =

√(m−1)2+4mC−(m−1)

2m ,

ηt =µ√

(m−1)2+4mC+µ(m−1)

2 ,αt = α := 1− 1

(m+1)+√

(m−1)2+4mC.

ThenE[P(xk , x∗)] ≤ (1 + 3Lf

µ )αk P(x0, x∗),

E[Ψ(xk )]−Ψ∗ ≤ αk/2(1− α)−1[µ+ 2Lf +

L2fµ

]P(x0, x∗).

82 / 118

beamer-tu-logo



The iteration complexity of RPGD

To find a point x ∈ X s.t. E[P(x , x∗)] ≤ ε:O

(m +√

mLµ ) log

[P(x0,x∗)

ε

].

To find a point x ∈ X s.t. ProbP(x , x∗) ≤ ε ≥ 1− λ for someλ ∈ (0,1):O

(m +√

mLµ ) log

[P(x0,x∗)λε

].

A factor of O

min√

Lµ ,√

m

savings on gradient computation(or price changes).

83 / 118

beamer-tu-logo



Technical challenges with RPDG

The incorporation of a dual prediction step into the PDG methodcomplicates the analysis.

The expected functional optimality gap does not necessarilyminorize the primal-dual gap as in the deterministic case.

Unlike existing randomized primal-dual method, RPDG is arandomized primal gradient method and it employs strongertermination criterion.

Built-in acceleration of randomized incremental gradient method,no need of Ψ∗ or its estimation.

84 / 118

beamer-tu-logo



Lower complexity bound

minxi∈Rn,i=1,...,m

Ψ(x) :=∑m

i=1

[fi (xi ) + µ

2 ‖xi‖22

].

fi (xi ) = µ(Q−1)4

[ 12 〈Axi , xi〉 − 〈e1, xi〉

]. n ≡ n/m,

A =

2 −1 0 0 · · · 0 0 0−1 2 −1 0 · · · 0 0 0· · · · · · · · · · · · · · · · · · · · ·0 0 0 0 · · · −1 2 −10 0 0 0 · · · 0 −1 κ

, κ =√Q+3√Q+1

.

Theorem

Denote q := (√Q− 1)(

√Q+ 1). Then the iterates xk generated by a

randomized incremental gradient method must satisfyE[‖xk−x∗‖2

2]

‖x0−x∗‖22≥ 1

2 exp(− 4k

√Q

m(√Q+1)2−4

√Q

)for any

n ≥ n(m, k) ≡ [m log[(

1− (1− q2)/m)k/2]]/(2 log q).

85 / 118

beamer-tu-logo



Complexity

Corollary

The number of gradient evaluations performed by any randomizedincremental gradient methods for finding a solution x ∈ X s.t.E[‖x − x∗‖2

2] ≤ ε cannot be smaller than Ω(√

mC + m)

log‖x0−x∗‖2

2ε

if

n is sufficiently large.

Remarks

No valid lower complexity bound exists before for theserandomized incremental gradient methods.

We also provide lower complexity bound for randomizedcoordinate descent methods as a by-product.

86 / 118

beamer-tu-logo



Existing literatureSAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O

((m + L/µ) log 1

ε

)rate of convergence

rate of convergence not optimal

RPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn

Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time

synchronous delays

No existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information

Theorem (Lan and Zhou 15(17))

The lower complexity bound of any RIG methods for finding a solutionx ∈ X such that E[‖x − x∗‖2

2] ≤ ε is at least

Ω(

m +√

mL/µ)

log‖x0−x∗‖2

2ε

.

87 / 118

beamer-tu-logo




((m + L/µ) log 1

ε

)rate of convergence rate of convergence not optimalRPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn


synchronous delays





Ω(

m +√

mL/µ)

log‖x0−x∗‖2

2ε

.

87 / 118

beamer-tu-logo




((m + L/µ) log 1

ε



synchronous delays





Ω(

m +√

mL/µ)

log‖x0−x∗‖2

2ε

.

87 / 118

beamer-tu-logo




((m + L/µ) log 1

ε


Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time synchronous delaysNo existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information




Ω(

m +√

mL/µ)

log‖x0−x∗‖2

2ε

.

87 / 118

beamer-tu-logo



Existing literature

SAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O

((m + L/µ) log 1

ε


Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time synchronous delaysNo existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information

Goals:Fully-distributed (no exact gradient evaluations) RIG methodDirect acceleration scheme - optimal convergence ratesApplicable to solve stochastic finite-sum problems

87 / 118

beamer-tu-logo


Preliminaries

Prox-function and prox-mapping

We define a prox-function associated with ω as

P(x0, x) := ω(x)− [ω(x0) + 〈ω′(x0), x − x0〉],

and the prox-mapping associated with X and ω is given by

MX (g, x0, η) := arg minx∈X

〈g, x〉+ µω(x) + ηP(x0, x)

.

P(·, ·) is a generalization of Bregman’s distance, since ω isnot necessarily differentiable.P(·, ·) is strongly convex w.r.t. an arbitrary norm becauseof the strong convexity of ω.Reasonable to assume the above prox-mapping problem iseasy to solve.

88 / 118

beamer-tu-logo


Preliminaries

Prox-function and prox-mapping

We define a prox-function associated with ω as

P(x0, x) := ω(x)− [ω(x0) + 〈ω′(x0), x − x0〉],

and the prox-mapping associated with X and ω is given by

MX (g, x0, η) := arg minx∈X

〈g, x〉+ µω(x) + ηP(x0, x)

.

P(·, ·) is a generalization of Bregman’s distance, since ω isnot necessarily differentiable.P(·, ·) is strongly convex w.r.t. an arbitrary norm becauseof the strong convexity of ω.Reasonable to assume the above prox-mapping problem iseasy to solve.

88 / 118

beamer-tu-logo


The algorithm - An optimal gradient extrapolation method (GEM)

The Problem: minx∈Xf (x) + µω(x) := 1m

∑mi=1fi (x) + µω(x)

Gradient Extrapolation Method (GEM)

Initialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). exact gradient evaluationfor t = 1,2, . . . , k do

gt = αt (gt−1 − gt−2) + gt−1.

x t =MX (gt , x t−1, ηt ).

x t =(x t + τtx t−1) /(1 + τt ).

gt = ∇f (x t ). one exact gradient evaluationend forOutput: xk .

89 / 118

beamer-tu-logo





Gradient Extrapolation Method (GEM)

Initialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). exact gradient evaluationfor t = 1,2, . . . , k do

gt = αt (gt−1 − gt−2) + gt−1.

x t =MX (gt , x t−1, ηt ).

x t =(x t + τtx t−1) /(1 + τt ).

gt = ∇f (x t ). one exact gradient evaluationend forOutput: xk .

89 / 118

beamer-tu-logo



Intuition: Game interpretation of GEM

Finite-sum optimization problem:

minx∈Xf (x) + µω(x)

m

Saddle point problem:

minx∈X

maxg∈G〈x ,g〉 − Jf (g)+ µω(x)

90 / 118

beamer-tu-logo




A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.

minx∈X


.

The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.

The dual player determines her/his action gt by maximizing theprofits.

91 / 118

beamer-tu-logo





minx∈X


.



91 / 118

beamer-tu-logo





minx∈X


.



91 / 118

beamer-tu-logo





minx∈X

maxg∈G〈x, g〉− Jf (g)+ µω(x)

.



91 / 118

beamer-tu-logo





x0 = x0 ∈ X , g−1 = g0 = ∇f (x0) initial information



91 / 118

beamer-tu-logo





gt = αt (gt−1 − gt−2) + gt−1,

x t ← arg minx∈X

〈gt , x〉+ µω(x) + ηtP(x t−1, x)

.



91 / 118

beamer-tu-logo





gt = arg ming∈G

〈−x t ,g〉+ Jf (g) + τtDg(gt−1,g)

m

x t =(x t + τtx t−1) /(1 + τt ),

gt = ∇f (x t ). one gradient evaluation



91 / 118

beamer-tu-logo



GEM: the dual of Nesterov’s accelerated gradientmethod

GEM:

gt = αt (gt−1 − gt−2) + gt−1

x t =MX (gt , x t−1, ηt )

gt =MG(−x t ,gt−1, τt )

NEST:

x t = αt (x t−1 − x t−2) + x t−1

gt =MG(−x t ,gt−1, τt )

x t =MX (gt , x t−1, ηt )

92 / 118

beamer-tu-logo


Optimality of GEM

Theorem (Lan and Zhou 17)

Let x∗ be an optimal solution, xk and xk be given by GEM, respectively.Suppose that µ > 0, and that parameters are set to

τt ≡ 11−α − 1, ηt ≡ α

1−αµ, and αt ≡ α =

√2Lf/µ

1+√

2Lf/µ.

Then,

P(xk , x∗) ≤ 2αk ∆µ ,

ψ(xk )− ψ∗ ≤ αk ∆,

where ∆ := µP(x0, x∗) + ψ(x0)− ψ∗.

To obtain an ε-solution⇒gradient evaluations / communication rounds: O

m√

Lfµ log 1

ε

.

Note: (a) O(1/√ε) iteration for problems without strong convexity

(µ = 0) with different parameter selection.(b) Output xk is where the gradient is computed.

93 / 118

beamer-tu-logo


Potential improvements

Adding randomization...



GEMInitialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). synchronous delayfor t = 1,2, . . . , k do

gt = αt (gt−1 − gt−2) + gt−1.

x t =MX (gt , x t−1, ηt ).

x t =(x t + τtx t−1) /(1 + τt ).

gt = ∇f (x t ). synchronous delayend forOutput: xk .

94 / 118

beamer-tu-logo


Potential improvements

Adding randomization...



GEMInitialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). synchronous delayfor t = 1,2, . . . , k do

gt = αt (gt−1 − gt−2) + gt−1.

x t =MX (gt , x t−1, ηt ).

x t =(x t + τtx t−1) /(1 + τt ).

gt = ∇f (x t ). synchronous delayend forOutput: xk .

94 / 118

beamer-tu-logo


The algorithm - Random Gradient Extrapolation Method (RGEM)

The Problem: minx∈X ψ(x) := 1m


RGEMInitialization:x0

i = x0 ∈ X , ∀i , y−1 = y0 = 0. No exact gradient evaluationfor t = 1, . . . , k doChoose it uniformly from 1, . . . ,m,

y t ←y t−1 + αt (y t−1 − y t−2),

x t ←MX ( 1m

∑mi=1y t

i , xt−1, ηt ),

x ti ←

(1 + τt )

−1(x t + τtx t−1i ), i = it ,

x t−1i , i 6= it .

y ti ←

∇fi (x t

i ), i = it , one gradient evaluationy t−1

i , i 6= it .

end forOutput: For some θt > 0, set xk := (

∑kt=1θt )

−1∑kt=1θtx t .

95 / 118

beamer-tu-logo



The Problem: minx∈X ψ(x) := 1m

∑mi=1Eξi [Fi (x , ξi )] + µω(x)

AssumptionAt iteration t , for a given point x t

i ∈ X , a stochastic first-order (SO) oracleoutputs a vector Gi (x t

i , ξti ) s.t.

Eξ[Gi (x ti , ξ

ti )] = ∇fi (x t

i ), i = 1, . . . ,m,

Eξ[‖Gi (x ti , ξ

ti )−∇fi (x t

i )‖2∗] ≤ σ2, i = 1, . . . ,m.

RGEM for stochastic finite-sum optimizationThe same as RGEM except that the gradient update step is replaced by

y ti ←

1Bt

∑Btj=1Gi (x t

i , ξti,j ), i = it , stochastic gradients of fi given by SO

y t−1i , i 6= it .

96 / 118

beamer-tu-logo



RGEM from the server and activated agentperspective

The server updates iterates x t and calculates the outputsolution xk given by sumx/sumθ.One agent is activated at a time, who updates localvariables x t

i and uploads the changes of gradient ∆yi tothe server.

97 / 118

beamer-tu-logo



RGEM from the server and activated agentperspective

The server updates iterates x t and calculates the outputsolution xk given by sumx/sumθ.One agent is activated at a time, who updates localvariables x t

i and uploads the changes of gradient ∆yi tothe server.

97 / 118

beamer-tu-logo



Advantages of RGEM for distributed learning

RGEM is an enhanced RIG method for distributed learning:Require no exact gradient evaluations of fInvolve communication only between the server and theactivated agent iteratively, and tolerate communicationfailuresPossess direct accelerated algorithmic schemeApplicable to stochastic optimization - minimization ofgeneralization riskSampling and communication complexities?

98 / 118

beamer-tu-logo





98 / 118

beamer-tu-logo





98 / 118

beamer-tu-logo





98 / 118

beamer-tu-logo





98 / 118

beamer-tu-logo


Optimality of RGEM

RGEM for deterministic finite-sum optimizationproblems


Let x∗ be an optimal solution, L = maxi=1,...,m Li , and the parameters be set toτt = 1

m(1−α)− 1, ηt = α

1−αµ, and αt ≡ mα

with α = 1− 1m+√

m2+16mL/µ, then,

E[P(xk , x∗)] ≤2∆0,σ0

αk

µ,

E[ψ(xk )− ψ(x∗)] ≤ 16 max

m, Lµ

∆0,σ0α

k/2,

where ∆0,σ0 := µP(x0, x∗) + ψ(x0)− ψ(x∗) +σ2

0mµ , and σ2

0 is the upper bound of the initial

gradients such that 1m∑m

i=1‖∇fi (x0)‖2∗ ≤ σ2

0 .

Conclusion: To obtain a stochastic ε-solution⇒# of gradient evaluations of fi / communication rounds is O

(m +

√mLµ

)log 1

ε

.

99 / 118

beamer-tu-logo


Optimality of RGEM

RGEM for stochastic finite-sum optimizationproblems


Let x∗ be an optimal solution, and L = maxi=1,...,m Li . Given the iteration limit k, let theparameters be set as before, and we set

Bt = dk(1− α)2α−te, t = 1, . . . , k ,then

E[P(xk , x∗)] ≤2αk ∆0,σ0,σ

µ,

E[ψ(xk )− ψ(x∗)] ≤ 16 max

m, Lµ

∆0,σ0,σα

k/2,

where ∆0,σ0,σ := µP(x0, x∗) + ψ(x0)− ψ(x∗) +σ2

0/m+5σ2

µ.

Conclusion: To obtain a stochastic ε-solution⇒# of required communication rounds is O

(m +

√mL/µ

)log 1

ε

,

# of stochastic gradient evaluations is O(

∆0,σ0,σµε

+ m +

√mL/µ

).

Note: Sampling complexity independent of m asymptotically.100 / 118

beamer-tu-logo


Decentralized optimization techniques

Most studies focus on deterministic stochastic optimization(e.g., Nedic and Ozdaglar 09; Shi, Ling, Wu, and Yin 15).

O(1/ε) communication rounds and gradient computation.O(log(1/ε)) communication rounds for unconstrainedsmooth and strongly convex problems.

For stochastic optimization problemsDirect extension of SGD type methods (e.g. Duchi,Agarwal, and Wainwright 12)O(1/ε2) communication rounds and stochastic(sub)gradient computations.

101 / 118

beamer-tu-logo


Motivation

Communication is about 1,000 times more expensiveCommunication over TCP/IP: 10KB/ms plus a few ms forstartupCPUs read/write from/to memory: 10KB/µsCurrent decentralized SGDs involve one communicationround for each iteration.NOT good for decentralized stochastic optimization and/ormachine learning

Question: Can we skip communication from time to time?Inspiration: Gradient sliding for composite optimizationminx∈X f (x) + h(x) (Lan, 14).

102 / 118

beamer-tu-logo


How to handle decentralized structure?

103 / 118

beamer-tu-logo


How to handle decentralized structure?

Dual decomposition (explicit)minx F (x) :=

∑mi=1fi(xi)

s.t. x1 = x2 = . . . = xm, xi ∈ Xi ,∀i = 1, . . . ,m.where x = [xT

1 , . . . , xTm].

Let Ni be the set of neighbors of agent i .The Laplacian L ∈ Rm×m of a graph G = (V ,E) is given by

Lij =

|Ni | − 1 if i = j−1 if i 6= j and (i , j) ∈ E0 otherwise.

104 / 118

beamer-tu-logo


Background: Laplacian L

Let Ni denote the set of neighbors of agent i :Ni = j ∈ V | (i , j) ∈ E ∪ i

Then, the Laplacian L ∈ Rm×m of a graph G = (V , E) isdefined as:

Lij =

|Ni | − 1 if i = j−1 if i 6= j and (i , j) ∈ E0 otherwise.

For example:

L =

1 −1 0−1 2 −10 −1 1

L1 = 0“AgreementSubspace”

105 / 118

beamer-tu-logo


Problem Formulation

minx

F (x) :=m∑

i=1

fi(xi)

s.t. x1 = · · · = xm

xi ∈ Xi , i = 1, . . . ,m

(=)

minx

F (x) :=m∑

i=1

fi(xi)

s.t. xi = xj , ∀(i , j) ∈ Exi ∈ Xi , i = 1, . . . ,m

If G is connected

(=)

minx

F (x) :=m∑

i=1

fi(xi)

s.t. Lx = 0xi ∈ Xi , i = 1, . . . ,m

Using Laplacian L,consistency constraints can

be compactly rewrittenL := L⊗ Id

(=) minx∈X m

F (x) + maxy∈Rmd

〈Lx,y〉 Equivalent Saddle Point form

106 / 118

beamer-tu-logo


Decentralized Primal-Dual (DPD): Vector Form

minx∈X m

F (x) + maxy∈Rmd

〈Lx,y〉x := [x>1 · · · x>m ]>

y := [y>1 · · · y>m ]>

Let x0 = x−1 ∈ X m, y0 ∈ Rmd , αk, τk, ηk, and θk be given.

For k = 1, . . . ,N, update zk = (xk ,yk )

xk = αk (xk−1 − xk−2) + xk−1

yk = argminy∈Rmd 〈−Lxk ,y〉+ τk2 ‖y− yk−1‖2

xk = argminx∈X m 〈Lyk ,x〉+ F (x) + ηk2 ‖x

k−1 − x‖2

Return zN = (∑N

k=1θk )−1∑Nk=1θk zk .

107 / 118

beamer-tu-logo


DPD: Agent i ’s point of view

Let x0 = x−1 ∈ X m, y0 ∈ Rmd , αk, τk, ηk, and θk be given.

For k = 1, . . . ,N, update zki = (xk

i , yki )

xki = αk (xk−1

i − xk−2i ) + xk−1

i

vki =

∑j∈Ni

Lij xkj

yki = yk−1

i + 1τk

vki

wki =

∑j∈Ni

Lijykj

xki = argminxi∈Xi

〈wki , xi〉+ fi(xi) + ηk

2 ‖xk−1i − xi‖2

Return zN = (∑N

k=1θk )−1∑Nk=1θkzk

108 / 118

beamer-tu-logo


DPD: Convergence Results

Theorem

Let xN = 1N∑N

k=1xk and x∗ be an optimal solution. αk = θk = 1,ηk = 2‖L‖, and τk = ‖L‖, ∀k = 1, . . . ,N. Then, for any N ≥ 1

F (xN)− F (x∗) ≤ ‖L‖N

[‖x0 − x∗‖2 + 1

2‖y0‖2]

‖LxN‖ ≤ 2‖L‖N

[3√

12‖x0 − x∗‖2 + 2‖y∗ − y0‖

].

To find a point x ∈ X s.t. F (xN)− F (x∗) ≤ ε and ‖Lx‖ ≤ ε:O(1ε

)iterations.

# of required communications is also O(1ε

).

109 / 118

beamer-tu-logo


Decentralized Communication Sliding (DCS)

Let x0 = x−1 ∈ X m, y0 ∈ Rmd , αk, τk, ηk, θk, and Tk begiven.

For k = 1, . . . ,N, update zki = (xk

i , yki )

xki = αk (xk−1

i − xk−2i ) + xk−1

i

vki =

∑j∈Ni

Lij xkj

yki = argminyi∈Rd 〈−vk

i , yi〉+ τk2 ‖yi − yk−1

i ‖2

wki =

∑j∈Ni

Lijykj

(xki , x

ki ) = Inner loop for Tk times

Observations: The computation of vki and wk

i involves the graphLaplacian, and communication with neighboring nodes.

110 / 118

beamer-tu-logo


The primal subproblem is solved with noisy subgradients

xki = argminxi∈Xi

〈wki , xi〉+ fi(xi) + ηk

2 ‖xk−1i − xi‖2.

Let u0 = u0 = xk−1i , βt, and λt be given.

For t = 1, . . . ,Tk ,

ht−1 = Gi (ut−1, ξt−1i )

ut = argminu∈Xi〈ht−1 + wk

i ,u〉+ ηk2 ‖x

k−1i − u‖2 + ηkβt

2 ‖ut−1 − u‖2

Return xki = uT and xk

i =(∑t

t=1 λt

)−1∑tt=1 λtut

Observations: The same wki is used, communication is

skipped! There are two output points xki and xk

i .

111 / 118

beamer-tu-logo


Convergence results

SDCS: convergence for convex cases

Theorem

Let xN = 1N∑N

k=1xk . If βt = t2 , λt = t + 1, αk = θk = 1, ηk = 2‖L‖,

τk = ‖L‖ and Tk =⌈

m(M2+σ2)N‖L‖2D

⌉for some D > 0, then

E[F (xk )− F (x∗)] ≤ ‖L‖N

[32‖x

0 − x∗‖2 + 12‖y

0‖2 + 4D],

E[‖LxN‖] ≤ ‖L‖N

[3√

3‖x0 − x∗‖2 + 8D + 4‖y∗ − y0‖].

112 / 118

beamer-tu-logo


Implication

To find a point x ∈ X such that E[F (x)− F (x∗)] ≤ ε andE[‖Lx‖] ≤ ε.

# of outer iterations/communication rounds: N = O(1/ε) -same as the deterministic case.# of inner iterations/samples/stochastic (sub)gradients:∑N

k=1Tk = O(1/ε2).Reduce # of communication rounds by a factor of 1/ε whilekeeping the optimal sampling complexity.

113 / 118

beamer-tu-logo


Boundedness of y∗

TheoremLet x∗ be an optimal solution. Then there exists an optimal dualmultiplier y∗, s.t.

‖y∗‖ ≤√

mMσmin(L) ,

where σmin(L) denotes the smallest nonzero singular value of L.

Everything can be represented in primal terms!

114 / 118

beamer-tu-logo


SDCS: convergence for strongly convex casesµ2‖x − y‖2 ≤ fi(x)− fi(y)− 〈f ′i (y), x − y〉 ≤ M‖x − y‖ for allx , y ∈ X .

Theorem

Let x∗ be an optimal solution and xN = 2N(N+3)

∑Nk=1(k + 1)xk . If

λt = t , β(k)t = (t+1)µ

2ηk+ t−1

2 , αk = kk+1 , θk = k + 1, ηk = kµ

2 ,

τk = 4‖L‖2

(k+1)µ , and Tk =

⌈√m(M2+σ2)

D2Nµ max

√m(M2+σ2)

D8µ ,1⌉

for

some D > 0, then

E[F (xN)− F (x∗) ≤ 2N(N+3)

[µ2‖x

0 − x∗‖2 + 2‖L‖2

µ ‖y0‖2 + 2µD

],

E[‖LxN‖] ≤ 8‖L‖N(N+3)

[3√

2D + 12‖x0 − x∗‖2 + 7‖L‖

µ ‖y∗ − y0‖

].

115 / 118

beamer-tu-logo


To find a point x ∈ X such that E[F (x)− F (x∗)] ≤ ε andE[‖Lx‖] ≤ ε.

# of outer iterations/communication rounds: N = O(1/√ε).

# of inner iterations/samples/stochastic (sub)gradients:∑Nk=1Tk = O(1/ε).

Reduce # of communication rounds by a factor of 1/√ε

while keeping the optimal sampling complexity.

116 / 118

beamer-tu-logo


Summary of convergence results

Table: Complexity for obtaining an ε-optimal and ε-feasible solution

Algorithm (problem type) # of communications # of subgradientevaluations

DCS: Convex O‖L‖D2

Xmε

O

mM2D2Xm

ε2

SDCS: Convex O

‖L‖D2

Xmε

O

m(M2+σ2)D2Xm

ε2

DCS: Strongly convex O

√µD2

Xmε

O

mM2

µε

SDCS: Strongly convex O

√µD2

Xmε

O

m(M2+σ2)µε

117 / 118

beamer-tu-logo


Comparisons with centralized SGD

Table: # of stochastic subgradient evaluations

Problem type SDCS (individual agent) SGD

Convex O

m(M2+σ2)D2Xm

ε2

O

m(M2f +σ2)D2

Xε2

Strongly convex O

m(M2+σ2)

µε

O

m(M2f +σ2)

µf ε

Note: Mf ≤ mM, µf ≥ mµ, D2X/D

2X m = O(1/m) and σ2 ≤ mσ2.

Conclusion: Sampling complexity comparable to centralizedSGD under reasonable assumptions and hence not improvablein general.

118 / 118

Documents

Stochastic Optimization Algorithms for Machine Learninglaurent.risser.free.fr/TMP_SHARE/optimCIMI2018/slidesGuanghuiLan.pdf · Stochastic Optimization Algorithms for Machine Learning