Upload
trannhu
View
254
Download
1
Embed Size (px)
Citation preview
Stochastic Optimization Algorithms forMachine Learning
Guanghui (George) Lan
H. Milton School of Industrial and Systems Engineering,Georgia Tech, USA
Workshop on Optimization and LearningCIMI Workshop, Toulouse, France
September 10-13, 2018
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Machine learning (ML)
ML exploits optimization, statistics, high performancecomputing, among other techniques, to transform raw data intoknowledge in order to support decision-making in variousareas, e.g., in biomedicine, health care, logistics, energy andtransportation etc.
2 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The role of optimization
Optimization provides theoretical insights and efficient solutionmethods for ML models.
2 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Typical ML models
Given a set of observed data S = (ui , vi)mi=1, drawn from acertain unknown distribution D on U × V .
Goal: to describe the relation between ui and vi ’s forprediction.Applications: predicting strokes and seizures, identifyingheart failure, stopping credit card fraud, predicting machinefailure, identifying spam, ......Classic models:
Lasso regression: minx E[(〈x ,u〉 − v)2] + ρ‖x‖1.Support vector machine: minEu,v [max0, v〈x ,u〉] + ρ‖x‖2
2.Deep learning: minθ,W Eu,v (F (θTψ(Wu))− v)2.
3 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Optimization models/methods in ML
Stochastic optimization: minx∈X f (x) := Eξ[F (x , ξ)] .F is the regularized loss function and ξ = (u, v):F (x , ξ) = (〈x ,u〉 − v)2 + ρ‖x‖1F (x , ξ) = max0, v〈x ,u〉+ ρ‖x‖22.Often called population risk minimization in ML
Finite-sum minimization: minx∈X
f (x) := 1
N∑N
i=1fi(x).
Empirical risk minimization: fi(x) = F (x , ξi).Distributed/decentralized ML: fi - loss for each agent i .
4 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Outline
Deterministic first-order methodsStochastic optimization methods
Stochastic gradient descent (SGD)Accelerated SGD (SGD with momentum)Nonconvex SGD and its accelerationAdaptive and accelerated methods
Finite-sum and distributed optimization methodsVariance reduced gradient methodsPrimal-dual gradient methodsDecentralized/Distributed SGD
Unconvered topics: projection-free, second-order methods
5 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
(Sub)Gradient descent
Problemf ∗ := min
x∈Xf (x). Here ∅ 6= X ⊂ Rn is a convex set and f is a convex
function.
Basic IdeaStarting from x1 ∈ Rn, update xt by xt+1 = xt − γt∇f (xt ), t = 1,2, . . . .
Two essential enhancements
f may be non-differentiable: replace ∇f (xt ) by g(xt ) ∈ ∂f (xt ).
xt+1 may be infeasible: project back to X .
xt+1 := argminx∈X‖x − (xt − γtg(xt ))‖2, t = 1,2, . . . .
6 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Interpretation
From the proximity control point of view,xt+1 = argminy∈X
12‖x − (xt − γtg(xt ))‖2
2= argminx∈Xγt〈g(xt ), x − xt〉+ 1
2‖x − xt‖22
= argminx∈Xγt [f (xt ) + 〈g(xt ), x − xt〉] + 12‖x − xt‖2
2= argminx∈Xγt〈g(xt ), x〉+ 1
2‖x − xt‖22.
Implication
To minimize the linear approximation f (xt ) + 〈g(xt ), x − xt〉 of f (x)over X , without moving too far away from xt .
The role of stepsize
γt controls how much we trust the model and depends on whichproblem classes to be solved.
7 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Convergence of (Sub)Gradient Descent (GD)
Nonsmooth problems
f is M-Lipschitz continuous, i.e., |f (x)− f (y)| ≤ M‖x − y‖2.
Theorem
Let xks :=
(∑kt=sγt
)−1∑kt=s(γtxt ). Then
f (xks )− f ∗ ≤
(2∑k
t=sγt
)−1 [‖x∗ − xs‖2
2 + M2∑kt=sγ
2t
].
Selection of γt
If γt =√
D2X/(kM2), for some fixed k , then f (xk
1 )− f ∗ ≤ MDX
2√
k,
where DX ≥ maxx1,x2∈X ‖x1 − x2‖.
If γt =√
D2X/(tM2), then f (xk
dk/2e)− f ∗ ≤ O(1)(MDX/√
k).
8 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Convergence of GD
Smooth problems
f is differentiable, and ∇f is L-Lipschitz continuous, i.e.,|∇f (x)−∇f (y)| ≤ L‖x − y‖2.
∃µ ≥ 0 s.t. f (y) ≥ f (x) + 〈f ′(x), y − x〉+ 12µ‖y − x‖2.
Theorem
Let γ ∈ (0, 2L ]. If γt = γ, t = 1,2, . . ., then
f (xk )− f ∗ ≤ ‖x0−x∗‖2
kh(2−Lh) .
Moreover, if µ > 0 and Qf = L/µ, then‖xk − x∗‖ ≤ ( Qf−1
Qf +1 )k‖x0 − x∗‖,f (xk )− f ∗ ≤ L
2 ( Qf−1Qf +1 )2k‖x0 − x∗‖2.
9 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaption to Geometry: Mirror Descent (MD)
GD is intrinsically linked to the Euclidean structure of Rn:
The method relies on the Euclidean projection,
DX , L and M are defined in terms of the Euclidean norm.
Bregman Distance
Let ‖ · ‖ be a (general) norm on Rn and ‖x‖∗ = sup‖y‖≤1〈x , y〉, and ωbe a continuously differentiable and strongly convex function withmodulus ν with respect to ‖ · ‖. Define
V (x , z) = ω(z)− [ω(x) +∇ω(x)T (z − x)].
Mirror Descent (Nemirovski and Yudin 83, Beck and Teboule 03)
xt+1 = argminx∈Xγt〈g(xt ), x〉+ V (xt , x), t = 1,2, . . . .
GD is a special case of MD: V (xt , x) = ‖x − xt‖22/2.
10 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Acceleration scheme
Assume that f is smooth: ‖∇f (x)−∇f (y)‖∗ ≤ L‖x − y‖.
Accelerated gradient descent (AGD) (Nesterov 83, 04; Tseng 08)
Choose x0 = x0 ∈ X .1) Set xk = (1− αk )xk−1 + αk xk−1.2) Compute f ′(xk ) and set
xk = argminx∈Xαk 〈f ′(xk ), x〉+ βk V (xk−1, x),xk = (1− αk )xk−1 + αk xk .
3) Set k ← k + 1 and go to step 1).
Theorem
If βk ≥ Lα2k and βk = (1− αk )βk−1 for k ≥ 1, then
f (xk )− f (x) ≤ 1−α1β1
[f (x0)− f (x)] + βk v(x0,−x0), ∀x ∈ X .In particular, If αk = 2/(k + 1) and βk = 4L/[k(k + 1)], then
f (xk )− f (x∗) ≤ 4Lνk(k+1) V (x0, x∗).
11 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Acceleration scheme (strongly convex case)
Assumption: ∃µ > 0 s.t. f (y) ≥ f (x) + 〈f ′(x), y − x〉+ 12µ‖y − x‖2.
AGD for strongly convex problems
Idea: Restart AGD every N ≡⌈√
8L/µ⌉
iterations.The algorithm.Input: p0 ∈ X .Phase t = 1,2, . . .:
Set pt = xN , where xN is obtained from AGD with x0 = pt−1.
Theorem
For any t ≥ 1, we have ‖pt − x∗‖2 ≤ ( 12 )t‖p0 − x∗‖2.
To have ‖pt − x∗‖2 ≤ ε, the total number of iterations is bounded by⌈√8Lµ
⌉log ‖p0−x∗‖2
ε
12 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Stochastic optimization problems
The Problem: minx∈X f (x) := Eξ[F (x , ξ)].
Challenges: To compute exact (sub)gradients is computationallyprohibitive.
Stochastic oracleAt iteration t , xt ∈ X being the input, SO outputs a vector G(xt , ξt ),where ξtt≥1 are i.i.d. random variables s.t.
E[G(xt , ξt )] ≡ g(xt ) ∈ ∂Ψ(xt ).
Examples:
f (x) = Eξ[F (x , ξ)]: G(xt , ξt ) = F ′(xt , ξt ), ξt being a randomrealization of ξ.
f (x) =∑m
i=1 fi (x)/m: G(xt , it ) = f ′it (xt ), it being a uniform randomvariable on 1, . . . ,m.
13 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
stochastic mirror descent
Mirror descent stochastic approximation (MDSA)
The algorithm: Replace the exact linear model in MD with itsstochastic approximation (goes back to Robinson and Monro 51,Nemirovski and Yudin 83).
xt+1 = argminx∈Xγt〈Gt , x〉+ V (xt , x), t = 1,2, . . .
Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09))
Assume E[‖G(x , ξ)‖2∗] ≤ M2.
E[f (xks )]− f ∗ ≤
(∑kt=sγt
)−1 [E[V (xs, x∗)] + (2ν)−1M2∑k
t=sγ2t
].
The selection of γt
If γt =√
Ω2/(kM2) for some fixed k , then E[f (xk1 )− f ∗] ≤ MΩ
2√
k,
where Ω ≥ maxx1,x2∈X V (x1, x2).
If γt =√
Ω2/(tM2), then E[f (xkdk/2e)− f ∗] ≤ O(1)(MΩ/
√k).
14 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
stochastic mirror descent
Complexity?
Stochastic Optimization: minx∈X E[F (x , ξ)],One F ′(xt , ξt ) per iteration, totally O(1/ε2) iterations.Optimal sampling complexity.
Deterministic Finite-sum Optimization:minx∈X
f (x) := 1
m∑m
i=1 fi(x)
|fi(x)− fi(y)| ≤ Mi‖x − y‖, |f (x)− f (y)| ≤ M‖x − y‖, ∀x , y ∈ XM≤ M ≡ maxi M
Iteration complexity iteration costMD M2Ω2
ε2≤ M2Ω2
ε2m subgradients
MDSA M2Ω2
ε21 subgradients
MDSA saves up to O(m) subgradient computations.
15 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Stochastic Composite Optimization (SCO)
Ψ∗ := minx∈XΨ(x) := f (x) + X (x).
X (x) is a simple convex function with known structure (e.g.X (x) = 0, X (x) = ‖x‖1).
∃ L ≥ 0, M ≥ 00 ≤ f (y)− f (x)− 〈f ′(x), y − x〉 ≤ L
2‖y − x‖2 + M‖y − x‖.
f is represented by an SO, which, given input xt ∈ X , outputsG(xt , ξt ) s.t.
E[G(xt , ξt )] ≡ g(xt ) ∈ ∂f (xt ),E[‖G(xt , ξt )− g(xt )‖2
∗]≤ σ2.
Covers both non-smooth minimization (L = 0) and smoothminimization (M = 0);
Covers both deterministic case (σ = 0) and stochastic case(σ 6= 0);
16 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Accelerated Stochastic Approximation (AC-SA)
Accelerated stochastic approximation, Lan 08 (12)
Choose x0 = x0 ∈ X .1) Set xk = (1− αk )xk−1 + αk xk−1.2) Compute G(xk , ξk ) and set
xk = argminx∈Xαk [〈G(xk , ξk ), x〉+ X (x)] + βk V (xk , x),xk = (1− αk )xk−1 + αk xk .
3) Set k ← k + 1 and go to step 1).
Evolves from Nesterov’s AGD method (Nesterov 83, Tseng 08)originally designed for deterministic smooth optimizaiton.
Challenge I: A unified analysis for smooth, nonsmooth andstochastic optimization.
Challenge II: A proper stepsize policy.Aggressive stepsize (γk ≡ αk/βk = O(k)) for smoothoptimization.
17 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Convergence Results
Theorem (Lan 08)
Ifαt = 2
t+1 and βt = 4βνt(t+1) , ∀ t = 1, . . . , k ,
for some β ≥ 2L, thenE[Ψ(xk )−Ψ∗] ≤ 4βV (x0,x∗)
νk(k+1) + 4(M2+σ2)(k+2)3β .
In particular, if k is fixed in advance and
β = β∗ = max
2L,[ν(M2+σ2)(k+1)(k+2)
3V (x0,x∗)
] 12
,
thenE[Ψ(xk )−Ψ∗] ≤ O(1)
(LΩ2
k2 + (M+σ)Ω√k
).
Note: It is possible to design a stepsize policy without requiring kgiven in advance by using different stepsize policy or slightlymodifying the algorithm.
18 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Remarks
A universally optimal method for non-smooth, smooth, andcomposite minimization, allowing stochastic (sub)gradients
The method can allow a very large Lipschitz constant L withoutaffecting the rate of convergence.Direct RM-SA LΩ2+(M+Q)Ω√
kL ≤ M+Q
Ω
SMP LΩ2
k + (M+Q)Ω√k
L ≤ (M+Q)k12
Ω
Optimal method LΩ2
k2 + (M+Q)Ω√k
L ≤ (M+Q)k32
Ω
19 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Strongly convex problems
∃µ > 0 s.t. f (y)− f (x)− 〈f ′(x), y − x〉 ≥ µ2 ‖y − x‖2.
A generic accelerated SA framework[Ghadimi and Lan 10 (12,13)]
0) Set x0 = x0 and k = 1.
1) Set xk = (1−αk )(µ+βk )
βk +(1−α2k )µ
xk−1 + αk [(1−αk )µ+βk ]
βk +(1−α2k )µ
xk−1,
2) Call the SO for computing Gk ≡ G(xk , ξk ). Setxk = arg minx∈X αk [〈Gk , x〉+ X (x) + µV (xk , x)]
+((1− αk )µ+ βk )V (xk−1, x) ,xk = αk xk + (1− αk )xk−1.
3) Set k ← k + 1 and go to step 1.
Note: Reduces to the AC-SA algorithm in Lan 08 by setting µ = 0.
20 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Nearly optimal AC-SA for strongly SCO
Theorem: [Ghadimi and Lan 10 (12)] Let xkk≥1 be computed bythe AC-SA algorithm with
αk = 2k+1 and βk = 4L
k(k+1) .
Under Assumption 1, we have,E[Ψ(xk )−Ψ∗] ≤ 2L‖x0−x∗‖2
k(k+1) + 8(M2+σ2)µ(k+1) , ∀ k ≥ 1.
——————
Iteration-complexity of finding an ε-solution:
O(1)
(√L‖x0−x∗‖2
ε + (M+σ)2
µε
).
Significantly better than the classic SA in terms of thedependence on L, µ, σ and ‖x0 − x∗‖.
The second term is unimprovable.
Optimal for problems without smooth components (i.e., L = 0).
Stepsizes independent of σ, M and ‖x0 − x∗‖.21 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Optimal AC-SA for strongly convex SCO
Basic idea: First assume that µ = 0 and then properly restart theexecution of the AC-SA algorithm.A multi-stage AC-SA algorithm.
0) Let p0 ∈ X and a bound V0 s.t. Ψ(p0)−Ψ(x∗) ≤ V0 be given.
1) For k = 1,2, . . .
1.a) Run Nk iterations of AC-SA with µ = 0, x0 = pk−1, αt = 2t+1 ,
βt ≡ βk,t = 4βkt(t+1) , where
Nk =⌈
max
4√
2Lµ ,
128(M2+σ2)3µV02−(k+1)
⌉,
βk = max
2L,[µ(M2+σ2)3V02−(k−1) N3
k
] 12
;
1.b) Set pk = xNk , where xNk is computed in Step 1.a).
Note: in comparision with the nearly optimal AC-SA, requiresadditional information on M, σ and V0 (or their estimations).
22 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Accelerated SGD (SGD with momentum)
Optimal AC-SA for strongly convex SCO
Theorem: [Ghadimi and Lan 10 (13)] The multi-stage AC-SAalgorithm will find an ε-solution x ∈ X s.t. E[Ψ(x)−Ψ∗] ≤ ε in at mostK := dlogV0/εe stages. Moreover, the total number of iterationsperformed by this algorithm to find such a solution is bounded by
O(1)√
Lµ max
(1, log V0
ε
)+ M2+σ2
µε
.
—————-
The iteration-complexity is optimal for strongly convex stochasticoptimization.
Significantly better then the nearly optimal one in terms of thedependence on p0 (or x0).
Possibly an optimal single-stage algorithm by using increasingmini-batches (to be discussed later).
23 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
Nonconvex SP problems
Issue: All previous stochastic algorithms require the convexityassumption to show convergence.Problem: Consider f ∗ := minx∈Rn f (x).
f is differentiable and bounded below, f ∈ C1,1L (Rn) :
‖∇f (y)−∇f (x)‖ ≤ L‖y − x‖, ∀x , y ∈ Rn.f is not necessarily convex.Goal: computing an approximate stationary point.Well-understood in the deterministic case.
Only stochastic gradient G(xk , ξk ) is available at xk .a) E[G(xk , ξk )] = g(xk ) ≡ ∇f (xk ),b) E
[‖G(xk , ξk )− g(xk )‖2
]≤ σ2.
No convergence analysis fo SA type algorithms.24 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
Motivating examples
Nonconvex loss or regularization (e.g., deep learning):f (x) =
∫Ξ L(x , ξ)dP(ξ) + r(x),
where either the loss function L(x , ξ) or the regularizationterm r(x) is nonconvex, G(x , ξ) = ∇xL(x , ξ) +∇r(x).Simulation-based optimization
f (x) = Eξ[F (x , ξ)].Here F (x , ξ) is given in a black box. Usually only noisyfunction values available. No convexity information.Endogenous uncertainty:
f (x) =∫
Ξ(x) F (x , ξ)dPx (ξ).Here f (x) is usually nonconvex even though F (x , ξ) isconvex w.r.t. x .
25 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
The randomized stochastic gradient method
Algorithm 1 A randomized stochastic gradient (RSG) method,Ghadimi and Lan 12 (13)
Input: Initial point x1, iteration limit N, stepsizes γkk≥1 andprobability mass function PR(·) supported on 1, . . . ,N.Step 0. Let R be a random variable with probability massfunction PR.Step k = 1, . . . ,R. Compute G(xk , ξk ) and set
xk+1 = xk − γkG(xk , ξk ).Output xR.
26 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
How would randomization help?
By the smoothness of f :f (xk+1) ≤ f (xk )− γk 〈g(xk ),G(xk , ξk )〉+
Lγ2k
2 ‖G(xk , ξk )‖2.
Taking expectation on both sides w.r.t. ξk , conditioning on(ξ1, . . . , ξk−1):E[f (xk+1)] ≤ f (xk )−
(γk − L
2γ2k
)‖g(xk )‖2 + L
2γ2kσ
2.
Applying the above relation inductively, we obtain:
1L∑N
k=1
(γk − L
2γ2k
)E[‖g(xk )‖2]
≤ 1L [f (x1)− E[f (xN+1)]] + 1
2σ2∑N
k=1 γ2k
≤ 1L
[f (x1)− f ∗]︸ ︷︷ ︸Df
+12σ
2∑Nk=1 γ
2k .
27 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
The randomized stochastic gradient method
Theorem: (Ghadimi and Lan 12 (13)) Suppose that PR(·) andγk are set to
PR(k) := ProbR = k =2γk−Lγ2
k∑Nk=1(2γk−Lγ2
k ), k = 1, ...,N,
γk = min
1L ,
Dσ√
N
, k = 1, . . . ,N,
for some D > 0. Then,1LE[‖∇f (xR)‖2] ≤ LD2
fN +
(D +
D2f
D
)σ√N
.
If, in addition, f is convex with an optimal solution x∗, thenE[f (xR)− f ∗] ≤ LD2
XN +
(D +
D2X
D
)σ√N
,where DX := ‖x1 − x∗‖.
28 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
Large-deviation properties for the RSG method
Goal: To find an (ε,Λ)-solution x ∈ E in a single run of the RSGalgorithm s.t. Prob‖∇f (x)‖2 ≤ ε ≥ 1− Λ for some Λ ∈ (0,1).
By Markov’s inequality, iteration-complexity of finding an
(ε,Λ)-solution: O
L2D2f
Λε + L2
Λ2
(D +
D2f
D
)2σ2
ε2
.
29 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
A two-phase randomized stochastic gradientmethod
Algorithm 2 A two-phase RSG (2-RSG) methodInput: Initial point x1, iteration limit N, sample size T , andconfidence level Λ ∈ (0,1).Initialize: Set S = dlog 2/Λe.Optimization phase:
For s = 1, . . . ,S:Call the RSG method with input x1, iteration limit N,stepsizes γk and probability mass function PR sameas in RSG method. Let xs be the output of thisprocedure.
Post-optimization phase: Choose x∗ from x1, . . . , xS s.t.‖g(x∗)‖ = mins=1,...,S ‖g(xs)‖, g(xs) := 1
T∑T
k=1 G(xs, ξk ).
30 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
A two-phase randomized stochastic gradientmethod
Iteration complexity of 2-RSG method for finding(ε,Λ)-solution:
O
L2D2f log(1/Λ)
ε + L2(
D +D2
fD
)2log(1/Λ)σ
2
ε2+ log2(1/Λ)σ2
Λε
.
When the second terms are the dominating ones in bothbounds, considerably smaller than the previous one up to afactor of
O
1/(Λ2 log2(1/Λ))
.
31 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex SGD
Large-deviation properties
Light-tail Assumption:E[exp‖G(x , ξt )− g(x)‖2∗/σ2
]≤ exp1,
The third term in the previous complexity can be improvedby factor 1/Λ:
O
log(1/Λ)
[L2D2
fε + L2
(D +
D2f
D
)2σ2
ε2
]+ log2(1/Λ)σ2
ε
.
32 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Acceleration for nonconvex optimizaiton?
Consider f ∗ := minx∈Rn f (x).The complexity of the gradient descent method for findinga point x such that ‖∇Ψ(x)‖2 ≤ ε is O(1/ε).It was unclear whether the AG method still converges ornot if f is possibly nonconvex.Question: Can we generalize the AG method for solvingnonconvex problems?
33 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
The AG method for nonconvex smoothoptimization
f ∈ C1,1Lf
(Rn), possibly nonconvex, bounded from below.
Algorithm 3 Nonconvex accelerated gradient (AG)
Input: x0 ∈ Rn, stepsize parameters αk, λk, and βk.0) Set the initial points x0 = x0 and k = 1.1) Set xk = (1− αk )xk−1 + αkxk−1.
2) Compute ∇f (xk ) and setxk = xk−1 − λk∇f (xk ),xk = xk − βk∇f (xk ).
3) Set k ← k + 1 and go to step 1.
34 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Remarks
If βk = αkλk , ∀k ≥ 1, then the AG method is equivalent toone of the simplest variant of the well-known Nesterov’smethod.
If βk = λk , ∀k ≥ 1, then the AG method reduces to thegradient descent method.
35 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Convergence properties
Theorem (Ghadimi and Lan 13 (16))Let xk , xkk≥1 be computed by the AG algorithm with
αk = 2k+1 and βk = 1
2Lf.
a) If λk ∈[βk ,(1 + αk
4
)βk], then for any N ≥ 1,
mink=1,...,N
‖∇f (xk )‖2 ≤ 6Lf [f (x0)−f∗]N .
b) If f (·) is convex and λk = k βk2 , then for any N ≥ 1,
mink=1,...,N
‖∇f (xk )‖2 ≤ 96L2f ‖x0−x∗‖2
N2(N+1),
f (xN)− f (x∗) ≤ 4Lf ‖x0−x∗‖2
N(N+1) .
36 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Remarks
The rate of convergence for the AG method is in the sameorder of magnitude as that for the gradient descentmethod.The AG method exhibits an optimal rate of convergence forconvex problems.The AG method can find a solution x such that‖∇f (x)‖2 ≤ ε in at most O(1/ε
13 ) iterations.
The selection of λk for the nonconvex case is lessaggressive than that for the convex case.
37 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
The AG method for nonconvex compositeoptimization
Consider a class of composite problems given by
minx∈Rn
Ψ(x) + X (x), Ψ(x) := f (x) + h(x).
f ∈ C1,1Lf
(Rn) is possibly nonconvex, h ∈ C1,1Lh
(Rn) is convex,and X is a simple convex (possibly non-differentiable)function with bounded domain. Clearly, we haveΨ ∈ C1,1
LΨ(Rn) with LΨ = Lf + Lh.
38 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Assumption 1There exists a constant M such that ‖P(x , y , c)‖ ≤ M for anyc ∈ (0,+∞) and x , y ∈ Rn, where P(x , y , c) is given by
P(x , y , c) := argminu∈Rn
〈y ,u〉+
12c‖u − x‖2 + X (u)
.
LemmaIf X (·) is a proper closed convex function with bounded domain,then Assumption 1 is always satisfied.
39 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Let X ⊆ Rn be given a convex compact set. It can be easilyseen that Assumption 1 holds if X (x) = IX (x), where
IX (x) =
0 x ∈ X ,+∞ x /∈ X .
X (x) = IX (x) + ‖x‖1.
40 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
New termination criterion
G(x , y , c) := 1c [x − P(x , y , c)].
G(x ,Ψ(x), c) is called gradient mapping.G(x ,∇Ψ(x), c) = ∇Ψ(x) when X (·) = 0.G(x , ·, c) is Lipschitz continuous.
LemmaLet x ∈ Rn be given and denote g ≡ ∇Ψ(x). If ‖G(x ,g, c)‖ ≤ ε,then
−∇Ψ(P(x ,g, c)) ∈ ∂X (P(x ,g, c)) + B(ε(cLΨ + 1)),where ∂X (·) denotes the subdifferential of X (·) andB(r) := x ∈ Rn : ‖x‖ ≤ r.
41 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
The AG method for nonconvex compositeoptimization
Algorithm 4 The AG method for composite optimization,Ghadimi and Lan 13 (16)
Replace Step 2 of the Algorithm 1 by
xk = P(xk−1,∇Ψ(xk ), λk ),
xk = P(xk ,∇Ψ(xk ), βk ).
G(xk ,∇Ψ(xk ), βk ) = 1βk
(xk − xk ) is used as the terminationcriterion.
42 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Convergence properties
Theorem (Ghadimi and Lan 13 (16))If Assumption 1 holds, an optimal solution x∗ exists and
αk = 2k+1 , βk = 1
2LΨand λk = k βk
2 ,then for any N ≥ 1,
mink=1,...,N
‖G(xk ,∇Ψ(xk ), βk )‖2 ≤
24LΨ
[4LΨ‖x0−x∗‖2
N2(N+1)+ Lf
N (‖x∗‖2 + 2M2)].
If, in addition, Lf = 0, thenΦ(xN)− Φ(x∗) ≤ 4LΨ‖x0−x∗‖2
N(N+1) .
43 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Remarks
After running the AG method for at most N = O(L2Ψ/ε
13 + LΨLf/ε)
iterations, we have −∇Ψ(xN) ∈ ∂X (xN) + B(ε).
Suppose that X (·) is Lipschitz continuous with constant LX . Then thecomplexity of the projected gradient method (Ghadimi, Lan, and Zhang13 (16)) is given by
O
LΨN (LX ) (‖x∗‖+ M) +
L2Ψ
N (‖x∗‖2 + M2),
which has a much worse dependence on the Lipschitz constants Lh andLX for the convex components.
If the Lipschitz constant Lf is small enough, i.e., Lf = O(Lh/N2), thenthe rate of convergence for the AG method is significantly better thanthat of the projected gradient method.
44 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Stochastic oracle
The function Ψ(·) is represented by a Stochastic first-orderOracle (SO).
At iteration k , xk ∈ X being the input, SO outputs a vectorG(xk , ξk ), where ξk , k ≥ 1, are random variables whosedistribution Pk are supported on Ξk ⊆ Rd .
Assumption 2For any x ∈ Rn and k ≥ 1, we have
a) E[G(x , ξk )] = ∇Ψ(x),b) E
[‖G(x , ξk )−∇Ψ(x)‖2
]≤ σ2.
45 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
The RSAG method for composite stochasticoptimization
Algorithm 5 RSAG for stochastic composite optimization
Input: x0 ∈ Rn, αk, βk > 0 and λk > 0, iteration limitN ≥ 1, and a random variable R with s.t. ProbR = k =pk , k = 1, . . . ,N.0. Set x0 = x0 and k = 1.1. Set
xk = (1− αk )xk−1 + αkxk−1xk = P(xk−1, Gk , λk ),
xk = P(xk , Gk , βk ),
where Gk is given by Gk = 1mk
∑mki=1 G(xk , ξk ,i), for some mk ≥
1 and G(xk , ξk ) is returned by SO .2. If k = R, terminate the algorithm. Otherwise, set k = k +1and go to step 1.
46 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
The RSAG method for composite stochasticoptimization
Under Assumption 2, we have
E[Gk ] = 1mk
∑mki=1 E[G(xk , ξk ,i)] = ∇Ψ(xk ).
E[‖Gk −∇Ψ(xk )‖2] =1
m2kE[‖∑mk
i=1[G(xk , ξk ,i)−∇Ψ(xk )]‖2]≤ σ2
mk.
E[‖G(xk ,∇Ψ(xk ), βk )− G(xk , Gk , βk )‖2] ≤ σ2
mk.
47 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
The RSAG method for composite stochasticoptimization
Theorem
Suppose that αk = 2k+1 , βk = 1
2LΨ, λk = k βk
2 , and
pk = λk Ck∑Nk=1 λk Ck
, k = 1, . . . ,N, where
Ck := 1− LΨλk − LΨ(λk−βk )2
2αk Γkλk
(∑Nτ=k Γτ
). Also assume that
mk =⌈
σ2
LΨD2 min
kLf, k2N
LΨ
⌉, k = 1,2, . . . ,N
for some parameter D. Then under Assumptions 1 and 2, we haveE[‖G(xmd
R ,∇Ψ(xmdR ), βR)‖2]
≤ 96LΨ
[4LΨ(‖x0−x∗‖2+D2)
N3 + Lf (‖x∗‖2+2M2+3D2)N
].
If, in addition, Lf = 0, thenE[Φ(xR)− Φ(x∗)] ≤ LΨ
N2
(12‖x0 − x∗‖2 + 7D2
).
48 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Nonconvex acceleration
Remarks
An optimal choice of D would be√‖x∗‖2 + M2.
It can be easily shown that when Lf = 0, Algorithm 4 possesses anoptimal complexity for solving convex SP problems.
The choice of mk requires explicit knowledge of Lf and LΨ.
Theorem
If mk =⌈σ2k
LΨD2
⌉, k = 1, 2, . . . , then
E[‖G(xmdR ,∇Ψ(xmd
R ), βR)‖2]
≤ 96LΨ
[4LΨ‖x0−x∗‖2
N3 + Lf (‖x∗‖2+2M2)+3D2
N
].
49 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaptive SGD with momentum
Motivation
Idea: Using a variable Bregman divergence in each iteration.
Adaptive stochastic subgradient (Duchi, Hazan, Singer 11).Different ways to define the Bregman divergence:ADAM,AMSGrad.Requires X to be bounded, but does not adapt to thegeometry of X .Adaptive to data geometry (e.g., entry-wise variance)Non-convergence for ADAM (exponential moving average)No guarantees for incorporating similar techniques intoSGD with momentum
50 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaptive SGD with momentum
Adaptive and accelerated SGD (A2Grad)
[Deng, Cheng, Lan 18]
1) Set xk = (1− αk )xk + αk xk .
2) Sample ξk , compute Gk ∈ ∇F (xk , ξk ) and φk (·).
3) Setxk+1 = argminx∈X 〈Gk , x〉+ γk V (xk , x) + βk Vφk (xk , x) ,xk+1 = (1− αk )xk + αk xk+1.
4) Set k ← k + 1 and go to step 1
Specifically, ifV (x , y) = 1
2‖x − y‖2,Vφk (x , y) = 12 〈(x − y)T Diag(hk )(x − y)〉,
thenxk+1 = ΠX (xk − 1
γk +βk hkGk ),
where ΠX denotes the projection over X .
51 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaptive SGD with momentum
Convergence properties of A2Grad
Theorem. [Deng, Cheng, Lan 18] Let βk = β and
hk = 1(k+1)q/2
√∑kτ=0(τ + 1)qδ2
τ with q ∈ [0,2].1 Then
E [f (xK +1)− f (x)] ≤ 2L‖x−x0‖2
(K +1)(K +2) +2βBE[
∑ni=1 ‖δ0:K ,i‖]
K +2 +2E[
∑ni=1 ‖δ0:K ,i‖]β(K +2) .
The deterministic part: O( 1
K 2
).
The stochastic part: Under the assumption E[‖δk,i‖2] ≤ σ2:
E[‖δ0:K ,i‖] ≤√∑K
k=0 E[δ2k,i ] ≤
√K + 1σ. Hence the stochastic
part obtains the O( 1√K
) rate of convergence.
1This operation is applied entry-wise to the vectors hk and δtau.52 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaptive SGD with momentum
Different moving average schemes
Setting q = 0, uniform moving average: hk =√∑k
τ=0 δ2τ .
Setting q = 2, incremental moving average with quadratic
weight: hk by hk = 1(k+1)
√∑kτ=0(τ + 1)2δ2
τ .
Exponential moving average:
vk =
δ2
k if k = 0ρvk−1 + (1− ρ)δ2
k otherwisevk = max vk , vk−1hk =
√(k + 1)vk
All above operations are applied entry-wise to the vectorshk , δτ , vk and vk .
53 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaptive SGD with momentum
Numerical results
MNIST logistic regression:
0 10 20 30 40 50Epochs
0.22
0.24
0.26
0.28
0.30
0.32
0.34
0.36
0.38
Trai
ning
Los
s
AdamAMSGradA2Grad-uniA2Grad-incA2Grad-exp
0 10 20 30 40 50Epochs
0.900
0.905
0.910
0.915
0.920
0.925
Test
ing
Accu
racy
AdamAMSGradA2Grad-uniA2Grad-incA2Grad-exp
54 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaptive SGD with momentum
MNIST neural network:
0 10 20 30 40 50Epochs
0.00
0.02
0.04
0.06
0.08
0.10
Trai
ning
Los
s
AdamAMSGradA2Grad-uniA2Grad-incA2Grad-exp
0 10 20 30 40 50Epochs
0.965
0.970
0.975
0.980
0.985
Test
ing
Accu
racy
AdamAMSGradA2Grad-uniA2Grad-incA2Grad-exp
Better or at least comparable performance for all other tests,e.g., CIFARNET and VGG16 on CIFAR10.
55 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Finite-sum smooth optimization
Problem: Ψ∗ := minx∈X Ψ(x) := f (x) + X (x) + µω(x).
X closed and convex.
X simple, e.g., l1 norm.
ω strongly convex with modulus 1 w.r.t. an arbitrary norm.
µ ≥ 0.
Subproblem argminx∈X 〈g, x〉+ X (x) + µω(x) is easy.
fi smooth convex: ‖∇fi (x1)−∇fi (x2)‖∗ ≤ Li‖x1 − x2‖.
Denote f (x) ≡∑m
i=1fi (x) and L ≡∑m
i=1Li . f is smooth withLipschitz constant Lf ≤ L.
56 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Why should we care about finite-sumoptimization?
Many problems in data analysis, especially inverse problem aregiven in this form (image reconstruction and matrix completion).
In machine learning, we intend to minimize the true risk, which isgiven in form of expectation .
Theoretically speaking, we do not need to save data.Using SA methods (e.g., MDSA, AC-SA), we directlyminimize the true risk (or generalization error).However, if we can afford to save the dataset and quite afew passes through the dataset, we might be able to getbetter prediction, since the estimation of sample size isusually conservative.
57 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Distributed ML
Data are distributed over a multi-agent network.minx f (x) :=
∑mi=1fi(x)
s.t. x ∈ X , X := ∩mi=1Xi ,
where fi(x) = E[Fi(x , ζi)] is given in the form of expectation.
Optimization defined over complex multi-agent network.Each agent has its own data (observations of ζi ).Data usually are private - no sharing.Can share knowledge learned from data.Communication among agents can be expensive.Data can be captured on-line.
58 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Example: distributed SVM
Three agents: minx13 [f1(x) + f2(x) + f3(x)]
f1(x) = 1N1
∑N1i=1
[max0, v1
i 〈x ,u1i 〉]
+ ρ‖x‖22.
f2(x) = 1N2
∑N2i=1
[max0, v2
i 〈x ,u2i 〉]
+ ρ‖x‖22.
f3(x) = 1N3
∑N3i=1
[max0, v3
i 〈x ,u3i 〉]
+ ρ‖x‖22.
Dataset for agent 1: (u1i , v
1i )N1
i=1.
Dataset for agent 2: (u2i , v
2i )N2
i=1.
Dataset for agent 3: (u2i , v
2i )N3
i=1.Each agent accesses its own dataset.
59 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Example: distributed SVM - streaming data
minx13 [f1(x) + f2(x) + f3(x)],
where fj(x) = E[max0, v j〈x ,uj〉
]+ ρ‖x‖22, j = 1,2,3.
Dataset for agent i can be viewed as samples of randomvector (uj , v j), j = 1,2,3.(uj , v j) can satisfy different distribution.Samples can be collected in an online fashion.Agents can possibly share solutions, but not the samples.Need to minimize the communication costs.
60 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Different topology for distributed learning
61 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Outline
Review of nonsmooth finite-sumVariance-reduced stochastic mirror descentRandomized primal-dual gradient methodStochastic finite-sum minimization
Random gradient extrapolation
Decentralized SGD
62 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Review of nonsmooth finite-sum problems
General stochastic programming (SP): minx∈X Eξ[F (x , ξ)].
Reformulation of the finite sum problem as SP:
ξ ∈ 1, . . . ,m, Probξ = i = νi , andF (x , i) = ν−1
i fi (x) + h(x) + µω(x), i = 1, . . . ,m.
Iteration complexity: O(1/ε2) or O(1/ε) (µ > 0).
Iteration cost: m times cheaper than deterministic first-ordermethods.
Save up to a factor of O(m) subgradient computations.
63 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Smooth finite-sum optimization
Problem: Ψ∗ := minx∈X
Ψ(x) :=∑m
i=1fi (x) + h(x) + µω(x)
.
For simplicity, focus on the strongly convex case (µ > 0).
Goal: find a solution x ∈ X s.t. ‖x − x∗‖ ≤ ε‖x0 − x∗‖.
AGD: O
m√
Lfµ log 1
ε
.
AC-SA: O√
Lfµ log 1
ε + σ2
µε
.
Note: the optimality of these bounds does not preclude more efficientalgorithms for the finite-sum problem (more structural information).
64 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Variance reduction techniques
Each iteration requires a randomly selected ∇fi(x).
Stochastic average gradient (SAG) by Schmidt, Roux andBach 13:
O((m + L/µ) log 1
ε
).
Similar results were obtained in Johnson and Zhang 13,Defazio et al. 14...Worse dependence on the L/µ than Nesterov’s method.
65 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Variance-reduced stochastic mirror descent(VRSMD)
Algorithm 6algVRSMD
Input: x0, γ,T , θt.x0 = x0.For s = 1,2, . . .
Set x = xs−1 and g = ∇F (x).Set x1 = xs−1 and T = Ts.probability Q = q1, . . . ,qm on 1, . . . ,m.For t = 1,2, . . . ,T
Pick it ∈ 1, . . . ,m randomly according to Q.Gt = (∇fit (xt )−∇fit (x))/(qit n) + g.xt+1 = arg minx∈X γ[〈Gt , x〉+ h(x)] + V (xt , x).
Endset xs = xT +1 and xs =
∑Tt=2(θtxt )/
∑Tt=2θt .
End66 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
VRSMD is a multi-epoch method.
Each epoch contains T iterations and requires the computationof a full gradient at the point x .
The gradient ∇f (x) will then be used to define a gradientestimator Gt for ∇f (x t−1) at each iteration t .
Gt has smaller variance than ∇fit (xt−1), while both are unbiased
estimators for ∇f (x t−1).
Lemmavariance_reduced_betterEsti
Conditionally on x1, . . . , xt ,E[δt ] = 0,E[‖δt‖2∗] ≤ 2LQ[f (x)− f (xt )− 〈∇f (xt ), x − xt ],E[‖δt‖2∗] ≤ 4LQ[Ψ(xt )−Ψ(x∗) + Ψ(x)−Ψ(x∗)].
67 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Smooth problems without strong convexity
Theorem. Suppose that θ = 1, γ = 1/(16LQ), qi = Li∑mi=1Li
and
Ts = 2Ts−1, s = 2,3, . . . , with T1 = 7. Then for any S ≥ 1, we haveE[Ψ(xS)−Ψ(x∗)] ≤ 8
2S−1
[ 114 (Ψ(x0)−Ψ(x∗)) + 16LQV (x0, x∗)
].
The total number of gradient computations required to find a pointx ∈ X s.t. Ψ(x)−Ψ(x∗) ≤ ε, can be bounded byO
m log Ψ(x0)−Ψ(x∗)+LQV (x0,x∗)ε + Ψ(x0)−Ψ(x∗)+LQV (x0,x∗)
ε
.
68 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Smooth and strongly convex problems
Theorem.Supposeγ = 1
21L ,
qi = Li∑mi=1Li
,
T ≥ 2,θt = (1− γµ)−t , t ≥ 0,
Then for any s ≥ 1, we have∆s ≤
max
[(1− µ
21L )T , 12
]s∆0,
where∆s := γE[Ψ(xs)−Ψ(x∗)] + E[V (xs, x∗)]
+(1− γµ− 4LQγ)µ−1(θT − θ1)E[Ψ(xs)−Ψ(x∗)].
In particular, if T is set to a constant factor of m, denoted by om,then the total number of gradients required to find an ε-solution ofproblem (??) can be bounded by O( L
µ + m) log ∆0ε .
69 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Open problems and our research
Problems:
Can we accelerate the convergence of randomized incrementalgradient method?
What is the best possible performance we can expect?
Our approach (Lan and Zhou 15):
Develop the primal-dual gradient (PDG) method and show itsinherent relation to Nesterov’s method.
Develop a randomized PDG (RPDG).
Present a new lower complexity bound.
Establish the optimality of the RPDG method.
70 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Saddle point optimization
Problem: Ψ∗ := minx∈X
Ψ(x) :=∑m
i=1fi (x) + h(x) + µω(x)
.
Saddle point formulation: Let Jf be the conjugate function of f .Consider
Ψ∗ := minx∈X
h(x) + µω(x) + maxg∈G〈x ,g〉 − Jf (g)
The buyer purchases products from the supplier.
The unit price is given by g ∈ Rn.
X , h and ω are constraints and other local cost for the buyer.
The profit of supplier: revenue (〈x ,g〉) - local cost Jf (g).
71 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Algorithmic elements to achieve equilibrium
Current order quantity x0, and product price g0.
Proximity control functions:P(x0, x) := ω(x)− [ω(x0) + 〈ω′(x0), x − x0〉].Df (g0
i , yi ) := Jf (g)− [Jf (g0) + 〈J ′f (g0),g − g0〉].
Dual prox-mapping:MG(−x ,g0, τ) := arg min
g∈G
〈−x ,g〉+ Jf (g) + τDf (g0,g)
.
x is the given or predicted demand. Maximize the profit, but not toofar away from g0.
Primal prox-mapping:MX (g, x0, η) := arg min
x∈X
〈g, x〉+ h(x) + µω(x) + ηP(x0, x)
.
g is the given or predicted price. Minimize the cost, but not too farway from x0.
72 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
The deterministic PDG
Algorithm 7 The primal-dual gradient method
Let x0 = x−1 ∈ X , and the nonnegative parameters τt, ηt,and αt be given.Set g0 = ∇f (x0).for t = 1, . . . , k do
Update z t = (x t , y t ) according tox t = αt (x t−1 − x t−2) + x t−1.
gt = MG(−x t ,gt−1, τt ).x t = MX (gt , x t−1, ηt ).
end for
73 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
PDG in gradient form
Algorithm 8 PDG method in gradient form
Input: Let x0 = x−1 ∈ X , and the nonnegative parametersτt, ηt, and αt be given.Set x0 = x0.for t = 1,2, . . . , k dox t = αt (x t−1 − x t−2) + x t−1.
x t =(x t + τtx t−1) /(1 + τt ).
gt = ∇f (x t ).x t = MX (gt , x t−1, ηt ).
end for
Idea: set J ′f (gt−1) = x t−1.
74 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Relation to Nesterov’s method in the non-stronglyconvex setting
A variant of Nesterov’s method:
x t = (1− θt )x t−1 + θtx t−1.x t = MX (
∑mi=1∇fi (x t ), x t−1, ηt ).
x t = (1− θt )x t−1 + θtx t .
Note thatx t = (1− θt )x t−1 + (1− θt )θt−1(x t−1 − x t−2) + θtx t−1.
Equivalent to PDG with τt = (1− θt )/θt and αt = θt−1(1− θt )/θt .
Nesterov’s acceleration: looking-ahead dual players.Gradient descent: myopic dual players (αt = τt = 0 in PDG).
75 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
What is new for PDG?
The connection between Nesterov’s method and the primal-dualmethod (and thus the related Nemirovski’s mirror-prox methodand the popular Douglas-Rachford splitting method) had neverbeen developed before.
Using our technique it is possible to introduce a whole new setsof optimal gradient methods based on these related algorithmsfor saddle point or variational inequality problems.
Inspired us to develop an optimal randomized incrementalgradient method.
76 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Convergence of PDG
Theorem
Define xk := (∑k
t=1θt )−1∑k
t=1(θtx t ). Suppose that
τt =√
2Lfµ , ηt =
√2Lfµ, αt = α ≡
√2Lf/µ
1+√
2Lf/µ, and θt = 1
αt . Then,
P(xk , x∗) ≤ µ+Lfµ αk P(x0, x∗).
Ψ(xk )−Ψ(x∗) ≤ µ(1− α)−1[1 + Lf
µ (2 + Lfµ )]αk P(x0, x∗).
Theorem
If τt = t−12 , ηt = 4Lf
t , αt = t−1t , and θt = t , then
Ψ(xk )−Ψ(x∗) ≤ 8Lfk(k+1) P(x0, x∗).
77 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Technical challenges with PDG
Using the conjugate of the objective as the distance generatingfunction seems to be novel.
Using Bregman distance for the strongly convex case in PDG iscompletely new and crucial.
Deal with non-differentiable Bregman distances and theselection of subgradient.
The complexity bounds depend only on the diameter for theprimal feasible set.
The PDG is much different from Nesterov’s method for thestrongly convex case.
78 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
A multi-dual-player saddle point optimization
Let Ji : Yi → R be the conjugate functions of fi and Yi ,i = 1, . . . ,m, denote the dual spaces.minx∈X h(x) + µω(x) + maxyi∈Yi 〈x ,
∑i yi〉 −
∑i J(y) ,
Define their new dual prox-functions and dual prox-mappings asDi (y0
i , yi ) := Ji (yi )− [Ji (y0i ) + 〈J ′i (y0
i ), yi − y0i 〉],
MYi (−x , y0i , τ) := arg min
yi∈Yi
〈−x , y〉+ Ji (yi ) + τDi (y0
i , yi ).
79 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
The RPDG method
Algorithm 9 The RPDG method
Let x0 = x−1 ∈ X , and τt, ηt, and αt be given.Set y0
i = ∇fi(x0), i = 1, . . . ,m.for t = 1, . . . , k do
Choose it according to Probit = i = pi , i = 1, . . . ,m.x t = αt (x t−1 − x t−2) + x t−1.
y ti =
MYi (−x t , y t−1
i , τt ), i = it ,y t−1
i , i 6= it .
y ti =
p−1
i (y ti − y t−1
i ) + y t−1i , i = it ,
y t−1i , i 6= it .
.
x t = MX (∑m
i=1y ti , x
t−1, ηt ).end for
80 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
RPDG in gradient form
Algorithm 10 RPDG
for t = 1, . . . , k doChoose it according to Probit = i = pi , i = 1, . . . ,m.x t = αt (x t−1 − x t−2) + x t−1.
x ti =
(1 + τt )
−1(
x t + τtx t−1i
), i = it ,
x t−1i , i 6= it .
y ti =
∇fi(x t
i ), i = it ,y t−1
i , i 6= it .x t = MX (gt−1 + (p−1
it− 1)(y t
it − y t−1it
), x t−1, ηt ).
gt = gt−1 + y tit − y t−1
it.
end for
81 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Rate of Convergence
Proposition
Let C = 8Lµ . and
pi = Probit = i = 12m + Li
2L , i = 1, . . . ,m,
τt =
√(m−1)2+4mC−(m−1)
2m ,
ηt =µ√
(m−1)2+4mC+µ(m−1)
2 ,αt = α := 1− 1
(m+1)+√
(m−1)2+4mC.
ThenE[P(xk , x∗)] ≤ (1 + 3Lf
µ )αk P(x0, x∗),
E[Ψ(xk )]−Ψ∗ ≤ αk/2(1− α)−1[µ+ 2Lf +
L2fµ
]P(x0, x∗).
82 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
The iteration complexity of RPGD
To find a point x ∈ X s.t. E[P(x , x∗)] ≤ ε:O
(m +√
mLµ ) log
[P(x0,x∗)
ε
].
To find a point x ∈ X s.t. ProbP(x , x∗) ≤ ε ≥ 1− λ for someλ ∈ (0,1):O
(m +√
mLµ ) log
[P(x0,x∗)λε
].
A factor of O
min√
Lµ ,√
m
savings on gradient computation(or price changes).
83 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Technical challenges with RPDG
The incorporation of a dual prediction step into the PDG methodcomplicates the analysis.
The expected functional optimality gap does not necessarilyminorize the primal-dual gap as in the deterministic case.
Unlike existing randomized primal-dual method, RPDG is arandomized primal gradient method and it employs strongertermination criterion.
Built-in acceleration of randomized incremental gradient method,no need of Ψ∗ or its estimation.
84 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Lower complexity bound
minxi∈Rn,i=1,...,m
Ψ(x) :=∑m
i=1
[fi (xi ) + µ
2 ‖xi‖22
].
fi (xi ) = µ(Q−1)4
[ 12 〈Axi , xi〉 − 〈e1, xi〉
]. n ≡ n/m,
A =
2 −1 0 0 · · · 0 0 0−1 2 −1 0 · · · 0 0 0· · · · · · · · · · · · · · · · · · · · ·0 0 0 0 · · · −1 2 −10 0 0 0 · · · 0 −1 κ
, κ =√Q+3√Q+1
.
Theorem
Denote q := (√Q− 1)(
√Q+ 1). Then the iterates xk generated by a
randomized incremental gradient method must satisfyE[‖xk−x∗‖2
2]
‖x0−x∗‖22≥ 1
2 exp(− 4k
√Q
m(√Q+1)2−4
√Q
)for any
n ≥ n(m, k) ≡ [m log[(
1− (1− q2)/m)k/2]]/(2 log q).
85 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Complexity
Corollary
The number of gradient evaluations performed by any randomizedincremental gradient methods for finding a solution x ∈ X s.t.E[‖x − x∗‖2
2] ≤ ε cannot be smaller than Ω(√
mC + m)
log‖x0−x∗‖2
2ε
if
n is sufficiently large.
Remarks
No valid lower complexity bound exists before for theserandomized incremental gradient methods.
We also provide lower complexity bound for randomizedcoordinate descent methods as a by-product.
86 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Existing literatureSAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O
((m + L/µ) log 1
ε
)rate of convergence
rate of convergence not optimal
RPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn
Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time
synchronous delays
No existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information
Theorem (Lan and Zhou 15(17))
The lower complexity bound of any RIG methods for finding a solutionx ∈ X such that E[‖x − x∗‖2
2] ≤ ε is at least
Ω(
m +√
mL/µ)
log‖x0−x∗‖2
2ε
.
87 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Existing literatureSAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O
((m + L/µ) log 1
ε
)rate of convergence rate of convergence not optimalRPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn
Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time
synchronous delays
No existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information
Theorem (Lan and Zhou 15(17))
The lower complexity bound of any RIG methods for finding a solutionx ∈ X such that E[‖x − x∗‖2
2] ≤ ε is at least
Ω(
m +√
mL/µ)
log‖x0−x∗‖2
2ε
.
87 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Existing literatureSAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O
((m + L/µ) log 1
ε
)rate of convergence rate of convergence not optimalRPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn
Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time
synchronous delays
No existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information
Theorem (Lan and Zhou 15(17))
The lower complexity bound of any RIG methods for finding a solutionx ∈ X such that E[‖x − x∗‖2
2] ≤ ε is at least
Ω(
m +√
mL/µ)
log‖x0−x∗‖2
2ε
.
87 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Existing literatureSAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O
((m + L/µ) log 1
ε
)rate of convergence rate of convergence not optimalRPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn
Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time synchronous delaysNo existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information
Theorem (Lan and Zhou 15(17))
The lower complexity bound of any RIG methods for finding a solutionx ∈ X such that E[‖x − x∗‖2
2] ≤ ε is at least
Ω(
m +√
mL/µ)
log‖x0−x∗‖2
2ε
.
87 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Variance-reduced methods
Existing literature
SAG/SAGA in Schmidt et al, 13 and Defazio et al, 14, andSVRG in Johnson and Zhang, 13 obtained O
((m + L/µ) log 1
ε
)rate of convergence rate of convergence not optimalRPDG in Lan and Zhou, 15 (precursor: Zhang and Xiao 14,Dang and Lan 14) require exact gradient evaluation at theinitial point, and differentiability over Rn
Catalyst scheme in Lin et al, 15 and Katyusha in Allen-Zhu, 16require re-evaluating exact gradients from time to time synchronous delaysNo existing studies on stochastic finite-sum: each fi isrepresented by a stochastic oracle, or each agent only hasaccess to noisy first-order information
Goals:Fully-distributed (no exact gradient evaluations) RIG methodDirect acceleration scheme - optimal convergence ratesApplicable to solve stochastic finite-sum problems
87 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Preliminaries
Prox-function and prox-mapping
We define a prox-function associated with ω as
P(x0, x) := ω(x)− [ω(x0) + 〈ω′(x0), x − x0〉],
and the prox-mapping associated with X and ω is given by
MX (g, x0, η) := arg minx∈X
〈g, x〉+ µω(x) + ηP(x0, x)
.
P(·, ·) is a generalization of Bregman’s distance, since ω isnot necessarily differentiable.P(·, ·) is strongly convex w.r.t. an arbitrary norm becauseof the strong convexity of ω.Reasonable to assume the above prox-mapping problem iseasy to solve.
88 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Preliminaries
Prox-function and prox-mapping
We define a prox-function associated with ω as
P(x0, x) := ω(x)− [ω(x0) + 〈ω′(x0), x − x0〉],
and the prox-mapping associated with X and ω is given by
MX (g, x0, η) := arg minx∈X
〈g, x〉+ µω(x) + ηP(x0, x)
.
P(·, ·) is a generalization of Bregman’s distance, since ω isnot necessarily differentiable.P(·, ·) is strongly convex w.r.t. an arbitrary norm becauseof the strong convexity of ω.Reasonable to assume the above prox-mapping problem iseasy to solve.
88 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
The Problem: minx∈Xf (x) + µω(x) := 1m
∑mi=1fi (x) + µω(x)
Gradient Extrapolation Method (GEM)
Initialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). exact gradient evaluationfor t = 1,2, . . . , k do
gt = αt (gt−1 − gt−2) + gt−1.
x t =MX (gt , x t−1, ηt ).
x t =(x t + τtx t−1) /(1 + τt ).
gt = ∇f (x t ). one exact gradient evaluationend forOutput: xk .
89 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
The Problem: minx∈Xf (x) + µω(x) := 1m
∑mi=1fi (x) + µω(x)
Gradient Extrapolation Method (GEM)
Initialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). exact gradient evaluationfor t = 1,2, . . . , k do
gt = αt (gt−1 − gt−2) + gt−1.
x t =MX (gt , x t−1, ηt ).
x t =(x t + τtx t−1) /(1 + τt ).
gt = ∇f (x t ). one exact gradient evaluationend forOutput: xk .
89 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
Finite-sum optimization problem:
minx∈Xf (x) + µω(x)
m
Saddle point problem:
minx∈X
maxg∈G〈x ,g〉 − Jf (g)+ µω(x)
90 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
minx∈X
maxg∈G〈x ,g〉 − Jf (g)+ µω(x)
.
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
minx∈X
maxg∈G〈x ,g〉 − Jf (g)+ µω(x)
.
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
minx∈X
maxg∈G〈x ,g〉 − Jf (g)+ µω(x)
.
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
minx∈X
maxg∈G〈x, g〉− Jf (g)+ µω(x)
.
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
x0 = x0 ∈ X , g−1 = g0 = ∇f (x0) initial information
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
gt = αt (gt−1 − gt−2) + gt−1,
x t ← arg minx∈X
〈gt , x〉+ µω(x) + ηtP(x t−1, x)
.
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
Intuition: Game interpretation of GEM
A game iteratively performed by a primal player (x) and a dualplayer (g) for finding the optimal solution of the saddle pointproblem.
gt = arg ming∈G
〈−x t ,g〉+ Jf (g) + τtDg(gt−1,g)
m
x t =(x t + τtx t−1) /(1 + τt ),
gt = ∇f (x t ). one gradient evaluation
The primal player predicts the dual player’s action based onhistorical information, and determines her/his correspondingaction via minimizing predicted cost.
The dual player determines her/his action gt by maximizing theprofits.
91 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - An optimal gradient extrapolation method (GEM)
GEM: the dual of Nesterov’s accelerated gradientmethod
GEM:
gt = αt (gt−1 − gt−2) + gt−1
x t =MX (gt , x t−1, ηt )
gt =MG(−x t ,gt−1, τt )
NEST:
x t = αt (x t−1 − x t−2) + x t−1
gt =MG(−x t ,gt−1, τt )
x t =MX (gt , x t−1, ηt )
92 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Optimality of GEM
Theorem (Lan and Zhou 17)
Let x∗ be an optimal solution, xk and xk be given by GEM, respectively.Suppose that µ > 0, and that parameters are set to
τt ≡ 11−α − 1, ηt ≡ α
1−αµ, and αt ≡ α =
√2Lf/µ
1+√
2Lf/µ.
Then,
P(xk , x∗) ≤ 2αk ∆µ ,
ψ(xk )− ψ∗ ≤ αk ∆,
where ∆ := µP(x0, x∗) + ψ(x0)− ψ∗.
To obtain an ε-solution⇒gradient evaluations / communication rounds: O
m√
Lfµ log 1
ε
.
Note: (a) O(1/√ε) iteration for problems without strong convexity
(µ = 0) with different parameter selection.(b) Output xk is where the gradient is computed.
93 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Potential improvements
Adding randomization...
The Problem: minx∈Xf (x) + µω(x) := 1m
∑mi=1fi (x) + µω(x)
GEMInitialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). synchronous delayfor t = 1,2, . . . , k do
gt = αt (gt−1 − gt−2) + gt−1.
x t =MX (gt , x t−1, ηt ).
x t =(x t + τtx t−1) /(1 + τt ).
gt = ∇f (x t ). synchronous delayend forOutput: xk .
94 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Potential improvements
Adding randomization...
The Problem: minx∈Xf (x) + µω(x) := 1m
∑mi=1fi (x) + µω(x)
GEMInitialization:x0 = x0 ∈ X and g−1 = g0 = ∇f (x0). synchronous delayfor t = 1,2, . . . , k do
gt = αt (gt−1 − gt−2) + gt−1.
x t =MX (gt , x t−1, ηt ).
x t =(x t + τtx t−1) /(1 + τt ).
gt = ∇f (x t ). synchronous delayend forOutput: xk .
94 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
The Problem: minx∈X ψ(x) := 1m
∑mi=1fi (x) + µω(x)
RGEMInitialization:x0
i = x0 ∈ X , ∀i , y−1 = y0 = 0. No exact gradient evaluationfor t = 1, . . . , k doChoose it uniformly from 1, . . . ,m,
y t ←y t−1 + αt (y t−1 − y t−2),
x t ←MX ( 1m
∑mi=1y t
i , xt−1, ηt ),
x ti ←
(1 + τt )
−1(x t + τtx t−1i ), i = it ,
x t−1i , i 6= it .
y ti ←
∇fi (x t
i ), i = it , one gradient evaluationy t−1
i , i 6= it .
end forOutput: For some θt > 0, set xk := (
∑kt=1θt )
−1∑kt=1θtx t .
95 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
The Problem: minx∈X ψ(x) := 1m
∑mi=1Eξi [Fi (x , ξi )] + µω(x)
AssumptionAt iteration t , for a given point x t
i ∈ X , a stochastic first-order (SO) oracleoutputs a vector Gi (x t
i , ξti ) s.t.
Eξ[Gi (x ti , ξ
ti )] = ∇fi (x t
i ), i = 1, . . . ,m,
Eξ[‖Gi (x ti , ξ
ti )−∇fi (x t
i )‖2∗] ≤ σ2, i = 1, . . . ,m.
RGEM for stochastic finite-sum optimizationThe same as RGEM except that the gradient update step is replaced by
y ti ←
1Bt
∑Btj=1Gi (x t
i , ξti,j ), i = it , stochastic gradients of fi given by SO
y t−1i , i 6= it .
96 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
RGEM from the server and activated agentperspective
The server updates iterates x t and calculates the outputsolution xk given by sumx/sumθ.One agent is activated at a time, who updates localvariables x t
i and uploads the changes of gradient ∆yi tothe server.
97 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
RGEM from the server and activated agentperspective
The server updates iterates x t and calculates the outputsolution xk given by sumx/sumθ.One agent is activated at a time, who updates localvariables x t
i and uploads the changes of gradient ∆yi tothe server.
97 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
Advantages of RGEM for distributed learning
RGEM is an enhanced RIG method for distributed learning:Require no exact gradient evaluations of fInvolve communication only between the server and theactivated agent iteratively, and tolerate communicationfailuresPossess direct accelerated algorithmic schemeApplicable to stochastic optimization - minimization ofgeneralization riskSampling and communication complexities?
98 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
Advantages of RGEM for distributed learning
RGEM is an enhanced RIG method for distributed learning:Require no exact gradient evaluations of fInvolve communication only between the server and theactivated agent iteratively, and tolerate communicationfailuresPossess direct accelerated algorithmic schemeApplicable to stochastic optimization - minimization ofgeneralization riskSampling and communication complexities?
98 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
Advantages of RGEM for distributed learning
RGEM is an enhanced RIG method for distributed learning:Require no exact gradient evaluations of fInvolve communication only between the server and theactivated agent iteratively, and tolerate communicationfailuresPossess direct accelerated algorithmic schemeApplicable to stochastic optimization - minimization ofgeneralization riskSampling and communication complexities?
98 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
Advantages of RGEM for distributed learning
RGEM is an enhanced RIG method for distributed learning:Require no exact gradient evaluations of fInvolve communication only between the server and theactivated agent iteratively, and tolerate communicationfailuresPossess direct accelerated algorithmic schemeApplicable to stochastic optimization - minimization ofgeneralization riskSampling and communication complexities?
98 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The algorithm - Random Gradient Extrapolation Method (RGEM)
Advantages of RGEM for distributed learning
RGEM is an enhanced RIG method for distributed learning:Require no exact gradient evaluations of fInvolve communication only between the server and theactivated agent iteratively, and tolerate communicationfailuresPossess direct accelerated algorithmic schemeApplicable to stochastic optimization - minimization ofgeneralization riskSampling and communication complexities?
98 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Optimality of RGEM
RGEM for deterministic finite-sum optimizationproblems
Theorem (Lan and Zhou 17)
Let x∗ be an optimal solution, L = maxi=1,...,m Li , and the parameters be set toτt = 1
m(1−α)− 1, ηt = α
1−αµ, and αt ≡ mα
with α = 1− 1m+√
m2+16mL/µ, then,
E[P(xk , x∗)] ≤2∆0,σ0
αk
µ,
E[ψ(xk )− ψ(x∗)] ≤ 16 max
m, Lµ
∆0,σ0α
k/2,
where ∆0,σ0 := µP(x0, x∗) + ψ(x0)− ψ(x∗) +σ2
0mµ , and σ2
0 is the upper bound of the initial
gradients such that 1m∑m
i=1‖∇fi (x0)‖2∗ ≤ σ2
0 .
Conclusion: To obtain a stochastic ε-solution⇒# of gradient evaluations of fi / communication rounds is O
(m +
√mLµ
)log 1
ε
.
99 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Optimality of RGEM
RGEM for stochastic finite-sum optimizationproblems
Theorem (Lan and Zhou 17)
Let x∗ be an optimal solution, and L = maxi=1,...,m Li . Given the iteration limit k, let theparameters be set as before, and we set
Bt = dk(1− α)2α−te, t = 1, . . . , k ,then
E[P(xk , x∗)] ≤2αk ∆0,σ0,σ
µ,
E[ψ(xk )− ψ(x∗)] ≤ 16 max
m, Lµ
∆0,σ0,σα
k/2,
where ∆0,σ0,σ := µP(x0, x∗) + ψ(x0)− ψ(x∗) +σ2
0/m+5σ2
µ.
Conclusion: To obtain a stochastic ε-solution⇒# of required communication rounds is O
(m +
√mL/µ
)log 1
ε
,
# of stochastic gradient evaluations is O(
∆0,σ0,σµε
+ m +
√mL/µ
).
Note: Sampling complexity independent of m asymptotically.100 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Decentralized optimization techniques
Most studies focus on deterministic stochastic optimization(e.g., Nedic and Ozdaglar 09; Shi, Ling, Wu, and Yin 15).
O(1/ε) communication rounds and gradient computation.O(log(1/ε)) communication rounds for unconstrainedsmooth and strongly convex problems.
For stochastic optimization problemsDirect extension of SGD type methods (e.g. Duchi,Agarwal, and Wainwright 12)O(1/ε2) communication rounds and stochastic(sub)gradient computations.
101 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Motivation
Communication is about 1,000 times more expensiveCommunication over TCP/IP: 10KB/ms plus a few ms forstartupCPUs read/write from/to memory: 10KB/µsCurrent decentralized SGDs involve one communicationround for each iteration.NOT good for decentralized stochastic optimization and/ormachine learning
Question: Can we skip communication from time to time?Inspiration: Gradient sliding for composite optimizationminx∈X f (x) + h(x) (Lan, 14).
102 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
How to handle decentralized structure?
103 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
How to handle decentralized structure?
Dual decomposition (explicit)minx F (x) :=
∑mi=1fi(xi)
s.t. x1 = x2 = . . . = xm, xi ∈ Xi ,∀i = 1, . . . ,m.where x = [xT
1 , . . . , xTm].
Let Ni be the set of neighbors of agent i .The Laplacian L ∈ Rm×m of a graph G = (V ,E) is given by
Lij =
|Ni | − 1 if i = j−1 if i 6= j and (i , j) ∈ E0 otherwise.
104 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Background: Laplacian L
Let Ni denote the set of neighbors of agent i :Ni = j ∈ V | (i , j) ∈ E ∪ i
Then, the Laplacian L ∈ Rm×m of a graph G = (V , E) isdefined as:
Lij =
|Ni | − 1 if i = j−1 if i 6= j and (i , j) ∈ E0 otherwise.
For example:
L =
1 −1 0−1 2 −10 −1 1
L1 = 0“AgreementSubspace”
105 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Problem Formulation
minx
F (x) :=m∑
i=1
fi(xi)
s.t. x1 = · · · = xm
xi ∈ Xi , i = 1, . . . ,m
(=)
minx
F (x) :=m∑
i=1
fi(xi)
s.t. xi = xj , ∀(i , j) ∈ Exi ∈ Xi , i = 1, . . . ,m
If G is connected
(=)
minx
F (x) :=m∑
i=1
fi(xi)
s.t. Lx = 0xi ∈ Xi , i = 1, . . . ,m
Using Laplacian L,consistency constraints can
be compactly rewrittenL := L⊗ Id
(=) minx∈X m
F (x) + maxy∈Rmd
〈Lx,y〉 Equivalent Saddle Point form
106 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Decentralized Primal-Dual (DPD): Vector Form
minx∈X m
F (x) + maxy∈Rmd
〈Lx,y〉x := [x>1 · · · x>m ]>
y := [y>1 · · · y>m ]>
Let x0 = x−1 ∈ X m, y0 ∈ Rmd , αk, τk, ηk, and θk be given.
For k = 1, . . . ,N, update zk = (xk ,yk )
xk = αk (xk−1 − xk−2) + xk−1
yk = argminy∈Rmd 〈−Lxk ,y〉+ τk2 ‖y− yk−1‖2
xk = argminx∈X m 〈Lyk ,x〉+ F (x) + ηk2 ‖x
k−1 − x‖2
Return zN = (∑N
k=1θk )−1∑Nk=1θk zk .
107 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
DPD: Agent i ’s point of view
Let x0 = x−1 ∈ X m, y0 ∈ Rmd , αk, τk, ηk, and θk be given.
For k = 1, . . . ,N, update zki = (xk
i , yki )
xki = αk (xk−1
i − xk−2i ) + xk−1
i
vki =
∑j∈Ni
Lij xkj
yki = yk−1
i + 1τk
vki
wki =
∑j∈Ni
Lijykj
xki = argminxi∈Xi
〈wki , xi〉+ fi(xi) + ηk
2 ‖xk−1i − xi‖2
Return zN = (∑N
k=1θk )−1∑Nk=1θkzk
108 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
DPD: Convergence Results
Theorem
Let xN = 1N∑N
k=1xk and x∗ be an optimal solution. αk = θk = 1,ηk = 2‖L‖, and τk = ‖L‖, ∀k = 1, . . . ,N. Then, for any N ≥ 1
F (xN)− F (x∗) ≤ ‖L‖N
[‖x0 − x∗‖2 + 1
2‖y0‖2]
‖LxN‖ ≤ 2‖L‖N
[3√
12‖x0 − x∗‖2 + 2‖y∗ − y0‖
].
To find a point x ∈ X s.t. F (xN)− F (x∗) ≤ ε and ‖Lx‖ ≤ ε:O(1ε
)iterations.
# of required communications is also O(1ε
).
109 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Decentralized Communication Sliding (DCS)
Let x0 = x−1 ∈ X m, y0 ∈ Rmd , αk, τk, ηk, θk, and Tk begiven.
For k = 1, . . . ,N, update zki = (xk
i , yki )
xki = αk (xk−1
i − xk−2i ) + xk−1
i
vki =
∑j∈Ni
Lij xkj
yki = argminyi∈Rd 〈−vk
i , yi〉+ τk2 ‖yi − yk−1
i ‖2
wki =
∑j∈Ni
Lijykj
(xki , x
ki ) = Inner loop for Tk times
Observations: The computation of vki and wk
i involves the graphLaplacian, and communication with neighboring nodes.
110 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The primal subproblem is solved with noisy subgradients
xki = argminxi∈Xi
〈wki , xi〉+ fi(xi) + ηk
2 ‖xk−1i − xi‖2.
Let u0 = u0 = xk−1i , βt, and λt be given.
For t = 1, . . . ,Tk ,
ht−1 = Gi (ut−1, ξt−1i )
ut = argminu∈Xi〈ht−1 + wk
i ,u〉+ ηk2 ‖x
k−1i − u‖2 + ηkβt
2 ‖ut−1 − u‖2
Return xki = uT and xk
i =(∑t
t=1 λt
)−1∑tt=1 λtut
Observations: The same wki is used, communication is
skipped! There are two output points xki and xk
i .
111 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Convergence results
SDCS: convergence for convex cases
Theorem
Let xN = 1N∑N
k=1xk . If βt = t2 , λt = t + 1, αk = θk = 1, ηk = 2‖L‖,
τk = ‖L‖ and Tk =⌈
m(M2+σ2)N‖L‖2D
⌉for some D > 0, then
E[F (xk )− F (x∗)] ≤ ‖L‖N
[32‖x
0 − x∗‖2 + 12‖y
0‖2 + 4D],
E[‖LxN‖] ≤ ‖L‖N
[3√
3‖x0 − x∗‖2 + 8D + 4‖y∗ − y0‖].
112 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Implication
To find a point x ∈ X such that E[F (x)− F (x∗)] ≤ ε andE[‖Lx‖] ≤ ε.
# of outer iterations/communication rounds: N = O(1/ε) -same as the deterministic case.# of inner iterations/samples/stochastic (sub)gradients:∑N
k=1Tk = O(1/ε2).Reduce # of communication rounds by a factor of 1/ε whilekeeping the optimal sampling complexity.
113 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Boundedness of y∗
TheoremLet x∗ be an optimal solution. Then there exists an optimal dualmultiplier y∗, s.t.
‖y∗‖ ≤√
mMσmin(L) ,
where σmin(L) denotes the smallest nonzero singular value of L.
Everything can be represented in primal terms!
114 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
SDCS: convergence for strongly convex casesµ2‖x − y‖2 ≤ fi(x)− fi(y)− 〈f ′i (y), x − y〉 ≤ M‖x − y‖ for allx , y ∈ X .
Theorem
Let x∗ be an optimal solution and xN = 2N(N+3)
∑Nk=1(k + 1)xk . If
λt = t , β(k)t = (t+1)µ
2ηk+ t−1
2 , αk = kk+1 , θk = k + 1, ηk = kµ
2 ,
τk = 4‖L‖2
(k+1)µ , and Tk =
⌈√m(M2+σ2)
D2Nµ max
√m(M2+σ2)
D8µ ,1⌉
for
some D > 0, then
E[F (xN)− F (x∗) ≤ 2N(N+3)
[µ2‖x
0 − x∗‖2 + 2‖L‖2
µ ‖y0‖2 + 2µD
],
E[‖LxN‖] ≤ 8‖L‖N(N+3)
[3√
2D + 12‖x0 − x∗‖2 + 7‖L‖
µ ‖y∗ − y0‖
].
115 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
To find a point x ∈ X such that E[F (x)− F (x∗)] ≤ ε andE[‖Lx‖] ≤ ε.
# of outer iterations/communication rounds: N = O(1/√ε).
# of inner iterations/samples/stochastic (sub)gradients:∑Nk=1Tk = O(1/ε).
Reduce # of communication rounds by a factor of 1/√ε
while keeping the optimal sampling complexity.
116 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Summary of convergence results
Table: Complexity for obtaining an ε-optimal and ε-feasible solution
Algorithm (problem type) # of communications # of subgradientevaluations
DCS: Convex O‖L‖D2
Xmε
O
mM2D2Xm
ε2
SDCS: Convex O
‖L‖D2
Xmε
O
m(M2+σ2)D2Xm
ε2
DCS: Strongly convex O
õD2
Xmε
O
mM2
µε
SDCS: Strongly convex O
õD2
Xmε
O
m(M2+σ2)µε
117 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Comparisons with centralized SGD
Table: # of stochastic subgradient evaluations
Problem type SDCS (individual agent) SGD
Convex O
m(M2+σ2)D2Xm
ε2
O
m(M2f +σ2)D2
Xε2
Strongly convex O
m(M2+σ2)
µε
O
m(M2f +σ2)
µf ε
Note: Mf ≤ mM, µf ≥ mµ, D2X/D
2X m = O(1/m) and σ2 ≤ mσ2.
Conclusion: Sampling complexity comparable to centralizedSGD under reasonable assumptions and hence not improvablein general.
118 / 118