Upload
dangdien
View
224
Download
0
Embed Size (px)
Citation preview
Consensus-Based Distributed Online Prediction and Stochastic Optimization
Michael Rabbat
Joint work with ��� Konstantinos Tsianos
Stochastic Online Prediction
2
Observe x(1)x(2) x(3) . . .
w(1) w(2) w(3) . . .
f(w(1), x(1)) f(w(2), x(2)) f(w(3), x(3))
Predict
Suffer Loss
x(t)Assume drawn i.i.d. from unknown distribution
R(m) =mX
t=1
f(w(t), x(t))�mX
t=1
f(w⇤, x(t))Regret:
where w⇤ = arg minw2W
Ex
[f(w, x)]
See, e.g., S. Shalev-Shwartz, “Online Learning and Online Convex Optimization, FnT 2012.
Example Application: Internet Advertising
3
w(t)Prediction
sample/data x(t)
Problem Formalization
4
Assume: |f(w, x)� f(w
0, x)| Lkw � w
0k (Lipschitz continuous)
krf(w, x)�rf(w
0, x)k Kkw � w
0k (Lipschitz cont. gradients)
f(w, x) is convex in w for all x
Best possible performance (Nemirovsky & Yudin ‘83) is Achieved by many algorithms including Nesterov’s Dual Averaging
R(m) =mX
t=1
f(w(t), x(t))�mX
t=1
f(w⇤, x(t))Regret:
E⇥krf(w, x)� E[rf(w, x)]k2
⇤ �
2(Bounded variance)
E[R(m)] = O(pm)
Dual Averaging Intuition
5
F (w) = Ex
[f(w, x)]
(w1, F (w1))
hrF (w1), w � w1i
Dual Averaging Intuition
6
F (w) = Ex
[f(w, x)]
Dual Averaging Intuition
7
F (w) = Ex
[f(w, x)]
Lower linear model
l(w) =
Pki=1 �i(f(wi) + hrF (wi), w � wii)Pk
i=1 �i
Algorithm (Dual Averaging)
Initialize: z(1) = 0 2 Rd, w(1) = 0
Repeat (for t > 1):
1. g(t) = rf(w(t), x(t))
2. z(t+ 1) = z(t) + g(t)
3. w(t+ 1) = argminw2W
n
hz(t+ 1), wi+ 1a(t)kwk
2o
The Dual Averaging Algorithm
8
Nesterov (2009); see also Zinkevich (2003)
Gradient aggregation
“Projection”
Algorithm (Dual Averaging)
Initialize: z(1) = 0 2 Rd, w(1) = 0
Repeat (for t > 1):
1. g(t) = rf(w(t), x(t))
2. z(t+ 1) = z(t) + g(t)
3. w(t+ 1) = argminw2W
n
hz(t+ 1), wi+ 1a(t)kwk
2o
The Dual Averaging Algorithm
9
Nesterov (2009); see also Zinkevich (2003)
Gradient aggregation
“Projection”
Best possible performance (Nemirovsky & Yudin ‘83, Cesa-Bianchi & Lugosi ‘06)
Theorem (Nesterov ’09): For a(t) = 1/pt we have E[R(m)] = O(
pm).
E[R(m)] = O(pm)
Distributed Online Prediction
10
w1(1), w1(2), w1(3), . . .
x1(1), x1(2), x1(3), . . . w2(1), w2(2), w2(3), . . .
x2(1), x2(2), x2(3), . . .
w3(1), w3(2), w3(3), . . .
x3(1), x3(2), x3(3), . . .
wn�1(1), wn�1(2), wn�1(3), . . .
xn�1(1), xn�1(2), xn�1(3), . . .wn(1), wn(2), wn(3), . . .
xn(1), xn(2), xn(3), . . .
Regret: Rn(m) =
nX
i=1
m/nX
t=1
⇥f(wi(t), xi(t))� f(w⇤
, x(t))⇤
Communication topology given by graph G = (V,E)
Distributed Online Prediction
11
“No collaboration”
Regret: Rn(m) = nR1(mn )
= O(pnm)
With collaboration…
Rn(m) R1(m) = O(pm)
How to achieve this bound?
Mini-batch Updates
12
[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]
Observe x(1)
x(2)
w(1)
f(w(1), x(1))
Predict
Suffer Loss
. . .
f(w(1), x(2)) f(w(1), x(b)) f(w(b+ 1), x(b+ 1))
w(b+ 1)
x(b+ 1)x(b)
w(1)
. . .
. . .
w(b) = w(1)
Update using average gradient
Regret: E[R1(m)] = O(b+pm+ b)
Update after mini-batch of b samples
1
b
bX
t=1
rf(w(1), x(t))
Distributed Mini-Batch Algorithm
13
[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]
Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)
All nodes exactly compute 1
b
bX
t=1
rf(w(1), x(t))
Collaborating has latency samples µ
Regret: Rn(m) =
mb+µX
t=1
nX
i=1
b+µnX
s=1
[f(wi(t), xi(t, s))� f(w⇤, xi(t, s))]
Distributed Mini-Batch Algorithm
14
[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]
Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)
Is exact average gradient computation necessary? Can we achieve the same rates with asynchronous distributed algorithms?
All nodes exactly compute 1
b
bX
t=1
rf(w(1), x(t))
Achieve optimal regret E[Rn(m)] = O(pm)
with appropriate choice of b
Approximate Distributed Averaging
ALLREDUCE is an example of an exact distributed averaging protocol
15
y+i = AllReduce(yi/n, i) ⌘1
n
nX
i=1
yi 8i
Approximate Distributed Averaging
ALLREDUCE is an example of an exact distributed averaging protocol
More generally, consider approximate distributed averaging protocols which guarantee that for all i with latency
16
y+i = AllReduce(yi/n, i) ⌘1
n
nX
i=1
yi 8i
y+i = DistributedAverage(yi, i)
ky+i � 1n
nX
i=1
yik �
µ
Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations
Then and if
17
W Wi,j > 0 , (i, j) 2 E
yi(k + 1) = Wi,iyi(k) +nX
j=1
Wi,jyj(k)
kyi(k)� yk �yi(k) ! ydef= 1
n
Pni=1 yi(0)
k �log
�1� ·
pn ·maxj kyj(0)� yk
�
1� �2(W )
ring expander
1
1� �2= O(1)
random geometric graph
1
1� �2= O(n2)
1
1� �2= O
✓n
log(n)
◆
Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations
Then and if
18
W Wi,j > 0 , (i, j) 2 E
yi(k + 1) = Wi,iyi(k) +nX
j=1
Wi,jyj(k)
kyi(k)� yk �yi(k) ! ydef= 1
n
Pni=1 yi(0)
k �log
�1� ·
pn ·maxj kyj(0)� yk
�
1� �2(W )
Related work: • Tsitsiklis, Bertsekas, & Athans 1986 • Nedic & Ozdaglar 2009 • Ram, Nedic, & Veeravalli 2010 • Duchi, Agarwal, & Wainwright 2012
Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)
19
Initialize zi(1) = 0, wi(1) = 0
For t = 1, . . . , Tdef= d m
b+µe
gi(t) =n
b
b/nX
s=1
rf(wi(t), xi(t, s))
zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)
Algorithm parameters 0 < �(t) �(t+ 1)
wi(t+ 1) = arg minw2W
�hzi(t+ 1), wi+ �(t)h(w)
Strongly convex aux. function E.g., . kwk22
Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)
20
Initialize zi(1) = 0, wi(1) = 0
For t = 1, . . . , Tdef= d m
b+µe
gi(t) =n
b
b/nX
s=1
rf(wi(t), xi(t, s))
zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)
zi(t+ 1) ⇡ 1
n
nX
i=1
�zi(t) + gi(t)
�
= z(t) +1
b
nX
i=1
b/nX
s=1
rf(wi(t), xi(t, s))
Should give
b samples
μ samples
wi(t+ 1) = arg minw2W
�hzi(t+ 1), wi+ �(t)h(w)
When do Approximate Mini-Batches Work?
21
If G is an expander, then and so . k = ⇥(log n) µ = ⇥(log n)
Latency is the same (order-wise) as aggregating along a tree.
Theorem (Tsianos & MR): Run DDA-AMB with
k =
log
�2
pn(1 + 2Lb)
�
1� �2(W )
iterations of gossip per mini-batch and �(t) = K +
qt
b+µ , and
take b = m⇢for ⇢ 2 (0, 1
2 ). Then E[Rn(m)] = O(
pm).
Stochastic Optimization
22
Consider the problem
Well-known that
minimize F (w) = Ex
[f(w, x)]
subject to w 2 W
F (w(m))� F (w⇤) 1
mE[R1(m)]
w(m) =1
m
mX
t=1
w(t)where
Distributed Stochastic Optimization
23
Accuracy is guaranteed if F (wi(T ))� F (w⇤) ✏ T � 1
n· 1
✏2
Total gossip iterations: O✓
1
✏2· log n
n· 1
1� �2(W )
◆
Agarwal & Duchi (2011) obtain similar rates with an asynchronous master-worker architecture
Corollary: Run DDA-AMB with �(t) = K +
qtb and
k =
log
�(1 + 2Lb)
pn�
1� �2(W )
gossip iterations per mini-batch of b gradients processed across the network.
Then
F (wi(dmb e))� F (w⇤
) = O(
1pm) = O(
1pnT
) .
Experimental Evaluation
24
Asynchronous version of distributed dual averaging using Matlab’s Distributed Computing Server (wraps to MPI) on a cluster with n=64 processors
Solve a multinomial logistic regression task
Data used: MNIST digits
y 2 {0, 1, . . . , 9}x 2 R784
w 2 R7850
f(w, (x, y)) =
1
Z(w, (x, y))
exp
8<
:wy,d+1 +
dX
j=1
wy,dxd
9=
;
Performance scaling with more nodes
25
0 500 1000 1500 2000 25000
0.5
1
1.5
2
2.5
Run Time (s)
1 mR
n(m
)
n = 4n = 8n = 16n = 32n = 64
Performance scaling
26 0 1 2 3 4 5 6
x 104
0.5
1
1.5
2
2.5
3
T
R4(T
)/R
n(T
)
n=8n=16n=32
For fixed b, have E[Rn(T )] = O(
pTn)
Expect that
Rn(T )
Rn0(T )
⇡r
n
n0
p2 ⇡ 1.4
p4 = 2
p8 ⇡ 2.8
Conclusions
• Exact averaging is not crucial for regret with distributed mini-batches – Just need to ensure nodes don’t drift too far apart
• Current gossip bounds are worst-case in initial condition – Potentially use an adaptive rule to gossip less
• Fully asynchronous version is straightforward extension
• Open problem: – Lower bounds for distributed optimization (communication +
computation) – Exploiting sparsity in distributed optimization
27
O(pm)