Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Hogwild!
Neil Giridharan
March 20, 2019
Outline
I Sparse Separable Cost Functions
I Motivating Examples
I Previous Work
I Hogwild!
I Convergence
I Experiments and Results
I Discussion
Sparse Separable Cost Functions
Let f : X ⊆ Rn → RDefine f (x) =
∑e∈E fe(xe)
e ⊆ 1, ..., nxe : The components of x that are indexed by eWhen |E | and n are large, but length of xe is small then f is sparse
Example
x = (2, 3, 5, 7, 11)e = (3, 5)xe = (5, 11)fe(xe) = |xe |fe(xe) ≈ 27.31
Motivating Examples
I Netflix Prize Problem
I Neutrinos and Muons
I Image Segmentation
Netflix Prize
Figure 1: Netflix Prize Winner
Sparse Matrix Completion Visualization
Figure 2: Matrix Completion
Sparse Matrix Completion
M is a nrxnc low rank matrix with some entries filledE contains set of (u, v) which is uth row of L and vth column of RIdea is to estimate M from the product of LR∗ matricesminimize(L,R)
∑(u,v)∈E (LuR
∗v −Muv )2 + µ
2 ||L||2F + µ
2 ||R||2F
The above regularization term depends on L and Rminimize(L,R)
∑(u,v)∈E (LuR
∗v −Muv )2 + µ
2|Eu−| ||Lu||2F + µ
2|E−v | ||Rv ||2FEu− = v : (u, v) ∈ E and E−v = u : (u, v) ∈ E
Neutrinos and MuonsI If a neutrino hits a water molecule then it could potentially
emit a muonI Need to distinguish between muons coming from neutrinos
and other muonsI Going upward vs Going downward muons
Figure 3: SVM
Sparse SVM
Let E = (z1, y1), ..., (z|E |, y|E |)z ∈ Rn, y are labelsx is the hyperplane in the previous figureFormulate as minimizex
∑α∈E max(1− yαx
T zα, 0) + λ||x ||22Regularization term depends on all of xLet eα be non-zero components of zαLet du be the number of training examples that are non-zero inu(u = 1, 2, ..., n)
minimizex∑
α∈E max(1− yαxT zα, 0) + λ
∑u∈eα
x2udu
Image Segmentation
I Can reduce image segmentation to a minimum cut problem
I Min cuts have had success even compared to normalized cuts
Figure 4: Image Segmentation
Sparse Graph Cuts
Let W be a sparse, non-negative similarity matrix, with edgescorresponding to nonzero entriesEach node is associated with a D dimensional simplex in the set SDSD = ζ ∈ RD : ζv ≥ 0
∑Dv=1 ζv = 1
minimizex∑
(u,v)∈E wuv ||xu − xv ||1, xv ∈ SD for v = 1, ..., n
Previous Work
I Approaches are inspired from numerical methods books
I Master/Worker: One processor writes to memory, otherprocessors computes gradients
I Round Robin: One processor updates gradient, tells otherprocessors it’s done
I Massive overhead due to lock contention and communication
I What happens with no locking and no communication?
Hogwild!
Assume component wise addition is atomicbv = 1 on vth component, 0 otherwiseGe(x) ∈ Rn, gradient of fe multiplied by |E |Ge(x) = 0 on the components ¬eγ - step size
Algorithm 1: Hogwild!
Sample e uniformly at random from E ;Read current state xe , evaluate Ge(x);for v ∈ e do
xv ←xv − γbTv Ge(x)end for
Graph Statistics
Ω := maxe∈E |e|: Max cardinality of a hyperedge
∆ :=max1≤v≤n|e∈E :v∈e|
|E | : Normalized Max degree
ρ := maxe∈E |e′∈E :e′∩e 6=∅||E | : Normalized Max edge degree
Statistic Approximation
Ω 2r∆ O(log n/n)ρ O(log n/n)
Table 1: Matrix Completion statistics.
Convergence
Continuous differentiability: ||∇f (x ′)−∇f (x)|| ≤ L||x ′ − x ||,∀x ′, x ∈ XStrongly convex: f (x ′) ≥ f (x) + (x ′ − x)T∇f (x) + c
2 ||x′ − x ||2,
∀x ′, x ∈ XBounded Gradients: ||Ge(xe)||2 ≤ MDefine D0 := ||x0 − x∗||2τ bounds the lag between when gradient is computed and whenit’s used at a particular stepε > 0, v ∈ (0, 1)
k ≥ 2LM2(1+6τρ+6τ2Ω∆12 ) log(LD0/ε)
c2vεWhen graph is disconnected (∆ = 0, ρ = 0), rate equals serialconvergence rateE [f (xk)− f (x∗)] ≤ ε, x∗ unique minimizer
Results
Data set size (GB) ρ ∆ time (s) speedup
RCV1 0.9 0.44 1.0 9.5 4.5Netflix 1.5 2.5e-3 2.3e-3 301.0 5.3KDD 3.9 3.0e-3 1.8e-3 877.5 5.2
Jumbo 30 2.6e-7 1.4e-7 9453.5 6.8DBLife 3e-3 8.6e-3 4.3e-3 230.0 8.8
Abdomen 18 9.2e-4 9.2e-4 1181.4 4.1
Table 2: Hogwild! statistics
12 core machine: 10 cores for gradients, 2 cores for data shuffling
Experiments
I RR is being destroyed by communication delay
I RR does get a nearly linear speedup when gradientcomputation time is slow
I Atomic Incremental Gradient (AIG): Locks memory associatedwith one edge, performs update, unlocks
I Graph Cut experiences a plateau after about 5 cores
I Believe it is an issue with data movement and poor spatiallocality
Experiments
Figure 5: a) RCV1 b) Abdomen c) DBLife
Matrix Completion Experiments
Figure 6: a) Netflix b) KDD
Conclusion
Figure 7: One Perspective
Discussion
I Could hybrid locking schemes outperform Hogwild!?Especially when certain terms are accessed more frequently
I What about hybrid algorithms that combine SGD withL-BFGS?
I What would be the impact of better spatial locality? Forexample could having a biased sampling (rather thanuniformly with replacement) improve matrix completion?
I How much do you agree with the person above?