Hogwild! - GitHub Pages · Hogwild! Assume component wise addition is atomic b v = 1 on vth component, 0 otherwise G e(x) 2Rn, gradient of f e multiplied by jEj G e(x) = 0 on the

Hogwild!

Neil Giridharan

March 20, 2019

Outline

I Sparse Separable Cost Functions

I Motivating Examples

I Previous Work

I Hogwild!

I Convergence

I Experiments and Results

I Discussion

Sparse Separable Cost Functions

Let f : X ⊆ Rn → RDefine f (x) =

∑e∈E fe(xe)

e ⊆ 1, ..., nxe : The components of x that are indexed by eWhen |E | and n are large, but length of xe is small then f is sparse

Example

x = (2, 3, 5, 7, 11)e = (3, 5)xe = (5, 11)fe(xe) = |xe |fe(xe) ≈ 27.31

Motivating Examples

I Netflix Prize Problem

I Neutrinos and Muons

I Image Segmentation

Netflix Prize

Figure 1: Netflix Prize Winner

Sparse Matrix Completion Visualization

Figure 2: Matrix Completion

Sparse Matrix Completion

M is a nrxnc low rank matrix with some entries filledE contains set of (u, v) which is uth row of L and vth column of RIdea is to estimate M from the product of LR∗ matricesminimize(L,R)

∑(u,v)∈E (LuR

∗v −Muv )2 + µ

2 ||L||2F + µ

2 ||R||2F

The above regularization term depends on L and Rminimize(L,R)

∑(u,v)∈E (LuR

∗v −Muv )2 + µ

2|Eu−| ||Lu||2F + µ

2|E−v | ||Rv ||2FEu− = v : (u, v) ∈ E and E−v = u : (u, v) ∈ E

Neutrinos and MuonsI If a neutrino hits a water molecule then it could potentially

emit a muonI Need to distinguish between muons coming from neutrinos

and other muonsI Going upward vs Going downward muons

Figure 3: SVM

Sparse SVM

Let E = (z1, y1), ..., (z|E |, y|E |)z ∈ Rn, y are labelsx is the hyperplane in the previous figureFormulate as minimizex

∑α∈E max(1− yαx

T zα, 0) + λ||x ||22Regularization term depends on all of xLet eα be non-zero components of zαLet du be the number of training examples that are non-zero inu(u = 1, 2, ..., n)

minimizex∑

α∈E max(1− yαxT zα, 0) + λ

∑u∈eα

x2udu

Image Segmentation

I Can reduce image segmentation to a minimum cut problem

I Min cuts have had success even compared to normalized cuts

Figure 4: Image Segmentation

Sparse Graph Cuts

Let W be a sparse, non-negative similarity matrix, with edgescorresponding to nonzero entriesEach node is associated with a D dimensional simplex in the set SDSD = ζ ∈ RD : ζv ≥ 0

∑Dv=1 ζv = 1

minimizex∑

(u,v)∈E wuv ||xu − xv ||1, xv ∈ SD for v = 1, ..., n

Previous Work

I Approaches are inspired from numerical methods books

I Master/Worker: One processor writes to memory, otherprocessors computes gradients

I Round Robin: One processor updates gradient, tells otherprocessors it’s done

I Massive overhead due to lock contention and communication

I What happens with no locking and no communication?

Hogwild!

Assume component wise addition is atomicbv = 1 on vth component, 0 otherwiseGe(x) ∈ Rn, gradient of fe multiplied by |E |Ge(x) = 0 on the components ¬eγ - step size

Algorithm 1: Hogwild!

Sample e uniformly at random from E ;Read current state xe , evaluate Ge(x);for v ∈ e do

xv ←xv − γbTv Ge(x)end for

Graph Statistics

Ω := maxe∈E |e|: Max cardinality of a hyperedge

∆ :=max1≤v≤n|e∈E :v∈e|

|E | : Normalized Max degree

ρ := maxe∈E |e′∈E :e′∩e 6=∅||E | : Normalized Max edge degree

Statistic Approximation

Ω 2r∆ O(log n/n)ρ O(log n/n)

Table 1: Matrix Completion statistics.

Convergence

Continuous differentiability: ||∇f (x ′)−∇f (x)|| ≤ L||x ′ − x ||,∀x ′, x ∈ XStrongly convex: f (x ′) ≥ f (x) + (x ′ − x)T∇f (x) + c

2 ||x′ − x ||2,

∀x ′, x ∈ XBounded Gradients: ||Ge(xe)||2 ≤ MDefine D0 := ||x0 − x∗||2τ bounds the lag between when gradient is computed and whenit’s used at a particular stepε > 0, v ∈ (0, 1)

k ≥ 2LM2(1+6τρ+6τ2Ω∆12 ) log(LD0/ε)

c2vεWhen graph is disconnected (∆ = 0, ρ = 0), rate equals serialconvergence rateE [f (xk)− f (x∗)] ≤ ε, x∗ unique minimizer

Results

Data set size (GB) ρ ∆ time (s) speedup

RCV1 0.9 0.44 1.0 9.5 4.5Netflix 1.5 2.5e-3 2.3e-3 301.0 5.3KDD 3.9 3.0e-3 1.8e-3 877.5 5.2

Jumbo 30 2.6e-7 1.4e-7 9453.5 6.8DBLife 3e-3 8.6e-3 4.3e-3 230.0 8.8

Abdomen 18 9.2e-4 9.2e-4 1181.4 4.1

Table 2: Hogwild! statistics

12 core machine: 10 cores for gradients, 2 cores for data shuffling

Experiments

I RR is being destroyed by communication delay

I RR does get a nearly linear speedup when gradientcomputation time is slow

I Atomic Incremental Gradient (AIG): Locks memory associatedwith one edge, performs update, unlocks

I Graph Cut experiences a plateau after about 5 cores

I Believe it is an issue with data movement and poor spatiallocality

Experiments

Figure 5: a) RCV1 b) Abdomen c) DBLife

Matrix Completion Experiments

Figure 6: a) Netflix b) KDD

Conclusion

Figure 7: One Perspective

Discussion

I Could hybrid locking schemes outperform Hogwild!?Especially when certain terms are accessed more frequently

I What about hybrid algorithms that combine SGD withL-BFGS?

I What would be the impact of better spatial locality? Forexample could having a biased sampling (rather thanuniformly with replacement) improve matrix completion?

I How much do you agree with the person above?

Documents

Hogwild! - GitHub Pages · Hogwild! Assume component wise addition is atomic b v = 1 on vth component, 0 otherwise G e(x) 2Rn, gradient of f e multiplied by jEj G e(x) = 0 on the