Distributed Machine Learning and Graph Processing with Sparse

Distributed Machine Learning and Graph Processing with Sparse Matrices

Shivaram Venkataraman*, Erik Bodzsar#

Indrajit Roy+, Alvin AuYoung+, Rob Schreiber+

*UC Berkeley, #U Chicago, +HP Labs

Big Data, Complex Algorithms

PageRank(Dominant eigenvector)

Recommendations(Matrix factorization)

Anomaly detection(Top-K eigenvalues)

User Importance(Vertex Centrality)

2

Machine learning + Graph algorithms

Large-Scale Processing Frameworks

Data-parallel frameworks – MapReduce/Dryad (2004)– Process each record in parallel

– Use case: Computing sufficient statistics, analytics queries

Graph-centric frameworks – Pregel/GraphLab (2010)– Process each vertex in parallel

– Use case: Graphical models

Array-based frameworks – MadLINQ (2012)– Process blocks of array in parallel

– Use case: Linear Algebra Operations

3

PageRank using Matrices

Power Method

Dominant eigenvector

4

Mp

M = web graph matrixp = PageRank vector

Simplified algorithm repeat { p = M*p }

Linear Algebra Operations on Sparse Matrices

p

Presto

Large-scale machine learning and

graph processing on sparse matrices

5

Extend R – make it scalable, distributed

Challenge 1 – Sparse Matrices

6

Challenge 1 – Sparse Matrices

1

10

100

1000

10000

1 11 21 31 41 51 61 71 81 91

Bloc

k de

nsit

y (n

orm

aliz

ed )

Block ID

LiveJournal Netflix ClueWeb-1B

1000x more data Computation imbalance

7

Challenge 2 – Data Sharing

Sharing data through pipes/network

Time-inefficient (sending copies)Space-inefficient (extra copies)

Process

copy of data

local copy

Process

data

Process

copy of data

Process

copy of data

Server 1

network

copynetwork

copy

Server 2

8

Sparse matrices

Communication overhead

Outline

• Motivation

• Programming model

• Design

• Applications and Results

9

darray

10

foreach f (x)

11

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(m=splits(M,i),

x=splits(P), p=splits(P,i)) {

p m*x

}

)}

Create Distributed Array

12

M p

P1

P2

PN/s

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(m=splits(M,i),

x=splits(P), p=splits(P,i)) {

p m*x

}

)}

Execute function in a cluster

Pass array partitions

13

p

P1

P2

PN/s

M

Presto Architecture

WorkerWorker

Master

R instanceR instance

DR

AM

R instance R instanceR instance

DR

AM

R instance

14

Repartitioning Matrices

Profile execution

Repartition

15

Partition if max(𝑡)

𝑚𝑒𝑑𝑖𝑎𝑛 (𝑡)> 𝛿

Maintaining Size Invariants

invariant(mat,vec, type=ROW)

16

Sharing Distributed Arrays

Versioned distributed arrays

17

Goal: Zero-copy sharing across cores

Immutable partitions Safe sharing

Data Sharing Challenges

1. Garbage collection

R object data partR object header

R instance R instance

2. Header conflicts

18

Overriding R’s allocator

page

Shared R object data part

Allocate process-local headers. Map data in shared memory

Local R object header

page boundary page boundary

19

Outline

• Motivation

• Programming model

• Design

• Applications and Results

20

demo

5 node cluster 8 cores per node

PageRank on 1.5B edge Twitter data

demodemo

demo

21

22

Applications Implemented in Presto

Application Algorithm Presto LOC

PageRank Eigenvector calculation 41

Triangle counting Top-K eigenvalues 121

Netflix recommendation Matrix factorization 130

Centrality measure Graph algorithm 132

k-path connectivity Graph algorithm 30

k-means Clustering 71

Sequence alignment Smith-Waterman 64

Fewer than 140 lines of code

23

Evaluation Overview

Evaluation Setup - 25 machine cluster

- Machine: 24 cores, 96GB RAM, 10Gbps network

Data-sharing benefits – 1.5B edge Twitter graph

Repartitioning analysis – 6B edge Web-graph

Faster than Spark and Hadoop using in-memory data

Collaborative Filtering using Netflix dataset

24

Data sharing benefits

25

4.45

2.49

1.63

0.71

0.7

0.72

10

20

40

CORE

S

4.38

2.21

1.22

1.22

2.12

4.16

10

20

40

CORE

S

Compute TransferNo sharing

Sharing

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Split

siz

e (G

B)

Iteration count

Repartitioning Progress

26

0 20 40 60 80 100 120 140 160

Wor

kers

Transfer Compute

0 20 40 60 80 100 120 140 160

Wor

kers

Repartitioning benefits

No Repartition

Repartition

27

Related Work

Large scale data processing frameworks

– MapReduce, Dryad, Spark, GraphLab

Matrix Computations – Ricardo, MadLINQ

HPC systems – ARPACK, Combinatorial BLAS

Multi-core R packages – doMC, snow, Rmpi

28

Presto

29

Co-partitioning

matrices

Locality-based

scheduling

Caching

partitions

Conclusion

Presto: Large scale array-based framework extends R

Challenges with Sparse matrices

Repartitioning, sharing versioned arrays

30

Backup Slides

31

Netflix Collaborative Filtering

32

755.112

380.985

256.1

234.236

202.725

155.299

0 200 400 600 800

8

16

24

32

40

48

Time (seconds)

Num

ber

of c

ores

Load t(R)×R R×t(R)×R

Repartitioning benefits

0

50

100

150

200

250

300

350

400

2000

3000

4000

5000

6000

7000

8000

0 5 10 15 20

Cum

ula

tive

par

titi

onin

g ti

me

(s)

Tim

e to

con

verg

ence

(s)

Number of Repartitions

Convergence Time

Time spent partitioning

33

Documents

Distributed Machine Learning and Graph Processing with Sparse