53
ICME MapReduce Workshop April 29 – May 1, 2013 David F. Gleich Computer Science Purdue University David Gleich · Purdue 1 Website www.stanford.edu/~paulcon/icme-mapreduce-2013 Paul G. Constantine Center for Turbulence Research Stanford University MRWorkshop

Sparse matrix computations in MapReduce

Embed Size (px)

DESCRIPTION

Slides from my presentation on sparse matrix methods in MapRed

Citation preview

Page 1: Sparse matrix computations in MapReduce

ICME MapReduce Workshop!April 29 – May 1, 2013!

!

David F. Gleich!Computer Science!Purdue University

David Gleich · Purdue 1

!

Website www.stanford.edu/~paulcon/icme-mapreduce-2013

Paul G. Constantine!Center for Turbulence Research!

Stanford University

MRWorkshop

Page 2: Sparse matrix computations in MapReduce

Goals

Learn the basics of MapReduce & Hadoop Be able to process large volumes of data from science and engineering applications … help enable you to explore on your own!

David Gleich · Purdue 2 MRWorkshop

Page 3: Sparse matrix computations in MapReduce

Workshop overview

Monday!Me! Sparse matrix computations in MapReduce!Austin Benson Tall-and-skinny matrix computations in MapReduce

Tuesday!Joe Buck Extending MapReduce for scientific computing!Chunsheng Feng Large scale video analytics on pivotal Hadoop

Wednesday!Joe Nichols Post-processing CFD dynamics data in MapReduce !Lavanya Ramakrishnan Evaluating MapReduce and Hadoop for science

David Gleich · Purdue 3 MRWorkshop

Page 4: Sparse matrix computations in MapReduce

Sparse matrix computations in MapReduce!

David F. Gleich!Computer Science!Purdue University

David Gleich · Purdue 4

Slides online soon! Code https://github.com/dgleich/mapreduce-matrix-tutorial

MRWorkshop

Page 5: Sparse matrix computations in MapReduce

How to compute with big matrix data !A tale of two computers

224k Cores 10 PB drive

1.7 Pflops

7 MW

Custom !interconnect!

$104 M

80k cores!50 PB drive ? Pflops ? MW GB ethernet $?? M

625 GB/core!High disk to CPU

45 GB/core High CPU to disk

5

ORNL 2010 Supercomputer!Google’s 2010? !Data computer!

David Gleich · Purdue MRWorkshop

Page 6: Sparse matrix computations in MapReduce

My data computers

6

Nebula Cluster @ Sandia CA!2TB/core storage, 64 nodes, 256 cores, GB ethernet Cost $150k

These systems are good for working with enormous matrix data!

ICME Hadoop @ Stanford!3TB/core storage, 11 nodes, 44 cores, GB ethernet Cost $30k

David Gleich · Purdue MRWorkshop

Page 7: Sparse matrix computations in MapReduce

My data computers

7

Nebula Cluster @ Sandia CA!2TB/core storage, 64 nodes, 256 cores, GB ethernet Cost $150k

These systems are good for working with enormous matrix data!

ICME Hadoop @ Stanford!3TB/core storage, 11 nodes, 44 cores, GB ethernet Cost $30k

^

but not great,

David Gleich · Purdue

some ^

MRWorkshop

Page 8: Sparse matrix computations in MapReduce

By 2013(?) all Fortune 500 companies will have a data computer

David Gleich · Purdue 8 MRWorkshop

Page 9: Sparse matrix computations in MapReduce

How do you program them?

9 David Gleich · Purdue MRWorkshop

Page 10: Sparse matrix computations in MapReduce

MapReduce and!Hadoop overview

10

David Gleich · Purdue MRWorkshop

Page 11: Sparse matrix computations in MapReduce

MapReduce is designed to solve a different set of problems from standard parallel libraries

11

David Gleich · Purdue MRWorkshop

Page 12: Sparse matrix computations in MapReduce

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to !

all values with key k (for all k) Output a list of (key, value) pairs

12

David Gleich · Purdue MRWorkshop

Page 13: Sparse matrix computations in MapReduce

Computing a histogram !A simple MapReduce example

13

Input!!Key ImageId Value Pixels

Map(ImageId, Pixels) for each pixel emit� Key = (r,g,b)� Value = 1

Reduce(Color, Values) emit� Key = Color Value = sum(Values)

Output!!Key Color Value ! # of pixels

David Gleich · Purdue

5 15 10 9 3 17 5 10

1 1

1

1

Map Reduce 1 1

1

1

1 1

1

1

1

1 1 1

shuffle

MRWorkshop

Page 14: Sparse matrix computations in MapReduce

Many matrix computations are possible in MapReduce

Column sums are easy !Input Key (i,j) Value Aij

Other basic methods !can use common parallel/out-of-core algs!Sparse matrix-vector products y = Ax Sparse matrix-matrix products C = AB

14

Reduce(j,Values) emit Key = j, Value = sum(Values)

David Gleich · Purdue

Map((i,j), val) emit� Key = j, Value = val

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A43 A44

(3,4) -> 5 (1,2) -> -6.0 (2,3) -> -1.2 (1,1) -> 3.14 … “Coordinate storage”

MRWorkshop

Page 15: Sparse matrix computations in MapReduce

Many matrix computations are possible in MapReduce

Column sums are easy !Input Key (i,j) Value Aij

Other basic methods !can use common parallel/out-of-core algs!Sparse matrix-vector products y = Ax Sparse matrix-matrix products C = AB

15

Reduce(j,Values) emit Key = j, Value = sum(Values)

David Gleich · Purdue

Map((i,j), val) emit� Key = j, Value = val

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A43 A44

(3,4) -> 5 (1,2) -> -6.0 (2,3) -> -1.2 (1,1) -> 3.14 … “Coordinate storage”

Beware of un-thoughtful ideas MRWorkshop

Page 16: Sparse matrix computations in MapReduce

Why so many limitations?

16

David Gleich · Purdue MRWorkshop

Page 17: Sparse matrix computations in MapReduce

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to !

all values with key k (for all k) Output a list of (key, value) pairs Map function f must be side-effect free!

All map functions run in parallel Reduce function g must be side-effect free!

All reduce functions run in parallel

17

David Gleich · Purdue MRWorkshop

Page 18: Sparse matrix computations in MapReduce

A graphical view of the MapReduce programming model

David Gleich · Purdue 18

dataMap

dataMap

dataMap

dataMap

keyvalue

keyvalue

keyvalue

keyvalue

keyvalue

keyvalue

()

Shuffle

keyvaluevalue

dataReduce

keyvaluevaluevalue

dataReduce

keyvalue dataReduce

MRWorkshop

Page 19: Sparse matrix computations in MapReduce

Data scalability

The idea !Bring the computations to the data MR can schedule map functions without moving data.

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

19

David Gleich · Purdue MRWorkshop

Page 20: Sparse matrix computations in MapReduce

After waiting in the queue for a month and !after 24 hours of finding eigenvalues, one node randomly hiccups.

heartbreak on node rs252

David Gleich · Purdue 20

MRWorkshop

Page 21: Sparse matrix computations in MapReduce

Fault tolerant

Redundant input helps make maps data-local Just one type of communication: shuffle

MM R

RMM

Input stored in triplicate

Map output!persisted to disk!before shuffle

Reduce input/!output on disk

David Gleich · Purdue 21

MRWorkshop

Page 22: Sparse matrix computations in MapReduce

3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�

:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� �����ZLWK�RQO\�a����SHUIRUPDQFH�SHQDOW\���+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�

Fault injection

10 100 1000

1/Prob(failure) – mean number of success per failure

Tim

e to

com

plet

ion

(sec)

200

100

No faults (200M by 200)

Faults (800M by 10)

Faults (200M by 200)

No faults !(800M by 10)

With 1/5 tasks failing, the job only takes twice as long.

David Gleich · Purdue 22

MRWorkshop

Page 23: Sparse matrix computations in MapReduce

Data scalability

The idea !Bring the computations to the data MR can schedule map functions without moving data.

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

23

David Gleich · Purdue MRWorkshop

Page 24: Sparse matrix computations in MapReduce

Computing a histogram !A simple MapReduce example

24

Input!!Key ImageId Value Pixels

Map(ImageId, Pixels) for each pixel emit� Key = (r,g,b)� Value = 1

Reduce(Color, Values) emit� Key = Color Value = sum(Values)

Output!!Key Color Value ! # of pixels

David Gleich · Purdue

5 15 10 9 3 17 5 10

1 1

1

1

Map Reduce 1 1

1

1

1 1

1

1

1

1 1 1

shuffle

The entire dataset is “transposed” from images to pixels. This moves the data to the computation!

(Using a combiner helps to reduce the data moved, but it cannot always be used)

MRWorkshop

Page 25: Sparse matrix computations in MapReduce

Hadoop and MapReduce are bad systems for some matrix computations.

David Gleich · Purdue 25

MRWorkshop

Page 26: Sparse matrix computations in MapReduce

How should you evaluate a MapReduce algorithm?

Build a performance model! Measure the worst mapper Usually not too bad Measure the data moved Could be very bad Measure the worst reducer Could be very bad

David Gleich · Purdue 26

MRWorkshop

Page 27: Sparse matrix computations in MapReduce

Tools I like

hadoop streaming dumbo mrjob hadoopy C++

David Gleich · Purdue 27

MRWorkshop

Page 28: Sparse matrix computations in MapReduce

Tools I don’t use but other people seem to like …

pig java hbase mahout Eclipse Cassandra

David Gleich · Purdue 28

MRWorkshop

Page 29: Sparse matrix computations in MapReduce

hadoop streaming

the map function is a program!(key,value) pairs are sent via stdin!output (key,value) pairs goes to stdout the reduce function is a program!(key,value) pairs are sent via stdin!keys are grouped!output (key,value) pairs goes to stdout

David Gleich · Purdue 29

MRWorkshop

Page 30: Sparse matrix computations in MapReduce

mrjob from

a wrapper around hadoop streaming for map and reduce functions in python

class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()

David Gleich · Purdue 30

MRWorkshop

Page 31: Sparse matrix computations in MapReduce

How can Hadoop streaming possibly be fast?

Hadoop streaming frameworks

Iter 1QR (secs.)

Iter 1Total (secs.)

Iter 2Total (secs.)

OverallTotal (secs.)

Dumbo 67725 960 217 1177

Hadoopy 70909 612 118 730

C++ 15809 350 37 387

Java 436 66 502

Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too

David Gleich (Sandia)

All timing results from the Hadoop job tracker

C++ in streaming beats a native Java implementation.

16/22MapReduce 2011

David Gleich · Purdue 31

Example available from github.com/dgleich/mrtsqr!

for verification

mrjob could be faster if it used typedbytes for intermediate storage see https://github.com/Yelp/mrjob/pull/447

MRWorkshop

Page 32: Sparse matrix computations in MapReduce

Code samples and short tutorials at github.com/dgleich/mrmatrix github.com/dgleich/mapreduce-matrix-tutorial

David Gleich · Purdue 32

MRWorkshop

Page 33: Sparse matrix computations in MapReduce

Matrix-vector product

David Gleich · Purdue 33

Ax = y

y

i

=X

k

A

ik

x

k

A x

Follow along! ���mapreduce-matrix-tutorial !/codes/smatvec.py!

MRWorkshop

Page 34: Sparse matrix computations in MapReduce

Matrix-vector product

David Gleich · Purdue 34

Ax = y

y

i

=X

k

A

ik

x

k

A x

A is stored by row

$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 !3 4 -1.45 !4 2 0.061 !

x is stored entry-wise !

$ head samples/vec_5.txt !0 0.241 !1 -0.98 !2 0.237 !3 -0.32 !4 0.080 !

Follow along! ���mapreduce-matrix-tutorial !/codes/smatvec.py!

MRWorkshop

Page 35: Sparse matrix computations in MapReduce

Matrix-vector product!(in pictures)

David Gleich · Purdue 35

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns!

Reduce 1!Output Aik xk!keyed on row i

A

x Reduce 2!Output sum(Aik xk)!

y

MRWorkshop

Page 36: Sparse matrix computations in MapReduce

Matrix-vector product!(in pictures)

David Gleich · Purdue 36

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns!

def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # xi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (vals[i], # column ! (row, # i,Aij! float(vals[i+1]))) !

MRWorkshop

Page 37: Sparse matrix computations in MapReduce

Matrix-vector product!(in pictures)

David Gleich · Purdue 37

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns!

Reduce 1!Output Aik xk!keyed on row i

A

x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], val[1]*vecval) !

Note that you should use a secondary sort to avoid reading both in memory

MRWorkshop

Page 38: Sparse matrix computations in MapReduce

Matrix-vector product!(in pictures)

David Gleich · Purdue 38

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns!

Reduce 1!Output Aik xk!keyed on row i

A

x Reduce 2!Output sum(Aik xk)!

y def sumred(self, key, vals): ! yield (key, sum(vals)) !

MRWorkshop

Page 39: Sparse matrix computations in MapReduce

Move the computations to the data? Not really!

David Gleich · Purdue 39

A x

A x

Input Map 1!Align on columns!

Reduce 1!Output Aik xk!keyed on row i

A

x Reduce 2!Output sum(Aik xk)!

y Copy data once, ���now aligned on column

Copy data again, align on row

MRWorkshop

Page 40: Sparse matrix computations in MapReduce

Matrix-matrix product

David Gleich · Purdue 40

A B

AB = CCij =

X

k

Aik Bkj

Follow along! ���mapreduce-matrix-tutorial !/codes/matmat.py!

MRWorkshop

Page 41: Sparse matrix computations in MapReduce

Matrix-matrix product

David Gleich · Purdue 41

A B

AB = CCij =

X

k

Aik Bkj

A is stored by row

$ head samples/smat_10_5_A.txt !0 0 0.599 4 -1.53 !1 !2 2 0.260 !3 !4 0 0.267 1 0.839

B is stored by row

$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 !

Follow along! ���mapreduce-matrix-tutorial !/codes/matmat.py!

MRWorkshop

Page 42: Sparse matrix computations in MapReduce

Matrix-matrix product !(in pictures)

David Gleich · Purdue 42

A B

AB = CCij =

X

k

Aik Bkj

A Map 1!Align on columns!

B Reduce 1!Output Aik Bkj!keyed on (i,j)

A

B Reduce 2!Output sum(Aik Bkj)!

C

MRWorkshop

Page 43: Sparse matrix computations in MapReduce

Matrix-matrix product !(in code)

David Gleich · Purdue 43

A B

AB = CCij =

X

k

Aik Bkj

A Map 1!Align on columns!

B

def joinmap(self, key, line): ! mtype = self.parsemat() ! vals = line.split() ! row = vals[0] ! rowvals = \ ! [(vals[i],float(vals[i+1])) ! for i in xrange(1,len(vals),2)] ! if mtype==1: ! # matrix A, output by col ! for val in rowvals: ! yield (val[0], (row, val[1])) ! else: ! yield (row, (rowvals,)) !

MRWorkshop

Page 44: Sparse matrix computations in MapReduce

Matrix-matrix product !(in code)

David Gleich · Purdue 44

A B

AB = CCij =

X

k

Aik Bkj

A Map 1!Align on columns!

B Reduce 1!Output Aik Bkj!keyed on (i,j)

A

B

def joinred(self, key, line): ! # load the data into memory ! brow = [] ! acol = [] ! for val in vals: ! if len(val) == 1: ! brow.extend(val[0]) ! else: ! acol.append(val) ! ! for (bcol,bval) in brow: ! for (arow,aval) in acol: ! yield ((arow,bcol),aval*bval) !

MRWorkshop

Page 45: Sparse matrix computations in MapReduce

Matrix-matrix product !(in pictures)

David Gleich · Purdue 45

A B

AB = CCij =

X

k

Aik Bkj

A Map 1!Align on columns!

B Reduce 1!Output Aik Bkj!keyed on (i,j)

A

B Reduce 2!Output sum(Aik Bkj)!

C def sumred(self, key, vals): ! yield (key, sum(vals)) !

MRWorkshop

Page 46: Sparse matrix computations in MapReduce

Why is MapReduce so popular?

if (root) { !PetscInt cur_nz=0; ! unsigned char* root_nz_buf; ! unsigned int *root_nz_buf_i,*root_nz_buf_j; ! double *root_nz_buf_v; ! PetscMalloc((sizeof(unsigned int)*2+sizeof(double))*root_nz_bufsize,&root_nz_buf); ! PetscMalloc(sizeof(unsigned int)*root_nz_bufsize,&root_nz_buf_i); ! PetscMalloc(sizeof(unsigned int)*root_nz_bufsize,&root_nz_buf_j); ! PetscMalloc(sizeof(double)*root_nz_bufsize,&root_nz_buf_v); ! ! unsigned long long int nzs_to_read = total_nz; ! ! while (send_rounds > 0) { ! // check if we are near the end of the file ! // and just read that amount ! size_t cur_nz_read = root_nz_bufsize; ! if (cur_nz_read > nzs_to_read) { ! cur_nz_read = nzs_to_read; ! } ! PetscInfo2(PETSC_NULL," reading %i non-zeros of %lli\n", cur_nz_read, nzs_to_read); !

600 lines of gross code in order to load a sparse matrix into memory, streaming from one processor. MapReduce offers a better alternative

David Gleich · Purdue 46

MRWorkshop

Page 47: Sparse matrix computations in MapReduce

Thoughts on a better system

Default quadruple precision Matrix computations without indexing

Easy setup of MPI data jobs David Gleich · Purdue 47

Initial data load of any MPI job Compute task

MRWorkshop

Page 48: Sparse matrix computations in MapReduce

Double-precision floating point was designed for the era where “big” was 1000-10000

David Gleich · Purdue 48

MRWorkshop

Page 49: Sparse matrix computations in MapReduce

Error analysis of summation

s = 0; for i=1 to n: s = s + x[i] A simple summation formula has !error that is not always small if n is a billion

David Gleich · Purdue 49

fl(x + y ) = (x + y )(1 + ")

fl(X

i

x

i

) �X

i

x

i

nµX

i

|xi

| µ ⇡ 10�16

MRWorkshop

Page 50: Sparse matrix computations in MapReduce

If your application matters then watch out for this issue. Use quad-precision arithmetic or compensated summation instead.

David Gleich · Purdue 50

MRWorkshop

Page 51: Sparse matrix computations in MapReduce

Compensated Summation “Kahan summation algorithm” on Wikipedia s = 0.; c = 0.; for i=1 to n: y = x[i] – c t = s + y c = (t – s) – y s = t

David Gleich · Purdue 51

Mathematically, c is always zero. On a computer, c can be non-zero The parentheses matter! fl(csum(x)) �

X

i

x

i

(µ + nµ2)X

i

|xi

|

µ ⇡ 10�16

MRWorkshop

Page 52: Sparse matrix computations in MapReduce

Summary

MapReduce is a powerful but limited tool that has a role in the future of computational math. … but it should be used carefully! See Austin’s talk next!

David Gleich · Purdue 52

MRWorkshop

Code samples and short tutorials at github.com/dgleich/mrmatrix github.com/dgleich/mapreduce-matrix-tutorial

Page 53: Sparse matrix computations in MapReduce

David Gleich · Purdue 53

MRWorkshop