34
M. Naumov, J. Daw, V. Ditya, A. Fit-Florea and S. Migacz Spark and GPUs 04/07/2016

Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

M. Naumov, J. Daw, V. Ditya, A. Fit-Florea and S. Migacz

Spark and GPUs

04/07/2016

Page 2: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

2

Key Issues that Need to Be Addressed

Data Contiguous memory layout

Code Intercept compute intensive call

Compile Java Bytecode to PTX

Job Placement Awareness of nodes with and without GPUs

Different GPU configurations

Page 3: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

3

Data

Contiguous memory layout Java Unsafe API

Java NIO buffers

Keep track of where is the data Reuse on CPU/GPU instead of always copying

And more Data layout in memory, UVM, …

Page 4: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

4

Code

Intercept compute intensive call Wrap library calls (using JNI, jCUDA, SWIG, …)

Key question, what algorithms are important?

Compile Java Bytecode to PTX Likely limits the functions you can write

Maybe enough for majority of users

Page 5: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

5

Job Placement

Awareness of nodes with/without GPUs By all schedulers, such as Mesos, YARN, …

Different configurations multiple processes per GPU

multiple GPUs per process

processes with memory requirements larger than

the memory of the GPU(s)

Page 6: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

6

Spark Language Interfaces PyCUDA, SWIG, JNI, MLLib with NVBLAS

Page 7: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

7

Python CUDA Bindings (PyCUDA) #CUDA kernel

mod= SourceModule(""" __global__ void vector_add(float *a, float *b, float *c) { int i = threadIdx.x; c[i]=a[i] + b[i]; } """)

#CUDA run vector_add = mod.get_function("vector_add") vector_add(drv.In(a), drv.In(b), drv.Out(c), block=(2,1,1), grid=(3,1,1))

PyCUDA

CUDA kernel

Page 8: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

8

Caveat I:

Must be able to serialize/unserialize (Java) or pickle/unpickle (Python) the

lambda/closure/function supplied to Spark operations, such as map

In practice, this often means the function must be “self contained”

Caveat II:

Currently, there is a lot of overhead in pyCUDA, which seems to include

compiling the CUDA kernel at Spark runtime

Caveat III:

Currently, there is no way to leave and reuse data on the GPU

Nikolai Sakharnykh,

Spark - Python + PyCUDA

Page 9: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

9

Python-C/C++ Interface Generation Tool #CUDA kernel

def test_add(n): x = [numpy.float64(i+1) for i in range(n)] y = [numpy.float64(10*(i+1)) for i in range(n)] e,r = mn.add(len(x),len(x),x,y) int add(int n, double *r, double *x, double *y) { for(int i=0; i<n; i++){ r[i]=x[i]+y[i]; } return 1; }

SWIG

Python

C/C++

Generate Python Object Layer Code …

Preamble C/C++ function call

Post amble …

Page 10: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

10

Python-C/C++ Interface Generation Tool #CUDA kernel

def test_add(n): x = [numpy.float64 … ] y = [numpy.float64 … ] e,r = mn.add(len(x), …) int add(int n, double *r, … ) { for(int i=0; i<n; i++){ r[i]=x[i]+y[i]; } return 1; }

SWIG

Code Example (typemaps for variables):

%define tmp2c_v(type, name)

#define PyType_AsType PyType_AsType_##type

%typemap(in) (type name) {

$1 = PyType_AsType($input);

}

#undef PyType_AsType

%enddef

Page 11: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

11

Similar to PyCUDA, but does not compile code on the fly.

Allows easier wrapping of CUDA library calls

Careful with data returned in arrays

Careful with names across multiple library calls

(they are all treated using the same rules)

SWIG can also generate interface to other languages

(for example, Java using JNI)

Nikolai Sakharnykh,

Spark - Python + SWIG

Page 12: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

12

Can be used for Scala-C/C++ Interface #CUDA kernel class Binding { @native def iArrayMethod(a: Array[Int]): Int } object Test extends App { System.loadLibrary("Binding") val b = new Binding val sum = b.iArrayMethod(Array(1, 2,3)) … }

Java Native Interface (JNI)

Scala C/C++

JNIEXPORT jint JNICALL Java_Binding_iArrayMethod (JNIEnv* env, jobject obj, jintArray array) { int sum=0; jsize len = (*env)->GetArrayLength(env,array); jint* x = (*env)->GetIntArrayElements(env,array, 0); for (int i = 0; i < len; i++) { sum += x[i]; } (*env)->ReleaseIntArrayElements(env,array, x, 0); return sum; }

Page 13: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

13

Spark - Scala + JNI

Similar SWIG, but using JNI instead of Python Object Layer.

Allows easier wrapping of CUDA library calls

Careful with arrays (GetIntArrayElements might make extra copies)

We have integrated this bindings into the Spark Maven project manager and

they are accessible from any classes.

Page 14: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

14

MLLib Spark Machine Learning Library

Allows the use of native BLAS libraries (such as Intel MKL)

NVBLAS Plug-and-play: intercepts host BLAS level-3 calls

Offloads computation to CUBLAS when beneficial

Supports multiple-GPUs

Designed to support preloading (no need to even recompile the code)

Spark – MLLib + NVBLAS

Page 15: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

15

Investigation of Spark Operators Basics, Prefix Sum, All-to-All

Page 16: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

16

Existing Operators

Map, flatMap, mapPartitions[WithIndex],

Zip[WithIndex], Union, Intersect, Filter,

sortBy[Key], PartitionBy, Reduce, …

Code Example:

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)

>>> res = rdd.reduce(lambda x,y: x+y)

>>> print(res)

>>> 10

shuffles

actions

transforms

1 + 2 + 3 + 4 = 10

Page 17: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

17

Motivation for New Operators

= Ap,Ac,Av x y

Many algorithm are not easily expressed

with existing operators

Consider sparse matrix-vector multiplication

(matrix A in CSR format is represented by arrays Ap, Ac and Av)

It is a standard benchmark for HPC. Also, it is used in Power method

to compute PageRank of a webpage.

Page 18: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

18

Coordinate (COO)

Compressed Sparse Row (CSR)

Compressed Sparse Column (CSC)

Sparse Matrix Storage Formats

1 2 2 3 4 4 4

1 1 2 3 1 3 4

1.0 2.0 3.0 4.0 5.0 6.0 7.0

Row Index

Col Index

Values

1 2 4 5 8

1 1 2 3 1 3 4

1.0 2.0 3.0 4.0 5.0 6.0 7.0

Ap

Ac

Av

1 2 4 2 3 4 4

1 4 5 7 8

1.0 2.0 3.0 4.0 5.0 6.0 7.0

Ar

Ap

Av

1.0

6.0

4.0

7.0

3.0 2.0

5.0

1 2 3 4

1

2

3

4

column-major order

row-major order

dense

Page 19: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

19

Partitioning the matrix

=

=

=

Ap,Ac,Av

Ap1,Ac1,Av1

Ap2,Ac2,Av2

y1

y2

x

x

x y

• Partition Arrays • Insert (at index) • Compute prefix sum • Broadcast/Collect • Numeric Operations

Page 20: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

20

numElements (per partition)

def getNumElements(self):

return self.map(lambda x: 1).reduce(lambda x,y: x+y)

def getNumLocalElements(self):

return self.mapPartitions(lambda p: [sum(1 for x in p)])

Code Example:

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)

>>> ne = rdd.getNumElements() >>> nle = rdd.getNumLocalElements()

>>> 4 >>> [[2], [2]] single number RDD

1 + 1 + 1 + 1 = 4

[1 + 1], [1 + 1] = [2], [2]

same as count()

Page 21: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

21

[find|insert|remove|swap][at]Index

def findIndex(self,e):

res = self.zipWithIndex().filter(lambda (x,k): x == e)

# check whether rdd is empty, if not then …

return res.reduce(lambda (x1,k1), (x2,k2): min(k1,k2))

Code Example:

>>> res = sc.parallelize([1, 3, 3, 2], 2).findIndex(3)

>>> print(res)

>>> 1 (be careful with 0/1 based indexing)

1 3 3 2

0 1 2 3

Page 22: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

22

Also, need local versions

def findLocalIndex(self,e):

res = self.zipWithLocalIndex().filter(lambda (x,k): x == e)

# check whether rdd is empty, if not then …

return res.mapPartitions(find_min_in_a_list)

Code Example:

>>> res = sc.parallelize([1, 3, 3, 2], 2).findLocalIndex(3);

>>> print(res.glom().collect())

>>> [[1], [0]] (be careful with 0/1 based indexing)

1 3 3 2

0 1 0 1

Page 23: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

23

Prefix Sum (by Key)

1 2 2 3 4 4 4

1 1 1 1 1 1 1

1 2 3 4

1 2 1 3

1 2 3 4

1 3 4 7

count

add

This can be used to convert from COO to CSR format

1 2 3 4

1 2 4 5 8 +1 (optional, based on 0/1 based indexing)

Page 24: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

24

Prefix Sum def prefixSum(self): #compute prefix sum by shifting and filtering keys

rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

n = rdd.getNumElements(); offset = next_pow2(n)

1 2 1 3

1 1 1 1 1 1 1

keys are colors

1 3 4 7

final result we expect

Page 25: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

25

Prefix Sum

def prefixSum(self): #compute prefix sum by shifting and filtering keys

rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

n = rdd.getNumElements(); offset = next_pow2(n)

while offset > 0:

set1= rdd.map(lambda t: t)

set2= rdd.map(lambda (k,x): (k+offset,x)).filter(lambda (k,x): k<(n+1))

rdd = set1.union(set2).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

offset = int(offset/2)

return rdd

1 2 1 3

1 2 1 3 1 2 2 5

offset=2

Page 26: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

26

Prefix Sum

def prefixSum(self): #compute prefix sum by shifting and filtering keys

rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

n = rdd.getNumElements(); offset = next_pow2(n)

while offset > 0:

set1= rdd.map(lambda t: t)

set2= rdd.map(lambda (k,x): (k+offset,x)).filter(lambda (k,x): k<(n+1))

rdd = set1.union(set2).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

offset = int(offset/2)

return rdd

1 2 2 5

1 2 2 5 1 3 4 7

offset=1

Page 27: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

27

Prefix Sum

def prefixSum(self): #compute prefix sum by shifting and filtering keys

rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)

n = rdd.getNumElements(); offset = next_pow2(n)

while offset > 0:

set1= rdd.map(lambda t: t)

set2= rdd.map(lambda (k,x): … >>> rdd = sc.parallelize([1,2,3,4,4,2,4], 2)

rdd = set1.union(set2).redu … >>> rdd.prefixSum()

offset = int(offset/2) >>> [(1,1), (2,3), (3,4),(4,7)]

return

Code Example (we can similarly have a local variant):

Page 28: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

28

numOps[Mixed]

def numOpsMixed(self, other, func): # ASSUMPTION: number of partitions is the same

rdd = self.zipPartitions(other) #creates an rdd whose elements are partitions

def apply_func((p,q)):

for y in q:

for x in p:

yield func(x,y)

res = rdd.flatMap(func)

return res

1 2 1 2

10 20

11 21 12 22

=

+

Page 29: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

29

AllToAll

def allToAll(self, np, partitionFunc):

#define add_partition_index_to_each_element and use it below …

rdd = self.mapPartitionsWithIndex(add_partition_index_to_each_element)

def expand_p_index(x):

for k in range(np):

yield (k,x)

res = rdd.flatMap(expand_p_index).partitionBy(np, partitionFunc).map(lambda k,x: x)

return res.sortLocalByKey().map(lambda k,x: x)

1 2 3 1

2 2 3 3 1 1 1 1

0 1 0 1 0 1 0 1

3 1 1 2 1 2 3 1

0 0 0 0 1 1 1 1

Page 30: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

30

Partitioning the matrix

=

=

=

Ap,Ac,Av

Ap1,Ac1,Av1

Ap2,Ac2,Av2

y1

y2

x

x

x y

• Partition Arrays • Insert (at index) • Compute prefix sum • Broadcast/Collect • Numeric Operations

Page 31: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

31

Discussion with Audience

Page 32: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

32

Algorithms and Challenges

What algorithms would you like to implement? PCA (SVD), SVM, ALS, K-Means, …

Are you interested in machine learning (other than deep learning)?

How is Python/Scala/Java used? What code/problems are interesting?

What is your vision for how spark should be aware of GPU resources,

in conjunction with resource manager (such as Mesos)?

What challenges do you have for using GPUs? Performance/Power/$, Memory Layout (JVM vs. C/C++), …

Page 33: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

33

Backup Slides

Page 34: Spark and GPUs - Meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · How is Python/Scala/Java used? What code/problems are interesting? What is your vision for

34

PageRank (from Linear Algebra Perspective)

• Let Cnxn be a scaled adjacency matrix (with row sums = 1), vector b={0,1}n have 1 in place of dangling nodes (indices of empty rows), and vector u=(1/n)e where e=[1,…,1]T.

• Find the largest eigenpair (in which eigenvector = pagerank) of

Ax = \lambda x, where A = \alpha (C + buT) + (1-\alpha)(ueT)

• The simplest approach - Power method

key operation: sparse matrix-vector multiplication