Exploiting GPUs in Spark

Kazuaki IshizakiIBM Research – Tokyo⽇本アイ・ビー・エム（株）東京基礎研究所

Exploiting GPUs in Spark

1

Who am I? Kazuaki Ishizaki

Research staff member at IBM Research – Tokyo– http://ibm.co/kiszk

Research interests– compiler optimizations, language runtime, and parallel processing

Worked for Java virtual machine and just-in-time compiler over 20 years– From JDK 1.0 to Java SE 8

Twitter: @kiszk

Slideshare: http://www.slideshare.net/ishizaki

Github: https://github.com/kiszk

2 Exploting GPUs in Spark - Kazuaki Ishizaki

AgendaMotivation & Goal

Introduction of GPUs

Design & New Components– Binary columnar– GPU enabler

Current Implementation

Performance Experiment– Achieved 3.15x performance of a naïve logistic regression by using a GPU

Future Direction in Spark 2.0 and beyond– with Dataset (introduced in Spark 1.6)

Conclusion3 Exploting GPUs in Spark - Kazuaki Ishizaki

Want to Accelerate Computation-heavy ApplicationMotivation

– Want to shorten execution time of a long-running Spark application Computation-heavy Shuffle-heavy I/O-heavy

Goal– Accelerate a Spark computation-heavy application

According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark


Accelerate a Spark Application by GPUs Approach

– Accelerate a Spark application by using GPUs effectively and transparently Exploit high performance of GPUs Do not ask users to change their Spark programs

New components– Binary columnar – GPU enabler


Motivation & Goal


Design & New Components


Performance Experiment

Future Direction in Spark 2.0 and beyond

Conclusion

GPU Programming Model Five steps

1. Allocate GPU device memory2. Copy data on CPU main memory to GPU device memory3. Launch a GPU kernel to be executed in parallel on cores4. Copy back data on GPU device memory to CPU main memory5. Free GPU device memory

Usually, a programmer has to write these steps in CUDA or OpenCL


device memory(up to 12GB)

main memory(up to 1TB/socket)

CPU GPU

Data copyover PCIe

dozen cores/socket thousands cores

How We Can Run Program Faster on GPU Assign a lot of parallel computations into cores

Make memory accesses coalesced– An example

– Column-oriented layout achieves better performance This paper reports about 3x performance improvement of GPU kernel execution of

kmeans over row-oriented layout8 Exploting GPUs in Spark - Kazuaki Ishizaki

1 52 61 5 3 7

Assumption: 4 consecutive data elementscan be coalesced by GPU hardware

2 v.s. 4memory accesses toGPU device memory

Row-oriented layoutColumn-oriented layout

Pt(x: Int, y: Int)Load four Pt.xLoad four Pt.y

2 6 4 843 87

coresx1 x2 x3 x4cores

Load Pt.x Load Pt.y Load Pt.x Load Pt.y

1 2 31 2 4

y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

Motivation & Goal






Conclusion

Design of GPU Exploitation Efficient

– Reduce data copy overhead between CPU and GPU– Make memory accesses efficient on GPU

Transparent– Map parallelism in a program

into GPU native code

User’s Spark Program (scala)

10

case class Pt(x: Int, y: Int)rdd1 = sc.parallelize(Array(

Pt(1, 4), Pt(2, 5),Pt(3, 6), Pt(4, 7),Pt(5, 8), Pt(6, 9)), 3)

rdd2 = rdd1.map(p => Pt(p.x*2, p.y‐1))cnt = rdd2.reduce(

(p1, p2) => p1.x + p2.x)

Translate toGPU native

code

Native code

1

GPU4

2 5

3 6

4 7

5 8

6 9

1 4

2 5

3 6

4 7

5 8

6 9

2 3

4 4

6 5

8 6

10 7

12 8

2 3

4 4

6 5

8 6

10 7

12 8

* 2 =

-1 =

rdd1

Datatransfer

x y

Exploting GPUs in Spark - Kazuaki Ishizaki

GPU enabler

binary columnar Off-heap

x y

GPU can exploit parallelism bothamong blocks in RDD andwithin a block of RDD

rdd2

blockGPU

kernel

CPU

What Binary Columnar does? Keep data as binary representation (not Java object representation)

Keep data as column-oriented layout

Keep data on off-heap or GPU device memory


2 51 4

Off-heap

case class Pt(x: Int, y: Int)Array(Pt(1, 4),

Pt(2, 5))

Example

2 51 4

Off-heap

Columnar (column-oriented) Row-oriented

Current RDD as Java objects on Java heap


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap

Current RDDRow-oriented layoutJava object representationOn Java heap

Pt Pt

Binary Columnar RDD on off-heap



Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap Off-heap

2 51 4

Current RDDRow-oriented layoutJava object representationOn Java heap

Binary columnar RDDColumn-oriented layoutBinary representationOn off-heap

2.1.

Long Path from Current RDD to GPU

Three steps to send data from RDD to GPU1. Java objects to column-oriented binary representation on Java heap

From a Java object to binary representation From a row-oriented format to columnar

2. Binary representation on Java heap to binary columnar on off-heap Garbage collection may move objects on Java heap during GPU related operations

3. Off-heap to GPU device memory


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

1 4 2 5 2 51 4 2 51 4 2 51 4

Off-heap GPU device memoryJava heap Java heap

This thread in dev ML also discusses overhead of copying data between RDD and GPU

3.

Pt Pt ByteBuffer ByteBuffer

Short Path from Binary Columnar RDD to GPU RDD with binary columnar can be simply copied to GPU device memory


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

Off-heap GPU device memoryEliminated

2 51 4 2 51 4

1 4 2 5 2 51 4 2 51 4

Off-heap GPU device memoryJava heap

2 51 4

Java heap

Can Execute map() in Parallel Using Binary Columnar

Adjacent elements in binary columnar RDD can be accessed in parallel

The same type of operations ( * or -) can be executed in parallel for data to be loaded in parallel



Pt(2, 5)))rdd1= rdd1.map(p => Pt(p.x*2, p.y‐1))

1 4 2 5

Java heap Off-heap

2 51 4

Current RDD Binary columnar RDD

Memory accessorder 1 2 3 4 1 1 2 2

Advantages of Binary Columnar Can exploit high performance of GPUs

Can reduce overhead of data copy between CPU and GPU

Consume less memory footprint

Can directly compute data, which are stored in columnar, from Apache Parquet

Can exploit SIMD instructions on CPU


What GPU Enabler Does? Copy data in binary columnar RDD between CPU main memory and GPU

device memory

Launch GPU kernels

Cache GPU native code for kernels

Generate GPU native code from transformations and actions in a program– We already productized the IBM Java just-in-time compiler that generate GPU

native code from a lambda expression in Java 8


Motivation & Goal






Conclusion

Software Stack in Current Spark 2.0-SNAPSHOT RDD keeps data on Java heap


RDD API

Java heap

RDD data

User’s Spark program

Off-heap

Software Stack of GPU Exploitation Current RDD and binary columnar RDD co-exist


RDD API

Java heap

RDD data


ColumnarGPU

enabler

GPU device memoryColumnar

Current Implementation of Binary ColumnarWork with RDD

Convert from current RDD to binary columnar RDD and vice versa– Our current implementation eliminates conversion overhead between CPU and

GPU in a task


Current Implementation of GPU Enabler Execute user-provided GPU kernels from map()/reduce() functions

– GPU memory managements and data copy are automatically handled

Generate GPU native code for simple map()/reduce() methods– “spark.gpu.codegen=true” in spark-defaults.conf


rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDDsum = rdd1.map(i => i * 2)

.reduce((x, y) => (x + y))

// CUDA__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {

long ix = threadIdx.x + blockIdx.x * blockDim.x;if (size <= ix) return;outX[ix] = inX[ix] * 2;outY[ix] = inY[ix] – 1;

}

// SparkmapFunction = new CUDAFunction(“sample_map", // CUDA method name

Array("this.x", "this.y"), // input object has two fieldsArray("this.x“, “this.y”), // output object has two fieldsthis.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier

rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDDrdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y‐1), mapFunction)

How to Use GPU Exploitation version

Easy to install by one-liner and to run by one-liner– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM

JDK or OpenJDK

Run script for AWS EC2 is available, which support spot instances 24 Exploting GPUs in Spark - Kazuaki Ishizaki

$ wget https://s3.amazonaws.com/spark‐gpu‐public/spark‐gpu‐latest‐bin‐hadoop2.4.tgz &&tar xf spark‐gpu‐latest‐bin‐hadoop2.4.tgz && cd spark‐gpu

$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run‐example SparkGPULR 8 3200 32 5…numSlices=8, N=3200, D=32, ITERATIONS=5 On iteration 1On iteration 2On iteration 3On iteration 4On iteration 5Elapsed time: 431 ms$

Available at http://kiszk.github.io/spark-gpu/• 3 contributors• Private communications

with other developers

Achieved 3.15x Performance Improvement by GPU Ran naïve implementation of logistic regression

Achieved 3.15x performance improvement of logistic regression over without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card

– We have rooms to improve performance


Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark

Program parametersN=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5Slices=128 (without GPU), 16 (with GPU)MASTER=local[8] (without and with GPU)

Hardware and softwareMachine: nx360 M4, 2 sockets 8‐core Intel Xeon E5‐2667 3.3GHz, 256GB memory, one NVIDIA K40m cardOS: RedHat 6.6, CUDA: 7.0

Motivation & Goal






Conclusion

Comparisons among DataFrame, Dataset, and RDD DataFrame (with relational operations) and Dataset (with lambda

functions) use Catalyst and row-oriented data representation on off-heap


ds = d.toDS()ds.filter(p => p.x>1)

.count()

1 4 2 5

Java heap

rdd = sc.parallelize(d)rdd.filter(p => p.x>1)

.count()

df = d.toDF(…)df.filter(”x>1”)

.count()

case class Pt(x: Int, y: Int)d = Array(Pt(1, 4), Pt(2, 5))

FrontendAPI

2 51 4Off-heap

Data

DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

Catalyst

Backendcomputation

GeneratedJava bytecode

Java bytecode inSpark program and runtime

Row-orientedRow-oriented

Design Concepts of Dataset and GPU Exploitation Keep data as binary representation

Keep data on off-heap

Take advantages of Catalyst optimizer


2 51 4

Off-heap

case class Pt(x: Int, y: Int)sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))

Comparison of data representations

2 51 4

Off-heap

case class Pt(x: Int, y: Int)ds = (Pt(1, 4),Pt(2, 5)).toDS()

How can we apply binary columnar and GPU enabler to Dataset?

Dataset Binary columnar RDD

Binary columnar also does

GPU enabler could use

Row-oriented Columnar

GPU kernel launcher

Column Encoder

Binary Encoder

In-memory storage

Components in GPU Exploitation Binary columnar

– Columnar In-memory storage keeps data in binary representation on off-heap or GPU memory BinaryEncoder converts a data representation between a Java object and binary format ColumnEncoder puts a set of data elements as column-oriented layout

– Memory Manager Manage off-heap and GPU memory Columnar cache manages

persistency of in-memory storage

GPU enabler– GPU kernel launcher

Launch kernels with data copy Caching GPU binary for kernels

– GPU code generator Generate GPU code from Spark program


Columnar cache

GPU code generatorPre-compiled

libraries for GPU

Memory Manager Columnar

GPU memory

Off-heap memory

Software Stack in Spark 2.0 and Beyond Dataset will become a primary data structure for computation

Dataset keeps data in UnsafeRow on off-heap


DataFrameDataset

TungstenCatalyst

Off-heap

UnsafeRow


Logical optimizerCPU code generator

Columnar with Dataset Keep data in UnsafeRow or Columnar on off-heap, or Columnar on GPU

device memory


User’s Spark programDataFrame

Dataset

TungstenCatalyst

Off-heap

UnsafeRow

GPU device memory

Columnar Logical optimizer

Memory manager

CPU code generator

Columnar

Two Approaches for Binary Columnar with Dataset Binary Columnar as a first-class citizen

– Better end-to-end performance in a job without conversion– Need more code changes to the existing source code

Binary Columnar as a cache in a task– Produce overhead of representation conversions between two tasks at shuffle– Need less code changes to the existing source code


ds1 =d.toDS()

ds2 =ds1.map(…)

ds11 =ds3.groupby(…)

ds3 =ds2.map(…)

ds12 =ds11.map(…)

As a first-classcitizen task1 task2

As acache

shuffle

GPU Support in Tungsten According to Reynold’s talk (p. 25), Tungsten backend has a plan to enable

GPU exploitation

Exploiting GPUs in Spark - Kazuaki Ishizaki33

GPU Enabler in Catalyst Place GPU kernel launcher and GPU code generator into Catalyst


User’s Spark programDataFrame

Dataset

TungstenCatalyst

Off-heap

UnsafeRow

GPU device memory

Columnar Logical optimizer

Memory manager

CPU code generator

GPU code generatorGPU kernel launcherColumnar

Future Direction Do refactoring to make current implementation decomposable

– Some components exist in one Scala file

Make pull requests for each component– to support columnar Dataset– to exploit GPUs


Memory Manager Columnar

Binary encoder

Column encoder

In-memory storage

Memory manager

Cache manager

As a cache in task

As a first-class citizen

Multiple backend support

CPU code generator for

Columnar

CPU code generator for

Columnar

GPU kernel launcher

Column Encoder

Binary Encoder

In-memory storageColumnar cache

GPU code generator

GPU memory

Off-heap memory

Roadmap for pull requestsOff-heap

Catalyst

Takeaway Accelerate a Spark application by using GPUs effectively and transparently Devised two New components

– Binary columnar to alleviate overhead for GPU exploitation– GPU enabler to manage GPU kernel execution from a Spark program

Call pre-compiled libraries for GPU Generate GPU native code at runtime

Available at http://kiszk.github.io/spark-gpu/

36

Component Initial design(Spark 1.3-1.5)

Current status(Spark 2.0-Snapshot)

Future(Spark 2.x)

Binary columnar

with RDD with RDD with Dataset

GPU enabler launch GPU kernelsgenerate GPU native code

launch GPU kernelsgenerate GPU native code

in Catalyst

Exploting GPUs in Spark - Kazuaki Ishizaki

Appreciate any your feedback and contributions