36
Kazuaki Ishizaki IBM Research – Tokyo ⽇本アイ・ビー・エム(株)東京基礎研究所 Exploiting GPUs in Spark 1

Exploiting GPUs in Spark

Embed Size (px)

Citation preview

Page 1: Exploiting GPUs in Spark

Kazuaki IshizakiIBM Research – Tokyo⽇本アイ・ビー・エム(株)東京基礎研究所

Exploiting GPUs in Spark

1

Page 2: Exploiting GPUs in Spark

Who am I? Kazuaki Ishizaki

Research staff member at IBM Research – Tokyo– http://ibm.co/kiszk

Research interests– compiler optimizations, language runtime, and parallel processing

Worked for Java virtual machine and just-in-time compiler over 20 years– From JDK 1.0 to Java SE 8

Twitter: @kiszk

Slideshare: http://www.slideshare.net/ishizaki

Github: https://github.com/kiszk

2 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 3: Exploiting GPUs in Spark

AgendaMotivation & Goal

Introduction of GPUs

Design & New Components– Binary columnar– GPU enabler

Current Implementation

Performance Experiment– Achieved 3.15x performance of a naïve logistic regression by using a GPU

Future Direction in Spark 2.0 and beyond– with Dataset (introduced in Spark 1.6)

Conclusion3 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 4: Exploiting GPUs in Spark

Want to Accelerate Computation-heavy ApplicationMotivation

– Want to shorten execution time of a long-running Spark application Computation-heavy Shuffle-heavy I/O-heavy

Goal– Accelerate a Spark computation-heavy application

According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark

4 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 5: Exploiting GPUs in Spark

Accelerate a Spark Application by GPUs Approach

– Accelerate a Spark application by using GPUs effectively and transparently Exploit high performance of GPUs Do not ask users to change their Spark programs

New components– Binary columnar – GPU enabler

5 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 6: Exploiting GPUs in Spark

Motivation & Goal

Introduction of GPUs

Design & New Components

Current Implementation

Performance Experiment

Future Direction in Spark 2.0 and beyond

Conclusion

Page 7: Exploiting GPUs in Spark

GPU Programming Model Five steps

1. Allocate GPU device memory2. Copy data on CPU main memory to GPU device memory3. Launch a GPU kernel to be executed in parallel on cores4. Copy back data on GPU device memory to CPU main memory5. Free GPU device memory

Usually, a programmer has to write these steps in CUDA or OpenCL

7 Exploting GPUs in Spark - Kazuaki Ishizaki

device memory(up to 12GB)

main memory(up to 1TB/socket)

CPU GPU

Data copyover PCIe

dozen cores/socket thousands cores

Page 8: Exploiting GPUs in Spark

How We Can Run Program Faster on GPU Assign a lot of parallel computations into cores

Make memory accesses coalesced– An example

– Column-oriented layout achieves better performance This paper reports about 3x performance improvement of GPU kernel execution of

kmeans over row-oriented layout8 Exploting GPUs in Spark - Kazuaki Ishizaki

1 52 61 5 3 7

Assumption: 4 consecutive data elementscan be coalesced by GPU hardware

2 v.s. 4memory accesses toGPU device memory

Row-oriented layoutColumn-oriented layout

Pt(x: Int, y: Int)Load four Pt.xLoad four Pt.y

2 6 4 843 87

coresx1 x2 x3 x4cores

Load Pt.x Load Pt.y Load Pt.x Load Pt.y

1 2 31 2 4

y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

Page 9: Exploiting GPUs in Spark

Motivation & Goal

Introduction of GPUs

Design & New Components

Current Implementation

Performance Experiment

Future Direction in Spark 2.0 and beyond

Conclusion

Page 10: Exploiting GPUs in Spark

Design of GPU Exploitation Efficient

– Reduce data copy overhead between CPU and GPU– Make memory accesses efficient on GPU

Transparent– Map parallelism in a program

into GPU native code

User’s Spark Program (scala)

10

case class Pt(x: Int, y: Int)rdd1 = sc.parallelize(Array(

Pt(1, 4), Pt(2, 5),Pt(3, 6), Pt(4, 7),Pt(5, 8), Pt(6, 9)), 3)

rdd2 = rdd1.map(p => Pt(p.x*2, p.y‐1))cnt =  rdd2.reduce(

(p1, p2) => p1.x + p2.x)

Translate toGPU native

code

Native code

1

GPU4

2 5

3 6

4 7

5 8

6 9

1 4

2 5

3 6

4 7

5 8

6 9

2 3

4 4

6 5

8 6

10 7

12 8

2 3

4 4

6 5

8 6

10 7

12 8

* 2 =

-1 =

rdd1

Datatransfer

x y

Exploting GPUs in Spark - Kazuaki Ishizaki

GPU enabler

binary columnar Off-heap

x y

GPU can exploit parallelism bothamong blocks in RDD andwithin a block of RDD

rdd2

blockGPU

kernel

CPU

Page 11: Exploiting GPUs in Spark

What Binary Columnar does? Keep data as binary representation (not Java object representation)

Keep data as column-oriented layout

Keep data on off-heap or GPU device memory

11 Exploting GPUs in Spark - Kazuaki Ishizaki

2 51 4

Off-heap

case class Pt(x: Int, y: Int)Array(Pt(1, 4),

Pt(2, 5))

Example

2 51 4

Off-heap

Columnar (column-oriented) Row-oriented

Page 12: Exploiting GPUs in Spark

Current RDD as Java objects on Java heap

12 Exploting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap

Current RDDRow-oriented layoutJava object representationOn Java heap

Pt Pt

Page 13: Exploiting GPUs in Spark

Binary Columnar RDD on off-heap

13 Exploting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap Off-heap

2 51 4

Current RDDRow-oriented layoutJava object representationOn Java heap

Binary columnar RDDColumn-oriented layoutBinary representationOn off-heap

Page 14: Exploiting GPUs in Spark

2.1.

Long Path from Current RDD to GPU

Three steps to send data from RDD to GPU1. Java objects to column-oriented binary representation on Java heap

From a Java object to binary representation From a row-oriented format to columnar

2. Binary representation on Java heap to binary columnar on off-heap Garbage collection may move objects on Java heap during GPU related operations

3. Off-heap to GPU device memory

14 Exploting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

1 4 2 5 2 51 4 2 51 4 2 51 4

Off-heap GPU device memoryJava heap Java heap

This thread in dev ML also discusses overhead of copying data between RDD and GPU

3.

Pt Pt ByteBuffer ByteBuffer

Page 15: Exploiting GPUs in Spark

Short Path from Binary Columnar RDD to GPU RDD with binary columnar can be simply copied to GPU device memory

15 Exploting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

Off-heap GPU device memoryEliminated

2 51 4 2 51 4

1 4 2 5 2 51 4 2 51 4

Off-heap GPU device memoryJava heap

2 51 4

Java heap

Page 16: Exploiting GPUs in Spark

Can Execute map() in Parallel Using Binary Columnar

Adjacent elements in binary columnar RDD can be accessed in parallel

The same type of operations ( * or -) can be executed in parallel for data to be loaded in parallel

16 Exploting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))rdd1= rdd1.map(p => Pt(p.x*2, p.y‐1)) 

1 4 2 5

Java heap Off-heap

2 51 4

Current RDD Binary columnar RDD

Memory accessorder 1 2 3 4 1 1 2 2

Page 17: Exploiting GPUs in Spark

Advantages of Binary Columnar Can exploit high performance of GPUs

Can reduce overhead of data copy between CPU and GPU

Consume less memory footprint

Can directly compute data, which are stored in columnar, from Apache Parquet

Can exploit SIMD instructions on CPU

17 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 18: Exploiting GPUs in Spark

What GPU Enabler Does? Copy data in binary columnar RDD between CPU main memory and GPU

device memory

Launch GPU kernels

Cache GPU native code for kernels

Generate GPU native code from transformations and actions in a program– We already productized the IBM Java just-in-time compiler that generate GPU

native code from a lambda expression in Java 8

18 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 19: Exploiting GPUs in Spark

Motivation & Goal

Introduction of GPUs

Design & New Components

Current Implementation

Performance Experiment

Future Direction in Spark 2.0 and beyond

Conclusion

Page 20: Exploiting GPUs in Spark

Software Stack in Current Spark 2.0-SNAPSHOT RDD keeps data on Java heap

20 Exploting GPUs in Spark - Kazuaki Ishizaki

RDD API

Java heap

RDD data

User’s Spark program

Page 21: Exploiting GPUs in Spark

Off-heap

Software Stack of GPU Exploitation Current RDD and binary columnar RDD co-exist

21 Exploting GPUs in Spark - Kazuaki Ishizaki

RDD API

Java heap

RDD data

User’s Spark program

ColumnarGPU

enabler

GPU device memoryColumnar

Page 22: Exploiting GPUs in Spark

Current Implementation of Binary ColumnarWork with RDD

Convert from current RDD to binary columnar RDD and vice versa– Our current implementation eliminates conversion overhead between CPU and

GPU in a task

22 Exploting GPUs in Spark - Kazuaki Ishizaki

Page 23: Exploiting GPUs in Spark

Current Implementation of GPU Enabler Execute user-provided GPU kernels from map()/reduce() functions

– GPU memory managements and data copy are automatically handled

Generate GPU native code for simple map()/reduce() methods– “spark.gpu.codegen=true” in spark-defaults.conf

23 Exploting GPUs in Spark - Kazuaki Ishizaki

rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDDsum  = rdd1.map(i => i * 2)

.reduce((x, y) => (x + y))

// CUDA__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {

long ix = threadIdx.x + blockIdx.x * blockDim.x;if (size <= ix) return;outX[ix] = inX[ix] * 2;outY[ix] = inY[ix] – 1;

}

// SparkmapFunction = new CUDAFunction(“sample_map", // CUDA method name

Array("this.x", "this.y"), // input object  has two fieldsArray("this.x“, “this.y”), // output object has two fieldsthis.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier

rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDDrdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y‐1), mapFunction)

Page 24: Exploiting GPUs in Spark

How to Use GPU Exploitation version

Easy to install by one-liner and to run by one-liner– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM

JDK or OpenJDK

Run script for AWS EC2 is available, which support spot instances 24 Exploting GPUs in Spark - Kazuaki Ishizaki

$ wget https://s3.amazonaws.com/spark‐gpu‐public/spark‐gpu‐latest‐bin‐hadoop2.4.tgz &&tar xf spark‐gpu‐latest‐bin‐hadoop2.4.tgz && cd spark‐gpu

$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run‐example SparkGPULR 8 3200 32 5…numSlices=8, N=3200, D=32, ITERATIONS=5                                         On iteration 1On iteration 2On iteration 3On iteration 4On iteration 5Elapsed time: 431 ms$

Available at http://kiszk.github.io/spark-gpu/• 3 contributors• Private communications

with other developers

Page 25: Exploiting GPUs in Spark

Achieved 3.15x Performance Improvement by GPU Ran naïve implementation of logistic regression

Achieved 3.15x performance improvement of logistic regression over without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card

– We have rooms to improve performance

25 Exploting GPUs in Spark - Kazuaki Ishizaki

Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark

Program parametersN=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5Slices=128 (without GPU), 16 (with GPU)MASTER=local[8] (without and with GPU)

Hardware and softwareMachine: nx360 M4, 2 sockets 8‐core Intel Xeon E5‐2667 3.3GHz, 256GB memory, one NVIDIA K40m cardOS: RedHat 6.6, CUDA: 7.0

Page 26: Exploiting GPUs in Spark

Motivation & Goal

Introduction of GPUs

Design & New Components

Current Implementation

Performance Experiment

Future Direction in Spark 2.0 and beyond

Conclusion

Page 27: Exploiting GPUs in Spark

Comparisons among DataFrame, Dataset, and RDD DataFrame (with relational operations) and Dataset (with lambda

functions) use Catalyst and row-oriented data representation on off-heap

27 Exploting GPUs in Spark - Kazuaki Ishizaki

ds = d.toDS()ds.filter(p => p.x>1)

.count()

1 4 2 5

Java heap

rdd = sc.parallelize(d)rdd.filter(p => p.x>1)

.count()

df = d.toDF(…)df.filter(”x>1”)

.count()

case class Pt(x: Int, y: Int)d = Array(Pt(1, 4), Pt(2, 5))

FrontendAPI

2 51 4Off-heap

Data

DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

Catalyst

Backendcomputation

GeneratedJava bytecode

Java bytecode inSpark program and runtime

Row-orientedRow-oriented

Page 28: Exploiting GPUs in Spark

Design Concepts of Dataset and GPU Exploitation Keep data as binary representation

Keep data on off-heap

Take advantages of Catalyst optimizer

28 Exploting GPUs in Spark - Kazuaki Ishizaki

2 51 4

Off-heap

case class Pt(x: Int, y: Int)sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))

Comparison of data representations

2 51 4

Off-heap

case class Pt(x: Int, y: Int)ds = (Pt(1, 4),Pt(2, 5)).toDS()

How can we apply binary columnar and GPU enabler to Dataset?

Dataset Binary columnar RDD

Binary columnar also does

GPU enabler could use

Row-oriented Columnar

Page 29: Exploiting GPUs in Spark

GPU kernel launcher

Column Encoder

Binary Encoder

In-memory storage

Components in GPU Exploitation Binary columnar

– Columnar In-memory storage keeps data in binary representation on off-heap or GPU memory BinaryEncoder converts a data representation between a Java object and binary format ColumnEncoder puts a set of data elements as column-oriented layout

– Memory Manager Manage off-heap and GPU memory Columnar cache manages

persistency of in-memory storage

GPU enabler– GPU kernel launcher

Launch kernels with data copy Caching GPU binary for kernels

– GPU code generator Generate GPU code from Spark program

29 Exploting GPUs in Spark - Kazuaki Ishizaki

Columnar cache

GPU code generatorPre-compiled

libraries for GPU

Memory Manager Columnar

GPU memory

Off-heap memory

Page 30: Exploiting GPUs in Spark

Software Stack in Spark 2.0 and Beyond Dataset will become a primary data structure for computation

Dataset keeps data in UnsafeRow on off-heap

30 Exploting GPUs in Spark - Kazuaki Ishizaki

DataFrameDataset

TungstenCatalyst

Off-heap

UnsafeRow

User’s Spark program

Logical optimizerCPU code generator

Page 31: Exploiting GPUs in Spark

Columnar with Dataset Keep data in UnsafeRow or Columnar on off-heap, or Columnar on GPU

device memory

31 Exploting GPUs in Spark - Kazuaki Ishizaki

User’s Spark programDataFrame

Dataset

TungstenCatalyst

Off-heap

UnsafeRow

GPU device memory

Columnar Logical optimizer

Memory manager

CPU code generator

Columnar

Page 32: Exploiting GPUs in Spark

Two Approaches for Binary Columnar with Dataset Binary Columnar as a first-class citizen

– Better end-to-end performance in a job without conversion– Need more code changes to the existing source code

Binary Columnar as a cache in a task– Produce overhead of representation conversions between two tasks at shuffle– Need less code changes to the existing source code

32 Exploting GPUs in Spark - Kazuaki Ishizaki

ds1 =d.toDS()

ds2 =ds1.map(…)

ds11 =ds3.groupby(…)

ds3 =ds2.map(…)

ds12 =ds11.map(…)

As a first-classcitizen task1 task2

As acache

shuffle

Page 33: Exploiting GPUs in Spark

GPU Support in Tungsten According to Reynold’s talk (p. 25), Tungsten backend has a plan to enable

GPU exploitation

Exploiting GPUs in Spark - Kazuaki Ishizaki33

Page 34: Exploiting GPUs in Spark

GPU Enabler in Catalyst Place GPU kernel launcher and GPU code generator into Catalyst

34 Exploting GPUs in Spark - Kazuaki Ishizaki

User’s Spark programDataFrame

Dataset

TungstenCatalyst

Off-heap

UnsafeRow

GPU device memory

Columnar Logical optimizer

Memory manager

CPU code generator

GPU code generatorGPU kernel launcherColumnar

Page 35: Exploiting GPUs in Spark

Future Direction Do refactoring to make current implementation decomposable

– Some components exist in one Scala file

Make pull requests for each component– to support columnar Dataset– to exploit GPUs

35 Exploting GPUs in Spark - Kazuaki Ishizaki

Memory Manager Columnar

Binary encoder

Column encoder

In-memory storage

Memory manager

Cache manager

As a cache in task

As a first-class citizen

Multiple backend support

CPU code generator for

Columnar

CPU code generator for

Columnar

GPU kernel launcher

Column Encoder

Binary Encoder

In-memory storageColumnar cache

GPU code generator

GPU memory

Off-heap memory

Roadmap for pull requestsOff-heap

Catalyst

Page 36: Exploiting GPUs in Spark

Takeaway Accelerate a Spark application by using GPUs effectively and transparently Devised two New components

– Binary columnar to alleviate overhead for GPU exploitation– GPU enabler to manage GPU kernel execution from a Spark program

Call pre-compiled libraries for GPU Generate GPU native code at runtime

Available at http://kiszk.github.io/spark-gpu/

36

Component Initial design(Spark 1.3-1.5)

Current status(Spark 2.0-Snapshot)

Future(Spark 2.x)

Binary columnar

with RDD with RDD with Dataset

GPU enabler launch GPU kernelsgenerate GPU native code

launch GPU kernelsgenerate GPU native code

in Catalyst

Exploting GPUs in Spark - Kazuaki Ishizaki

Appreciate any your feedback and contributions