79
© 2010 Michael Boyer 1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor: Kevin Skadron

© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

Embed Size (px)

Citation preview

Page 1: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 1

Harnessing the Power of GPUs for Non-Graphics Applications

Michael BoyerDepartment of Computer Science

University of VirginiaAdvisor: Kevin Skadron

Page 2: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 2

Outline• GPU architecture• Programming GPUs using CUDA• Case study: Leukocyte Tracking• Current work: CPU-GPU Task Sharing

Page 3: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 3

Graphics Processors• Graphics Processing Units (GPUs) are designed

specifically for graphics rendering applications

Courtesy of G

ameS

pot

Page 4: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 4

Graphics Applications• Graphics applications involve applying the same

operation to many pieces of data• Application characteristics:

– Massively parallel– Only aggregate performance matters

Page 5: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 5

CPU vs. GPU: Architectural Difference 1

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File OOO LogicOOO Logic

Branch PredictorBranch

Predictor

Data CacheData Cache

MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File

CPU GPU

Avoid structures that onlyimprove single-thread performance

Page 6: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 6

CPU vs. GPU: Architectural Difference 2

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File

Amortize the overhead of control logic across multiple execution units (SIMD processing)

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRFThread Group

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File OOO LogicOOO Logic

Branch PredictorBranch

Predictor

Data CacheData Cache

MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

GPUCPU

Page 7: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 7

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

CPU vs. GPU: Architectural Difference 3

Fetch/DecodeFetch/

Decode

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRFThread GroupThread Group 1

Thread Group 2

Thread Group 3

Thread Group 4

Use multiple groups of threads to keepexecution units busy and hide memory latency

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File OOO LogicOOO Logic

Branch PredictorBranch

Predictor

Data CacheData Cache

MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

GPUCPU

Page 8: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 8

CPU vs. GPU: Architectural Difference 4

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

Fetch/DecodeFetch/

Decode

Replicate cores to leverage more parallelism

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File OOO LogicOOO Logic

Branch PredictorBranch

Predictor

Data CacheData Cache

MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

GPUCPU

CPU CoreCPU CoreGPU CoreGPU Core

Core 1Core 1 Core 2Core 2

Core 3Core 3 Core 4Core 4

Core 7Core 7 Core 8Core 8 Core 9Core 9 Core 10

Core 10

Core 12

Core 12

Core 13

Core 13

Core 14

Core 14

Core 15

Core 15

Core 17

Core 17

Core 18

Core 18

Core 19

Core 19

Core 20

Core 20

Core 22

Core 22

Core 23

Core 23

Core 24

Core 24

Core 25

Core 25

Core 27

Core 27

Core 28

Core 28

Core 29

Core 29

Core 30

Core 30

Core 2Core 2 Core 3Core 3 Core 4Core 4 Core 5Core 5

Core 6Core 6

Core 11

Core 11

Core 16

Core 16

Core 21

Core 21

Core 26

Core 26

Core 1Core 1

Page 9: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 9

CPU vs. GPU: Architectural Differences• Summary: take advantage of abundant parallelism

– Lots of threads, so focus on aggregate performance– Parallelism in space:

• SIMD processing in each core• Many independent SIMD cores across the chip

– Parallelism in time:• Multiple SIMD groups in each core

Page 10: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 10

CPU vs. GPU: Peak PerformanceProcessor Type CPU GPU

Product Intel Xeon W5590 (Nehalem)

AMD Radeon HD 5870

Throughput(GFLOPs)

107 2,720

Memory Bandwidth (GB/s) 32 154

Cost $1,700 $450

• Note that these are peak numbers• What we really care about is performance on real-world applications

Page 11: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 11

General-Purpose Computing on GPUs• Lots of recent interest in using GPUs to run non-

graphics applications (GPGPU)• Why GPUs? Why now?

– Recent increases in performance via parallelism – Recent increases in programmability– Ubiquity in multiple market segments

• Old approach: graphics languages• New approach: GPGPU languages

– OpenCL, CUDA

Page 12: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 12

CUDA• Programming model for running general-purpose

applications on NVIDIA GPUs• Extension to the C programming language• GPU is a co-processor:

– Main program runs on the CPU– Large computations (kernels) are offloaded to the GPU– CPU and GPU have separate memory, so data must be

transferred back and forth

Page 13: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 13

CUDA: Typical Program Structurevoid function(…) {

Allocate memory on the GPUTransfer input data to the GPULaunch kernel on the GPUTransfer output data to CPU

}

__global__ void kernel(…) {Code executed on

the GPU goes here…}

CPUCPU CPU MemoryCPU Memory

GPUGPU

GPU MemoryGPU Memory

Page 14: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 14

CUDA: Typical Program Transformationfor (i = 0; i < N; i++) {

Process array element i}

Body of loop becomes body of kernel

__global__ void kernel(…) {Determine this thread’s value of iProcess array element i

}

Page 15: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 15

CUDA Kernel• Scalar program invoked across many threads

– Typically one thread per data element

• Overall computation decomposed into a grid of thread blocks– Thread blocks are independent and cannot communicate

(with some exceptions)– Threads within the same block can communicate

Thread Block 1

Thread Block 2

Thread Block 3

Thread Block 4

Thread Block 5

Page 16: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 16

1 2 3 4 5 6 7 8A

9 10 11 12 13 14 15 16B

10 12 14 16 18 20 22 24C

+

=

Simple Example: Vector Addition

C = A + B

Page 17: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 17

C Code

float *CPU_add_vectors(float *A, float *B, int N) {

// Allocate memory for the resultfloat *C = (float *) malloc(N * sizeof(float));

// Compute the sum;for (int i = 0; i < N; i++) {

C[i] = A[i] + B[i];}

// Return the resultreturn C;

}

Page 18: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 18

CUDA Kernel

// GPU kernel that computes the vector sum C = A + B// (each thread computes a single value of the result)__global__ void kernel(float *A, float *B, float *C, int N) {

// Determine which element this thread is computingint i = blockDim.x * blockIdx.x + threadIdx.x;

// Compute a single element of the result vectorif (i < N) {

C[i] = A[i] + B[i];}

}

Page 19: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 19

CUDA Host Codefloat *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {

// Allocate GPU memory for the inputs and the resultint vector_size = N * sizeof(float);float *A_GPU, *B_GPU, *C_GPU;cudaMalloc((void **) &A_GPU, vector_size);cudaMalloc((void **) &B_GPU, vector_size);cudaMalloc((void **) &C_GPU, vector_size);

// Transfer the input vectors to GPU memorycudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);

// Execute the kernel to compute the vector sum on the GPUint num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK);kernel <<< num_blocks, THREADS_PER_BLOCK >>> (A_GPU, B_GPU, C_GPU, N);

// Transfer the result vector from the GPU to the CPUfloat *C_CPU = (float *) malloc(vector_size);cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);return C_CPU;

}

Page 20: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 20

Example Program Output./vector_add 50,000,000

GPU: Transfer to GPU: 0.236 sec Kernel execution: 0.005 sec Transfer from GPU: 0.152 sec Total: 0.404 sec

CPU: 0.136 sec

Execution: GPU outperformed CPU by 27.2x Overall: CPU outperformed GPU by 2.97x

Vector addition does not do enough work per memory operation to justify offload!

Page 21: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 21

Case Study:Leukocyte Tracking

Page 22: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 22

Leukocyte Tracking

• Important for evaluating inflammatory drugs• Velocity measured by tracking leukocytes through

multiple frames• Current approaches:

– Manual analysis: 1 minute video in tens of hours– MATLAB: 1 minute video in 5 hours

Page 23: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 23

Goal: Leverage CUDA and a GPU toaccelerate leukocyte tracking

to near real-time speeds

Page 24: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 24

Acceleration1. Translation: convert MATLAB code to C2. Parallelization:

– OpenMP for multi-core CPU– CUDA for GPU

• Experimental setup:– CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770– GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)

Page 25: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 25

Tracking AlgorithmInputs: Video frame

Location of cells in previous frame

Output: Location of cells in current frame

For each cell:– Extract sub-image near cell’s old location– Compute MGVF matrix over sub-image– Evolve active contour using MGVF matrix

→ 99.8%

Page 26: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 26

Computing the MGVF Matrix• Motion Gradient Vector Flow• MGVF matrix is approximated via an iterative solution

procedure

Sub-image near cell MGVF

Page 27: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 27

MGVF = normalized sub-image gradientdo {

Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion

} while (! converged)

MGVF Pseudo-code

Page 28: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 28

Naïve CUDA Implementation

2.0x 7.7x 0.8x0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

• Kernel is called ~50,000 times per frame• Amount of work per call is small• Runtime dominated by CUDA overheads:

– Memory allocation, memory copying, kernel call overhead

Page 29: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 29

Kernel Overhead• Kernel calls are not cheap!

– Overhead of one kernel call: 9 microseconds– Overhead of one CPU function: 3 nanoseconds– Kernel call is 3,000 times more expensive

• Heaviside kernel:– 27% of kernel runtime due to computation– 73% of kernel runtime due to kernel overhead

Page 30: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 30

Lesson 1: Reduce Kernel Overhead• Increase amount of work per kernel call

– Decrease total number of kernel calls– Amortize overhead of each kernel call across more

computation

Page 31: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 31

Larger Kernel Implementation

MGVF = normalized sub-image gradientdo {

Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion

} while (! converged)

Page 32: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 32

9%

15%

71%

0% 20% 40% 60% 80% 100%

Kernel Execution

Memory Copying

Memory Allocation

Percentage of Runtime

Larger Kernel Implementation

2.0x 7.7x 0.8x 6.3x0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA Larger Kernel

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

Page 33: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 33

Memory Allocation Overhead

0.01

0.1

1

10

100

1000

10000

1E-07 1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000

Megabytes Allocated Per Call

Tim

e P

er

Ca

ll (

mic

ros

ec

on

ds

)

malloc (CPU memory) cudaMalloc (GPU memory)

Page 34: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 34

Lesson 2: Reduce Memory Management Overhead

• Reduce the number of memory allocations– Allocate memory once and reuse it throughout the

application– If memory size is not known a priori, estimate and only re-

allocate if estimate is too small

Page 35: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 35

31%

56%

3%

0% 20% 40% 60% 80% 100%

Kernel Execution

Memory Copying

Memory Allocation

Percentage of Runtime

Reduced Allocation Implementation

2.0x 7.7x 0.8x 6.3x25.4x

0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

Page 36: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 36

Memory Transfer Overhead

0.001

0.01

0.1

1

10

100

1000

1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000

Megabytes per Transfer

Tra

ns

fer

Tim

e (

mil

lis

ec

on

ds

)

CPU to GPU GPU to CPU

Page 37: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 37

Lesson 3: Reduce Memory Transfer Overhead

• If the CPU operates on values produced by the GPU:– Move the operation to the GPU– May improve performance even if the operation itself is

slower on the GPU

Operation(GPU)

Time

valuesproducedby GPU

valuesconsumed

by GPU

Memory Transfer

Operation(CPU)

Memory Transfer

Page 38: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 38

MGVF = normalized sub-image gradientdo {

Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion

} while (! converged)

GPU Reduction Implementation

Page 39: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 39

GPU Reduction Implementation

2.0x 7.7x 0.8x 6.3x25.4x

60.7x

0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation

GPUReduction

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

80%

1%

7%

0% 20% 40% 60% 80% 100%

Kernel Execution

Memory Copying

Memory Allocation

Percentage of Runtime

Page 40: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 40

MGVF = normalized sub-image gradientdo {

Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion

} while (! converged)

Persistent Thread Block

Page 41: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 41

Persistent Thread Block• Problem: need a global memory fence

– Multiple thread blocks compute the MGVF matrix– Thread blocks cannot communicate with each other– So each iteration requires a separate kernel call

• Solution: compute entire matrix in one thread block– Arbitrary number of iterations can be computed in a single kernel call

Page 42: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 42

Persistent Thread Block: Example

1 32

4 65

7 98

1 11

1 11

1 11

Canonical CUDA Approach

(1-to-1 mapping between threads and data elements)

Persistent Thread Block

MGVF Matrix MGVF Matrix

Page 43: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 43

GPUGPU

SM SM SM

SM SM SM

SM SM SM

Cell2

Cell3

Cell4

Cell5

Cell6

Cell7

Cell8

Cell9

SM SM SM

SM SM SM

SM SM SM

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Persistent Thread Block: Example

Cell1

Canonical CUDA Approach

(1-to-1 mapping between threads and data elements)

Persistent Thread Block

SM = Streaming Multiprocessor (GPU core)

Page 44: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 44

Lesson 4: Avoid Global Memory Fences• Confine dependent computations to a single thread

block– Execute an iterative algorithm until convergence in a single

kernel call– Only efficient if there are multiple independent

computations

Page 45: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 45

Persistent Thread Block Implementation

2.0x 7.7x 0.8x 6.3x25.4x

211.3x

60.7x

0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation

GPUReduction

PersistentThread Block

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

27x

Page 46: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 46

Absolute Performance

0.11 0.22 0.83

21.6

0

5

10

15

20

25

MATLAB C C + OpenMP CUDA

Fra

mes p

er

Seco

nd

(F

PS

)

Page 47: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 47

Video Example

Page 48: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 48

Conclusions• CUDA overheads can be significant bottlenecks• CUDA provides enormous performance improvements

for leukocyte tracking– 200x over MATLAB– 27x over OpenMP

• Processing time reduced from > 4.5 hours to < 1.5 minutes

• Real-time analysis feasible in near future

M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2009.

Page 49: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 49

Current work:CPU-GPU Task Sharing

Page 50: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 50

CPU-GPU Task Sharing• Offloading decision is generally considered to be

binary

GPU?

CPU?

Page 51: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 51

CPU-GPU Task Sharing• Offload decision does not need to be

binary!– Dividing a task between the CPU and

GPU can provide improved performance over either device alone

GPU

CPU

GPU? CPU?

Page 52: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 52

Theoretical Performance

0

0.5

1

1.5

2

0.01 0.1 1 10 100

Ratio of GPU to CPU performance

Pe

rfo

rma

nc

e n

orm

ali

zed

to b

es

t w

ith

ou

t s

ha

rin

g

GPUCPUCPU+GPU (equal sharing)CPU+GPU (optimal sharing)

Page 53: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 53

Research Goal1. Given an input program written in CUDA or OpenCL,

automatically generate a program that can execute on the CPU and GPU concurrently

2. Automatically determine best division of work:– When beneficial, share work between CPU and GPU– Otherwise, execute on CPU or GPU exclusively– Optimal decision can change at runtime:

• With different inputs• With contention

Page 54: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 54

Proposed System

OpenCL code

Source-to-source Translation Framework

OpenCL Compiler

Modified OpenCL code

CPU/GPU binary

Transform all GPU memory allocations, memory transfers, and

kernel launches into a form supporting concurrent CPU-GPU

execution

Page 55: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 55

Potential Problems• One version of the kernel for multiple devices

– Optimizations for GPU may hurt performance on CPU and vice versa

• Possible (but rare) for thread blocks to communicate with each other– Do we try to support this?

• Statically predicting data access patterns can be hard (or even impossible for some applications)

Page 56: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 56

Data Sharing

CPUGPU

• If we cannot predict data access patterns statically, then the CPU and the GPU must have a consistent view of memory

1) Computation

2) Data Transfer

Page 57: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 57

Data Sharing (2)

CPUGPU

• If we can predict data access patterns statically, then we can minimize the data transfer overhead

1) Computation

2) Data Transfer

Page 58: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 58

Preliminary Results (HotSpot)

0

2

4

6

8

10

12

0 20 40 60 80 100

Percent of Computation on GPU

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

Static AnalysisDynamic AnalysisNo Sharing

Page 59: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 59

Conclusions• GPUs are designed to provide good performance on

graphics workloads– But they have evolved to support any workload with

abundant parallelism

• GPUs can provide large performance improvements– But we need to take into account the overheads involved to

fully take advantage

• Allowing the CPU and GPU to work together can provide an even larger performance improvement

Page 60: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 60

Acknowledgements• Funding provided by:

– NSF grant IIS-0612049– SRC grant 1607.001– NVIDIA research grant– GRC AMD/Mahboob Kahn Ph.D. fellowship

• Equipment donated by NVIDIA

Page 61: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 61

BACKUP

Page 62: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 62

3D Rendering APIs

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display

• High-level abstractions for rendering geometry

Courtesy of D. Luebke, NVIDIA

Page 63: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 63

CUDA: Abstractions1. Kernel function

– Mapped onto a grid of thread blocks

2. Scratchpad memory– For sharing data within a thread block

3. Barrier synchronization– For synchronizing within a thread block

Page 64: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 64

Kernel Function

__global__ void kernel(int *in, int *out) {

// Determine this thread’s index

int i = threadIdx.x;

// Add one to the input value

out[i] = in[i] + 1;

}

Page 65: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 65

Grid of Thread Blocks

Grid:2-dimensional≤ 4.3 billion blocks

Thread block:3-dimensional≤ 512 threads

Page 66: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 66

Launching a Kernel

int num_threads = ...;

int threads_per_block = 256;

// Determine how many thread blocks are needed

// (using either of the two methods shown below)

int num_blocks = ceil(num_threads / threads_per_block);

int num_blocks = (num_threads + threads_per_block – 1) /

threads_per_block;

// Make structures for grid and thread block dimensions

dim3 grid(num_blocks, 1);

dim3 thread_block(threads_per_block, 1, 1);

// Launch the kernel

kernel <<< grid, thread_block >>> (in, out);

Page 67: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 67

Scratchpad Memory• Each multiprocessor has 16 KB of software-controlled

shared memory• Variables declared “__shared__” get mapped into this

memory• Values can only be shared among threads within the

same thread block

Page 68: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 68

Scratchpad Memory: Example

__global__ void kernel() {

int i = threadIdx.x;

// Compute some function

int v = foo(i);

// Write the value into shared memory

__shared__ int values[THREADS_PER_BLOCK];

values[i] = v;

// Use the shared values

...

}

Page 69: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 69

Barrier Synchronization• __syncthreads() function• Each thread waits for all other threads in the thread

block• All values written by every thread are now visible to all

other threads

Page 70: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 70

Barrier Synchronization: Example__global__ void kernel(float *out, int N) {

int i = threadIdx.x;

__shared__ int values[THREADS_PER_BLOCK];

values[i] = foo(i);

// Wait to ensure all values have been written

__syncthreads();

// Compute average of two values

out[i] = (values[i] + values[(i + 1) % N]);

}

Page 71: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 71

CUDA Overheads• Driver initialization: 0.14 seconds• Kernel launch: 13 μs• GPU memory allocation and deallocation: orders of

magnitude slower than on CPU• Memory transfer: 15 μs + 1 ms/MB

Page 72: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 72

Program

Allocate GPU memory

Transfer input data

Launch kernel

Transfer results

Free GPU memory

Acceleration using CUDACPU GPU

Step 1: Determine which code to offload to the GPU as a CUDA kernel

Step 2: Write the CPU-side CUDA code

Step 3: Write and optimize the GPU kernel

CUDA kernel

Page 73: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 73

Performance Issues• Branch divergence• Memory coalescing

• Key concept: Warp– Group of threads that execute concurrently– In current hardware, warp size is 32 threads

Page 74: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 74

Branch Divergence• Remember: hardware is SIMD• What if threads in the same warp follow two different

paths?

• Solution: entire warp executes both paths– Unneeded values are simply ignored– Performance can suffer with many divergent branches

Page 75: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 75

Memory Coalescing• Threads in the same half-warp access memory together• If all threads access successive memory locations:

– All of the accesses are combined (coalesced)– Result: significantly improved memory performance

• Otherwise:– Each thread accesses memory separately– Result: significantly reduced memory performance

Page 76: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 76

Memory Coalescing: Examples

Coalesced Accesses

Non-CoalescedAccess

Page 77: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 77

Parallelization Granularity

CPUCPU

CPU MemoryCPU Memory

GPUGPU

GPU MemoryGPU Memory

Page 78: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 78

Memory Transfer

Memory Transfer

Kernel Overhead Revisited• Overhead depends on calling pattern:

– One at a time (synchronous): 9 microseconds– Back-to-back (asynchronous): 3 microseconds

Kernel Call

Kernel Call

Kernel Call

Kernel Call

Kernel Call

Kernel Call

Synchronous:

Asynchronous:

Implicit Synchronization

Kernel Call

Kernel Call

Page 79: © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 79

Lesson 1 Revisited: Reduce Kernel Overhead

• Increase amount of work per kernel call– Decrease total number of kernel calls– Amortize overhead of each kernel call across more

computation

• Launch kernels back-to-back– Kernel calls are asynchronous: avoid explicit or implicit

synchronization between kernel calls– Overlap kernel execution on the GPU with driver access on

the CPU