© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

© 2010 Michael Boyer 1

Harnessing the Power of GPUs for Non-Graphics Applications

Michael BoyerDepartment of Computer Science

University of VirginiaAdvisor: Kevin Skadron


Outline• GPU architecture• Programming GPUs using CUDA• Case study: Leukocyte Tracking• Current work: CPU-GPU Task Sharing


Graphics Processors• Graphics Processing Units (GPUs) are designed

specifically for graphics rendering applications

Courtesy of G

ameS

pot


Graphics Applications• Graphics applications involve applying the same

operation to many pieces of data• Application characteristics:

– Massively parallel– Only aggregate performance matters


CPU vs. GPU: Architectural Difference 1

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File OOO LogicOOO Logic

Branch PredictorBranch

Predictor

Data CacheData Cache

MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File

CPU GPU

Avoid structures that onlyimprove single-thread performance



Fetch/DecodeFetch/

Decode

ExecuteExecute

Register FileRegister File

Amortize the overhead of control logic across multiple execution units (SIMD processing)

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRFThread Group

Fetch/DecodeFetch/

Decode

ExecuteExecute



Predictor


MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

GPUCPU


EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF


Fetch/DecodeFetch/

Decode

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRFThread GroupThread Group 1

Thread Group 2

Thread Group 3

Thread Group 4

Use multiple groups of threads to keepexecution units busy and hide memory latency

Fetch/DecodeFetch/

Decode

ExecuteExecute



Predictor


MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

GPUCPU



EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

EXEEXE

RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

RFRF RFRF RFRF RFRF

Fetch/DecodeFetch/

Decode

Replicate cores to leverage more parallelism

Fetch/DecodeFetch/

Decode

ExecuteExecute



Predictor


MemoryPre-Fetcher

MemoryPre-Fetcher

Fetch/DecodeFetch/

Decode

GPUCPU

CPU CoreCPU CoreGPU CoreGPU Core

Core 1Core 1 Core 2Core 2

Core 3Core 3 Core 4Core 4

Core 7Core 7 Core 8Core 8 Core 9Core 9 Core 10

Core 10

Core 12

Core 12

Core 13

Core 13

Core 14

Core 14

Core 15

Core 15

Core 17

Core 17

Core 18

Core 18

Core 19

Core 19

Core 20

Core 20

Core 22

Core 22

Core 23

Core 23

Core 24

Core 24

Core 25

Core 25

Core 27

Core 27

Core 28

Core 28

Core 29

Core 29

Core 30

Core 30

Core 2Core 2 Core 3Core 3 Core 4Core 4 Core 5Core 5

Core 6Core 6

Core 11

Core 11

Core 16

Core 16

Core 21

Core 21

Core 26

Core 26

Core 1Core 1


CPU vs. GPU: Architectural Differences• Summary: take advantage of abundant parallelism

– Lots of threads, so focus on aggregate performance– Parallelism in space:

• SIMD processing in each core• Many independent SIMD cores across the chip

– Parallelism in time:• Multiple SIMD groups in each core


CPU vs. GPU: Peak PerformanceProcessor Type CPU GPU

Product Intel Xeon W5590 (Nehalem)

AMD Radeon HD 5870

Throughput(GFLOPs)

107 2,720

Memory Bandwidth (GB/s) 32 154

Cost $1,700 $450

• Note that these are peak numbers• What we really care about is performance on real-world applications


General-Purpose Computing on GPUs• Lots of recent interest in using GPUs to run non-

graphics applications (GPGPU)• Why GPUs? Why now?

– Recent increases in performance via parallelism – Recent increases in programmability– Ubiquity in multiple market segments

• Old approach: graphics languages• New approach: GPGPU languages

– OpenCL, CUDA


CUDA• Programming model for running general-purpose

applications on NVIDIA GPUs• Extension to the C programming language• GPU is a co-processor:

– Main program runs on the CPU– Large computations (kernels) are offloaded to the GPU– CPU and GPU have separate memory, so data must be

transferred back and forth


CUDA: Typical Program Structurevoid function(…) {

Allocate memory on the GPUTransfer input data to the GPULaunch kernel on the GPUTransfer output data to CPU

}

__global__ void kernel(…) {Code executed on

the GPU goes here…}

CPUCPU CPU MemoryCPU Memory

GPUGPU

GPU MemoryGPU Memory


CUDA: Typical Program Transformationfor (i = 0; i < N; i++) {

Process array element i}

Body of loop becomes body of kernel

__global__ void kernel(…) {Determine this thread’s value of iProcess array element i

}


CUDA Kernel• Scalar program invoked across many threads

– Typically one thread per data element

• Overall computation decomposed into a grid of thread blocks– Thread blocks are independent and cannot communicate

(with some exceptions)– Threads within the same block can communicate

Thread Block 1

Thread Block 2

Thread Block 3

Thread Block 4

Thread Block 5


1 2 3 4 5 6 7 8A

9 10 11 12 13 14 15 16B

10 12 14 16 18 20 22 24C

+

=

Simple Example: Vector Addition

C = A + B


C Code

float *CPU_add_vectors(float *A, float *B, int N) {

// Allocate memory for the resultfloat *C = (float *) malloc(N * sizeof(float));

// Compute the sum;for (int i = 0; i < N; i++) {

C[i] = A[i] + B[i];}

// Return the resultreturn C;

}


CUDA Kernel

// GPU kernel that computes the vector sum C = A + B// (each thread computes a single value of the result)__global__ void kernel(float *A, float *B, float *C, int N) {

// Determine which element this thread is computingint i = blockDim.x * blockIdx.x + threadIdx.x;

// Compute a single element of the result vectorif (i < N) {

C[i] = A[i] + B[i];}

}


CUDA Host Codefloat *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {

// Allocate GPU memory for the inputs and the resultint vector_size = N * sizeof(float);float *A_GPU, *B_GPU, *C_GPU;cudaMalloc((void **) &A_GPU, vector_size);cudaMalloc((void **) &B_GPU, vector_size);cudaMalloc((void **) &C_GPU, vector_size);

// Transfer the input vectors to GPU memorycudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);

// Execute the kernel to compute the vector sum on the GPUint num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK);kernel <<< num_blocks, THREADS_PER_BLOCK >>> (A_GPU, B_GPU, C_GPU, N);

// Transfer the result vector from the GPU to the CPUfloat *C_CPU = (float *) malloc(vector_size);cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);return C_CPU;

}


Example Program Output./vector_add 50,000,000

GPU: Transfer to GPU: 0.236 sec Kernel execution: 0.005 sec Transfer from GPU: 0.152 sec Total: 0.404 sec

CPU: 0.136 sec

Execution: GPU outperformed CPU by 27.2x Overall: CPU outperformed GPU by 2.97x

Vector addition does not do enough work per memory operation to justify offload!


Case Study:Leukocyte Tracking


Leukocyte Tracking

• Important for evaluating inflammatory drugs• Velocity measured by tracking leukocytes through

multiple frames• Current approaches:

– Manual analysis: 1 minute video in tens of hours– MATLAB: 1 minute video in 5 hours


Goal: Leverage CUDA and a GPU toaccelerate leukocyte tracking

to near real-time speeds


Acceleration1. Translation: convert MATLAB code to C2. Parallelization:

– OpenMP for multi-core CPU– CUDA for GPU

• Experimental setup:– CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770– GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)


Tracking AlgorithmInputs: Video frame

Location of cells in previous frame

Output: Location of cells in current frame

For each cell:– Extract sub-image near cell’s old location– Compute MGVF matrix over sub-image– Evolve active contour using MGVF matrix

→ 99.8%


Computing the MGVF Matrix• Motion Gradient Vector Flow• MGVF matrix is approximated via an iterative solution

procedure

Sub-image near cell MGVF


MGVF = normalized sub-image gradientdo {

Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion

} while (! converged)

MGVF Pseudo-code


Naïve CUDA Implementation

2.0x 7.7x 0.8x0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

• Kernel is called ~50,000 times per frame• Amount of work per call is small• Runtime dominated by CUDA overheads:

– Memory allocation, memory copying, kernel call overhead


Kernel Overhead• Kernel calls are not cheap!

– Overhead of one kernel call: 9 microseconds– Overhead of one CPU function: 3 nanoseconds– Kernel call is 3,000 times more expensive

• Heaviside kernel:– 27% of kernel runtime due to computation– 73% of kernel runtime due to kernel overhead


Lesson 1: Reduce Kernel Overhead• Increase amount of work per kernel call

– Decrease total number of kernel calls– Amortize overhead of each kernel call across more

computation


Larger Kernel Implementation





9%

15%

71%

0% 20% 40% 60% 80% 100%

Kernel Execution

Memory Copying

Memory Allocation

Percentage of Runtime

Larger Kernel Implementation

2.0x 7.7x 0.8x 6.3x0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA Larger Kernel

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B


Memory Allocation Overhead

0.01

0.1

1

10

100

1000

10000

1E-07 1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000

Megabytes Allocated Per Call

Tim

e P

er

Ca

ll (

mic

ros

ec

on

ds

)

malloc (CPU memory) cudaMalloc (GPU memory)


Lesson 2: Reduce Memory Management Overhead

• Reduce the number of memory allocations– Allocate memory once and reuse it throughout the

application– If memory size is not known a priori, estimate and only re-

allocate if estimate is too small


31%

56%

3%

0% 20% 40% 60% 80% 100%

Kernel Execution

Memory Copying

Memory Allocation


Reduced Allocation Implementation

2.0x 7.7x 0.8x 6.3x25.4x

0x

50x

100x

150x

200x

250x

C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B


Memory Transfer Overhead

0.001

0.01

0.1

1

10

100

1000

1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000

Megabytes per Transfer

Tra

ns

fer

Tim

e (

mil

lis

ec

on

ds

)

CPU to GPU GPU to CPU


Lesson 3: Reduce Memory Transfer Overhead

• If the CPU operates on values produced by the GPU:– Move the operation to the GPU– May improve performance even if the operation itself is

slower on the GPU

Operation(GPU)

Time

valuesproducedby GPU

valuesconsumed

by GPU

Memory Transfer

Operation(CPU)

Memory Transfer





GPU Reduction Implementation


GPU Reduction Implementation

2.0x 7.7x 0.8x 6.3x25.4x

60.7x

0x

50x

100x

150x

200x

250x


GPUReduction

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

80%

1%

7%

0% 20% 40% 60% 80% 100%

Kernel Execution

Memory Copying

Memory Allocation






Persistent Thread Block


Persistent Thread Block• Problem: need a global memory fence

– Multiple thread blocks compute the MGVF matrix– Thread blocks cannot communicate with each other– So each iteration requires a separate kernel call

• Solution: compute entire matrix in one thread block– Arbitrary number of iterations can be computed in a single kernel call


Persistent Thread Block: Example

1 32

4 65

7 98

1 11

1 11

1 11

Canonical CUDA Approach

(1-to-1 mapping between threads and data elements)


MGVF Matrix MGVF Matrix


GPUGPU

SM SM SM

SM SM SM

SM SM SM

Cell2

Cell3

Cell4

Cell5

Cell6

Cell7

Cell8

Cell9

SM SM SM

SM SM SM

SM SM SM

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Cell1

Persistent Thread Block: Example

Cell1

Canonical CUDA Approach

(1-to-1 mapping between threads and data elements)


SM = Streaming Multiprocessor (GPU core)


Lesson 4: Avoid Global Memory Fences• Confine dependent computations to a single thread

block– Execute an iterative algorithm until convergence in a single

kernel call– Only efficient if there are multiple independent

computations


Persistent Thread Block Implementation

2.0x 7.7x 0.8x 6.3x25.4x

211.3x

60.7x

0x

50x

100x

150x

200x

250x


GPUReduction

PersistentThread Block

CUDA

Sp

ee

du

p o

ve

r M

AT

LA

B

27x


Absolute Performance

0.11 0.22 0.83

21.6

0

5

10

15

20

25

MATLAB C C + OpenMP CUDA

Fra

mes p

er

Seco

nd

(F

PS

)


Video Example


Conclusions• CUDA overheads can be significant bottlenecks• CUDA provides enormous performance improvements

for leukocyte tracking– 200x over MATLAB– 27x over OpenMP

• Processing time reduced from > 4.5 hours to < 1.5 minutes

• Real-time analysis feasible in near future

M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2009.


Current work:CPU-GPU Task Sharing


CPU-GPU Task Sharing• Offloading decision is generally considered to be

binary

GPU?

CPU?


CPU-GPU Task Sharing• Offload decision does not need to be

binary!– Dividing a task between the CPU and

GPU can provide improved performance over either device alone

GPU

CPU

GPU? CPU?


Theoretical Performance

0

0.5

1

1.5

2

0.01 0.1 1 10 100

Ratio of GPU to CPU performance

Pe

rfo

rma

nc

e n

orm

ali

zed

to b

es

t w

ith

ou

t s

ha

rin

g

GPUCPUCPU+GPU (equal sharing)CPU+GPU (optimal sharing)


Research Goal1. Given an input program written in CUDA or OpenCL,

automatically generate a program that can execute on the CPU and GPU concurrently

2. Automatically determine best division of work:– When beneficial, share work between CPU and GPU– Otherwise, execute on CPU or GPU exclusively– Optimal decision can change at runtime:

• With different inputs• With contention


Proposed System

OpenCL code

Source-to-source Translation Framework

OpenCL Compiler

Modified OpenCL code

CPU/GPU binary

Transform all GPU memory allocations, memory transfers, and

kernel launches into a form supporting concurrent CPU-GPU

execution


Potential Problems• One version of the kernel for multiple devices

– Optimizations for GPU may hurt performance on CPU and vice versa

• Possible (but rare) for thread blocks to communicate with each other– Do we try to support this?

• Statically predicting data access patterns can be hard (or even impossible for some applications)


Data Sharing

CPUGPU

• If we cannot predict data access patterns statically, then the CPU and the GPU must have a consistent view of memory

1) Computation

2) Data Transfer


Data Sharing (2)

CPUGPU

• If we can predict data access patterns statically, then we can minimize the data transfer overhead

1) Computation

2) Data Transfer


Preliminary Results (HotSpot)

0

2

4

6

8

10

12

0 20 40 60 80 100

Percent of Computation on GPU

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

Static AnalysisDynamic AnalysisNo Sharing


Conclusions• GPUs are designed to provide good performance on

graphics workloads– But they have evolved to support any workload with

abundant parallelism

• GPUs can provide large performance improvements– But we need to take into account the overheads involved to

fully take advantage

• Allowing the CPU and GPU to work together can provide an even larger performance improvement


Acknowledgements• Funding provided by:

– NSF grant IIS-0612049– SRC grant 1607.001– NVIDIA research grant– GRC AMD/Mahboob Kahn Ph.D. fellowship

• Equipment donated by NVIDIA


BACKUP


3D Rendering APIs

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display

• High-level abstractions for rendering geometry

Courtesy of D. Luebke, NVIDIA


CUDA: Abstractions1. Kernel function

– Mapped onto a grid of thread blocks

2. Scratchpad memory– For sharing data within a thread block

3. Barrier synchronization– For synchronizing within a thread block


Kernel Function

__global__ void kernel(int *in, int *out) {

// Determine this thread’s index

int i = threadIdx.x;

// Add one to the input value

out[i] = in[i] + 1;

}


Grid of Thread Blocks

Grid:2-dimensional≤ 4.3 billion blocks

Thread block:3-dimensional≤ 512 threads


Launching a Kernel

int num_threads = ...;

int threads_per_block = 256;

// Determine how many thread blocks are needed

// (using either of the two methods shown below)

int num_blocks = ceil(num_threads / threads_per_block);

int num_blocks = (num_threads + threads_per_block – 1) /

threads_per_block;

// Make structures for grid and thread block dimensions

dim3 grid(num_blocks, 1);

dim3 thread_block(threads_per_block, 1, 1);

// Launch the kernel

kernel <<< grid, thread_block >>> (in, out);


Scratchpad Memory• Each multiprocessor has 16 KB of software-controlled

shared memory• Variables declared “__shared__” get mapped into this

memory• Values can only be shared among threads within the

same thread block


Scratchpad Memory: Example

__global__ void kernel() {


// Compute some function

int v = foo(i);

// Write the value into shared memory

__shared__ int values[THREADS_PER_BLOCK];

values[i] = v;

// Use the shared values

...

}


Barrier Synchronization• __syncthreads() function• Each thread waits for all other threads in the thread

block• All values written by every thread are now visible to all

other threads


Barrier Synchronization: Example__global__ void kernel(float *out, int N) {


__shared__ int values[THREADS_PER_BLOCK];

values[i] = foo(i);

// Wait to ensure all values have been written

__syncthreads();

// Compute average of two values

out[i] = (values[i] + values[(i + 1) % N]);

}


CUDA Overheads• Driver initialization: 0.14 seconds• Kernel launch: 13 μs• GPU memory allocation and deallocation: orders of

magnitude slower than on CPU• Memory transfer: 15 μs + 1 ms/MB


Program

Allocate GPU memory

Transfer input data

Launch kernel

Transfer results

Free GPU memory

Acceleration using CUDACPU GPU

Step 1: Determine which code to offload to the GPU as a CUDA kernel

Step 2: Write the CPU-side CUDA code

Step 3: Write and optimize the GPU kernel

CUDA kernel


Performance Issues• Branch divergence• Memory coalescing

• Key concept: Warp– Group of threads that execute concurrently– In current hardware, warp size is 32 threads


Branch Divergence• Remember: hardware is SIMD• What if threads in the same warp follow two different

paths?

• Solution: entire warp executes both paths– Unneeded values are simply ignored– Performance can suffer with many divergent branches


Memory Coalescing• Threads in the same half-warp access memory together• If all threads access successive memory locations:

– All of the accesses are combined (coalesced)– Result: significantly improved memory performance

• Otherwise:– Each thread accesses memory separately– Result: significantly reduced memory performance


Memory Coalescing: Examples

Coalesced Accesses

Non-CoalescedAccess


Parallelization Granularity

CPUCPU

CPU MemoryCPU Memory

GPUGPU

GPU MemoryGPU Memory


Memory Transfer

Memory Transfer

Kernel Overhead Revisited• Overhead depends on calling pattern:

– One at a time (synchronous): 9 microseconds– Back-to-back (asynchronous): 3 microseconds

Kernel Call

Kernel Call

Kernel Call

Kernel Call

Kernel Call

Kernel Call

Synchronous:

Asynchronous:

Implicit Synchronization

Kernel Call

Kernel Call


Lesson 1 Revisited: Reduce Kernel Overhead

• Increase amount of work per kernel call– Decrease total number of kernel calls– Amortize overhead of each kernel call across more

computation

• Launch kernels back-to-back– Kernel calls are asynchronous: avoid explicit or implicit

synchronization between kernel calls– Overlap kernel execution on the GPU with driver access on

the CPU

Documents

© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor: