© NVIDIA Corporation 2015
GPU Programming
Introduction
DR. CHRISTOPH ANGERER, NVIDIA
© NVIDIA Corporation 2015 © NVIDIA Corporation 2015
AGENDA
Introduction to Heterogeneous Computing
Using Accelerated Libraries
GPU Programming Languages
Introduction to CUDA
Lunch
© NVIDIA Corporation 2015
What is Heterogeneous Computing?
Application Execution
+
GPU CPU
High Data Parallelism High Serial
Performance
© NVIDIA Corporation 2015
Low Latency or High Throughput?
© NVIDIA Corporation 2015
Latency vs. Throughput
F-22 Raptor • 1500 mph
• Knoxville to San Jose in 1:25
• Seats 1
Boeing 737 • 485 mph
• Knoxville to San Jose in 4:20
• Seats 200
© NVIDIA Corporation 2015
Latency vs. Throughput
F-22 Raptor • Latency – 1:25
• Throughput – 1 / 1.42 hours = 0.7
people/hr.
Boeing 737 • Latency – 4:20
• Throughput – 200 / 4.33 hours = 46.2
people/hr.
© NVIDIA Corporation 2015
Low Latency or High Throughput?
CPU architecture must minimize latency within each thread
GPU architecture hides latency with computation from other threads
GPU Streaming Multiprocessor – High Throughput Processor
CPU core – Low Latency Processor Computation Thread/Warp
Tn
Processing
Waiting for data
Ready to be processed
Context switch W1
W2
W3
W4
T1
T2
T3
T4
© NVIDIA Corporation 2015
Anatomy of a GPU-Accelerated Application
Serial code executes in a Host (CPU) thread
Parallel code executes in many Device (GPU) threads
across multiple processing elements
Application
Serial code
Serial code
Parallel code
Parallel code
Device = GPU
…
Host = CPU
Device = GPU
...
Host = CPU
© NVIDIA Corporation 2015
Simple Processing Flow
1. Copy input data from CPU memory to GPU
memory
PCI Bus
© NVIDIA Corporation 2015
Simple Processing Flow
1. Copy input data from CPU memory to GPU
memory
2. Load GPU code and execute it
PCI Bus
© NVIDIA Corporation 2015
Simple Processing Flow
1. Copy input data from CPU memory to GPU
memory
2. Load GPU code and execute it
3. Copy results from GPU memory to CPU
memory
PCI Bus
© NVIDIA Corporation 2015 © NVIDIA Corporation 2015
3 APPROACHES TO GPU PROGRAMMING
Applications
Libraries
Easy to use
Most Performance
Programming
Languages
Most Performance
Most Flexibility
Easy to use
Portable code
Compiler
Directives
© NVIDIA Corporation 2015 © NVIDIA Corporation 2015
SIMPLICITY & PERFORMANCE Accelerated Libraries
Little or no code change for standard libraries, high performance.
Limited by what libraries are available
Compiler Directives
Based on existing programming languages, so they are simple and familiar.
Performance may not be optimal because directives often do not expose low level architectural details
Parallel Programming languages
Expose low-level details for maximum performance
Often more difficult to learn and more time consuming to implement.
Simplicity
Performance
© NVIDIA Corporation 2015
Libraries
© NVIDIA Corporation 2015
Libraries: Easy, High-Quality Acceleration
Ease of use: Using libraries enables GPU acceleration without in-depth
knowledge of GPU programming
“Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus
enabling acceleration with minimal code changes
Quality: Libraries offer high-quality implementations of functions
encountered in a broad range of applications
Performance: NVIDIA libraries are tuned by experts
© NVIDIA Corporation 2015
Gpu accelerated libraries
Providing “Drop-in” Acceleration
Linear Algebra FFT, BLAS,
SPARSE, Matrix
Numerical & Math RAND, Statistics
Data Struct. & AI Sort, Scan, Zero Sum
Visual Processing Image & Video
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
NVIDIA
Math Lib NVIDIA cuRAND
NVIDIA
NPP
NVIDIA
Video
Encode
GPU AI –
Board
Games
GPU AI –
Path Finding
© NVIDIA Corporation 2015
3 Steps to CUDA-accelerated application
Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) cublasSaxpy ( … )
Step 2: Manage data locality
- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
Step 3: Rebuild and link the CUDA-accelerated library
nvcc myobj.o –l cublas
© NVIDIA Corporation 2015
Programming Languages (CUDA)
© NVIDIA Corporation 2015
GPU Programming Languages
CUDA Fortran Fortran
CUDA C C
Thrust, CUDA C++ C++
PyCUDA, Copperhead Python
GPU.NET C#
MATLAB, Mathematica, LabVIEW Numerical analytics
© NVIDIA Corporation 2015
What is CUDA?
CUDA Platform and Programming Model
Expose GPU computing for general purpose
A model how to offload work to the GPU and how the work is executed on the
GPU
CUDA C/C++
Based on industry-standard C/C++
Small set of extensions to enable heterogeneous programming
Straightforward APIs to manage devices, memory etc.
© NVIDIA Corporation 2015
Anatomy of a CUDA Application
Serial code executes in a Host (CPU) thread (Functions)
Parallel code executes in many Device (GPU) threads
across multiple processing elements (Kernels)
CUDA Application
Serial code
Serial code
Parallel code
Parallel code
Device = GPU
…
Host = CPU
Device = GPU
...
Host = CPU
© NVIDIA Corporation 2015
CUDA Kernels: Parallel Threads
A kernel is a function executed on the GPU as an array of threads in parallel
All threads execute the same code
can take different paths
but the fewer divergence between “neighboring” threads the better
Each thread has an ID
Select input/output data
Control decisions
__global__ void myKernel(
float * input, float * output) {
float x = input[threadIdx.x];
float y = func(x);
output[threadIdx.x] = y;
}
© NVIDIA Corporation 2015
CUDA Kernels: Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
© NVIDIA Corporation 2015
CUDA Kernels: Grids
A grid is a 1D, 2D or 3D index space
Threads query their position inside the index space using blockIdx.{x,y,z}
and threadIdx.{x,y,z} and decide on the work they have to do
Only threads in the same block can communicate directly (via shared
memory)
GPU
blockIdx.x
blockIdx.y
blockIdx.z
1 2 3
1
2
© NVIDIA Corporation 2015
HELLO WORLD!
© NVIDIA Corporation 2015
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Standard C that runs on the host
NVIDIA compiler (nvcc) can be used to compile
programs with no device code
Output:
$ module load
cudatoolkit
$ nvcc
hello_world.cu
$ a.out
Hello World!
© NVIDIA Corporation 2015
Hello World! with Device Code
__global__ void mykernel(void) { }
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
Two new syntactic elements…
© NVIDIA Corporation 2015
Hello World! with Device Code
__global__ void mykernel(void) { }
CUDA C/C++ keyword __global__ indicates a function that:
Runs on the device
Is called from host code
nvcc separates source code into host and device components
Device functions (e.g. mykernel()) processed by NVIDIA compiler
Host functions (e.g. main()) processed by standard host compiler
- gcc, cl.exe
© NVIDIA Corporation 2015
Hello World! with Device Code
mykernel<<<1,1>>>();
Triple angle brackets mark a call from host code to device code
Also called a “kernel launch”
We’ll return to the parameters (1,1) in a moment
That’s all that is required to execute a function on the GPU!
© NVIDIA Corporation 2015
Parallel Programming in CUDA C/C++
But wait… GPU computing is about massive
parallelism!
We need a more interesting example…
We’ll start by adding two integers and build up
to vector addition
a b c
© NVIDIA Corporation 2015
Addition on the Device
A simple kernel to add two integers
__global__ void add(int a, int b, int *c) {
*c = a + b;
}
As before __global__ is a CUDA C/C++ keyword meaning
add() will execute on the device
add() will be called from the host
Note that we are passing pointers.
- Necessary for c because we want the result back
© NVIDIA Corporation 2015
Memory Management
Host and device memory are separate entities
Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code
Simple CUDA API for handling device memory cudaMalloc(), cudaFree(), cudaMemcpy()
Similar to the C equivalents malloc(), free(), memcpy()
NEW (CUDA 6.0): Unified Memory can do the memory management for you!
© NVIDIA Corporation 2015
Addition on the Device: add()
Returning to our add() kernel
__global__ void add(int a, int b, int *c) {
*c = a + b;
}
Let’s take a look at main()…
© NVIDIA Corporation 2015
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_c; // device copy of c
int size = sizeof(int);
// Allocate space for device copy of c
cudaMalloc((void **)&d_c, size);
// Setup input values
a = 2;
b = 7;
© NVIDIA Corporation 2015
Addition on the Device: main()
// Launch add() kernel on GPU
add<<<1,1>>>(a, b, d_c);
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA Corporation 2015
Unified Memory Simplifies Data Management int main(void) {
int a, b;
int *c;
int size = sizeof(int);
// Allocate space for device copy of c
cudaMallocManged(&c, size);
// Setup input values
a = 2; b = 7;
// Launch add() kernel on GPU
add<<<1,1>>>(a, b, c);
//use result on host, no explicit copy needed!
cudaDeviceSynchronize();
printf(“Result is %d\n”, d);
// Cleanup
cudaFree(c);
return 0;
}
© NVIDIA Corporation 2015
Review
Difference between host and device
Host CPU
Device GPU
Using __global__ to declare a function as device code (kernel)
Executes on the device
Called from the host
Passing parameters from host code to a device kernel
© NVIDIA Corporation 2015
RUNNING IN PARALLEL
© NVIDIA Corporation 2015
Moving to Parallel
GPU computing is about massive parallelism
So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< 1, N >>>();
Instead of executing add() once, execute N threads in parallel
© NVIDIA Corporation 2015
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
CUDA Threads
With add() running in parallel we can do vector addition
Terminology: a block can be split into parallel threads
Let’s change add() to use parallel threads instead of a single thread
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
We use threadIdx.x
Let’s have a look at main()…
© NVIDIA Corporation 2015
Unified Memory #define N 512
int main(void) {
int *a, *b, *c;
int size = N * sizeof(int);
// Allocate space for N-vectors a, b, c
cudaMallocManaged(&a, size);
cudaMallocManaged(&b, size);
cudaMallocManged(&c, size);
// Setup input values on host
fillRandom(N, a, b);
// Launch one add() kernel block on GPU with N threads
add<<< 1, N >>>(a, b, c);
//use result on host, no explicit copy needed
cudaDeviceSynchronize();
printVector(N, c);
// Cleanup
cudaFree(a); cudaFree(b); cudaFree(c);
return 0;
}
© NVIDIA Corporation 2015
Review
Launching parallel kernels
Kernels are launched with a given grid (1/2/3D index space) and block (1/2/3D
index subspace) through the <<<grid, block>>> notation
Launch a N blocks with M threads with add<<<N,M>>>(…);
Use blockIdx.x to access block index, threadIdx.x to access thread index
However, the number of threads per block is very limited (max 1024)
Also, a block is always executed on a single streaming multiprocessor (SM),
the other ~14 (on K40) SMs are unused
The grid allows a much bigger index space to fill the GPU
© NVIDIA Corporation 2015
WELCOME TO THE GRID
© NVIDIA Corporation 2015
Indexing Arrays with Blocks and Threads
A thread calculates its position in the 1/2/3D grid using blockIdx.x
and threadIdx.x
Consider indexing an array with one element per thread (8 threads/block)
With M threads per block, a unique index for each thread is given by: int index = blockIdx.x * M + threadIdx.x;
0 1 7 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3
© NVIDIA Corporation 2015
Indexing Arrays: Example
Which thread will operate on the red element?
int index = blockIdx.x * M + threadIdx.x;
= 2 * 8 + 5;
= 21;
0 1 7 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
threadIdx.x = 5
blockIdx.x = 2
0 1 31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M = 8
© NVIDIA Corporation 2015
Vector Addition with Blocks and Threads
Use the built-in variable blockDim.x for threads per block
int index = blockIdx.x * blockDim.x + threadIdx.x;
Combined version of add() to use parallel threads and parallel blocks
__global__ void add(int *a, int *b, int *c) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
c[index] = a[index] + b[index];
}
What changes need to be made in main()?
© NVIDIA Corporation 2015
Unified Memory #define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c;
int size = N * sizeof(int);
// Allocate space for device copies of a, b, c
cudaMallocManaged(&a, size);
cudaMallocManaged(&b, size);
cudaMallocManged(&c, size);
// Setup input values on host
fillRandom(N, a, b);
// Launch add() kernel on GPU with N blocks
add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(a, b, c);
//use result on host
cudaDeviceSynchronize();
printVector(N, c);
// Cleanup
cudaFree(a); cudaFree(b); cudaFree(c);
return 0;
}
© NVIDIA Corporation 2015
Handling Arbitrary Vector Sizes
Typical problems are not even multiples of blockDim.x
Avoid accessing beyond the end of the arrays:
__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
Update the kernel launch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
© NVIDIA Corporation 2015
ADVANCED CONCEPTS:
COOPERATING
THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA Corporation 2015
in
out
1D Stencil
Consider applying a 1D stencil to a 1D array of elements
Each output element is the sum of input elements within a radius
If radius is 3, then each output element is the sum of 7 input elements:
© NVIDIA Corporation 2015
0 1 2 3 4 5 6 7
Implementing Within a Block
Each thread processes one output element
blockDim.x elements per block
Input elements are read several times
With radius 3, each input element is read seven times
in
out
radius radius
Thread
0
Thread
1
Thread
2
Thread
3
Thread
4
Thread
5
Thread
6
Thread
7
Thread
8
© NVIDIA Corporation 2015
Sharing Data Between Threads
Terminology: within a block, threads share data via shared memory
Extremely fast on-chip memory
By opposition to device memory, referred to as global memory
Like a user-managed cache
Declare using __shared__, allocated per block
Data is not visible to threads in other blocks
© NVIDIA Corporation 2015
Implementing With Shared Memory
Cache data in shared memory
Read (blockDim.x + 2 * radius) input elements from global memory to
shared memory
Compute blockDim.x output elements
Write blockDim.x output elements to global memory
Each block needs a halo of radius elements at each boundary
blockDim.x output elements
halo on left halo on right
in
out
© NVIDIA Corporation 2015
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
Stencil Kernel (1 of 2)
© NVIDIA Corporation 2015
Stencil Kernel (2 of 2)
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
© NVIDIA Corporation 2015
Data Race!
The stencil example will not work…
Suppose thread 15 reads the halo before thread 0 has fetched it…
...
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
...
Store at temp[18]
Load from temp[19]
Skipped since threadId.x > RADIUS
© NVIDIA Corporation 2015
__syncthreads()
void __syncthreads();
Synchronizes all threads within a block
Used to prevent RAW / WAR / WAW hazards
All threads must reach the barrier
In conditional code, the condition must be uniform across the block
© NVIDIA Corporation 2015
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
Stencil Kernel
© NVIDIA Corporation 2015
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
© NVIDIA Corporation 2015
Review (1 of 2)
Launching parallel threads
Launch N blocks with M threads per block with kernel<<<N,M>>>(…);
Use blockIdx.x to access block index within grid
Use threadIdx.x to access thread index within block
Assign elements to threads:
int index = blockIdx.x * blockDim.x + threadIdx.x;
© NVIDIA Corporation 2015
Review (2 of 2)
Use __shared__ to declare a variable/array in shared memory
Data is shared between threads in a block
Not visible to threads in other blocks
Use __syncthreads() as a barrier
Use to prevent data hazards
© NVIDIA Corporation 2015
MANAGING THE DEVICE
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA Corporation 2015
Coordinating Host & Device
Kernel launches are asynchronous
Control returns to the CPU immediately
CPU needs to synchronize before consuming the results
cudaMemcpy() Blocks the CPU until the copy is complete
Copy begins when all preceding CUDA calls have completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchronize() Blocks the CPU until all preceding CUDA calls have completed
© NVIDIA Corporation 2015
Reporting Errors
All CUDA API calls return an error code (cudaError_t)
Error in the API call itself
OR
Error in an earlier asynchronous operation (e.g. kernel)
Get the error code for the last error: cudaError_t cudaGetLastError(void)
Get a string to describe the error: char *cudaGetErrorString(cudaError_t)
if(cudaGetLastError() != cudaSuccess)
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
© NVIDIA Corporation 2015
Device Management
Application can query and select GPUs cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
Multiple host threads can share a device
A single host thread can manage multiple devices
cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies
© NVIDIA Corporation 2015
Next Steps
We just touched the surface
But you can already do a lot of useful things with what we learned
Next topics:
Thread cooperation via shared memory and __syncthreads()
Asynchronous Kernel launches and memory operations
Device management (cudaSetDevice, error handling…)
Atomic operations
Launching kernels from the GPU
MPI
…
© NVIDIA Corporation 2015
Accelerated Libraries https://developer.nvidia.com/gpu-accelerated-libraries
Links and References These languages are supported on all CUDA-capable GPUs.
CUDA C/C++ and Fortran
http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
http://developer.nvidia.com/thrust
CUDA Programming Guide
http://docs.nvidia.com/cuda/
Parallel Forall blog http://devblogs.nvidia.com/parallelforall/
PyCUDA (Python) http://mathema.tician.de/software/pycuda
GTC On Demand http://on-demand-gtc.gputechconf.com/
gtcnew/on-demand-gtc.php
© NVIDIA Corporation 2015 © NVIDIA Corporation 2015
LUNCH