Upload
rob-gillen
View
3.842
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides from my talk on CUDA programming at DevLink 2011
Citation preview
Intro to GPGPU Programing With CUDA
Rob Gillenrob.gillenfamily.net@argodev
Intro to GPGPU Programming with CUDA
Rob Gillen
Welcome!
Goals:Overview of GPGPU with CUDA
“Vision Casting” for how you can use GPUs to improve your application
Introduction to CUDA C
OutlineWhy GPGPUs?
Applications
Tooling
Hands-On: Matrix Multiplication
Context Setting
Level of the TalkIntroductory/Overview
Perspective of the Speaker12+ years as professional developer
4+ years at Oak Ridge National Laboratory
Disclaimer:Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.
WHY USE GPUS?Motivation
CPU vs. GPU
GPU devotes more transistors to data processing
Specialized (purpose-designed) Silicon
NVIDIA Fermi
~1.5TFLOPS (SP)/~800GFLOPS (DP)
230 GB/s DRAM Bandwidth
Motivation
FLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU
Example: Sparse Matrix-Vector
CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
Rayleigh-Bénard Results
Double precision
384 x 384 x 192 grid (max that fits in 4GB)
Vertical slice of temperature at y=0
Transition from stratified (left) to turbulent (right)
Regime depends on Rayleigh number: Ra = gαΔT/κν
8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
G80 Characteristics
367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
Supercomputer Comparison
Applications
Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”
Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world
Various granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
*Not* for all applications
SPMD (Single Program, Multiple Data) are best (data parallel)
Operations need to be of sufficient size to overcome overhead
Think Millions of operations.
Raytracing
NVIRT: CUDA Ray Tracing API
Tooling
VS 2010 C++ (Express is OK… sort-of.)
NVIDIA CUDA-Capable GPU
NVIDIA CUDA Toolkit (v4+)
NVIDIA CUDA Tools (v4+)
GPU Computing SDK
NVIDIA Parallel Insight
Parallel Debugging
Parallel Analysis
VS Project Templates
VS Project Templates
Outline of CUDA Basics
Basic Memory Management
Basic Kernels and Execution on GPU
Development Resources
See the Programming Guide for the full API
See the Getting Started Guide for installation and compilation instructions
Both guest are included in the toolkit
Memory Spaces
CPU and GPU have separate memory spacesData is moved across PCIe bus
Use functions to allocate/set/copy memory on GPUVery similar to corresponding C functions
Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPU
Must exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crash
Converse is also true
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree(void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes);
cudaMemset(d_a, 0, nbytes);
cudaFree(d_a);
Data Copies
Cudamemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
Returns after copy is complete
Blocks CPU thread until all bytes have been copied
Doesn’t start copying until previous CUDA calls complete
Enum cudaMemcpyKindcudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpy DeviceToDevice
Non-blocking memcopies are provided
DEMOCode Walkthrough 1
CUDA Programming Model
Parallel code (kernel) is launched and executed on a device by many threads
Threads are grouped into thread blocks
Parallel code is written for a threadEach thread is free to execute a unique code path
Built-in thread and block ID variables
Thread Hierarchy
Threads launched for a parallel section are partitioned into thread blocks
Grid == all blocks for a given launch
Thread block is a group of threads that can:Synchronize their execution
Communicate via shared memory
Threads Thread Blocks Grid
Block IDs and Threads
Threads:3D IDs, Unique within a block
Blocks:2D IDs, unique within a grid
Dimensions set at launch time
Can be unique for each grid
Built-in variables:threadIdx, blockIdxblockDim, gridDim
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Code executed on GPU
C function with some restrictionsReturn void
Can only dereference GPU pointers
No static variables
Some additional restrictions for older GPUs
Must be declared with a qualifier:__global__ : launched by CPU, cannot be called from GPU, must return void
__device__ : called from other GPU functions, cannot be launched by the CPU
__host__ : can be executed only by the CPU
__host__ and __device__ qualifiers can be combined
DEMOCode Walkthrough 2
Launching kernels on GPU
Launch Parameters:Grid dimensions (up to 2D), dim3 type
Thread-block dimensions (up to 3D), dim3 type
Shared memory: number of bytes per blockFor extern smem variables declared without size
Optional, 0 by default
Stream IDOptional, 0 by default
dim3 grid(16, 16);
dim3 block(16, 16);
kernel<<<grid, block, 0, 0>>>(…);
kernel<<<32, 512>>>(…);
Kernel Variations and Output
__global__ void kernel (int *a)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
a[idx] = 7;
} Output: 7777777777777777
__global__ void kernel (int *a)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
a[idx] = blockIdx.x;
} Output: 000011112222333
__global__ void kernel (int *a)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
a[idx] = threadIdx.x;
} Output: 0123012301230123
Code Walkthrough 3
Build on Walkthrough 2
Write kernel to increment nxm integers
Copy the result back to CPU
Print the values
DEMOCode Walkthrough 3
Blocks must be independent
Any possible interleaving of blocks should be validPresumed to run to completion without pre-emption
Can run in any order
Can run concurrently OR sequentially
Blocks may coordinate but not synchronizeShared queue pointer: OK
Shared lock: BAD … can easily deadlock
Independence requirement gives scalability
Transparent Scalability
Hardware is free to assigns blocks to any processor at any time
A kernel scales across any number of parallel processors
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Each block can execute in any order relative to other blocks.
time
EXTENDED EXAMPLEMatrix Multiplication
A Simple Running ExampleMatrix Multiplication
A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
Leave shared memory usage until later
Local, register usage
Thread ID usage
Memory data transfer API between host and device
Assume square matrix for simplicity
Programming Model:Square Matrix Multiplication Example
P = M * N of size WIDTH x WIDTH
Without tiling:One thread calculates one element of P
M and N are loaded WIDTH times from global memory
40
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
Memory Layout of Matrix in C
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M3,1M3,0 M3,2 M3,3
M
Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) { for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } }}
42
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
i
k
k
j
Simple Matrix Multiplication (GPU)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{ int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// Allocate P on the device cudaMalloc(&Pd, size);
Simple Matrix Multiplication (GPU)
// 2. Kernel invocation code – to be shown later …
// 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd);}
Kernel Function
// Matrix multiplication kernel – per thread code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
Kernel Function (contd.) for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; }
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
46
Nd
Md Pd
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
ty
tx
k
k
Kernel Function (full)
// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float*
Pd, int Width){ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; }
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
Kernel Invocation (Host Side)
// Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width);
// Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Only One Thread Block Used
One Block of threads compute matrix Pd
Each thread computes one element of Pd
Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)
Size of matrix limited by the number of threads allowed in a thread block
Grid 1
Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
Handling Arbitrary Sized Square Matrices
Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix
Each has (TILE_WIDTH)2 threads
Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
50
Md
Nd
Pd
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
by
bx
TILE_WIDTH
You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!
Small Example
P1,0P0,0
P0,1
P2,0 P3,0
P1,1
P0,2 P2,2 P3,2P1,2
P3,1P2,1
P0,3 P2,3 P3,3P1,3
Block(0,0) Block(1,0)
Block(1,1)Block(0,1)
TILE_WIDTH = 2
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
Cleanup Topics
Memory ManagementPinned Memory (Zero-Transfer)
Portable Pinned Memory
Multi-GPU
Wrappers (Python, Java, .NET)
Kernels
Atomics
Thread Synchronization (staged reductions)
NVCC