Intro to GPGPU with CUDA (DevLink)

Intro to GPGPU Programing With CUDA

Rob Gillenrob.gillenfamily.net@argodev

Intro to GPGPU Programming with CUDA

Rob Gillen

Welcome!

Goals:Overview of GPGPU with CUDA

“Vision Casting” for how you can use GPUs to improve your application

Introduction to CUDA C

OutlineWhy GPGPUs?

Applications

Tooling

Hands-On: Matrix Multiplication

Context Setting

Level of the TalkIntroductory/Overview

Perspective of the Speaker12+ years as professional developer

4+ years at Oak Ridge National Laboratory

Disclaimer:Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.

WHY USE GPUS?Motivation

CPU vs. GPU

GPU devotes more transistors to data processing

Specialized (purpose-designed) Silicon

NVIDIA Fermi

~1.5TFLOPS (SP)/~800GFLOPS (DP)

230 GB/s DRAM Bandwidth

Motivation

FLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU

Example: Sparse Matrix-Vector

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Rayleigh-Bénard Results

Double precision

384 x 384 x 192 grid (max that fits in 4GB)

Vertical slice of temperature at y=0

Transition from stratified (left) to turbulent (right)

Regime depends on Rayleigh number: Ra = gαΔT/κν

8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon

G80 Characteristics

367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

Supercomputer Comparison

Applications

Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”

Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world

Various granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management

*Not* for all applications

SPMD (Single Program, Multiple Data) are best (data parallel)

Operations need to be of sufficient size to overcome overhead

Think Millions of operations.

Raytracing

NVIRT: CUDA Ray Tracing API

Tooling

VS 2010 C++ (Express is OK… sort-of.)

NVIDIA CUDA-Capable GPU

NVIDIA CUDA Toolkit (v4+)

NVIDIA CUDA Tools (v4+)

GPU Computing SDK

NVIDIA Parallel Insight

Parallel Debugging

Parallel Analysis

VS Project Templates

VS Project Templates

Outline of CUDA Basics

Basic Memory Management

Basic Kernels and Execution on GPU

Development Resources

See the Programming Guide for the full API

See the Getting Started Guide for installation and compilation instructions

Both guest are included in the toolkit

Memory Spaces

CPU and GPU have separate memory spacesData is moved across PCIe bus

Use functions to allocate/set/copy memory on GPUVery similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPU

Must exercise care when dereferencing:Dereferencing CPU pointer on GPU will likely crash

Converse is also true

GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)

cudaMemset (void * pointer, int value, size_t count)

cudaFree(void* pointer)

int n = 1024;

int nbytes = 1024*sizeof(int);

int * d_a = 0;

cudaMalloc( (void**)&d_a, nbytes);

cudaMemset(d_a, 0, nbytes);

cudaFree(d_a);

Data Copies

Cudamemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

Returns after copy is complete

Blocks CPU thread until all bytes have been copied

Doesn’t start copying until previous CUDA calls complete

Enum cudaMemcpyKindcudaMemcpyHostToDevice

cudaMemcpyDeviceToHost

cudaMemcpy DeviceToDevice

Non-blocking memcopies are provided

DEMOCode Walkthrough 1

CUDA Programming Model

Parallel code (kernel) is launched and executed on a device by many threads

Threads are grouped into thread blocks

Parallel code is written for a threadEach thread is free to execute a unique code path

Built-in thread and block ID variables

Thread Hierarchy

Threads launched for a parallel section are partitioned into thread blocks

Grid == all blocks for a given launch

Thread block is a group of threads that can:Synchronize their execution

Communicate via shared memory

Threads Thread Blocks Grid

Block IDs and Threads

Threads:3D IDs, Unique within a block

Blocks:2D IDs, unique within a grid

Dimensions set at launch time

Can be unique for each grid

Built-in variables:threadIdx, blockIdxblockDim, gridDim

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Code executed on GPU

C function with some restrictionsReturn void

Can only dereference GPU pointers

No static variables

Some additional restrictions for older GPUs

Must be declared with a qualifier:__global__ : launched by CPU, cannot be called from GPU, must return void

__device__ : called from other GPU functions, cannot be launched by the CPU

__host__ : can be executed only by the CPU

__host__ and __device__ qualifiers can be combined


Launching kernels on GPU

Launch Parameters:Grid dimensions (up to 2D), dim3 type

Thread-block dimensions (up to 3D), dim3 type

Shared memory: number of bytes per blockFor extern smem variables declared without size

Optional, 0 by default

Stream IDOptional, 0 by default

dim3 grid(16, 16);

dim3 block(16, 16);

kernel<<<grid, block, 0, 0>>>(…);

kernel<<<32, 512>>>(…);

Kernel Variations and Output

__global__ void kernel (int *a)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

a[idx] = 7;

} Output: 7777777777777777


{


a[idx] = blockIdx.x;

} Output: 000011112222333


{


a[idx] = threadIdx.x;

} Output: 0123012301230123

Code Walkthrough 3

Build on Walkthrough 2

Write kernel to increment nxm integers

Copy the result back to CPU

Print the values


Blocks must be independent

Any possible interleaving of blocks should be validPresumed to run to completion without pre-emption

Can run in any order

Can run concurrently OR sequentially

Blocks may coordinate but not synchronizeShared queue pointer: OK

Shared lock: BAD … can easily deadlock

Independence requirement gives scalability

Transparent Scalability

Hardware is free to assigns blocks to any processor at any time

A kernel scales across any number of parallel processors

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relative to other blocks.

time

EXTENDED EXAMPLEMatrix Multiplication

A Simple Running ExampleMatrix Multiplication

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs

Leave shared memory usage until later

Local, register usage

Thread ID usage

Memory data transfer API between host and device

Assume square matrix for simplicity

Programming Model:Square Matrix Multiplication Example

P = M * N of size WIDTH x WIDTH

Without tiling:One thread calculates one element of P

M and N are loaded WIDTH times from global memory

40

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

Memory Layout of Matrix in C

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) { for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } }}

42

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

i

k

k

j

Simple Matrix Multiplication (GPU)

void MatrixMulOnDevice(float* M, float* N, float* P, int Width)

{ int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device cudaMalloc(&Pd, size);

Simple Matrix Multiplication (GPU)

// 2. Kernel invocation code – to be shown later …

// 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd);}

Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)

{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

Kernel Function (contd.) for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; }

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

46

Nd

Md Pd

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

ty

tx

k

k

Kernel Function (full)

// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float*

Pd, int Width){ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; }

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

Kernel Invocation (Host Side)

// Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width);

// Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Only One Thread Block Used

One Block of threads compute matrix Pd

Each thread computes one element of Pd

Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Handling Arbitrary Sized Square Matrices

Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix

Each has (TILE_WIDTH)2 threads

Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks

50

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

by

bx

TILE_WIDTH

You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!

Small Example

P1,0P0,0

P0,1

P2,0 P3,0

P1,1

P0,2 P2,2 P3,2P1,2

P3,1P2,1

P0,3 P2,3 P3,3P1,3

Block(0,0) Block(1,0)

Block(1,1)Block(0,1)

TILE_WIDTH = 2

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Cleanup Topics

Memory ManagementPinned Memory (Zero-Transfer)

Portable Pinned Memory

Multi-GPU

Wrappers (Python, Java, .NET)

Kernels

Atomics

Thread Synchronization (staged reductions)

NVCC

Questions?

[email protected]@argodev

http://rob.gillenfamily.net

Technology

Intro to GPGPU with CUDA (DevLink)