CUDA: Introduction

CUDA: IntroductionCUDA: IntroductionChristian Trefftz / Greg WolffeChristian Trefftz / Greg WolffeGrand Valley State UniversityGrand Valley State University

Supercomputing 2008 Supercomputing 2008 Education ProgramEducation Program

Supercomputing 2008 Education Program

2

TermsTerms What is GPGPU?What is GPGPU?

General-Purpose computing on a Graphics General-Purpose computing on a Graphics Processing UnitProcessing Unit

Using graphic hardware for non-graphic Using graphic hardware for non-graphic computationscomputations

What is CUDA?What is CUDA? Compute Unified Device ArchitectureCompute Unified Device Architecture Software architecture for managing data-parallel Software architecture for managing data-parallel

programmingprogramming


3

MotivationMotivation


4

CPU vs. GPUCPU vs. GPU

CPU Fast caches Branching adaptability High performance

GPU Multiple ALUs Fast onboard memory High throughput on parallel tasks

• Executes program on each fragment/vertex

CPUs are great for task parallelism GPUs are great for data parallelism


5

CPU vs. GPU - HardwareCPU vs. GPU - Hardware

More transistors devoted to data processingMore transistors devoted to data processing


6

Traditional Graphics PipelineTraditional Graphics Pipeline

Vertex processingVertex processing

RasterizerRasterizer

Fragment processingFragment processing

Renderer (textures)Renderer (textures)


7

Pixel / Thread ProcessingPixel / Thread Processing


8

GPU ArchitectureGPU Architecture


9

Processing ElementProcessing Element

Processing element = thread processor = ALUProcessing element = thread processor = ALU


10

Memory ArchitectureMemory Architecture

Constant MemoryConstant Memory Texture MemoryTexture Memory Device MemoryDevice Memory


11

Data-parallel ProgrammingData-parallel Programming

Think of the CPU as a massively-threaded Think of the CPU as a massively-threaded co-processorco-processor

Write “kernel” functions that execute on Write “kernel” functions that execute on the device -- processing multiple data the device -- processing multiple data elements in parallelelements in parallel

Keep it busy! Keep it busy! massive threading massive threading Keep your data close! Keep your data close! local memory local memory


12

Hardware RequirementsHardware Requirements

CUDA-capable CUDA-capable video cardvideo card

Power supplyPower supply CoolingCooling PCI-ExpressPCI-Express


13


14

AcknowledgementsAcknowledgements

NVidia CorporationNVidia Corporationdeveloper.nvidia.comdeveloper.nvidia.com/CUDA/CUDA

NVidiaNVidiaTechnical Brief – Architecture OverviewTechnical Brief – Architecture OverviewCUDA Programming GuideCUDA Programming Guide

ACM QueueACM Queue http://www.acmqueue.org/http://www.acmqueue.org/


15

A Gentle Introduction to A Gentle Introduction to CUDA ProgrammingCUDA Programming


16

CreditsCredits

The code used in this presentation is based The code used in this presentation is based on code available in:on code available in:

the Tutorial on CUDA in Dr. Dobbs Journalthe Tutorial on CUDA in Dr. Dobbs Journal

Andrew Bellenir’s code for matrix multiplicationAndrew Bellenir’s code for matrix multiplication

Igor Majdandzic’s code for Voronoi diagramsIgor Majdandzic’s code for Voronoi diagrams

NVIDIA’s CUDA programming guideNVIDIA’s CUDA programming guide


17

Software Requirements/ToolsSoftware Requirements/Tools

CUDA device driverCUDA device driver CUDA Software Development KitCUDA Software Development Kit

EmulatorEmulator CUDA ToolkitCUDA Toolkit

Occupancy calculatorOccupancy calculator Visual profilerVisual profiler


18

To compute, we need to: To compute, we need to:

AllocateAllocate memory that will be used for the memory that will be used for the computation (variable declaration and allocation)computation (variable declaration and allocation)

ReadRead the data that we will compute on (input) the data that we will compute on (input) Specify the Specify the computationcomputation that will be performed that will be performed WriteWrite to the appropriate device the results to the appropriate device the results

(output)(output)


19

A GPU is a specialized computerA GPU is a specialized computer

We need to We need to allocateallocate space in the video card’s space in the video card’s memory for the variables.memory for the variables.

The video card does not have I/O devices, The video card does not have I/O devices, hence we need to hence we need to copy the input datacopy the input data from the from the memory in the host computer into the memory in memory in the host computer into the memory in the video card, using the variable allocated in the video card, using the variable allocated in the previous step.the previous step.

We need to specify code to We need to specify code to executeexecute.. Copy the results backCopy the results back to the memory in the host to the memory in the host

computer.computer.


20

Initially:Initially:

Host’s Memory GPU Card’s Memory

array


21

Allocate Memory in the GPU Allocate Memory in the GPU cardcard


array_darray


22

Copy content from the host’s memory to the Copy content from the host’s memory to the GPU card memoryGPU card memory


array_darray


23

Execute code on the GPUExecute code on the GPU


array_darray

GPU MPs


24

Copy results back to the host Copy results back to the host memorymemory


array_darray


25

The KernelThe Kernel It is necessary to write the It is necessary to write the

code that will be executed in code that will be executed in the stream processors in the the stream processors in the GPU cardGPU card

That code, called the kernel, That code, called the kernel, will be downloaded and will be downloaded and executed, simultaneously and executed, simultaneously and in lock-step fashion, in several in lock-step fashion, in several (all?) stream processors in the (all?) stream processors in the GPU cardGPU card

How is every instance of the How is every instance of the kernel going to know which kernel going to know which piece of data it is working on?piece of data it is working on?


26

Grid Size and Block SizeGrid Size and Block Size

Programmers need to specify:Programmers need to specify: The grid size: The size and shape of the data The grid size: The size and shape of the data

that the program will be working onthat the program will be working on The block size: The block size indicates the The block size: The block size indicates the

sub-area of the original grid that will be sub-area of the original grid that will be assigned to an MP (a set of stream assigned to an MP (a set of stream processors that share local memory)processors that share local memory)


27

Block SizeBlock Size

Recall that the “stream processors” of the Recall that the “stream processors” of the GPU are organized as MPs (multi-GPU are organized as MPs (multi-processors) and every MP has its own set processors) and every MP has its own set of resources:of resources: RegistersRegisters Local memoryLocal memory

The block size needs to be chosen such The block size needs to be chosen such that there are enough resources in an MP that there are enough resources in an MP to execute a block at a time.to execute a block at a time.


28

In the GPU:In the GPU:

ProcessingProcessing ElementsElements

Array ElementsArray Elements

Block 0 Block 1


29

Let’s look at a very simple exampleLet’s look at a very simple example

The code has been divided into two files:The code has been divided into two files: simple.csimple.c simple.cusimple.cu

simple.c is ordinary code in Csimple.c is ordinary code in C It allocates an array of integers, initializes It allocates an array of integers, initializes

it to values corresponding to the indices in it to values corresponding to the indices in the array and prints the array.the array and prints the array.

It calls a function that modifies the arrayIt calls a function that modifies the array The array is printed again.The array is printed again.


30

simple.csimple.c

#include <stdio.h>#include <stdio.h>#define SIZEOFARRAY 64 #define SIZEOFARRAY 64 extern void fillArray(int *a,int size);extern void fillArray(int *a,int size);

/* The main program *//* The main program */int main(int argc,char *argv[])int main(int argc,char *argv[]){{/* Declare the array that will be modified by the GPU *//* Declare the array that will be modified by the GPU */ int a[SIZEOFARRAY]; int a[SIZEOFARRAY]; int i; int i;/* Initialize the array to 0s *//* Initialize the array to 0s */ for(i=0;i < SIZEOFARRAY;i++) { for(i=0;i < SIZEOFARRAY;i++) { a[i]=i; a[i]=i; } } /* Print the initial array */ /* Print the initial array */ printf("Initial state of the array:\n"); printf("Initial state of the array:\n");for(i = 0;i < SIZEOFARRAY;i++) {for(i = 0;i < SIZEOFARRAY;i++) { printf("%d ",a[i]); printf("%d ",a[i]); } } printf("\n"); printf("\n");/* Call the function that will in turn call the function in the GPU that will fill /* Call the function that will in turn call the function in the GPU that will fill the array */the array */ fillArray(a,SIZEOFARRAY); fillArray(a,SIZEOFARRAY); /* Now print the array after calling fillArray */ /* Now print the array after calling fillArray */ printf("Final state of the array:\n"); printf("Final state of the array:\n"); for(i = 0;i < SIZEOFARRAY;i++) { for(i = 0;i < SIZEOFARRAY;i++) { printf("%d ",a[i]); printf("%d ",a[i]); } } printf("\n"); printf("\n"); return 0; return 0;}}


31

simple.cusimple.cu

simple.cu contains two functionssimple.cu contains two functions fillArray(): A function that will be executed on fillArray(): A function that will be executed on

the host and which takes care of:the host and which takes care of:• Allocating variables in the global GPU memoryAllocating variables in the global GPU memory• Copying the array from the host to the GPU memoryCopying the array from the host to the GPU memory• Setting the grid and block sizesSetting the grid and block sizes• Invoking the kernel that is executed on the GPUInvoking the kernel that is executed on the GPU• Copying the values back to the host memoryCopying the values back to the host memory• Freeing the GPU memoryFreeing the GPU memory


32

fillArray (part 1)fillArray (part 1)#define BLOCK_SIZE 32#define BLOCK_SIZE 32extern "C" void fillArray(int *array,int arraySize){extern "C" void fillArray(int *array,int arraySize){

/* a_d is the GPU counterpart of the array that exists /* a_d is the GPU counterpart of the array that exists on the host memory */on the host memory */int *array_d;int *array_d;

ccuuddaaEErrrroorr__tt rreessuulltt;;

/* allocate memory on device *//* allocate memory on device *//* cudaMalloc allocates space in the memory of the GPU /* cudaMalloc allocates space in the memory of the GPU card */card */result = result = cudaMalloc((void**)&array_d,sizeof(int)*arraySize);cudaMalloc((void**)&array_d,sizeof(int)*arraySize);/* copy the array into the variable array_d in the /* copy the array into the variable array_d in the device */device *//* The memory from the host is being copied to the /* The memory from the host is being copied to the corresponding variable in the GPU global memory */corresponding variable in the GPU global memory */result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,

cudaMemcpyHostToDevice);cudaMemcpyHostToDevice);


33

fillArray (part 2)fillArray (part 2)/* execution configuration... *//* execution configuration... *//* Indicate the dimension of the block *//* Indicate the dimension of the block */dim3 dimblock(BLOCK_SIZE);dim3 dimblock(BLOCK_SIZE);/* Indicate the dimension of the grid in blocks *//* Indicate the dimension of the grid in blocks */dim3 dimgrid(arraySize/BLOCK_SIZE);dim3 dimgrid(arraySize/BLOCK_SIZE);/* actual computation: Call the kernel, the function that is *//* actual computation: Call the kernel, the function that is *//* executed by each and every processing element on the GPU /* executed by each and every processing element on the GPU card */card */cu_fillArray<<<dimgrid,dimblock>>>(array_d);cu_fillArray<<<dimgrid,dimblock>>>(array_d);/* read results back: *//* read results back: *//* Copy the results from the GPU back to the memory on the host /* Copy the results from the GPU back to the memory on the host */*/result = result = cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevicecudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDeviceToHost);ToHost);/* Release the memory on the GPU card *//* Release the memory on the GPU card */cudaFree(array_d);cudaFree(array_d);

}}


34

simple.cu (cont.)simple.cu (cont.)

The other function in simple.cu isThe other function in simple.cu is cu_fillArray()cu_fillArray()

• This is the kernel that will be executed in every This is the kernel that will be executed in every stream processor in the GPUstream processor in the GPU

• It is identified as a kernel by the use of the It is identified as a kernel by the use of the keyword: __global__keyword: __global__

• This function uses the built-in variablesThis function uses the built-in variables blockIdx.x andblockIdx.x and threadIdx.xthreadIdx.x

to identify a particular position in the arrayto identify a particular position in the array


35

cu_fillArraycu_fillArray__global__ void cu_fillArray(int *array_d){__global__ void cu_fillArray(int *array_d){ int x;int x;

/* blockIdx.x is a built-in variable in CUDA/* blockIdx.x is a built-in variable in CUDA that returns the blockId in the x axisthat returns the blockId in the x axis of the block that is executing this block of codeof the block that is executing this block of code threadIdx.x is another built-in variable in CUDAthreadIdx.x is another built-in variable in CUDA that returns the threadId in the x axisthat returns the threadId in the x axis of the thread that is being executed by thisof the thread that is being executed by this stream processor in this particular blockstream processor in this particular block */*/

x=blockIdx.x*BLOCK_SIZE+threadIdx.x;x=blockIdx.x*BLOCK_SIZE+threadIdx.x;array_d[x]+=array_d[x];array_d[x]+=array_d[x];

}}


36

To compile:To compile:

nvcc simple.c simple.cu –o simplenvcc simple.c simple.cu –o simple The compiler generates the code for both The compiler generates the code for both

the host and the GPUthe host and the GPU Demo on cuda.littlefe.net …Demo on cuda.littlefe.net …


37

What are those blockIds and What are those blockIds and threadIds?threadIds?

With a minor modification to the code, we With a minor modification to the code, we can print the blockIds and threadIdscan print the blockIds and threadIds

We will use two arrays instead of just one.We will use two arrays instead of just one. One for the blockIdsOne for the blockIds One for the threadIdsOne for the threadIds

The code in the kernel:The code in the kernel: x=blockIdx.x*BLOCK_SIZE+threadIdx.x;x=blockIdx.x*BLOCK_SIZE+threadIdx.x;

block_d[x] = blockIdx.x;block_d[x] = blockIdx.x;thread_d[x] = threadIdx.x;thread_d[x] = threadIdx.x;


38

In the GPU:In the GPU:

ProcessingProcessing ElementsElements

Array ElementsArray Elements

Thread 1

Thread 2

Thread 3

Thread 0

Thread 1

Thread 2

Thread 3

Thread 0

Block 0 Block 1


39

Hands-on ActivityHands-on Activity

Compile with (one single line)Compile with (one single line)

nvcc blockAndThread.c blockAndThread.cu nvcc blockAndThread.c blockAndThread.cu

-o blockAndThread-o blockAndThread Run the programRun the program

./blockAndThread./blockAndThread Edit the file Edit the file blockAndThread.cublockAndThread.cu Modify the constant BLOCK_SIZE. The current value is Modify the constant BLOCK_SIZE. The current value is

8, try replacing it with 4.8, try replacing it with 4. Recompile as aboveRecompile as above Run the program and compare the output with the Run the program and compare the output with the

previous run.previous run.


40

This can be extended to 2 dimensionsThis can be extended to 2 dimensions

See files: See files: blockAndThread2D.cblockAndThread2D.c blockAndThread2D.cublockAndThread2D.cu

The gist in the kernelThe gist in the kernelx = blockIdx.x*BLOCK_SIZE+threadIdx.x;x = blockIdx.x*BLOCK_SIZE+threadIdx.x;

y = blockIdx.y*BLOCK_SIZE+threadIdx.y;y = blockIdx.y*BLOCK_SIZE+threadIdx.y;

pos = x*sizeOfArray+y;pos = x*sizeOfArray+y;

block_dX[pos] = blockIdx.x;block_dX[pos] = blockIdx.x;

Compile and run blockAndThread2DCompile and run blockAndThread2D nvcc blockAndThread2D.c blockAndThread2D.cu nvcc blockAndThread2D.c blockAndThread2D.cu

-o blockAndThread2D-o blockAndThread2D ./blockAndThread2D./blockAndThread2D


41

When the kernel is called:When the kernel is called:

dim3 dimblock(BLOCK_SIZE,BLOCK_SIZE);dim3 dimblock(BLOCK_SIZE,BLOCK_SIZE);

nBlocks = arraySize/BLOCK_SIZE;nBlocks = arraySize/BLOCK_SIZE;

dim3 dimgrid(nBlocks,nBlocks);dim3 dimgrid(nBlocks,nBlocks);

cu_fillArray<<<dimgrid,dimblock>>>cu_fillArray<<<dimgrid,dimblock>>>

(… params…);(… params…);


42

Another Example: Another Example: saxpysaxpy

SAXPY (Scalar Alpha X Plus Y)SAXPY (Scalar Alpha X Plus Y) A common operation in linear algebraA common operation in linear algebra

CUDA: loop iteration CUDA: loop iteration thread thread


43

Traditional Sequential CodeTraditional Sequential Code

void saxpy_serial(int n, void saxpy_serial(int n,

float alpha,float alpha,

float *x,float *x,

float *y)float *y)

{{

for(int i = 0;i < n;i++)for(int i = 0;i < n;i++)

y[i] = alpha*x[i] + y[i];y[i] = alpha*x[i] + y[i];

}}


44

CUDA CodeCUDA Code

__global__ void saxpy_parallel(int n,__global__ void saxpy_parallel(int n,

float alpha,float alpha,

float *x,float *x,

float *y) {float *y) {

int i = blockIdx.x*blockDim.x+threadIdx.x;int i = blockIdx.x*blockDim.x+threadIdx.x;

if (i<n)if (i<n)

y[i] = alpha*x[i] + y[i];y[i] = alpha*x[i] + y[i];

}}


45

Keeping Multiprocessors in mind…Keeping Multiprocessors in mind…

Each hardware multiprocessor has the ability to actively Each hardware multiprocessor has the ability to actively process multiple blocks at one time. process multiple blocks at one time.

How many depends on the number of registers per How many depends on the number of registers per thread and how much shared memory per block is thread and how much shared memory per block is required by a given kernel. required by a given kernel.

The blocks that are processed by one multiprocessor at The blocks that are processed by one multiprocessor at one time are referred to as “active”.one time are referred to as “active”.

If a block is too large, then it will not fit into the resources If a block is too large, then it will not fit into the resources of an MP.of an MP.


46

““Warps”Warps”

Each active block is split into SIMD ("Single Each active block is split into SIMD ("Single Instruction Multiple Data") groups of threads Instruction Multiple Data") groups of threads called "warps". called "warps".

Each warp contains the same number of threads, Each warp contains the same number of threads, called the "warp size", which are executed by the called the "warp size", which are executed by the multiprocessor in a SIMD fashion. multiprocessor in a SIMD fashion.

On “if” statements, or “while” statements (control On “if” statements, or “while” statements (control transfer) the threads may diverge.transfer) the threads may diverge.

Use: __syncthreads()Use: __syncthreads()


47

A Real ApplicationA Real Application

The Voronoi Diagram: The Voronoi Diagram: A fundamental data A fundamental data structure in structure in Computational Computational GeometryGeometry


48

DefinitionDefinition

Definition : Let S be a set of n sites in Definition : Let S be a set of n sites in Euclidean space of dimension d. For each Euclidean space of dimension d. For each site p of S, the Voronoi cell V(p) of p is the site p of S, the Voronoi cell V(p) of p is the set of points that are closer to p than to set of points that are closer to p than to other sites of S. The Voronoi diagram V(S) other sites of S. The Voronoi diagram V(S) is the space partition induced by Voronoi is the space partition induced by Voronoi cells.cells.


49

AlgorithmsAlgorithms

The classical sequential algorithm has The classical sequential algorithm has complexity O(n log n) where n is the number complexity O(n log n) where n is the number of sites (seeds).of sites (seeds).

If one only needs an approximation, on a grid If one only needs an approximation, on a grid of points (e.g. digital display):of points (e.g. digital display): Assign a different color to each seedAssign a different color to each seed Calculate the distance from every point in the grid Calculate the distance from every point in the grid

to all seedsto all seeds Color each point with the color of its closest seedColor each point with the color of its closest seed


50

Lends itself to implementation on a Lends itself to implementation on a GPU…GPU…

The calculation for every pixel is a good The calculation for every pixel is a good candidate to be carried out in parallel…candidate to be carried out in parallel…

Notice that the locations of the seeds are Notice that the locations of the seeds are read-only in the kernelread-only in the kernel

Thus we can use the texture map area in Thus we can use the texture map area in the GPU card, which is a fast read-only the GPU card, which is a fast read-only cache to store the seeds:cache to store the seeds:

__device__ __constant__ …__device__ __constant__ …


51

Demo on cuda…Demo on cuda…


52

Tips for improving performanceTips for improving performance

Special thanks to Igor Majdandzic.Special thanks to Igor Majdandzic.


53

Memory AlignmentMemory Alignment

Memory access on the GPU works much Memory access on the GPU works much better if the data items are aligned at 64 better if the data items are aligned at 64 byte boundaries.byte boundaries.

Hence, allocating 2D arrays so that every Hence, allocating 2D arrays so that every row starts at a 64-byte boundary address row starts at a 64-byte boundary address will improve performance.will improve performance.

But that is difficult to do for a programmerBut that is difficult to do for a programmer


54

Allocating 2D arrays with “pitch”Allocating 2D arrays with “pitch”

CUDA offers special versions of:CUDA offers special versions of: Memory allocation of 2D arrays so that every row Memory allocation of 2D arrays so that every row

is padded (if necessary). The function determines is padded (if necessary). The function determines the best pitch and returns it to the program. The the best pitch and returns it to the program. The function name is cudaMallocPitch()function name is cudaMallocPitch()

Memory copy operations that take into account the Memory copy operations that take into account the pitch that was chosen by the memory allocation pitch that was chosen by the memory allocation operation. The function name is cudaMemcpy2D()operation. The function name is cudaMemcpy2D()


55

PitchPitch

RowsRows

ColumnsColumns

PitchPitch

PaddingPadding


56

A simple example:A simple example:

See pitch.cuSee pitch.cu A matrix of 30 rows and 10 columnsA matrix of 30 rows and 10 columns The work is divided into 3 blocks of 10 The work is divided into 3 blocks of 10

rows:rows: Block size is 10Block size is 10 Grid size is 3Grid size is 3


57

Key portions of the code (1)Key portions of the code (1)

result = cudaMallocPitch(result = cudaMallocPitch(

(void **)&devPtr,(void **)&devPtr,

&pitch,&pitch,

width*sizeof(int),width*sizeof(int),

height);height);


58

Key portions of the code (2)Key portions of the code (2)

result = cudaMemcpy2D(result = cudaMemcpy2D(

devPtr,devPtr,

pitch,pitch,

mat,mat,



height, height,

cudaMemcpyHostToDevice);cudaMemcpyHostToDevice);


59

In the kernel:In the kernel:

__global__ void myKernel(int *devPtr,__global__ void myKernel(int *devPtr,

int pitch,int pitch,

int width,int width,

int height)int height)

{{

int c;int c;

int thisRow;int thisRow;

thisRow = blockIdx.x * 10 + threadIdx.x;thisRow = blockIdx.x * 10 + threadIdx.x;

int *row = (int *)((char *)devPtr + int *row = (int *)((char *)devPtr + thisRow*pitch);thisRow*pitch);

for(c = 0;c < width;c++) for(c = 0;c < width;c++)

row[c] = row[c] + 1;row[c] = row[c] + 1;

}}\


60

The call to the kernelThe call to the kernel

myKernel<<<3,10>>>(myKernel<<<3,10>>>(

devPtr,devPtr,

pitch,pitch,

width,width,

height);height);


61

pitch pitch Divide work by rows Divide work by rows

Notice that when using pitch, we divide the Notice that when using pitch, we divide the work by rows.work by rows.

Instead of using the 2D decomposition of Instead of using the 2D decomposition of 2D blocks, we are dividing the 2D matrix 2D blocks, we are dividing the 2D matrix into blocks of rows.into blocks of rows.


62

Dividing the work by blocks:Dividing the work by blocks:

RowsRows

ColumnsColumns

PitchPitch

Block 0

Block 1

Block 2


63

An application that uses pitch: An application that uses pitch: MandelbrotMandelbrot

The Mandelbrot set: A set of The Mandelbrot set: A set of points in the complex plane, points in the complex plane, the boundary of which forms a the boundary of which forms a fractal.fractal.

A complex number, A complex number, cc, is in the , is in the Mandelbrot set if, when Mandelbrot set if, when starting with starting with xx00=0 and applying =0 and applying

the iteration the iteration

xxnn+1+1 = = xxnn22 + + c c

repeatedly, the absolute value repeatedly, the absolute value of of xxnn never exceeds a certain never exceeds a certain

number (that number depends number (that number depends on on cc) however large ) however large nn gets. gets.


64

Demo: ComparisonDemo: Comparison

We can compare the execution times of:We can compare the execution times of: The sequential versionThe sequential version The CUDA versionThe CUDA version


65

Performance Tip: Block SizePerformance Tip: Block Size

Critical for performanceCritical for performance Recommended value is 192 or 256Recommended value is 192 or 256 Maximum value is 512Maximum value is 512 Should be a multiple of 32 since this is the warp Should be a multiple of 32 since this is the warp

size for Series 8 GPUs and thus the native size for Series 8 GPUs and thus the native execution size for multiprocessorsexecution size for multiprocessors

Limited by number of registers on the MPLimited by number of registers on the MP Series 8 GPU MPs have 8192 registers which Series 8 GPU MPs have 8192 registers which

are shared between all the threads on an MPare shared between all the threads on an MP


66

Performance Tip: Grid SizePerformance Tip: Grid Size

Critical for scalabilityCritical for scalability Recommended value is at least 100, but 1000 would Recommended value is at least 100, but 1000 would

scale for many generations of hardwarescale for many generations of hardware Actual value depends on size of the problem dataActual value depends on size of the problem data It should be a multiple of the number of MPs for an even It should be a multiple of the number of MPs for an even

distribution of work (not a requirement though)distribution of work (not a requirement though) Example: 24 blocks Example: 24 blocks

Grid will work efficiently on Series 8 (12 MPs), but it will waste Grid will work efficiently on Series 8 (12 MPs), but it will waste resources on new GPUs with 32MPsresources on new GPUs with 32MPs


67

Performance Tip: Code DivergancePerformance Tip: Code Divergance

Control flow instructions diverge (threads take Control flow instructions diverge (threads take different paths of execution)different paths of execution)

Example: if, for, whileExample: if, for, while Diverged code prevents SIMD execution – it Diverged code prevents SIMD execution – it

forces serial execution (kills efficiency)forces serial execution (kills efficiency) One approach is to invoke a simpler kernel One approach is to invoke a simpler kernel

multiple timesmultiple times Liberal use of __syncthreads()Liberal use of __syncthreads()


68

Performance Tip: Memory LatencyPerformance Tip: Memory Latency

4 clock cycles for each memory read/write plus 4 clock cycles for each memory read/write plus additional 400-600 cycles for latencyadditional 400-600 cycles for latency

Memory latency can be hidden by keeping a large Memory latency can be hidden by keeping a large number of threads busynumber of threads busy

Keep number of threads per block (block size) and Keep number of threads per block (block size) and number of blocks per grid (grid size) as large as possiblenumber of blocks per grid (grid size) as large as possible

Constant memory can be used for constant data Constant memory can be used for constant data (variables that do not change). (variables that do not change).

Constant memory is cached.Constant memory is cached.


69

Performance Tip: Memory ReadsPerformance Tip: Memory Reads

Device is capable of reading a 32, 64 or 128-bit Device is capable of reading a 32, 64 or 128-bit number from memory with a single instructionnumber from memory with a single instruction

Data has to be aligned in memory (this can be Data has to be aligned in memory (this can be accomplished by using cudaMallocPitch() calls)accomplished by using cudaMallocPitch() calls)

If formatted properly, multiple threads from a If formatted properly, multiple threads from a warp can each receive a piece of memory with a warp can each receive a piece of memory with a single read instructionsingle read instruction


70

Watchdog timerWatchdog timer

Operating system GUI may have a "watchdog" Operating system GUI may have a "watchdog" timer that causes programs using the primary timer that causes programs using the primary graphics adapter to time out if they run longer graphics adapter to time out if they run longer than the maximum allowed time. than the maximum allowed time.

Individual GPU program launches are limited to Individual GPU program launches are limited to a run time of less than this maximum.a run time of less than this maximum.

Exceeding this time limit usually causes a launch Exceeding this time limit usually causes a launch failure.failure.

Possible solution: run CUDA on a GPU that is Possible solution: run CUDA on a GPU that is NOT attached to a display.NOT attached to a display.


71

Resources on lineResources on line

http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532

http://www.ddj.com/hpc-high-performance-computing/207200659

http://www.nvidia.com/object/cuda_home.html# http://www.nvidia.com/object/cuda_learn.html ““Computation of Voronoi diagrams using a Computation of Voronoi diagrams using a

graphics processing unit” by Igor Majdandzic etgraphics processing unit” by Igor Majdandzic et al. available through IEEE Digital Library, DOI: al. available through IEEE Digital Library, DOI: 10.1109/EIT.2008.4554342 10.1109/EIT.2008.4554342

Documents

CUDA: Introduction