71
CUDA: Introduction CUDA: Introduction Christian Trefftz / Greg Christian Trefftz / Greg Wolffe Wolffe Grand Valley State University Grand Valley State University Supercomputing 2008 Supercomputing 2008 Education Program Education Program

CUDA: Introduction

  • Upload
    falala

  • View
    86

  • Download
    2

Embed Size (px)

DESCRIPTION

Christian Trefftz / Greg Wolffe Grand Valley State University Supercomputing 2008 Education Program. CUDA: Introduction. Terms. What is GPGPU? General-Purpose computing on a Graphics Processing Unit Using graphic hardware for non-graphic computations What is CUDA? - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA: Introduction

CUDA: IntroductionCUDA: IntroductionChristian Trefftz / Greg WolffeChristian Trefftz / Greg WolffeGrand Valley State UniversityGrand Valley State University

Supercomputing 2008 Supercomputing 2008 Education ProgramEducation Program

Page 2: CUDA: Introduction

Supercomputing 2008 Education Program

2

TermsTerms What is GPGPU?What is GPGPU?

General-Purpose computing on a Graphics General-Purpose computing on a Graphics Processing UnitProcessing Unit

Using graphic hardware for non-graphic Using graphic hardware for non-graphic computationscomputations

What is CUDA?What is CUDA? Compute Unified Device ArchitectureCompute Unified Device Architecture Software architecture for managing data-parallel Software architecture for managing data-parallel

programmingprogramming

Page 3: CUDA: Introduction

Supercomputing 2008 Education Program

3

MotivationMotivation

Page 4: CUDA: Introduction

Supercomputing 2008 Education Program

4

CPU vs. GPUCPU vs. GPU

CPU Fast caches Branching adaptability High performance

GPU Multiple ALUs Fast onboard memory High throughput on parallel tasks

• Executes program on each fragment/vertex

CPUs are great for task parallelism GPUs are great for data parallelism

Page 5: CUDA: Introduction

Supercomputing 2008 Education Program

5

CPU vs. GPU - HardwareCPU vs. GPU - Hardware

More transistors devoted to data processingMore transistors devoted to data processing

Page 6: CUDA: Introduction

Supercomputing 2008 Education Program

6

Traditional Graphics PipelineTraditional Graphics Pipeline

Vertex processingVertex processing

RasterizerRasterizer

Fragment processingFragment processing

Renderer (textures)Renderer (textures)

Page 7: CUDA: Introduction

Supercomputing 2008 Education Program

7

Pixel / Thread ProcessingPixel / Thread Processing

Page 8: CUDA: Introduction

Supercomputing 2008 Education Program

8

GPU ArchitectureGPU Architecture

Page 9: CUDA: Introduction

Supercomputing 2008 Education Program

9

Processing ElementProcessing Element

Processing element = thread processor = ALUProcessing element = thread processor = ALU

Page 10: CUDA: Introduction

Supercomputing 2008 Education Program

10

Memory ArchitectureMemory Architecture

Constant MemoryConstant Memory Texture MemoryTexture Memory Device MemoryDevice Memory

Page 11: CUDA: Introduction

Supercomputing 2008 Education Program

11

Data-parallel ProgrammingData-parallel Programming

Think of the CPU as a massively-threaded Think of the CPU as a massively-threaded co-processorco-processor

Write “kernel” functions that execute on Write “kernel” functions that execute on the device -- processing multiple data the device -- processing multiple data elements in parallelelements in parallel

Keep it busy! Keep it busy! massive threading massive threading Keep your data close! Keep your data close! local memory local memory

Page 12: CUDA: Introduction

Supercomputing 2008 Education Program

12

Hardware RequirementsHardware Requirements

CUDA-capable CUDA-capable video cardvideo card

Power supplyPower supply CoolingCooling PCI-ExpressPCI-Express

Page 13: CUDA: Introduction

Supercomputing 2008 Education Program

13

Page 14: CUDA: Introduction

Supercomputing 2008 Education Program

14

AcknowledgementsAcknowledgements

NVidia CorporationNVidia Corporationdeveloper.nvidia.comdeveloper.nvidia.com/CUDA/CUDA

NVidiaNVidiaTechnical Brief – Architecture OverviewTechnical Brief – Architecture OverviewCUDA Programming GuideCUDA Programming Guide

ACM QueueACM Queue http://www.acmqueue.org/http://www.acmqueue.org/

Page 15: CUDA: Introduction

Supercomputing 2008 Education Program

15

A Gentle Introduction to A Gentle Introduction to CUDA ProgrammingCUDA Programming

Page 16: CUDA: Introduction

Supercomputing 2008 Education Program

16

CreditsCredits

The code used in this presentation is based The code used in this presentation is based on code available in:on code available in:

the Tutorial on CUDA in Dr. Dobbs Journalthe Tutorial on CUDA in Dr. Dobbs Journal

Andrew Bellenir’s code for matrix multiplicationAndrew Bellenir’s code for matrix multiplication

Igor Majdandzic’s code for Voronoi diagramsIgor Majdandzic’s code for Voronoi diagrams

NVIDIA’s CUDA programming guideNVIDIA’s CUDA programming guide

Page 17: CUDA: Introduction

Supercomputing 2008 Education Program

17

Software Requirements/ToolsSoftware Requirements/Tools

CUDA device driverCUDA device driver CUDA Software Development KitCUDA Software Development Kit

EmulatorEmulator CUDA ToolkitCUDA Toolkit

Occupancy calculatorOccupancy calculator Visual profilerVisual profiler

Page 18: CUDA: Introduction

Supercomputing 2008 Education Program

18

To compute, we need to: To compute, we need to:

AllocateAllocate memory that will be used for the memory that will be used for the computation (variable declaration and allocation)computation (variable declaration and allocation)

ReadRead the data that we will compute on (input) the data that we will compute on (input) Specify the Specify the computationcomputation that will be performed that will be performed WriteWrite to the appropriate device the results to the appropriate device the results

(output)(output)

Page 19: CUDA: Introduction

Supercomputing 2008 Education Program

19

A GPU is a specialized computerA GPU is a specialized computer

We need to We need to allocateallocate space in the video card’s space in the video card’s memory for the variables.memory for the variables.

The video card does not have I/O devices, The video card does not have I/O devices, hence we need to hence we need to copy the input datacopy the input data from the from the memory in the host computer into the memory in memory in the host computer into the memory in the video card, using the variable allocated in the video card, using the variable allocated in the previous step.the previous step.

We need to specify code to We need to specify code to executeexecute.. Copy the results backCopy the results back to the memory in the host to the memory in the host

computer.computer.

Page 20: CUDA: Introduction

Supercomputing 2008 Education Program

20

Initially:Initially:

Host’s Memory GPU Card’s Memory

array

Page 21: CUDA: Introduction

Supercomputing 2008 Education Program

21

Allocate Memory in the GPU Allocate Memory in the GPU cardcard

Host’s Memory GPU Card’s Memory

array_darray

Page 22: CUDA: Introduction

Supercomputing 2008 Education Program

22

Copy content from the host’s memory to the Copy content from the host’s memory to the GPU card memoryGPU card memory

Host’s Memory GPU Card’s Memory

array_darray

Page 23: CUDA: Introduction

Supercomputing 2008 Education Program

23

Execute code on the GPUExecute code on the GPU

Host’s Memory GPU Card’s Memory

array_darray

GPU MPs

Page 24: CUDA: Introduction

Supercomputing 2008 Education Program

24

Copy results back to the host Copy results back to the host memorymemory

Host’s Memory GPU Card’s Memory

array_darray

Page 25: CUDA: Introduction

Supercomputing 2008 Education Program

25

The KernelThe Kernel It is necessary to write the It is necessary to write the

code that will be executed in code that will be executed in the stream processors in the the stream processors in the GPU cardGPU card

That code, called the kernel, That code, called the kernel, will be downloaded and will be downloaded and executed, simultaneously and executed, simultaneously and in lock-step fashion, in several in lock-step fashion, in several (all?) stream processors in the (all?) stream processors in the GPU cardGPU card

How is every instance of the How is every instance of the kernel going to know which kernel going to know which piece of data it is working on?piece of data it is working on?

Page 26: CUDA: Introduction

Supercomputing 2008 Education Program

26

Grid Size and Block SizeGrid Size and Block Size

Programmers need to specify:Programmers need to specify: The grid size: The size and shape of the data The grid size: The size and shape of the data

that the program will be working onthat the program will be working on The block size: The block size indicates the The block size: The block size indicates the

sub-area of the original grid that will be sub-area of the original grid that will be assigned to an MP (a set of stream assigned to an MP (a set of stream processors that share local memory)processors that share local memory)

Page 27: CUDA: Introduction

Supercomputing 2008 Education Program

27

Block SizeBlock Size

Recall that the “stream processors” of the Recall that the “stream processors” of the GPU are organized as MPs (multi-GPU are organized as MPs (multi-processors) and every MP has its own set processors) and every MP has its own set of resources:of resources: RegistersRegisters Local memoryLocal memory

The block size needs to be chosen such The block size needs to be chosen such that there are enough resources in an MP that there are enough resources in an MP to execute a block at a time.to execute a block at a time.

Page 28: CUDA: Introduction

Supercomputing 2008 Education Program

28

In the GPU:In the GPU:

ProcessingProcessing ElementsElements

Array ElementsArray Elements

Block 0 Block 1

Page 29: CUDA: Introduction

Supercomputing 2008 Education Program

29

Let’s look at a very simple exampleLet’s look at a very simple example

The code has been divided into two files:The code has been divided into two files: simple.csimple.c simple.cusimple.cu

simple.c is ordinary code in Csimple.c is ordinary code in C It allocates an array of integers, initializes It allocates an array of integers, initializes

it to values corresponding to the indices in it to values corresponding to the indices in the array and prints the array.the array and prints the array.

It calls a function that modifies the arrayIt calls a function that modifies the array The array is printed again.The array is printed again.

Page 30: CUDA: Introduction

Supercomputing 2008 Education Program

30

simple.csimple.c

#include <stdio.h>#include <stdio.h>#define SIZEOFARRAY 64 #define SIZEOFARRAY 64 extern void fillArray(int *a,int size);extern void fillArray(int *a,int size);

/* The main program *//* The main program */int main(int argc,char *argv[])int main(int argc,char *argv[]){{/* Declare the array that will be modified by the GPU *//* Declare the array that will be modified by the GPU */ int a[SIZEOFARRAY]; int a[SIZEOFARRAY]; int i; int i;/* Initialize the array to 0s *//* Initialize the array to 0s */ for(i=0;i < SIZEOFARRAY;i++) { for(i=0;i < SIZEOFARRAY;i++) { a[i]=i; a[i]=i; } } /* Print the initial array */ /* Print the initial array */ printf("Initial state of the array:\n"); printf("Initial state of the array:\n");for(i = 0;i < SIZEOFARRAY;i++) {for(i = 0;i < SIZEOFARRAY;i++) { printf("%d ",a[i]); printf("%d ",a[i]); } } printf("\n"); printf("\n");/* Call the function that will in turn call the function in the GPU that will fill /* Call the function that will in turn call the function in the GPU that will fill the array */the array */ fillArray(a,SIZEOFARRAY); fillArray(a,SIZEOFARRAY); /* Now print the array after calling fillArray */ /* Now print the array after calling fillArray */ printf("Final state of the array:\n"); printf("Final state of the array:\n"); for(i = 0;i < SIZEOFARRAY;i++) { for(i = 0;i < SIZEOFARRAY;i++) { printf("%d ",a[i]); printf("%d ",a[i]); } } printf("\n"); printf("\n"); return 0; return 0;}}

Page 31: CUDA: Introduction

Supercomputing 2008 Education Program

31

simple.cusimple.cu

simple.cu contains two functionssimple.cu contains two functions fillArray(): A function that will be executed on fillArray(): A function that will be executed on

the host and which takes care of:the host and which takes care of:• Allocating variables in the global GPU memoryAllocating variables in the global GPU memory• Copying the array from the host to the GPU memoryCopying the array from the host to the GPU memory• Setting the grid and block sizesSetting the grid and block sizes• Invoking the kernel that is executed on the GPUInvoking the kernel that is executed on the GPU• Copying the values back to the host memoryCopying the values back to the host memory• Freeing the GPU memoryFreeing the GPU memory

Page 32: CUDA: Introduction

Supercomputing 2008 Education Program

32

fillArray (part 1)fillArray (part 1)#define BLOCK_SIZE 32#define BLOCK_SIZE 32extern "C" void fillArray(int *array,int arraySize){extern "C" void fillArray(int *array,int arraySize){

/* a_d is the GPU counterpart of the array that exists /* a_d is the GPU counterpart of the array that exists on the host memory */on the host memory */int *array_d;int *array_d;

ccuuddaaEErrrroorr__tt rreessuulltt;;

/* allocate memory on device *//* allocate memory on device *//* cudaMalloc allocates space in the memory of the GPU /* cudaMalloc allocates space in the memory of the GPU card */card */result = result = cudaMalloc((void**)&array_d,sizeof(int)*arraySize);cudaMalloc((void**)&array_d,sizeof(int)*arraySize);/* copy the array into the variable array_d in the /* copy the array into the variable array_d in the device */device *//* The memory from the host is being copied to the /* The memory from the host is being copied to the corresponding variable in the GPU global memory */corresponding variable in the GPU global memory */result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,

cudaMemcpyHostToDevice);cudaMemcpyHostToDevice);

Page 33: CUDA: Introduction

Supercomputing 2008 Education Program

33

fillArray (part 2)fillArray (part 2)/* execution configuration... *//* execution configuration... *//* Indicate the dimension of the block *//* Indicate the dimension of the block */dim3 dimblock(BLOCK_SIZE);dim3 dimblock(BLOCK_SIZE);/* Indicate the dimension of the grid in blocks *//* Indicate the dimension of the grid in blocks */dim3 dimgrid(arraySize/BLOCK_SIZE);dim3 dimgrid(arraySize/BLOCK_SIZE);/* actual computation: Call the kernel, the function that is *//* actual computation: Call the kernel, the function that is *//* executed by each and every processing element on the GPU /* executed by each and every processing element on the GPU card */card */cu_fillArray<<<dimgrid,dimblock>>>(array_d);cu_fillArray<<<dimgrid,dimblock>>>(array_d);/* read results back: *//* read results back: *//* Copy the results from the GPU back to the memory on the host /* Copy the results from the GPU back to the memory on the host */*/result = result = cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevicecudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDeviceToHost);ToHost);/* Release the memory on the GPU card *//* Release the memory on the GPU card */cudaFree(array_d);cudaFree(array_d);

}}

Page 34: CUDA: Introduction

Supercomputing 2008 Education Program

34

simple.cu (cont.)simple.cu (cont.)

The other function in simple.cu isThe other function in simple.cu is cu_fillArray()cu_fillArray()

• This is the kernel that will be executed in every This is the kernel that will be executed in every stream processor in the GPUstream processor in the GPU

• It is identified as a kernel by the use of the It is identified as a kernel by the use of the keyword: __global__keyword: __global__

• This function uses the built-in variablesThis function uses the built-in variables blockIdx.x andblockIdx.x and threadIdx.xthreadIdx.x

to identify a particular position in the arrayto identify a particular position in the array

Page 35: CUDA: Introduction

Supercomputing 2008 Education Program

35

cu_fillArraycu_fillArray__global__ void cu_fillArray(int *array_d){__global__ void cu_fillArray(int *array_d){ int x;int x;

/* blockIdx.x is a built-in variable in CUDA/* blockIdx.x is a built-in variable in CUDA that returns the blockId in the x axisthat returns the blockId in the x axis of the block that is executing this block of codeof the block that is executing this block of code threadIdx.x is another built-in variable in CUDAthreadIdx.x is another built-in variable in CUDA that returns the threadId in the x axisthat returns the threadId in the x axis of the thread that is being executed by thisof the thread that is being executed by this stream processor in this particular blockstream processor in this particular block */*/

x=blockIdx.x*BLOCK_SIZE+threadIdx.x;x=blockIdx.x*BLOCK_SIZE+threadIdx.x;array_d[x]+=array_d[x];array_d[x]+=array_d[x];

}}

Page 36: CUDA: Introduction

Supercomputing 2008 Education Program

36

To compile:To compile:

nvcc simple.c simple.cu –o simplenvcc simple.c simple.cu –o simple The compiler generates the code for both The compiler generates the code for both

the host and the GPUthe host and the GPU Demo on cuda.littlefe.net …Demo on cuda.littlefe.net …

Page 37: CUDA: Introduction

Supercomputing 2008 Education Program

37

What are those blockIds and What are those blockIds and threadIds?threadIds?

With a minor modification to the code, we With a minor modification to the code, we can print the blockIds and threadIdscan print the blockIds and threadIds

We will use two arrays instead of just one.We will use two arrays instead of just one. One for the blockIdsOne for the blockIds One for the threadIdsOne for the threadIds

The code in the kernel:The code in the kernel: x=blockIdx.x*BLOCK_SIZE+threadIdx.x;x=blockIdx.x*BLOCK_SIZE+threadIdx.x;

block_d[x] = blockIdx.x;block_d[x] = blockIdx.x;thread_d[x] = threadIdx.x;thread_d[x] = threadIdx.x;

Page 38: CUDA: Introduction

Supercomputing 2008 Education Program

38

In the GPU:In the GPU:

ProcessingProcessing ElementsElements

Array ElementsArray Elements

Thread 1

Thread 2

Thread 3

Thread 0

Thread 1

Thread 2

Thread 3

Thread 0

Block 0 Block 1

Page 39: CUDA: Introduction

Supercomputing 2008 Education Program

39

Hands-on ActivityHands-on Activity

Compile with (one single line)Compile with (one single line)

nvcc blockAndThread.c blockAndThread.cu nvcc blockAndThread.c blockAndThread.cu

-o blockAndThread-o blockAndThread Run the programRun the program

./blockAndThread./blockAndThread Edit the file Edit the file blockAndThread.cublockAndThread.cu Modify the constant BLOCK_SIZE. The current value is Modify the constant BLOCK_SIZE. The current value is

8, try replacing it with 4.8, try replacing it with 4. Recompile as aboveRecompile as above Run the program and compare the output with the Run the program and compare the output with the

previous run.previous run.

Page 40: CUDA: Introduction

Supercomputing 2008 Education Program

40

This can be extended to 2 dimensionsThis can be extended to 2 dimensions

See files: See files: blockAndThread2D.cblockAndThread2D.c blockAndThread2D.cublockAndThread2D.cu

The gist in the kernelThe gist in the kernelx = blockIdx.x*BLOCK_SIZE+threadIdx.x;x = blockIdx.x*BLOCK_SIZE+threadIdx.x;

y = blockIdx.y*BLOCK_SIZE+threadIdx.y;y = blockIdx.y*BLOCK_SIZE+threadIdx.y;

pos = x*sizeOfArray+y;pos = x*sizeOfArray+y;

block_dX[pos] = blockIdx.x;block_dX[pos] = blockIdx.x;

Compile and run blockAndThread2DCompile and run blockAndThread2D nvcc blockAndThread2D.c blockAndThread2D.cu nvcc blockAndThread2D.c blockAndThread2D.cu

-o blockAndThread2D-o blockAndThread2D ./blockAndThread2D./blockAndThread2D

Page 41: CUDA: Introduction

Supercomputing 2008 Education Program

41

When the kernel is called:When the kernel is called:

dim3 dimblock(BLOCK_SIZE,BLOCK_SIZE);dim3 dimblock(BLOCK_SIZE,BLOCK_SIZE);

nBlocks = arraySize/BLOCK_SIZE;nBlocks = arraySize/BLOCK_SIZE;

dim3 dimgrid(nBlocks,nBlocks);dim3 dimgrid(nBlocks,nBlocks);

cu_fillArray<<<dimgrid,dimblock>>>cu_fillArray<<<dimgrid,dimblock>>>

(… params…);(… params…);

Page 42: CUDA: Introduction

Supercomputing 2008 Education Program

42

Another Example: Another Example: saxpysaxpy

SAXPY (Scalar Alpha X Plus Y)SAXPY (Scalar Alpha X Plus Y) A common operation in linear algebraA common operation in linear algebra

CUDA: loop iteration CUDA: loop iteration thread thread

Page 43: CUDA: Introduction

Supercomputing 2008 Education Program

43

Traditional Sequential CodeTraditional Sequential Code

void saxpy_serial(int n, void saxpy_serial(int n,

float alpha,float alpha,

float *x,float *x,

float *y)float *y)

{{

for(int i = 0;i < n;i++)for(int i = 0;i < n;i++)

y[i] = alpha*x[i] + y[i];y[i] = alpha*x[i] + y[i];

}}

Page 44: CUDA: Introduction

Supercomputing 2008 Education Program

44

CUDA CodeCUDA Code

__global__ void saxpy_parallel(int n,__global__ void saxpy_parallel(int n,

float alpha,float alpha,

float *x,float *x,

float *y) {float *y) {

int i = blockIdx.x*blockDim.x+threadIdx.x;int i = blockIdx.x*blockDim.x+threadIdx.x;

if (i<n)if (i<n)

y[i] = alpha*x[i] + y[i];y[i] = alpha*x[i] + y[i];

}}

Page 45: CUDA: Introduction

Supercomputing 2008 Education Program

45

Keeping Multiprocessors in mind…Keeping Multiprocessors in mind…

Each hardware multiprocessor has the ability to actively Each hardware multiprocessor has the ability to actively process multiple blocks at one time. process multiple blocks at one time.

How many depends on the number of registers per How many depends on the number of registers per thread and how much shared memory per block is thread and how much shared memory per block is required by a given kernel. required by a given kernel.

The blocks that are processed by one multiprocessor at The blocks that are processed by one multiprocessor at one time are referred to as “active”.one time are referred to as “active”.

If a block is too large, then it will not fit into the resources If a block is too large, then it will not fit into the resources of an MP.of an MP.

Page 46: CUDA: Introduction

Supercomputing 2008 Education Program

46

““Warps”Warps”

Each active block is split into SIMD ("Single Each active block is split into SIMD ("Single Instruction Multiple Data") groups of threads Instruction Multiple Data") groups of threads called "warps". called "warps".

Each warp contains the same number of threads, Each warp contains the same number of threads, called the "warp size", which are executed by the called the "warp size", which are executed by the multiprocessor in a SIMD fashion. multiprocessor in a SIMD fashion.

On “if” statements, or “while” statements (control On “if” statements, or “while” statements (control transfer) the threads may diverge.transfer) the threads may diverge.

Use: __syncthreads()Use: __syncthreads()

Page 47: CUDA: Introduction

Supercomputing 2008 Education Program

47

A Real ApplicationA Real Application

The Voronoi Diagram: The Voronoi Diagram: A fundamental data A fundamental data structure in structure in Computational Computational GeometryGeometry

Page 48: CUDA: Introduction

Supercomputing 2008 Education Program

48

DefinitionDefinition

Definition : Let S be a set of n sites in Definition : Let S be a set of n sites in Euclidean space of dimension d. For each Euclidean space of dimension d. For each site p of S, the Voronoi cell V(p) of p is the site p of S, the Voronoi cell V(p) of p is the set of points that are closer to p than to set of points that are closer to p than to other sites of S. The Voronoi diagram V(S) other sites of S. The Voronoi diagram V(S) is the space partition induced by Voronoi is the space partition induced by Voronoi cells.cells.

Page 49: CUDA: Introduction

Supercomputing 2008 Education Program

49

AlgorithmsAlgorithms

The classical sequential algorithm has The classical sequential algorithm has complexity O(n log n) where n is the number complexity O(n log n) where n is the number of sites (seeds).of sites (seeds).

If one only needs an approximation, on a grid If one only needs an approximation, on a grid of points (e.g. digital display):of points (e.g. digital display): Assign a different color to each seedAssign a different color to each seed Calculate the distance from every point in the grid Calculate the distance from every point in the grid

to all seedsto all seeds Color each point with the color of its closest seedColor each point with the color of its closest seed

Page 50: CUDA: Introduction

Supercomputing 2008 Education Program

50

Lends itself to implementation on a Lends itself to implementation on a GPU…GPU…

The calculation for every pixel is a good The calculation for every pixel is a good candidate to be carried out in parallel…candidate to be carried out in parallel…

Notice that the locations of the seeds are Notice that the locations of the seeds are read-only in the kernelread-only in the kernel

Thus we can use the texture map area in Thus we can use the texture map area in the GPU card, which is a fast read-only the GPU card, which is a fast read-only cache to store the seeds:cache to store the seeds:

__device__ __constant__ …__device__ __constant__ …

Page 51: CUDA: Introduction

Supercomputing 2008 Education Program

51

Demo on cuda…Demo on cuda…

Page 52: CUDA: Introduction

Supercomputing 2008 Education Program

52

Tips for improving performanceTips for improving performance

Special thanks to Igor Majdandzic.Special thanks to Igor Majdandzic.

Page 53: CUDA: Introduction

Supercomputing 2008 Education Program

53

Memory AlignmentMemory Alignment

Memory access on the GPU works much Memory access on the GPU works much better if the data items are aligned at 64 better if the data items are aligned at 64 byte boundaries.byte boundaries.

Hence, allocating 2D arrays so that every Hence, allocating 2D arrays so that every row starts at a 64-byte boundary address row starts at a 64-byte boundary address will improve performance.will improve performance.

But that is difficult to do for a programmerBut that is difficult to do for a programmer

Page 54: CUDA: Introduction

Supercomputing 2008 Education Program

54

Allocating 2D arrays with “pitch”Allocating 2D arrays with “pitch”

CUDA offers special versions of:CUDA offers special versions of: Memory allocation of 2D arrays so that every row Memory allocation of 2D arrays so that every row

is padded (if necessary). The function determines is padded (if necessary). The function determines the best pitch and returns it to the program. The the best pitch and returns it to the program. The function name is cudaMallocPitch()function name is cudaMallocPitch()

Memory copy operations that take into account the Memory copy operations that take into account the pitch that was chosen by the memory allocation pitch that was chosen by the memory allocation operation. The function name is cudaMemcpy2D()operation. The function name is cudaMemcpy2D()

Page 55: CUDA: Introduction

Supercomputing 2008 Education Program

55

PitchPitch

RowsRows

ColumnsColumns

PitchPitch

PaddingPadding

Page 56: CUDA: Introduction

Supercomputing 2008 Education Program

56

A simple example:A simple example:

See pitch.cuSee pitch.cu A matrix of 30 rows and 10 columnsA matrix of 30 rows and 10 columns The work is divided into 3 blocks of 10 The work is divided into 3 blocks of 10

rows:rows: Block size is 10Block size is 10 Grid size is 3Grid size is 3

Page 57: CUDA: Introduction

Supercomputing 2008 Education Program

57

Key portions of the code (1)Key portions of the code (1)

result = cudaMallocPitch(result = cudaMallocPitch(

(void **)&devPtr,(void **)&devPtr,

&pitch,&pitch,

width*sizeof(int),width*sizeof(int),

height);height);

Page 58: CUDA: Introduction

Supercomputing 2008 Education Program

58

Key portions of the code (2)Key portions of the code (2)

result = cudaMemcpy2D(result = cudaMemcpy2D(

devPtr,devPtr,

pitch,pitch,

mat,mat,

width*sizeof(int),width*sizeof(int),

width*sizeof(int),width*sizeof(int),

height, height,

cudaMemcpyHostToDevice);cudaMemcpyHostToDevice);

Page 59: CUDA: Introduction

Supercomputing 2008 Education Program

59

In the kernel:In the kernel:

__global__ void myKernel(int *devPtr,__global__ void myKernel(int *devPtr,

int pitch,int pitch,

int width,int width,

int height)int height)

{{

int c;int c;

int thisRow;int thisRow;

thisRow = blockIdx.x * 10 + threadIdx.x;thisRow = blockIdx.x * 10 + threadIdx.x;

int *row = (int *)((char *)devPtr + int *row = (int *)((char *)devPtr + thisRow*pitch);thisRow*pitch);

for(c = 0;c < width;c++) for(c = 0;c < width;c++)

row[c] = row[c] + 1;row[c] = row[c] + 1;

}}\

Page 60: CUDA: Introduction

Supercomputing 2008 Education Program

60

The call to the kernelThe call to the kernel

myKernel<<<3,10>>>(myKernel<<<3,10>>>(

devPtr,devPtr,

pitch,pitch,

width,width,

height);height);

Page 61: CUDA: Introduction

Supercomputing 2008 Education Program

61

pitch pitch Divide work by rows Divide work by rows

Notice that when using pitch, we divide the Notice that when using pitch, we divide the work by rows.work by rows.

Instead of using the 2D decomposition of Instead of using the 2D decomposition of 2D blocks, we are dividing the 2D matrix 2D blocks, we are dividing the 2D matrix into blocks of rows.into blocks of rows.

Page 62: CUDA: Introduction

Supercomputing 2008 Education Program

62

Dividing the work by blocks:Dividing the work by blocks:

RowsRows

ColumnsColumns

PitchPitch

Block 0

Block 1

Block 2

Page 63: CUDA: Introduction

Supercomputing 2008 Education Program

63

An application that uses pitch: An application that uses pitch: MandelbrotMandelbrot

The Mandelbrot set: A set of The Mandelbrot set: A set of points in the complex plane, points in the complex plane, the boundary of which forms a the boundary of which forms a fractal.fractal.

A complex number, A complex number, cc, is in the , is in the Mandelbrot set if, when Mandelbrot set if, when starting with starting with xx00=0 and applying =0 and applying

the iteration the iteration

xxnn+1+1 = = xxnn22 + + c c

repeatedly, the absolute value repeatedly, the absolute value of of xxnn never exceeds a certain never exceeds a certain

number (that number depends number (that number depends on on cc) however large ) however large nn gets. gets.

Page 64: CUDA: Introduction

Supercomputing 2008 Education Program

64

Demo: ComparisonDemo: Comparison

We can compare the execution times of:We can compare the execution times of: The sequential versionThe sequential version The CUDA versionThe CUDA version

Page 65: CUDA: Introduction

Supercomputing 2008 Education Program

65

Performance Tip: Block SizePerformance Tip: Block Size

Critical for performanceCritical for performance Recommended value is 192 or 256Recommended value is 192 or 256 Maximum value is 512Maximum value is 512 Should be a multiple of 32 since this is the warp Should be a multiple of 32 since this is the warp

size for Series 8 GPUs and thus the native size for Series 8 GPUs and thus the native execution size for multiprocessorsexecution size for multiprocessors

Limited by number of registers on the MPLimited by number of registers on the MP Series 8 GPU MPs have 8192 registers which Series 8 GPU MPs have 8192 registers which

are shared between all the threads on an MPare shared between all the threads on an MP

Page 66: CUDA: Introduction

Supercomputing 2008 Education Program

66

Performance Tip: Grid SizePerformance Tip: Grid Size

Critical for scalabilityCritical for scalability Recommended value is at least 100, but 1000 would Recommended value is at least 100, but 1000 would

scale for many generations of hardwarescale for many generations of hardware Actual value depends on size of the problem dataActual value depends on size of the problem data It should be a multiple of the number of MPs for an even It should be a multiple of the number of MPs for an even

distribution of work (not a requirement though)distribution of work (not a requirement though) Example: 24 blocks Example: 24 blocks

Grid will work efficiently on Series 8 (12 MPs), but it will waste Grid will work efficiently on Series 8 (12 MPs), but it will waste resources on new GPUs with 32MPsresources on new GPUs with 32MPs

Page 67: CUDA: Introduction

Supercomputing 2008 Education Program

67

Performance Tip: Code DivergancePerformance Tip: Code Divergance

Control flow instructions diverge (threads take Control flow instructions diverge (threads take different paths of execution)different paths of execution)

Example: if, for, whileExample: if, for, while Diverged code prevents SIMD execution – it Diverged code prevents SIMD execution – it

forces serial execution (kills efficiency)forces serial execution (kills efficiency) One approach is to invoke a simpler kernel One approach is to invoke a simpler kernel

multiple timesmultiple times Liberal use of __syncthreads()Liberal use of __syncthreads()

Page 68: CUDA: Introduction

Supercomputing 2008 Education Program

68

Performance Tip: Memory LatencyPerformance Tip: Memory Latency

4 clock cycles for each memory read/write plus 4 clock cycles for each memory read/write plus additional 400-600 cycles for latencyadditional 400-600 cycles for latency

Memory latency can be hidden by keeping a large Memory latency can be hidden by keeping a large number of threads busynumber of threads busy

Keep number of threads per block (block size) and Keep number of threads per block (block size) and number of blocks per grid (grid size) as large as possiblenumber of blocks per grid (grid size) as large as possible

Constant memory can be used for constant data Constant memory can be used for constant data (variables that do not change). (variables that do not change).

Constant memory is cached.Constant memory is cached.

Page 69: CUDA: Introduction

Supercomputing 2008 Education Program

69

Performance Tip: Memory ReadsPerformance Tip: Memory Reads

Device is capable of reading a 32, 64 or 128-bit Device is capable of reading a 32, 64 or 128-bit number from memory with a single instructionnumber from memory with a single instruction

Data has to be aligned in memory (this can be Data has to be aligned in memory (this can be accomplished by using cudaMallocPitch() calls)accomplished by using cudaMallocPitch() calls)

If formatted properly, multiple threads from a If formatted properly, multiple threads from a warp can each receive a piece of memory with a warp can each receive a piece of memory with a single read instructionsingle read instruction

Page 70: CUDA: Introduction

Supercomputing 2008 Education Program

70

Watchdog timerWatchdog timer

Operating system GUI may have a "watchdog" Operating system GUI may have a "watchdog" timer that causes programs using the primary timer that causes programs using the primary graphics adapter to time out if they run longer graphics adapter to time out if they run longer than the maximum allowed time. than the maximum allowed time.

Individual GPU program launches are limited to Individual GPU program launches are limited to a run time of less than this maximum.a run time of less than this maximum.

Exceeding this time limit usually causes a launch Exceeding this time limit usually causes a launch failure.failure.

Possible solution: run CUDA on a GPU that is Possible solution: run CUDA on a GPU that is NOT attached to a display.NOT attached to a display.

Page 71: CUDA: Introduction

Supercomputing 2008 Education Program

71

Resources on lineResources on line

http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532

http://www.ddj.com/hpc-high-performance-computing/207200659

http://www.nvidia.com/object/cuda_home.html# http://www.nvidia.com/object/cuda_learn.html ““Computation of Voronoi diagrams using a Computation of Voronoi diagrams using a

graphics processing unit” by Igor Majdandzic etgraphics processing unit” by Igor Majdandzic et al. available through IEEE Digital Library, DOI: al. available through IEEE Digital Library, DOI: 10.1109/EIT.2008.4554342 10.1109/EIT.2008.4554342