22
CUDA C/C++ BASICS (cont.) NVIDIA Corpora7on © NVIDIA 2013

CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

CUDAC/C++BASICS(cont.)

NVIDIACorpora7on

© NVIDIA 2013

Page 2: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

COOPERATING THREADS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 3: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

1DStencil

•  Considerapplyinga1Dstenciltoa1Darrayofelements–  Eachoutputelementisthesumofinputelementswithinaradius

•  Ifradiusis3,theneachoutputelementisthesumof7inputelements:

© NVIDIA 2013

radius radius

Page 4: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Implemen7ngWithinaBlock

•  Eachthreadprocessesoneoutputelement–  blockDim.xelementsperblock

•  Inputelementsarereadseveral7mes– Withradius3,eachinputelementisreadseven7mes

© NVIDIA 2013

Page 5: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

SharingDataBetweenThreads

•  Terminology:withinablock,threadssharedataviasharedmemory

•  Extremelyfaston-chipmemory,user-managed

•  Declareusing__shared__,allocatedperblock

•  Dataisnotvisibletothreadsinotherblocks

© NVIDIA 2013

Page 6: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Implemen7ngWithSharedMemory

•  Cachedatainsharedmemory–  Read(blockDim.x+2*radius)inputelementsfromglobalmemorytosharedmemory

–  ComputeblockDim.xoutputelements

– WriteblockDim.xoutputelementstoglobalmemory

–  Eachblockneedsahaloofradiuselementsateachboundary

blockDim.x output elements (16 in the example)

halo on left halo on right

© NVIDIA 2013

Page 7: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

© NVIDIA 2013

StencilKernel

Page 8: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];

// Store the result out[gindex] = result; }

StencilKernel

© NVIDIA 2013

Page 9: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

DataRace!

© NVIDIA 2013

!  Thestencilexamplewillnotwork…

!  Supposethread15readsthehalobeforethread0hasfetchedit…

temp[lindex] = in[gindex];

if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS = in[gindex – RADIUS];

temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

int result = 0;

result += temp[lindex + 1];

Store at temp[18]

Load from temp[19]

Skipped, threadIdx > RADIUS

Page 10: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

__syncthreads()

•  void __syncthreads();

•  Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards

•  Allthreadsmustreachthebarrier–  Incondi7onalcode,thecondi7onmustbeuniformacrosstheblock

© NVIDIA 2013

Page 11: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

StencilKernel__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

// Synchronize (ensure all the data is available) __syncthreads();

© NVIDIA 2013

Page 12: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

StencilKernel

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];

// Store the result out[gindex] = result; }

© NVIDIA 2013

Page 13: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Review(1of2)

•  Launchingparallelthreads– LaunchNblockswithMthreadsperblockwith

kernel<<<N,M>>>(…); – UseblockIdx.xtoaccessblockindexwithingrid– UsethreadIdx.xtoaccessthreadindexwithinblock

•  Allocateelementstothreads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

© NVIDIA 2013

Page 14: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Review(2of2)

•  Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks

•  Use__syncthreads()asabarrier– Usetopreventdatahazards

© NVIDIA 2013

Page 15: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

MANAGING THE DEVICE

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 16: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Coordina7ngHost&Device

•  Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately

•  CPUneedstosynchronizebeforeconsumingtheresults

cudaMemcpy() BlockstheCPUun7lthecopyiscompleteCopybeginswhenallprecedingCUDAcallshavecompleted

cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU

cudaDeviceSynchronize() BlockstheCPUun7lallprecedingCUDAcallshavecompleted

© NVIDIA 2013

Page 17: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Repor7ngErrors

•  AllCUDAAPIcallsreturnanerrorcode(cudaError_t)–  ErrorintheAPIcallitself

OR–  Errorinanearlierasynchronousopera7on(e.g.kernel)

•  Gettheerrorcodeforthelasterror: cudaError_t cudaGetLastError(void)

•  Getastringtodescribetheerror: char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

© NVIDIA 2013

Page 18: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

DeviceManagement

•  Applica7oncanqueryandselectGPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

•  Mul7plethreadscanshareadevice

•  Asinglethreadcanmanagemul7pledevices cudaSetDevice(i)toselectcurrentdevice cudaMemcpy(…)forpeer-to-peercopies✝

✝ requires OS and device support

© NVIDIA 2013

Page 19: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Introduc7ontoCUDAC/C++

•  Whathavewelearned?– WriteandlaunchCUDAC/C++kernels

•  __global__,blockIdx.x,threadIdx.x,<<<>>>

– ManageGPUmemory•  cudaMalloc(),cudaMemcpy(),cudaFree()

– Managecommunica7onandsynchroniza7on•  __shared__,__syncthreads()

•  cudaMemcpy()vscudaMemcpyAsync(),cudaDeviceSynchronize()

© NVIDIA 2013

Page 20: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

ComputeCapability

•  Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.–  Numberofregisters

–  Sizesofmemories

–  Features&capabili7es

•  Thefollowingpresenta7onsconcentrateonFermidevices–  ComputeCapability>=2.0

ComputeCapability

SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)

Teslamodels

1.0 FundamentalCUDAsupport 870

1.3 Doubleprecision,improvedmemoryaccesses,atomics

10-series

2.0 Caches,fusedmul7ply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,func7onpointers,recursion

20-series

© NVIDIA 2013

Page 21: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

IDsandDimensions

– Akernelislaunchedasagridofblocksofthreads

•  blockIdxandthreadIdxare3D

• Weshowedonlyonedimension(x)

•  Built-invariables:–  threadIdx –  blockIdx –  blockDim –  gridDim

Device

Grid 1

Block (0,0,0)

Block (1,0,0)

Block (2,0,0)

Block (1,1,0)

Block (2,1,0)

Block (0,1,0)

Block (1,1,0) Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

Thread

(4,0,0)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(4,1,0)

Thread

(0,2,0)

Thread

(1,2,0)

Thread

(2,2,0)

Thread

(3,2,0)

Thread

(4,2,0)

© NVIDIA 2013

Page 22: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array

Topicsweskipped

•  Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide

– CUDAZone–tools,training,webinarsandmoredeveloper.nvidia.com/cuda

© NVIDIA 2013