CUDA C/C++ BASICS (cont.) -...

CUDAC/C++BASICS(cont.)

NVIDIACorpora7on

COOPERATING THREADS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

1DStencil

•  Considerapplyinga1Dstenciltoa1Darrayofelements–  Eachoutputelementisthesumofinputelementswithinaradius

•  Ifradiusis3,theneachoutputelementisthesumof7inputelements:

radius radius

Implemen7ngWithinaBlock

•  Eachthreadprocessesoneoutputelement–  blockDim.xelementsperblock

•  Inputelementsarereadseveral7mes– Withradius3,eachinputelementisreadseven7mes

SharingDataBetweenThreads

•  Terminology:withinablock,threadssharedataviasharedmemory

•  Extremelyfaston-chipmemory,user-managed

•  Declareusing__shared__,allocatedperblock

•  Dataisnotvisibletothreadsinotherblocks

Implemen7ngWithSharedMemory

•  Cachedatainsharedmemory–  Read(blockDim.x+2*radius)inputelementsfromglobalmemorytosharedmemory

–  ComputeblockDim.xoutputelements

– WriteblockDim.xoutputelementstoglobalmemory

–  Eachblockneedsahaloofradiuselementsateachboundary

blockDim.x output elements (16 in the example)

halo on left halo on right

__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

StencilKernel

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];

// Store the result out[gindex] = result; }

StencilKernel

DataRace!

!  Thestencilexamplewillnotwork…

!  Supposethread15readsthehalobeforethread0hasfetchedit…

temp[lindex] = in[gindex];

if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS = in[gindex – RADIUS];

temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

int result = 0;

result += temp[lindex + 1];

Store at temp[18]

Load from temp[19]

Skipped, threadIdx > RADIUS

__syncthreads()

•  void __syncthreads();

•  Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards

•  Allthreadsmustreachthebarrier–  Incondi7onalcode,thecondi7onmustbeuniformacrosstheblock

StencilKernel__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

// Synchronize (ensure all the data is available) __syncthreads();

StencilKernel

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];

// Store the result out[gindex] = result; }

Review(1of2)

•  Launchingparallelthreads– LaunchNblockswithMthreadsperblockwith

kernel<<<N,M>>>(…); – UseblockIdx.xtoaccessblockindexwithingrid– UsethreadIdx.xtoaccessthreadindexwithinblock

•  Allocateelementstothreads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

Review(2of2)

•  Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks

•  Use__syncthreads()asabarrier– Usetopreventdatahazards

MANAGING THE DEVICE

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

Coordina7ngHost&Device

•  Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately

•  CPUneedstosynchronizebeforeconsumingtheresults

cudaMemcpy() BlockstheCPUun7lthecopyiscompleteCopybeginswhenallprecedingCUDAcallshavecompleted

cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU

cudaDeviceSynchronize() BlockstheCPUun7lallprecedingCUDAcallshavecompleted

Repor7ngErrors

•  AllCUDAAPIcallsreturnanerrorcode(cudaError_t)–  ErrorintheAPIcallitself

OR–  Errorinanearlierasynchronousopera7on(e.g.kernel)

•  Gettheerrorcodeforthelasterror: cudaError_t cudaGetLastError(void)

•  Getastringtodescribetheerror: char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

DeviceManagement

•  Applica7oncanqueryandselectGPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

•  Mul7plethreadscanshareadevice

•  Asinglethreadcanmanagemul7pledevices cudaSetDevice(i)toselectcurrentdevice cudaMemcpy(…)forpeer-to-peercopies✝

✝ requires OS and device support

Introduc7ontoCUDAC/C++

•  Whathavewelearned?– WriteandlaunchCUDAC/C++kernels

•  __global__,blockIdx.x,threadIdx.x,<<<>>>

– ManageGPUmemory•  cudaMalloc(),cudaMemcpy(),cudaFree()

– Managecommunica7onandsynchroniza7on•  __shared__,__syncthreads()

•  cudaMemcpy()vscudaMemcpyAsync(),cudaDeviceSynchronize()

ComputeCapability

•  Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.–  Numberofregisters

–  Sizesofmemories

–  Features&capabili7es

•  Thefollowingpresenta7onsconcentrateonFermidevices–  ComputeCapability>=2.0

ComputeCapability

SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)

Teslamodels

1.0 FundamentalCUDAsupport 870

1.3 Doubleprecision,improvedmemoryaccesses,atomics

10-series

2.0 Caches,fusedmul7ply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,func7onpointers,recursion

20-series

IDsandDimensions

– Akernelislaunchedasagridofblocksofthreads

•  blockIdxandthreadIdxare3D

• Weshowedonlyonedimension(x)

•  Built-invariables:–  threadIdx –  blockIdx –  blockDim –  gridDim

Device

Grid 1

Block (0,0,0)

Block (1,0,0)

Block (2,0,0)

Block (1,1,0)

Block (2,1,0)

Block (0,1,0)

Block (1,1,0) Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

Thread

(4,0,0)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(4,1,0)

Thread

(0,2,0)

Thread

(1,2,0)

Thread

(2,2,0)

Thread

(3,2,0)

Thread

(4,2,0)

Topicsweskipped

•  Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide

– CUDAZone–tools,training,webinarsandmoredeveloper.nvidia.com/cuda

CUDA C/C++ BASICS (cont.) -...

Documents

Guido Araújo (IC-UNICAMP) Rodolfo Baccarelli (Baita) Renato Toi …oxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files... · 2016-01-11 · Guido Araújo (IC-UNICAMP) Rodolfo Baccarelli

EPA's Web Archive - West Lake OU-1 Waste Disposal Area ......S-5 D-3 1D-2S 1D-3S 1D-5S 1D-7S 1D-9A 1D-17S 1D-15S 1D-20S 1D-9S 1D-6S 1D-11S 1D-8S 1D-13S 1D-19S 1D-12S OU-1 WASTE DISPOSAL

Estimating Available Bandwidth with pathload and abget Professor Nelson Fonseca nfonseca@ic.unicamp.br

Urban Drainage Models: 1D, 1D/1D and 1D/2D - LEESUleesu.fr/IMG/pdf/Simoes_19_mai_2011_seminaire_leesu.pdfNuno Simoes, 2011 – Imperial College London - LEESU Seminar - 19/May/2011

1D - Carburetor

Kinematics 1D

1: Introdução1 Redes de Computadores Prof. Nelson Fonseca nfonseca@ic.unicamp.br nfonseca/redes

Pin Information for the Arria 10 AX 10AX027 Device · 2021. 1. 15. · l22 1d: refclk_gxbl1d_chtn l21: 1d gxbl1d_tx_ch5n: c25 1d: gxbl1d_tx_ch5p c26: 1d gxbl1d_rx_ch5n,gxbl1d_refclk5n:

CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based

ANTIKANKER 1D

Cloud Computing using MapReduce, Hadoop, Sparkoxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/mapreduce.pdf · MapReduce Goals • Cloud Environment: – Commodity nodes (cheap,

Pin Information for the Arria 10 GX 10AX048 Device...N24 1D REFCLK_GXBL1D_CHTn N23 1D GXBL1D_TX_CH5n E27 1D GXBL1D_TX_CH5p E28 1D GXBL1D_RX_CH5n,GXBL1D_REFCLK5n D25 1D GXBL1D_RX_CH5p,GXBL1D_REFCLK5p

FONTE DO VILAR - Concello de Cambre...FONTE DE OUTEIRO 1D.12.012 LAVADOIRO DE FOLGUEIRA CRUCEIRO DE SANTA 0$5Ë$ '( 9,*2 1D.12.002 1D.12.011 1D.12.016 FONTE DO VILAR 1D.04.011 INVENTARIO

2007 - ducatte@ic.unicamp.br ducatte · Teoria e Prática MC542 3.1 2007 -2011 Prof. Paulo Cesar Centoducatte ducatte@ic.unicamp.br ... • A função é formada pelo OR dos mintermos

Guido Araújo guido@ic.unicampoxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/cap2-lex_6.pdf · MC910: Construção de Compiladores guido 1 Análise Léxica Guido Araújo guido@ic.unicamp.br

Escalada 1d

SPEC CPU 2017 - oxent2.ic.unicamp.br · SPEC CPU 2017 •CPU intensive benchmarks •SPEC CPU provides a comparative measure:-Across a wide range of hardware platforms-Using workloads

!MO615B!(!Implementação!de! Linguagens!II e! !!!!!!MC900A ...oxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/01... · Bibliograﬁa Livro do Dragão: Aho, Lam, Sethi and Ullman

1d CETOANALOGOS

Alocação de Registradores - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/cap11-regal... · – Pode tornar o processo de alocação mais complicado