Lecture 2: CUDA C Basicsmqhuang/courses/4643/s2015/... · Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading 2/33

GPU Programming

Lecture 2: CUDA C Basics

Miaoqing HuangUniversity of Arkansas

Spring 2015

1 / 33

Outline

Evolvements of NVIDIA GPU

CUDA Basic

Detailed StepsDevice Memories and Data TransferKernel Functions and Threading

2 / 33

Outline


CUDA Basic


3 / 33

Architecture of G80 GPU

SFU SFUSFU SFUSFU SFUSFU SFU SFU SFUSFU SFU

I 128 streaming processors in 16 streaming multiprocessors

4 / 33

Architecture of GT200 GPU (e.g., GeForce GTX 295)

SFU SFUSFU SFUSFU SFUSFU SFU SFU SFU SFU SFU

I 240 streaming processors in 30 streaming multiprocessors

5 / 33

Architecture of Fermi GPU (e.g., GeForce GTX 480, Tesla C2075)

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

LD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/ST

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Register File (32,768 × 32-bit)

Dispatch Unit Dispatch UnitWarp Scheduler Warp Scheduler

Instruction Cache

Interconnect Network64 KB Shared Memory / L1 Cache

Fermi Streaming Multiprocessor (SM)

FermiSM

L2 Cache

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

Gig

athr

ead

Hos

t In

terfa

ceD

RA

MD

RA

M

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

16-SM Fermi GPU

Dispatch PortOperant Collector

Result Queue

FP Unit INT Unit

CUDA Core

I 512 Fermi streaming processors in 16 streaming multiprocessors

6 / 33

Architecture of Kepler GPU (e.g., Tesla K20)

I 2,880 streaming processors in 15 streaming multiprocessors

7 / 33

Architecture of Kepler Streaming Multiprocessor

I Each streaming multiprocessor contains 192 single-precisioncores and 64 double-precision cores

8 / 33

Comparison among Three ArchitecturesGPU G80 GT200 Fermi

Transistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Capability None 30 FMAs/clk 256 FMAs/clkSingle Precision Capability 128 MADs/clk 240 MADs/clk 512 FMAs/clkSpecial Function Units / SM 2 2 4Shared Memory / SM 16KB 16KB 48KB/16KBL1 Cache / SM None None 16KB/48KBL2 Cache None None 768KBECC Support No No Yes

Product

C

BA

Result

× =

+

=

(truncate extra digits)

Multiply-Add (MAD)Product

C

BA

Result

× =

+

=

(retain all digits)

Fused Multiply-Add (FMA)

9 / 33

Fermi vs. KeplerFERMIGF100

FERMIGF104

KEPLERGK104

KEPLERGK110

Compute Capability 2.0 2.1 3.0 3.5Threads / Warp 32 32 32 32Max Warps / Multiprocessor 48 48 64 64Max Threads / Multiprocessor 1536 1536 2048 2048Max Thread Blocks / Multiprocessor 8 8 16 1632 bit Registers / Multiprocessor 32768 32768 65536 65536Max Registers / Thread 63 63 63 255Max Threads / Thread Block 1024 1024 1024 1024Shared Memory Size Configurations (bytes) 16K 16K 16K 16K

48K 48K 32K 32K48K 48K

Max X Grid Dimension 2^16 1 2^16 1 2^32 1 2^32 1Hyper Q No No No YesDynamic Parallelism No No No Yes

Compute Capability of Fermi and Kepler GPUs

10 / 33

Compute Capability

I The compute capability of a device is defined by a major revisionnumber and a minor revision number

I Devices with the same major revision number are of the samecore architecture

I Kepler architecture: 3.xI Fermi architecture: 2.xI Prior devices: 1.x

I “NVIDIA CUDA C Programming Guide (v6.5)”I Appendix A lists of all CUDA-enabled devices along with their

compute capabilityI Appendix G gives the technical specifications of each compute

capability

11 / 33

Outline


CUDA Basic


12 / 33

Data ParallelismSquare Matrix Multiplication Example

28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

I P = M × N of sizeWIDTH×WIDTH

I The approach in C

for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {

......P[i][j]=......;......

}}

I The approach in CUDA(the straightforward way)

I One thread calculates oneelement of P

I Issue WIDTH×WIDTHthreads simultaneously

13 / 33


28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH


I The approach in C


......P[i][j]=......;......

}}




14 / 33


28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH


I The approach in C


......P[i][j]=......;......

}}




15 / 33

The Heterogeneous Platform

I A typical GPGPU-capableplatform consists of

I One or moremicroprocessors (CPUs) –host

I One or more GPUs –device

I GPU devices are connectedto CPUs through PCIe 2.0bus

I A CUDA program consists ofI The code on CPU –

software partI The code on GPU –

hardware part

QPI

25.6GB/s25.6GB/s

CPU-GPU5.8GB/s

Unidirectional

102GB/s

QPI

16x PCIe 2.0

QPI

QPI

144GB/s

16x PCIe 2.0

16 / 33

CUDA Program Structure

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Grid 0

Grid 1

I Integrated host+device application C programI Sequential or modestly parallel parts in host C codeI Highly parallel parts in device SPMD kernel C code

17 / 33

CUDA Thread Organization

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

I A kernel is implemented as a gridof threads

I Threads in grid is furtherdecomposed into blocks

I A grid can be up tothree-dimensional

I A thread block is a batch ofthreads that can cooperate witheach other by:

I Synchronizing their executionI Efficiently sharing data through

shared memoryI A block can be one or two or

three-dimensionalI Each block and each thread has

its own ID

18 / 33


Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)








its own ID

19 / 33

Matrix Multiplication: the main() function

28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

int main (void) {// 1.// Allocate and initialize// the matrices M, N, P// I/O to read the input// matrices M and N

......

// 2.// M*N on the processorMatrixMultiplication(M,N,P,width);

// 3.// I/O to write the output matrix P// Free matrices M, N, P

......return 0;

}

20 / 33

Matrix Multiplication: A Simple Host Version in C

WIDTH WIDTH

WIDTH

WIDTH

void MatrixMultiplication(float* M,float* N, float* P, int width)

{for (int i=0; i<width; i++) {for (int j=0; j<width; j++) {float sum = 0;for (int k=0; k<width; k++) {

float a = M[i*width+k];float b = N[k*width+j];sum += a*b;

}P[i*width+j] = sum;

}}

}

M0,0 M0,1 M0,2 M0,3

M1,0

M2,0

M3,0

M1,1

M2,1

M3,1

M1,2

M2,2

M3,2

M1,3

M2,3

M3,3

M0,0 M0,1 M0,2 M0,3 M1,0 M2,0 M3,0M1,1 M2,1 M3,1M1,2 M2,2 M3,2M1,3 M2,3 M3,3

Mi,j

row column

21 / 33

Matrix Multiplication: Move computation to the device

WIDTH WIDTH

WIDTH

WIDTH

void MatrixMultiplication(float* M,float* N, float* P, int width)

{int size = width*width*sizeof(float);float* Md, Nd, Pd;......// 1.// Allocate device memory for M, N,// and P// Copy M and N to allocated device// memory locations

// 2.// Kernel invocation code - to have// the device to perform the actual// matrix multiplication

// 3.// Copy P from the device memory// Free device matrices

}

22 / 33

Outline


CUDA Basic


23 / 33

Memory Hierarchy on DeviceI Memory hierarchy on device

I Global MemoryI Main means of communicating

between host and deviceI Long latency access

I Shared MemoryI Short latency

I RegisterI Per-thread local variables

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

I Data accessI Device code can read/write

I Per-thread registers, per-block shared memory, per-grid globalmemory

I Host code can transfer data to/fromI Per-grid global memory

24 / 33

CUDA Device Memory AllocationI cudaMalloc()

I Allocates object in the deviceGlobal Memory

I Requires two parameters1. Address of a pointer to the

allocated object2. Size of of allocated object

float* Md; //d indicates a device dataint size = width*width*sizeof(float);

cudaMalloc((void**)&Md, size);cudaFree(Md);

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

I cudaFree()I Frees object from device Global Memory

I Pointer to freed object

25 / 33

CUDA Host-Device Data TransferI cudaMemcpy()

I Memory data transferI Requires four parameters

1. Pointer to destination2. Pointer to source3. Number of bytes copied4. Type of transfer

I Types of transferI cudaMemcpyHostToHostI cudaMemcpyHostToDeviceI cudaMemcpyDeviceToHostI cudaMemcpyDeviceToDevice

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

26 / 33

Matrix Multiplication: After Integrating the Data Transfer

void MatrixMultiplication(float* M, float* N, float* P, int width){int size = width*width*sizeof(float);float* Md, Nd, Pd;

// 1. Allocate and Load M, N to device memorycudaMalloc((void**)&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc((void**)&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the devicecudaMalloc((void**)&Pd, size);

// 2. Kernel invocation code -- to be shown later

// 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree (Pd);

}

27 / 33


Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)








its own ID

28 / 33

Index (i.e., coordinates) of Block and Thread

O

Y

X

Z

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,0,0)

Thread(3,1,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

tx ty tz

M0,0 M0,1 M0,2 M0,3

M1,0

M2,0

M3,0

M1,1

M2,1

M3,1

M1,2

M2,2

M3,2

M1,3

M2,3

M3,3

I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx

I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z

I threadIdx.y−→row, threadIdx.x−→column

29 / 33

Index (i.e., coordinates) of Block and Thread

O

Y

X

Z

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,0,0)

Thread(3,1,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

tx ty tz

M0,0 M0,1 M0,2 M0,3

M1,0

M2,0

M3,0

M1,1

M2,1

M3,1

M1,2

M2,2

M3,2

M1,3

M2,3

M3,3

I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx

I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z

I threadIdx.y−→row, threadIdx.x−→column

30 / 33

Define the Dimension of Grid and Block

I Use pre-defined dim3 to define thedimension of grid and block

I Use gridDim and blockDim to get thedimensions of the grid and the block

dim3 dimGrid(width_g, height_g, depth_g);dim3 dimBlock(width_b, height_b, depth_b);

I Total number of threads issuedI width_g×height_g×depth_g×width_b×height_b×depth_b

I Launch the device computation threads

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

31 / 33

Kernel Function Specification

WIDTH WIDTH

WIDTH

WIDTH

// Matrix multiplication kernel// -- per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,int width)

{int tx = threadIdx.x;int ty = threadIdx.y;

// Pvalue is used to store the// element of the matrix// that is computed by the threadfloat Pvalue = 0;

for (int k = 0; k < width; ++k) {float Melement = Md[ty*width+k];float Nelement = Nd[k*width+tx];Pvalue += Melement * Nelement;

}

Pd[ty*width+tx] = Pvalue;}

32 / 33

CUDA Function DeclarationsExecuted on Only callable from

__device__ float DeviceFunc() device device__globe__ void DeviceFunc() device host__host__ float HostFunc() host host

I __global__ defines a kernel functionI Must be void function

I __device__ and __host__ can be used togetherI Generate two versions of the same function during the compilation

I For functions executed on the deviceI __global__ functions does not support recursionI __device__ functions only support recursion in device code

compiled for devices of compute capability 2.x and higherI No static variable declarations inside the functionI No variable number of argumentsI No indirect function calls through pointers

I All functions are host functions by default

33 / 33

Documents

Lecture 2: CUDA C Basicsmqhuang/courses/4643/s2015/... · Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions and Threading 2/33