34
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: •The basic memory hierarchy in the NVIDIA GPU global memory, shared memory, register file, constant memory •How to declare variables for each memory •Memory coalescing •Cache memory and making most effective in program

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy

Embed Size (px)

Citation preview

1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013GPUMemories.ppt

GPU Memories

These notes will introduce:

•The basic memory hierarchy in the NVIDIA GPU• global memory, shared memory, register file,

constant memory•How to declare variables for each memory•Memory coalescing•Cache memory and making most effective in program

2

Host-Device Connection

Host (CPU)

Device Global

Memory

Host Memory

Device (GPU)

PCIe x164 GB/s

PCIe x16 Gen28 GB/s peak

GPU busC2050 1030.4 GB/sGTX 280 141.7 GB/s

DDR 4003.2 GB/s

GDDR5230 GB/s

Memory buslimited by memory

and processor-memory connection

bandwidth

Hypertransport and Intel’s Quickpath

currently 25.6 GB/s

Note transferring between host and GPU much slower that between device and global memory Hence need to minimize host-device transfersGPU on a laptop such as Mac pro may share the system memory.

3

GPU Memory Hierarchy

Global memory is off-chip on the GPU card.

Even though global memory an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance

GPU provided with faster on-chip memory although data has to be transferred explicitly into shared memory –Pointers created with cudaMalloc() point to global memory.

Two principal levels on-chip:shared memory and registers

4

Grid

Block

Threads

Shared memoryLocal memory

Registers

Global memory

Constant memory

Scope of global memory, shared memory, and registers

Host

Host memory

For storing global constants see later. Also a read-only global memory called texture memory.

5

Currently can only transfer data from host to global (and constant memory) and not host directly to shared.

Constant memory used for data that does not change (i.e. read-only by GPU)

Shared memory is said to provide up to 15 x speed of global memory

Register similar speed to shared memory if reading same address or no bank conflicts.

6

Lifetimes

Global/constant memory –- lifetime of applicationShared memory -– lifetime of a kernelRegisters –- lifetime of a kernel

Scope

Global/constant memory –- GridShared memory –- BlockRegisters –- Thread

7

Declaring program variables for registers, shared memory and global memory

Memory Declaration Scope Lifetime

Registers Automatic variables* Thread Kernelother than arrays

Local Automatic array variables Thread Kernel

Shared __shared__ Block Kernel

Global __device__ Grid Application

Constant __constant__ Grid Application

*Automatic variables allocated automatically when entering scope of variable and de-allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see http://en.wikipedia.org/wiki/Automatic_variable

8

Global Memory__device__

For data available to all threads in device.

Declared outside function bodies

Scope of Grid and lifetime of application

#include <stdio.h>#include <stdlib.h>#include <cuda.h>#define N 1000…__device__ int A[N];

__global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x;A[tid] = ……

}

main {…

}

9

Issues with using Global memory

• Long delays, slow

• Access congestion

• Cannot synchronize accesses

• Need to ensure no conflicts of accesses between threads

10

Shared Memory

Shared memory is on the GPU chip and very fast

Separate data available to all threads in one block.

Declared inside function bodies

Scope of block and lifetime of kernel call

So each block would have its own array A[N]

#include <stdio.h>#include <stdlib.h>#include <cuda.h>#define N 1000…

__global__ kernel() {

__shared__ int A[N];

int tid = threadIdx.x;A[tid] = ……

}main {

…}

11

Transferring data to shared memory

int A[N][N]; //to be copied into device from host with cudamalloc

__global__ void myKernel (int *A_global) {__shared__ int A_sh[n][n]; // declare shared memory

int row = …int col = …A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared…

}

main () {… cudaMalloc((void**)dev_ A, size); // allocate global memorycudoMemcpy(dev_A, A, size, cudaMemcpyHostTo Device); //copy to global memorymyKernel<<G,B>>(dev_A)…

}

12

Issues with Shared Memory

Shared memory is not immediately synchronized after access.

Usually it is the writes that matter.

Use __syncthreads() before you read data that has been altered.

Shared memory is very limited(Fermi has up to 48KB per GPU core, NOT per block)

Hence may have to divide your data into “chunks”

13

Example uses of shared data

Where the data can be divided into independent parts:

Image processing

- Image can be divided into blocks and placed into shared memory for processing

Block matrix multiplication

-Sub-matrices can be stored in shared memory (Slides to follow on this)

14

Registers

Compiler will place variables declared in kernel in registers when possible

Limit to the number of registers

Fermi has 32768 32-bit registers

Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps

__global__ kernel() {

int x, y, z;

}

15

Arrays declared within kernel(Automatic array variables)

__global__ kernel() {

int A[10];

}

Generally stored in global memory but private copy made for each thread.*

Can be as slow access as global memory, except cached, see later

If array indexed with a constant value, compiler may use registers

* Global “local” memory, see later

16

Constant Memory__constant__

For data not altered by device.

Although stored in global memory, cached and has fast access

Declared outside function bodies

Scope of grid and lifetime of application

Size currently limited to 65536 bytes

#include <stdio.h>#include <stdlib.h>#include <cuda.h>…__constant__ int n;

__global__ kernel() {…

}main {

n = ……

}

17

Local memory

Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible.

For compute capability 2.x, cached in L1 and L2 caches on-chip

Used to hold arrays if not indexed with a constant value

and

for variables when there are no more register available for them

18

Cache memory

More recent GPUs have L1 and L2 (data) cache memory, but apparently without cache coherence so up to the programmer to ensure that.

Make sure each thread accesses different locations

Ideally arrange accesses to be in same cache lines

Compute capability 1.3 Tesla’s do not have cache memory

Compute capability 2.0 Fermi’s have L1/L2 caches

19

Fermi Caches

Streaming processors (SM’s)

L2 cache

L1 cache/ shared memory

Streaming processors (SM)

Register file

20

Fermi Cache Sizes

L2

• Unified 384kB L2 cache for all SM’s

• 384-bit memory bus from device memory to L2 cache

• Up to 160 GB/s bandwidth

• 128 bytes cache line (32 32-bit integers or floats, or 16 doubles)

L1

• Each SM has 16kB or 48kB of L1 cache (64kB split 16/48 or 48/16

between L1 cache and shared memory)

• No global cache coherency!

21

Poor Performance from Poor Data Layout

__global__ void kernel(int *A) {

int i = threadIdx.x + blockDim.x*blockIdx.x;

A[1000*i] = …

}

Very Bad!

Each thread accesses a location on a different line.

Fermi line size is 32 integers or floats

22

Taking Advantage of Cache

__global__ void kernel(int *A) {

int i = threadIdx.x + blockDim.x*blockIdx.x;

A[i] = …

}

Good!

Groups of 32 accesses by consecutive threads on same line. Threads will be in same warpFermi line size is 32 integers or floats

23

Warp

A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode

A “half warp” (16 threads) actually execute simultaneously (current GPUs)

Using knowledge of warps and how the memory is laid out can improve code performance

24

Memory Banks

Memory 1 Memory 4Memory 3Memory 2

Device (GPU)

Consecutive locations on successive memory banks

A[0] A[1] A[2] A[3]

Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

25

Shared Memory Banks

Shared memory divided into 16 or 32 banks of 32-bit width.Banks can be accessed simultaneously

Compute cap. 1.x has 16 banks accesses processed per half warp

Compute cap. 2.x and 3.0/3.5 has 32 banks accesses processed per warp

Banks can be accessed simultaneously

To achieve maximum bandwidth, threads in a half warp should access different banks of shared memory

Exception: all threads read the same location which results in a broadcast operation

*coit-grid06 and coit-grid07 C2050 compute capability 2.0 has 32 banks)

26

Global memory banks

Global memory is also partitioned into banks depending upon the version of the GPU

200 series and 10 series NVIDIA GPUs have 8 partitions of 256 bytes wide

C2050 has ??

27

Achieving best data access patterns

Requires a lot of thought – will consider in detail for specific problems

Generally

Padding data to make data aligned

For matrix operationsTilingPre-transpose operationsPadding – adding columns/rows

28

Memory Coalescing

Aligned memory accesses

Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned.

That is: A 4-byte read must start at address …xxxxx00A 8 byte read must start at address …xxxx000A 16 byte read must start at address …xxx0000

Then access is much faster (twice?)

29

Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses

A bad case would be:

•Thread 0 to access A[0], A[2], ... A[15] •Thread 1 to access A[16], A[17], ... A[31] •Thread 2 to access A[32], A[33], ... A[63]

… etc.

Good case would be

•Thread 0 to access A[0], A[16], ... A[31] •Thread 1 to access A[1], A[17], ... A[32] •Thread 2 to access A[2], A[18], ... A[33] … etc.if there are 16 banks. Need to know that detail!

Time

30

Memory coalescing and cache memories

Comp cap 2.x onwards have data caches Accessing locations in cache will still be in blocks of consecutive locations and proper data layout that allows memory coalescing will be advantageous

Unfortunately need to know the detailed physical arrangements in global memory and cache to gain maximum benefit.

Different comp capability devices have different constraints for memory coalescing, see NVIDIA documentation for more information, see http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-3-0

31Wikipedia “ CUDA” http://en.wikipedia.org/wiki/CUDA Jan 21, 2013

C2050 coit-grid06 and coit-grid7Compute capability 2.0

To be installed: K20 coit-grid08compute capability 3.5

32

Some notes on NVIDIA New Tesla K20 GPU card

Released late 2012.

Uses GK110 chip

7.1 billion transistors! “Big-die” GPU

“Kepler” architecture

64KB shared memory/L1 cache

48 KB uniform (?) cache

Up to 1.5 MB L2 cache

K20 card has 5 GB global memory

320-bit GDDR5

225 wattSourceshttp://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3

33Source: http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

Includes:

• SMX (streaming multiprocessor) design that delivers up to 3x more performance per watt compared to the SM in Fermi. It also delivers 1 petaflop of computing in just 10 server racks.• Dynamic Parallelism capability that enables GPU threads to automatically spawn new threads. By adapting to the data without going back to the CPU, it greatly simplifies parallel programming and enables GPU acceleration of a broader set of popular algorithms, like adaptive mesh refinement (AMR), fast multipole method (FMM), and multigrid methods.• Hyper-Q feature that enables multiple CPU cores to simultaneously utilize the CUDA cores on a single Kepler GPU, dramatically increasing GPU utilization, slashing CPU idle times, and advancing programmability. Ideal for cluster applications that use MPI.

Kepler compute architecture

Questions