Upload
moses-hosfield
View
225
Download
1
Tags:
Embed Size (px)
Citation preview
CS179: GPU ProgrammingLecture 5: Memory
Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling
Memory Overview
• Very slow access:• Between host and device
• Slow access:• Global Memory
• Fast access:• Shared memory, constant
memory, texture memory, local memory
• Very fast access:• Register memory
Global Memory Read/write Shared between blocks and grids Same across multiple kernel executions Very slow to access
No caching!
Constant Memory Read-only in device Cached in multiprocessor Fairly quick
Cache can broadcast to all active threads
Texture Memory Read-only in device 2D cached -- quick access Filtering methods available
Shared Memory Read/write per block Memory is shared within block Generally quick
Has bad worst-cases
Local Memory Read/write per thread Not too fast (stored independent of chip) Each thread can only see its own local
memory Indexable (can do arrays)
Register Memory Read/write per thread function Extremely fast Each thread can only see its own
register memory Not indexable (can’t do arrays)
Syntax:Register Memory Default memory type Declare as normal -- no special syntax
int var = 1; Only accessible by current thread
Syntax:Local Memory “Global” variables for threads
Can modify across local functions for a thread Declare with __device__ __local__ keyword
__device__ __local__ int var = 1; Can also just use __local__
Syntax: Shared Memory Shared across threads in block, not across blocks Cannot use pointers, but can use array syntax for arrays Declare with __device__ __shared__ keyword
__device__ __shared__ int var[]; Can also just use __shared__ Don’t need to declare size for arrays
Syntax: Global Memory Created with cudaMalloc Can pass pointers between host and kernel
Transfer is slow! Declare with __device__keyword
__device__ int var = 1;
Syntax: Constant Memory Declare with __device__ __constant__ keyword
__device__ __constant__ int var = 1; Can also just use __constant__
Set using cudaMemcpyToSymbol (or cudaMemcpy) cudaMemcpyToSymbol(var, src, count);
Syntax: Texture Memory To be discussed later…
Memory Issues Each multiprocessor has set amount of memory
Limits amount of blocks we can have (# of blocks) x (memory used per block) <= total memory Either get lots of blocks using little memory, or fewer blocks using
lots of memory
Memory Issues Register memory is limited!
Similar to shared memory in blocks Can have many threads using fewer registers, or few threads
using many registers Former is better, more parallelism
Memory Issues Global accesses: slow!
Can be sped up when memory is contiguous Memory coalescing: making memory contiguous
Coalesced accesses are: Contiguous accesses In-order accesses Aligned accesses
Memory Coalescing:Aligned Accesses Threads read 4, 8, or 16 bytes at a time from global memory
Accesses must be aligned in memory! Good:
Bad:
Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE?
0x00 0x04 0x14
0x00 0x07 0x14
Memory CoalescingAligned Accesses
Also bad: beginning unaligned
Memory Coalescing:Aligned Accesses Built-in types force alignment
float3 (12B) takes up the same space as float4 (16B) float3 arrays are not aligned!
To align a struct, use __align__(x) // x = 4, 8, 16 cudaMalloc aligns the start of each block automatically
cudaMalloc2D aligns the start of each row for 2D arrays
Memory Coalescing:Contiguous Accesses Contiguous = memory is together
Example: non-contiguous memory Thread 3 and 4 swapped accesses!
Memory Coalescing:Contiguous Accesses Which is better?
index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y);
index = threadIdx.x + blockDim.y * (blockIdx.y + gridDim.y * blockIdx.x);
bank[0]
bank[1]
bank[2]
bank[3]
Memory Coalescing:Contiguous Accesses Case 1: Contiguous accesses
thread[0][0]
thread[0][1]
thread[1][0]
thread[1][1]
bank[0]
bank[1]
bank[2]
bank[3]
Memory Coalescing:Contiguous Accesses Case 1: Contiguous accesses
thread[0][0]
thread[0][1]
thread[1][0]
thread[1][1]
Memory Coalescing:In-order Accesses In-order accesses
Do not skip addresses Access addresses in order in memory
Bad example: Left: address 140 skipped Right: lots of skipped addresses
Memory Coalescing Good example:
Memory Coalescing Not as much of an issue in new hardware
Many restrictions relaxed -- e.g., do not need to have sequential access
However, memory coalescing and alignment still good practice!
Memory Issues Shared memory:
Also can be limiting Broken up into banks
Optimal when entire warp is reading shared memory together
Banks: Each bank services only one thread at a time Bank conflict: when two threads try to access same block
Causes slowdowns in program!
Bank Conflicts Bad:
Many threads trying to access the same bank
Bank Conflicts Good:
Few to no bank conflicts
Bank Conflicts Banks service 32-bit words at a time at addresses mod 64
Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc.
Want to avoid multiple thread access to same bank Keep data spread out Split data that is larger than 4 bytes into multiple accesses Be careful of data elements with even stride
Broadcasting Fast distribution of data to threads Happens when entire warp tries to access same address
Memory will get broadcasted to all threads in one read
Summary Best memory management:
Balances memory optimization with parallelism
Break problem up into coalesced chunks Process data in shared memory, then copy back to global
Remember to avoid bank conflicts!
Next Time Texture memory CUDA Applications in graphics