35
CS179: GPU Programming Lecture 5: Memory

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Embed Size (px)

Citation preview

Page 1: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

CS179: GPU ProgrammingLecture 5: Memory

Page 2: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Page 3: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Overview

• Very slow access:• Between host and device

• Slow access:• Global Memory

• Fast access:• Shared memory, constant

memory, texture memory, local memory

• Very fast access:• Register memory

Page 4: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Global Memory Read/write Shared between blocks and grids Same across multiple kernel executions Very slow to access

No caching!

Page 5: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Constant Memory Read-only in device Cached in multiprocessor Fairly quick

Cache can broadcast to all active threads

Page 6: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Texture Memory Read-only in device 2D cached -- quick access Filtering methods available

Page 7: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Shared Memory Read/write per block Memory is shared within block Generally quick

Has bad worst-cases

Page 8: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Local Memory Read/write per thread Not too fast (stored independent of chip) Each thread can only see its own local

memory Indexable (can do arrays)

Page 9: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Register Memory Read/write per thread function Extremely fast Each thread can only see its own

register memory Not indexable (can’t do arrays)

Page 10: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Syntax:Register Memory Default memory type Declare as normal -- no special syntax

int var = 1; Only accessible by current thread

Page 11: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Syntax:Local Memory “Global” variables for threads

Can modify across local functions for a thread Declare with __device__ __local__ keyword

__device__ __local__ int var = 1; Can also just use __local__

Page 12: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Syntax: Shared Memory Shared across threads in block, not across blocks Cannot use pointers, but can use array syntax for arrays Declare with __device__ __shared__ keyword

__device__ __shared__ int var[]; Can also just use __shared__ Don’t need to declare size for arrays

Page 13: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Syntax: Global Memory Created with cudaMalloc Can pass pointers between host and kernel

Transfer is slow! Declare with __device__keyword

__device__ int var = 1;

Page 14: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Syntax: Constant Memory Declare with __device__ __constant__ keyword

__device__ __constant__ int var = 1; Can also just use __constant__

Set using cudaMemcpyToSymbol (or cudaMemcpy) cudaMemcpyToSymbol(var, src, count);

Page 15: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Syntax: Texture Memory To be discussed later…

Page 16: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Issues Each multiprocessor has set amount of memory

Limits amount of blocks we can have (# of blocks) x (memory used per block) <= total memory Either get lots of blocks using little memory, or fewer blocks using

lots of memory

Page 17: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Issues Register memory is limited!

Similar to shared memory in blocks Can have many threads using fewer registers, or few threads

using many registers Former is better, more parallelism

Page 18: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Issues Global accesses: slow!

Can be sped up when memory is contiguous Memory coalescing: making memory contiguous

Coalesced accesses are: Contiguous accesses In-order accesses Aligned accesses

Page 19: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing:Aligned Accesses Threads read 4, 8, or 16 bytes at a time from global memory

Accesses must be aligned in memory! Good:

Bad:

Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE?

0x00 0x04 0x14

0x00 0x07 0x14

Page 20: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory CoalescingAligned Accesses

Also bad: beginning unaligned

Page 21: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing:Aligned Accesses Built-in types force alignment

float3 (12B) takes up the same space as float4 (16B) float3 arrays are not aligned!

To align a struct, use __align__(x) // x = 4, 8, 16 cudaMalloc aligns the start of each block automatically

cudaMalloc2D aligns the start of each row for 2D arrays

Page 22: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing:Contiguous Accesses Contiguous = memory is together

Example: non-contiguous memory Thread 3 and 4 swapped accesses!

Page 23: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing:Contiguous Accesses Which is better?

index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y);

index = threadIdx.x + blockDim.y * (blockIdx.y + gridDim.y * blockIdx.x);

Page 24: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

bank[0]

bank[1]

bank[2]

bank[3]

Memory Coalescing:Contiguous Accesses Case 1: Contiguous accesses

thread[0][0]

thread[0][1]

thread[1][0]

thread[1][1]

Page 25: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

bank[0]

bank[1]

bank[2]

bank[3]

Memory Coalescing:Contiguous Accesses Case 1: Contiguous accesses

thread[0][0]

thread[0][1]

thread[1][0]

thread[1][1]

Page 26: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing:In-order Accesses In-order accesses

Do not skip addresses Access addresses in order in memory

Bad example: Left: address 140 skipped Right: lots of skipped addresses

Page 27: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing Good example:

Page 28: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Coalescing Not as much of an issue in new hardware

Many restrictions relaxed -- e.g., do not need to have sequential access

However, memory coalescing and alignment still good practice!

Page 29: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Issues Shared memory:

Also can be limiting Broken up into banks

Optimal when entire warp is reading shared memory together

Banks: Each bank services only one thread at a time Bank conflict: when two threads try to access same block

Causes slowdowns in program!

Page 30: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Bank Conflicts Bad:

Many threads trying to access the same bank

Page 31: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Bank Conflicts Good:

Few to no bank conflicts

Page 32: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Bank Conflicts Banks service 32-bit words at a time at addresses mod 64

Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc.

Want to avoid multiple thread access to same bank Keep data spread out Split data that is larger than 4 bytes into multiple accesses Be careful of data elements with even stride

Page 33: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Broadcasting Fast distribution of data to threads Happens when entire warp tries to access same address

Memory will get broadcasted to all threads in one read

Page 34: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Summary Best memory management:

Balances memory optimization with parallelism

Break problem up into coalesced chunks Process data in shared memory, then copy back to global

Remember to avoid bank conflicts!

Page 35: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Next Time Texture memory CUDA Applications in graphics