CUDA Lecture 11 Performance Considerations

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 11Performance

Considerations

Always measure where your time is going!Even if you think you know where it is goingStart coarse, go fine-grained as need be

Keep in mind Amdahl’s Law when optimizing any part of your codeDon’t continue to optimize once a part is only a

small fraction of overall execution time

Performance Considerations – Slide 2

Preliminaries

Performance Consideration IssuesMemory coalescingShared memory bank conflictsControl-flow divergenceOccupancyKernel launch overheads


Outline

Off-chip memory is accessed in chunksEven if you read only a single wordIf you don’t use whole chunk, bandwidth is

wasted Chunks are aligned to multiples of 32/64/128

bytesUnaligned accesses will cost more

When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations.


Performance Topic A: Memory Coalescing


Memory Layout of a Matrix in C

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M



M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Access direction in Kernel code

T1 T2 T3 T4

Time Period 1T1 T2 T3 T4

Time Period 2 …

T1 T2 T3 T4

Time Period 1

T1 T2 T3 T4

Time Period 2 …



M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Access direction in Kernel code



Md Nd

WIDTH

WIDTH

Thread 1Thread 2

Not coalesced coalesced


Use Shared Memory to Improve Coalescing

Md Nd

WIDTH

WIDTH

Md Nd

Original AccessPattern

Tiled AccessPattern

Copy into scratchpad

memory

Perform multiplication

with scratchpad values

Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127


Second Example

96 192128

128B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127 (reduce to 64B)


Second Example (cont.)

96 192128

64B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3




96 192128

32B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Threads 0-15 access 4-byte words at addresses 116-176Thread 3 is lowest active, accesses address 128128-byte segment: 128-255



96 192128

128B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3




96 192128

64B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3


Consider the stride of your accesses

__global__ void foo(int* input, float3* input2){ int i = blockDim.x * blockIdx.x + threadIdx.x; // Stride 1 int a = input[i]; // Stride 2, half the bandwidth is wasted int b = input[2 * i]; // Stride 3, 2/3 of the bandwidth wasted float c = input2[i].x;}


Example: Array of Structures (AoS)

struct record{ int key; int value; int flag;};

record *d_records;cudaMalloc((void**)&d_records, ...);


Example: Structure of Arrays (SoA)

struct SoA{ int * keys; int * values; int * flags;};

SoA d_SoA_data;cudaMalloc((void**)&d_SoA_data.keys, ...);cudaMalloc((void**)&d_SoA_data.values, ...);cudaMalloc((void**)&d_SoA_data.flags, ...);


Example: SoA versus AoS__global__ void bar(record *AoS_data, SoA SoA_data){ int i = blockDim.x * blockIdx.x + threadIdx.x; // AoS wastes bandwidth int key = AoS_data[i].key; // SoA efficient use of bandwidth int key_better = SoA_data.keys[i];}

Structure of arrays is often better than array of structures Very clear win on regular, stride 1 access

patternsUnpredictable or irregular access patterns are

case-by-case


Example: SoA versus AoS (cont.)

As seen each SM has 16 KB of shared memory16 banks of 32-bit words (Tesla)

CUDA uses shared memory as shared storage visible to all threads in a thread blockread and write access

Not used explicitly for pixel shader programswe dislike pixels talking to each

other Performance Considerations – Slide 20

Performance Topic B: Shared Memory Bank Conflicts

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

So shared memory is bankedOnly matters for threads within a warpFull performance with some restrictionsThreads can each access different banksOr can all access the same value

Consecutive words are in different banksIf two or more threads access the same bank

but different value, get bank conflicts


Shared Memory

In a parallel machine, many threads access memoryTherefore, memory is divided into banksEssential to achieve high bandwidth

Each bank can service one address per cycleA memory can service as many

simultaneous accesses as it has banks

Multiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serializedPerformance Considerations – Slide 22

Details: Parallel Memory Architecture

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

No Bank Conflicts No Bank Conflicts Linear addressing, stride == 1 Random 1:1 permutation


Bank Addressing Examples

Bank 15


Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15


Two-way Bank Conflicts Eight-way Bank ConflictsLinear addressing stride == 2 Linear addressing stride == 8


Bank Addressing Examples (cont.)

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 15


Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Each bank has a bandwidth of 32 bits per clock cycle

Successive 32-bit words are assigned to successive banks

G80 has 16 banksSo bank = address % 16Same as the size of a half-warp

No bank conflicts between different half-warps, only within a single half-warp


How addresses map to banks on G80

Shared memory is as fast as registers if there are no bank conflicts

The fast case:If all threads of a half-warp access different banks,

there is no bank conflictIf all threads of a half-warp access the identical

address, there is no bank conflict (broadcast)The slow case:

Bank conflict: multiple threads in the same half-warp access the same bank

Must serialize the accessesCost = max # of simultaneous accesses to a single

bank Performance Considerations – Slide 26

Shared memory bank conflicts

Change all shared memory reads to the same valueAll broadcasts = no conflictsWill show how much performance could be

improved by eliminating bank conflictsThe same doesn’t work for shared memory

writesSo, replace shared memory array indices with threadIdx.x

Can also be done to the reads


Trick to Assess Impact On Performance

Given:

This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd


Linear Addressing

__shared__ float shared[256];float foo = shared[baseIndex + s * threadIdx.x];


Linear Addressing Examples

Bank 15


Thread 15


Bank 15


Thread 15


s=3s=1

texture and __constant__Read-onlyData resides in global memoryDifferent read path:

includes specialized caches


Additional “memories”

Data stored in global memory, read through a constant-cache path__constant__ qualifier in declarationsCan only be read by GPU kernelsLimited to 64KB

To be used when all threads in a warp read the same addressSerializes otherwise

Throughput: 32 bits per warp per clock per multiprocessor


Constant Memory

Immediate address constantsIndexed address constantsConstants stored in DRAM, and

cached on chipL1 per SM

A constant value can be broadcast to all threads in a warpExtremely efficient way of

accessing a value that is common for all threads in a block!


ConstantsI$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

ObjectivesTo understand the implications of control flow

onBranch divergence overheadSM execution resource utilization

To learn better ways to write code with control flow

To understand compiler/HW predication designed to reduce the impact of control flowThere is a cost involved.


Performance Topic C: Control Flow Divergence

Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads)The unit of parallelism in CUDA

Warp: a group of threads executed physically in parallel in G80

Block: a group of threads that are executed together and form the unit of resource assignment

Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect


Quick terminology review

Thread blocks are partitioned into warps with instructions issued per 32 threads (warp)Thread IDs within a warp are consecutive and

increasingWarp 0 starts with Thread ID 0

Partitioning is always the sameThus you can use this knowledge in control

flow The exact size of warps may change from

generation to generation


How thread blocks are partitioned

However, DO NOT rely on any ordering between warpsIf there are any dependencies between

threads, you must __syncthreads() to get correct results


How thread blocks are partitioned (cont.)

Main performance concern with branching is divergenceThreads within a single warp take different

pathsif-else, ...

Different execution paths within a warp are serialized in G80The control paths taken by the threads in a warp

are traversed one at a time until there is no more.Different warps can execute different code

with no impact on performance


Control Flow Instructions

A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread IDExample with divergence:

This creates two different control paths for threads in a block

Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp


Control Flow Divergence (cont.)

if (threadIdx.x > 2) {...}else {...}

A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread IDExample without divergence:

Also creates two different control paths for threads in a block

Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path


Control Flow Divergence (cont.)

if (threadIdx.x / WARP_SIZE > 2) {...}else {...}

Given an array of values, “reduce” them to a single value in parallel

Examples sum reduction: sum of all values in the arrayMax reduction: maximum of all values in the

arrayTypically parallel implementation:

Recursively halve # threads, add two values per thread

Takes log(n) steps for n elements, requires n/2 threads


Parallel Reduction


Example: Divergent Iteration__global__ void per_thread_sum(int *indices, float *data, float *sums){ ... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum;}

Assume an in-place reduction using shared memoryThe original vector is in device global memoryThe shared memory used to hold a partial sum

vectorEach iteration brings the partial sum vector

closer to the final sumThe final solution will be in element 0


A Vector Reduction Example

Assume we have already loaded array into__shared__ float partialSum[]


A simple implementation

unsigned int t = threadIdx.x;for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0)

partialSum[t] += partialSum[t+stride];

}


Vector Reduction with Bank Conflicts

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

ITERATIONS


Vector Reduction with Branch Divergence

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

ITERATIONS

Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10

In each iteration, two control flow paths will be sequentially traversed for each warpThreads that perform addition and threads that

do notThreads that do not perform addition may cost

extra cycles depending on the implementation of divergence


Some Observations

No more than half of threads will be executing at any timeAll odd index threads are disabled right from

the beginning!On average, less than ¼ of the threads will be

activated for all warps over time.After the 5th iteration, entire warps in each

block will be disabled, poor resource utilization but no divergence.This can go on for a while, up to 4 more iterations

(512/32=16= 24), where each iteration only has one thread activated until all warps retire


Some Observations (cont.)



Short comings of the implementation

unsigned int t = threadIdx.x;for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0)


}

BAD: Divergence due to interleaved branch decisions



A better implementation

unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x;

stride > 1; stride >> 1) { __syncthreads(); if (t < stride)


}


Less Divergence than originalThread 0

0 1 2 3 … 13 1514 181716 19

0+16 15+311

3

4

Only the last 5 iterations will have divergenceEntire warps will be shut down as iterations

progressFor a 512-thread block, 4 iterations to shut

down all but one warps in each blockBetter resource utilization, will likely retire

warps and thus blocks fasterRecall, no bank conflicts either


Some Observations About the New Implementation

For last 6 loops only one warp active (i.e. tid’s 0..31) Shared reads and writes SIMD synchronous

within a warp, so skip __syncthreads() and unroll last 5 iterations


A Potential Further Refinement but bad idea

unsigned int tid = threadIdx.x;for (unsigned int d = n>>1; d > 32; d >>= 1) {

__syncthreads();if (tid < d) shared[tid] += shared[tid + d];

}__syncthreads();if (tid <= 32) { // unroll last 6 predicated steps

shared[tid] += shared[tid + 32];shared[tid] += shared[tid + 16];shared[tid] += shared[tid + 8];shared[tid] += shared[tid + 4];shared[tid] += shared[tid + 2];shared[tid] += shared[tid + 1];

}

This would not work properly is warp size decreases; need

__synchthreads() between each statement!

However, having ___synchthreads() in if statement

is problematic.

A single thread can drag a whole warp with it for a long time

Know your data patternsIf data is unpredictable, try to flatten peaks

by letting threads work on multiple data items


Conclusion: Iteration Divergence

<p1> LDR r1,r2,0If p1 is TRUE, instruction executes normallyIf p1 is FALSE, instruction treated as NOPPredication Example


Predicated Execution Concept

…if (x == 10) c = c + 1;…

… LDR r5, X p1 <- r5 eq 10<p1> LDR r1 <- C<p1> ADD r1, r1, 1<p1> STR r1 -> C …


Predication very helpful for if-else

B

A

C

D

ABCD

The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence.


If-else example…

p1,p2 <- r5 eq 10<p1> inst 1 from B<p1> inst 2 from B<p1> …

…<p2> inst 1 from C<p2> inst 2 from C …

… p1,p2 <- r5 eq 10<p1> inst 1 from B<p2> inst 1 from C

<p1> inst 2 from B<p2> inst 2 from C

<p1> ……

schedule

Comparison instructions set condition codes (CC)Instructions can be predicated to write results only

when CC meets criterion (CC != 0, CC >= 0, etc.)Compiler tries to predict if a branch condition is

likely to produce many divergent warpsIf guaranteed not to diverge: only predicates if < 4

instructionsIf not guaranteed: only predicates if < 7 instructions

May replace branches with instruction predication


Instruction Predication in G80

ALL predicated instructions take execution cyclesThose with false conditions don’t write their

outputOr invoke memory loads and stores

Saves branch instructions, so can be cheaper than serializing divergent paths


Instruction Predication in G80 (cont.)

“A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu, Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 138-150 http://www.crhc.uiuc.edu/IMPACT/ftp/conference/

isca-95-partial-pred.pdfAlso available in Readings in Computer

Architecture, edited by Hill, Jouppi, and Sohi, Morgan Kaufmann, 2000


For more information on instruction predication

Recall that streaming multiprocessor implements zero-overhead warp schedulingAt any time, only one of the warps is executed

by SM *Warps whose next instruction has its inputs

ready for consumption are eligible for execution

Eligible Warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected


Performance Topic D: Occupancy

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

What happens if all warps are stalled?No instruction issued performance lost

Most common reason for stalling?Waiting on global memory

If your code reads global memory every couple of instructionsYou should try to maximize occupancy

What determines occupancy?Register usage per thread and shared memory

per thread block


Thread Scheduling

Pool of registers and sharedmemory per streamingmultiprocessorEach thread block grabs registers

and shared memory


Resource Limits

TB 0

Registers Shared Memory

TB 1

TB 2

TB 0TB 1TB 2

Pool of registers and sharedmemory per streamingmultiprocessorIf one or the other is fully utilized

no more thread blocks


Resource Limits (cont.)

TB 0

Registers

TB 1TB 0

TB 1

Shared Memory

Can only have eight thread blocks per streaming multiprocessorIf they’re too small, can’t fill up the SMNeed 128 threads/thread block (gt200), 192

threads/TB (gf100)Higher occupancy has diminishing returns for

hiding latency


Resource Limits (cont.)


Hiding latency with more threads

Use nvcc -Xptxas –v to get register and shared memory usage

Plug those numbers into CUDA Occupancy Calculatorhttp://developer.download.nvidia.com/

compute/cuda/CUDA_Occupancy_calculator.xls


How do you know what you’re using?


CUDA GPU Occupancy Calculator Example


CUDA GPU Occupancy Calculator Example (cont.)





Pass option –maxrregcount=X to nvcc

This isn’t magic, won’t get occupancy for free

Use this very carefully when you are right on the edge


How to influence how many registers you use

Kernel launches aren’t freeA null kernel launch will take non-trivial timeActual number changes with hardware

generations and driver software, so I can’t give you one number

Independent kernel launches are cheaper than dependent kernel launchesDependent launch: Some read-back to the CPU

If you are launching lots of small grids you will lose substantial performance due to this effect


Performance Topic E: Kernel Launch Overhead

If you are reading back data to the CPU for control decisions, consider doing it on the GPU

Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead


Kernel Launch Overhead (cont.)

Measure, measure, then measure some more!Once you identify bottlenecks, apply judicious

tuningWhat is most important depends on your

programYou’ll often have a series of bottlenecks, where

each optimization gives a smaller boost than expected


In Conclusion…

Reading: Chapter 6, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuStanford University: Jared Hoberock, David

TarjanRevision history: last updated 10/11/2011.

Previous revisions: 9/9/2011.


End Credits

Documents

CUDA Lecture 11 Performance Considerations