75
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 11 Performance Considerations

CUDA Lecture 11 Performance Considerations

  • Upload
    dessa

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

CUDA Lecture 11 Performance Considerations. Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Preliminaries. Always measure where your time is going! Even if you think you know where it is going Start coarse, go fine-grained as need be - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA Lecture 11 Performance Considerations

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 11Performance

Considerations

Page 2: CUDA Lecture 11 Performance Considerations

Always measure where your time is going!Even if you think you know where it is goingStart coarse, go fine-grained as need be

Keep in mind Amdahl’s Law when optimizing any part of your codeDon’t continue to optimize once a part is only a

small fraction of overall execution time

Performance Considerations – Slide 2

Preliminaries

Page 3: CUDA Lecture 11 Performance Considerations

Performance Consideration IssuesMemory coalescingShared memory bank conflictsControl-flow divergenceOccupancyKernel launch overheads

Performance Considerations – Slide 3

Outline

Page 4: CUDA Lecture 11 Performance Considerations

Off-chip memory is accessed in chunksEven if you read only a single wordIf you don’t use whole chunk, bandwidth is

wasted Chunks are aligned to multiples of 32/64/128

bytesUnaligned accesses will cost more

When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations.

Performance Considerations – Slide 4

Performance Topic A: Memory Coalescing

Page 5: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 5

Memory Layout of a Matrix in C

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Page 6: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 6

Memory Layout of a Matrix in C

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Access direction in Kernel code

T1 T2 T3 T4

Time Period 1T1 T2 T3 T4

Time Period 2 …

Page 7: CUDA Lecture 11 Performance Considerations

T1 T2 T3 T4

Time Period 1

T1 T2 T3 T4

Time Period 2 …

Performance Considerations – Slide 7

Memory Layout of a Matrix in C

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Access direction in Kernel code

Page 8: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 8

Memory Layout of a Matrix in C

Md Nd

WIDTH

WIDTH

Thread 1Thread 2

Not coalesced coalesced

Page 9: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 9

Use Shared Memory to Improve Coalescing

Md Nd

WIDTH

WIDTH

Md Nd

Original AccessPattern

Tiled AccessPattern

Copy into scratchpad

memory

Perform multiplication

with scratchpad values

Page 10: CUDA Lecture 11 Performance Considerations

Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127

Performance Considerations – Slide 10

Second Example

96 192128

128B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Page 11: CUDA Lecture 11 Performance Considerations

Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127 (reduce to 64B)

Performance Considerations – Slide 11

Second Example (cont.)

96 192128

64B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Page 12: CUDA Lecture 11 Performance Considerations

Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127 (reduce to 32B)

Performance Considerations – Slide 12

Second Example (cont.)

96 192128

32B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Page 13: CUDA Lecture 11 Performance Considerations

Threads 0-15 access 4-byte words at addresses 116-176Thread 3 is lowest active, accesses address 128128-byte segment: 128-255

Performance Considerations – Slide 13

Second Example (cont.)

96 192128

128B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Page 14: CUDA Lecture 11 Performance Considerations

Threads 0-15 access 4-byte words at addresses 116-176Thread 3 is lowest active, accesses address 128128-byte segment: 128-255 (reduce to 64B)

Performance Considerations – Slide 14

Second Example (cont.)

96 192128

64B segment

160 224

t1t2

288256

...t0 t15

0 32 64

t3

Page 15: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 15

Consider the stride of your accesses

__global__ void foo(int* input, float3* input2){ int i = blockDim.x * blockIdx.x + threadIdx.x; // Stride 1 int a = input[i]; // Stride 2, half the bandwidth is wasted int b = input[2 * i]; // Stride 3, 2/3 of the bandwidth wasted float c = input2[i].x;}

Page 16: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 16

Example: Array of Structures (AoS)

struct record{ int key; int value; int flag;};

record *d_records;cudaMalloc((void**)&d_records, ...);

Page 17: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 17

Example: Structure of Arrays (SoA)

struct SoA{ int * keys; int * values; int * flags;};

SoA d_SoA_data;cudaMalloc((void**)&d_SoA_data.keys, ...);cudaMalloc((void**)&d_SoA_data.values, ...);cudaMalloc((void**)&d_SoA_data.flags, ...);

Page 18: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 18

Example: SoA versus AoS__global__ void bar(record *AoS_data, SoA SoA_data){ int i = blockDim.x * blockIdx.x + threadIdx.x; // AoS wastes bandwidth int key = AoS_data[i].key; // SoA efficient use of bandwidth int key_better = SoA_data.keys[i];}

Page 19: CUDA Lecture 11 Performance Considerations

Structure of arrays is often better than array of structures Very clear win on regular, stride 1 access

patternsUnpredictable or irregular access patterns are

case-by-case

Performance Considerations – Slide 19

Example: SoA versus AoS (cont.)

Page 20: CUDA Lecture 11 Performance Considerations

As seen each SM has 16 KB of shared memory16 banks of 32-bit words (Tesla)

CUDA uses shared memory as shared storage visible to all threads in a thread blockread and write access

Not used explicitly for pixel shader programswe dislike pixels talking to each

other Performance Considerations – Slide 20

Performance Topic B: Shared Memory Bank Conflicts

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 21: CUDA Lecture 11 Performance Considerations

So shared memory is bankedOnly matters for threads within a warpFull performance with some restrictionsThreads can each access different banksOr can all access the same value

Consecutive words are in different banksIf two or more threads access the same bank

but different value, get bank conflicts

Performance Considerations – Slide 21

Shared Memory

Page 22: CUDA Lecture 11 Performance Considerations

In a parallel machine, many threads access memoryTherefore, memory is divided into banksEssential to achieve high bandwidth

Each bank can service one address per cycleA memory can service as many

simultaneous accesses as it has banks

Multiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serializedPerformance Considerations – Slide 22

Details: Parallel Memory Architecture

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 23: CUDA Lecture 11 Performance Considerations

No Bank Conflicts No Bank Conflicts Linear addressing, stride == 1 Random 1:1 permutation

Performance Considerations – Slide 23

Bank Addressing Examples

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 24: CUDA Lecture 11 Performance Considerations

Two-way Bank Conflicts Eight-way Bank ConflictsLinear addressing stride == 2 Linear addressing stride == 8

Performance Considerations – Slide 24

Bank Addressing Examples (cont.)

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Page 25: CUDA Lecture 11 Performance Considerations

Each bank has a bandwidth of 32 bits per clock cycle

Successive 32-bit words are assigned to successive banks

G80 has 16 banksSo bank = address % 16Same as the size of a half-warp

No bank conflicts between different half-warps, only within a single half-warp

Performance Considerations – Slide 25

How addresses map to banks on G80

Page 26: CUDA Lecture 11 Performance Considerations

Shared memory is as fast as registers if there are no bank conflicts

The fast case:If all threads of a half-warp access different banks,

there is no bank conflictIf all threads of a half-warp access the identical

address, there is no bank conflict (broadcast)The slow case:

Bank conflict: multiple threads in the same half-warp access the same bank

Must serialize the accessesCost = max # of simultaneous accesses to a single

bank Performance Considerations – Slide 26

Shared memory bank conflicts

Page 27: CUDA Lecture 11 Performance Considerations

Change all shared memory reads to the same valueAll broadcasts = no conflictsWill show how much performance could be

improved by eliminating bank conflictsThe same doesn’t work for shared memory

writesSo, replace shared memory array indices with threadIdx.x

Can also be done to the reads

Performance Considerations – Slide 27

Trick to Assess Impact On Performance

Page 28: CUDA Lecture 11 Performance Considerations

Given:

This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd

Performance Considerations – Slide 28

Linear Addressing

__shared__ float shared[256];float foo = shared[baseIndex + s * threadIdx.x];

Page 29: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 29

Linear Addressing Examples

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

s=3s=1

Page 30: CUDA Lecture 11 Performance Considerations

texture and __constant__Read-onlyData resides in global memoryDifferent read path:

includes specialized caches

Performance Considerations – Slide 30

Additional “memories”

Page 31: CUDA Lecture 11 Performance Considerations

Data stored in global memory, read through a constant-cache path__constant__ qualifier in declarationsCan only be read by GPU kernelsLimited to 64KB

To be used when all threads in a warp read the same addressSerializes otherwise

Throughput: 32 bits per warp per clock per multiprocessor

Performance Considerations – Slide 31

Constant Memory

Page 32: CUDA Lecture 11 Performance Considerations

Immediate address constantsIndexed address constantsConstants stored in DRAM, and

cached on chipL1 per SM

A constant value can be broadcast to all threads in a warpExtremely efficient way of

accessing a value that is common for all threads in a block!

Performance Considerations – Slide 32

ConstantsI$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 33: CUDA Lecture 11 Performance Considerations

ObjectivesTo understand the implications of control flow

onBranch divergence overheadSM execution resource utilization

To learn better ways to write code with control flow

To understand compiler/HW predication designed to reduce the impact of control flowThere is a cost involved.

Performance Considerations – Slide 33

Performance Topic C: Control Flow Divergence

Page 34: CUDA Lecture 11 Performance Considerations

Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads)The unit of parallelism in CUDA

Warp: a group of threads executed physically in parallel in G80

Block: a group of threads that are executed together and form the unit of resource assignment

Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

Performance Considerations – Slide 34

Quick terminology review

Page 35: CUDA Lecture 11 Performance Considerations

Thread blocks are partitioned into warps with instructions issued per 32 threads (warp)Thread IDs within a warp are consecutive and

increasingWarp 0 starts with Thread ID 0

Partitioning is always the sameThus you can use this knowledge in control

flow The exact size of warps may change from

generation to generation

Performance Considerations – Slide 35

How thread blocks are partitioned

Page 36: CUDA Lecture 11 Performance Considerations

However, DO NOT rely on any ordering between warpsIf there are any dependencies between

threads, you must __syncthreads() to get correct results

Performance Considerations – Slide 36

How thread blocks are partitioned (cont.)

Page 37: CUDA Lecture 11 Performance Considerations

Main performance concern with branching is divergenceThreads within a single warp take different

pathsif-else, ...

Different execution paths within a warp are serialized in G80The control paths taken by the threads in a warp

are traversed one at a time until there is no more.Different warps can execute different code

with no impact on performance

Performance Considerations – Slide 37

Control Flow Instructions

Page 38: CUDA Lecture 11 Performance Considerations

A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread IDExample with divergence:

This creates two different control paths for threads in a block

Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp

Performance Considerations – Slide 38

Control Flow Divergence (cont.)

if (threadIdx.x > 2) {...}else {...}

Page 39: CUDA Lecture 11 Performance Considerations

A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread IDExample without divergence:

Also creates two different control paths for threads in a block

Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path

Performance Considerations – Slide 39

Control Flow Divergence (cont.)

if (threadIdx.x / WARP_SIZE > 2) {...}else {...}

Page 40: CUDA Lecture 11 Performance Considerations

Given an array of values, “reduce” them to a single value in parallel

Examples sum reduction: sum of all values in the arrayMax reduction: maximum of all values in the

arrayTypically parallel implementation:

Recursively halve # threads, add two values per thread

Takes log(n) steps for n elements, requires n/2 threads

Performance Considerations – Slide 40

Parallel Reduction

Page 41: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 41

Example: Divergent Iteration__global__ void per_thread_sum(int *indices, float *data, float *sums){ ... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum;}

Page 42: CUDA Lecture 11 Performance Considerations

Assume an in-place reduction using shared memoryThe original vector is in device global memoryThe shared memory used to hold a partial sum

vectorEach iteration brings the partial sum vector

closer to the final sumThe final solution will be in element 0

Performance Considerations – Slide 42

A Vector Reduction Example

Page 43: CUDA Lecture 11 Performance Considerations

Assume we have already loaded array into__shared__ float partialSum[]

Performance Considerations – Slide 43

A simple implementation

unsigned int t = threadIdx.x;for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0)

partialSum[t] += partialSum[t+stride];

}

Page 44: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 44

Vector Reduction with Bank Conflicts

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

ITERATIONS

Page 45: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 45

Vector Reduction with Branch Divergence

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

ITERATIONS

Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10

Page 46: CUDA Lecture 11 Performance Considerations

In each iteration, two control flow paths will be sequentially traversed for each warpThreads that perform addition and threads that

do notThreads that do not perform addition may cost

extra cycles depending on the implementation of divergence

Performance Considerations – Slide 46

Some Observations

Page 47: CUDA Lecture 11 Performance Considerations

No more than half of threads will be executing at any timeAll odd index threads are disabled right from

the beginning!On average, less than ¼ of the threads will be

activated for all warps over time.After the 5th iteration, entire warps in each

block will be disabled, poor resource utilization but no divergence.This can go on for a while, up to 4 more iterations

(512/32=16= 24), where each iteration only has one thread activated until all warps retire

Performance Considerations – Slide 47

Some Observations (cont.)

Page 48: CUDA Lecture 11 Performance Considerations

Assume we have already loaded array into__shared__ float partialSum[]

Performance Considerations – Slide 48

Short comings of the implementation

unsigned int t = threadIdx.x;for (unsigned int stride = 1;

stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0)

partialSum[t] += partialSum[t+stride];

}

BAD: Divergence due to interleaved branch decisions

Page 49: CUDA Lecture 11 Performance Considerations

Assume we have already loaded array into__shared__ float partialSum[]

Performance Considerations – Slide 49

A better implementation

unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x;

stride > 1; stride >> 1) { __syncthreads(); if (t < stride)

partialSum[t] += partialSum[t+stride];

}

Page 50: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 50

Less Divergence than originalThread 0

0 1 2 3 … 13 1514 181716 19

0+16 15+311

3

4

Page 51: CUDA Lecture 11 Performance Considerations

Only the last 5 iterations will have divergenceEntire warps will be shut down as iterations

progressFor a 512-thread block, 4 iterations to shut

down all but one warps in each blockBetter resource utilization, will likely retire

warps and thus blocks fasterRecall, no bank conflicts either

Performance Considerations – Slide 51

Some Observations About the New Implementation

Page 52: CUDA Lecture 11 Performance Considerations

For last 6 loops only one warp active (i.e. tid’s 0..31) Shared reads and writes SIMD synchronous

within a warp, so skip __syncthreads() and unroll last 5 iterations

Performance Considerations – Slide 52

A Potential Further Refinement but bad idea

unsigned int tid = threadIdx.x;for (unsigned int d = n>>1; d > 32; d >>= 1) {

__syncthreads();if (tid < d) shared[tid] += shared[tid + d];

}__syncthreads();if (tid <= 32) { // unroll last 6 predicated steps

shared[tid] += shared[tid + 32];shared[tid] += shared[tid + 16];shared[tid] += shared[tid + 8];shared[tid] += shared[tid + 4];shared[tid] += shared[tid + 2];shared[tid] += shared[tid + 1];

}

This would not work properly is warp size decreases; need

__synchthreads() between each statement!

However, having ___synchthreads() in if statement

is problematic.

Page 53: CUDA Lecture 11 Performance Considerations

A single thread can drag a whole warp with it for a long time

Know your data patternsIf data is unpredictable, try to flatten peaks

by letting threads work on multiple data items

Performance Considerations – Slide 53

Conclusion: Iteration Divergence

Page 54: CUDA Lecture 11 Performance Considerations

<p1> LDR r1,r2,0If p1 is TRUE, instruction executes normallyIf p1 is FALSE, instruction treated as NOPPredication Example

Performance Considerations – Slide 54

Predicated Execution Concept

…if (x == 10) c = c + 1;…

… LDR r5, X p1 <- r5 eq 10<p1> LDR r1 <- C<p1> ADD r1, r1, 1<p1> STR r1 -> C …

Page 55: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 55

Predication very helpful for if-else

B

A

C

D

ABCD

Page 56: CUDA Lecture 11 Performance Considerations

The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence.

Performance Considerations – Slide 56

If-else example…

p1,p2 <- r5 eq 10<p1> inst 1 from B<p1> inst 2 from B<p1> …

…<p2> inst 1 from C<p2> inst 2 from C …

… p1,p2 <- r5 eq 10<p1> inst 1 from B<p2> inst 1 from C

<p1> inst 2 from B<p2> inst 2 from C

<p1> ……

schedule

Page 57: CUDA Lecture 11 Performance Considerations

Comparison instructions set condition codes (CC)Instructions can be predicated to write results only

when CC meets criterion (CC != 0, CC >= 0, etc.)Compiler tries to predict if a branch condition is

likely to produce many divergent warpsIf guaranteed not to diverge: only predicates if < 4

instructionsIf not guaranteed: only predicates if < 7 instructions

May replace branches with instruction predication

Performance Considerations – Slide 57

Instruction Predication in G80

Page 58: CUDA Lecture 11 Performance Considerations

ALL predicated instructions take execution cyclesThose with false conditions don’t write their

outputOr invoke memory loads and stores

Saves branch instructions, so can be cheaper than serializing divergent paths

Performance Considerations – Slide 58

Instruction Predication in G80 (cont.)

Page 59: CUDA Lecture 11 Performance Considerations

“A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu, Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 138-150 http://www.crhc.uiuc.edu/IMPACT/ftp/conference/

isca-95-partial-pred.pdfAlso available in Readings in Computer

Architecture, edited by Hill, Jouppi, and Sohi, Morgan Kaufmann, 2000

Performance Considerations – Slide 59

For more information on instruction predication

Page 60: CUDA Lecture 11 Performance Considerations

Recall that streaming multiprocessor implements zero-overhead warp schedulingAt any time, only one of the warps is executed

by SM *Warps whose next instruction has its inputs

ready for consumption are eligible for execution

Eligible Warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected

Performance Considerations – Slide 60

Performance Topic D: Occupancy

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 61: CUDA Lecture 11 Performance Considerations

What happens if all warps are stalled?No instruction issued performance lost

Most common reason for stalling?Waiting on global memory

If your code reads global memory every couple of instructionsYou should try to maximize occupancy

What determines occupancy?Register usage per thread and shared memory

per thread block

Performance Considerations – Slide 61

Thread Scheduling

Page 62: CUDA Lecture 11 Performance Considerations

Pool of registers and sharedmemory per streamingmultiprocessorEach thread block grabs registers

and shared memory

Performance Considerations – Slide 62

Resource Limits

TB 0

Registers Shared Memory

TB 1

TB 2

TB 0TB 1TB 2

Page 63: CUDA Lecture 11 Performance Considerations

Pool of registers and sharedmemory per streamingmultiprocessorIf one or the other is fully utilized

no more thread blocks

Performance Considerations – Slide 63

Resource Limits (cont.)

TB 0

Registers

TB 1TB 0

TB 1

Shared Memory

Page 64: CUDA Lecture 11 Performance Considerations

Can only have eight thread blocks per streaming multiprocessorIf they’re too small, can’t fill up the SMNeed 128 threads/thread block (gt200), 192

threads/TB (gf100)Higher occupancy has diminishing returns for

hiding latency

Performance Considerations – Slide 64

Resource Limits (cont.)

Page 65: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 65

Hiding latency with more threads

Page 66: CUDA Lecture 11 Performance Considerations

Use nvcc -Xptxas –v to get register and shared memory usage

Plug those numbers into CUDA Occupancy Calculatorhttp://developer.download.nvidia.com/

compute/cuda/CUDA_Occupancy_calculator.xls

Performance Considerations – Slide 66

How do you know what you’re using?

Page 67: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 67

CUDA GPU Occupancy Calculator Example

Page 68: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 68

CUDA GPU Occupancy Calculator Example (cont.)

Page 69: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 69

CUDA GPU Occupancy Calculator Example (cont.)

Page 70: CUDA Lecture 11 Performance Considerations

Performance Considerations – Slide 70

CUDA GPU Occupancy Calculator Example (cont.)

Page 71: CUDA Lecture 11 Performance Considerations

Pass option –maxrregcount=X to nvcc

This isn’t magic, won’t get occupancy for free

Use this very carefully when you are right on the edge

Performance Considerations – Slide 71

How to influence how many registers you use

Page 72: CUDA Lecture 11 Performance Considerations

Kernel launches aren’t freeA null kernel launch will take non-trivial timeActual number changes with hardware

generations and driver software, so I can’t give you one number

Independent kernel launches are cheaper than dependent kernel launchesDependent launch: Some read-back to the CPU

If you are launching lots of small grids you will lose substantial performance due to this effect

Performance Considerations – Slide 72

Performance Topic E: Kernel Launch Overhead

Page 73: CUDA Lecture 11 Performance Considerations

If you are reading back data to the CPU for control decisions, consider doing it on the GPU

Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead

Performance Considerations – Slide 73

Kernel Launch Overhead (cont.)

Page 74: CUDA Lecture 11 Performance Considerations

Measure, measure, then measure some more!Once you identify bottlenecks, apply judicious

tuningWhat is most important depends on your

programYou’ll often have a series of bottlenecks, where

each optimization gives a smaller boost than expected

Performance Considerations – Slide 74

In Conclusion…

Page 75: CUDA Lecture 11 Performance Considerations

Reading: Chapter 6, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuStanford University: Jared Hoberock, David

TarjanRevision history: last updated 10/11/2011.

Previous revisions: 9/9/2011.

Performance Considerations – Slide 75

End Credits