CUDA Lecture 10 Architectural Considerations

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 10Architectural

Considerations

To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPUThe feeds and speeds of the traditional CPU

worldThe feeds and speeds when employing a GPUTo form a solid knowledge base for

performance programming in modern GPU’sKnowing yesterday, today, and tomorrow

The PC world is becoming flatterOutsourcing of computation is becoming

easier…Architectural Considerations – Slide 2

Objective

Topic 1 (next): The GPU as Part of the PC Architecture

Topic 2: Threading Hardware in the G80Topic 3: Memory Hardware in the G80

Architectural Considerations – Slide 3

Outline

Global variables declaration Function prototypes

__global__ void kernelOne(…) Main ()

allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )

transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution

Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__

automatic variables transparently assigned to registers or local memory __syncthreads()…


Recall: Typical Structure of a CUDA Program

repeatas

needed

The bandwidth between key components ultimately dictates system performanceEspecially true for massively parallel systems

processing massive amount of data Tricks like buffering, reordering, caching can

temporarily defy the rules in some casesUltimately, the performance goes falls back to

what the “speeds and feeds” dictate


Bandwidth: Gravity of Modern Computer Systems

Northbridge connects three components that must be communicate at high speed CPU, DRAM, video Video also needs to have

first-class access to DRAM Previous NVIDIA cards

are connected to AGP, up to 2 GB/sec transfers

Southbridge serves as a concentrator for slower I/O devices


Classic PC Architecture

CPU

Core Logic Chipset

Connected to the southBridgeOriginally 33 MHz, 32-bit wide, 132 MB/sec

peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak

Upstream bandwidth remain slow for device (256 MB/sec peak)

Shared bus with arbitrationWinner of arbitration becomes bus master and can

connect to CPU or DRAM through the southbridge and northbridge


(Original) PCI Bus Specification

PCI device registers are mapped into the CPU’s physical address spaceAccessed through

loads/ stores (kernel mode)

Addresses assigned to the PCI devices at boot timeAll devices listen for

their addresses


PCI as Memory Mapped I/O

Switched, point-to-point connectionEach card has a

dedicated “link” to the central switch, no bus arbitration.

Packet switches messages form virtual channel

Prioritized packets for quality of service, e.g., real-time video streaming Architectural Considerations – Slide 9

PCI Express (PCIe)

Each link consists of one more lanesEach lane is 1-bit

wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction)Upstream and

downstream now simultaneous and symmetric

Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc.


PCIe Links and Lanes

Each link consists of one more lanesEach byte data is

8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way.

Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way


PCIe Links and Lanes (cont.)

PCIe forms the interconnect backboneNorthbridge/

Southbridge are both PCIe switches

Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards

Some PCIe cards are PCI cards with a PCI-PCIe bridge


PCIe PC Architecture

FSB connection between processor and Northbridge (82925X)Memory control

hub Northbridge

handles “primary” PCIe to video/GPU and DRAM.PCIe x16

bandwidth at 8 GB/sec (4 GB each direction)

Southbridge (ICH6RW) handles other peripherals


Today’s Intel PC Architecture: Single Core System

Bensley platformBlackford Memory

Control Hub (MCH) is now a PCIe switch that integrates (NB/SB).

FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM

PCIe links form backbone


Today’s Intel PC Architecture: Dual Core System

Source: http://www.2cpu.com/review.php?id=109

Bensley platformPCIe device

upstream bandwidth now equal to down stream

Workstation version has x16 GPU link via the Greencreek MCH


Today’s Intel PC Architecture: Dual Core System (cont.)


Two CPU socketsDual Independent

Bus to CPUs, each is basically a FSB

CPU feeds at 8.5–10.5 GB/sec per socket

Compared to current Front-Side Bus CPU feeds 6.4GB/sec

PCIe bridges to legacy I/O devices


Today’s Intel PC Architecture: Dual Core System (cont.)


AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture

HyperTransport ™ similarities to PCIe: Packet based,

switching networkDedicated links

for both directionsShown in 4 socket

configuraton, 8 GB/sec per link


Today’s AMD PC Architecture

Northbridge/ HyperTransport ™ is on die

Glueless logic to DDR, DDR2

memoryPCI-X/PCIe

bridges (usually implemented in Southbridge)


Today’s AMD PC Architecture (cont.)

“Torrenza” technologyAllows licensing of

coherent HyperTransport™ to 3rd party manufacturers to make socket-compatible accelerators/co-processors



“Torrenza” technologyAllows 3rd party

PPUs (Physics Processing Unit), GPUs, and co-processors to access main system memory directly and coherently



“Torrenza” technologyCould make

accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.



Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe


HyperTransport™ Feeds and Speeds

Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture

HyperTransport ™ 1.0 Specification800 MHz max, 12.8 GB/s aggregate bandwidth

(6.4 GB/s each way)


HyperTransport™ Feeds and Speeds (cont.)


HyperTransport ™ 2.0 SpecificationAdded PCIe mapping1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate

bandwidth (11.2 GB/s each way)




HyperTransport ™ 3.0 Specification1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate

bandwidth (20.8 GB/s each way)Added AC coupling to extend HyperTransport

™ to long distance to system-to-system interconnect





GeForce 7800 GTX Board Details

256MB/256-bit DDR3 600 MHz8 pieces of 8Mx32

16x PCI-Express

SLI Connector

DVI x 2

sVideoTV Out

Single slot cooling

Single-Program Multiple-Data (SPMD)CUDA integrated CPU + GPU application C

programSerial C code executes on CPUParallel Kernel C code executes on GPU thread

blocks


Topic 2: Threading in G80


SPMD (cont.)CPU Serial Code

Grid 0

. . .

. . .

GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

A kernel is executed as a grid of thread blocks All threads

share global memory space


Grids and BlocksHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

A thread block is a batch of threads that can cooperate with each other by: Synchronizing

their execution using barrier

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate


Grids and Blocks (cont.)Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Programmer declares (Thread) Block: Block size 1 to 512

concurrent threads Block shape 1D, 2D, or

3D Block dimensions in

threads


CUDA Thread Block: ReviewCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

All threads in a block execute the same thread program

Threads share data and synchronize while doing their share of the work

Threads have thread id numbers within block

Thread program uses thread id to select work and address shared data


CUDA Thread Block: Review (cont.)

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA


GeForce-8 Series Hardware Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster

Streaming Multiprocessor

SM Shared Memory

Streaming Processor Array

…

SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)

TPC: Texture Processor Cluster (2 SM + TEX)

SM: Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA

thread block SP: Streaming Processor

Scalar ALU for a single CUDA thread


CUDA Processor Terminology

Streaming Multiprocessor (SM)8 Streaming Processors (SP)2 Super Function Units (SFU)

Multi-threaded instruction dispatch1 to 512 threads activeShared instruction fetch per 32

threadsCover latency of texture/memory

loads


Streaming Multiprocessor

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU


Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

20+ GFLOPS16 KB shared memorytexture and global memory

access


Streaming Multiprocessor (cont.)

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU



Shared Memory

The future of GPUs is programmable processingSo – build the architecture around the processor


G80 Thread Computing Pipeline

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Processors execute computing threadsAlternative operating mode specifically for

computing


G80 Thread Computing Pipeline (cont.)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Generates thread grids based on kernel calls

Grid is launched on the streaming processor array (SPA)

Thread blocks are serially distributed to all the streaming multiprocessors (SMs) Potentially >1

thread block per SM

Each SM launches warps of threads 2 levels of

parallelismArchitectural Considerations – Slide 39

Thread Life Cycle in HardwareHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

SM schedules and executes warps that are ready to run

As warps and thread blocks complete, resources are freed SPA can distribute

more thread blocks


Thread Life Cycle in Hardware (cont.)

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Threads are assigned to SMs in block granularityUp to 8 blocks to each

SM as resource allowsSM in G80 can take up to

768 threadsCould be 256

(threads/block) × 3 blocksOr 128 (threads/block) ×

6 blocks, etc.Architectural Considerations – Slide 41

Streaming Multiprocessor Executes Blocks

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Threads run concurrentlySM assigns/maintains

thread id numbersSM manages/schedules

thread execution


Streaming Multiprocessor Executes Blocks (cont.)

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Each thread blocks is divided into 32-thread warpsThis is an implementation

decision, not part of the CUDA programming model

Warps are scheduling units in SM


Thread Scheduling/Execution

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU



Shared Memory

If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?Each block is divided into

256/32 = 8 warpsThere are 8 × 3 = 24 warps At any point in time, only one

of the 24 warps will be selected for instruction fetch and execution.


Thread Scheduling/Execution (cont.)

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU



Shared Memory

SM hardware implements zero-overhead warp schedulingWarps whose next

instruction has its operands ready for consumption are eligible for execution

Eligible warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected


Symmetric Multiprocessor Warp Scheduling

warp 8 instruction 11

SM multithreadedWarp scheduler




...

time


Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80If one global memory

access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200-cycle memory latency


Symmetric Multiprocessor Warp Scheduling (cont.)


SM multithreadedWarp scheduler




...

time


Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cyclefrom any warp - instruction buffer

slotoperand scoreboarding used to

prevent hazardsIssue selection based on round-

robin/age of warpSM broadcasts the same instruction

to 32 threads of a warp Architectural Considerations – Slide 47

SM Instruction Buffer: Warp Scheduling

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

All register operands of all instructions in the instruction buffer are scoreboardedinstruction becomes ready after the needed

values are depositedprevents hazardscleared instructions are eligible for issue


Scoreboarding

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Decoupled memory/processor pipelinesany thread can continue to issue instructions

until scoreboarding prevents issueallows memory/processor ops to proceed in

shadow of other waiting memory/processor ops


Scoreboarding (cont.)

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 4×4, we have 16 threads per block.Since each SM can take up to 768 threads, the

thread capacity allows 48 blocks.However, each SM can only take up to 8

blocks, thus there will be only 128 threads in each SM!There are 8 warps but each warp is only half full.


Granularity Considerations

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 8×8, we have 64 threads per block.Since each SM can take up to 768 threads, it

could take up to 12 blocks.However, each SM can only take up to 8

blocks, only 512 threads will go into each SM!There are 16 warps available for scheduling in

each SMEach warp spans four slices in the y dimension


Granularity Considerations (cont.)

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 16×16, we have 256 threads per block.Since each SM can take up to 768 threads, it

can take up to 3 blocks and achieve full capacity unless other resource considerations overrule.There are 24 warps available for scheduling in

each SMEach warp spans two slices in the y dimension

For 32×32, we have 1024 threads per Block. Not even one can fit into an SM!


Granularity Considerations (cont.)

Review: CUDA Device Memory SpaceEach thread can:

R/W per-thread registers and local memory

R/W per-block shared memory

R/W per-grid global memoryRead only per-grid constant

and texture memoriesThe host can R/W global,

constant and texturememories


Topic 3: Memory Hardware in G80

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMem

Thr (0, 0)

Regs

LocalMem

Thr (1, 0)

Regs

Block (1, 0)

Shared Memory

LocalMem

Thr (0, 0)

Regs

LocalMem

Thr (1, 0)

Regs

HOST

Uses:Inter-thread communication within a blockCache data to reduce global memory accessesUse it to avoid non-coalesced access

Organization:16 banks, 32-bit wide banks (Tesla)32 banks, 32-bit wide banks (Fermi) Successive 32-bit words belong to different

banks


Overview: Shared Memory

Performance:32 bits per bank per 2 clocks per

multiprocessorShared memory accesses are per 16-threads

(half-warp)Serialization: if n threads (out of 16) access the

same bank, n accesses are executed seriallyBroadcast: n threads access the same word in

one fetch


Overview: Shared Memory (cont.)

Local Memory: per-threadPrivate per threadAuto variables, register

spillShared Memory: per-

blockShared by threads of the

same blockInter-thread

communication


Parallel Memory SharingThread

Local Memory

Block

SharedMemory

Global Memory:per-applicationShared by all threadsInter-grid communication


Parallel Memory Sharing (cont.)

Grid 0

. . .Global

Memory

. . .

Grid 1SequentialGridsin Time

Threads in a block share data and resultsIn memory and shared

memorySynchronize at barrier

instruction


Streaming Multiprocessor Memory Architecture

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Per-block shared memory allocationKeeps data close to

processorMinimize trips to global

memoryShared memory is

dynamically allocated to blocks, one of the limiting resources Architectural Considerations – Slide 59

Streaming Multiprocessor Memory Architecture(cont.)

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Register File (RF)32 KB (8K entries) for each SM

in G80TEX pipe can also read/write

RF2 SMs share 1 TEXLoad/Store pipe can also

read/write RF


Symmetric Multiprocessor Register File

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

There are 8192 registers in each SM in G80This is an implementation

decision, not part of CUDARegisters are dynamically

partitioned across all blocks assigned to the SM

Once assigned to a block, the register is NOT accessible by threads in other blocks

Each thread in the same block only access registers assigned to itself


Programmer View of Register File4 blocks 3 blocks

If each block has 16 × 16 threads and each thread uses 10 registers, how many thread can run on each SM?Each block requires 10 × 256 = 2560 registers8192 = 3 × 2560 + changeSo, three blocks can run on an SM as far as

registers are concernedHow about if each thread increases the use of

registers by 1?Each block now requires 11 × 256 = 2816

registers8192 < 2816 × 3Only two blocks can run on an SM, ¹⁄₃ reduction of

thread-level parallelism!!!Architectural Considerations – Slide 62

Matrix Multiplication Example

Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that

require many registers each or a large number of threads that require few registers each This allows for finer grain threading than

traditional CPU threading models.The compiler can tradeoff between instruction-

level parallelism (ILP) and thread level parallelism (TLP)


More on Dynamic Partitioning

Assume that a kernel has 256-thread blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles, then 3 blocks can run on each SM

If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, only two can run on each SM


ILP vs. TLP Example

However, one only needs 200/(8×4) = 7 warps to tolerate the memory latency

Two blocks have 16 warps. The performance can be actually higher!


ILP vs. TLP Example (cont.)

Increase in per-thread performance, but fewer threadsLower overall performance


Resource Allocation Example

TB0 TB1 TB2

32KB Register File

………

16KB Shared Memory

SFU0 SFU1

SP0 SP7

(a) Pre-“optimization”

Core Computation

Thread Contexts

SP UtilizationArea determines overall performance

32KB Register File

16KB Shared Memory

SFU0 SFU1

………

SP0 SP7

(b) Post-“optimization”

Insufficient registers to allocate 3 blocks

Thread Contexts

X


Loop { Load current tile to shared memory syncthreads()

Compute current tile syncthreads()}

Load next tile from global memoryLoop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile syncthreads()}

Without prefetching With prefetching

One could double buffer the computation, getting better instruction mix within each threadThis is classic software pipelining in ILP

compilers

Prefetching

Deposit blue tile from register into shared memory and Syncthreads

Load orange tile into register Compute blue tile Deposit orange tile into

shared memory ….


Prefetch

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx 01 TILE_WIDTH-12

0 1 2

by

ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

THT

ILE

_WID

THE

WID

TH

WID

TH

There are very few multiplications or additions between branches and address calculations.

Loop unrolling can help.


for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx];

Instruction Mix Considerations

Pvalue += Ms[ty][k] * Ns[k][tx] + … Ms[ty][k+15] * Ns[k+15][tx];


Ctemp = 0;for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile for (i = 0; i < 16; i++) { Ctemp += As[ty][i] * Bs[i][tx]; } __syncthreads();}C[indexC] = Ctemp;

Ctemp = 0;for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile Ctemp += As[ty][0] * Bs[0][tx]; … Ctemp += As[ty][15] * Bs[15][tx]; __syncthreads();}C[indexC] = Ctemp;

Tiled Version Unrolled Version

Removal of branch instructions and address calculations

Unrolling

Does this use more registers?

Long-latency operationsAvoid stalls by executing other threads

Stalls and bubbles in the pipelineBarrier synchronizationBranch divergence

Shared resource saturationGlobal memory bandwidthLocal memory capacity


Major G80 Performance Detractors

Based on original material fromJon Stokes, PCI Express: An Overview

http://arstechnica.com/articles/paedia/hardware/pcie.ars

The University of Illinois at Urbana-ChampaignDavid Kirk, Wen-mei W. Hwu

Revision history: last updated 10/11/2011.Previous revisions: 9/13/2011.


End Credits

Documents

CUDA Lecture 10 Architectural Considerations