72
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 10 Architectural Considerations

CUDA Lecture 10 Architectural Considerations

  • Upload
    waseem

  • View
    39

  • Download
    1

Embed Size (px)

DESCRIPTION

CUDA Lecture 10 Architectural Considerations. Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Objective. To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPU - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA Lecture 10 Architectural Considerations

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 10Architectural

Considerations

Page 2: CUDA Lecture 10 Architectural Considerations

To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPUThe feeds and speeds of the traditional CPU

worldThe feeds and speeds when employing a GPUTo form a solid knowledge base for

performance programming in modern GPU’sKnowing yesterday, today, and tomorrow

The PC world is becoming flatterOutsourcing of computation is becoming

easier…Architectural Considerations – Slide 2

Objective

Page 3: CUDA Lecture 10 Architectural Considerations

Topic 1 (next): The GPU as Part of the PC Architecture

Topic 2: Threading Hardware in the G80Topic 3: Memory Hardware in the G80

Architectural Considerations – Slide 3

Outline

Page 4: CUDA Lecture 10 Architectural Considerations

Global variables declaration Function prototypes

__global__ void kernelOne(…) Main ()

allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )

transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution

Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__

automatic variables transparently assigned to registers or local memory __syncthreads()…

Architectural Considerations – Slide 4

Recall: Typical Structure of a CUDA Program

repeatas

needed

Page 5: CUDA Lecture 10 Architectural Considerations

The bandwidth between key components ultimately dictates system performanceEspecially true for massively parallel systems

processing massive amount of data Tricks like buffering, reordering, caching can

temporarily defy the rules in some casesUltimately, the performance goes falls back to

what the “speeds and feeds” dictate

Architectural Considerations – Slide 5

Bandwidth: Gravity of Modern Computer Systems

Page 6: CUDA Lecture 10 Architectural Considerations

Northbridge connects three components that must be communicate at high speed CPU, DRAM, video Video also needs to have

first-class access to DRAM Previous NVIDIA cards

are connected to AGP, up to 2 GB/sec transfers

Southbridge serves as a concentrator for slower I/O devices

Architectural Considerations – Slide 6

Classic PC Architecture

CPU

Core Logic Chipset

Page 7: CUDA Lecture 10 Architectural Considerations

Connected to the southBridgeOriginally 33 MHz, 32-bit wide, 132 MB/sec

peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak

Upstream bandwidth remain slow for device (256 MB/sec peak)

Shared bus with arbitrationWinner of arbitration becomes bus master and can

connect to CPU or DRAM through the southbridge and northbridge

Architectural Considerations – Slide 7

(Original) PCI Bus Specification

Page 8: CUDA Lecture 10 Architectural Considerations

PCI device registers are mapped into the CPU’s physical address spaceAccessed through

loads/ stores (kernel mode)

Addresses assigned to the PCI devices at boot timeAll devices listen for

their addresses

Architectural Considerations – Slide 8

PCI as Memory Mapped I/O

Page 9: CUDA Lecture 10 Architectural Considerations

Switched, point-to-point connectionEach card has a

dedicated “link” to the central switch, no bus arbitration.

Packet switches messages form virtual channel

Prioritized packets for quality of service, e.g., real-time video streaming Architectural Considerations – Slide 9

PCI Express (PCIe)

Page 10: CUDA Lecture 10 Architectural Considerations

Each link consists of one more lanesEach lane is 1-bit

wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction)Upstream and

downstream now simultaneous and symmetric

Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc.

Architectural Considerations – Slide 10

PCIe Links and Lanes

Page 11: CUDA Lecture 10 Architectural Considerations

Each link consists of one more lanesEach byte data is

8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way.

Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way

Architectural Considerations – Slide 11

PCIe Links and Lanes (cont.)

Page 12: CUDA Lecture 10 Architectural Considerations

PCIe forms the interconnect backboneNorthbridge/

Southbridge are both PCIe switches

Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards

Some PCIe cards are PCI cards with a PCI-PCIe bridge

Architectural Considerations – Slide 12

PCIe PC Architecture

Page 13: CUDA Lecture 10 Architectural Considerations

FSB connection between processor and Northbridge (82925X)Memory control

hub Northbridge

handles “primary” PCIe to video/GPU and DRAM.PCIe x16

bandwidth at 8 GB/sec (4 GB each direction)

Southbridge (ICH6RW) handles other peripherals

Architectural Considerations – Slide 13

Today’s Intel PC Architecture: Single Core System

Page 14: CUDA Lecture 10 Architectural Considerations

Bensley platformBlackford Memory

Control Hub (MCH) is now a PCIe switch that integrates (NB/SB).

FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM

PCIe links form backbone

Architectural Considerations – Slide 14

Today’s Intel PC Architecture: Dual Core System

Source: http://www.2cpu.com/review.php?id=109

Page 15: CUDA Lecture 10 Architectural Considerations

Bensley platformPCIe device

upstream bandwidth now equal to down stream

Workstation version has x16 GPU link via the Greencreek MCH

Architectural Considerations – Slide 15

Today’s Intel PC Architecture: Dual Core System (cont.)

Source: http://www.2cpu.com/review.php?id=109

Page 16: CUDA Lecture 10 Architectural Considerations

Two CPU socketsDual Independent

Bus to CPUs, each is basically a FSB

CPU feeds at 8.5–10.5 GB/sec per socket

Compared to current Front-Side Bus CPU feeds 6.4GB/sec

PCIe bridges to legacy I/O devices

Architectural Considerations – Slide 16

Today’s Intel PC Architecture: Dual Core System (cont.)

Source: http://www.2cpu.com/review.php?id=109

Page 17: CUDA Lecture 10 Architectural Considerations

AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture

HyperTransport ™ similarities to PCIe: Packet based,

switching networkDedicated links

for both directionsShown in 4 socket

configuraton, 8 GB/sec per link

Architectural Considerations – Slide 17

Today’s AMD PC Architecture

Page 18: CUDA Lecture 10 Architectural Considerations

Northbridge/ HyperTransport ™ is on die

Glueless logic to DDR, DDR2

memoryPCI-X/PCIe

bridges (usually implemented in Southbridge)

Architectural Considerations – Slide 18

Today’s AMD PC Architecture (cont.)

Page 19: CUDA Lecture 10 Architectural Considerations

“Torrenza” technologyAllows licensing of

coherent HyperTransport™ to 3rd party manufacturers to make socket-compatible accelerators/co-processors

Architectural Considerations – Slide 19

Today’s AMD PC Architecture (cont.)

Page 20: CUDA Lecture 10 Architectural Considerations

“Torrenza” technologyAllows 3rd party

PPUs (Physics Processing Unit), GPUs, and co-processors to access main system memory directly and coherently

Architectural Considerations – Slide 20

Today’s AMD PC Architecture (cont.)

Page 21: CUDA Lecture 10 Architectural Considerations

“Torrenza” technologyCould make

accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.

Architectural Considerations – Slide 21

Today’s AMD PC Architecture (cont.)

Page 22: CUDA Lecture 10 Architectural Considerations

Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe

Architectural Considerations – Slide 22

HyperTransport™ Feeds and Speeds

Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture

Page 23: CUDA Lecture 10 Architectural Considerations

HyperTransport ™ 1.0 Specification800 MHz max, 12.8 GB/s aggregate bandwidth

(6.4 GB/s each way)

Architectural Considerations – Slide 23

HyperTransport™ Feeds and Speeds (cont.)

Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture

Page 24: CUDA Lecture 10 Architectural Considerations

HyperTransport ™ 2.0 SpecificationAdded PCIe mapping1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate

bandwidth (11.2 GB/s each way)

Architectural Considerations – Slide 24

HyperTransport™ Feeds and Speeds (cont.)

Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture

Page 25: CUDA Lecture 10 Architectural Considerations

HyperTransport ™ 3.0 Specification1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate

bandwidth (20.8 GB/s each way)Added AC coupling to extend HyperTransport

™ to long distance to system-to-system interconnect

Architectural Considerations – Slide 25

HyperTransport™ Feeds and Speeds (cont.)

Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture

Page 26: CUDA Lecture 10 Architectural Considerations

Architectural Considerations – Slide 26

GeForce 7800 GTX Board Details

256MB/256-bit DDR3 600 MHz8 pieces of 8Mx32

16x PCI-Express

SLI Connector

DVI x 2

sVideoTV Out

Single slot cooling

Page 27: CUDA Lecture 10 Architectural Considerations

Single-Program Multiple-Data (SPMD)CUDA integrated CPU + GPU application C

programSerial C code executes on CPUParallel Kernel C code executes on GPU thread

blocks

Architectural Considerations – Slide 27

Topic 2: Threading in G80

Page 28: CUDA Lecture 10 Architectural Considerations

Architectural Considerations – Slide 28

SPMD (cont.)CPU Serial Code

Grid 0

. . .

. . .

GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

Page 29: CUDA Lecture 10 Architectural Considerations

A kernel is executed as a grid of thread blocks All threads

share global memory space

Architectural Considerations – Slide 29

Grids and BlocksHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 30: CUDA Lecture 10 Architectural Considerations

A thread block is a batch of threads that can cooperate with each other by: Synchronizing

their execution using barrier

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Architectural Considerations – Slide 30

Grids and Blocks (cont.)Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 31: CUDA Lecture 10 Architectural Considerations

Programmer declares (Thread) Block: Block size 1 to 512

concurrent threads Block shape 1D, 2D, or

3D Block dimensions in

threads

Architectural Considerations – Slide 31

CUDA Thread Block: ReviewCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Page 32: CUDA Lecture 10 Architectural Considerations

All threads in a block execute the same thread program

Threads share data and synchronize while doing their share of the work

Threads have thread id numbers within block

Thread program uses thread id to select work and address shared data

Architectural Considerations – Slide 32

CUDA Thread Block: Review (cont.)

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Page 33: CUDA Lecture 10 Architectural Considerations

Architectural Considerations – Slide 33

GeForce-8 Series Hardware Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster

Streaming Multiprocessor

SM Shared Memory

Streaming Processor Array

Page 34: CUDA Lecture 10 Architectural Considerations

SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)

TPC: Texture Processor Cluster (2 SM + TEX)

SM: Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA

thread block SP: Streaming Processor

Scalar ALU for a single CUDA thread

Architectural Considerations – Slide 34

CUDA Processor Terminology

Page 35: CUDA Lecture 10 Architectural Considerations

Streaming Multiprocessor (SM)8 Streaming Processors (SP)2 Super Function Units (SFU)

Multi-threaded instruction dispatch1 to 512 threads activeShared instruction fetch per 32

threadsCover latency of texture/memory

loads

Architectural Considerations – Slide 35

Streaming Multiprocessor

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

Page 36: CUDA Lecture 10 Architectural Considerations

20+ GFLOPS16 KB shared memorytexture and global memory

access

Architectural Considerations – Slide 36

Streaming Multiprocessor (cont.)

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

Page 37: CUDA Lecture 10 Architectural Considerations

The future of GPUs is programmable processingSo – build the architecture around the processor

Architectural Considerations – Slide 37

G80 Thread Computing Pipeline

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Page 38: CUDA Lecture 10 Architectural Considerations

Processors execute computing threadsAlternative operating mode specifically for

computing

Architectural Considerations – Slide 38

G80 Thread Computing Pipeline (cont.)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Generates thread grids based on kernel calls

Page 39: CUDA Lecture 10 Architectural Considerations

Grid is launched on the streaming processor array (SPA)

Thread blocks are serially distributed to all the streaming multiprocessors (SMs) Potentially >1

thread block per SM

Each SM launches warps of threads 2 levels of

parallelismArchitectural Considerations – Slide 39

Thread Life Cycle in HardwareHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 40: CUDA Lecture 10 Architectural Considerations

SM schedules and executes warps that are ready to run

As warps and thread blocks complete, resources are freed SPA can distribute

more thread blocks

Architectural Considerations – Slide 40

Thread Life Cycle in Hardware (cont.)

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 41: CUDA Lecture 10 Architectural Considerations

Threads are assigned to SMs in block granularityUp to 8 blocks to each

SM as resource allowsSM in G80 can take up to

768 threadsCould be 256

(threads/block) × 3 blocksOr 128 (threads/block) ×

6 blocks, etc.Architectural Considerations – Slide 41

Streaming Multiprocessor Executes Blocks

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Page 42: CUDA Lecture 10 Architectural Considerations

Threads run concurrentlySM assigns/maintains

thread id numbersSM manages/schedules

thread execution

Architectural Considerations – Slide 42

Streaming Multiprocessor Executes Blocks (cont.)

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Page 43: CUDA Lecture 10 Architectural Considerations

Each thread blocks is divided into 32-thread warpsThis is an implementation

decision, not part of the CUDA programming model

Warps are scheduling units in SM

Architectural Considerations – Slide 43

Thread Scheduling/Execution

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

Page 44: CUDA Lecture 10 Architectural Considerations

If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?Each block is divided into

256/32 = 8 warpsThere are 8 × 3 = 24 warps At any point in time, only one

of the 24 warps will be selected for instruction fetch and execution.

Architectural Considerations – Slide 44

Thread Scheduling/Execution (cont.)

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

Page 45: CUDA Lecture 10 Architectural Considerations

SM hardware implements zero-overhead warp schedulingWarps whose next

instruction has its operands ready for consumption are eligible for execution

Eligible warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected

Architectural Considerations – Slide 45

Symmetric Multiprocessor Warp Scheduling

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 46: CUDA Lecture 10 Architectural Considerations

Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80If one global memory

access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200-cycle memory latency

Architectural Considerations – Slide 46

Symmetric Multiprocessor Warp Scheduling (cont.)

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 47: CUDA Lecture 10 Architectural Considerations

Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cyclefrom any warp - instruction buffer

slotoperand scoreboarding used to

prevent hazardsIssue selection based on round-

robin/age of warpSM broadcasts the same instruction

to 32 threads of a warp Architectural Considerations – Slide 47

SM Instruction Buffer: Warp Scheduling

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 48: CUDA Lecture 10 Architectural Considerations

All register operands of all instructions in the instruction buffer are scoreboardedinstruction becomes ready after the needed

values are depositedprevents hazardscleared instructions are eligible for issue

Architectural Considerations – Slide 48

Scoreboarding

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 49: CUDA Lecture 10 Architectural Considerations

Decoupled memory/processor pipelinesany thread can continue to issue instructions

until scoreboarding prevents issueallows memory/processor ops to proceed in

shadow of other waiting memory/processor ops

Architectural Considerations – Slide 49

Scoreboarding (cont.)

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 50: CUDA Lecture 10 Architectural Considerations

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 4×4, we have 16 threads per block.Since each SM can take up to 768 threads, the

thread capacity allows 48 blocks.However, each SM can only take up to 8

blocks, thus there will be only 128 threads in each SM!There are 8 warps but each warp is only half full.

Architectural Considerations – Slide 50

Granularity Considerations

Page 51: CUDA Lecture 10 Architectural Considerations

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 8×8, we have 64 threads per block.Since each SM can take up to 768 threads, it

could take up to 12 blocks.However, each SM can only take up to 8

blocks, only 512 threads will go into each SM!There are 16 warps available for scheduling in

each SMEach warp spans four slices in the y dimension

Architectural Considerations – Slide 51

Granularity Considerations (cont.)

Page 52: CUDA Lecture 10 Architectural Considerations

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 16×16, we have 256 threads per block.Since each SM can take up to 768 threads, it

can take up to 3 blocks and achieve full capacity unless other resource considerations overrule.There are 24 warps available for scheduling in

each SMEach warp spans two slices in the y dimension

For 32×32, we have 1024 threads per Block. Not even one can fit into an SM!

Architectural Considerations – Slide 52

Granularity Considerations (cont.)

Page 53: CUDA Lecture 10 Architectural Considerations

Review: CUDA Device Memory SpaceEach thread can:

R/W per-thread registers and local memory

R/W per-block shared memory

R/W per-grid global memoryRead only per-grid constant

and texture memoriesThe host can R/W global,

constant and texturememories

Architectural Considerations – Slide 53

Topic 3: Memory Hardware in G80

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMem

Thr (0, 0)

Regs

LocalMem

Thr (1, 0)

Regs

Block (1, 0)

Shared Memory

LocalMem

Thr (0, 0)

Regs

LocalMem

Thr (1, 0)

Regs

HOST

Page 54: CUDA Lecture 10 Architectural Considerations

Uses:Inter-thread communication within a blockCache data to reduce global memory accessesUse it to avoid non-coalesced access

Organization:16 banks, 32-bit wide banks (Tesla)32 banks, 32-bit wide banks (Fermi) Successive 32-bit words belong to different

banks

Architectural Considerations – Slide 54

Overview: Shared Memory

Page 55: CUDA Lecture 10 Architectural Considerations

Performance:32 bits per bank per 2 clocks per

multiprocessorShared memory accesses are per 16-threads

(half-warp)Serialization: if n threads (out of 16) access the

same bank, n accesses are executed seriallyBroadcast: n threads access the same word in

one fetch

Architectural Considerations – Slide 55

Overview: Shared Memory (cont.)

Page 56: CUDA Lecture 10 Architectural Considerations

Local Memory: per-threadPrivate per threadAuto variables, register

spillShared Memory: per-

blockShared by threads of the

same blockInter-thread

communication

Architectural Considerations – Slide 56

Parallel Memory SharingThread

Local Memory

Block

SharedMemory

Page 57: CUDA Lecture 10 Architectural Considerations

Global Memory:per-applicationShared by all threadsInter-grid communication

Architectural Considerations – Slide 57

Parallel Memory Sharing (cont.)

Grid 0

. . .Global

Memory

. . .

Grid 1SequentialGridsin Time

Page 58: CUDA Lecture 10 Architectural Considerations

Threads in a block share data and resultsIn memory and shared

memorySynchronize at barrier

instruction

Architectural Considerations – Slide 58

Streaming Multiprocessor Memory Architecture

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Page 59: CUDA Lecture 10 Architectural Considerations

Per-block shared memory allocationKeeps data close to

processorMinimize trips to global

memoryShared memory is

dynamically allocated to blocks, one of the limiting resources Architectural Considerations – Slide 59

Streaming Multiprocessor Memory Architecture(cont.)

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Page 60: CUDA Lecture 10 Architectural Considerations

Register File (RF)32 KB (8K entries) for each SM

in G80TEX pipe can also read/write

RF2 SMs share 1 TEXLoad/Store pipe can also

read/write RF

Architectural Considerations – Slide 60

Symmetric Multiprocessor Register File

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 61: CUDA Lecture 10 Architectural Considerations

There are 8192 registers in each SM in G80This is an implementation

decision, not part of CUDARegisters are dynamically

partitioned across all blocks assigned to the SM

Once assigned to a block, the register is NOT accessible by threads in other blocks

Each thread in the same block only access registers assigned to itself

Architectural Considerations – Slide 61

Programmer View of Register File4 blocks 3 blocks

Page 62: CUDA Lecture 10 Architectural Considerations

If each block has 16 × 16 threads and each thread uses 10 registers, how many thread can run on each SM?Each block requires 10 × 256 = 2560 registers8192 = 3 × 2560 + changeSo, three blocks can run on an SM as far as

registers are concernedHow about if each thread increases the use of

registers by 1?Each block now requires 11 × 256 = 2816

registers8192 < 2816 × 3Only two blocks can run on an SM, ¹⁄₃ reduction of

thread-level parallelism!!!Architectural Considerations – Slide 62

Matrix Multiplication Example

Page 63: CUDA Lecture 10 Architectural Considerations

Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that

require many registers each or a large number of threads that require few registers each This allows for finer grain threading than

traditional CPU threading models.The compiler can tradeoff between instruction-

level parallelism (ILP) and thread level parallelism (TLP)

Architectural Considerations – Slide 63

More on Dynamic Partitioning

Page 64: CUDA Lecture 10 Architectural Considerations

Assume that a kernel has 256-thread blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles, then 3 blocks can run on each SM

If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, only two can run on each SM

Architectural Considerations – Slide 64

ILP vs. TLP Example

Page 65: CUDA Lecture 10 Architectural Considerations

However, one only needs 200/(8×4) = 7 warps to tolerate the memory latency

Two blocks have 16 warps. The performance can be actually higher!

Architectural Considerations – Slide 65

ILP vs. TLP Example (cont.)

Page 66: CUDA Lecture 10 Architectural Considerations

Increase in per-thread performance, but fewer threadsLower overall performance

Architectural Considerations – Slide 66

Resource Allocation Example

TB0 TB1 TB2

32KB Register File

………

16KB Shared Memory

SFU0 SFU1

SP0 SP7

(a) Pre-“optimization”

Core Computation

Thread Contexts

SP UtilizationArea determines overall performance

32KB Register File

16KB Shared Memory

SFU0 SFU1

………

SP0 SP7

(b) Post-“optimization”

Insufficient registers to allocate 3 blocks

Thread Contexts

X

Page 67: CUDA Lecture 10 Architectural Considerations

Architectural Considerations – Slide 67

Loop { Load current tile to shared memory syncthreads()

Compute current tile syncthreads()}

Load next tile from global memoryLoop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile syncthreads()}

Without prefetching With prefetching

One could double buffer the computation, getting better instruction mix within each threadThis is classic software pipelining in ILP

compilers

Prefetching

Page 68: CUDA Lecture 10 Architectural Considerations

Deposit blue tile from register into shared memory and Syncthreads

Load orange tile into register Compute blue tile Deposit orange tile into

shared memory ….

Architectural Considerations – Slide 68

Prefetch

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx 01 TILE_WIDTH-12

0 1 2

by

ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

THT

ILE

_WID

THE

WID

TH

WID

TH

Page 69: CUDA Lecture 10 Architectural Considerations

There are very few multiplications or additions between branches and address calculations.

Loop unrolling can help.

Architectural Considerations – Slide 69

for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx];

Instruction Mix Considerations

Pvalue += Ms[ty][k] * Ns[k][tx] + … Ms[ty][k+15] * Ns[k+15][tx];

Page 70: CUDA Lecture 10 Architectural Considerations

Architectural Considerations – Slide 70

Ctemp = 0;for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile for (i = 0; i < 16; i++) { Ctemp += As[ty][i] * Bs[i][tx]; } __syncthreads();}C[indexC] = Ctemp;

Ctemp = 0;for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile Ctemp += As[ty][0] * Bs[0][tx]; … Ctemp += As[ty][15] * Bs[15][tx]; __syncthreads();}C[indexC] = Ctemp;

Tiled Version Unrolled Version

Removal of branch instructions and address calculations

Unrolling

Does this use more registers?

Page 71: CUDA Lecture 10 Architectural Considerations

Long-latency operationsAvoid stalls by executing other threads

Stalls and bubbles in the pipelineBarrier synchronizationBranch divergence

Shared resource saturationGlobal memory bandwidthLocal memory capacity

Architectural Considerations – Slide 71

Major G80 Performance Detractors

Page 72: CUDA Lecture 10 Architectural Considerations

Based on original material fromJon Stokes, PCI Express: An Overview

http://arstechnica.com/articles/paedia/hardware/pcie.ars

The University of Illinois at Urbana-ChampaignDavid Kirk, Wen-mei W. Hwu

Revision history: last updated 10/11/2011.Previous revisions: 9/13/2011.

Architectural Considerations – Slide 72

End Credits