View
23
Download
0
Category
Tags:
Preview:
DESCRIPTION
CUDA Lecture 10 Architectural Considerations. Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Objective. To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPU - PowerPoint PPT Presentation
Citation preview
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 10Architectural
Considerations
To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPUThe feeds and speeds of the traditional CPU
worldThe feeds and speeds when employing a GPUTo form a solid knowledge base for
performance programming in modern GPU’sKnowing yesterday, today, and tomorrow
The PC world is becoming flatterOutsourcing of computation is becoming
easier…Architectural Considerations – Slide 2
Objective
Topic 1 (next): The GPU as Part of the PC Architecture
Topic 2: Threading Hardware in the G80Topic 3: Memory Hardware in the G80
Architectural Considerations – Slide 3
Outline
Global variables declaration Function prototypes
__global__ void kernelOne(…) Main ()
allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution
Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__
automatic variables transparently assigned to registers or local memory __syncthreads()…
Architectural Considerations – Slide 4
Recall: Typical Structure of a CUDA Program
repeatas
needed
The bandwidth between key components ultimately dictates system performanceEspecially true for massively parallel systems
processing massive amount of data Tricks like buffering, reordering, caching can
temporarily defy the rules in some casesUltimately, the performance goes falls back to
what the “speeds and feeds” dictate
Architectural Considerations – Slide 5
Bandwidth: Gravity of Modern Computer Systems
Northbridge connects three components that must be communicate at high speed CPU, DRAM, video Video also needs to have
first-class access to DRAM Previous NVIDIA cards
are connected to AGP, up to 2 GB/sec transfers
Southbridge serves as a concentrator for slower I/O devices
Architectural Considerations – Slide 6
Classic PC Architecture
CPU
Core Logic Chipset
Connected to the southBridgeOriginally 33 MHz, 32-bit wide, 132 MB/sec
peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak
Upstream bandwidth remain slow for device (256 MB/sec peak)
Shared bus with arbitrationWinner of arbitration becomes bus master and can
connect to CPU or DRAM through the southbridge and northbridge
Architectural Considerations – Slide 7
(Original) PCI Bus Specification
PCI device registers are mapped into the CPU’s physical address spaceAccessed through
loads/ stores (kernel mode)
Addresses assigned to the PCI devices at boot timeAll devices listen for
their addresses
Architectural Considerations – Slide 8
PCI as Memory Mapped I/O
Switched, point-to-point connectionEach card has a
dedicated “link” to the central switch, no bus arbitration.
Packet switches messages form virtual channel
Prioritized packets for quality of service, e.g., real-time video streaming Architectural Considerations – Slide 9
PCI Express (PCIe)
Each link consists of one more lanesEach lane is 1-bit
wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction)Upstream and
downstream now simultaneous and symmetric
Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc.
Architectural Considerations – Slide 10
PCIe Links and Lanes
Each link consists of one more lanesEach byte data is
8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way.
Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way
Architectural Considerations – Slide 11
PCIe Links and Lanes (cont.)
PCIe forms the interconnect backboneNorthbridge/
Southbridge are both PCIe switches
Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards
Some PCIe cards are PCI cards with a PCI-PCIe bridge
Architectural Considerations – Slide 12
PCIe PC Architecture
FSB connection between processor and Northbridge (82925X)Memory control
hub Northbridge
handles “primary” PCIe to video/GPU and DRAM.PCIe x16
bandwidth at 8 GB/sec (4 GB each direction)
Southbridge (ICH6RW) handles other peripherals
Architectural Considerations – Slide 13
Today’s Intel PC Architecture: Single Core System
Bensley platformBlackford Memory
Control Hub (MCH) is now a PCIe switch that integrates (NB/SB).
FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM
PCIe links form backbone
Architectural Considerations – Slide 14
Today’s Intel PC Architecture: Dual Core System
Source: http://www.2cpu.com/review.php?id=109
Bensley platformPCIe device
upstream bandwidth now equal to down stream
Workstation version has x16 GPU link via the Greencreek MCH
Architectural Considerations – Slide 15
Today’s Intel PC Architecture: Dual Core System (cont.)
Source: http://www.2cpu.com/review.php?id=109
Two CPU socketsDual Independent
Bus to CPUs, each is basically a FSB
CPU feeds at 8.5–10.5 GB/sec per socket
Compared to current Front-Side Bus CPU feeds 6.4GB/sec
PCIe bridges to legacy I/O devices
Architectural Considerations – Slide 16
Today’s Intel PC Architecture: Dual Core System (cont.)
Source: http://www.2cpu.com/review.php?id=109
AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture
HyperTransport ™ similarities to PCIe: Packet based,
switching networkDedicated links
for both directionsShown in 4 socket
configuraton, 8 GB/sec per link
Architectural Considerations – Slide 17
Today’s AMD PC Architecture
Northbridge/ HyperTransport ™ is on die
Glueless logic to DDR, DDR2
memoryPCI-X/PCIe
bridges (usually implemented in Southbridge)
Architectural Considerations – Slide 18
Today’s AMD PC Architecture (cont.)
“Torrenza” technologyAllows licensing of
coherent HyperTransport™ to 3rd party manufacturers to make socket-compatible accelerators/co-processors
Architectural Considerations – Slide 19
Today’s AMD PC Architecture (cont.)
“Torrenza” technologyAllows 3rd party
PPUs (Physics Processing Unit), GPUs, and co-processors to access main system memory directly and coherently
Architectural Considerations – Slide 20
Today’s AMD PC Architecture (cont.)
“Torrenza” technologyCould make
accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.
Architectural Considerations – Slide 21
Today’s AMD PC Architecture (cont.)
Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe
Architectural Considerations – Slide 22
HyperTransport™ Feeds and Speeds
Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture
HyperTransport ™ 1.0 Specification800 MHz max, 12.8 GB/s aggregate bandwidth
(6.4 GB/s each way)
Architectural Considerations – Slide 23
HyperTransport™ Feeds and Speeds (cont.)
Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture
HyperTransport ™ 2.0 SpecificationAdded PCIe mapping1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate
bandwidth (11.2 GB/s each way)
Architectural Considerations – Slide 24
HyperTransport™ Feeds and Speeds (cont.)
Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture
HyperTransport ™ 3.0 Specification1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate
bandwidth (20.8 GB/s each way)Added AC coupling to extend HyperTransport
™ to long distance to system-to-system interconnect
Architectural Considerations – Slide 25
HyperTransport™ Feeds and Speeds (cont.)
Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture
Architectural Considerations – Slide 26
GeForce 7800 GTX Board Details
256MB/256-bit DDR3 600 MHz8 pieces of 8Mx32
16x PCI-Express
SLI Connector
DVI x 2
sVideoTV Out
Single slot cooling
Single-Program Multiple-Data (SPMD)CUDA integrated CPU + GPU application C
programSerial C code executes on CPUParallel Kernel C code executes on GPU thread
blocks
Architectural Considerations – Slide 27
Topic 2: Threading in G80
Architectural Considerations – Slide 28
SPMD (cont.)CPU Serial Code
Grid 0
. . .
. . .
GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);
Grid 1CPU Serial Code
GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);
A kernel is executed as a grid of thread blocks All threads
share global memory space
Architectural Considerations – Slide 29
Grids and BlocksHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
A thread block is a batch of threads that can cooperate with each other by: Synchronizing
their execution using barrier
Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate
Architectural Considerations – Slide 30
Grids and Blocks (cont.)Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Programmer declares (Thread) Block: Block size 1 to 512
concurrent threads Block shape 1D, 2D, or
3D Block dimensions in
threads
Architectural Considerations – Slide 31
CUDA Thread Block: ReviewCUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
All threads in a block execute the same thread program
Threads share data and synchronize while doing their share of the work
Threads have thread id numbers within block
Thread program uses thread id to select work and address shared data
Architectural Considerations – Slide 32
CUDA Thread Block: Review (cont.)
CUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
Architectural Considerations – Slide 33
GeForce-8 Series Hardware Overview
TPC TPC TPC TPC TPC TPC
TEX
SM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Texture Processor Cluster
Streaming Multiprocessor
SM Shared Memory
Streaming Processor Array
…
SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)
TPC: Texture Processor Cluster (2 SM + TEX)
SM: Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA
thread block SP: Streaming Processor
Scalar ALU for a single CUDA thread
Architectural Considerations – Slide 34
CUDA Processor Terminology
Streaming Multiprocessor (SM)8 Streaming Processors (SP)2 Super Function Units (SFU)
Multi-threaded instruction dispatch1 to 512 threads activeShared instruction fetch per 32
threadsCover latency of texture/memory
loads
Architectural Considerations – Slide 35
Streaming Multiprocessor
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
20+ GFLOPS16 KB shared memorytexture and global memory
access
Architectural Considerations – Slide 36
Streaming Multiprocessor (cont.)
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
The future of GPUs is programmable processingSo – build the architecture around the processor
Architectural Considerations – Slide 37
G80 Thread Computing Pipeline
L2
FB
SP SP
L1
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Processors execute computing threadsAlternative operating mode specifically for
computing
Architectural Considerations – Slide 38
G80 Thread Computing Pipeline (cont.)
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
Generates thread grids based on kernel calls
Grid is launched on the streaming processor array (SPA)
Thread blocks are serially distributed to all the streaming multiprocessors (SMs) Potentially >1
thread block per SM
Each SM launches warps of threads 2 levels of
parallelismArchitectural Considerations – Slide 39
Thread Life Cycle in HardwareHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
SM schedules and executes warps that are ready to run
As warps and thread blocks complete, resources are freed SPA can distribute
more thread blocks
Architectural Considerations – Slide 40
Thread Life Cycle in Hardware (cont.)
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Threads are assigned to SMs in block granularityUp to 8 blocks to each
SM as resource allowsSM in G80 can take up to
768 threadsCould be 256
(threads/block) × 3 blocksOr 128 (threads/block) ×
6 blocks, etc.Architectural Considerations – Slide 41
Streaming Multiprocessor Executes Blocks
t0 t1 t2 … tm
Blocks
Texture L1SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Threads run concurrentlySM assigns/maintains
thread id numbersSM manages/schedules
thread execution
Architectural Considerations – Slide 42
Streaming Multiprocessor Executes Blocks (cont.)
t0 t1 t2 … tm
Blocks
Texture L1SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Each thread blocks is divided into 32-thread warpsThis is an implementation
decision, not part of the CUDA programming model
Warps are scheduling units in SM
Architectural Considerations – Slide 43
Thread Scheduling/Execution
…t0 t1 t2 … t31…
…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?Each block is divided into
256/32 = 8 warpsThere are 8 × 3 = 24 warps At any point in time, only one
of the 24 warps will be selected for instruction fetch and execution.
Architectural Considerations – Slide 44
Thread Scheduling/Execution (cont.)
…t0 t1 t2 … t31…
…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
SM hardware implements zero-overhead warp schedulingWarps whose next
instruction has its operands ready for consumption are eligible for execution
Eligible warps are selected for execution on a prioritized scheduling policy
All threads in a warp execute the same instruction when selected
Architectural Considerations – Slide 45
Symmetric Multiprocessor Warp Scheduling
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80If one global memory
access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200-cycle memory latency
Architectural Considerations – Slide 46
Symmetric Multiprocessor Warp Scheduling (cont.)
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot
Issue one “ready-to-go” warp instruction/cyclefrom any warp - instruction buffer
slotoperand scoreboarding used to
prevent hazardsIssue selection based on round-
robin/age of warpSM broadcasts the same instruction
to 32 threads of a warp Architectural Considerations – Slide 47
SM Instruction Buffer: Warp Scheduling
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
All register operands of all instructions in the instruction buffer are scoreboardedinstruction becomes ready after the needed
values are depositedprevents hazardscleared instructions are eligible for issue
Architectural Considerations – Slide 48
Scoreboarding
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
Decoupled memory/processor pipelinesany thread can continue to issue instructions
until scoreboarding prevents issueallows memory/processor ops to proceed in
shadow of other waiting memory/processor ops
Architectural Considerations – Slide 49
Scoreboarding (cont.)
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 4×4, we have 16 threads per block.Since each SM can take up to 768 threads, the
thread capacity allows 48 blocks.However, each SM can only take up to 8
blocks, thus there will be only 128 threads in each SM!There are 8 warps but each warp is only half full.
Architectural Considerations – Slide 50
Granularity Considerations
For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 8×8, we have 64 threads per block.Since each SM can take up to 768 threads, it
could take up to 12 blocks.However, each SM can only take up to 8
blocks, only 512 threads will go into each SM!There are 16 warps available for scheduling in
each SMEach warp spans four slices in the y dimension
Architectural Considerations – Slide 51
Granularity Considerations (cont.)
For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 16×16, we have 256 threads per block.Since each SM can take up to 768 threads, it
can take up to 3 blocks and achieve full capacity unless other resource considerations overrule.There are 24 warps available for scheduling in
each SMEach warp spans two slices in the y dimension
For 32×32, we have 1024 threads per Block. Not even one can fit into an SM!
Architectural Considerations – Slide 52
Granularity Considerations (cont.)
Review: CUDA Device Memory SpaceEach thread can:
R/W per-thread registers and local memory
R/W per-block shared memory
R/W per-grid global memoryRead only per-grid constant
and texture memoriesThe host can R/W global,
constant and texturememories
Architectural Considerations – Slide 53
Topic 3: Memory Hardware in G80
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMem
Thr (0, 0)
Regs
LocalMem
Thr (1, 0)
Regs
Block (1, 0)
Shared Memory
LocalMem
Thr (0, 0)
Regs
LocalMem
Thr (1, 0)
Regs
HOST
Uses:Inter-thread communication within a blockCache data to reduce global memory accessesUse it to avoid non-coalesced access
Organization:16 banks, 32-bit wide banks (Tesla)32 banks, 32-bit wide banks (Fermi) Successive 32-bit words belong to different
banks
Architectural Considerations – Slide 54
Overview: Shared Memory
Performance:32 bits per bank per 2 clocks per
multiprocessorShared memory accesses are per 16-threads
(half-warp)Serialization: if n threads (out of 16) access the
same bank, n accesses are executed seriallyBroadcast: n threads access the same word in
one fetch
Architectural Considerations – Slide 55
Overview: Shared Memory (cont.)
Local Memory: per-threadPrivate per threadAuto variables, register
spillShared Memory: per-
blockShared by threads of the
same blockInter-thread
communication
Architectural Considerations – Slide 56
Parallel Memory SharingThread
Local Memory
Block
SharedMemory
Global Memory:per-applicationShared by all threadsInter-grid communication
Architectural Considerations – Slide 57
Parallel Memory Sharing (cont.)
Grid 0
. . .Global
Memory
. . .
Grid 1SequentialGridsin Time
Threads in a block share data and resultsIn memory and shared
memorySynchronize at barrier
instruction
Architectural Considerations – Slide 58
Streaming Multiprocessor Memory Architecture
t0 t1 t2 … tm
Blocks
Texture L1SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Per-block shared memory allocationKeeps data close to
processorMinimize trips to global
memoryShared memory is
dynamically allocated to blocks, one of the limiting resources Architectural Considerations – Slide 59
Streaming Multiprocessor Memory Architecture(cont.)
t0 t1 t2 … tm
Blocks
Texture L1SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Register File (RF)32 KB (8K entries) for each SM
in G80TEX pipe can also read/write
RF2 SMs share 1 TEXLoad/Store pipe can also
read/write RF
Architectural Considerations – Slide 60
Symmetric Multiprocessor Register File
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
There are 8192 registers in each SM in G80This is an implementation
decision, not part of CUDARegisters are dynamically
partitioned across all blocks assigned to the SM
Once assigned to a block, the register is NOT accessible by threads in other blocks
Each thread in the same block only access registers assigned to itself
Architectural Considerations – Slide 61
Programmer View of Register File4 blocks 3 blocks
If each block has 16 × 16 threads and each thread uses 10 registers, how many thread can run on each SM?Each block requires 10 × 256 = 2560 registers8192 = 3 × 2560 + changeSo, three blocks can run on an SM as far as
registers are concernedHow about if each thread increases the use of
registers by 1?Each block now requires 11 × 256 = 2816
registers8192 < 2816 × 3Only two blocks can run on an SM, ¹⁄₃ reduction of
thread-level parallelism!!!Architectural Considerations – Slide 62
Matrix Multiplication Example
Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that
require many registers each or a large number of threads that require few registers each This allows for finer grain threading than
traditional CPU threading models.The compiler can tradeoff between instruction-
level parallelism (ILP) and thread level parallelism (TLP)
Architectural Considerations – Slide 63
More on Dynamic Partitioning
Assume that a kernel has 256-thread blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles, then 3 blocks can run on each SM
If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, only two can run on each SM
Architectural Considerations – Slide 64
ILP vs. TLP Example
However, one only needs 200/(8×4) = 7 warps to tolerate the memory latency
Two blocks have 16 warps. The performance can be actually higher!
Architectural Considerations – Slide 65
ILP vs. TLP Example (cont.)
Increase in per-thread performance, but fewer threadsLower overall performance
Architectural Considerations – Slide 66
Resource Allocation Example
TB0 TB1 TB2
32KB Register File
………
16KB Shared Memory
SFU0 SFU1
SP0 SP7
(a) Pre-“optimization”
Core Computation
Thread Contexts
SP UtilizationArea determines overall performance
32KB Register File
16KB Shared Memory
SFU0 SFU1
………
SP0 SP7
(b) Post-“optimization”
Insufficient registers to allocate 3 blocks
Thread Contexts
X
Architectural Considerations – Slide 67
Loop { Load current tile to shared memory syncthreads()
Compute current tile syncthreads()}
Load next tile from global memoryLoop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile syncthreads()}
Without prefetching With prefetching
One could double buffer the computation, getting better instruction mix within each threadThis is classic software pipelining in ILP
compilers
Prefetching
Deposit blue tile from register into shared memory and Syncthreads
Load orange tile into register Compute blue tile Deposit orange tile into
shared memory ….
Architectural Considerations – Slide 68
Prefetch
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx 01 TILE_WIDTH-12
0 1 2
by
ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
THT
ILE
_WID
THE
WID
TH
WID
TH
There are very few multiplications or additions between branches and address calculations.
Loop unrolling can help.
Architectural Considerations – Slide 69
for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx];
Instruction Mix Considerations
Pvalue += Ms[ty][k] * Ns[k][tx] + … Ms[ty][k+15] * Ns[k+15][tx];
Architectural Considerations – Slide 70
Ctemp = 0;for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile for (i = 0; i < 16; i++) { Ctemp += As[ty][i] * Bs[i][tx]; } __syncthreads();}C[indexC] = Ctemp;
Ctemp = 0;for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile Ctemp += As[ty][0] * Bs[0][tx]; … Ctemp += As[ty][15] * Bs[15][tx]; __syncthreads();}C[indexC] = Ctemp;
Tiled Version Unrolled Version
Removal of branch instructions and address calculations
Unrolling
Does this use more registers?
Long-latency operationsAvoid stalls by executing other threads
Stalls and bubbles in the pipelineBarrier synchronizationBranch divergence
Shared resource saturationGlobal memory bandwidthLocal memory capacity
Architectural Considerations – Slide 71
Major G80 Performance Detractors
Based on original material fromJon Stokes, PCI Express: An Overview
http://arstechnica.com/articles/paedia/hardware/pcie.ars
The University of Illinois at Urbana-ChampaignDavid Kirk, Wen-mei W. Hwu
Revision history: last updated 10/11/2011.Previous revisions: 9/13/2011.
Architectural Considerations – Slide 72
End Credits
Recommended