CUDA Lecture 10 Architectural Considerations

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 10Architectural

Considerations

To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPUThe feeds and speeds of the traditional CPU

worldThe feeds and speeds when employing a GPUTo form a solid knowledge base for

performance programming in modern GPU’sKnowing yesterday, today, and tomorrow

The PC world is becoming flatterOutsourcing of computation is becoming

easier…Architectural Considerations – Slide 2

Objective

Topic 1 (next): The GPU as Part of the PC Architecture

Topic 2: Threading Hardware in the G80Topic 3: Memory Hardware in the G80

Architectural Considerations – Slide 3

Outline

Global variables declaration Function prototypes

__global__ void kernelOne(…) Main ()

allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )

transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution

Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__

automatic variables transparently assigned to registers or local memory __syncthreads()…

Recall: Typical Structure of a CUDA Program

repeatas

needed

The bandwidth between key components ultimately dictates system performanceEspecially true for massively parallel systems

processing massive amount of data Tricks like buffering, reordering, caching can

temporarily defy the rules in some casesUltimately, the performance goes falls back to

what the “speeds and feeds” dictate

Bandwidth: Gravity of Modern Computer Systems

Northbridge connects three components that must be communicate at high speed CPU, DRAM, video Video also needs to have

first-class access to DRAM Previous NVIDIA cards

are connected to AGP, up to 2 GB/sec transfers

Southbridge serves as a concentrator for slower I/O devices

Classic PC Architecture

Core Logic Chipset

Connected to the southBridgeOriginally 33 MHz, 32-bit wide, 132 MB/sec

peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak

Upstream bandwidth remain slow for device (256 MB/sec peak)

Shared bus with arbitrationWinner of arbitration becomes bus master and can

connect to CPU or DRAM through the southbridge and northbridge

(Original) PCI Bus Specification

PCI device registers are mapped into the CPU’s physical address spaceAccessed through

loads/ stores (kernel mode)

Addresses assigned to the PCI devices at boot timeAll devices listen for

their addresses

PCI as Memory Mapped I/O

Switched, point-to-point connectionEach card has a

dedicated “link” to the central switch, no bus arbitration.

Packet switches messages form virtual channel

Prioritized packets for quality of service, e.g., real-time video streaming Architectural Considerations – Slide 9

PCI Express (PCIe)

Each link consists of one more lanesEach lane is 1-bit

wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction)Upstream and

downstream now simultaneous and symmetric

Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc.

PCIe Links and Lanes

Each link consists of one more lanesEach byte data is

8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way.

Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way

PCIe Links and Lanes (cont.)

PCIe forms the interconnect backboneNorthbridge/

Southbridge are both PCIe switches

Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards

Some PCIe cards are PCI cards with a PCI-PCIe bridge

PCIe PC Architecture

FSB connection between processor and Northbridge (82925X)Memory control

hub Northbridge

handles “primary” PCIe to video/GPU and DRAM.PCIe x16

bandwidth at 8 GB/sec (4 GB each direction)

Southbridge (ICH6RW) handles other peripherals

Today’s Intel PC Architecture: Single Core System

Bensley platformBlackford Memory

Control Hub (MCH) is now a PCIe switch that integrates (NB/SB).

FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM

PCIe links form backbone

Today’s Intel PC Architecture: Dual Core System

Source: http://www.2cpu.com/review.php?id=109

Bensley platformPCIe device

upstream bandwidth now equal to down stream

Workstation version has x16 GPU link via the Greencreek MCH

Today’s Intel PC Architecture: Dual Core System (cont.)

Two CPU socketsDual Independent

Bus to CPUs, each is basically a FSB

CPU feeds at 8.5–10.5 GB/sec per socket

Compared to current Front-Side Bus CPU feeds 6.4GB/sec

PCIe bridges to legacy I/O devices

Today’s Intel PC Architecture: Dual Core System (cont.)

AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture

HyperTransport ™ similarities to PCIe: Packet based,

switching networkDedicated links

for both directionsShown in 4 socket

configuraton, 8 GB/sec per link

Today’s AMD PC Architecture

Northbridge/ HyperTransport ™ is on die

Glueless logic to DDR, DDR2

memoryPCI-X/PCIe

bridges (usually implemented in Southbridge)

Today’s AMD PC Architecture (cont.)

“Torrenza” technologyAllows licensing of

coherent HyperTransport™ to 3rd party manufacturers to make socket-compatible accelerators/co-processors

“Torrenza” technologyAllows 3rd party

PPUs (Physics Processing Unit), GPUs, and co-processors to access main system memory directly and coherently

“Torrenza” technologyCould make

accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.

Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe

HyperTransport™ Feeds and Speeds

Courtesy HyperTransport ™ ConsortiumSource: “White Paper: AMD HyperTransport Technology-Based System Architecture

HyperTransport ™ 1.0 Specification800 MHz max, 12.8 GB/s aggregate bandwidth

(6.4 GB/s each way)

HyperTransport™ Feeds and Speeds (cont.)

HyperTransport ™ 2.0 SpecificationAdded PCIe mapping1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate

bandwidth (11.2 GB/s each way)

HyperTransport ™ 3.0 Specification1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate

bandwidth (20.8 GB/s each way)Added AC coupling to extend HyperTransport

™ to long distance to system-to-system interconnect

GeForce 7800 GTX Board Details

256MB/256-bit DDR3 600 MHz8 pieces of 8Mx32

16x PCI-Express

SLI Connector

DVI x 2

sVideoTV Out

Single slot cooling

Single-Program Multiple-Data (SPMD)CUDA integrated CPU + GPU application C

programSerial C code executes on CPUParallel Kernel C code executes on GPU thread

blocks

Topic 2: Threading in G80

SPMD (cont.)CPU Serial Code

Grid 0

GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

A kernel is executed as a grid of thread blocks All threads

share global memory space

Grids and BlocksHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

A thread block is a batch of threads that can cooperate with each other by: Synchronizing

their execution using barrier

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Grids and Blocks (cont.)Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Programmer declares (Thread) Block: Block size 1 to 512

concurrent threads Block shape 1D, 2D, or

3D Block dimensions in

threads

CUDA Thread Block: ReviewCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

All threads in a block execute the same thread program

Threads share data and synchronize while doing their share of the work

Threads have thread id numbers within block

Thread program uses thread id to select work and address shared data

CUDA Thread Block: Review (cont.)

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

GeForce-8 Series Hardware Overview

TPC TPC TPC TPC TPC TPC

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster

Streaming Multiprocessor

SM Shared Memory

Streaming Processor Array

SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)

TPC: Texture Processor Cluster (2 SM + TEX)

SM: Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA

thread block SP: Streaming Processor

Scalar ALU for a single CUDA thread

CUDA Processor Terminology

Streaming Multiprocessor (SM)8 Streaming Processors (SP)2 Super Function Units (SFU)

Multi-threaded instruction dispatch1 to 512 threads activeShared instruction fetch per 32

threadsCover latency of texture/memory

Streaming Multiprocessor

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

20+ GFLOPS16 KB shared memorytexture and global memory

access

Streaming Multiprocessor (cont.)

Shared Memory

The future of GPUs is programmable processingSo – build the architecture around the processor

G80 Thread Computing Pipeline

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Processors execute computing threadsAlternative operating mode specifically for

computing

G80 Thread Computing Pipeline (cont.)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Generates thread grids based on kernel calls

Grid is launched on the streaming processor array (SPA)

Thread blocks are serially distributed to all the streaming multiprocessors (SMs) Potentially >1

thread block per SM

Each SM launches warps of threads 2 levels of

parallelismArchitectural Considerations – Slide 39

Thread Life Cycle in HardwareHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

SM schedules and executes warps that are ready to run

As warps and thread blocks complete, resources are freed SPA can distribute

more thread blocks

Thread Life Cycle in Hardware (cont.)

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Threads are assigned to SMs in block granularityUp to 8 blocks to each

SM as resource allowsSM in G80 can take up to

768 threadsCould be 256

(threads/block) × 3 blocksOr 128 (threads/block) ×

6 blocks, etc.Architectural Considerations – Slide 41

Streaming Multiprocessor Executes Blocks

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Threads run concurrentlySM assigns/maintains

thread id numbersSM manages/schedules

thread execution

Streaming Multiprocessor Executes Blocks (cont.)

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Each thread blocks is divided into 32-thread warpsThis is an implementation

decision, not part of the CUDA programming model

Warps are scheduling units in SM

Thread Scheduling/Execution

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

Shared Memory

If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?Each block is divided into

256/32 = 8 warpsThere are 8 × 3 = 24 warps At any point in time, only one

of the 24 warps will be selected for instruction fetch and execution.

Thread Scheduling/Execution (cont.)

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

Shared Memory

SM hardware implements zero-overhead warp schedulingWarps whose next

instruction has its operands ready for consumption are eligible for execution

Eligible warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected

Symmetric Multiprocessor Warp Scheduling

warp 8 instruction 11

SM multithreadedWarp scheduler

Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80If one global memory

access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200-cycle memory latency

Symmetric Multiprocessor Warp Scheduling (cont.)

SM multithreadedWarp scheduler

Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cyclefrom any warp - instruction buffer

slotoperand scoreboarding used to

prevent hazardsIssue selection based on round-

robin/age of warpSM broadcasts the same instruction

to 32 threads of a warp Architectural Considerations – Slide 47

SM Instruction Buffer: Warp Scheduling

MultithreadedInstruction Buffer

SharedMem

Operand Select

MAD SFU

All register operands of all instructions in the instruction buffer are scoreboardedinstruction becomes ready after the needed

values are depositedprevents hazardscleared instructions are eligible for issue

Scoreboarding

TB = Thread Block, W = Warp

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Decoupled memory/processor pipelinesany thread can continue to issue instructions

until scoreboarding prevents issueallows memory/processor ops to proceed in

shadow of other waiting memory/processor ops

Scoreboarding (cont.)

TB = Thread Block, W = Warp

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 4×4, we have 16 threads per block.Since each SM can take up to 768 threads, the

thread capacity allows 48 blocks.However, each SM can only take up to 8

blocks, thus there will be only 128 threads in each SM!There are 8 warps but each warp is only half full.

Granularity Considerations

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 8×8, we have 64 threads per block.Since each SM can take up to 768 threads, it

could take up to 12 blocks.However, each SM can only take up to 8

blocks, only 512 threads will go into each SM!There are 16 warps available for scheduling in

each SMEach warp spans four slices in the y dimension

Granularity Considerations (cont.)

For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles?For 16×16, we have 256 threads per block.Since each SM can take up to 768 threads, it

can take up to 3 blocks and achieve full capacity unless other resource considerations overrule.There are 24 warps available for scheduling in

each SMEach warp spans two slices in the y dimension

For 32×32, we have 1024 threads per Block. Not even one can fit into an SM!

Granularity Considerations (cont.)

Review: CUDA Device Memory SpaceEach thread can:

R/W per-thread registers and local memory

R/W per-block shared memory

R/W per-grid global memoryRead only per-grid constant

and texture memoriesThe host can R/W global,

constant and texturememories

Topic 3: Memory Hardware in G80

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMem

Thr (0, 0)

LocalMem

Thr (1, 0)

Block (1, 0)

Shared Memory

LocalMem

Thr (0, 0)

LocalMem

Thr (1, 0)

Uses:Inter-thread communication within a blockCache data to reduce global memory accessesUse it to avoid non-coalesced access

Organization:16 banks, 32-bit wide banks (Tesla)32 banks, 32-bit wide banks (Fermi) Successive 32-bit words belong to different

Overview: Shared Memory

Performance:32 bits per bank per 2 clocks per

multiprocessorShared memory accesses are per 16-threads

(half-warp)Serialization: if n threads (out of 16) access the

same bank, n accesses are executed seriallyBroadcast: n threads access the same word in

one fetch

Overview: Shared Memory (cont.)

Local Memory: per-threadPrivate per threadAuto variables, register

spillShared Memory: per-

blockShared by threads of the

same blockInter-thread

communication

Parallel Memory SharingThread

Local Memory

SharedMemory

Global Memory:per-applicationShared by all threadsInter-grid communication

Parallel Memory Sharing (cont.)

Grid 0

. . .Global

Memory

Grid 1SequentialGridsin Time

Threads in a block share data and resultsIn memory and shared

memorySynchronize at barrier

instruction

Streaming Multiprocessor Memory Architecture

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Per-block shared memory allocationKeeps data close to

processorMinimize trips to global

memoryShared memory is

dynamically allocated to blocks, one of the limiting resources Architectural Considerations – Slide 59

Streaming Multiprocessor Memory Architecture(cont.)

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Register File (RF)32 KB (8K entries) for each SM

in G80TEX pipe can also read/write

RF2 SMs share 1 TEXLoad/Store pipe can also

read/write RF

Symmetric Multiprocessor Register File

MultithreadedInstruction Buffer

SharedMem

Operand Select

MAD SFU

There are 8192 registers in each SM in G80This is an implementation

decision, not part of CUDARegisters are dynamically

partitioned across all blocks assigned to the SM

Once assigned to a block, the register is NOT accessible by threads in other blocks

Each thread in the same block only access registers assigned to itself

Programmer View of Register File4 blocks 3 blocks

If each block has 16 × 16 threads and each thread uses 10 registers, how many thread can run on each SM?Each block requires 10 × 256 = 2560 registers8192 = 3 × 2560 + changeSo, three blocks can run on an SM as far as

registers are concernedHow about if each thread increases the use of

registers by 1?Each block now requires 11 × 256 = 2816

registers8192 < 2816 × 3Only two blocks can run on an SM, ¹⁄₃ reduction of

thread-level parallelism!!!Architectural Considerations – Slide 62

Matrix Multiplication Example

Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that

require many registers each or a large number of threads that require few registers each This allows for finer grain threading than

traditional CPU threading models.The compiler can tradeoff between instruction-

level parallelism (ILP) and thread level parallelism (TLP)

CUDA Lecture 10 Architectural Considerations

Documents

Gilbane Report White Paper: Architectural Considerations in

Design considerations for architectural glazing, with ...constructii.utcluj.ro/ActaCivilEng/download/atn/ATN2013(1)_9.pdf · Design considerations for architectural glazing, with

WOLA Architectural Considerations - SHARE Architectural Considerations IBM Advanced Technical Skills (ATS) A true partnership: WAS z/OS Support Team CICS Support Team IBM Software

Seismic Design Considerations About Architectural Design ...Seismic Design Considerations About Architectural Design Aspects and Irregularities *Serra Zerrin Korkmaz1) and Hasan Husnu

Architectural Considerations for a New Generation of …groups.csail.mit.edu/ana/Publications/PubPDFs/Architectural... · Architectural Considerations for a New Generation of Protocols

Open API Architectural Choices Considerations

SSD Architectural Considerations for Data Center Workloads

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011

Architectural Considerations for FPGA Acceleration of Machine Learning …hhomayou/files/Samos... · 2019-12-05 · Architectural Considerations for FPGA Acceleration of Machine Learning

Architectural considerations for rate-flexible trellis

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock

Architectural Considerations in Smart Object Networking

Architectural Considerations in Digital Asset Management

Architectural Considerations For Today’s Technology Trends · Architectural Considerations For Today’s Technology Trends Steve Pawlowski Intel Senior Fellow GM, ... •Pins required

Architectural Considerations For Complex Mobile And Web Applications

Architectural Considerations for a New Generation of Protocolsnms.lcs.mit.edu/6829-papers/alf.pdf · 2002. 9. 3. · Architectural Considerations for a New Generation of Protocols

CUDA Lecture 10 Architectural Considerations

1105 Implementation Scenarios Architectural Considerations for SAP MII Implementations

Architectural Considerations in Digital Asset Management · Architectural Considerations in Digital Asset Management ... assets—and the software that manages them—available across

Architectural Considerations in the design of WDM-based Optical