Streaming Architectures and GPUs

1 February 11, 2004

Streaming Architectures and GPUs

Ian BuckBill Dally & Pat Hanrahan

Stanford UniversityFebruary 11, 2004

2 February 11, 2004

To Exploit VLSI Technology We Need:

• Parallelism– To keep 100s of ALUs per chip

(thousands/board millions/system) busy

• Latency tolerance– To cover 500 cycle remote

memory access time

• Locality– To match 20Tb/s ALU

bandwidth to ~100Gb/s chip bandwidth

• Moore’s Law– Growth of transistors, not

performance

Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system

90nm Chip$2001GHz

64-bit FPU(to scale)

12mm

0.5mm

1 clock50pJ/FLOP

Courtesy of Bill Dally

3 February 11, 2004

Arithmetic Intensity

Lots of ops per word transferred

• Compute-to-Bandwidth ratio• High Arithmetic Intensity

desirable– App limited by ALU performance,

not off-chip bandwidth– More chip real estate for ALUs,

not caches

Courtesy of Pat Hanrahan

90nm Chip$2001GHz

64-bit FPU(to scale)

12mm

0.5mm

1 clock50pJ/FLOP

4 February 11, 2004

Brook: Stream programming Model

– Enforce Data Parallel computing– Encourage Arithmetic Intensity– Provide fundamental ops for stream computing

5 February 11, 2004

Streams & Kernels

• Streams– Collection of records requiring similar computation

• Vertex positions, voxels, FEM cell, …

– Provide data parallelism

• Kernels– Functions applied to each element in stream

• transforms, PDE, …• No dependencies between stream elements

– Encourage high Arithmetic Intensity

6 February 11, 2004

Vectors vs. Streams

• Vectors:• v: array of floats• Instruction sequence

LD v0LD v1ADD v0, v1, v2ST v2

Large set of temps

• Streams:• s: stream of records• Instruction sequence

LD s0LD s1CALLS f, s0, s1, s2ST s2

Small set of temps

Higher arithmetic intensity: |f|/|s| |+|/|v|

7 February 11, 2004

Imagine

• Imagine– Stream processor for image and

signal processing– 16mm die in 0.18um TI process– 21M transistors

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

8 February 11, 2004

Merrimac Processor

• 90nm tech (1 V)• ASIC technology• 1 GHz (37 FO4)• 128 GOPs• Inter-cluster switch

between clusters• 127.5 mm2 (small

~12x10)– Stanford Imagine is 16mm

x 16mm– MIT Raw is 18mm x 18mm

• 25 Watts (P4 = 75 W)– ~41W with memoriesNetwork

12.5 mm

10

.2 m

m

Microcontroller

Clu

ster

Clu

ster

Clu

ster

Clu

ster

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

16 R

DR

AM

In

t erf

ac e

s

Fo

rwa

rd E

CC

Ad

dre

ss

Ge

n

Re

ord

er

Bu

ffe

rA

dd

res

s G

en

$ bank

$ bank

$ bank

$ bank

$ bank

$ bank

$ bank

$ bank

Ad

dre

ss

Ge

n

Re

ord

er

Bu

ffe

rA

dd

res

s G

en

Me

m s

witc

h

Mips6420kc

Mips6420kc

8K

Wor

ds S

RF

Ban

k

2.3 mm

1.6

mm

FP/INT64 BitMADD

64 W RF64 W RF64 W RF

FP/INT64 BitMADD


FP/INT64 BitMADD


FP/INT64 BitMADD


0.6

mm

0.9 mm

9 February 11, 2004

Merrimac Streaming Supercomputer

StreamProcessor128 FPUs

128GFLOPS

16 xDRDRAM2GBytes

16GBytes/s

16GBytes/s32+32 pairs

Node

On-Board Network

Node2

Node16 Board 2

16 Nodes1K FPUs2TFLOPS32GBytes

Intra-Cabinet Network

Board 32

64GBytes/s128+128 pairs

6" Teradyne GbX

Board

Backplane

Inter-Cabinet Network

Backplane 232 Boards512 Nodes64K FPUs64TFLOPS

1TByte

E/OO/E

1TBytes/s2K+2K links

Ribbon Fiber

Backplane32

Bisection 32TBytes/s

All links 5Gb/s per pair or fiberAll bandwidths are full duplex

10 February 11, 2004

Streaming Applications

• Finite volume – StreamFLO (from TFLO)• Finite element - StreamFEM• Molecular dynamics code (ODEs) - StreamMD• Model (elliptic, hyperbolic and parabolic) PDEs• PCA Applications: FFT, Matrix Mul, SVD, Sort


StreamFLO

• StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil.

• The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations.

• The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers.


StreamFEM

• A Brook implementation of the Discontinuous Galerkin (DG) Finite Element

• Method (FEM) in 2D triangulated domains.


StreamMD: motivation

• Application: study the folding of human proteins.

• Molecular Dynamics: computer simulation of the dynamics of macro molecules.

• Why this application?– Expect high arithmetic

intensity.– Requires variable length

neighborlists.– Molecular Dynamics can be

used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.

• Test case chosen for initial evaluation: box of water molecules.

DNA molecule

Human immunodeficiency virus (HIV)


Summary of Application Results

Application Sustained

GFLOPS1

FP Ops / Mem Ref

LRF Refs SRF Refs Mem Refs

StreamFEM2D(Euler, quadratic)

32.2 23.5 169.5M(93.6%)

10.3M(5.7%)

1.4M(0.7%)

StreamFEM2D(MHD, cubic)

33.5 50.6 733.3M(94.0%)

43.8M(5.6%)

3.2M(0.4%)

StreamMD 14.22 12.1 90.2M(97.5%)

1.6M(1.7%)

0.7M(0.8%)

StreamFLO 11.42 7.4 234.3M(95.7%)

7.2M(2.9%)

3.4M(1.4%)

1. Simulated on a machine with 64GFLOPS peak performance2. The low numbers are a result of many divide and square-root operations


0

5

10

15

20

25

Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03

GF

LO

PS

Streaming on graphics hardware?

Pentium 4 SSE theoretical*3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS

GeForce FX 5900 (NV35) fragment shader observed:MULR R0, R0, R0: 20 GFLOPSequivalent to a 10 GHz P4

and getting faster: 3x improvement over NV30 (6 months)

*from Intel P4 Optimization Manual

GeForce FX

Pentium 4NV30

NV35


GPU Program Architecture

InputRegisters

Program

OutputRegisters

Constants

Registers

Texture


Example Program

Simple Specular and Diffuse Lighting!!VP1.0## c[0-3] = modelview projection (composite) matrix# c[4-7] = modelview inverse transpose# c[32] = eye-space light direction# c[33] = constant eye-space half-angle vector (infinite viewer)# c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat.# c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat.# c[36] = specular color# c[38].x = specular power# outputs homogenous position and color# DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position.DP4 o[HPOS].y, c[1], v[OPOS];DP4 o[HPOS].z, c[2], v[OPOS];DP4 o[HPOS].w, c[3], v[OPOS];DP3 R0.x, c[4], v[NRML]; # Compute normal.DP3 R0.y, c[5], v[NRML]; DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normalDP3 R1.x, c[32], R0; # R1.x = Ldir DOT N'DP3 R1.y, c[33], R0; # R1.y = H DOT N'MOV R1.w, c[38].x; # R1.w = specular powerLIT R2, R1; # Compute lighting valuesMAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambientMAD o[COL0].xyz, c[36], R2.z, R3; # + specularEND


Cg/HLSL: High level language for GPUs

Specular Lighting

// Lookup the normal map

float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5);

// Multiply 3 X 2 matrix generated using lightDir and halfAngle with

// scaled normal followed by lookup in intensity map with the result.

float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz),

dot(I.halfAngle.xyz, normal.xyz));

float4 intensity = tex2D(intensityMap, intensCoord);

// Lookup color

float4 color = tex2D(colorMap, I.texCoord3.xy);

// Blend/Modulate intensity with color

return color * intensity;


GPU: Data Parallel

– Each fragment shaded independently• No dependencies between fragments

– Temporary registers are zeroed – No static variables– No Read-Modify-Write textures

• Multiple “pixel pipes”

– Data Parallelism• Support ALU heavy architectures • Hide Memory Latency

[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]


GPU: Arithmetic Intensity

Lots of ops per word transferredGraphics pipeline

– Vertex• BW: 1 triangle = 32 bytes; • OP: 100-500 f32-ops / triangle

– Rasterization• Create 16-32 fragments per triangle

– Fragment • BW: 1 fragment = 10 bytes• OP: 300-1000 i8-ops/fragment

Shader Programs

Courtesy of Pat Hanrahan


Streaming Architectures

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster



SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

MAD R3, R1, R2;MAD R5, R2, R3;

Kernel Execution Unit



SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster



Parallel Fragment Pipelines



SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster



Parallel Fragment PipelinesStream Register File:• Texture Cache?• F-Buffer [Mark et al.]


Conclusions

• The problem is bandwidth – arithmetic is cheap• Stream processing & architectures can provide

VLSI-efficient scientific computing– Imagine– Merrimac

• GPUs are first generation streaming architectures– Apply same stream programming model for general

purpose computing on GPUs

GeForce FX

Documents

Streaming Architectures and GPUs