Upload
dorie
View
42
Download
1
Embed Size (px)
DESCRIPTION
Streaming Architectures and GPUs. Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004. To Exploit VLSI Technology We Need:. Parallelism To keep 100s of ALUs per chip (thousands/board millions/system) busy Latency tolerance To cover 500 cycle remote memory access time - PowerPoint PPT Presentation
Citation preview
1 February 11, 2004
Streaming Architectures and GPUs
Ian BuckBill Dally & Pat Hanrahan
Stanford UniversityFebruary 11, 2004
2 February 11, 2004
To Exploit VLSI Technology We Need:
• Parallelism– To keep 100s of ALUs per chip
(thousands/board millions/system) busy
• Latency tolerance– To cover 500 cycle remote
memory access time
• Locality– To match 20Tb/s ALU
bandwidth to ~100Gb/s chip bandwidth
• Moore’s Law– Growth of transistors, not
performance
Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system
90nm Chip$2001GHz
64-bit FPU(to scale)
12mm
0.5mm
1 clock50pJ/FLOP
Courtesy of Bill Dally
3 February 11, 2004
Arithmetic Intensity
Lots of ops per word transferred
• Compute-to-Bandwidth ratio• High Arithmetic Intensity
desirable– App limited by ALU performance,
not off-chip bandwidth– More chip real estate for ALUs,
not caches
Courtesy of Pat Hanrahan
90nm Chip$2001GHz
64-bit FPU(to scale)
12mm
0.5mm
1 clock50pJ/FLOP
4 February 11, 2004
Brook: Stream programming Model
– Enforce Data Parallel computing– Encourage Arithmetic Intensity– Provide fundamental ops for stream computing
5 February 11, 2004
Streams & Kernels
• Streams– Collection of records requiring similar computation
• Vertex positions, voxels, FEM cell, …
– Provide data parallelism
• Kernels– Functions applied to each element in stream
• transforms, PDE, …• No dependencies between stream elements
– Encourage high Arithmetic Intensity
6 February 11, 2004
Vectors vs. Streams
• Vectors:• v: array of floats• Instruction sequence
LD v0LD v1ADD v0, v1, v2ST v2
Large set of temps
• Streams:• s: stream of records• Instruction sequence
LD s0LD s1CALLS f, s0, s1, s2ST s2
Small set of temps
Higher arithmetic intensity: |f|/|s| |+|/|v|
7 February 11, 2004
Imagine
• Imagine– Stream processor for image and
signal processing– 16mm die in 0.18um TI process– 21M transistors
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
Fil
e ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
8 February 11, 2004
Merrimac Processor
• 90nm tech (1 V)• ASIC technology• 1 GHz (37 FO4)• 128 GOPs• Inter-cluster switch
between clusters• 127.5 mm2 (small
~12x10)– Stanford Imagine is 16mm
x 16mm– MIT Raw is 18mm x 18mm
• 25 Watts (P4 = 75 W)– ~41W with memoriesNetwork
12.5 mm
10
.2 m
m
Microcontroller
Clu
ster
Clu
ster
Clu
ster
Clu
ster
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
16 R
DR
AM
In
t erf
ac e
s
Fo
rwa
rd E
CC
Ad
dre
ss
Ge
n
Re
ord
er
Bu
ffe
rA
dd
res
s G
en
$ bank
$ bank
$ bank
$ bank
$ bank
$ bank
$ bank
$ bank
Ad
dre
ss
Ge
n
Re
ord
er
Bu
ffe
rA
dd
res
s G
en
Me
m s
witc
h
Mips6420kc
Mips6420kc
8K
Wor
ds S
RF
Ban
k
2.3 mm
1.6
mm
FP/INT64 BitMADD
64 W RF64 W RF64 W RF
FP/INT64 BitMADD
64 W RF64 W RF64 W RF
FP/INT64 BitMADD
64 W RF64 W RF64 W RF
FP/INT64 BitMADD
64 W RF64 W RF64 W RF
0.6
mm
0.9 mm
9 February 11, 2004
Merrimac Streaming Supercomputer
StreamProcessor128 FPUs
128GFLOPS
16 xDRDRAM2GBytes
16GBytes/s
16GBytes/s32+32 pairs
Node
On-Board Network
Node2
Node16 Board 2
16 Nodes1K FPUs2TFLOPS32GBytes
Intra-Cabinet Network
Board 32
64GBytes/s128+128 pairs
6" Teradyne GbX
Board
Backplane
Inter-Cabinet Network
Backplane 232 Boards512 Nodes64K FPUs64TFLOPS
1TByte
E/OO/E
1TBytes/s2K+2K links
Ribbon Fiber
Backplane32
Bisection 32TBytes/s
All links 5Gb/s per pair or fiberAll bandwidths are full duplex
10 February 11, 2004
Streaming Applications
• Finite volume – StreamFLO (from TFLO)• Finite element - StreamFEM• Molecular dynamics code (ODEs) - StreamMD• Model (elliptic, hyperbolic and parabolic) PDEs• PCA Applications: FFT, Matrix Mul, SVD, Sort
11 February 11, 2004
StreamFLO
• StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil.
• The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations.
• The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers.
12 February 11, 2004
StreamFEM
• A Brook implementation of the Discontinuous Galerkin (DG) Finite Element
• Method (FEM) in 2D triangulated domains.
13 February 11, 2004
StreamMD: motivation
• Application: study the folding of human proteins.
• Molecular Dynamics: computer simulation of the dynamics of macro molecules.
• Why this application?– Expect high arithmetic
intensity.– Requires variable length
neighborlists.– Molecular Dynamics can be
used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.
• Test case chosen for initial evaluation: box of water molecules.
DNA molecule
Human immunodeficiency virus (HIV)
14 February 11, 2004
Summary of Application Results
Application Sustained
GFLOPS1
FP Ops / Mem Ref
LRF Refs SRF Refs Mem Refs
StreamFEM2D(Euler, quadratic)
32.2 23.5 169.5M(93.6%)
10.3M(5.7%)
1.4M(0.7%)
StreamFEM2D(MHD, cubic)
33.5 50.6 733.3M(94.0%)
43.8M(5.6%)
3.2M(0.4%)
StreamMD 14.22 12.1 90.2M(97.5%)
1.6M(1.7%)
0.7M(0.8%)
StreamFLO 11.42 7.4 234.3M(95.7%)
7.2M(2.9%)
3.4M(1.4%)
1. Simulated on a machine with 64GFLOPS peak performance2. The low numbers are a result of many divide and square-root operations
15 February 11, 2004
0
5
10
15
20
25
Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03
GF
LO
PS
Streaming on graphics hardware?
Pentium 4 SSE theoretical*3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS
GeForce FX 5900 (NV35) fragment shader observed:MULR R0, R0, R0: 20 GFLOPSequivalent to a 10 GHz P4
and getting faster: 3x improvement over NV30 (6 months)
*from Intel P4 Optimization Manual
GeForce FX
Pentium 4NV30
NV35
16 February 11, 2004
GPU Program Architecture
InputRegisters
Program
OutputRegisters
Constants
Registers
Texture
17 February 11, 2004
Example Program
Simple Specular and Diffuse Lighting!!VP1.0## c[0-3] = modelview projection (composite) matrix# c[4-7] = modelview inverse transpose# c[32] = eye-space light direction# c[33] = constant eye-space half-angle vector (infinite viewer)# c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat.# c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat.# c[36] = specular color# c[38].x = specular power# outputs homogenous position and color# DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position.DP4 o[HPOS].y, c[1], v[OPOS];DP4 o[HPOS].z, c[2], v[OPOS];DP4 o[HPOS].w, c[3], v[OPOS];DP3 R0.x, c[4], v[NRML]; # Compute normal.DP3 R0.y, c[5], v[NRML]; DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normalDP3 R1.x, c[32], R0; # R1.x = Ldir DOT N'DP3 R1.y, c[33], R0; # R1.y = H DOT N'MOV R1.w, c[38].x; # R1.w = specular powerLIT R2, R1; # Compute lighting valuesMAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambientMAD o[COL0].xyz, c[36], R2.z, R3; # + specularEND
18 February 11, 2004
Cg/HLSL: High level language for GPUs
Specular Lighting
// Lookup the normal map
float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5);
// Multiply 3 X 2 matrix generated using lightDir and halfAngle with
// scaled normal followed by lookup in intensity map with the result.
float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz),
dot(I.halfAngle.xyz, normal.xyz));
float4 intensity = tex2D(intensityMap, intensCoord);
// Lookup color
float4 color = tex2D(colorMap, I.texCoord3.xy);
// Blend/Modulate intensity with color
return color * intensity;
19 February 11, 2004
GPU: Data Parallel
– Each fragment shaded independently• No dependencies between fragments
– Temporary registers are zeroed – No static variables– No Read-Modify-Write textures
• Multiple “pixel pipes”
– Data Parallelism• Support ALU heavy architectures • Hide Memory Latency
[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]
20 February 11, 2004
GPU: Arithmetic Intensity
Lots of ops per word transferredGraphics pipeline
– Vertex• BW: 1 triangle = 32 bytes; • OP: 100-500 f32-ops / triangle
– Rasterization• Create 16-32 fragments per triangle
– Fragment • BW: 1 fragment = 10 bytes• OP: 300-1000 i8-ops/fragment
Shader Programs
Courtesy of Pat Hanrahan
21 February 11, 2004
Streaming Architectures
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
Fil
e ALU Cluster
ALU Cluster
ALU Cluster
22 February 11, 2004
Streaming Architectures
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
Fil
e ALU Cluster
ALU Cluster
ALU Cluster
MAD R3, R1, R2;MAD R5, R2, R3;
Kernel Execution Unit
23 February 11, 2004
Streaming Architectures
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
Fil
e ALU Cluster
ALU Cluster
ALU Cluster
MAD R3, R1, R2;MAD R5, R2, R3;
Kernel Execution Unit
Parallel Fragment Pipelines
24 February 11, 2004
Streaming Architectures
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
Fil
e ALU Cluster
ALU Cluster
ALU Cluster
MAD R3, R1, R2;MAD R5, R2, R3;
Kernel Execution Unit
Parallel Fragment PipelinesStream Register File:• Texture Cache?• F-Buffer [Mark et al.]
25 February 11, 2004
Conclusions
• The problem is bandwidth – arithmetic is cheap• Stream processing & architectures can provide
VLSI-efficient scientific computing– Imagine– Merrimac
• GPUs are first generation streaming architectures– Apply same stream programming model for general
purpose computing on GPUs
GeForce FX