Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia

Modified from:Modified from:

A Survey of General-Purpose A Survey of General-Purpose Computation on Graphics HardwareComputation on Graphics Hardware

John OwensJohn OwensUniversity of California, University of California,

DavisDavis

David LuebkeDavid LuebkeUniversity of VirginiaUniversity of Virginia

with Naga Govindaraju, Mark Harris, Jens Krwith Naga Govindaraju, Mark Harris, Jens Krüger, Aaron Lefohn, üger, Aaron Lefohn, Tim PurcellTim Purcell

2

Motivation: The Potential of Motivation: The Potential of GPGPUGPGPU• In short:In short:

• The power and flexibility of GPUs makes them The power and flexibility of GPUs makes them an attractive platform for general-purpose an attractive platform for general-purpose computationcomputation

• Example applications range from in-game Example applications range from in-game physics simulation to conventional physics simulation to conventional computational sciencecomputational science

• Goal: make the inexpensive power of the GPU Goal: make the inexpensive power of the GPU available to developers as a sort of available to developers as a sort of computational coprocessorcomputational coprocessor

3

Problems: Difficult To UseProblems: Difficult To Use

• GPUs designed for & driven by video GPUs designed for & driven by video gamesgames• Programming model unusualProgramming model unusual

• Programming idioms tied to computer graphicsProgramming idioms tied to computer graphics

• Programming environment tightly constrainedProgramming environment tightly constrained

• Underlying architectures are:Underlying architectures are:• Inherently parallelInherently parallel

• Rapidly evolving (even in basic feature set!)Rapidly evolving (even in basic feature set!)

• Largely secretLargely secret

• Can’t simply “port” CPU code!Can’t simply “port” CPU code!

4

STAR Goals STAR Goals

• Detailed & useful survey of general-Detailed & useful survey of general-purpose computing on graphics purpose computing on graphics hardwarehardware• Hardware and software developments behind Hardware and software developments behind

GPGPUGPGPU

• Building blocksBuilding blocks: techniques for mapping : techniques for mapping general-purpose computation to the GPUgeneral-purpose computation to the GPU

• ApplicationsApplications: important applications of GPGPU : important applications of GPGPU

• A comprehensive GPGPU bibliographyA comprehensive GPGPU bibliography

5

Triangle SetupTriangle Setup

L2 TexL2 Tex

Shader Instruction DispatchShader Instruction Dispatch

Fragment CrossbarFragment Crossbar

MemoryPartitionMemoryPartition




Z-CullZ-Cull

NVIDIA GeForce 6800 3D NVIDIA GeForce 6800 3D PipelinePipeline

Courtesy Nick Triantos, NVIDIA

Vertex

Fragment

Composite

6

Programming a GPU for Programming a GPU for GraphicsGraphics

• Each fragment is shaded Each fragment is shaded w/w/ SIMD program SIMD program

• Shading can use values Shading can use values from texture memoryfrom texture memory

• Image can be used Image can be used as texture on as texture on future passesfuture passes

• Application specifies Application specifies geometry geometry rasterized rasterized

7

Programming a GPU for GP Programming a GPU for GP ProgramsPrograms

• Run a SIMD Run a SIMD kernelkernel over each fragmentover each fragment

• ““Gather” is permitted Gather” is permitted from texture memoryfrom texture memory

• Resulting buffer can Resulting buffer can be treated as texture be treated as texture on next passon next pass

• Draw a screen-sized Draw a screen-sized quad quad streamstream

8

FeedbackFeedback

• Each algorithm step Each algorithm step depend on the results of depend on the results of previous stepsprevious steps

• Each time step depends on Each time step depends on the results of the previous the results of the previous time steptime step

9

CPU-GPU AnalogiesCPU-GPU Analogies

.. . .

. .

Grid[i][j]= x;Grid[i][j]= x; . . . . . .

Array Write Array Write = = Render to Render to TextureTexture

CPU GPU

10

CPU-GPU AnalogiesCPU-GPU Analogies

CPUCPU GPUGPU

Stream / Data Array = TextureStream / Data Array = TextureMemory Read = Texture Memory Read = Texture

SampleSample

11

KernelsKernels

Kernel / loop body / algorithm step = Fragment ProgramKernel / loop body / algorithm step = Fragment Program

CPU GPU

12

Scatter vs. GatherScatter vs. Gather

• Grid communicationGrid communication• Grid cells share informationGrid cells share information

• Gather

•Indirect read from memory ( x = a[i])

•Naturally maps to a texture fetch

•Used to access data structures and data streams

• Scatter

•Indirect write to memory (a[i] = x)

•Difficult to emulate:

•Usually done on CPU

13

Computational Resources Computational Resources InventoryInventory• Programmable parallel processorsProgrammable parallel processors

• Vertex & Fragment pipelinesVertex & Fragment pipelines

• RasterizerRasterizer• Mostly useful for interpolating addresses Mostly useful for interpolating addresses

(texture coordinates) and per-vertex constants(texture coordinates) and per-vertex constants

• Texture unitTexture unit• Read-only memory interfaceRead-only memory interface

• Render to textureRender to texture• Write-only memory interfaceWrite-only memory interface

14

Vertex ProcessorVertex Processor

• Fully programmable (SIMD / MIMD)Fully programmable (SIMD / MIMD)

• Processes 4-vectors (RGBA / XYZW)Processes 4-vectors (RGBA / XYZW)

• Capable of scatter but not gatherCapable of scatter but not gather• Can change the location of current vertexCan change the location of current vertex

• Cannot read info from other verticesCannot read info from other vertices

• Can only read a small constant memoryCan only read a small constant memory

• Latest GPUs: Vertex Texture FetchLatest GPUs: Vertex Texture Fetch• Random access memory for verticesRandom access memory for vertices

Gather (But not from the vertex stream itself)Gather (But not from the vertex stream itself)

15

Fragment ProcessorFragment Processor

• Fully programmable (SIMD)Fully programmable (SIMD)

• Processes 4-component vectors (RGBA / Processes 4-component vectors (RGBA / XYZW)XYZW)

• Random access memory read (textures)Random access memory read (textures)

• Capable of gather but not scatterCapable of gather but not scatter• RAM read (texture fetch), but no RAM writeRAM read (texture fetch), but no RAM write

• Output address fixed to a specific pixelOutput address fixed to a specific pixel

• Typically more useful than vertex Typically more useful than vertex processorprocessor• More fragment pipelines than vertex pipelinesMore fragment pipelines than vertex pipelines

• Direct output (fragment processor is at end of pipeline)Direct output (fragment processor is at end of pipeline)

Building Blocks & Building Blocks & ApplicationsApplications

17

GPGPU Building BlocksGPGPU Building Blocks

• fundamental techniques & fundamental techniques & computational building blocks:computational building blocks:• Flow control (a Flow control (a veryvery fundamental building fundamental building

block)block)

• Stream operationsStream operations

• Data structuresData structures

• Differential equations & linear algebraDifferential equations & linear algebra

• Data queriesData queries

18

Flow controlFlow control

• SurprisingSurprising number of issues on GPUs number of issues on GPUs

• Main themes:Main themes:• Avoid branching when possibleAvoid branching when possible

• Move branching earlier in the pipeline when Move branching earlier in the pipeline when possiblepossible

• Largely SIMD Largely SIMD coherent branching most efficient coherent branching most efficient

• Mechanisms:Mechanisms:• Rasterized geometryRasterized geometry

• Z-cullZ-cull

• Occlusion queryOcclusion query

19

Domain DecompositionDomain Decomposition

• Avoid branches where outcome is Avoid branches where outcome is fixedfixed• One region is always true, another falseOne region is always true, another false

• Separate FPs for each region, no branchesSeparate FPs for each region, no branches

• Example: Example: boundariesboundaries

20

Flat 3D TexturesFlat 3D Textures

21

Flat 3D TexturesFlat 3D Textures

• AdvantagesAdvantages• One texture update per operationOne texture update per operation

• Better use of GPU parallelismBetter use of GPU parallelism

• Non-power-of-two TexturesNon-power-of-two Textures

• Quick simulation previewQuick simulation preview

• DisadvantageDisadvantage• Must compute texture offsetsMust compute texture offsets

22

Staggered SimulationStaggered Simulation

• Non-interactive application:Non-interactive application:• Simulate as fast as possibleSimulate as fast as possible

• Frame rate suffersFrame rate suffers

20ms

23

Staggered SimulationStaggered Simulation

• Interactive frame rate!Interactive frame rate!• Simulation still proceeds pretty fastSimulation still proceeds pretty fast

10 20ms

24

Z-CullZ-Cull

• In early pass, modify depth bufferIn early pass, modify depth buffer• Write depth=0 for pixels that should not be Write depth=0 for pixels that should not be

modified by later passesmodified by later passes

• Write depth=1 for restWrite depth=1 for rest

• Subsequent passesSubsequent passes• Enable depth test (GL_LESS)Enable depth test (GL_LESS)

• Draw full-screen quad at z=0.5Draw full-screen quad at z=0.5

• Only pixels with previous depth=1 will be Only pixels with previous depth=1 will be processedprocessed

• Can also use early stencil testCan also use early stencil test

• Note: Depth replace disables ZCullNote: Depth replace disables ZCull

25

Pre-computationPre-computation

• Pre-compute anything that will not Pre-compute anything that will not change every iteration!change every iteration!

• Example: arbitrary boundariesExample: arbitrary boundaries• When user draws boundaries, compute texture When user draws boundaries, compute texture

containing boundary info for cellscontaining boundary info for cells

• Reuse that texture until boundaries modifiedReuse that texture until boundaries modified

• Combine with Z-cull for higher performance!Combine with Z-cull for higher performance!

26

Stream OperationsStream Operations

• Several stream operations in GPGPU Several stream operations in GPGPU toolkit:toolkit:• MapMap: apply a function to every element in a stream: apply a function to every element in a stream

• ReduceReduce: use a function to reduce a stream to a : use a function to reduce a stream to a smaller stream (often 1 element)smaller stream (often 1 element)

• Scatter/gatherScatter/gather: indirect read and write: indirect read and write

• FilterFilter: select a subset of elements in a stream: select a subset of elements in a stream

• SortSort: order elements in a stream: order elements in a stream

• SearchSearch: find a given element, nearest neighbors, : find a given element, nearest neighbors, etcetc

27

Simple Fire EffectSimple Fire Effect

Blur and scroll upward

Trails of blur emerge from bright source ‘embers’ at the bottom VD

VAVC VB

28

Cellular AutomataCellular Automata• Great for generating noise and other animated Great for generating noise and other animated

patterns to use in blendingpatterns to use in blending

• Game of Life in a Pixel ShaderGame of Life in a Pixel Shader

• Cell ‘state’ relative to the rules is computed at each texel Cell ‘state’ relative to the rules is computed at each texel

• Dependent texture read Dependent texture read

• State accesses ‘rules’ table, which is a textureState accesses ‘rules’ table, which is a texture

• Highly complex rules are easy!Highly complex rules are easy!

The Rules

For a space that is 'populated':

Each cell with one or no neighbors dies,

as if by loneliness.

Each cell with four or more neighbors dies,

as if by overpopulation.

Each cell with two or three neighbors survives.

For a space that is 'empty' or 'unpopulated'

Each cell with three neighbors becomes populated

29

Lattice ComputationsLattice Computations

• How far can we take them?How far can we take them?• Anything we can describe with discrete PDE equations!Anything we can describe with discrete PDE equations!

• Discrete in space and timeDiscrete in space and time

• Also other approximationsAlso other approximations

30

Approximate MethodsApproximate Methods

• Several different approximationsSeveral different approximations• Cellular Automata (CA)Cellular Automata (CA)

• Coupled Map Lattice (CML)Coupled Map Lattice (CML)

• Lattice-Boltzmann Methods (LBM)Lattice-Boltzmann Methods (LBM)

31

Coupled Map LatticeCoupled Map Lattice

• Mapping:Mapping:• Continuous state Continuous state lattice nodes lattice nodes

• Coupling:Coupling:• Nodes interact with each other to produce new Nodes interact with each other to produce new

state according to specified rulesstate according to specified rules

32

Coupled Map LatticeCoupled Map Lattice

• CML introduced by Kaneko (1980s)CML introduced by Kaneko (1980s)• Used CML to study spatio-temporal chaosUsed CML to study spatio-temporal chaos

• Others adapted CML to physical simulation:Others adapted CML to physical simulation:

•Boiling [Yanagita 1992]Boiling [Yanagita 1992]

•Convection [Yanagita 1993]Convection [Yanagita 1993]

•Clouds [Yanagita 1997; Miyazaki 2001]Clouds [Yanagita 1997; Miyazaki 2001]

•Chemical reaction-diffusion [Kapral ‘93]Chemical reaction-diffusion [Kapral ‘93]

•Saltation (sand ripples / dunes) [ Nishimori ‘93]Saltation (sand ripples / dunes) [ Nishimori ‘93]

•And moreAnd more

33

CML vs. CACML vs. CA

• CML extends cellular automata CML extends cellular automata (CA)(CA)

CACA CMLCMLSPACESPACE DiscreteDiscrete DiscreteDiscrete

TIMETIME DiscreteDiscrete DiscreteDiscrete

STATESTATE DiscreteDiscrete ContinuouContinuouss

34

CML vs. CACML vs. CA

• Continuous state is more usefulContinuous state is more useful• Discrete: physical quantities difficultDiscrete: physical quantities difficult

•Must filter over many nodes to get “real” Must filter over many nodes to get “real” valuesvalues

• Continuous: physical quantities easyContinuous: physical quantities easy

•Real physical values at each nodeReal physical values at each node

•Temperature, velocity, concentration, etc.Temperature, velocity, concentration, etc.

35

Rules?Rules?

• CML updated via simple, local rulesCML updated via simple, local rules• Simple: same rule applied at every cell (SIMD)Simple: same rule applied at every cell (SIMD)

• Local: cells updated according to some Local: cells updated according to some function of their neighbors’ statefunction of their neighbors’ state

36

Example: BuoyancyExample: Buoyancy

• Used in temperature-based boiling Used in temperature-based boiling simulationsimulation

• At each cell:At each cell:• If neighbors to left and right of cell are warmer, If neighbors to left and right of cell are warmer,

raise the cell’s temperatureraise the cell’s temperature

• If neighbors are cooler, lower its temperatureIf neighbors are cooler, lower its temperature

37

CML OperationsCML Operations

• Implement operations as building blocks Implement operations as building blocks for use in multiple simulationsfor use in multiple simulations• DiffusionDiffusion

• Buoyancy (2 types)Buoyancy (2 types)

• Latent HeatLatent Heat

• AdvectionAdvection

• Viscosity / PressureViscosity / Pressure

• Gray-Scott Chemical ReactionGray-Scott Chemical Reaction

• Boundary ConditionsBoundary Conditions

• User interaction (drawing)User interaction (drawing)

• Transfer function (color gradient)Transfer function (color gradient)

38

Anatomy of a CML operationAnatomy of a CML operation

• Neighbor SamplingNeighbor Sampling

• Select and read values, Select and read values, vv, of nearby cells, of nearby cells

• Computation on NeighborsComputation on Neighbors

• Compute Compute ff((vv) ) for each sample (for each sample (f f can be can be arbitrary computation)arbitrary computation)

• Combine new values (arithmetic)Combine new values (arithmetic)

• Store new values back in latticeStore new values back in lattice

39

Graphics HardwareGraphics Hardware

• Why use it?Why use it?• Speed: up to 25x speedup in our simsSpeed: up to 25x speedup in our sims

• GPU perf. grows faster than CPU perf.GPU perf. grows faster than CPU perf.

• Cheap: GeForce 4 Ti 4200 < $130Cheap: GeForce 4 Ti 4200 < $130

• Load balancing in complex applicationsLoad balancing in complex applications

• Why not use it?Why not use it?• Low precision computation (not anymore!)Low precision computation (not anymore!)

• Difficult to program (not anymore!)Difficult to program (not anymore!)

40

Hardware Implementation (GF4)Hardware Implementation (GF4)

Simulating the worldSimulating the world

• Simulate a wide variety of Simulate a wide variety of phenomena on GPUsphenomena on GPUs• Anything we can describe with discrete PDEs or Anything we can describe with discrete PDEs or

approximations of PDEsapproximations of PDEs

Documents

Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia