Parallel Graphics APIs Gregory S. Johnson [email protected]

Parallel Graphics APIsParallel Graphics APIs

Gregory S. JohnsonGregory S. [email protected]@cs.utexas.edu

TopicsTopics

• Problem: Host / Graphics Performance MismatchProblem: Host / Graphics Performance Mismatch

• Conventional SolutionsConventional Solutions

• ParallelismParallelism

• IRIS Performer (Rohlf and Helman, 1994)IRIS Performer (Rohlf and Helman, 1994)

• Stanford Parallel API (Igehy et al., 1998)Stanford Parallel API (Igehy et al., 1998)

ProblemProblem

• graphics subsystems can process graphics primitives faster than a graphics subsystems can process graphics primitives faster than a single-CPU host can deliver the related sequence of commandssingle-CPU host can deliver the related sequence of commands

• when a single-CPU host is busy with non-graphics related tasks when a single-CPU host is busy with non-graphics related tasks (I/O, OS, etc.), the graphics subsystem idles(I/O, OS, etc.), the graphics subsystem idles

OpenGL command issuedOpenGL command issued

OpenGL* OpenGL* command command processedprocessed

Bottlenecks (Igehy et al.)Bottlenecks (Igehy et al.)

• overhead associated with encoding API commandsoverhead associated with encoding API commands

• data bandwidth from the API hostdata bandwidth from the API host

• data bandwidth into the graphics subsystemdata bandwidth into the graphics subsystem

• overhead associated with decoding API commandsoverhead associated with decoding API commands

Solution DirectionsSolution Directions

• utilize the given resources more effectivelyutilize the given resources more effectively

• add more hardware resourcesadd more hardware resources

Conventional SolutionsConventional SolutionsPacked Primitive ArraysPacked Primitive Arrays

• arrays of primitives stored in system memory which can be issued to arrays of primitives stored in system memory which can be issued to the graphics system via a small number of API callsthe graphics system via a small number of API calls

• the use of primitives arrays can result in reduced API overhead and the use of primitives arrays can result in reduced API overhead and increased bandwidth utilization via DMAincreased bandwidth utilization via DMA

glVertexPointer(2, GL_FLOAT, 0, verts);glVertexPointer(2, GL_FLOAT, 0, verts);glEnableClientState(GL_VERTEX_ARRAY);glEnableClientState(GL_VERTEX_ARRAY);glColorPointer(3, GL_FLOAT, 0, colors);glColorPointer(3, GL_FLOAT, 0, colors);glEnableClientState(GL_COLOR_ARRAY);glEnableClientState(GL_COLOR_ARRAY);

/* “strip” points into an array with triangle strip connectivity *//* “strip” points into an array with triangle strip connectivity *//* based on the vertices in the “verts” array *//* based on the vertices in the “verts” array */glDrawElements(GL_TRIANGLE_STRIP, length, GL_UNSIGNED_INT, strip);glDrawElements(GL_TRIANGLE_STRIP, length, GL_UNSIGNED_INT, strip);

Conventional SolutionsConventional SolutionsDisplay ListsDisplay Lists

• a display list is a set of graphics commands (low level equivalents) a display list is a set of graphics commands (low level equivalents) stored on the graphics subsystem and typically used as a macrostored on the graphics subsystem and typically used as a macro

• useful in cases where geometry in a scene is drawn repeatedlyuseful in cases where geometry in a scene is drawn repeatedly

• even more useful if the geometry fits on the graphics card itselfeven more useful if the geometry fits on the graphics card itself

/* create a "vane" for the tail of the arrow *//* create a "vane" for the tail of the arrow */glNewList(VANE, GL_COMPILE);glNewList(VANE, GL_COMPILE); glBegin(GL_QUADS);glBegin(GL_QUADS); glColor3f(1.0, 1.0, 1.0);glColor3f(1.0, 1.0, 1.0); glVertex3fv(v1); glVertex3fv(v2);glVertex3fv(v1); glVertex3fv(v2); glVertex3fv(v3); glVertex3fv(v4);glVertex3fv(v3); glVertex3fv(v4); ...... glEnd();glEnd();glEndList();glEndList();

Conventional SolutionsConventional SolutionsCompressionCompression

GL_SUNX_geometry_compressionGL_SUNX_geometry_compression

• encoding of scene geometry by the host CPU and decoding encoding of scene geometry by the host CPU and decoding by the graphics subsystemby the graphics subsystem

• compression of graphics data can reduce inter-subsystem compression of graphics data can reduce inter-subsystem bandwidth requirements at the expense of decoding timebandwidth requirements at the expense of decoding time

ParallelismParallelism• inherent parallelism: not all graphics-related commands need be inherent parallelism: not all graphics-related commands need be

issued in strict order (e.g. drawing opaque primitives on Z-buffer issued in strict order (e.g. drawing opaque primitives on Z-buffer equipped hardware)equipped hardware)

• parallelism to cover latency: the graphics subsystem is faster at parallelism to cover latency: the graphics subsystem is faster at processing commands than the host CPU is at generating themprocessing commands than the host CPU is at generating them

OpenGL commands issuedOpenGL commands issued

OpenGL* OpenGL* commands commands processedprocessed

TerminologyTerminology

• contextcontext is the scope within which graphics state is affected by is the scope within which graphics state is affected by graphics commands issued (in some sense a binding between graphics commands issued (in some sense a binding between graphics state and issued graphics commands)graphics state and issued graphics commands)

IRIS Performer: A High Performance Multiprocessing IRIS Performer: A High Performance Multiprocessing Toolkit for Real-Time 3D GraphicsToolkit for Real-Time 3D Graphics

John Rohlf, James HelmanJohn Rohlf, James HelmanSilicon Graphics Computer Systems (1994)Silicon Graphics Computer Systems (1994)

SummarySummary

• discusses the design and implementation of a pair of libraries discusses the design and implementation of a pair of libraries for developing high performance graphics applications easilyfor developing high performance graphics applications easily

• a low-level library to provide high performance rendering via a low-level library to provide high performance rendering via specialized graphics primitives and efficient state managementspecialized graphics primitives and efficient state management

• a high-level library for multiprocessing which utilizes pipeline a high-level library for multiprocessing which utilizes pipeline parallelism for traversing, culling, and issuing elements of a parallelism for traversing, culling, and issuing elements of a hierarchically organized scene graphhierarchically organized scene graph

A Tale of Two LibrariesA Tale of Two Libraries

• libprlibpr provides efficient graphics primitives, state management, provides efficient graphics primitives, state management, and basic mechanisms in support of efficient renderingand basic mechanisms in support of efficient rendering

• libpflibpf provides support for multiprocessing and hierarchical provides support for multiprocessing and hierarchical organization of scene elementsorganization of scene elements

libpr: pfGeoSetlibpr: pfGeoSet

• a “primitives array” like data structure which holds homogeneous a “primitives array” like data structure which holds homogeneous graphics primitives and associated coloring, normal, and texture graphics primitives and associated coloring, normal, and texture mapping (coordinates) datamapping (coordinates) data

libpr:libpr: State Management State Management

• libprlibpr provides 3 mechanisms for setting graphics state provides 3 mechanisms for setting graphics state

• immediate mode: a “state stack” helps reduce unnecessary immediate mode: a “state stack” helps reduce unnecessary state changes and is typically used to set global statestate changes and is typically used to set global state

• display list mode: typically used by display list mode: typically used by libpflibpf to capture a full to capture a full frame’s worth of data for purposes of multiprocessingframe’s worth of data for purposes of multiprocessing

• encapsulated mode: motivated by the observation that most encapsulated mode: motivated by the observation that most state applies to the bulk of a scene; is typically used to tie a state applies to the bulk of a scene; is typically used to tie a small number of state changes to specific geometrysmall number of state changes to specific geometry

libpr:libpr: Multiprocessing Support Multiprocessing Support

• libprlibpr doesn’t implement multiprocessing itself doesn’t implement multiprocessing itself

• libprlibpr does provide support for shared data does provide support for shared data including synchronized accessincluding synchronized access

• includes “multibuffered” arrays which can be thought includes “multibuffered” arrays which can be thought of as multiple copies of an array, each at a different of as multiple copies of an array, each at a different stage of processingstage of processing

• multibuffering solves the problems of data exclusion multibuffering solves the problems of data exclusion and synchronizationand synchronization

libpf:libpf: Scene Graphs Scene Graphs• libpflibpf organizes scene elements into scene graphs, for organizes scene elements into scene graphs, for

increased modeling, access, and processing efficiencyincreased modeling, access, and processing efficiency

• a scene graph is a tree-like structure containing nodes which a scene graph is a tree-like structure containing nodes which correspond to geometry, lights, cameras, coloration, texture, correspond to geometry, lights, cameras, coloration, texture, transformations, etc.transformations, etc.

libpf:libpf: Scene Graph Hierarchy Scene Graph Hierarchy

• scene graphs promote top-down state inheritancescene graphs promote top-down state inheritance

• the top-down inheritance restriction enables parallel the top-down inheritance restriction enables parallel traversal and processing of the scene graph treetraversal and processing of the scene graph tree

• scene graphs also encode a hierarchy of bounding scene graphs also encode a hierarchy of bounding volumes, simplifying intersection testing and cullingvolumes, simplifying intersection testing and culling

libpf:libpf: Scene Graph Traversal Scene Graph Traversal

• intersection traversal: application-driven collision detectionintersection traversal: application-driven collision detection

• culling traversal: precedes drawing traversals, culling culling traversal: precedes drawing traversals, culling geometry with bounding spheres which fall outside of the geometry with bounding spheres which fall outside of the view frustrum, and placing the remaining geometry in a view frustrum, and placing the remaining geometry in a (possibly sorted) display list(possibly sorted) display list

• draw traversal: traverses the display list generated during draw traversal: traverses the display list generated during the culling phase and issues the appropriate commands to the culling phase and issues the appropriate commands to the graphics subsystemthe graphics subsystem

libpf:libpf: Optimizations Optimizations

• pfFlattenpfFlatten: reduce the number of transformations: reduce the number of transformations

• pfLODpfLOD: level-of-detail based on geometry of varying complexity: level-of-detail based on geometry of varying complexity

• pfSequencepfSequence: animated sequences: animated sequences

• pfBillboardpfBillboard: special representation of axially symmetric shapes: special representation of axially symmetric shapes

libpf:libpf: Multiprocessing Multiprocessing

• a pipelined approach to multiprocessing, whereby different processors a pipelined approach to multiprocessing, whereby different processors execute different stages of the APP -> CULL -> DRAW and APP-> execute different stages of the APP -> CULL -> DRAW and APP-> ISECT pipelinesISECT pipelines

The Design of a Parallel Graphics InterfaceThe Design of a Parallel Graphics InterfaceHoman Igehy, Gordon Stoll, Pat HanrahanHoman Igehy, Gordon Stoll, Pat Hanrahan

Stanford University (1998)Stanford University (1998)

SummarySummary

• discuss several issues (state, mode, order) influencing the discuss several issues (state, mode, order) influencing the design of a graphics APIdesign of a graphics API

• propose a swank parallel API composed of a small number propose a swank parallel API composed of a small number of extensions to OpenGLof extensions to OpenGL

• present an implementation of the API within a custom present an implementation of the API within a custom software graphics pipelinesoftware graphics pipeline

• examine the performance of the implementation applied to examine the performance of the implementation applied to a pair of graphics-related applicationsa pair of graphics-related applications

Parallelism via Existing OpenGL ConstructsParallelism via Existing OpenGL Constructs

• consider a pair of application threads each with its own graphics consider a pair of application threads each with its own graphics context, issuing OpenGL commands for a single framebuffercontext, issuing OpenGL commands for a single framebuffer

• recall that a stream of OpenGL commands is issued by the host recall that a stream of OpenGL commands is issued by the host CPU(s) and later executed by the graphics subsystemCPU(s) and later executed by the graphics subsystem

Thread 1Thread 1

DrawPrimitives(opaq[1..256])DrawPrimitives(opaq[1..256])

appBarrier(appBarrierVar)appBarrier(appBarrierVar)DrawPrimitives(tran[1..256])DrawPrimitives(tran[1..256])glFinish()glFinish()appBarrier(appBarrierVar)appBarrier(appBarrierVar)

Thread 2Thread 2

DrawPrimitives(opaq[257..512])DrawPrimitives(opaq[257..512])glFinish()glFinish()appBarrier(appBarrierVar)appBarrier(appBarrierVar)

appBarrier(appBarrierVar)appBarrier(appBarrierVar)DrawPrimitives(tran[257..512])DrawPrimitives(tran[257..512])

Addition of a Wait ConstructAddition of a Wait Construct

• glFinish() commands force the issuing threads to wait for the previously glFinish() commands force the issuing threads to wait for the previously issued graphics commands to complete (on the graphics subsystem)issued graphics commands to complete (on the graphics subsystem)

• but synchronization between the application threads in this example is but synchronization between the application threads in this example is only needed to insure that the graphics commands are issued in orderonly needed to insure that the graphics commands are issued in order

Thread 1Thread 1

DrawPrimitives(opaq[1..256])DrawPrimitives(opaq[1..256])appBarrier(appBarrierVar)appBarrier(appBarrierVar)glpWaitContext(glpWaitContext(Thread2CtxThread2Ctx))DrawPrimitives(tran[1..256])DrawPrimitives(tran[1..256])appBarrier(appBarrierVar)appBarrier(appBarrierVar)

Thread 2Thread 2

DrawPrimitives(opaq[257..512])DrawPrimitives(opaq[257..512])appBarrier(appBarrierVar)appBarrier(appBarrierVar)

appBarrier(appBarrierVar)appBarrier(appBarrierVar)glpWaitContext(glpWaitContext(Thread1CtxThread1Ctx))DrawPrimitives(tran[257..512])DrawPrimitives(tran[257..512])

Improved SynchronizationImproved Synchronization

• synchronization of the graphics command streams in the previous synchronization of the graphics command streams in the previous example is performed by the application threads, stalling themexample is performed by the application threads, stalling them

• graphics subsystem-level barriers (many-to-many) and semaphores graphics subsystem-level barriers (many-to-many) and semaphores (point-to-point) synchronization mechanisms are introduced(point-to-point) synchronization mechanisms are introduced

Thread 1Thread 1

DrawPrimitives(opaq[1..256])DrawPrimitives(opaq[1..256])glpBarrier(glpBarrierVar)glpBarrier(glpBarrierVar)DrawPrimitives(tran[1..256])DrawPrimitives(tran[1..256])glpBarrier(glpBarrierVar)glpBarrier(glpBarrierVar)

Thread 2Thread 2

DrawPrimitives(opaq[257..512])DrawPrimitives(opaq[257..512])glpBarrier(glpBarrierVar)glpBarrier(glpBarrierVar)

glpBarrier(glpBarrierVar)glpBarrier(glpBarrierVar)DrawPrimitives(tran[257..512])DrawPrimitives(tran[257..512])

Example: Marching CubesExample: Marching CubesSerialSerial

for (i=0; i<M; i++)for (i=0; i<M; i++) for (j=0; j<N; j++)for (j=0; j<N; j++) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j])

Parallel (Unordered)Parallel (Unordered)

for (i=0; i<M; i++)for (i=0; i<M; i++) for (j=(myProc+i)%P; j<N; j+=P)for (j=(myProc+i)%P; j<N; j+=P) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j])

Parallel (Ordered)Parallel (Ordered)

for (i=0; i<M; i++)for (i=0; i<M; i++) for (j=(myProc+i)%P; j<N; j+=P)for (j=(myProc+i)%P; j<N; j+=P) if (i>0) glpPSema(sema[i-1,j])if (i>0) glpPSema(sema[i-1,j]) if (j>0) glpPSema(sema[i,j-1])if (j>0) glpPSema(sema[i,j-1]) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j]) if (i<M-1) glpVSema(sema[i,j])if (i<M-1) glpVSema(sema[i,j]) if (j<N-1) glpVSema(sema[i,j])if (j<N-1) glpVSema(sema[i,j])

Implementation: ArgusImplementation: ArgusA

rgus pipelineA

rgus pipelineIn

finite

Rea

lity

pipe

line

Infin

iteR

ealit

y pi

pelin

e

PerformancePerformance

• Argus software pipeline on a SGI Origin SMP applied to Argus software pipeline on a SGI Origin SMP applied to NurbsNurbs (patch tessellator - embarrassingly parallel) and (patch tessellator - embarrassingly parallel) and MarchMarch (parallel (parallel marching cubes)marching cubes)

1

2

3

4

5

6, 7, 8

ConvergenceConvergence

• the Performer approach utilizes pipeline parallelism while the the Performer approach utilizes pipeline parallelism while the Stanford approach utilizes multithreaded parallelismStanford approach utilizes multithreaded parallelism

• the authors note that the role of their API is complimentary to that of the authors note that the role of their API is complimentary to that of IRIS Performer which utilizes pipeline parallelism, but is constrained IRIS Performer which utilizes pipeline parallelism, but is constrained by placing one processor in charge of issuing graphics commandsby placing one processor in charge of issuing graphics commands

The EndThe End

Documents

Parallel Graphics APIs Gregory S. Johnson [email protected]