WireGL A Scalable Graphics System for Clusters Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, and Pat Hanrahan Presented by

WireGL

A Scalable Graphics System for Clusters

Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, and Pat Hanrahan

Presented by Bruce Johnson

Motivation for WireGL

Data sets for scientific computation applications are enormous.

Visualization of these datasets on single workstations is difficult or impossible.

Therefore, we need a scalable, parallel graphics rendering system.

What is WireGL?

Provides a parallel interface to a cluster-based virtual graphics system

Extends OpenGL API

Allows flexible assignment of tiles to graphics accelerators

Can perform final image reassembly in software using a general purpose cluster interconnect

Can bring rendering power of cluster to displays ranging from a single monitor to a multi-projector, wall-sized display.

WireGL Illustrated

Parallel Graphics Architecture Classification

Classify by the point in the graphics pipeline at which data are redistributed.

Redistribution, or “sorting”, is the transition from object parallelism to image parallelism

Sort location has tremendous implications for the architecture’s communication needs.

Advantage of WireGL’s Communication Infrastructure

WireGL uses commodity parts. As opposed to highly specialized components

found in SGI’s Infinite Reality

Therefore, the hardware or the network may be upgraded at any time without redesigning the system

Points of Communication in Graphics Pipeline

Commodity parts restricts choices of communication because individual graphics accelerators cannot be modified.

Therefore, there are only two points in the graphics pipeline to induce communication.

Immediately after the application stage.

Immediately before the final display stage.

If communication is used after the application stage, this is the traditional sort-first graphics architecture.

WireGL is a Sort-first Renderer

WireGL’s Implementation (From a High Level)

WireGL consists of one or more clients submitting OpenGL commands simultaneously to one or more graphics servers known as pipeservers.

Pipeservers are organized as a sort-first parallel graphics pipeline and together serve to render a single output image.

Each pipeserver has its own graphics accelerator and a high-speed network connecting it to all of its clients.

Compute, Graphics, Interface and Resolution Limited

Compute limited means that the simulation generates data more slowly than the graphics system can accept it.Graphics limited (Geometry Limited) means that a single client occupies multiple servers keeping each server it occupies busy due to its long rendering time.Interface limited means that an application is limited by the rate at which it issues geometry to the graphics system.Resolution limited (Field Limited) means that the visualization of the data is hampered because of a lack of display resolution.

How does WireGL Deal With These Limitations?

WireGL has no inherent restriction on the number of clients and servers it can accommodate.

For compute-limited applications, one needs more clients than servers.

For graphics-limited applications, one needs more servers than clients.

For interface-limited applications, one needs an equal number of clients and servers.

For resolution-limited applications WireGL affords one the capacity to use larger display devices.

Client Implementation

WireGL replaces OpenGL on Windows, Linux, and IRIX machines. As the program makes calls to the OpenGL API, WireGL will classify them into three categories:GeometryStateSpecial

Geometry Commands

Geometry commands are those that appear between glBegin and glEnd

These commands are packed into a global “geometry buffer”.

The buffer contains a copy of the arguments to the function and an opcode.

These opcodes and data are sent directly to the networking library as a single function call.

Commands like glNormal3f do not create graphical fragments. State effects are recorded in the buffer.

State Commands

State commands directly affect the graphics state such as glRotatef, glBlendFunc or glTexImage2D.

Each state has n bits associated indicate whether that state element is out of sync with each of its n servers.

When a state command is executed, the bits are all set to 1, indicating that each server might need a new copy of that element.

Geometry Buffer Transmission

Two circumstances can trigger the transmission of the geometry buffer. If the buffer fills up, it must be flushed to make room for subsequent

commands. If a state command is called while the geometry buffer is not empty,

since OpenGL has such strict ordering semantics. The geometry buffer cannot be sent to overlapped servers immediately since they may not have the correct OpenGL state. The application’s current state must be sent prior to any transmission of geometry.WireGL currently has no automatic mechanism for determining the best time to partition geometry.

Parallel Graphics Considerations

When running a parallel application, each client node performs a sort-first distribution of geometry and state to all pipeservers. When multiple OpenGL graphics contexts wish to render a single image, ordering must be performed via semaphores.Synchronization functions are added to WireGL. glBarrierExec(name) causes a graphics context to enter a barrier. glSemaphoreP wait for a signal. glSemaphoreV means to issue a signal.

These ordering commands are broadcast because the same ordering restrictions must be observed by all servers.

Special CommandsExamples of special commands would be SwapBuffers, glFinish, and glClear.

glClear has a barrier immediately after its call to ensure that the frame buffer is clear before any drawing may take place.

SwapBuffers has consequences on synchronization because only one client may execute it per frame. SwapBuffers marks the end of a frame and causes

a buffer swap to be executed by all servers.

Pipeserver ImplementationA pipeserver maintains a queue of pending commands for each client. As the new commands arrive over the network, they are placed at the end of the client’s queue. These queues are stored in a circular “run queue” of contexts.

A pipeserver continues executing a client’s commands until it runs out of work or the context “blocks” on a barrier with a semaphore operation.

Blocked contexts are placed on wait queues associated with the semaphore or barrier they are waiting on.

Portrait of a Pipeserver

Context SwitchingSince each client has an associated graphics context, a context switch must be performed each time a client’s stream blocks.

This context switching time is limited by the hardware.

The context switching time is slow enough to limit the amount of intra-frame parallelization achievable with WireGL.

Overcoming Context Switching Limitations

Each pipeserver uses the same state tracking library as the client to maintain the state of each client in the software. Context switching on the server is facilitated by a context differencing operation. Parallel applications collaborate to produce a single image and will typically have similar graphics states. The context switching amongst collaborating nodes has a cost proportional to the context’s disparity. Hence a hierarchy arises in which different contexts are

classified according to their difference.

Scheduling Amongst Different Contexts

When a context blocks, the servers have a choice as to which context they will run next.

Therefore, one must consider the cost of performing the context switch and the amount of work that can be done before

performing the next context switch

A simple round-robin scheduler was used.

Round-Robin SchedulingWhy does round-robin scheduling work? First, clients participating in the visualization of large data

sets are likely to have similar contexts, making the expense of context switching low and uniform.

Since we can’t know when a stream will block, we can only estimate the time to the next context switch by using the amount of work queued for a particular context.

Any large disparity in the amount of work queued for a particular context is likely the result of application-level load imbalance.

Description of the NetworkWireGL uses a connection-based network abstraction in order to support multiple network types.

Uses a credit-based flow control mechanism to prevent servers from exhausting memory resources when they can’t keep up with the clients.

The server/client pair is joined by a connection.

Have a zero-copy send since the buffer allocation is the responsibility of the network layer.

Have a zero-copy receive because of the network layer allocated buffers.

Symmetric ConnectionThe connection between the client and the server is completely symmetric, this means that the servers can return data to the clients.WireGL supports glFinish to tell applications when a command has been executed.This allows applications to synchronize their output with some external input ensuring that the graphics system’s internal buffering is not

causing their output to lag behind the input

The user may optionally enable an implicit glFinish-like synchronization upon each call to SwapBuffers. This ensures that no client gets more than one frame ahead of

the servers.

Display Management

To form a seamless output image, tiles must be extracted from the framebuffers of the pipeservers and reassembled to drive the display device. There are two ways to perform this display reassembly: Hardware Software

Display Reassembly in Hardware

Experiments used Lightning-2 boards which accepted 4 inputs and emitted 8 outputs.

More inputs are accommodated by connecting multiple

Lightning-2 boards into a “pixel bus”.

Multiple outputs an be accommodated by repeating the inputs.

Hence, we may allow for an arbitrary number of accelerators and displays to be connected by a 2-D mesh.

Display Reassembly in Hardware (2)

Each input to a Lightning-2 usually contributes to multiple output displays, so Lightning-2 must observe a full output frame for each input before it may swap. This introduces one frame of latency.

Lightning-2 provides a per-host back channel using the host’s serial port. WireGL waits for this notification before executing a client’s

SwapBuffers command.

Having synchronized outputs allows a Lightning-2 to drive tiled display devices like IBM’s Bertha or a multi-projector display wall without tearing artifacts.

Display Reassembly in Software

Without special hardware to support image reassembly, the final rendered image must be read out of each local frame buffer and redistributed over a network.The drawback to pure software rendering is that it may diminish performance.Pixel data must be read out of the local buffer, transferred over the internal network of the cluster and written back to the frame buffer for display.

Software rendering has demonstrated an inability to sustain high frame rates.

Visualization Server

A separate, dedicated network is called a “visualization server”.

Using the visualization server, all pipeservers read the color contents of their managed tiles at the end of each frame.

Those images are sent over the cluster’s interconnect to a separate compositing server from reassembly

Applications Used

March A parallel implementation of the marching cubes algorithm. March extracts and renders 385,492 lit triangles/frame.

Nurbs A parallel patch evaluator using multiple processors to

subdivide a curved surface and tessellate it. Nurbs tessellates and renders 413,000 lit, stripped

triangles/frame

Hundy A parallel application that renders a set of unorganized triangle

strips. Hundy renders 4 million triangles/frame at a rate of 7.45 million

triangles/sec

Parallel Rendering Speedups

Parallel Interface

To scale any interface-limited application it is necessary to allow parallel submission of graphics primitives.

This effect was illustrated with Hundy.

Some of Hundy’s performance measurements show a super-linear speed up because Hundy generates a large amount of network traffic per second.

This shows that Hundy’s performance is very sensitive to the behavior of the network under a high load.

Hardware vs. Software Image Reassembly

As the size of the output image grows, software image reassembly can quickly compromise the performance of the application.An single application was written to measure the overhead associated with software vs. hardware reassembly. It demonstrated that hardware supported

reassembly is necessary to maintain high frame rates.

Load Balancing

Two kinds of load balancing to consider.

Application level load balancing (that is, balancing the amount of computation performed by each client node)

It is the responsibility of the programmer to efficiently distribute the work to various nodes.

This aspect of load balancing was tested on each of the applications and showed that each application did posses adequate application-level load balancing.

Load Balancing (2)

The other type is graphics work done by servers.This implies that it is necessary to distribute the rendering work to multiple servers. However, the rendering work required to generate an output image is typically not uniformly distributed in screen space. Thus, the tiling of the output image introduces a potential load imbalance; this may create a load imbalance in the network as well.

Scalability Limits

Experiments indicate that WireGL ought to be able to allow a scaling up from 16 pipeservers and 16 clients to 32 pipeservers and 32 clients if the network was better able to support an all-to-all communication.The limits on scalability would be the amount of screen space parallelism available for a given output size.For a huge cluster of say 128 nodes, the tile size would be so small that it would be difficult to provide a good load balance for any non-trivial application without a prohibitively high overlap factor.

Texture Management

WireGL’s client treats texture data as a component of the graphics state and lazily updates the servers as needed. In the worst case, this will result in each texture being replicated on every server node in the system. This is a consequence of using commodity graphics accelerators in the cluster.It is not possible to introduce a stage of communication to remotely access the texture memory.Currently investigating new texture management strategies such as parallel texture caching.

Latency

There are 2 sources of latency: Display reassembly stage Buffering of commands on the client

Display reassembly via the Lightning-2 cards introduces one frame of latency.Display reassembly via software introduces 50-100 ms of latency.Latency due to command buffering depends on the size of the network buffers and the fact that the pipeserver cannot process the buffer until it has been completely received.

Future Work

Main direction for future development is to add additional flexibility to accommodate a broader range of parallel rendering applications. The next version will allow a user to describe an

arbitrary directed graph of graphics stream processing units

This will involve developing new parallel applications to use this new system

This system shows promise with CAVEs

Documents

WireGL A Scalable Graphics System for Clusters Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, and Pat Hanrahan Presented by