JOURNAL OF LA Equalizer 2.0 – Convergence of a Parallel ... · lage network library and a quick overview of the main Equalizer-based applications. A result section presents new

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Equalizer 2.0 –Convergence of a Parallel Rendering Framework

Stefan Eilemann, David Steinerand Renato Pajarola, Senior Member, IEEE

Abstract—Developing complex, real world graphics applications which leverage multiple GPUs and computers for interactive 3Drendering tasks is a complex task. It requires expertise in distributed systems and parallel rendering in addition to the applicationdomain itself. We present a mature parallel rendering framework which provides a large set of features, algorithms and systemintegration for a wide range of real-world research and industry applications. Using the Equalizer parallel rendering framework, weshow how a wide set of generic algorithms can be integrated in the framework to help application scalability and development in manydifferent domains, highlighting how concrete applications benefit from the diverse aspects and use cases of Equalizer. We presentnovel parallel rendering algorithms, powerful abstractions for large visualization setups and virtual reality, as well as new experimentalresults for parallel rendering and data distribution.

Index Terms—Parallel Rendering, Scalable Visualization, Cluster Graphics, Immersive Environments, Display Walls

F

1 INTRODUCTION

The Equalizer parallel rendering framework as first presentedin [8] has demonstrated its general versatility and the usefulnessof its minimally-intrusive programming design in a variety ofapplications and projects. In particular, the integration of large-scale parallel rendering algorithms, APIs for developing complexdistributed applications, and many individual features make Equal-izer a unique, open source framework to develop visualizationapplications for virtually any type of setup and use case. Whileindividual applications and case studies using the framework,as well as new algorithms and system components extending ithave been presented since the initial release of Equalizer, manyimportant features and functionalities have not previously beenpublished and are presented here. In this report, we thus presentan updated comprehensive review of the integration of research,application use cases and commercial developments with respectto Equalizer as well as novel comparative experimental results ofits scalability features.

We present novel algorithms for parallel rendering which didnot appear in a separate publication, including pixel and sub-pixeldecompositions, dynamic frame resolution, tunable sort-first load-balancing parameters, frame-rate equalization, thread synchro-nization modes for multi-GPU rendering, a powerful abstractionfor multi-view rendering on arbitrary display setups, dynamicfocus distance and asymmetric eye positions for VR, parallel pixelstreaming to tiled display walls, as well as a fully-fledged datadistribution API with compression and reliable multicast.

The remainder of this paper is structured as follows: Firstwe provide an update on related work since the introductionof Equalizer. The main body of this paper then presents newperformance features, VR algorithms, usability features to buildcomplex applications, main novel features of the underlying Col-

email: [email protected]: [email protected]:[email protected]

• All authors are with the Visualization and MultiMedia Lab, Department ofInformatics, University of Zurich.

lage network library and a quick overview of the main Equalizer-based applications. A result section presents new experiments notpreviously published, followed by the discussion and conclusion.

2 RELATED WORK

In 2009 we presented Equalizer [8], which introduced the archi-tecture of a generic parallel rendering framework and summarizedour work in parallel rendering. Since then, an extensive Program-ming and User Guide provides in-depth documentation on usingand programming Equalizer [6]. In the following related work weassume these two publications and their references as a startingpoint, and focus on the new work published since 2009.

The concept of transparent OpenGL interception popularizedby WireGL and Chromium [21] has received little attention since2009. While some commercial implementations such as TechVizand MechDyne Conduit continue to exist, on the research sideonly ClusterGL [30] has been presented. ClusterGL employsthe same approach as Chromium, but delivers a significantlyfaster implementation of transparent OpenGL interception anddistribution for parallel rendering. CGLX [5] tries to bring par-allel execution transparently to OpenGL applications, by emu-lating the GLUT API and intercepting certain OpenGL calls. Incontrast to frameworks like Chromium and ClusterGL whichdistribute OpenGL calls, CGLX follows the distributed applicationapproach. This works transparently for trivial applications, butquickly requires the application developer to address the com-plexities of a distributed application, when mutable applicationstate needs to be synchronized across processes. For realisticapplications, writing parallel applications remains the only viableapproach for scalable parallel rendering, as shown by the successof Paraview, Visit and various Equalizer-based applications.

On the other hand, software for driving and interacting withtiled display walls has received significant attention, includingSage [29] and Sage 2 [26] in particular. Sage was built entirelyaround the concept of a shared framebuffer where all contentwindows are separate applications using pixel streaming. It is no

arX

iv:1

802.

0802

2v1

[cs

.GR

] 2

2 Fe

b 20

18


(a) (b) (c)Fig. 1: Example Equalizer applications: (a) 192 Megapixel CAVE at KAUST running RTNeuron, (b) Immersive HMD with externaltracked and untracked views running RTT DeltaGen for virtual car usability studies, and (c) Cave2 running a molecular visualization.

longer actively supported. Sage 2 is a complete, browser-centricreimplementation where each application is a web applicationdistributed across browser instances. DisplayCluster [23], andits continuation Tide [3], also implement the shared framebufferconcept of Sage, but provide a few native content applicationsintegrated into the display servers. All these solutions implementa scalable display environment and are a target display platformfor scalable 3D graphics applications.

Equalizer itself has received significant attention within theresearch community. Various algorithms to improve the parallelrendering performance have been proposed: compression and re-gion of interest during compositing [25], load-balancing resourcesfor multi-display installations [11], asynchronous compositingand NUMA optimizations [7], as well as work queueing [32].Additionally, complex large scale and out-of-core multiresolutionrendering approaches have been parallelized and implementedwith Equalizer [18], [19], demonstrating the feasibility of theframework to be used with complex rendering algorithms and 3Dmodel representations.

Furthermore, various applications and frameworks have usedEqualizer for new research in visualization. On the applica-tion side, RTT Deltagen, Bino, Livre and RTNeuron [20] arethe most mature examples and are presented in Section 7. Onthe framework side, Omegalib [13], a framework used in theCave2, made significant progress in integrating 2D collaborativeworkspaces like Sage 2 with 3D immersive content. Lamberset.al. developed a framework for visualizing remote sensingdata [24] on large displays and immersive installations.

3 PERFORMANCE FEATURES

3.1 New Decomposition Modes

The initial version of Equalizer implemented the basic sort-first(2D), sort-last (DB), stereo (EYE) and multilevel decompositions[8]. In the following we present the newly added decompositionmodes and motivate their use case, which bring the overall featureset way beyond the typical sort-first and sort-last rendering modes.Figure 2 provides an overview of the new modes. The compoundconcept to set up scalable rendering is presented in [8].

3.1.1 Time-MultiplexTime-multiplexing (Figure 2a), also called AFR or DPlex, was firstimplemented in [2] for shared memory machines. It is however abetter fit for distributed memory systems, since the separate mem-ory space makes concurrent rendering of different frames easier

to implement. While it increases the framerate linearly, it doesnot decrease the latency between user input and the correspondingoutput. Consequently, this decomposition mode is mostly usefulfor non-interactive movie generation. It is transparent to Equalizerapplications, but does require the configuration latency to be equalor greater than the number of source channels. Furthermore, towork in multi-threaded, multi-GPU configurations, the applicationneeds to support running the rendering threads asynchronously(Section 3.3.4). The output frame rate of the destination channelmay be smoothened using a frame rate equalizer (Section 3.2.5).

3.1.2 Tiles and Chunks

Tile (Figure 2b) and chunk decompositions are a variant of sort-first and sort-last rendering, respectively. They decompose thescene into a predefined set of fixed-size image tiles or databaseranges. These tasks, or work packages, are queued and processedby all source channels by polling a server-central queue. Prefetch-ing ensures that the task communication overlaps with rendering.As shown in [32] and the results, these modes can provide betterperformance due to being implicitly, i.e., inherently load-balanced,as long as there is an insignificant overhead for the render tasksetup. This mode is transparent to Equalizer applications.

3.1.3 Pixel

Pixel compounds (Figure 2c) decompose the destination channelby interleaving rows or columns in image space. They are avariant of sort-first decomposition which works well for fill-limited applications which are not geometry bound, for exampledirect volume rendering. Source channels cannot reduce geometryload through view frustum culling, since each source channel hasalmost the same frustum. However, the fragment load on all sourcechannels is reduced linearly and well load-balanced due to theinterleaved distribution of pixels. This functionality is transparentto Equalizer applications, and the default compositing uses thestencil buffer to blit pixels onto the destination channel.

3.1.4 Subpixel

Subpixel compounds (Figure 2d) are similar to pixel compounds,but they decompose the work for a single pixel, for examplewith Monte-Carlo ray tracing, FSAA or depth of field rendering.Composition typically uses accumulation and averaging of allcomputed fragments for a pixel. This feature is not fully trans-parent to the application, since it needs to adapt (jitter or tilt) thefrustum based on the iteration executed.


compound {c h a n n e l ” d e s t i n a t i o n ”f r a m e r a t e e q u a l i z e r {}compound {

c h a n n e l ” s o u r c e 1 ”phase 0 p e r i o d 3o u t p u t f r a m e { name ” f rame ” }

}compound {


}compound {


}i n p u t f r a m e { name ” f rame ” }

}

(a) Time-Multiplex

compound {c h a n n e l ” d e s t i n a t i o n ”o u t p u t t i l e s {

name ” queue ”s i z e [ 64 64 ]

}compound {

c h a n n e l ” d e s t i n a t i o n ”i n p u t t i l e s { name ” queue ” }

}compound {

c h a n n e l ” s o u r c e 1 ”i n p u t t i l e s { name ” queue ” }o u t p u t f r a m e {}

}compound {


}compound {


}i n p u t f r a m e { name ” frame . s o u r c e 1 ” }i n p u t f r a m e { name ” frame . s o u r c e 2 ” }i n p u t f r a m e { name ” frame . s o u r c e 3 ” }

}

(b) Tiles

compound {c h a n n e l ” d e s t ”compound {

c h a n n e l ” d e s t ”p i x e l [ 0 0 3 1 ]o u t p u t f r a m e { t y p e t e x t u r e }

}compound {

c h a n n e l ” s o u r c e 1 ”p i x e l [ 1 0 3 1 ]o u t p u t f r a m e {}

}compound {

c h a n n e l ” s o u r c e 2 ”p i x e l [ 2 0 3 1 ]o u t p u t f r a m e {}

}i n p u t f r a m e { name ” frame . d e s t ” }i n p u t f r a m e { name ” frame . s o u r c e 1 ” }i n p u t f r a m e { name ” frame . s o u r c e 2 ” }

}

(c) Pixel

compound {c h a n n e l ” d e s t ”compound {

c h a n n e l ” d e s t ”s u b p i x e l [ 0 3 ]o u t p u t f r a m e { t y p e t e x t u r e }

}compound {

c h a n n e l ” s o u r c e 1 ”s u b p i x e l [ 1 3 ]o u t p u t f r a m e {}

}compound {

c h a n n e l ” s o u r c e 2 ”s u b p i x e l [ 1 3 ]o u t p u t f r a m e {}

}i n p u t f r a m e { name ” frame . d e s t ” }i n p u t f r a m e { name ” frame . s o u r c e 1 ” }i n p u t f r a m e { name ” frame . s o u r c e 2 ” }

}

(d) Subpixel

Fig. 2: New Equalizer task decomposition modes and their compound descriptions for parallel rendering

3.2 Equalizers

Equalizers are an addition to compound trees. They modifyparameters of their respective subtree at runtime to dynami-cally optimize the resource usage, by each tuning one aspectof the decomposition. Due to their nature, they are transparentto application developers, but might have application-accessibleparameters to tune their behavior. Resource equalization is thecritical component for scalable parallel rendering, and thereforethe eponym for the Equalizer project name.

3.2.1 Sort-First and Sort-Last Load BalancingSort-first (Figure 3a) and sort-last load balancing are the mostobvious optimizations for these parallel rendering modes. Ourload equalizers are fully transparent for application developers;that is, they use a reactive approach based on past rendering times.This assumes a reasonable frame-to-frame coherence. Equalizerimplements two different algorithms, a load equalizer and atree equalizer. The result section provides some evidence on thestrengths and weaknesses of both algorithms.

The load equalizer stores a 2D or 1D grid of the load,mapping the load of each channel. The load is stored in normalized2D/1D coordinates using time

area as the load, the contributing sourcechannels are organized in a binary tree, and then the algorithmbalances the two branches of each level by equalizing the integralover the cost area map on each side.

The tree equalizer also uses a binary tree for recursive loadbalancing. It computes the accumulated render time on all nodes

of the tree, and uses this to allocate an equal render time to eachsubtree. It makes no assumption of the load distribution in 2D or1D space, it only tries to correct the imbalance in render time.

Both equalizers implement tuneable parameters allowing ap-plication developers to optimize the load balancing based on thecharacteristics of their rendering algorithm:

Split Mode configures the tile layout: horizontal or verticalstripes, and 2D, a binary tree split alternating the split axison each level, resulting in compact 2D tiles.

Damping reduces frame-to-frame oscillations. The equal loaddistribution within the region of interest assumed by the loadequalizer is in reality not equal, causing the load balancingto overshoot. Damping is a normalized scalar defining howmuch of the computed delta from the previous position isapplied to the new split.

Resistance eliminates small deltas in the load balancing step.This might help the application to cache visibility compu-tations since the frustum does not change each frame.

Boundaries define the modulo factor in pixels onto which aload split may fall. Some rendering algorithms produceartefacts related to the OpenGL raster position, e.g., screendoor transparency, which can be eliminated by aligning theboundary to the pixel repetition. Furthermore, some renderingalgorithms are sensitive to cache alignments, which can againbe exploited by chosing the corresponding boundary.


(a) Load-Balancing(b) Cross-Segment Load-Balancing (c) Dynamic Frame Resolution (d) Monitoring

Fig. 3: Runtime modifications

3.2.2 Dynamic Work PackagesThe package equalizers implement client-affinity models fortile and chunk compounds (Section 3.1.2). A tile equalizer orchunk equalizer creates the packages and changes the assign-ment of them to individual nodes, based on an affinity modelspecified in the equalizer. In [32], we explore this approach indetail.

3.2.3 Cross-Segment Load BalancingCross-segment load balancing (Figure 3b) addresses the optimalresource allocation of n rendering resources to m output channels(with n ≥ m). A view equalizer works in conjunction with loadequalizers balancing the individual output channels. It monitorsthe usage of shared source channels across outputs and activatesthem to balance the rendering time of all outputs. In [11], weprovide a detailed description and evaluation of our algorithm.

3.2.4 Dynamic Frame ResolutionThe DFR equalizer (Figure 3c) provides a functionality similarto dynamic video resizing [27], that is, it maintains a constantframerate by adapting the rendering resolution of a fill-limitedapplication. In Equalizer, this works by rendering into a sourcechannel (typically on a FBO) separate to the destination channel,and then scaling the rendering during the transfer (typicallythrough an on-GPU texture) to the destination channel. The DFRequalizer monitors the rendering performance and accordinglyadapts the resolution of the source channel and zoom factor forthe source to destination transfer. If the performance and sourcechannel resolutions allow, this will not only subsample, but alsosupersample the destination channel to reduce aliasing artefacts.

3.2.5 Frame Rate EqualizerThe framerate equalizer smoothens the output frame rate of adestination channel by instructing the corresponding window todelay its buffer swap to a minimum time between swaps. Thisis regularly used for time-multiplexed decompositions, wheresource channels tend to drift and finish their rendering unevenlydistributed over time. This equalizer is however fully independentof DPlex compounds, and may be used to smoothen irregularapplication rendering algorithms.

3.2.6 MonitoringThe monitor equalizer (Figure 3d, Figure 4) allows to reuse therendering on another channel, typically for monitoring a larger

setup on a control workstation. Output frames on the displaychannels are connected to input frames on a single monitoringchannel. The monitor equalizer changes the scaling factor andoffset between the output and input, so that the monitor channelhas the same, but typically downscaled view, as the originatingsegments.

3.3 Optimizations3.3.1 Region of InterestThe region of interest is the screen-space 2D bounding box en-closing the data rendered by a single resource. We have extendedthe core parallel rendering framework to use an application-provided ROI to optimize the load equalizer as well as the imagecompositing. The load equalizer uses the ROI to refine its load gridto the regions containing data. The compositing code uses the ROIto minimize image readback and network transmission. In [25] and[7], we provide the details of the algorithm, and show that usingROI can quadruple the rendering performance, in particular for thecostly compositing step in sort-last rendering.

3.3.2 Asynchronous CompositingAsynchronous compositing is pipelining the rendering with com-positing operations, by executing the image readback, networktransfer and image assembly from threads running in parallelto the rendering threads. In [7], we provide the details of theimplementation and experimental data showing an improvementof the rendering performance of over 25% for large node counts.

3.3.3 Download and Compression PluginsCompression for the compositing step is critical for performance.This not only applies to the well-researched network transferstep, but also for the transfer between GPU and CPU. Equalizersupports a variety of compression algorithms, from very fast run-length encoding (RLE) and YUV subsampling on the GPU toJPEG compression. These algorithms are implemented as runtime-loaded plugins, allowing easy extension and customization toapplication-specific compression. In [25], we show this to be acritical step for interactive performance at scale.

3.3.4 Thread Synchronization ModesDifferent applications have different degrees on how decoupledand thread-safe the rendering code is from the application logic.For full decoupling all mutable data has to have a copy in each


render thread, which is not feasible in most applications andlarge data scenarios. To easily customize the synchronizationof all threads on a single process, Equalizer implements threethreading modes: Full synchronization, draw synchronization andasynchronous. Note that the execution between node processes isalways asynchronous, for up to latency frames.

In full synchronization, all threads always execute thesame frame; that is, the render threads are unlocked afterNode::frameStart, and the node is blocked for all render threadsto finish the frame before executing Node::frameFinish. Thisallows the render threads to read shared data from all theiroperations, but provides the slowest performance.

In draw synchronization, the node thread and all renderthreads are synchronized for all frameDraw operations; that is,Node::frameFinish is executed after the last channel is donedrawing. This allows the render threads to read shared dataduring their draw operation, but not during compositing. Sincecompositing is often independent of the rendered data, this isthe default mode. This mode allows to overlap compositing withrendering and data synchronization on multi-GPU machines.

In asynchronous execution, all threads run asynchronously.Render threads may work on different frames at any given time.This mode is the fastest, and requires the application to have oneinstance of each mutable object in each render thread. It is requiredfor scaling time-multiplex compounds on multi-GPU machines.

4 VIRTUAL REALITY FEATURES

Virtual Reality is an important field for parallel rendering. It doeshowever require special attention to support it as a first-classcitizen in a generic parallel rendering framework. Equalizer hasbeen used in many virtual reality installations, such as the Cave2[14], the high-resolution C6 CAVE at the KAUST visualizationlaboratory, and head-mounted displays (Figure 1). In the followingwe lay out the features needed to support these installations,motivated by application use cases.

4.1 Head TrackingHead tracking is the minimal feature needed to support immer-sive installations. Equalizer does support multiple, independenttracked views through the observer abstraction (Section 5.1).Built-in VRPN support enables the direct, application-transparentconfiguration of a VRPN tracker device. Alternatively, applica-tions can provide a 4×4 tracking matrix. Both CAVE-like trackingwith fixed projection surfaces and HMD tracking with movingdisplays are implemented.

4.2 Dynamic Focus DistanceTo our knowledge, all parallel rendering systems have the focalplane coincide with the physical display surface. For better view-ing comfort, we introduce a new dynamic focus mode, wherethe application defines the distance of the focal plane from theobserver, based on the current lookat distance. Initial experimentsshow that this provides better viewing comfort, in particular forobjects placed in front of the physical displays.

4.3 Asymmetric Eye PositionTraditional head tracking computes the left and right eye positionsby using an interocular distance. However, human heads are notsymmetric, and by measuring individual users a more precise

frustum can be computed. Equalizer supports this through theoptional configuration of individual 3D eye translations relative tothe tracking matrix.

4.4 Model Unit

This feature allows applications to specify a scaling factor betweenthe model and the real world, to allow exploration of macroscopicor microscopic worlds in virtual reality. The unit is per view,allowing different scale factors within the same application. Itscales both the specified projection surface as well as the eyeposition (and therefore separation) to achieve the necessary effect.

4.5 Runtime Stereo Switch

Applications can switch each view between mono and stereorendering at runtime, and run both monoscopic and stereoscopicviews concurrently (Figure 1 (b)). This switch does potentiallyinvolve the start and stop of resources and processes for passivestereo or stereo-dependent task decompositions (Section 5.2).

5 USABILITY FEATURES

In this section we present features motivated by real-world appli-cation use cases, i.e., new functionalities rather then performanceimprovements. We motivate the use case, explain the architectureand integration into our parallel rendering framework, and, whereapplicable, show the steps needed to use this functionality inapplications.

5.1 Physical and Logical Visualization Setup

Real-world visualization setups can be complex. An abstractrepresentation of the display system simplifies the configurationprocess. Applications often have the need to be aware of spatialrelationships of the display setup, for example to render 2Doverlays or to configure multiple views on a tiled display wall.

We addressed this need through a new configuration sectioninterspersed between the node/pipe/window/channel hardware re-sources and the compound trees configurating the resource usagefor parallel rendering.

A typical installation consists of one projection canvas, whichis one aggregated projection surface, e.g., a tiled display wall or aCAVE. Desktop windows are considered a canvas. Each canvas ismade of one or more segments, which are the individual outputsconnected to a display or projector. Segments can be planar ornon-planar to each other, and can overlap or have gaps betweeneach other. A segment is referencing a channel, which defines theoutput area of this segment, e.g., on a DVI connector connectedto a projector. This abstraction covers all use cases from simplewindows, tiled display walls with bezels, to non-planar immersivesystems with edge-blending.

A canvas can define a frustum, which will create default, planarsub-frusta for all of its segments. A segment can also define afrustum, which overrides the canvas frustum, e.g., for non-planarsetups such as CAVEs or curved screens. These frusta describe aphysically-correct display setup for a Virtual Reality installation.A canvas may have a software or hardware swap barrier, whichwill synchronize the rendering of all contributing GPUs. Thesoftware barrier executes a glFinish to ensure the GPU is ready toswap, a Collage barrier (Section 6.4) to synchronize all segments,and the swap buffers call followed by a glFlush to ensure timely


execution of the swap command. The hardware swap barrier isimplemented using the NV swap group extension.

On each canvas, the application can display one or more views.A view is used in the sense of the MVC pattern. The view class isavailable to applications to define view-specific data for rendering,e.g., a scene, viewing mode or camera. The application processmanages this data, and the render clients receive it for rendering.

A layout groups one or more views which logically belongtogether. A layout is applied to a canvas. The layout assignmentcan be changed at run-time by the application. The intersectionbetween views and segments defines which output channels areavailable, and which frustum they should use for rendering.These output channels are then used as destination channels ina compound. They are automatically created during configuration.

An observer looks at one or more views. It is described bythe observer position in the world and its eye separation. Eachobserver has its own stereo mode, focus distance and eye positions.This allows to have untracked views and multiple tracked views,e.g., two HMDs, in the same application.

Fig. 4: A 2x2 tiled display wall and control host rendering fourindependent views driven by an eight node visualization cluster

Figure 4 shows RTT Deltagen running an example multi-segment, multi-view setup driven by eight rendering nodes. Themain tiled display wall canvas uses four LCD segments showingone layout with four views, which do not align on the segmentboundaries. This setup creates seven destination channels. Theconfiguration provides multiple, run-time configurable layouts.It is driven from the control host on the right, which showsfour views, each in their own canvas and segment windowswith a single-view layout each. One view on the control hostsynchronizes its content (model and camera) with one view onthe display wall through Collage objects. The control host allowsfull model modifications and all workflows supported within thestandalone Deltagen application, and all changes are synchronizedto the corresponding rendering nodes. For this monoscopic setupno head tracking or observers are used.

5.2 Runtime ReconfigurationSwitching a layout, as described above, or switching the stereorendering mode, may involve a different set of resources afterthe change, including the launch and exit of render client pro-cesses. Equalizer solves this through a reconfiguration step atthe beginning of each rendering frame. Each resource (channel,window, pipe, node) has an activation count, which is updatedwhen the layout or any other relevant rendering parameter ischanged. When a resource is found whose activation count doesnot match its current start/stopped state, the resource is created ordestroyed and configInit or configExit are called accordingly. In

the current implementation, a normal configuration initializationor exit, as described in [8], uses the same code path with all usedresources transitioning to a running or stopped state, accordingly.Since starting new resources typically requires object mapping andassociated data distribution, it is a costly operation.

5.3 Automatic ConfigurationAutomatic configuration implements the discovery of local andremote resources as well as the creation of typical configurationsusing the discovered resources at application launch time.

The discovery is implemented in a separate library, hwsd(HardWare Service Discovery), which uses a plugin-based ap-proach to discover GPUs for GLX, AGL or WGL windowingsystems, as well as network interfaces on Linux, Mac OS X andWindows. Furthermore, it detects the presence of VirtualGL toallow optimal configuration of remote visualization clusters. Theresources can be discovered on the local workstation, and throughthe help of a simple daemon using the zeroconf protocol, on a setof remote nodes within a visualization cluster. A session identifiermay be used to support multiple users on a single cluster.

The Equalizer server uses the hwsd library to discover localand remote resources when an hwsd session name instead of a.eqc configuration file is provided. A set of standard decomposi-tion modes is configured, which can be selected through activatingthe corresponding layout.

This versatile mechanism allows non-experts to configure andprofit from multi-GPU workstations and visualization clusters,as well as to provide system administrators with the tools toimplement easy to use integration with cluster schedulers.

5.4 Qt WindowingQt is a popular window system with application developers.Unfortunately, it imposes a different threading model for windowcreation and event handling compared to Equalizer. In Equal-izer, each GPU rendering thread is independently responsiblefor creating its windows, receiving the events and eventuallydispatching them to the application process’ main thread. Thisdesign is motivated by the natural threading model of X11 andWGL, and allows simple sequential semantics between OpenGLrendering and event handling. In contrast, Qt requires all windowsand each QOpenGLContext to be created from the Qt main thread.An existing Qt window or context may subsequently be moved toa different thread, and events are signalled from the main thread.For Qt windows, Equalizer will internally dispatch and handle thewindow creation from the render to the main thread, move thecreated objects back to the render thread, and dispatch Qt signalsto the correct render threads.

5.5 Tide IntegrationTide (Tiled interactive display environment) is an improved ver-sion of DisplayCluster [23], providing a touch-based, multi-window user interface for high-resolution tiled display walls.Remote applications receive input events and send pixel streamsusing the Deflect client library. Equalizer includes full support,enabling application-transparent integration with Tide. When aTide server is configured, all output channels of a view streamin parallel to one window on the wall. In [28], we have showninteractive framerates for a 24 megapixel resolution over a WANlink. Deflect events are translated and injected into the Equalizerevent flow, allowing seamless application integration.


5.6 SequelSequel is a simplification layer for Equalizer. It is based on therealization that while fully expressive, the verbatim abstractionlayer of nodes, pipes, windows and channels in Equalizer requiressignificant learning to fully understand and exploit. In reality, ahigher abstraction of only Application and Renderer is sufficientfor many use cases. In Sequel, the application class drives theconfiguration, and one renderer instance is created for each (pipe)render thread. They also provide the natural place to store anddistribute data. Finally, ViewData provides a convenient way tomanage multiple views by storing the camera, model or any otherview-specific information.

6 THE COLLAGE NETWORK LIBRARY

An important part of writing a parallel rendering application is thecommunication layer between the individual processes. Equalizerrelies on the Collage network library for its internal operation.Collage furthermore provides powerful abstractions for writingEqualizer applications, which are introduced in this section.

6.1 ArchitectureCollage provides networking functionality of different abstractionlayers, gradually providing higher level functionality for the pro-grammer. The main primitives in Collage and their relations areshown in Figure 5 and provide:Connection A stream-oriented point-to-point communication

line. The connections transmit raw data reliably betweentwo endpoints for unicast connections, and between a setof endpoints for multicast connections. For unicast, process-local pipes, TCP and Infiniband RDMA are implemented.For multicast, a reliable, UDP-based protocol is discussed inSection 6.2.

DataI/OStream Abstracts the input and output of C++ data typesfrom or to a set of connections by implementing outputstream operators. Uses buffering to aggregate data for net-work transmission. Performs byte swapping during input ifthe endianness differs between the remote and local node.

Node and LocalNode The abstraction of a process in the cluster.Nodes communicate with each other using connections. ALocalNode listens on various connections and processesrequests for a given process. Received data is wrapped inICommands and dispatched to command handler methods.A Node is a proxy for a remote LocalNode. The EqualizerClient object is a LocalNode.

Object Provides object-oriented, versioned data distribution ofC++ objects between nodes. Objects are registered or mappedon a LocalNode.

6.2 Reliable Stream ProtocolRSP is an implementation of a reliable multicast protocol overunreliable UDP transport. RSP behaves similarly to TCP; itprovides full reliability and ordering of the data, and slow receiverswill eventually throttle the sender through a sliding windowalgorithm. This behavior is needed to guarantee delivery of datain all situations. Pragmatic generic multicast (PGM [16]) providesfull ordering, but slow clients will disconnect from the multicastsession instead of throttling the send rate.

RSP combines various established algorithms [1], [15] formulticast in an open source implementation capable of delivering

LocalNode

*1

Node

Connection

LocalNode

*1

Node

OCommandICommand

<<send>>

*

1

*

1<<recv>>

Connection

Object Object

Object ID

ObjectICommandObjectOCommand

byte stream

<<send>> <<recv>>

Dispatcher

DataOStreamDataIStream

CommandFunc

*1

*1

Fig. 5: Communication between two Collage objects

wire speed transmission rates on high-speed LAN interfaces. In thefollowing we will outline the RSP protocol and implementation aswell as motivate the design decisions. Any defaults given beloware for Linux or OS X, the Windows UDP stack requires differentdefault values which can be found in the implementation.

Our RSP implementation uses a separate protocol thread foreach RSP group, which handles all reads and writes on themulticast socket. It implements the protocol handling and commu-nicates with the application threads through thread-safe queues.The queues contain datagrams of the application byte stream,prefixed by a header of at most eight bytes. Each connectionhas a configurable number of buffers (1024 by default) of aconfigurable MTU (1470 bytes default), which are either free orin transmission.

Handling a smooth packet flow is critical for performance.RSP uses active flow control to advance the byte stream bufferedby the implementation. Each incoming connection actively ac-knowledges every n (17 by default) packets fully received. Theincoming connections offset this acknowledgment by their con-nection identifier to avoid bursts of acks. Any missed datagramis actively nack’ed as soon as detected. Write connections contin-uously retransmit packets for nack datagrams, and advance theirwindow upon reception of all acks from the group. The writer willexplicitly request an ack or nack when it runs out of empty buffersor finishes its write queue. Nack datagrams may contain multipleranges of missed datagrams, which is motivated by the observationthat UDP implementations often drop multiple contiguous packets.

Congestion control is necessary to optimize bandwidth usage.While TCP uses the well-known additive increase, multiplicativedecrease algorithm, we have chosen a more aggressive congestioncontrol algorithm of additive increase and additive decrease. Thishas proven experimentally to be more optimal: UDP is often rate-limited by switches; that is, packets are discarded regularly and notexceptionally. Only slowly backing of the current send rate helpsto stay close to this limit. Furthermore, our RSP traffic is limited tothe local subnet, making cooperation between multiple data streamless of an issue. Send rate limiting uses a bucket algorithm, whereover time the bucket fills with send credits, from which sends aresubstracted. If there are no available credits, the sender sleeps untilsufficient credits are available.

6.3 Distributed, Versioned ObjectsAdapting an existing application for parallel rendering requiresthe synchronization of application data across the processes in the


parallel rendering setup. Existing parallel rendering frameworksaddress this often poorly, at best they rely on MPI to distributedata. Real-world, interactive visualization applications are typi-cally written in C++ and have complex data models and classhierarchies to represent their application state. As outlined in [8],the parallel rendering code in an Equalizer application only needsaccess to the data needed for rendering, as all application logic iscentralized in the application main thread. We have encounteredtwo main approaches to address this distribution: Using a sharedfilesystem for static data, or using data distribution for staticand dynamic data. Distributed objects are not required to buildEqualizer applications. While most developers choose to use thisabstraction for convenience, we have seen applications using othermeans for data distribution, e.g., MPI.

6.3.1 Programming Interface

Distributed objects in Collage provide powerful, object-orienteddata distribution for C++ objects. They facilitate the implemen-tation of data distribution in a cluster environment. Distributedobjects are created by subclassing from co::Serializable orco::Object. The application programmer implements serializationand deserialization. Distributed objects can be static (immutable)or dynamic. Objects have a universally unique identifier (UUID)as cluster-wide address. A master-slave model is used to establishmapping and data synchronization across processes. Typically, theapplication main loop registers a master instance and communi-cates the UUID to the render clients, which map their instance tothe given identifier. The following object types are available:Static The object is not versioned nor buffered. The instance data

is serialized whenever a new slave instance is mapped. Noadditional data is stored.

Instance The object is versioned and buffered. The instanceand delta data are identical; that is, only instance data isserialized. Previous instance data is saved to be able to mapold versions.

Delta The object is versioned and buffered. The delta data istypically smaller than the instance data. The delta data istransmitted to slave instances for synchronization. Previousinstance and delta data is saved to be able to map and syncold versions.

Unbuffered The object is versioned and unbuffered. No data isstored, and no previous versions can be mapped.

Serialization is facilitated using output or input streams, whichabstract the data transmission and are used like a std::stream.The data streams implement efficient buffering and compression,and automatically select the best connection for data transport.Custom data type serializers can be implemented by providingthe appropriate serialization functions. No pointers should bedirectly transmitted through the data streams. For pointers, thecorresponding object is typically a distributed object as well, andits UUID and version are transmitted in place of a pointer.

Dynamic objects are versioned, and on commit the delta datafrom the previous version is sent, if available using multicast, to allmapped slave instances. The data is queued on the remote node,and is applied when the application calls sync to synchronizethe object to a new version. The sync method might block if aversion has not yet been committed or is still in transmission. Allversioned objects have the following characteristics:

• The master instance of the object generates new versions forall slaves. These versions are continuous. It is possible to

commit on slave instances, but special care has to be taken tohandle possible conflicts.

• Slave instance versions can only be advanced; that is, sync(version) with a version smaller than the current version willfail.

• Newly mapped slave instances are mapped to the oldestavailable version by default, or to the version specified whencalling mapObject.

The Collage Serializable implements one convenient usagepattern for object data distribution. The co::Serializable data dis-tribution is based on the concept of dirty bits, allowing inheritancewith data distribution. Dirty bits form a 64-bit mask which marksthe parts of the object to be distributed during the next commit. Forserialization, the application developer implements serialize ordeserialize, which are called with the bit mask specifying whichdata has to be transmitted or received. During a commit or sync,the current dirty bits are given, whereas during object mapping alldirty bits are passed to the serialization methods.

Blocking commits allow to limit the number of outstanding,queued versions on the slave nodes. A token-based protocol willblock the commit on the master instance if too many unsynchro-nized versions exist.

6.3.2 Optimizations

The API presented in the previous section provides sufficientabstraction to implement various optimizations for faster mappingand synchronization of data: compression, chunking, caching,preloading and multicast. The results section evaluates some ofthese optimizations.

The most obvious one is compression. Recently, many newcompression algorithms have been developed which exploit mod-ern CPU architectures and deliver compression rates well aboveone gigabyte per second. Collage uses the Pression library [17],which provides a unified interface for a number of compressionlibraries, such as FastLZ [22], Snappy [31] and ZStandard [12]. Italso contains a custom, virtually zero-cost RLE compressor. Pres-sion parallelizes the compression and decompression using datadecomposition. This compression is generic, and implementedtransparently for the application. Applications can also use data-specific compression.

The data streaming interface implements chunking, whichpipelines the serialization code with the network transmission.After a configurable number of bytes has been serialized to theinternal buffer, it is transmitted and serialization continues. This isused both for the initial mapping data and for commit data.

Caching retains instance data of objects in a client-side cache,and reuses this data to accelerate mapping of objects. The instancecache is either filled by “snooping” on multicast transmissions orby an explicit preloading when the master objects are registered.The preloading sends instance data of recently registered masterobjects to all connected nodes, while the corresponding node isidle. These nodes simply enter the received data to their cache.Preloading uses multicast when available.

Due to the master-slave model of data distribution, multicast isused to optimize the transmission time of data. If the contributingnodes share a multicast session, and more than one slave instanceis mapped, Collage automatically uses the multicast connectionto send the new version information.


6.4 Barriers, Queues and Object MapsCollage implements a few generic distributed objects which areused by Equalizer and other applications. A barrier is a distributedbarrier primitive used for software swap synchronization in Equal-izer (Section 5.1). Its implementation follows a simple master-slave approach, which has shown to by sufficient for this use case.Queues are distributed, single producer, multiple consumer FIFOqueues. To hide network latencies, consumers prefetch items intoa local queue. Queues are used for tile and chunk compounds(Section 3.1.2).

The object map facilitates distribution and synchronization ofa collection of distributed objects. Master versions can be regis-tered on a central node, e.g., the application node in Equalizer.Consumers, e.g., Equalizer render clients, can selectively mapthe objects they are interested in. Committing the object mapwill commit all registered objects and sync their new versionto the slaves. Syncing the map on the slaves will synchronizeall mapped instances to the new version recorded in the objectmap. This effective design allows data distribution with minimalapplication logic. It is used by Sequel (Section 5.6) and otherCollage applications.

7 EXAMPLE APPLICATIONS

In this section, we present some major applications built usingEqualizer, and show how they interact with the framework tosolve complex parallel rendering problems.

7.1 LivreLivre (Large-scale Interactive Volume Rendering Engine) is aGPU ray-casting based parallel 4D volume renderer, implementingstate-of-the-art view-dependent level-of-detail rendering (LOD)and out-of-core data management [10]. Hierarchical and out-of-core LOD data management is supported by an implicit volumeoctree, accessed asynchronously by the renderer from a datasource on a shared file system. Different data sources providingan octree conform access to RAW or compressed as well asto implicitly generated volume data (e.g. such as from eventsimulations or surface meshes) can be used.

High-level state information, e.g., camera position and render-ing settings, is shared in Livre through Collage objects betweenthe parallel applications and rendering threads. Sort-first decompo-sition is efficiently supported through octree traversal and cullingboth for scalability as well as for driving large-scale tiled displaywalls.

7.2 RTT DeltagenRTT Deltagen (now Dassault 3D Excite) is a commercial applica-tion for interactive, high quality rendering of CAD data. The RTTScale module, delivering multi-GPU and distributed execution, isbased on Equalizer and Collage, and has driven many of theaforementioned features.

RTT Scale uses a master-slave execution mode, were a singlerunning Deltagen instance can go into “Scale mode” at any timeby launching an Equalizer configuration. Consequently, the wholeinternal representation needed for rendering is based on a Col-lage-based data distribution. The rendering clients are separate,smaller applications which will map their scenes during startup.At runtime, any change performed in the main application is com-mitted as a delta at the beginning of the next frame, following a

design pattern similar to the Collage Serializable (Section 6.3.1).Multicast (Section 6.2) is used to keep data distribution timesduring session launch reasonable for larger cluster sizes (tens tohundreds of nodes).

RTT Scale is used for a wide variety of use cases. In virtualreality, the application is used for virtual prototyping and designreviews in front of high-resolution display walls and CAVEs, aswell as for virtual prototyping of human-machine interactionsusing CAVEs and HMDs (Figure 1(b)). For scalability, sort-firstand tile compounds are used to achieve fast, high-quality render-ing, primarily for interactive raytracing, both based on CPUs andGPUs. For CPU-based raytracing, often Linux-based renderingclients are used with a Windows-based application node.

7.3 RTNeuronRTNeuron [20] is a scalable real-time rendering tool for the visual-isation of neuronal simulations based on cable models. It is basedon OpenSceneGraph for data management and Equalizer for par-allel rendering, and focuses not only on fast rendering times, butalso on fast loading times with no offline preprocessing. It provideslevel of detail (LOD) rendering, high quality anti-aliasing basedon jittered frusta and accumulation during still views, interactivemodification of the visual representation of neurons on a per-neuron basis (full neuron vs. soma only, branch pruning dependingon the branch level, . . . ). RTNeuron implements both sort-first andsort-last rendering with order independent transparency.

7.4 RASTeRRASTeR [4] is an out-of-core and view-dependent real-timemultiresolution terrain rendering approach using a patch-basedrestricted quadtree triangulation. For load-balanced parallel ren-dering [19] it exploits fast hierarchical view-frustum culling of thelevel-of-detail (LOD) quadtree for sort-first decomposition, anduniform distribution of the visible LOD triangle patches for sort-last decomposition. The latter is enabled by a fast traversal ofthe patch-based restricted quadtree triangulation hierarchy whichresults in a list of selected LOD nodes, constituting a view-dependent cut or front of activated nodes through the LODhierarchy. Assigning and distributing equally sized segments ofthis active LOD front to the concurrent rendering threads resultsin a near-optimal sort-last decomposition for each frame.

7.5 BinoBino is a stereoscopic 3D video player capable of running onvery large display systems. Originally written for the immersivesemi-cylindrical projection system at the University of Siegen, ithas been used in many installations thanks to its flexibility ofconfiguration. Bino decodes the video on each Equalizer renderingprocess, and only synchronizes the time step globally, thereforeproviding a scalable solution to video playback.

7.6 OmegalibOmegalib [13] is a software framework built on top of Equalizerthat facilitates application development for hybrid reality environ-ments such as the Cave 2. Hybrid reality enviroments aim to createa seamless 2D/3D environment that supports both information-rich analysis (traditionally done on tiled display wall) as well asvirtual reality simulation exploration (traditionally done in VRsystems) at a resolution matching human visual acuity. Omegalib


supports dynamic reconfigurability of the display environment,so that areas of the display can be interactively allocated to2D or 3D workspaces as needed. It makes it possible to havemultiple immersive applications running on a cluster-controlleddisplay system, have different input sources dynamically routedto applications, and have rendering results optionally redirected toa distributed compositing manager. Omegalib supports pluggablefront-ends, to simplify the integration of third-party libraries likeOpenGL, OpenSceneGraph, and the Visualization Toolkit (VTK).

8 EXPERIMENTAL RESULTS

This section presents new experiments, complementing the resultsof previous publications [7], [8], [9], [11], [20], [25], [28], [32].The first part summarizes rendering performance over all decom-position modes with a few representative workloads. The secondpart analyses data distribution performance, in particular how theoptimizations in Collage perform in realistic scenarios.

8.1 Decomposition ModesWe conducted new performance benchmarks for various decom-position modes on a cluster using hexacore Intel Xeon E5-2620v3CPUs (2.4 GHz), nVidia GTX 970 GPUs with 4 GB VRAMeach, 16 GB main memory per node, 4 GBit/s Ethernet, and QDRInfiniband. GCC 4.8 has been used with CMake 3.7 release modesettings to compile the software stack.

We tested the decomposition modes with both polygonal dataand volume data (Figure 6 (middle and left)), using test scenesthat allowed to adapt the rendering load the system has to copewith. In both cases the scene is comprised of two rows ofinstantiated, identical models with 30 models in total. Renderingwas performed at an output resolution of 2560×1440. The camerais initially placed in the center of the scene, between the tworows, rendering only half of the model instances. It is then movedbackward over the duration of 800 frames, steadily increasing therendering load by revealing more models, until all 30 instances arevisible.

We investigated the scalability of individual decompositionmodes by running the same experiment using a varying numberof render nodes (2-9) and one dedicated application/display node.We subsequently summed up the duration of all rendered framesfor each run (Figure 8).

For sort-first and sort-last rendering we present static and load-balanced task decomposition. For readability, we only present theresults of the equalizer (load or tree) providing the better perfor-mance for each application. Unsurprisingly, static decompositionsperform worse over load-balanced compounds. Sort-first polygonrendering exhibits oscillations in performance as nodes are addedto the task, due to unfavorable assignment of a tile with a highwork load on odd node counts. Static sort-last volume renderinghas a similar oscillation behaviour, as ranges of scene geometrytend to also get unfavorably assigned under such conditions.

The simpler tree equalizer outperforms in almost all casesthe load-grid-driven load equalizer, except for sort-first volumerendering where the load in the region of interest is relativelyuniform. This counterintuitive result seems to again confirm thatsimple algorithms often outperform theoretically better, but morecomplex implementations. On the other hand, the tile equalizeroften outperforms tree equalizer. This suggests that the underly-ing implicit load balancing can be superior to the explicit methodsof load equalizer and tree equalizer in high load situations,

where the additional overhead of tile generation and distributionis more justified. The relatively simple nature of our benchmarkapplication’s rendering algorithms is also favoring work packages,since they have a near-zero static overhead per rendering pass.

Finally, we also provide scalability results for pixel com-pounds. While naturally load-balanced, pixel compounds onlyscale fill rate and not geometry processing. Consequently, pixelcompounds provide better performance for volume rendering, anda predictable scaling behaviour for both.

For volume rendering we also measured the performance ofdecomposition modes under heterogeneous load, which was easilyachievable by varying the number of volume samples used foreach fragment (1-7) while rendering. This allowed for a consistentlinear scaling of rendering load, which was randomly varied(Figure 9) either per frame or per node. Such a linear scaling ofload per node corresponds to a scaling of resources, e.g., doublingthe rendering load on a specific node corresponds to halving itsavailable rendering resources. To the system this node would thencontribute the value 0.5 in terms of normalized compute resources,as illustrated by Figure 9 (left).

This figure gives an impression of how individual modesperform on heterogeneous systems. In this case the tree equalizerperforms best (Figure 9 (left)), as it allows us to a priori definehow much usage it should make of individual nodes, i.e., bias theallocation of rendering time in accordance with the (simulated)compute resources. Figure 9 (right), on the other hand, illus-trates how the tested decomposition modes perform on a systemwhere compute resources fluctuate randomly every frame, as canarguably be the case for shared rendering nodes in virtualizedenviroments. For this scenario tile equalizer seems best suited, as itperforms load balancing implicitly and does not assume coherenceof available resources between frames. The simpler tree equalizeralso outperforms the load equalizer in this experiment.

8.2 Object Distribution

The data distribution benchmarks have been performed on acluster using dual processor Intel Xeon X5690 CPUs (3.47 GHz),24 GB main memory per node, 10 GBit/s Ethernet and QDRInfiniband. Intel ICC 2017 has been used with CMake 3.2 releasemode settings to compile the software stack. To benchmark thedata distribution we used two datasets: The David statue at 2 mmresolution (as in Figure 6 (middle), but in 2 mm resolution) and 3Dvolumes of spike frequencies of an electrical simulation of threemillion neurons (Figure 6 (right)).

The PLY file is converted into a k-d tree for fast view frustumculling and rendering, and the resulting data structure is serializedin binary form for data transmission. The spike frequency volumesaggregate the number of spikes which happened within a givenvoxel over a given time. The absolute spike count is renormalizedto an unsigned byte value during creation. Higher densities in thevolume represent higher spiking activity in the voxel.

8.2.1 Data Compression EnginesA critical piece for data distribution performance are the charac-teristics of the data compression algorithm. Our microbenchmarkcompresses a set of binary files, precalculating the speed andcompression ratio of the various engines. Figure 7 shows the com-pression and decompression speed in gigabyte per second as wellas the size of the compressed data relative to the uncompresseddata. The ZSTDx engines use the ZStandard compression library


Fig. 6: Benchmark data: Flower alley with 30 volume models of 10243 unsigned bytes (1 GB) each (left); David alley with 30 statues of56 M triangles (988 MB) each (middle); Volumes used for object distribution benchmarks (right): spike frequencies of a three millionneuron electrical simulation at 512× 437× 240 unsigned byte (51 MB) and an MicroCT scan of a beechnut at 1024× 1024× 1546unsigned short (3 GB) resolution

at compression level x. The measurements were performed on asingle, isolated node.

Table 1

Relative compressed size

Compression

Decompression

RLE 98% 10.0 13.2

Snappy 77% 4.41 7.97

FastLZ 76% 1.96 4.75

LZF 76% 1.67 5.64

ZSTD1 63% 1.50 3.76

ZSTD2 63% 1.15 3.02

ZSTD3 62% 0.719 3.46

ZSTD4 60% 0.590 2.42

ZSTD5 60% 0.456 2.74

ZSTD10 60% 0.220 3.25

ZSTD19 55% 0.047 1.69

Spee

d (G

B/s)

0.0

3.0

6.0

9.0

12.0

15.0

Com

pres

sion

Rat

io

0%

20%

40%

60%

80%

100%

RLE

Snap

py

Fast

LZ LZF

ZSTD

1

ZSTD

2

ZSTD

3

ZSTD

4

ZSTD

5

ZSTD

10

ZSTD

19Relative compressed sizeCompressionDecompression

55%60%60%60%62%63%63%

76%76%77%

98%98%

77% 76% 76%

63% 63% 62% 60% 60% 60%55%

Fig. 7: Data compression for generic binary data

RLE compression has a very low overhead but merely removes“blank space” in the data. The Snappy compression, used asdefault in Collage, achieves the same compression ratio as theLZ variants at a much higher speed. The ZStandard compressorhas roughly the same speed as the LZ variants at the lowestcompression level, but provides significantly better compression.At higher compression levels it can improve the compression ratioslightly, but at a high cost for the compression speed.

The compression ratio for the models used in the followingsection deviate from this averaged distribution. Figure 10 showsthe compression ratios for the triangle and volume data.

The PLY data is little compressible, the default compressorachieves a 10% reduction. This is due to a high entropy in the dataand the dominant use of floating point values. Overall, the profileis similar to the generic benchmark, at a smaller compression rate.

The volume data on the other hand is sparsely populatedand using integer (byte and short) values, which is easier tocompress. The naive RLE implementation already achieves a goodcompression rate, showing that the smaller volume contains atmost 28% empty space and the bigger volume at most 43%.Snappy and ZStandard can reduce the spike data much further,reducing the data to a few megabytes. Surprisingly, the beechnutdata set does not yield significantly higher compression with themodern Snappy and ZStandard libraries.

8.2.2 Model Distribution and UpdateIn this section we analyse how data distribution and synchroniza-tion performs in real-world applications. We extracted the existingdata distribution code from a mesh renderer (eqPly) and a volumerenderer (Livre) into a benchmark application to measure the timeto initially map all the objects on the render client nodes, and toperform a commit-sync of the full data set after mapping has beenestablished. All figures observe a noticable measurement jitterdue to other services running on the cluster. The details of thebenchmark algorithm can be found in the implementation1.

We used the same data sets as in the previous section, and ranthe benchmark on up to eight physical nodes, that is, after eightprocesses nodes start to run two processes per node, which shareCPU, memory and network interface bandwidth.

Object mapping is measured using the following settings:none distributes the raw, uncompressed, and unbuffered data,compression uses the Snappy compressor to distribute unbuffereddata, buffered reuses uncompressed, serialized data for mappingsfrom multiple nodes, and compression buffered reused the com-pressed buffer for multiple nodes. Unbuffered operations needto reserialize, and potentially recompress, the master object datafor each slave node. Each slave instance needs to deserialize anddecompress the data.

During data synchronization, the master commits the objectdata to all mapped slave instances simultaneously. This is a“push” operation, whereas the mapping is a slave-triggered “pull”operation. Slave nodes queue this data and consume it duringsynchronization. We test the time to commit and sync the datausing different compression engines.

The David statue at 2 mm resolution is organized in a k-d treefor rendering. Each node is a separate distributed object, whichhas two child node objects. A total of 1023 objects is distributedand synchronized. Figure 11 shows the distribution times for thisdata set. Due to limited compressibility of the data, the results arerelatively similar. Compressing the data repeatedly for each clientleads to decreased performance, since the compression overheadcannot be amortized by the decreased transmission time. Bufferingdata slightly improves performance by reducing the CPU and copyoverhead. Combining compression and buffering leads to the bestperformance, although only by about 10%.

During synchronization, data is pushed from the master pro-cess to all mapped slaves using a unicast connection to each slave.While the results in the Figure 11 (middle) are relatively close toeach other, we can still observe how the tradeoff between compres-sion ratio and speed influences overall performance. Better, slower

1. https://github.com/Eyescale/Equalizer/tree/paper2018/tools/eqObjectBench

https://github.com/Eyescale/Equalizer/tree/paper2018/tools/eqObjectBench


compression algorithms lead to improved overall performancewhen amortized over many send operations.

The volume data sets are distributed in a single object, seri-alizing the raw volume buffer. The Spike volume data set has asignificant compression ratio, which is reflected by the results inFigure 12. Compression for this data is beneficial for transmittingdata over a 10 Gb/s link even for a single slave process. Bufferinghas little benefit since the serialization of volume data is trivial.Buffered compression makes a significant difference, since thecompression cost can be amortized over n nodes, reaching raw datatransmission rates of 3.7 GB/s with the default Snappy compressorand at best 4.4 GB/s with ZStandard at level 1.

The distribution of the beechnut data set also behaves asexpected (Figure 13). Due to the larger object size, uncompressedtransmission is slightly faster compared to the Spike data set at700 MB/s, and compressed transmission does not improve themapping performance, likely due to increased memory pressurecaused by the large data size. The comparison of the various com-pression engines is consistent with the benchmarks in Figure 10;RLE, Snappy and the LZ variants are very close to each other, andZSTD1 can provide better performance after four nodes due to thebetter compression ratio.

Finally, we compare data distribution speed using differentprotocols. In this benchmark, data synchronization time of theSpike volume data set is measured, as in Figure 12 (middle).Buffering is enabled, and compression is disabled to focus on theraw network performance. Figure 14 shows the performance usingvarious protocols. TCP over the faster InfiniBand link outperformsthe cheaper ten gigabit ethernet link by more than a factor of two.Unexpectedly, the native RDMA connection performs worse, eventhough it outperforms IPoIB in a simple peer-to-peer connectionbenchmark. This needs further investigation, but we suspect theabstraction of a byte stream connection chosen by Collage is notwell suited for remote DMA semantics; that is, one needs to designthe network API around zero-copy semantics with managed mem-ory for modern high-speed transports. Both Infiniband connectionsshow significant measurement jitter.

RSP multicast performs as expected. Collage starts usingmulticast to commit new object versions when two or more clientsare mapped, since the transmission to a single client is faster usingunicast. RSP consistently outperforms unicast on the same phys-ical interface and shows good scaling behavior (2.5 times sloweron 16 vs 2 clients on ethernet, 1.8 times slower on InfiniBand).The scaling is significantly better when only one process per nodeis used (Figure 14, middle: 30% slower on ethernet, nearly flaton InfiniBand). The increased transmission time with multipleclients is caused by a higher probability of packet loss, whichincreases significantly when using more than one process per node.Figure 14 (right) plots the number of retransmissions divided bythe number of datagrams. Infiniband outperforms ethernet slightly,but is largely limited by the RSP implementation throughput ofpreparing and queueing the datagrams to and from the protocolthread, which we observed in profiling.

9 DISCUSSION AND CONCLUSION

We have presented a significantly improved generic parallel ren-dering system over the original publication [8]. While the originalpublication motivated the system design, this publication describesa feature-rich, mature implementation capable of supporting awide variety of use cases. We doubled the support for scalable

rendering modes, many of which are presented here for thefirst time. We present new runtime adaptations for better loadbalance and performance, describe how common optimizations areintegrated into the system, making Equalizer the most genericallyavailable scalable rendering system.

Furthermore, we present many new features needed in parallelrendering applications, from advanced Virtual Reality supportto advanced display system setup for 2D/3D integration, auto-configuration and runtime reconfiguration, and an advanced net-work data synchronization library tailored to parallel renderingapplications. We highlight a few commercial and research applica-tions underlining the generic and versatile system implementation.

With respect to the feature set implemented, we believe thatEqualizer now covers almost any scenario within its scope. Forfuture work, we would like to integrate new research for betterscalability, new network implementations in particular for modernzero-copy RDMA based transports, as well as extending theSequel abstraction layer for ease of use.

ACKNOWLEDGMENTS

We would like to thank and acknowledge the following institutionsand projects for providing the 3D geometry and volume test datasets: the Digital Michelangelo Project, Stanford 3D ScanningRepository, Cyberware Inc., volvis.org and the Visual HumanProject. This work was partially supported by the Swiss NationalScience Foundation Grants 200021-116329 and 200020-129525,the Swiss Commission for Technology and Innovation CTI/KTIProject 9394.2 PFES-ES, the EU FP7 People Programme (MarieCurie Actions) under REA Grant Agreement n◦290227 and aHasler Stiftung grant (project number 12097).

We would also like to thank all supporters and contributorsof Equalizer, most notably RTT, the Blue Brain Project, theUniversity of Siegen, the Electronic Visualization Lab at theUniversity of Illinois Chicago and Dardo Kleiner.

Stefan Eilemann works towards large-scale vi-sualization for Exascale simulations; interactiveintegration of simulation, analysis and visual-ization as well as flexible frameworks for datasharing and dynamic allocation of heterogenousresources. He was the technical manager of theVisualization Team in the Blue Brain Project,is the CEO and founder of Eyescale, and thelead developer of the Equalizer parallel render-ing framework. He received his masters degreein Computer Science from EPFL in 2015, and

his Engineering Diploma in Computer Science in 1998. He is currentlyworking towards a PhD in Computer Science at the Visualization andMultiMedia Lab at the University of Zurich.

David Steiner received MSc degrees from theUniversity of Applied Sciences Upper Austria(Digital Media) and the University of Zurich(Computer Science). He joined the Visualizationand MultiMedia Lab (VMML) in 2012 and is cur-rently pursuing his doctorate. His research inter-ests include interactive large-scale data visual-ization, distributed parallel rendering, and loadbalancing.


Prof. Dr. Renato Pajarola has been a Professorin computer science at the University of Zurichsince 2005, leading the Visualization and Multi-Media Lab (VMML). He has previously been anAssistant Professor at the University of CaliforniaIrvine and a Postdoc at Georgia Tech. He hasreceived his Dipl. Inf-Ing. ETH and Dr. sc. techn.degrees in computer science from the SwissFederal Institute of Technology (ETH) Zurich in1994 and 1998 respectively. He is a Fellow of theEurographics Association and a Senior Member

of both ACM and IEEE. His research interests include real-time 3Dgraphics, interactive data visualization and geometry processing.

REFERENCES

[1] B. Adamson, C. Bormann, M. Handley, and J. Macker. Negative-acknowledgment (NACK)-oriented reliable multicast (NORM) protocol.Memo RFC 3940, The Internet Society, 2004.

[2] P. Bhaniramka, P. C. D. Robert, and S. Eilemann. OpenGL MultipipeSDK: A toolkit for scalable parallel rendering. In Proceedings IEEEVisualization, pages 119–126, 2005.

[3] Blue Brain Project. Tide: Tiled Interactive Display Environment.https://github.com/BlueBrain/Tide, 2016.

[4] J. Bosch, P. Goswami, and R. Pajarola. RASTeR: Simple and efficientterrain rendering on the GPU. In Proceedings EUROGRAPHICS AreasPapers, Scientific Visulization, pages 35–42, 2009.

[5] K.-U. Doerr and F. Kuester. CGLX: A scalable, high-performancevisualization framework for networked display environments. IEEETransactions on Visualization and Computer Graphics, 17(2):320–332,March 2011.

[6] S. Eilemann. Equalizer Programming and User Guide. Technical report,Eyescale Software GmbH, 2013.

[7] S. Eilemann, A. Bilgili, M. Abdellah, J. Hernando, M. Makhinya,R. Pajarola, and F. Schurmann. Parallel Rendering on Hybrid Multi-GPUClusters. In EGPGV, pages 109–117, 2012.

[8] S. Eilemann, M. Makhinya, and R. Pajarola. Equalizer: A scalableparallel rendering framework. IEEE Transactions on Visualization andComputer Graphics, 15(3):436–452, May/June 2009.

[9] S. Eilemann and R. Pajarola. Direct send compositing for parallel sort-last rendering. In Proceedings Eurographics Symposium on ParallelGraphics and Visualization, 2007.

[10] K. Engel, M. Hadwiger, J. M. Kniss, C. Rezk-Salama, and D. Weiskopf.Real-Time Volume Graphics. AK Peters, 2006.

[11] F. Erol, S. Eilemann, and R. Pajarola. Cross-segment load balancing inparallel rendering. In Proceedings Eurographics Symposium on ParallelGraphics and Visualization, pages 41–50, 2011.

[12] I. Facebook. Fast real-time compression algorithm.https://github.com/facebook/zstd, 2016.

[13] A. Febretti, A. Nishimoto, V. Mateevitsi, L. Renambot, A. Johnson, andJ. Leigh. Omegalib: A multi-view application framework for hybridreality display environments. In 2014 IEEE Virtual Reality (VR), pages9–14, March 2014.

[14] A. Febretti, A. Nishimoto, T. Thigpen, J. Talandis, L. Long, J. Pirtle,T. Peterka, A. Verlo, M. Brown, D. Plepys, et al. Cave2: a hybridreality environment for immersive simulation and information analysis.In IS&T/SPIE Electronic Imaging, pages 864903–864903. InternationalSociety for Optics and Photonics, 2013.

[15] R.-H. Gau, Z. J. Haas, and B. Krishnamachari. On multicast flow controlfor heterogeneous receivers. IEEE/ACM Trans. Netw., 10(1):86–101, Feb.2002.

[16] J. Gemmell, T. Montgomery, T. Speakman, and J. Crowcroft. The PGMreliable multicast protocol. IEEE Network, 17(1):16–22, Jan 2003.

[17] E. S. GmbH and B. B. Project. Compression and data transfer plugins.https://github.com/Eyescale/Pression, 2016.

[18] P. Goswami, F. Erol, R. Mukhi, R. Pajarola, and E. Gobbetti. Anefficient multiresolution framework for high quality interactive renderingof massive point clouds using multi-way kd-trees. The Visual Computer,29(1):69–83, 2013.

[19] P. Goswami, M. Makhinya, J. Bosch, and R. Pajarola. Scalable parallelout-of-core terrain rendering. In Proceedings Eurographics Symposiumon Parallel Graphics and Visualization, pages 63–71, 2010.

[20] J. B. Hernando, J. Biddiscombe, B. Bohara, S. Eilemann, andF. Schurmann. Practical Parallel Rendering of Detailed Neuron Sim-ulations. In Proceedings of the 13th Eurographics Symposium onParallel Graphics and Visualization, EGPGV, pages 49–56, Aire-la-Ville,Switzerland, Switzerland, 2013. Eurographics Association.

[21] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. D. Kirchner,and J. T. Klosowski. Chromium: A stream-processing framework forinteractive rendering on clusters. ACM Transactions on Graphics,21(3):693–702, 2002.

[22] J. Jesper, K. Sadakane, and W.-K. Sung. Fast LZ-compression algorithm.Technical report, Department of Computer Science and CommunicationEngineering, Kyushu. University, Japan.

[23] G. P. Johnson, G. D. Abram, B. Westing, P. Navr’til, and K. Gaither.DisplayCluster: An Interactive Visualization Environment for Tiled Dis-plays. In 2012 IEEE International Conference on Cluster Computing,pages 239–247, Sept 2012.

[24] M. Lambers and A. Kolb. GPU-based framework for distributed interac-tive 3D visualization of multimodal remote sensing data. In 2009 IEEEInternational Geoscience and Remote Sensing Symposium, volume 4,pages IV–57–IV–60, July 2009.

[25] M. Makhinya, S. Eilemann, and R. Pajarola. Fast Compositing forCluster-Parallel Rendering. In Proceedings of the 10th EurographicsConference on Parallel Graphics and Visualization, EGPGV, pages111–120, Aire-la-Ville, Switzerland, Switzerland, 2010. EurographicsAssociation.

[26] T. Marrinan, J. Aurisano, A. Nishimoto, K. Bharadwaj, V. Mateevitsi,L. Renambot, L. Long, A. Johnson, and J. Leigh. SAGE2: A newapproach for data intensive collaboration using Scalable ResolutionShared Displays. In Collaborative Computing: Networking, Applicationsand Worksharing, pages 177–186, 2014.

[27] J. S. Montrym, D. R. Baum, D. L. Dignam, and C. J. Migdal. Infinite-Reality: A real-time graphics system. In Proceedings ACM SIGGRAPH,pages 293–302, 1997.

[28] D. Nachbaur, R. Dumusc, A. Bilgili, J. Hernando, and S. Eilemann. Re-mote parallel rendering for high-resolution tiled display walls. In LargeData Analysis and Visualization (LDAV), 2014 IEEE 4th Symposium on,pages 117–118, Nov 2014.

[29] K. Naveen, V. Venkatram, C. Vaidya, S. Nicholas, S. Allan, Z. Charles,G. Gideon, L. Jason, and J. Andrew. Sage: the scalable adaptive graphicsenvironment.

[30] B. Neal, P. Hunkin, and A. McGregor. Distributed OpenGL rendering innetwork bandwidth constrained environments. In T. Kuhlen, R. Pajarola,and K. Zhou, editors, Proceedings Eurographics Conference on ParallelGraphics and Visualization, pages 21–29. Eurographics Association,2011.

[31] [email protected]. A fast compressor/decompressor.https://github.com/google/snappy, 2016.

[32] D. Steiner, E. G. Paredes, S. Eilemann, and R. Pajarola. Dynamic workpackages in parallel rendering. In Proceedings Eurographics Symposiumon Parallel Graphics and Visualization, pages 89–98, 2016.


David

2D load_equalizer DB load_equalizer sort-first + tree_equalizer tile_equalizer sort-last + tree_equalizer pixel

518.27 272.92 821.533 632.369 561.564 841.045 324.721 823.205

2 506.959 338.579 592.731 566.337 291.14 347.841 306.542 1080.69

3 393.965 167.577 801.935 453.764 234.772 309.629 210.588 812.63

4 513.317 213.825 572.807 455.424 203.384 253.906 236.501 635.997

5 504.641 272.342 778.286 418.996 191.911 361.166 264.946 541.711

6 354.646 225.283 562.077 391.6 187.048 349.807 283.482 501.801

7 322.694 231.389 757.429 374.21 184.922 335.818 304.434 480.803

8 332.21 224.741 547.849 383.044 183.809 357.481 312.632 457.48

9 347.907 218.199 734.17 362.902 184.491 296.785 323.976 445.493

Polygonal Renderingtim

e (s

)

0

300

600

900

1200

2 3 4 5 6 7 8 9

sort-first + tree_equalizertile_equalizersort-last + tree_equalizerpixel

Table 1

2D load_equalizer DB load_equalizer tile_equalizer sort-last sort-first pixel + tree_equalizer + tree_equalizer

2 1'543.60 2'931.26 1'424.59 951.20 973.79 971.81 1'265.08 2'463.65

3 1'578.04 2'362.82 2'747.82 2'299.90 1'349.68 740.27 1'412.59 2'609.76

4 2'030.64 4'773.92 3'407.56 2'583.74 997.59 984.46 1'763.03 3'798.89

5 1'558.49 3'741.38 3'933.45 3'150.77 1'396.63 1'257.87 1'756.60 3'382.65

6 1'585.29 2'937.48 4'168.60 2'215.05 1'027.90 1'476.80 1'909.33 3'019.48

7 2'255.77 3'551.09 4'276.98 2'286.98 1'423.29 1'594.26 2'042.90 2'822.05

9 2'479.13 3'457.38 4'326.15 2'382.24 1'056.20 1'663.88 2'137.84 2'627.83

9 2'408.12 3'559.65 4'352.34 2'237.88 1'460.26 1'748.71 2'088.53 2'558.92

10 2'299.47 3'666.38 4'336.25 2'695.55 1'089.67 1'795.76 2'204.45 2'469.32

GB/

s ra

w d

ata

0

833.333

1666.667

2500

3333.333

4166.667

5000

2 3 4 5 6 7 9 9 10

2D load_equalizer DB load_equalizer tile_equalizersort-last sort-first pixel + tree_equalizer + tree_equalizer

David-1

DB load_equalizer sort-first + load_equalizer tile_equalizer sort-last + tree_equalizer pixel tree

848.327 1352.491 713.233 3006.908 1352.182 909.588 1352.215 763.801

2 884.179 693.143 610.317 579.496 831.352 585.407 982.973 505.543

3 735.904 672.247 470.881 573.769 1124.537 520.665 825.865 411.44

4 810.787 529.769 491.367 529.979 794.312 474.507 735.748 321.218

5 734.543 518.33 362.897 451.104 989.738 441.844 672.443 259.259

6 659.309 547.567 390.669 383.183 770.542 400.924 608.374 242.638

7 618.068 583.007 304.915 307.694 796.555 394.526 568.281 201.573

8 639.671 448.658 322.475 318.831 758.397 392.567 529.601 197.874

9 611.223 412.047 333.549 374.29 748.854 387.957 498.102 162.89

Volume Rendering

Processes2 3 4 5 6 7 8 9

sort-first + load_equalizertile_equalizersort-last + tree_equalizerpixel

�1

Fig. 8: Compound scalability for polygonal and volume data

David

2D load_equalizer sort-first + tree_equalizer + usage tile_equalizer sort-last + tree_equalizer + usage

946.273 3258.876 672.646 718.641 3568.042 3259.176 1348.052 1025.771

2 (0.37) 807.143 927.857 563.588 404.149 687.899 1421.465 1054.289 638.835

3 (1.37) 565.353 808.503 444.618 392.009 676.478 619.626 731.59 438.363

4 (1.54) 524.336 2570.845 443.072 400.929 628.278 2226.959 1480.774 462.454

5 (1.68) 810.302 2778.057 409.571 271.047 531.876 2920.293 1384.889 949.67

6 (2.68) 641.76 2636.308 338.709 257.234 451.993 2593.835 712.708 347.163

7 (2.93) 820.589 2818.003 281.388 201.025 365.272 2589.965 680.985 533.371

8 (3.93) 801.365 957.47 272.402 197.572 378.077 1048.467 675.077 462.328

9 (4.18) 557.95 502.388 205.272 203.295 440.017 945.221 475.414 520.028

Heterogenous Rendering Nodes

time

(s)

0

500

1000

1500

2000

2500

Processes (normalized compute resources)2 (0.37) 3 (1.37) 4 (1.54) 5 (1.68) 6 (2.68) 7 (2.93) 8 (3.93) 9 (4.18)

sort-first + tree_equalizer + usagetile_equalizersort-last + tree_equalizer + usage

Table 1

2D load_equalizer sort-first tile_equalizer + usage + usage sort-last + tree_equalizer

2 845.42 245.48 224.21 779.90 1'113.21 245.46 1'189.33

3 991.15 862.20 1'162.96 1'252.28 1'979.47 562.80 1'419.48

4 1'415.05 989.48 1'182.60 1'824.97 2'040.77 1'291.10 1'799.30

5 1'525.74 311.18 1'273.32 1'729.90 1'995.37 359.23 1'805.58

6 987.29 287.97 1'504.11 842.40 2'951.52 273.95 1'953.26

7 1'246.57 303.45 1'769.94 2'304.39 3'110.01 308.42 2'361.91

9 974.91 283.89 2'190.15 1'499.89 3'979.60 308.88 2'843.05

9 998.30 835.54 2'115.97 1'730.37 4'049.16 763.02 2'936.84

10 1'433.82 1'592.39 1'818.11 1'538.38 3'935.17 846.36 3'897.27

GB/

s ra

w d

ata

0

1250

2500

3750

5000

2 3 4 5 6 7 9 9 10

2D load_equalizer sort-first tile_equalizer + usage + usage sort-last + tree_equalizer

David-1

DB 2D pixel sort-first load_equalizer

sort-first tree_equalizer

tile_equalizer sort-last load_equalizer

sort-last tree_equalizer

2987.585 2988.258 2987.446 1764.76 1912.241 3265.703 1917.695 2028.709

2 1913.81 1686.083 2301.55 1435.35 1404.123 1595.547 2004.289 1677.679

3 2514.586 1584.784 1915.305 1221.532 1114.192 1045.738 1686.722 1464.048

4 1728.705 1250.951 1703.024 1147.805 989.193 846.807 1852.095 1329.647

5 2159.356 1203.165 1559.333 1012.258 863.364 756.277 1652.558 1275.215

6 1760.588 1306.115 1392.387 928.751 780.164 670.26 1509.519 1132.864

7 1823.357 1381.093 1314.545 804.546 712.173 580.667 1458.927 1021.788

8 1726.14 1068.894 1223.938 794.19 661.357 539.159 1484.815 1037.998

9 1665.988 984.409 1127.391 796.268 601.436 529.972 1415.633 1044.49

Shared Rendering Nodes

Processes2 3 4 5 6 7 8 9 9

sort-first load_equalizersort-first tree_equalizertile_equalizersort-last load_equalizersort-last tree_equalizer

�1

Fig. 9: Scalability with heterogenous rendering resources

Binary


Compression

Decompression

RLE 98% 10.0 13.2

Snappy 77% 4.41 7.97

FastLZ 76% 1.96 4.75

LZF 76% 1.67 5.64

ZSTD1 63% 1.50 3.76

ZSTD2 63% 1.15 3.02

ZSTD3 62% 0.719 3.46

ZSTD4 60% 0.590 2.42

ZSTD5 60% 0.456 2.74

ZSTD10 60% 0.220 3.25

ZSTD19 55% 0.047 1.69

Spee

d (G

B/s)

0.0

3.0

6.0

9.0

12.0

15.0

Com

pres

sion

Rat

io

0%

20%

40%

60%

80%

100%

RLE

Snap

py

Fast

LZ LZF

ZSTD

1

ZSTD

2

ZSTD

3

ZSTD

4

ZSTD

5

ZSTD

10

ZSTD

19

Relative compressed sizeCompressionDecompression

55%60%60%60%62%63%63%

76%76%77%

98%98%

77% 76% 76%

63% 63% 62% 60% 60% 60%55%

Spikes 437x240x512


Compression

Decompression

RLE 72% 6.8692 3.92 37605

Snappy 19% 4.9998 10.974 9748

FastLZ 19% 4.3165 2.9956 9853

LZF 19% 5.007 2.3205 9777

ZSTD1 12% 2.3272 2.4628 6267

ZSTD2 12% 1.6627 2.3992 6304

ZSTD3 12% 1.0059 2.12 6223

ZSTD4 12% 0.69925 2.205 6107

ZSTD5 12% 0.54902 1.7475 6035

ZSTD10 11% 0.20548 5 5719

ZSTD19 9% 0.013166 2.6265 4915

Spikes

RLE

Snap

py

Fast

LZ LZF

ZSTD

1

ZSTD

2

ZSTD

3

ZSTD

4

ZSTD

5

ZSTD

10

ZSTD

19

9%11%12%12%12%12%12%19%19%19%

72%72%

19% 19% 19%12% 12% 12% 12% 12% 11% 9%

Beechnut 1024x1024x1546


Compression

Decompression

RLE 57% 5.2958 12.997 1764

Snappy 59% 15.628 13.904 1821

FastLZ 56% 2.7606 7.1601 1735

LZF 56% 2.6859 7.6337 1739

ZSTD1 45% 3.5344 7.7903 1382

ZSTD2 45% 1.8205 7.5859 1383

ZSTD3 45% 1.0213 7.6865 1383

ZSTD4 45% 0.70692 6.9144 1381

ZSTD5 45% 0.58168 6.4649 1379

ZSTD10 45% 0.45114 6.3964 1377

ZSTD19 44% 0.14913 2.7041 1351

Beechnut

Spee

d (G

B/s)

0

4

8

12

16

RLE

Snap

py

Fast

LZ LZF

ZSTD

1

ZSTD

2

ZSTD

3

ZSTD

4

ZSTD

5

ZSTD

10

ZSTD

19

Relative compressed sizeCompressionDecompression

44%45%45%45%45%45%45%

56%56%59%57%57% 59% 56% 56%

45% 45% 45% 45% 45% 45% 44%

David 2mm


Compression

Decompression

RLE 100% 6.7743 11.807

Snappy 90% 3.267 9.3255

FastLZ 90% 2.0667 4.7394

LZF 89% 1.7327 5.7091

ZSTD1 84% 1.3078 3.4581

ZSTD2 81% 0.90846 3.1057

ZSTD3 81% 0.55187 3.1324

ZSTD4 77% 0.47093 2.4956

ZSTD5 77% 0.3562 2.4732

ZSTD10 77% 0.15825 2.6823

ZSTD19 70% 0.073808 2.0204

David 2mm

Com

pres

sion

Rat

io

0%

25%

50%

75%

100%

RLE

Snap

py

Fast

LZ LZF

ZSTD

1

ZSTD

2

ZSTD

3

ZSTD

4

ZSTD

5

ZSTD

10

ZSTD

19

70%77%77%77%

81%81%84%89%90%90%

100%100%

90% 90% 89%84% 81% 81%

77% 77% 77%70%

Fig. 10: Data compression for PLY data (left, David statue 2 mm) and raw volumes shown in Figure 6 (right)


David

none compression buffered compression buffered

chunked

1 269.262 488.862 239.874 295.458 314.721

2 481.52 674.415 440.82 461.04 539.749

3 671.356 908.347 619.967 716.554 791.434

4 911.425 1260.64 846.094 841.245 1018.46

5 1093.09 1477.01 1008.79 1045.32 1260.29

6 1287.89 1727.58 1212.08 1188.96 1514.79

7 1468.55 1991.29 1394.99 1338.3 1716.93

8 1665.03 2296.96 1568.83 1469.53 1927.65

9 1845.68 2502.9 1768.3 1724.48 2159.17

10 2061.01 2809.24 1959.29 1847.06 2425.96

11 2277.12 3115.34 2124.9 2009.83 2669.32

12 2469.04 3342.63 2290.31 2156.87 2885.83

13 2666.45 3617.03 2474.83 2365.77 3131.96

14 2829.93 3898.22 2703.75 2565.74 3338.1

15 3069.4 4117.89 2905.78 2659.72 3546.27

16 3283.73 4351.78 3043.56 3014.37 3768.37

Mapping on n Slave Processes

time

(ms)

0

900

1800

2700

3600

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

nonecompressionbufferedcompression buffered

Table 1-1

None RLE FastLZ Snappy ZSTD1 ZSTD5 ZSTD10

1 225.553 503.479 387.303 269.349 372.93 621.554 887.943

2 404.969 683.799 540.65 442.741 522.118 738.039 1025.6

3 578.758 865.506 694.198 615.829 661.975 877.156 1174.39

4 757.229 1028.02 863.256 765.081 821.652 1005.39 1297.8

5 945.49 1223.18 1054.87 908.339 944.597 1142.61 1423.41

6 1122.37 1426.67 1185.09 1067.7 1092.72 1293.32 1572.06

7 1376.46 1510.27 1317.5 1218.02 1254.02 1428.79 1717.08

8 1544.19 1746.47 1513.1 1418.86 1420.23 1587.25 1868.43

9 1653.51 1930.35 1671.84 1588.12 1548.03 1704.32 1991.37

10 1841.39 2060.25 1829.62 1748.39 1725.37 1877.18 2131.61

11 2036.74 2209.88 1986.36 1902.83 1822.57 1978.37 2273.02

12 2201.43 2399.39 2160.71 2076.79 2046.31 2114.15 2403.41

13 2378.86 2611.97 2297 2176.38 2169.55 2252.05 2618.79

14 2638.34 2770.32 2451.53 2372.26 2293.93 2386.86 2644.13

15 2740.4 2960.88 2640.05 2544.46 2459.79 2560.51 2815.16

16 2921.88 3147.84 2785.38 2712.42 2618.8 2677.33 2968.33

Commit - Sync using n Slave Processes

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

NoneRLEFastLZSnappyZSTD1ZSTD5ZSTD10

Detail of Commit - Sync

0

750

1500

Processes1 2 3 4 5 6 7 8

Table 1


1 0.19 0.16 0.10 0.21

2 0.21 0.19 0.15 0.23

3 0.22 0.19 0.17 0.24

4 0.22 0.20 0.16 0.24

5 0.23 0.20 0.17 0.25

6 0.23 0.20 0.17 0.25

7 0.24 0.20 0.18 0.25

8 0.24 0.21 0.17 0.26

9 0.24 0.21 0.18 0.25

10 0.24 0.21 0.18 0.26

11 0.24 0.21 0.18 0.26

12 0.24 0.21 0.18 0.26

13 0.24 0.21 0.18 0.26

14 0.25 0.21 0.18 0.26

15 0.24 0.21 0.18 0.26

16 0.24 0.21 0.18 0.26

GB/

s ra

w d

ata

0.08

1.064

2.048

3.032

4.016

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Table 1-2

none RLE FastLZ Snappy ZSTD1 ZSTD5

1 0.22 0.10 0.13 0.19 0.13 0.08

2 0.25 0.15 0.19 0.23 0.19 0.14

3 0.26 0.17 0.22 0.24 0.23 0.17

4 0.26 0.19 0.23 0.26 0.24 0.20

5 0.26 0.20 0.24 0.28 0.26 0.22

6 0.27 0.21 0.25 0.28 0.27 0.23

7 0.25 0.23 0.27 0.29 0.28 0.25

8 0.26 0.23 0.26 0.28 0.28 0.25

9 0.27 0.23 0.27 0.28 0.29 0.26

10 0.27 0.24 0.27 0.29 0.29 0.27

11 0.27 0.25 0.28 0.29 0.30 0.28

12 0.27 0.25 0.28 0.29 0.29 0.28

13 0.27 0.25 0.28 0.30 0.30 0.29

14 0.27 0.25 0.29 0.30 0.31 0.29

15 0.27 0.25 0.28 0.29 0.30 0.29

16 0.27 0.25 0.29 0.30 0.31 0.30

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


�1

Fig. 11: Object mapping (left) and data synchronization time (middle, detail view right) for the David 2 mm data set

512 spikes


1 114.8970 139.018 89.7825 63.3325

2 179.92 159.636 162.063 65.2746

3 252.748 229.863 228.857 73.2822

4 327.627 255.235 297.079 83.868

5 403.372 300.406 368.013 92.1833

6 473.077 373.071 437.355 104.478

7 553.167 419.895 499.577 112.161

8 622.767 474.178 564.534 124.021

9 688.673 514.321 630.847 137.741

10 748.563 576.171 699.095 153.244

11 839.126 578.48 767.054 158.365

12 914.314 707.802 844.542 173.106

13 1'006.35 766.451 911.401 183.442

14 1'049.72 807.819 969.673 192.672

15 1'156.29 813.302 1'051.29 204.693

16 1'215.07 905.907 1'121.04 215.895


time

(ms)

0

300

600

900

1200

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Table 1-1


1 89.7825 74.8619 52.5474 63.3325 55.5042 124.677 255.462

2 162.063 122.712 68.884 65.2746 73.1046 129.009 266.061

3 228.857 169.639 77.1931 73.2822 78.3735 138.663 264.228

4 297.079 210.128 86.3954 83.868 77.9689 141.686 269.44

5 368.013 249.525 96.9489 92.1833 93.3917 152.503 292.837

6 437.355 295.92 105.853 104.478 96.2615 151.599 271

7 499.577 333.786 123.506 112.161 108.26 134.993 287.194

8 564.534 382.574 129.433 124.021 111.93 142.422 306.128

9 630.847 424.425 140.729 137.741 120.387 147.938 304.205

10 699.095 468.86 154.15 153.244 131.168 165.39 327.39

11 767.054 526.03 162.293 158.365 142.012 171.338 324.247

12 844.542 568.277 183.839 173.106 156.812 175.387 339.709

13 911.401 611.649 192.215 183.442 158.064 223.601 338.996

14 969.673 657.368 205.598 192.672 168.096 236.012 347.586

15 1'051.29 683.069 220.785 204.693 182.3 230.743 364.78

16 1'121.04 736.255 225.046 215.895 180.537 232.785 349.496


Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16



0

100

200

Processes1 2 3 4 5 6 7 8

Table 1


1 0.44 0.36 0.56 0.79

2 0.56 0.63 0.62 1.53

3 0.59 0.65 0.66 2.05

4 0.61 0.78 0.67 2.39

5 0.62 0.83 0.68 2.71

6 0.63 0.80 0.69 2.87

7 0.63 0.83 0.70 3.12

8 0.64 0.84 0.71 3.23

9 0.65 0.88 0.71 3.27

10 0.67 0.87 0.72 3.26

11 0.66 0.95 0.72 3.47

12 0.66 0.85 0.71 3.47

13 0.65 0.85 0.71 3.54

14 0.67 0.87 0.72 3.63

15 0.65 0.92 0.71 3.66

16 0.66 0.88 0.71 3.71

GB/

s ra

w d

ata

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Table 1-2


1 0.56 0.67 0.95 0.79 0.90 0.40

2 0.62 0.82 1.45 1.53 1.37 0.78

3 0.66 0.88 1.94 2.05 1.91 1.08

4 0.67 0.95 2.32 2.39 2.57 1.41

5 0.68 1.00 2.58 2.71 2.68 1.64

6 0.69 1.01 2.83 2.87 3.12 1.98

7 0.70 1.05 2.83 3.12 3.23 2.59

8 0.71 1.05 3.09 3.23 3.57 2.81

9 0.71 1.06 3.20 3.27 3.74 3.04

10 0.72 1.07 3.24 3.26 3.81 3.02

11 0.72 1.05 3.39 3.47 3.87 3.21

12 0.71 1.06 3.26 3.47 3.83 3.42

13 0.71 1.06 3.38 3.54 4.11 2.91

14 0.72 1.07 3.41 3.63 4.17 2.97

15 0.71 1.10 3.40 3.66 4.11 3.25

16 0.71 1.09 3.56 3.71 4.43 3.44

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Fig. 12: Object mapping (left) and data synchronization time (middle, detail view right) for the Spike data set in Figure 6 (right)

512 spikes


1 5859.4 6888.36 4775.1 4557.84

2 10366.4 10544.2 8534.19 6877.34

3 14590.1 14799.5 12295.9 8993.71

4 19391.5 20479.5 16069.8 11311.1

5 23767.9 24526.3 20553.7 13219.4

6 28248.2 29403.9 23788 16150.1

7 32649.8 33707 28116.1 17812.1

8 37175 38683.9 31506.8 20660.5

9 40902.2 42125.2 35575.4 22769.1

10 44062.6 47155.4 39625.4 25008.4

11 47547.2 51536.6 43338.2 27480.2

12 55312.9 56230.6 47470.4 29626.1

13 59261.6 60337.2 50817.2 32086.7

14 66492.4 63931.2 54783.7 34233.7

15 70847.5 68318.9 58750.2 36821.6

16 69095.4 73935.4 62956.7 38765.2


time

(ms)

0

25000

50000

75000

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Table 1-1


1 4803.32 3932.65 5913.92 4379.32 5360.88 18103.5 21130.6

2 7895.92 6080.83 8286.18 6552.13 7059.94 20190.9 23704

3 10776.5 8200.12 10356.1 8699.51 8739.81 21578.2 25214.2

4 13896.6 10315.6 12508.6 11011.2 10550.7 23426.7 27054.1

5 17360.5 12548.5 14497.3 13226.6 12147.1 24935.9 28928.8

6 19426.4 14609.6 16637.6 15557.6 13888 26734.4 29863.8

7 23327.8 16807.5 18697.8 17650.5 15522 28290.8 31850

8 27043.7 19035.3 21125.5 19984.7 17249.4 30132.1 33521.7

9 27830.5 21437.5 23094.2 22151.3 18948.9 31721.9 35304.9

10 30856.6 23365.5 25163.2 24457.6 20581.8 33477 36674.5

11 37087.5 25366.4 27266.9 26540.2 22488 35444.1

12 38827.2 27534 29651.8 29084.6 24073.7 37132.9

13 41106.3 30226.8 31366.2 31226.8 25663.4 38779

14 43365.8 32061.8 33376.7 33874.2 27344.5 40388

15 47531.3 34246.2 37235.8 35653.4 29527.3 42084.2

16 48192.1 36162.7 37900.3 38145.8 31734.2 43984.3


Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16



0

5000

10000

15000

20000

25000

Processes1 2 3 4 5 6 7 8

Table 1


1 0.52 0.44 0.63 0.66

2 0.58 0.57 0.71 0.88

3 0.62 0.61 0.74 1.01

4 0.62 0.59 0.75 1.07

5 0.64 0.62 0.73 1.14

6 0.64 0.62 0.76 1.12

7 0.65 0.63 0.75 1.19

8 0.65 0.62 0.77 1.17

9 0.66 0.65 0.76 1.19

10 0.69 0.64 0.76 1.21

11 0.70 0.64 0.77 1.21

12 0.66 0.64 0.76 1.22

13 0.66 0.65 0.77 1.22

14 0.64 0.66 0.77 1.23

15 0.64 0.66 0.77 1.23

16 0.70 0.65 0.77 1.25

GB/

s ra

w d

ata

0

0.32

0.64

0.96

1.28

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Table 1-2


1 0.63 0.77 0.51 0.69 0.56 0.17

2 0.76 0.99 0.73 0.92 0.86 0.30

3 0.84 1.10 0.87 1.04 1.04 0.42

4 0.87 1.17 0.97 1.10 1.14 0.52

5 0.87 1.20 1.04 1.14 1.24 0.61

6 0.93 1.24 1.09 1.16 1.30 0.68

7 0.91 1.26 1.13 1.20 1.36 0.75

8 0.89 1.27 1.14 1.21 1.40 0.80

9 0.98 1.27 1.18 1.23 1.43 0.86

10 0.98 1.29 1.20 1.23 1.47 0.90

11 0.90 1.31 1.22 1.25 1.48 0.94

12 0.93 1.32 1.22 1.25 1.51 0.98

13 0.95 1.30 1.25 1.26 1.53 1.01

14 0.97 1.32 1.27 1.25 1.55 1.05

15 0.95 1.32 1.22 1.27 1.53 1.08

16 1.00 1.34 1.27 1.27 1.52 1.10

0.0

0.3

0.6

1.0

1.3

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Fig. 13: Object mapping (left) and data synchronization time (middle, detail view right) for the beechnut data set in Figure 6 (right)


0

100

200

300

Processes1 2 3 4 5 6 7 8

512 spikes

10 GE 40G IPoIB 40G RDMA 10 GE multicast 40G IPoIB multicast

1 63.3325 45.209 46.3108 48.8436 49.5925

2 65.2746 50.502 58.389 62.7239 52.7701

3 73.2822 64.621 63.813 65.3697 57.958

4 83.868 68.222 76.679 67.141 55.8572

5 92.1833 78.890 93.519 65.0503 60.4427

6 104.478 73.001 127.342 65.568 61.768

7 112.161 70.639 165.955 63.870 60.8599

8 124.021 80.446 176.185 65.591 62.0671

9 137.741 90.876 207.724 74.083 79.4668

10 153.244 90.237 173.758 86.219 90.2087

11 158.365 106.22 276.388 95.070 114.421

12 173.106 119.506 295.701 107.362 124.464

13 183.442 131.470 269.206 118.861 129.844

14 192.672 121.609 348.324 106.954 130.288

15 204.693 128.505 356.86 121.420 130.518

16 215.895 129.498 338.35 105.216 137.864

Unicast Commit - Sync using n Slave Processes

time

(ms)

0

100

200

300

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

10 GE40G IPoIB40G RDMA

Multicast Commit - Sync using n Slave Processes

time

(ms)

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

10 GE10 GE multicast40G IPoIB multicast

Multicast Commit - Sync using n Slave Processes

0

25

50

75

100

125

Processes1 2 3 4 5 6 7 8

10 GE10 GE multicast40G IPoIB multicast

512 spikes-1

10 GE 40G IPoIB 40G RDMA 10 GE multicast 40G IPoIB multicast 10 GE multicast 40G IPoIB multicast 10 GE multicast

40G IPoIB multicast

1 89.7825 47.129 62.7427 16'560 16'560

2 162.063 70.260 113.619 75.4453 73.8126 13% 0% 2'099 30 0.18 0.01

3 228.857 104.829 149.720 86.6379 66.134 24% 1% 4'025 184 0.21 0.03

4 297.079 127.342 198.492 93.6253 72.3448 26% 9% 4'376 1'477 0.30 0.06

5 368.013 211.067 261.350 100.185 71.9826 40% 7% 6'684 1'116 0.33 0.09

6 437.355 245.736 322.591 87.0146 74.0963 33% 12% 5'460 2'062 0.39 0.09

7 499.577 213.542 355.548 94.8648 79.9112 43% 8% 7'083 1'272 0.42 0.16

8 564.534 246.510 405.547 98.2001 74.2445 51% 27% 8'439 4'538 0.48 0.13

9 630.847 272.504 464.000 127.789 82.3148 52% 4% 8'547 589 0.70 0.18

10 699.095 296.360 457.529 138.107 101.678 109% 22% 18'008 3'679 1.02 0.16

11 767.054 375.80 557.636 158.805 102.656 145% 21% 23'997 3'497 1.44 0.17

12 844.542 373.184 562.159 179.393 111.694 178% 7% 29'537 1'215 1.63 0.15

13 911.401 499.582 685.501 161.383 102.354 166% 15% 27'504 2'544 1.80 0.19

14 969.673 406.090 669.048 182.865 113.738 195% 35% 32'303 5'876 1.78 0.33

15 1'051.29 479.317 787.29 167.298 108.804 172% 47% 28'428 7'802 1.89 0.44

16 1'121.04 473.122 748.71 187.414 130.682 200% 50% 33'057 8'272 1.86 0.49


time

(ms)

0

400

800

1200

Processes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

10 GE40G IPoIB40G RDMA10 GE multicast40G IPoIB multicast

Multicast Packet Loss

0%

50%

100%

150%

200%

Processes2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

�1

Fig. 14: Object synchronization using different network transports

Documents

JOURNAL OF LA Equalizer 2.0 – Convergence of a Parallel ... · lage network library and a quick overview of the main Equalizer-based applications. A result section presents new