A Scalable and Recon gurable Shared-Memory Graphics ... · other graphics application related...

Preview:

Citation preview

A Scalable and Reconfigurable Shared-Memory Graphics Architecture

Michael Manzke∗

Trinity College DublinRoss Brennan

Trinity College DublinKeith O’Conor

Trinity College DublinJohn Dingliana

Trinity College DublinCarol O’Sullivan

Trinity College Dublin

Current scalable high-performance graphics systems are either con-structed using special purpose graphics acceleration hardware orbuilt as a cluster of commodity components with a software in-frastructure that exploits multiple graphics cards [Humphreys et al.2002]. Both these solutions are used in application domains wherecomputational demand cannot be met by a single commodity graph-ics card e.g., large-scale scientific visualisation. The former ap-proach tends to provide the highest performance but is expensivebecause it requires frequent redesign of the special purpose graph-ics acceleration hardware in order to maintain a performance ad-vantage over the commodity graphics hardware used in the clusterapproach. The latter approach, while more affordable and scalable,has intrinsic performance drawbacks due to computationally expen-sive communication between the individual graphics pipelines.

Figure 1: The first prototype of the custom-built high-performancegraphics cluster node. The figure shows how a commodity graphicscard interfaces the cluster node. It also depicts the four SCI cablesthat should interconnect the custom-built GPU interface boards andthe PC cluster via a 2D torus topology.

In this sketch we propose a scalable tightly coupled cluster ofcustom-built boards that provide an AGP interface for commoditygraphics accelerators. This hybrid solution aims to bridge the gapbetween both of the current solutions, offering a minimal custom-built hardware component together with a novel and efficient sharedmemory infrastructure that exploits cutting-edge consumer graph-ics hardware. The boards are supplied with rendering instructionsby a cluster of commodity PCs that execute OpenGL graphics ap-plications. All the commodity PCs and custom-built boards are in-terconnected with an implementation of the IEEE 1596-1992 Scal-able Coherent Interface (SCI) standard. This technology providesthe system with a high bandwidth, low latency, point to point inter-connect. Our design allows for the implementation of a 2D torustopology with good scalability properties and excellent suitabilityfor parallel rendering. Most importantly the interconnect imple-ments a Distributed Shared Memory (DSM) architecture in hard-

∗e-mail:Michael.Manzke@cs.tcd.ie

ware. Figure 2 shows how local memories on the custom-builtboards and the PCs become part of the system wide DSM throughthe SCI interconnect. Figure 2 also depicts Field ProgrammableGate Arrays (FPGAs) on the custom-built boards. These recon-figurable components assist the SCI implementation and providesubstantial additional computational resources that may be used tocontrol the commodity graphics accelerators and to perform oper-ations associated with a parallel rendering infrastructure. Beyondthe previously mentioned application of the FPGAs we envisionother graphics application related computation e.g., ray tracing.These reconfigurable components are an integral part of the scalableshared-memory graphics cluster and consequently increase the pro-grammability of the parallel rendering system, just like vertex andpixel shaders increased the programmability of graphics pipelines.

Board n

FPGAs GPU Card

Local Shared Memory

Board 1

FPGAs GPU Card

Local Shared Memory

Distributed Shared MemorySingle Address Space

Implemented through a High Speed Interconnect

PC Board n

Local Shared Memory

CPUPC Board 1

Local Shared Memory

CPU Scalable

Scalable

Figure 2: Shared memory system.

In this sketch, we describe the design of a tightly coupled scalableNon-Uniform Memory Access (NUMA) architecture of distributedFPGAs, GPUs and memory that may be constructed with a limitedamount of custom-built hardware. A first prototype of the custom-built boards, seen in Figure 1, was manufactured and is currentlydebugged. A second revision will resolve outstanding problems.We expect that this hardware DSM cluster communicates data at500Mbytes/s with low latencies (< 1.5 µs). This hard real-timecapable parallel rendering cluster will be connected with the samehigh speed interconnect to a commodity PC cluster that will exe-cute the graphics application. We have introduced this novel archi-tecture and estimate, based on the arguments presented, that thissolution could out-perform pure commodity implementations with-out increased hardware cost and yet maintain its adaptability to themost recent generation of commodity graphics accelerators and tar-get applications. Later prototypes will incorporate PCI Express tobe compatible with the latest commodity graphics accelerators.

References

HUMPHREYS, G., HOUSTON, M., NG, R., FRANK, R., AH-ERN, S., KIRCHNER, P. D., AND KLOSOWSKI, J. T. 2002.Chromium: a stream-processing framework for interactive ren-dering on clusters. In SIGGRAPH, 693–702.

Recommended