1
A Scalable and Reconfigurable Shared-Memory Graphics Architecture Michael Manzke * Trinity College Dublin Ross Brennan Trinity College Dublin Keith O’Conor Trinity College Dublin John Dingliana Trinity College Dublin Carol O’Sullivan Trinity College Dublin Current scalable high-performance graphics systems are either con- structed using special purpose graphics acceleration hardware or built as a cluster of commodity components with a software in- frastructure that exploits multiple graphics cards [Humphreys et al. 2002]. Both these solutions are used in application domains where computational demand cannot be met by a single commodity graph- ics card e.g., large-scale scientific visualisation. The former ap- proach tends to provide the highest performance but is expensive because it requires frequent redesign of the special purpose graph- ics acceleration hardware in order to maintain a performance ad- vantage over the commodity graphics hardware used in the cluster approach. The latter approach, while more affordable and scalable, has intrinsic performance drawbacks due to computationally expen- sive communication between the individual graphics pipelines. Figure 1: The first prototype of the custom-built high-performance graphics cluster node. The figure shows how a commodity graphics card interfaces the cluster node. It also depicts the four SCI cables that should interconnect the custom-built GPU interface boards and the PC cluster via a 2D torus topology. In this sketch we propose a scalable tightly coupled cluster of custom-built boards that provide an AGP interface for commodity graphics accelerators. This hybrid solution aims to bridge the gap between both of the current solutions, offering a minimal custom- built hardware component together with a novel and efficient shared memory infrastructure that exploits cutting-edge consumer graph- ics hardware. The boards are supplied with rendering instructions by a cluster of commodity PCs that execute OpenGL graphics ap- plications. All the commodity PCs and custom-built boards are in- terconnected with an implementation of the IEEE 1596-1992 Scal- able Coherent Interface (SCI) standard. This technology provides the system with a high bandwidth, low latency, point to point inter- connect. Our design allows for the implementation of a 2D torus topology with good scalability properties and excellent suitability for parallel rendering. Most importantly the interconnect imple- ments a Distributed Shared Memory (DSM) architecture in hard- * e-mail:[email protected] ware. Figure 2 shows how local memories on the custom-built boards and the PCs become part of the system wide DSM through the SCI interconnect. Figure 2 also depicts Field Programmable Gate Arrays (FPGAs) on the custom-built boards. These recon- figurable components assist the SCI implementation and provide substantial additional computational resources that may be used to control the commodity graphics accelerators and to perform oper- ations associated with a parallel rendering infrastructure. Beyond the previously mentioned application of the FPGAs we envision other graphics application related computation e.g., ray tracing. These reconfigurable components are an integral part of the scalable shared-memory graphics cluster and consequently increase the pro- grammability of the parallel rendering system, just like vertex and pixel shaders increased the programmability of graphics pipelines. Board n FPGAs GPU Card Local Shared Memory Board 1 FPGAs GPU Card Local Shared Memory Distributed Shared Memory Single Address Space Implemented through a High Speed Interconnect PC Board n Local Shared Memory CPU PC Board 1 Local Shared Memory CPU Scalable Scalable Figure 2: Shared memory system. In this sketch, we describe the design of a tightly coupled scalable Non-Uniform Memory Access (NUMA) architecture of distributed FPGAs, GPUs and memory that may be constructed with a limited amount of custom-built hardware. A first prototype of the custom- built boards, seen in Figure 1, was manufactured and is currently debugged. A second revision will resolve outstanding problems. We expect that this hardware DSM cluster communicates data at 500Mbytes/s with low latencies (< 1.5 μ s). This hard real-time capable parallel rendering cluster will be connected with the same high speed interconnect to a commodity PC cluster that will exe- cute the graphics application. We have introduced this novel archi- tecture and estimate, based on the arguments presented, that this solution could out-perform pure commodity implementations with- out increased hardware cost and yet maintain its adaptability to the most recent generation of commodity graphics accelerators and tar- get applications. Later prototypes will incorporate PCI Express to be compatible with the latest commodity graphics accelerators. References HUMPHREYS, G., HOUSTON, M., NG, R., FRANK, R., AH- ERN, S., KIRCHNER, P. D., AND KLOSOWSKI , J. T. 2002. Chromium: a stream-processing framework for interactive ren- dering on clusters. In SIGGRAPH, 693–702.

A Scalable and Recon gurable Shared-Memory Graphics ... · other graphics application related computation e.g., ray tracing. These recongurable components are an integral part of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Scalable and Recon gurable Shared-Memory Graphics ... · other graphics application related computation e.g., ray tracing. These recongurable components are an integral part of

A Scalable and Reconfigurable Shared-Memory Graphics Architecture

Michael Manzke∗

Trinity College DublinRoss Brennan

Trinity College DublinKeith O’Conor

Trinity College DublinJohn Dingliana

Trinity College DublinCarol O’Sullivan

Trinity College Dublin

Current scalable high-performance graphics systems are either con-structed using special purpose graphics acceleration hardware orbuilt as a cluster of commodity components with a software in-frastructure that exploits multiple graphics cards [Humphreys et al.2002]. Both these solutions are used in application domains wherecomputational demand cannot be met by a single commodity graph-ics card e.g., large-scale scientific visualisation. The former ap-proach tends to provide the highest performance but is expensivebecause it requires frequent redesign of the special purpose graph-ics acceleration hardware in order to maintain a performance ad-vantage over the commodity graphics hardware used in the clusterapproach. The latter approach, while more affordable and scalable,has intrinsic performance drawbacks due to computationally expen-sive communication between the individual graphics pipelines.

Figure 1: The first prototype of the custom-built high-performancegraphics cluster node. The figure shows how a commodity graphicscard interfaces the cluster node. It also depicts the four SCI cablesthat should interconnect the custom-built GPU interface boards andthe PC cluster via a 2D torus topology.

In this sketch we propose a scalable tightly coupled cluster ofcustom-built boards that provide an AGP interface for commoditygraphics accelerators. This hybrid solution aims to bridge the gapbetween both of the current solutions, offering a minimal custom-built hardware component together with a novel and efficient sharedmemory infrastructure that exploits cutting-edge consumer graph-ics hardware. The boards are supplied with rendering instructionsby a cluster of commodity PCs that execute OpenGL graphics ap-plications. All the commodity PCs and custom-built boards are in-terconnected with an implementation of the IEEE 1596-1992 Scal-able Coherent Interface (SCI) standard. This technology providesthe system with a high bandwidth, low latency, point to point inter-connect. Our design allows for the implementation of a 2D torustopology with good scalability properties and excellent suitabilityfor parallel rendering. Most importantly the interconnect imple-ments a Distributed Shared Memory (DSM) architecture in hard-

∗e-mail:[email protected]

ware. Figure 2 shows how local memories on the custom-builtboards and the PCs become part of the system wide DSM throughthe SCI interconnect. Figure 2 also depicts Field ProgrammableGate Arrays (FPGAs) on the custom-built boards. These recon-figurable components assist the SCI implementation and providesubstantial additional computational resources that may be used tocontrol the commodity graphics accelerators and to perform oper-ations associated with a parallel rendering infrastructure. Beyondthe previously mentioned application of the FPGAs we envisionother graphics application related computation e.g., ray tracing.These reconfigurable components are an integral part of the scalableshared-memory graphics cluster and consequently increase the pro-grammability of the parallel rendering system, just like vertex andpixel shaders increased the programmability of graphics pipelines.

Board n

FPGAs GPU Card

Local Shared Memory

Board 1

FPGAs GPU Card

Local Shared Memory

Distributed Shared MemorySingle Address Space

Implemented through a High Speed Interconnect

PC Board n

Local Shared Memory

CPUPC Board 1

Local Shared Memory

CPU Scalable

Scalable

Figure 2: Shared memory system.

In this sketch, we describe the design of a tightly coupled scalableNon-Uniform Memory Access (NUMA) architecture of distributedFPGAs, GPUs and memory that may be constructed with a limitedamount of custom-built hardware. A first prototype of the custom-built boards, seen in Figure 1, was manufactured and is currentlydebugged. A second revision will resolve outstanding problems.We expect that this hardware DSM cluster communicates data at500Mbytes/s with low latencies (< 1.5 µs). This hard real-timecapable parallel rendering cluster will be connected with the samehigh speed interconnect to a commodity PC cluster that will exe-cute the graphics application. We have introduced this novel archi-tecture and estimate, based on the arguments presented, that thissolution could out-perform pure commodity implementations with-out increased hardware cost and yet maintain its adaptability to themost recent generation of commodity graphics accelerators and tar-get applications. Later prototypes will incorporate PCI Express tobe compatible with the latest commodity graphics accelerators.

References

HUMPHREYS, G., HOUSTON, M., NG, R., FRANK, R., AH-ERN, S., KIRCHNER, P. D., AND KLOSOWSKI, J. T. 2002.Chromium: a stream-processing framework for interactive ren-dering on clusters. In SIGGRAPH, 693–702.