Interconnect-Limited VLSI Architecture

8/2/2019 Interconnect-Limited VLSI Architecture

1/3

Interconnect-Lim ted VLSl ArchitectureWilliam J. DallyComputer Systems LaboratoryStanford University

AbstractAs semiconductor technology scales, wires arebecoming the dominant factor in determining systemperformance and power dissipation. By 2008, it is expectedthat chip traversal will require 16 clocks. Mo demsuperscalar architectures that depend on global register files,global bypass structures, and global instruction issue logicare poorly matched to tomorrows semiconductortechnology. This technology demands architectures thatexploit locality and minimize global commun ication. In thispaper we describe three approaches to developingarchitectures that are well matched to interconnect-limited

technology. These architectures reduce the use of globalcommunication by clustering execution resources with theirdata and instruction storage and extending the storagehierarchy to the level of individual ALU s. They also makemore efficient use of global interconnect by organizing it asa regular network, rather than a collection of ad-hocdedicated wires.

exposing it to the programmer and compiler foroptim ization. The first approa ch is to use a clusteredarchitecture in which the execution resources are partitionedinto clusters, each with their own data and instructionstorage . With this appro ach, most data and controlcommunication is local to the cluster and does not requireglobal resources. Secon d, within each cluster we extend thestorage hierarchy by adding register files local to eachexecutio n unit input. Using these local register files data canbe routed directly from execution unit to execution unitusing a minimum o f wiring resources. Third, all remainingglobal communication is performed over a packet-switchingnetwork rather than on dedicated wires. This approachmakes more efficient use of the global wires and greatlysimplifies the design.The remainder of this paper introduces the problem ofinterconnect in contemporary architectures and describeseach of ou r three approaches in mo re detail.Superscalar architectures depend on globalcommunication

IntroductionContemporary computer architectures evolved in alogic-centric era when wires were considered free in termsof delay and power. As a result, these architectures dependheavily on implicit global communication to distribute dataand instructions to execution units. These globalarchitectures were well matched for a logic-limitedtechnology; they use profligate communication to optimizethe use of executio n units. How ever, they are poorlymatched to emerging wire-limited semiconductortechnologies in which traversing a chip takes 16 clocks anddissipates considerable power. In these new technologies, itis more important to optimize the use of the wires than theuse of the execution units.There is tremendous opportunity in designingarchitectures for wire-limited technologies. Early effortshave already shown that by eliminating gratuitous globalcommunication, these architectures reduce global bandwidthdemands by an order of magnitude or more. By makingglobal communication explicit, and then optimizing it, thesearchitectures can reduce average global wire delays by a

factor of 3 or more. Also, by casting global communicationas a packet routing rather than a wire routing problem, thesearchitectures can greatly reduce the cost and complexity ofdesigning future chips.This paper describes three complementary approachesto developing architectures that are well suited forinterconnect-limited VLSI technology. These approachesare all designed to mak e global communication explicit thus

GlobalRegisterFile

GlobalInstructionFetch 8 Issue

LFigure 1 Simplified view of a superscalar architectureA modern superscalar architecture such as the IntelPentium I1 (1) uses considerable global communication forboth control and data. As illustrated in Figure 1, a typicalsuperscalar architecture is organized around a global registerfile, from which data is distributed, and a global instructionunit, from which con trol is distributed. The result is thatglobal communication is required for every instruction, bothto route data to an d from arithm etic units, and to dispatch thecontrol to the selected arithmetic unit.The communication problem is greater than the figureindicates. To achieve good performance, full bypass pathsare provided from the output of every ALU to the input ofevery other ALU resulting in a full crossbar switch betweenthe ALUs. A similar crossbar exists in the register file, andadditional crossbars are needed to check instructions fordependencies and to route instructions from instruction-cache output ports to available execution units.

0-7803-5 174-6/99/$10.000 99 9 IEEE IITC 99 - 15


2/3

This global organization was appropriate back in 1995when the number of execution units was relatively small andglobal wires could be traversed in less than a clock cycle.Then the wires were largely ignored. Today the wiresdominate the delay, area, and power o f our chips and can nolonger be ignored. At the same time the number ofexecution units is increasing, making the problem worse.Because the communication in a typical superscalararchitecture is implicit, hidden in the mechanics ofinstruction execution, it cannot be optimized by theprogrammer or compiler. As wires come to dominate ourarchitectures we need to make this communication explicitso it can be managed and optimized.Clustered architectures exploit register andinstruction locality

Figure 2: A clustered architecture makes communicationexplicit.The communication requirements of a multiple-issueprocessor can be greatly reduced by dividing the machineinto clusters as illustrated in Figure 2. The ALUs arepartitioned into clusters of 2-4 ALUs each (each cluster isdenoted by a single ALU sym bol in the figure). Theregisters are then partitioned into local register files, one foreach cluster. Each cluster is contro lled by its owninstruction unit. To communicate data values, ALUs indifferent clusters exchang e data via a packet switch. In asimilar manner, the local instruction units synchronize withone another when required. The MIT Multi-ALU Processorpioneered this style of clustered architecture (2,3,4). It ha srecently been applied to the Alpha 21264 m icroprocessor fordata only ( 5 ) , but without exposing the clustering to thecompiler.A clustered architecture makes global communication

explicit and hence exposed for optimization. A compiler cangroup instructions that communicate on the same cluster,while placing instructions that should run in parallel ondifferent clusters. The compiler can also reduce latencysensitivity by placing instructions that are not dependent ona potentially high-latency instruction (e.g., a load) in aseparate thread. This eliminates false control dependencies

without the need for expensive reordering hardware. Thiscompile-time instruction placement is not possible onconventional architectures.Keckler and others have shown that, for a class ofbenchmarks, replacing the global registers and crossbars of aconventional architecture with the local registers and a lowerbandwidth, explicit switch of a clustered architecture resultsin a 72% reduction in wiring area with little impact onclocks per instruction (2). This study show s that there issignificant register locality in programs that can be exploitedby making com munication explicit.A bandwidth hierarchy keeps most communicationon short wires. .

Switch 1

Figure 3: An extended register hierarchy results in mostcommunication occurring on local wires connecting ALUs.Extending the storage hierarchy to include several levelsof registers as illustrated in Figure 3 hrther reducescommunication. In a conventional processor, there areseveral levels to the memory hierarchy, but only a singlelevel of registers (all global). Clustered processors extendthe register hierarchy one level to include a set of localregisters for each cluster. This hierarchy can be extendedone step further to include registers associated with each

ALU inpu t port as in the Imagine architecture (6). (Asimilar arrangement of local registers (in the form ofFIFO s)is found in the Cydrome architecture (7), but without theupper levels).The use of such local registers further reducescommunication. With this arrangement, the input operandsof each instruction are always available locally, at the inputof the ALU. The only communication required is to routethe result of each instruction to the register file(s) of theconsuming ALU(s). This direct ALU to ALU routing is theminimum required to pass the result between the two ALUs.In contrast, a conventional organization would require threecluster-wide communications for each instruction tocommunicate w ith the cluster registers.In a series of experiments on graphics, multimedia, andsignal processing workloads, Rixner and others have foundthat with this approach reduces global register bandwidthdema nd by a factor of 10-20 (6). In effect, slow globalregister (or even cluster register) accesses over long wireshave been replaced in most cases by direct ALU to ALUcommunication.

IITC 99-16


3/3

Route packets,no t wiresToday, global connections on chips are made by routingwires from one point to another. This is true of the datapaths that carry operands between memory arrays, registerfiles, and operation units and also for the control paths thatcarry instructions and sequencing information. In the nearfuture we envision that most global connections on chips,both data and control, will be made not by dedicated wires,but rather over on-chip communication networks (as wassuggested for circuit boards in (8)). There are threecompelling reasons to expect this change.First, the need to put repeaters into long wires allow s usto add the switching needed to implement a network at littleadditional cost. Today, in an 0.25pm process, repeaters arerequired about every 4mm for performance critical signals.By 2008, it is expected that this repeate r spacing will be lessthan Imm . Every l mm , even a dedicated wire will need tobe connected down to a buffer fabricated on the substrate.With the addition of just a few transistors, a m ultiplexer can

be included with the buffer to implement the data path of anetwork switch. Som e logic is also needed to control thisswitch. However, only one copy o f this logic is needed for amulti-bit switch, and this logic can be pipelined ahead of thedata so as not to slow operation.Second, using a network rather than dedicated wiringmakes more efficient use of critical global wiring resourcesby allowing these resources to be shared by different sendersand receivers. With dedicated wiring, when a module isidle, the dedicated wires it is attached to go unused. With anetwork, these wires are available to route other traffic.Finally, restricting global wiring to a regular networkgreatly simplifies the design process. In doing so, it enableshigh-performance circuit techniques, and facilitates designre-use. With a network, only one set of global interconnectneed be designed for all chips implemented in a givenprocess. Regardless of the function of the chip, its globalcommunication is performed by routing packets over thisinterconnect. Further, the interconnect is a regular structure:a single tile containing a basic routing element andassociated wires that is repeated across this chip in bothdimensions. As with a memory cell, considerable effort canbe expended on optimizing and verifying this basic elementyielding a highly-reliable, high-performance design. Forexample, it is possible to employ special low-energysignaling techniques, to use optimized transmission lines,and to carehlly analyze cross talk in such a regular structurein a way that is not practical for random, dedicated wiring.Once such a network is in use, it also provides a convenientinterface to connect to semicond uctor IP of all varieties.Conclusion

New architectures that expose and reduce globalcommunication are required to make effective use ofemere ine semiconductor technolo ev. ContemDorarv

superscalar architectures with their global registers, globalbypass, and global instruction logic are poorly suited forthese technologies.In this paper we have introduced three approaches tobuilding interconnect-oriented architectures that are beingpursued in ongoing projects at Stanford: clustering, registerhierarchy, and global networks. By clustering executionunits with their associated data and instruction storage, mostcommunication can be kept local to the cluster and thedemands on global commu nication greatly reduced. Withineach cluster, the storage hierarchy is extended to the level ofindividual execution unit inputs reducing intra-clustercomm unication to a minimum value. When globalcommunication is required, performing this communicationover a shared packet network rather than dedicated wiresmakes more efficient use of the communication resource andsimplifies design complexity.Designing an architecture to optimize the use ofinterconnect has already resulted in an order of magnitudereduction in global bandwidth and a factor of 3 reduction inlatency. This is in contrast to the factor of 2-3 im provementthat can be expected from better materials for conductorsand insulators.We have only begun to investigate this very promisingarea and many challenges remain. Instruction sets thatexpose communication, yet hide implementation details areneeded. Compilers and run-time software that optimizes theuse of interconnect must be developed. We also need todiscover the best way to organize on-chip interconnectionnetworks.References

Gwennap, Linley, P6 Underscores Intels Lead, MicroprocessorReport 9(2), February 1995,pp 1,6-15.Keckler, Stephen and Dally, William, Processor Coupling:Integrating Compile-Time and Run-Time Parallelism: Proceedings ofthe Annual International Symposium on Computer Architecture,Fillo, Marco, Keckler, Stephen, Dally, William, Carter, Nicholas,Chang, Andrew, Gurevich, Yevgeny, and Lee, Whay, The M-Machine Multicomputer, International Journal of ParallelProgramming, 25(3), 1997 pp. 183-212.Keckler, S tephen, Dally, W illiam, Maskit, Daniel, Carter, Nicholas,Chang, Andrew, and Lee, W hay, Exploiting Fine-Grain Thread-Level Parallelism on the MIT Multi-ALU Processor: 25th AnnualInternational Symposium on Computer Architecture, ISCA-25, 1998,Gwennap, Linley, Digital 21264 Sets New Standard,Microprocessor Report, October 1998.Rixner, Scott, Dally, William, Kapasi, Ujval, Khailany, Brucek,Lopez-Lagunas, Abelardo, Mattson, Peter, and Owens, John, ABandwidth-Efficient Architecture for Media Processing, Proceedingsof the 31st Annual International Symposium on Microarchitecture,Rau, B., Yen, David, Yen, Wei, and Towle, Ross, Th e Cyrda-5Departmental Supercomputer: Design Philosophies, Decisions, andTrade-offs, Computer, 22(1), January 1989, pp. 12-35.Seitz, C., Lets Route Packets Instead of Wires: Proceedings of theSixth MIT Conference on Advanced Research in VLSI, W. Dally Ed.,M IT Press, 199 0, pp. 133-138.

ISCA-I 9, 1992, pp. 202-2 13.

pp . 306-3 17.

1998, pp. 3-13.

IITC 99-17

Documents

Interconnect-Limited VLSI Architecture