Scaling to the End of Silicon with EDGE Architectures

Scaling to the End of Silicon Scaling to the End of Silicon with EDGE Architectureswith EDGE Architectures

D. Burger, S.W. Keckler, K.S. D. Burger, S.W. Keckler, K.S. McKinley, M. Dahlin, L.K. McKinley, M. Dahlin, L.K.

John, C. Lin, John, C. Lin, C.R. Moore, J. Burrill, C.R. Moore, J. Burrill,

R.G. McDonald, W. Yoder R.G. McDonald, W. Yoder and the TRIPS Teamand the TRIPS Team

(presented by Khalid El-Arini)(presented by Khalid El-Arini)

OverviewOverview

MotivationMotivation High-level architecture descriptionHigh-level architecture description Compiling for TRIPSCompiling for TRIPS DiscussionDiscussion

Why do we need a new ISA?Why do we need a new ISA? For the last 20 years, we have witnessed For the last 20 years, we have witnessed

dramatic improvements in processor dramatic improvements in processor performanceperformance

Acceleration of clock rates (x86):Acceleration of clock rates (x86): 1990: 33 MHz1990: 33 MHz 2004: 3.4 GHz2004: 3.4 GHz

Aggressive pipelining responsible for Aggressive pipelining responsible for approximately half of performance gainapproximately half of performance gain

However, all good things come to an endHowever, all good things come to an end[ Hrishikesh et al, ISCA ’02][ Hrishikesh et al, ISCA ’02]

EExplicit xplicit DData ata GGraph raph EExecutionxecution Direct instruction communicationDirect instruction communication

Producer and consumer instructions interact Producer and consumer instructions interact directlydirectly

An instruction An instruction firesfires when its inputs are when its inputs are availableavailable

Dataflow explicitly represented in hardwareDataflow explicitly represented in hardware No rediscovery of data dependenciesNo rediscovery of data dependencies

Higher exposed concurrencyHigher exposed concurrency More power-efficient executionMore power-efficient execution

TRIPS: An EDGE ArchitectureTRIPS: An EDGE Architecture

Four goalsFour goals Increase in Increase in

concurrencyconcurrency Power-efficient high Power-efficient high

performanceperformance Mitigation of Mitigation of

communication delayscommunication delays Increased flexibilityIncreased flexibility

Block Atomic ExecutionBlock Atomic Execution Compiler groups instructions into blocksCompiler groups instructions into blocks

Called “hyperblocks,” and contain up to 128 Called “hyperblocks,” and contain up to 128 instructionsinstructions

Each block is fetched, executed, and Each block is fetched, executed, and committed committed atomically atomically (similar to conventional notion of transactions)(similar to conventional notion of transactions)

Sequential execution semantics at block Sequential execution semantics at block level – each block is a level – each block is a megainstructionmegainstruction

Dataflow execution semantics within each Dataflow execution semantics within each blockblock

Hyperblocks and PredicationHyperblocks and Predication

128 instructions?!128 instructions?! Predication allows us to Predication allows us to

hide branches within hide branches within dataflow graphdataflow graph

Loop unrolling and function Loop unrolling and function inlining also helpinlining also help

TRIPS InstructionsTRIPS Instructions RISC add:RISC add:

ADD R1, R2, R3ADD R1, R2, R3 TRIPS add:TRIPS add:

T5: ADD T13, T17T5: ADD T13, T17 Compiler statically Compiler statically

determines locations of determines locations of instructionsinstructions

Block mapping/execution Block mapping/execution model eliminates need to go model eliminates need to go through shared data through shared data structures (e.g., register file) structures (e.g., register file) while executing within a while executing within a hyperblockhyperblock

T5

T17T13

TRIPS Processor CoreTRIPS Processor Core

Compiling for TRIPSCompiling for TRIPS




Two new responsibilitiesTwo new responsibilities Generating hyperblocksGenerating hyperblocks Spatial scheduling of blocksSpatial scheduling of blocks

Predicated ExecutionPredicated Execution Naïve implementation:Naïve implementation:

Route a predicate to every instruction in a Route a predicate to every instruction in a predicated basic blockpredicated basic block

Wide fan-out problemWide fan-out problem Better implementations:Better implementations:

Predicate only the first instruction in a chainPredicate only the first instruction in a chain Saves power if predicate is falseSaves power if predicate is false

Predicate only the last instruction in a chainPredicate only the last instruction in a chain Hide latency of predicate computationHide latency of predicate computation

Spatial SchedulingSpatial Scheduling

Two competing goalsTwo competing goals Place independent instructions on different Place independent instructions on different

ALUs to increase concurrencyALUs to increase concurrency Place instructions near one another to Place instructions near one another to

minimize routing distances and minimize routing distances and communication delayscommunication delays

DiscussionDiscussion

Now that intermediate results within a Now that intermediate results within a hyperblock’s dataflow are directly passed hyperblock’s dataflow are directly passed between instructions, how will register between instructions, how will register allocation be affected?allocation be affected?

Compare EDGE compiler/hardware Compare EDGE compiler/hardware responsibilities with RISC and VLIWresponsibilities with RISC and VLIW

Documents

Scaling to the End of Silicon with EDGE Architectures