RingScalar: A Complexity-Effective Out-of-Order …krste/papers/MIT-CSAIL...Jessica H. Tseng and Krste Asanovi´c MIT Computer Science and Artiﬁcial Intelligence Laboratory 32 Vassar

Computer Science and Artificial Intelligence Laboratory

Technical Report

m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 213 9 u s a — w w w. c s a i l . m i t . e d u

MIT-CSAIL-TR-2006-066 September 18, 2006

RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture Jessica H. Tseng and Krste Asanovic

RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture

Jessica H. Tseng and Krste AsanovicMIT Computer Science and Artificial Intelligence Laboratory

32 Vassar Street, Cambridge, MA 02139�jhtseng,krste � @csail.mit.edu

Abstract

RingScalar is a complexity-effective microarchitecture forout-of-order superscalar processors, that reduces the area,latency, and power of all major structures in the instruc-tion flow. The design divides an � -way superscalar into �columns connected in a unidirectional ring, where each col-umn contains a portion of the instruction window, a bank ofthe register file, and an ALU. The design exploits the fact thatmost decoded instructions are waiting on just one operand touse only a single tag per issue window entry, and to restrictinstruction wakeup and value bypass to only communicatewith the neighboring column. Detailed simulations of four-issue single-threaded machines running SPECint2000 showthat RingScalar has IPC only 13% lower than an idealizedsuperscalar, while providing large reductions in area, power,and circuit latency.

1 Introduction

Early research in decentralized or clustered architec-tures [18, 8, 7, 16] was motivated by the desire to buildwider superscalar architectures (8-way and greater) whilemaintaining high clock frequencies in an era where wire de-lay was becoming problematic [10]. Clustered architecturesdivide a wide-issue microarchitecture into disjoint clusterseach containing local issue windows, register files, and func-tional units. Because each cluster is much smaller and sim-pler than a monolithic superscalar design, the circuit latencyof any path within a cluster is significantly lower, hence al-lowing greater clock rates than a centralized design of thesame total issue width. However, any communication acrossclusters incurs greater latency, and so a critical issue is thescheme used to map instructions to clusters.

Ranganathan and Franklin [14] grouped decentralizedclustering schemes into three categories. Execution unitbased dependence schemes (EDD) map instructions to clus-ters according to the types of instructions (e.g., floating-pointversus integer). Control Dependence based schemes (CDD),such as Multiscalar [18], and Trace Processors [16], map in-structions that are contiguous in program order to the samecluster. Data Dependence based schemes (DDD), such as

PEWS [8] and Multicluster [7], try to map data dependentinstructions to the same cluster.

EDD schemes are widely used, dating back to early out-of-order designs [19], and can also be employed locallywithin clusters of other schemes. However, EDD schemesdo not scale well to large issue widths [14]. DDD schemeswere found to be more effective than CDD schemes exceptfor the largest issue widths, but both these schemes incur sig-nificant area and complexity overheads for small increases intotal IPC [14].

In recent years, two trends have significantly changedprocessor design optimization targets. First, power dissipa-tion is now a primary design constraint, and designs must beevaluated based on both performance and power. Second,the advent of chip-scale multiprocessors has placed greateremphasis on processor core area, as total chip throughputcan also be increased by exploiting thread-level parallelismacross multiple cores. We believe these trends favor tech-niques that reduce the complexity of moderate issue-widthcores, rather than techniques that use large area and complexcontrol to increase single-thread performance.

In this paper, we introduce “RingScalar”, a new cen-tralized out-of-order superscalar microarchitecture that usesbanking to increase area and power efficiency of all the ma-jor components in the instruction flow without adding sig-nificant pipeline control complexity. RingScalar builds an

� -way superscalar from � columns, connected in a unidi-rectional ring. Each column contains a bank of the issue win-dow, a bank of the physical register file, and an ALU. Com-munication throughout the microarchitecture is engineeredto reduce the number of ports required for each type of ac-cess to each bank within each structure. The restricted ringtopology reduces electrical loading on latency-critical com-munications between columns, such as instruction wakeupand value bypassing. We exploit the fact that most decodedinstructions are waiting on only one operand to use just asingle source tag in each issue window entry, and dispatchinstructions to columns according to data dependencies toreduce the performance impact of the restricted communica-tion. Detailed simulations of the SPECint2000 benchmarkson four-issue machines show that a RingScalar design has anaverage IPC only 13% lower than an idealized superscalar,while having much reduced area, power, and circuit latency.

1

2 RingScalar Microarchitecture

The RingScalar design builds upon earlier work in bankedregister files [23, 4, 2, 11, 20], tag-elimination [5, 9], anddependence-based scheduling [8, 10]. For clarity, this sec-tion describes RingScalar in comparison to a conventionalsuperscalar. Detailed comparison with previous work is de-ferred to Section 5.

2.1 Baseline Pipeline Overview

RingScalar uses the same general structure as the MIPSR10K [24] and Alpha 21264 [12] processors, with a uni-fied physical register file containing both speculative andcommitted state. The following briefly summarizes the op-eration of this style of out-of-order superscalar processor,which we also use as a baseline against which to comparethe RingScalar design.

Instructions are fetched and decoded in program order.During decode, each instruction attempts to allocate re-sources including: an entry in the reorder buffer to supportin-order commit; a free physical register to hold the instruc-tion’s result value, if any; an entry in the issue window; andan entry in the memory queue, if this is a memory instruc-tion. If some required resource is not available, decode stalls.Otherwise, the architectural register operands of the instruc-tion are renamed to point to physical registers, and the sourceoperands are checked to see if they are already available orif the instruction must wait on the operands in the issue win-dow. The instruction is then dispatched to the issue window,with a tag for each source operand to hold its physical regis-ter number and readiness.

Execution occurs out-of-order from the issue window,driven by data availability. As earlier instructions execute,they broadcast their result tag across the issue window towake up instructions with matching source tags. An instruc-tion becomes a candidate for execution when all of its sourceoperands are ready. A select circuit picks some subset of theready instructions for execution on the available functionalunits. Once instructions have been selected for issue, theyread operands from the physical register file and/or the by-pass network and proceed to execute on the functional units.When instructions complete execution, they write values tothe physical register file and write exception status to the re-order buffer entry. When it is known an instruction will com-plete successfully, its issue window entry can be freed.

To preserve the illusion of sequential program execution,instructions are committed from the reorder buffer in pro-gram order. If the next instruction to commit recorded an ex-ception in the reorder buffer, the machine pipeline is flushedand execution continues at the exception handler. As instruc-tions commit, they free any remaining machine resources(physical register, reorder buffer entry, memory queue entry)for use by new instructions entering decode.

Memory instructions require several additional steps inexecution. During decode, an entry is allocated in the mem-ory queue in program order. Memory instructions are split

into address calculation and data movement sub-instructions.Store address and store data sub-instructions issue indepen-dently from the window, writing to the memory queue oncompletion. Store instructions only update the cache whenthey commit, using address and data values from the memoryqueue. Load instructions are handled as a single load addresscalculation in the issue window. On issue, loads calculatethe effective address then check for address dependencies onearlier stores buffered in the memory queue. Depending onthe memory speculation policy (discussed below), the loadwill attempt to proceed using speculative data obtained ei-ther from the cache or from earlier store instructions in thememory queue (the load may later require re-execution ifan address mis-speculation or a violation of memory consis-tency is detected). If the load cannot proceed due to an unre-solvable address or data dependency, it waits in the memoryqueue to reissue when the dependency is resolved. Loadsreissuing from the memory queue are given priority for dataaccess over newly issued loads entering the memory queue.

2.2 RingScalar Overview

RingScalar retains this overall instruction flow and usesthe same reorder buffer and memory queue, but drasticallyreduces the circuitry required in the issue window, registerfile, and bypass network by restricting global communica-tion within these structures. These restrictions exploit thefact that most instructions enter the issue window waiting onone or zero operands.

The overall structure of the RingScalar microarchitectureis shown in Figure 1. RingScalar divides an � -way issuemachine into � columns connected in a unidirectional ring.Each column contains a portion of the issue window, a por-tion of the physical register file, and an ALU. Physical regis-ters are divided equally among the columns, and any instruc-tion that writes a given physical register must be dispatchedand issued in the column holding the physical register. Thisrestriction means each bank of the regfile needs only a singlewrite port directly connected to the output of the ALU in thatcolumn.

A second restriction is that any instruction entering thewindow while waiting for an operand must be dispatched tothe column immediately to the right of the column contain-ing the producer of the value (the leftmost column is consid-ered to be to the right of the rightmost column in the ring).This restriction has two major impacts. First, when an in-struction executes, it need only broadcast its tag to the neigh-boring column which reduces the fanout on the tag wakeupbroadcast by a factor of � compared to a conventional win-dow. Second, the bypass network can be reduced to a simplering connection between ALUs as any other value an instruc-tion needs should be available from the register file. Thebypass fanout is reduced by a factor of � , and each ALUoutput now only has to drive the regfile write port and thetwo inputs of the following ALU.

These restrictions are key to the microarchitectural sav-ings in RingScalar, and as shown in the evaluation, have a

2

2nd Wakeup Signal Crossbar

Issu

e W

indo

w

Dispatch Crossbar

Renamed Buffer

Sel

ect A

rbite

r I

Sel

ect A

rbite

r I

Sel

ect A

rbite

r I

Sel

ect A

rbite

r I

Reg

iste

r F

ile

Bank1Bank0 Bank2 Bank3

SRC2 Local to Global Data Port Crossbar


Write Address Crossbar

SRC2 Read Address Arbiter & Crossbar


Select Arbiter IILo

ad M

iss

Writ

e D

ata

Load

Mis

s W

rite

Add

ress

Load

Mis

s W

rite

Bac

k R

eser

vatio

n

Ren

ame

Wak

eup

& S

elec

tR

ead

& B

ypas

sE

xecu

te

Col0 Col1 Col2 Col3

oneoneoneone

Figure 1. RingScalar core microarchitecture for a four-issue machine. The reorder buffer and thememory queue are not shown.

relatively small impact on instructions per cycle (IPC) whilereducing area, power, and circuit latency significantly. Thefollowing subsections describe the major components of themachine in more detail.

2.3 RingScalar Register Renaming

The RingScalar design trades a little added complexity inthe rename and dispatch stage (and some overall IPC degra-dation) to reduce the area, power, and latency of the remain-ing stages (issue window, regfile, bypass network). Fig-ure 2 shows an example of the RingScalar renaming and dis-patch process. We used the Alpha ISA in our experiments,where instructions can have zero, one, or two source registeroperands.

As RingScalar decodes each instruction, the source archi-tectural registers are renamed into the appropriate physical

registers and the readiness of these operands is checked, justas in a conventional superscalar processor. Instructions thatare not waiting on any operand (zero-waiting) can be dis-patched to any column in the window. Instructions that arewaiting on one operand (one-waiting) must be dispatched tothe column to the right of the producer column. Instruc-tions waiting on two operands (two-waiting) are split intotwo parts that will issue sequentially, and each part mustbe dispatched to the column to the right of the appropriateproducer column. A separate 2nd wakeup port is providedon each column to enable the first part of an instruction towakeup the second part regardless of the column in which itresides. The second part sits in the window, but will not re-quest issue until after the first part has issued and woken upthe second part. The second part can be woken up one cy-cle after the first part. Previous work has considered the useof prediction to determine which operand will arrive last [5],

3

lw r1 (r3) add r1, r1, #1 sw r1 (r4) xor r3, r1, #4

Rename

Col0 Col1 Col2 Col3

add P2[3], P1[5]lw P1[5], P0[4] sw P2[3], P3[4] xor P3[4], P2[3]

sub P0[4], P3[2]

sw addr. P3[4] add P2[3], P1[5] sw data P2[3]

Dispatch Conflict

Oldest I0 I1 I2 I3

lw P1[5], P0[4]

and P3[4], P2[0]

Figure 2. RingScalar rename and dispatch.The sub and and instructions were already inthe window before the new dispatch group.

but in this work, we adopt a simple heuristic that assumes thefirst (left) source operand will arrive last for a two-waitinginstruction. Store instructions are handled specially by split-ting them into two parts (address and data) that can issueindependently and in parallel.

To reduce complexity in both the dispatch crossbar andthe issue window entries, there is only a single dispatch portfor each column in the window. The two parts of a store ortwo-waiting instruction occupy two separate dispatch ports ifthey go to different columns, but can be dispatched togetherin one cycle to the same column. Note that a physical registertag is now two fields: the window column and the registerwithin the column.

The RingScalar renamer has some flexibility in how anyzero-waiting instructions (and any dependents) are mappedto columns. To reduce the complexity of the rename logic,we adopt a simple greedy scheme where instructions are con-sidered in program order. Zero-waiting instructions select adispatch column using a random permutation that changeseach cycle. One-waiting and two-waiting instructions haveno freedom and must be allocated as described above. Whenthe next instruction cannot be dispatched because a requireddispatch port is busy, dispatch stalls until the next cycle.

Figure 3 shows the renaming and column dispatch cir-cuitry in detail. As with a conventional superscalar, the firststep is a rename table lookup to find the current physical reg-ister holding each architectural register source operand to-gether with its readiness, while, in parallel, the architecturalsource registers of later instructions in the group are checkedfor dependencies on the architectural destination registers ofearlier instructions.

Each rename table lookup returns the column (Col) inwhich the physical register resides and a single bit (Rdy?)indicating if the value is ready or not, in addition to the phys-ical register number. Below the rename table in Figure 3,we show only the circuitry responsible for column allocationand do not show the physical register number within the col-umn. Column information is represented using a unary for-mat with one bit per window column (i.e., Col. is an � -bitvector) to simplify circuitry.

Rename Table

N N N

Requests Nex

tCol. Select

N

N

N N

N

N

=

N

Requests Nex

tCol. Select

src1 src2 destop src1 src2 destop src1 src2 destop

0 0

N =

N

Inst. 1 Col. Grants

N

N

N N

ColumnZero−Wait

N

ColumnZero−Wait

N

N =

N

Inst. 2 Col. Grants

Requests Nex

tCol. SelectColumn

Zero−Wait

N

= =

=

N

N

1

1

1

Rdy?Rdy? Col. Rdy?

0

Col. Full

Col.

Inst. 0 Col. Grants

Inst. 0 Inst. 1 Inst. 2

Col.N

Rdy?Col.

Figure 3. RingScalar register renaming andcolumn dispatch circuitry. Only the circuitryfor src1 of instruction 1 and 2 is shown.

For each instruction, the Col. Select circuitry calculatestwo � -bit column vectors: Requests and Next. The Requestsvector has one or two bits set indicating which columns theinstruction wants to dispatch into, and is calculated in twosteps. First, if both of the operands are ready, an internalvector is set to the precomputed Zero-Wait Column vectorwhich has a single bit pointing at the randomly assigned col-umn for this instruction. If at least one of the operands isnot ready, the internal vector is set to the bitwise-OR of thetwo input Col vectors. The internal vector is then rotated byone bit position to yield the Requests vector. The rotation issimple rewiring and so has no additional logic delay.

The Next output has a single bit set indicating the columninto which this instruction will write its final result. First, theinternal vector is assigned either Zero-Wait Column if bothoperands are ready, Col for the second source only if the in-struction is one-waiting on the second operand, or otherwiseCol for the first source (i.e., one-waiting on first operand ortwo-waiting). Second, the internal vector is rotated by onebit position to obtain the Next vector.

Any later instruction in the current dispatch group that hasa RAW dependency on an earlier instruction in the groupmust mux in the Next vector from the earlier instruction inplace of the stale column vector read from the rename table.In the worst case, a series of serially-dependent instructionsrequires the Next values to ripple across the different Col. Se-lect blocks in the dispatch group, as shown in Figure 3. For-tunately, mux select lines are available early, and the ripplepath always carries non-ready operands (Rdy? �

�), which

4

reduces worst-case latency to a few gate delays per instruc-tion.

The dispatch column arbiter is implemented using the se-rial logic gates shown at the bottom of Figure 3. The left-most input to the arbiter chain is a Col. Full vector indicat-ing which columns cannot accept a new instruction, eitherbecause the issue window is full or because there are no freephysical registers left in the column. The arbiter ensures thatinstructions must dispatch in program order, by preventinglater instructions from dispatching if an earlier one did notget all requested columns (this is the purpose of the equal-ity comparators). The ripple through the arbiter is in parallelwith the slower ripple through the Col. Select blocks, so addsonly a single gate delay before yielding the column grant sig-nals.

The additional column latency in RingScalar is compen-sated by the reduced dispatch latency, as each dispatch portfans out to � times fewer entries than in a conventional su-perscalar, and each entry has one port rather than � .

Note that the column allocation circuity is a small amountof logic (dozens of gates on each of the � bit slices) andrepresents a very small power and area overhead comparedto the savings in issue window and register file size.

The final step of renaming is to allocate a destinationphysical register for the instruction if required (stores andbranches do not require destination registers, and neitherdoes the first part of a two-waiting instruction). Each columnhas a separate physical register free list, and so any instruc-tion that is dispatched to a column simply takes the head ofthe relevant free list.

2.4 Issue Window

The RingScalar issue window has several complexity re-ductions compared to a conventional superscalar. The pri-mary savings come from the reduced port count in each col-umn. A conventional superscalar window has � dispatchports, � � wakeup ports, and � issue ports on each entry.RingScalar has only a single dispatch port, two narrowerwakeup ports, and one issue port.

Each column needs only two wakeup ports. The firstwakeup port is used by the preceding column to wake updependent instructions, while the second port wakes up thesecond part of a two-waiting instruction. Both of these portsare narrower than in a conventional superscalar as shown inFigure 4. The physical register tag requires �� fewer bitsbecause the consumer must be waiting for a value located inthe preceding column. The second-part tag can be consider-ably narrower as it only needs to distinguish between mul-tiple second parts mapped to the same column in the issuewindow.

The RingScalar instruction window design significantlyreduces the critical wakeup-select scheduling loop. Wakeuphas reduced latency because an instruction only has to broad-cast its tag to one column, each entry has only one compara-tor, and each tag is narrower. A conventional design requires

==

==

src1 tagready

==

==

src2 tag readypart2tag

==

ready ready src tag

wakeupport

2nd partwakeup

port

wakeup ports

Conventional RingScalar

Figure 4. Wakeup circuitry.

the tag be driven across the entire window, with each en-try having two comparators, leading to a � � times greaterfanout in total. The select arbiter also has considerably re-duced latency, as each column has a separate arbiter, andeach column can only issue at most one instruction. Theconventional design has an arbiter with � times more inputsand � times more outputs.

Each entry only has a single issue port, which reduceselectrical loading to read out instruction information afterselect. A conventional design has each entry connected to

� issue ports, each with � times greater fanout.The combination of a single dispatch port and a single is-

sue port makes it particularly straightforward to implement acompacting instruction queue [3], where each column in thewindow holds a stack of instructions ordered by age with theoldest at the bottom. The fixed-priority select arbiter picksthe oldest ready instruction for issue. To compact out a com-pleted instruction from the window, each entry below thehole retains its value, each entry at or above the hole copiesthe entry immediately above to squeeze out the hole, whilethe previous highest entry copies from the dispatch port ifthere’s a new dispatch (Figure 5). The age-ordered windowcolumns also simplify pipeline cleanup after exceptions orbranch mispredictions.

ready

ready

ready

ready

ready

ready

ready

ready

ready

ENTRY 0ENTRY 0

ENTRY 1

ENTRY 2

ENTRY 3

ENTRY 4

ENTRY 5

ENTRY 1

ENTRY 2

RingScalarConventional

Dispatch Port Issue Port Dispatch Port Issue Port

Sel

ect A

rbite

r

Sel

ect A

rbite

r

Figure 5. Compacting instruction queues.

5

In practice, instruction entries are not compacted out rightafter issue, but only after it is known they will complete suc-cessfully. In particular, dependent operations are scheduledassuming loads will hit in the cache and must remain in thewindow until the cache access is validated in case a miss re-quires the dependent instruction be replayed. As describedin the following section, RingScalar uses the same techniqueif a banked register file with read conflicts is used.

Instructions are latched after issue, then undergo a secondstage of select arbitration (Select Arbiter II in Figure 1) thatis used to resolve structural hazards across columns. For ex-ample, our evaluation machine only allows a single load toissue per cycle. We also allow only a single first-part sub-instruction to wake-up a second-part sub-instruction acrossthe 2nd Wakeup Signal Crossbar each cycle. Instructionsfailing the second stage of arbitration remain in the issuelatch and block further issue in the column until the struc-tural hazard is resolved.

2.5 Register File

One of the greatest savings in the RingScalar designcomes from the reduction in the number of write ports re-quired on the register file. Each column has a separate phys-ical register bank, which needs only a single write port.

We allow any column to read data from any register bankin the machine. In the simplest RingScalar design, we pro-vide a full complement of read ports ( � � ) on every bank.To further reduce regfile power and area, we can reduce thenumber of read ports per bank and use the speculative readconflict resolution strategy previously published in [20]. Forexample, the four-issue machine shown in Figure 1 has fourread ports and one write port per bank whereas a conven-tional machine would have eight read ports and four writeports. A local to global read port crossbar is required to al-low any functional unit to read any register from any localbank’s read port

The reduced-read-port design adds an additional arbitra-tion stage to the pipeline as shown in Figure 6. Instructionscontinue issuing assuming there will be no conflicts. Whenread-port conflicts are detected, the instruction window mustbe repaired and execution must be replayed [20]. The num-ber of read bank conflicts is reduced by not requesting readport accesses when a value was produced in the precedingcycle and hence will be available on the bypass ring (con-servative bypass-skip [20]), and by implementing the read-sharing optimization [2], which allows a single bank port tosend the same register to multiple requesters over the globalports.

Unlike the previous design [20], there is no need for aglobal write port network and an arbiter with bank conflictdetection, as RingScalar associates one column with eachwrite port and issues at most one instruction per column.This is a considerable saving, as it was also previously foundthat more than one write port per bank was required to reducewrite port conflicts to an acceptable level [21].

Variable latency instructions, such as cache misses, alsorequire access to the register file write ports to return theirresults. To avoid conflicts, a returning cache miss insertsa high priority request into the select arbiter for the targetcolumn, preventing another instruction from issuing whilethe cache miss uses the write port.

2.6 Bypass Network

RingScalar also provides a large reduction in bypass net-work complexity. An ALU can only bypass to its neigh-bor around the ring, not even to its own inputs. This bypasspath is sufficient because the dependence-based rename anddispatch ensures dependent instructions are located immedi-ately following the producer in the ring. If an operand wasready when the instruction was dispatched, it would havebeen obtained from the register file in any case. If a de-pendent instruction does not issue right after the producer,it must wait until the value is available in the regfile beforeissuing.

3 Evaluation

To characterize the behavior of RingScalar, we exten-sively modified SMTSIM [22], a cycle-accurate simulatorthat models an out-of-order superscalar processor with si-multaneous multithreading ability. These modifications in-cluded changes to the Rename, Dispatch, Select, Issue, Reg-file Read, and Bypass stages of the processor pipeline. Aregister renaming table that maps architectural registers tophysical registers is added to monitor the regfile access fromeach instruction. To keep track of a unified physical regis-ter file organized into banks, extra arbitration logic is addedto each regfile bank to prevent over-subscription of read andwrite ports when a lesser-ported storage cell is used. Sinceload misses are timing critical, a write-port reservation queueis also added to give them priority over other instructions.Additional changes are made to the register renaming pol-icy, dispatch logic, wakeup-select loop, and issue logic, tomodel the RingScalar design.

This paper focuses on evaluating the performance ofRingScalar within integer pipelines, so we choose the SPECCINT2000 benchmark suite for its wide range of applica-tions taken from a variety of workloads. The suite has longrun times but expected performance can be well character-ized without running to completion. To reduce the simula-tion run time to a reasonable length, the methodology de-scribed in [17] is used to fast-forward execution to a sam-ple of half a billion instructions for each application. Thebenchmarks are compiled with optimization for the Alphainstruction set.

Given the design target for this microarchitecture, wecompare RingScalar against idealized models of 4-issue cen-tralized superscalars. Table 1 shows parameters commonacross the machines compared. We used a large reorderbuffer of 256 entries and a large memory queue of 64 entries

6

Fetch DecodeIssue

Arbitrate Bypass ExecuteRead

WakeupRenameDispatch Select Writeback Commit

Figure 6. RingScalar pipeline structure.

L1 I-cache 16KB 4-way,64-byte lines, 1 cycle

L1 D-cache 16KB 4-way,64-byte lines, 1 cycle

L2 unified cache 1MB 4-way,64-byte lines, 12 cycles

L3 unified cache 8MB 8-way,64-byte lines, 25 cycles

Fetch width 8Dispatch, issue, and 4commit widthInteger ALUs 4Memory instructions 2 (1-Load and 1-Store)Reorder Buffer 256 entriesMemory Queue 64 entriesBranch predictor gshare 4K 2-bit

counters, 12-bit history

Table 1. Common simulation parameters.

such that these would not limit performance. The simula-tor has an unrealizable memory queue model, with perfectprediction of load latency (hits versus misses) and perfectknowledge of memory dependencies (i.e. loads are only is-sued when they will not depend on an earlier store). Al-though we would obtain greater savings by comparing towider issue machines, we did not observe a substantial in-crease in IPC that would justify more than 4-issue on thesecodes even with the optimistic memory system.

In this paper, each configuration is labeled with the fol-lowing nomenclature: (arch)(#iq):(size)R(#read)W(#write),where (arch) is either the monolithic baseline (BL) or theRingScalar (RS) architecture, (#iq) is the total number ofinstruction window entries, (size) defines the regfile size,(#read) and (#write) are the number of read ports and thenumber of write ports in each regfile storage cell.

Our idealized 4-issue superscalar base-line,BL32:80R8W4, contains a conventional monolithicissue window with 32 issue queue entries, a fully mul-tiported register file with 80 registers, and a full bypassnetwork. The issue window uses an oldest-first priorityscheme to select among multiple ready instructions. ForRingScalar, we assume entries are evenly distributed amongcolumns, e.g., RS48:128R8W1 is a RingScalar design whereeach of the four columns has 12 issue window entries, anda bank of 32 registers with 8 read ports and one write port.Any RingScalar with a speculatively-controlled bankedregister file [21], has an additional read port arbitration stageadded as shown in Figure 6, and the branch misprediction

penalty increases by one cycle.

3.1 Resource Sizing

The register file and issue window of RingScalar is spreadevenly across the columns. Their utilization is less than amonolithic structure and so the optimal sizing needs to be re-evaluated. Our first experiments show the effect of increas-ing the regfile size while keeping the issue window fixed.Figure 7 shows diminishing performance improvements aswe increase the regfile size for both the baseline superscalarBL32:xR8W4 and the RingScalar RS48:xR4W1 processor.For the baseline design, IPC saturates at 144 registers; per-formance remains the same if additional registers are addedbeyond this point. RingScalar, however, keeps improvingas more registers are added. Because instructions can onlybe allocated to a particular column, an imbalance of regis-ters across the columns can lower the total regfile utilization.Nevertheless, the diminishing returns do not justify imple-menting a regfile that is larger than 128 for issue queue sizesof 48 and 64. The performance of the BL128:256R8W4 con-figuration is also plotted to show the limit on IPC for thesecodes with this simulation framework.

100 150 200 2501.2

1.25

1.3

1.35

1.4

1.45

1.5

1.55

1.6

Regfile Size

IPC

BL128:256R8W4BL32:xR8W4BL16:xR8W4RS64:xR4W1RS48:xR4W1

Figure 7. Average IPC comparison for differentregfile size.

Our second set of experiments in Figure 8 shows how per-formance varies when increasing the size of the RingScalarissue window. For designs with 256 registers, IPC improve-ment tapers off beyond a window size of 64; for designs with128 registers, it tapers off beyond a 48-entry window. Thisalso demonstrates the importance of a balanced design, as in-creasing the resource in just a single area will not always leadto higher performance. We chose RS64:256 and RS48:128as the two basic configurations for the RingScalar evaluation.

7

20 40 60 80 100 1201.2

1.22

1.24

1.26

1.28

1.3

Issue Window Size

IPC

RSx:256R4W1RSx:128R4W1

Figure 8. RingScalar average IPC sensitivity toinstruction window size.

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr avg 0

20

40

60

80

100

Benchmark

% o

f T

ota

l In

stru

ctio

n

two−waitingone−waitingzero−waiting

Figure 9. Distribution of zero-waiting, one-waiting, and two-waiting instructions for SPECCINT2000 running on baseline superscalar.

3.2 Operand Availability

Previous work indicates that issue window source tags areunderutilized and many instructions enter the issue queuewith only zero or one outstanding register operands [5, 9].Our results, shown in Figure 9, confirm this finding, withmore than 80% of instructions waiting on one or zerooperand for our four-way baseline superscalar processor run-ning the SPEC CINT2000 benchmark suite. The percentageof two-waiting instructions ranges from 3.9% (mcf) to 21.9%(bzip) with an average of 11.8%.

As the second part of a two-waiting instruction will onlybe issued if woken up by the first part, the arrival tim-ing of the two unmet operands impacts the performance ofRingScalar. When the source operand of the second part ar-rives last, the waking of the second part is likely to finishbefore the instruction is ready to be issued. However, if thesecond part arrives first or arrives at the same time as thefirst part, the waking of the second part delays issue of theready instruction for additional cycles. RingScalar uses asimple scheme where the left source operand is always pre-dicted to arrive last. Figure 10 shows that the left sourceoperand arrives last more than half of time in seven out oftwelve programs (ratio ranges from 35.8% to 65.0% with an


20

40

60

80

100

Benchmark

% o

f Tw

o−w

aitin

g In

stru

ctio

n

arrive same timeright arrive lastleft arrive last

Figure 10. Percentage distribution of last-arrival operand for two-waiting instructions.

average of 50.0%). These numbers do not include store in-structions, where the address calculation and data movementissue independently and in parallel.

3.3 IPC Comparison

We compared RingScalar performance against the base-line configuration. Simulations were run with a gsharebranch predictor (Figure 11) and with a perfect branch pre-dictor (Figure 12) to ascertain the effect of the extra pipelinestage and branch predictor inaccuracies. Relative perfor-mance differences remain reasonably consistent across thedifferent branch predictor designs, with IPC increasing 12%on average with perfect branch prediction.

Despite their simplified window design and their muchmore realistic implementation parameters, the RingScalar re-sults are quite competitive with the idealized superscalars.In comparison to the baseline (BL32:80R8W4), the perfor-mance of the small RingScalar design without regfile read-port conflicts (RS48:128R8W1) has an average IPC reduc-tion of 12% with a maximum degradation of 24%. The per-formance impact is mainly due to delayed issuing of crit-ical instructions, which can pile up in the same issue col-umn. Extra regfile savings can be achieved in RingScalarwith a lesser-ported banked structure. Figure 11 showthat IPC drops only another 1% for RS48:128R4W1 de-sign but 4% for RS48:128R2W1 design. Further comparingthe RS48:128R4W1 design to the large RingScalar design(RS64:256R4W1), only a 2% IPC difference is observed.The above data suggests that RS48:128R4W1 is a good de-sign point for a four-issue machine.

3.4 Two-waiting Queues

The performance loss from using a single-tag instructionwindow is evaluated by comparing the results to a variationof a RingScalar design where each issue queue column isdivided into three banks. Figure 13 shows that instructionbanks are of three types, depending on whether instructionsare waiting on zero, one, or two source register operands.Unlike the original RingScalar design, instructions that waiton both operands are placed in the banks with two source

8


0.5

1

1.5

2

2.5

3

Benchmark

IPC

BL32:80R8W4RS64:256R4W1RS48:128R8W1RS48:128R4W1RS48:128R2W1

Figure 11. SPEC CINT2000 IPC with a gshare branch predictor.


0.5

1

1.5

2

2.5

3

Benchmark

IPC

BL32:80R8W4RS64:256R4W1RS48:128R8W1RS48:128R4W1RS48:128R2W1

Figure 12. SPEC CINT2000 IPC with a perfect branch predictor.

tags. The two-waiting queues reduce the complexity in reg-ister renaming logic and eliminate the prediction on operandavailability as two-waiting instructions are no longer splitinto two parts. Simulations show a 3% IPC improvementacross the benchmarks after a 16-entry two-waiting queueis added to RingScalar (RS48:128R4W1). We believe thissmall improvement does not justify the additional area andpower consumption of two-waiting queues.

4 Complexity Analysis

To determine the complexity effectness of RingScalar de-signs, we analyze the area, latency, and power reductionsof key components in this section. The approach is to firstcompare required regfile die area by counting the number ofoccupied wire tracks. Then, we evaluate the factors that de-termine latency and power of issue windows.

The area of register file can be estimated by the numberof bitlines and the number of wordlines [15, 20]. For reg-files with single-ended reads and differential writes, we useEquation 1 to approximate its grid area for a bit-slice. The

width ( � ) and height (�

) of a storage cell, including powerand ground, is given in unit of wire tracks. � is the numberof read ports, � is the number of write ports, and � is thenumber of physical entries in the regfile. The equation ex-presses that each regfile port calls for one wordline per entry,plus a single bitline for read ports and a pair of bitlines forwrite ports.

�� !� (1)

An issue window consist of dispatch ports, issue ports,wakeup port, comparators, tag broadcast network, and se-lect arbiters. The speed of the wakeup-select loop is a func-tion of wire propagation delay and the fan-in/fan-out delay.Its power consumption is proportional to switched capaci-tance, such as wire capacitance and transistor parasitic ca-pacitance. RingScalar reduces these parameters by adoptinglesser-ported banked structures. Power saving can also beachieved by minimizing the number of active components,such as comparators. Using Equation 2, the number of bitcomparators in an issue window can be determined. " is the

9

Issu

e W

indo

w

Sel

ect A

rbite

r I

zero

one

twoS

elec

t Arb

iter

I

zero

one

two

Sel

ect A

rbite

r I

zero

one

two

Sel

ect A

rbite

r I

zero

one

two

Reg

iste

r F

ile

Select Arbiter II



Bank1Bank0 Bank2 Bank3



Write Address Crossbar

Load

Mis

s W

rite

Dat

aLo

ad M

iss

Writ

e A

ddre

ssLo

ad M

iss

Writ

e B

ack

Res

erva

tion

Wak

eup

& S

elec

tR

enam

eR

ead

& B

ypas

sE

xecu

te

Col0 Col1 Col2 Col3

Dispatch Crossbar & Arbiter

Renamed Stage

Figure 13. RingScalar architecture for designswith three issue banks per column.

number of tags per entry, � is the number of tag bits per en-try (depends on the regfile size), � is the number of wakeupport per tag, and � is the number of entries. For example,there are � �� bit-comparators in thebaseline (BL32:80R8W4) design.

� �� ! #"$ $�%��"'& � " �(� �)� �!� (2)

Table 2 provides a complexity and performance compar-ison across a few RingScalar designs and the baseline. Ingeneral, RingScalar designs are smaller and more power ef-ficient than the conventional superscalar. Even the largeRingScalar design (RS64:256R4W1) has a smaller regfile,a much simpler issue window (faster wakeup-select), anda smaller bypass network than the baseline, despite the in-creased number of regfile entries.

The RS48:128R4W1 design point appears to be a sweetspot in the performance-complexity space. Regfile area isunder half that of the baseline while the issue window ismuch smaller, and IPC is just 13.3% away from the base-line. Adding more read ports only increases IPC by 1%(RS48:128R8W1). The additional regfile area saving of

moving from RS48:128R4W1 to RS48:128R2W1 is only 9%but this causes a 3% drop in IPC. Furthermore, a 48-entryRingScalar issue window requires only one fourth the num-ber of dispatch, issue, and wakeup ports, and 21% of the tagcomparators of a conventional 32-entry design. Table 2 alsoshows that RingScalar reduces wakeup delay because of itsreduction in the wakeup broadcast fan-out. Comparing tothe baseline configuration, RS48:128R4W1 has faster selecttiming. BL32:80R8W4 has one arbiter that selects four in-structions out of a pool of 32 instructions while RingScalarhas four arbiters, each independently selecting only one outof 12 instructions. The reduced bypass network decreasesthe ALU fan-out by a factor of three while the bypass muxfan-in is cut from seven to four for the RingScalar bypassnetworks.

In this evaluation, the baseline architecture was idealizedin several respects. More realistic superscalar models shouldhave a reduced IPC advantage over RingScalar. For example,most designs approximate the oldest-first select arbitration toreduce circuit complexity; real designs have less than fullyorthogonal functional unit issue; and they will experienceload-hit mispredictions and memory dependence misspecu-lations. These additional stalls will tend to reduce the IPCadvantage of existing superscalar designs.

Although, due to the large engineering effort required, weare unable to complete a complete analysis using full-customcircuit implementation of both RingScalar and the conven-tional design, we believe the complexity analysis aboveshows that RingScalar is a promising approach which couldenable significant cycle time, power, and area reductions.

5 Related Work

In this section, we describe how RingScalar relates toearlier work in banked register files, tag-elimination, anddependence-based scheduling [8, 10]. We also discuss therelationship to clustered designs.

RingScalar improves on previous work in banked regis-ter files by using the rename stage to steer instructions suchthat only a single write port per bank is required, avoidingwrite conflicts while supporting the simple pipeline controlstructure proposed in [20] to also reduce read ports per cell.

The fact that many operands are ready before an instruc-tion is dispatched has inspired the tag-elimination [5], half-price architecture [9], and banked issue queue [3] designs.In comparison to the conventional scheduler, these schemesreduce the electric loading of the wakeup ports and the num-ber of comparators by keeping the number of tag checks toa minimum. However, they still require � wakeup ports, �dispatch ports, and � issue ports. In contrast, RingScalarreduces both the number of tag comparisons and the num-ber of wakeup ports. In addition, each issue queue entry ofRingScalar requires only a single dispatch port and a singleissue port.

Earlier dependence-based scheduling techniques have ex-ploited the fact that it is unnecessary to perform tag checks

10

Configuration Regfile Issue Window Wakeup Select Bypass Networks IPCArea # Dispatch/Issue/ # Compa- Fan-out Arbiter ALU MUX

Wakeup Ports rators Fan-out Fan-in

BL32:80R8W4 100.0% 4/4/8 100.0% 64 4 from 32 9 7 100.0%RS64:256R4W1 80.8% 1/1/2 28.6% 16 1 from 16 3 4 89.7%RS48:128R8W1 87.6% 1/1/2 21.4% 12 1 from 12 3 4 87.6%RS48:128R4W1 40.4% 1/1/2 21.4% 12 1 from 12 3 4 86.7%RS48:128R2W1 31.4% 1/1/2 21.4% 12 1 from 12 3 4 84.2%

Table 2. Total complexity comparisons. Results in percentage are normalized to the baseline(BL32:80R8W4).

on a consumer instruction prior to the issue of the pro-ducer, allowing instructions in the same dependency chainto be grouped to reduce wakeup complexity [8, 10, 13, 6].By keeping track of the dependency chains and only is-suing the instructions that reach the head of issue queues,these schemes eliminate tag broadcast and simplify the se-lect logic. However, these designs either add complexity intracking the dependency chain or require a large intercon-nect crossbar to connect the distributed ports. For example,the FIFO-based approach presented by Palacharla et al. [10]requires that � dependent instructions can be steered into thetail of a FIFO at dispatch time. This results in an � � ��interconnect crossbar to allow any of the � dispatched in-structions to connect to any of the � dispatch ports on anyof the � FIFOs. In contrast, RingScalar spreads dependentinstructions across different columns, reducing the dispatchinterconnect requirements and the number of dispatch portsper entry. To reduce wakeup costs, RingScalar restricts thepossible wakeup paths across windows, and to reduce selectcosts, only a subset of instructions are considered for eachissue slot. RingScalar’s integrated approach also reduces thecost of the register file and bypass network.

Although the techniques are similar, clustering can be dis-tinguished from banking in two ways. First, the componentsthat are most closely connected differ. A clustered architec-ture first tightly couples a local issue window, a local regfile,and local execution units within each cluster, then provideslooser coupling between these clusters. A banked architec-ture first tightly couples the banks within an issue windowor a regfile, often with combinational paths, then connectsthese major microarchitectural structures using conventionalregisters between pipeline stages. Second, clustering is gen-erally used in larger scale designs, where each cluster is it-self a moderate issue-width core, whereas banking reducesthe complexity of a moderate issue-width core.

Multiscalar [18] was the earliest CDD clustered schemeto exploit control flow hierarchy, and uses compiler tech-niques to statically divide a single program into a collectionof tasks. Each task is then dynamically scheduled and as-signed to one of many execution clusters at run time. Ag-gressive control speculation and memory dependence spec-ulation are implemented to allow parallel execution of mul-tiple tasks. To maintain a sequential appearance, each task

is retired in program order. This scheme requires extensivemodification to the compiler and the performance dependsheavily on the ability to execute tasks in parallel. Alter-native CDD schemes include Trace Processors [16], whichbuild traces dynamically as the program executes and whichdynamically dispatches whole traces to different clusters.This approach, however, requires a large centralized cacheto store the traces and the algorithm used to delineate tracesstrongly influences the overall performance. Both Multi-scalar and Trace processors attempt to reduce inter-clustercommunication by localizing sequential program segmentsto an individual cluster but rely largely on control and valuepredictions for parallel execution.

PEWs [8] and Multicluster [7] are DDD schemes that tryto assign dependent instructions to the same cluster to mini-mize inter-cluster communication, with the assignments de-termined at decode time. Although performance is compa-rable to a centralized design for small numbers of clusters,these algorithms inherently have poor load balancing andcannot effectively utilize a large number of clusters. TheMulticluster authors [7] suggest the possibility of using com-piler techniques to increase utilization by performing codeoptimization and code scheduling. The fine-grained bankingapproach in RingScalar gives satisfactory performance witha simple fixed instruction distribution algorithm, whereasthese larger clustered approaches often require complex in-struction distribution schemes.

Abella and Gonzalez [1] recently published a related ap-proach for clusters which distributes the workload across allthe clusters by placing consumer instructions in the clusternext to the cluster that contains the producer instructions.The processor is laid out in a ring configuration so that theresults of a cluster can be forwarded to the neighboring clus-ter with low latency. Each partition of the register file can beread only from its own cluster but can be written only fromthe previous neighboring cluster. This design still requireslong latency inter-cluster communication to move missingoperands to appropriate clusters. RingScalar uses a simplervariant of this ring scheme, although with the goal of reduc-ing complexity rather than providing good load balancing.

Several of the proposals mentioned above have attemptedto reduce the cost of one component of a superscalar ar-chitecture (e.g., just the register file or just the issue win-

11

dow), but often with a large increase in overall pipeline con-trol complexity or possibly needing compensating enhance-ments to other portions of the machine (e.g., extending thebypass network to forward values queuing to use limited reg-file write ports). The RingScalar architecture is engineeredto simplify all of the major components simultaneously.

6 Conclusion

The RingScalar design provides complexity reductionthroughout the major components of an out-of-order proces-sor, in exchange for a small increase in complexity in therename and dispatch stage. Compared with idealized su-perscalar architectures, there is only a small (10.3-13.3%)drop in IPC but with a large reduction in area, power, and la-tency of the issue window, register file, and bypass network.RingScalar should be even more competitive against realis-tic conventional superscalar processors, and should providea suitable design point for CMP cores that need both highsingle thread performance and lower power and area.

References

[1] J. Abella and A. Gonzalez. Inherently workload-balanced clustered microarchitecture. In IPDPS-19,Long Beach, CA, April 2005.

[2] R. Balasubramonian, S. Dwarkadas, and D.H. Al-bonesi. Reducing the complexity of the register file indynamic superscalar processors. In MICRO-34, Austin,TX, December 2001.

[3] A. Buyuktosunoglu, D. Albonesi, P. Bose, P. Cook, andS. Schuster. Tradeoffs in power-efficient issue queuedesign. In ISLPED’02, Monterey, CA, August 2002.

[4] J.-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham.Multiple-banked register file architectures. In ISCA-27,Vancouver, Canada, June 2000.

[5] D. Ernst and T. Austin. Efficient dynamic schedulingthrough tag elimination. In ISCA-29, Anchorage, AK,May 2002.

[6] D. Ernst, A Hamel, and T. Austin. Cyclone: Abroadcast-free dynamic instruction scheduler with se-lective replay. In ISCA-30, San Diego, CA, June 2003.

[7] K. I. Farkas, P. Chow, N. P. Jouppi, and Z. G. Vranesic.The Multicluster architecture: Reducing cycle timethrough partitioning. In MICRO-30, Research TrianglePark, NC, December 1997.

[8] G. A. Kemp and M. Franklin. PEWs: A decentral-ized dynamic scheduler. In ICPP’96, Bloomingdale,IL, August 1996.

[9] I. Kim and M. Lipasti. Half-price architecture. InISCA-30, San Diego, CA, June 2003.

[10] S. Palacharla, N. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In ISCA-24, Denver,CO, June 1997.

[11] I. Park, M. D. Powell, and T. N. Vijaykumar. Reducingregister ports for higher speed and lower energy. InMICRO-35, Istanbul, Turkey, November 2002.

[12] D. Ponomarev, G. Kucuk, O. Ergin, K Ghose, andP. Kogge. The Alpha 21264 microprocessor. IEEETransactions on Very Large Scale Integration Systems,11(5), October 2003.

[13] S. Raasch, N. Binkert, and S. Reinhardt. A scalableinstruction queue design using dependence chain. InISCA-29, Anchorage, AK, May 2002.

[14] N. Ranganathan and M. Franklin. An empirical studyof decentralized ILP execution models. In ASPLOS-8,San Jose, CA, October 1998.

[15] S. Rixner, W. Dally, B. Khailany, P. Mattson, U. J. Ka-pasi, and J. D. Owens. Register organization for mediaprocessing. In HPCA, Toulouse, France, 2000.

[16] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. E.Smith. Trace processors. In MICRO-30, Research Tri-angle Park, NC, December 1997.

[17] S. Sair and M. Charney. Memory behavior of theSPEC2000 benchmark suite. Technical report, IBMResearch Report, Yorktown Heights, New York, Oc-tober 2000.

[18] G. S. Sohi, S. Breach, and T. N. Vijaykumar. Multi-scalar processors. In ISCA-22, Santa Margherita Lig-ure, Italy, June 1995.

[19] R. M. Tomasulo. An efficient algorithm for exploitingmultiple arithmetic units. IBM Journal, 11(1), January1967.

[20] J. Tseng and K. Asanovic. Banked multiported reg-ister files for high-frequency superscalar microproces-sors. In ISCA-30, San Diego, CA, June 2003.

[21] J. Tseng and K. Asanovic. A speculative controlscheme for an energy-efficient banked register file.IEEE Transactions on Computers, 54(6), June 2005.

[22] D.M. Tullsen. Simulation and modeling of a simul-taneous multithreading processor. In The 22nd An-nual Computer Measurement Group Conference, SanDiego, CA, December 1996.

[23] S. Wallace and N. Bagherzadeh. A scalable register filearchitecture for dynamically scheduled processors. InPACT-5, Boston, MA, October 1996.

[24] K. C. Yeager. The MIPS R10000 superscalar micro-processor. IEEE Micro, 16(2), April 1996.

12

Documents

RingScalar: A Complexity-Effective Out-of-Order …krste/papers/MIT-CSAIL...Jessica H. Tseng and Krste Asanovi´c MIT Computer Science and Artiﬁcial Intelligence Laboratory 32 Vassar