CS 152 Computer Architecture and Engineeringcs152/sp14/lecnotes/lec...CS 152 Computer Architecture and Engineering cs152/ TA: Eric Love Lecture 19 -- Dynamic Scheduling II Play: Thursday,

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

2014-4-3John Lazzaro

(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 19 -- Dynamic Scheduling II

Play:1Thursday, April 3, 14

http://www.eecs.berkeley.edu/~johnw/

http://www.eecs.berkeley.edu/~johnw/

UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers

Case studies of dynamic execution

DEC Alpha 21264: High performance from a relatively simple implementation of a modern instruction set.

IBM Power: Evolving dynamic designs over many generations.

Simultaneous Multi-threading: Adapting multi-threading to dynamic scheduling.

Short Break

2Thursday, April 3, 14

DEC Alpha

21164: 4-issue in-order design.

21264 was 50% to 200% faster in real-world applications.

21264: 4-issue out-of-order design.


500 MHz 0.5µ parts for in-order 21164 and

out-of-order 21264.

Similarly-sized on-chip caches (116K vs 128K)In-order 21164

has larger off-chip cache.

21264 has 55% more transistors

than the 21164. The die is

44% larger.

21264 has a 1.7x advantage on integer code, and a 2.7x

advantage of floating-point code.

21264 consumes

46% more power

than the 21164.


UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II

24

Alpha microprocessors have beenperformance leaders since their introductionin 1992. The first generation 21064 and thelater 211641,2 raised expectations for thenewest generation—performance leadershipwas again a goal of the 21264 design team.Benchmark scores of 30+ SPECint95 and 58+SPECfp95 offer convincing evidence thus farthat the 21264 achieves this goal and will con-tinue to set a high performance standard.

A unique combination of high clock speedsand advanced microarchitectural techniques,including many forms of out-of-order andspeculative execution, provide exceptional corecomputational performance in the 21264. Theprocessor also features a high-bandwidthmem-ory system that can quickly deliver data valuesto the execution core, providing robust perfor-mance for a wide range of applications, includ-ing those without cache locality. The advancedperformance levels are attained while main-taining an installed application base. All Alphagenerations are upward-compatible. Database,real-time visual computing, data mining, med-ical imaging, scientific/technical, and manyother applications can utilize the outstandingperformance available with the 21264.

Architecture highlightsThe 21264 is a superscalar microprocessor

that can fetch and execute up to four instruc-tions per cycle. It also features out-of-orderexecution.3,4 With this, instructions executeas soon as possible and in parallel with other

nondependent work, which results in fasterexecution because critical-path computationsstart and complete quickly.

The processor also employs speculative exe-cution to maximize performance. It specula-tively fetches and executes instructions eventhough it may not know immediately whetherthe instructions will be on the final executionpath. This is particularly useful, for instance,when the 21264 predicts branch directions andspeculatively executes down the predicted path.

Sophisticated branch prediction, coupledwith speculative and dynamic execution,extracts instruction parallelism from applica-tions. With more functional units and thesedynamic execution techniques, the processoris 50% to 200% faster than its 21164 prede-cessor for many applications, even thoughboth generations can fetch at most fourinstructions per cycle.5

The 21264’s memory system also enableshigh performance levels. On-chip and off-chip caches provide for very low latency dataaccess. Additionally, the 21264 can servicemany parallel memory references to all cachesin the hierarchy, as well as to the off-chipmemory system. This permits very high band-width data access.6 For example, the proces-sor can sustain more than 1.3 GBytes/sec onthe Stream benchmark.7

The microprocessor’s cycle time is 500 to600 MHz, implemented by 15 million tran-sistors in a 2.2-V, 0.35-micron CMOS processwith six metal layers. The 3.1 cm2 processor

R. E. KesslerCompaq Computer

Corporation

THE ALPHA 21264 OWES ITS HIGH PERFORMANCE TO HIGH CLOCK SPEED,

MANY FORMS OF OUT-OF-ORDER AND SPECULATIVE EXECUTION, AND A HIGH-

BANDWIDTH MEMORY SYSTEM.

0272-1732/99/$10.00 1999 IEEE

THE ALPHA 21264MICROPROCESSOR

.

The Real Difference: Speculation

If the ability to recover from

mis-speculation is built into an implementation

... it offers the option to

add speculative features to all parts of the

design.


GRONOWSKI et al.: HIGH-PERFORMANCE MICROPROCESSOR DESIGN 677

Fig. 2. 21064 die photo.


II. ARCHITECTUREThe Alpha instruction set architecture is a true 64-bit

load/store RISC architecture designed with emphasis on highclock speed and multiple instruction issue [4]. Fixed-length in-structions, minimal instruction ordering constraints, and 64-bitdata manipulation allow for straightforward instruction decode


and a clean microarchitectural design. The architecture doesnot contain condition codes, branch delay slots, adaptationsfrom existing 32-bit architectures, and other bits of architec-tural history that can add complexity. The chip organizationfor each generation was carefully chosen to gain the mostadvantage from microarchitectural features while maintainingthe ability to meet critical circuit paths.The 21064 is a fully pipelined in-order execution machine

capable of issuing two instructions per clock cycle. It containsone pipelined integer execution unit and one pipelined floating-point execution unit. Integer instruction latency is one or twocycles, except for multiplies which are not pipelined. Floating-point instruction latency is six cycles for all instructions exceptfor divides. The chip includes an 8-kB instruction cacheand an 8-kB data cache. The emphasis of this designwas to gain performance through clock rate while keeping thearchitecture relatively simple. Subsequent designs rely moreheavily on aggressive architectural enhancements to furtherincrease performance.The quad-issue, in order execution implementation of the

21164 was more complex than the 21064, but simpler thanan out-of-order execution implementation [5]. It contains twopipelined integer execution units and two pipelined floating-point execution units. The first-level cache was changed tononblocking. A second-level 96-kB unified and cachewas added on-chip to improve memory latency without addingexcessive complexity. Integer latency was reduced to onecycle for all instructions, and was roughly halved for allMUL instructions. The floating-point unit contains separateadd and multiply pipelines, each with a four-cycle latency [6].Floating-point divide latency is reduced by 50%.The trend of increased architectural complexity continues

with Digital’s latest Alpha microprocessor. The 21264 gains

FP Pipe

Int Pipe

Int Pipe

OoOOoO

I-CacheI-CacheData

CacheData

Cache

Fetch and

predict

21264 die

Separate OoO control

for integer and floating point.

RISC decode happens in OoO blocks

Unlabeled areas devoted to memory

system control


comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.

Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.

Instruction pipeline—FetchThe instruction pipeline begins with the

fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.

Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.

Line and way predictionThe processor implements

a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.

The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.

The processor loads the line and way pre-dictors on an instruction cache fill, and

25MARCH–APRIL 1999

Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller

Data and control buses

DatacacheInstruction

cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)

F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.

Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface

Integerregisterrename

Floating-pointissuequeue(15)

Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr

Floating-pointmultiply execution

Floating-pointadd execution

Mux

Mux

F igure 2. Stages of the A lpha 21264 instruction pipe line .

.

21264 pipeline diagramRename and Issue stages are primary

locations of dynamic scheduling logic. Load/store disambiguation support resides in Memory stage.

Slot: absorbs delay of long path on last slide.












Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller



cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)


Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface



Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr



Mux

Mux


.

Fetch stage close-up:

dynamically retrains them when they are inerror. Most mispredictions cost a single cycle.The line and way predictors are correct 85%to 100% of the time for most applications, sotraining is infrequent. As an additional pre-caution, a 2-bit hysteresis counter associatedwith each fetch block eliminates overtrain-ing—training occurs only when the currentprediction has been in error multiple times.Line and way prediction is an important speedenhancement since the mispredict cost is lowand line/way mispredictions are rare.

Beyond the speed benefits of direct cacheaccess, line and way prediction has other ben-efits. For example, frequently encounteredpredictable branches, such as loop termina-tors, avoid the mis-fetch penalty often associ-ated with a taken branch. The processor alsotrains the line predictor with the address ofjumps and subroutine calls that use direct reg-ister addressing. Code using dynamicallylinked library routines will thus benefit afterthe line predictor is trained with the target.This is important since the pipeline delaysrequired to calculate the indirect (subroutine)jump address are eight cycles or more.

An instruction cache miss forces theinstruction fetch engine to check the level-two(L2) cache or system memory for the neces-sary instructions. The fetch engine prefetch-es up to four 64-byte (or 16-instruction) cache

lines to tolerate the additional latency. Theresult is very high bandwidth instructionfetch, even when the instructions are notfound in the instruction cache. For instance,the processor can saturate the available L2cache bandwidth with instruction prefetches.

Branch predictionBranch prediction is more important to the

21264’s efficiency than to previous micro-processors for several reasons. First, the seven-cycle mispredict cost is slightly higher thanprevious generations. Second, the instructionexecution engine is faster than in previous gen-erations. Finally, successful branch predictioncan utilize the processor’s speculative executioncapabilities. Good branch prediction avoids thecosts of mispredicts and capitalizes on the mostopportunities to find parallelism. The 21164could accept 20 in-flight instructions at most,but the 21264 can accept 80, offering manymore parallelism opportunities.

The 21264 implements a sophisticated tour-nament branch prediction scheme. The schemedynamically chooses between two types ofbranch predictors—one using local history, andone using global history—to predict the direc-tion of a given branch.8 The result is a tourna-ment branch predictor with better predictionaccuracy than larger tables of either individualmethod, with a 90% to 100% success rate onmost simulated applications/benchmarks.Together, local and global correlation tech-niques minimize branch mispredicts. Theprocessor adapts to dynamically choose the bestmethod for each branch.

Figure 4, in detailing the structure of thetournament branch predictor, shows the local-history prediction path—through a two-levelstructure—on the left. The first level holds 10bits of branch pattern history for up to 1,024branches. This 10-bit pattern picks from oneof 1,024 prediction counters. The global pre-dictor is a 4,096-entry table of 2-bit saturat-ing counters indexed by the path, or global,history of the last 12 branches. The choice pre-diction, or chooser, is also a 4,096-entry tableof 2-bit prediction counters indexed by thepath history. The “Local and global branchpredictors” box describes these techniques inmore detail.

The processor inserts the true branch direc-tion in the local-history table once branches

26

ALPHA 21264

IEEE MICRO

Learn dynamic jumps

No branch penalty

Set associativityPC

Instructiondecode,branch

prediction,validity check

Tag0

Tag1 Cached

instructionsLine

predictionWay

prediction

Next line plus wayInstructions (4)

Compare Compare

Hit/miss/way miss

Mux

Mux

Programcounter (PC)generation

…

F igure 3. A lpha 21264 instruction fetch. The line and way prediction (wrap-around path on the right side) provides a fast instruction fetch path thatavoids common fetch stalls when the predictions are correct.

.

Each cache line stores predictions of the next line, and the cache way to be fetched. If predictions are correct, fetcher maintains the required 4 instructions/cycle pace.

Speculative












Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller



cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)


Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface



Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr



Mux

Mux


.

Rename stage close-up:(1) Allocates new physical registers for destinations, (2) Looks up physical register numbers for sources,

(3) Handle rename dependences within the 4 issuing instructions in one clock cycle!

issue and retire. It also trains the correct pre-dictions by updating the referenced local,global, and choice counters at that time. Theprocessor maintains path history with a siloof 12 branch predictions. This silo is specu-latively updated before a branch retires and isbacked up on a mispredict.

Out-of-order execution The 21264 offers out-of-order efficiencies

with higher clock speeds than competingdesigns, yet this speed does not restrict themicroprocessor’s dynamic execution capabili-ties. The out-of-order execution logic receivesfour fetched instructions every cycle,renames/remaps the registers to avoid unneces-sary register dependencies, and queues the

instructions until operands or functional unitsbecome available. It dynamically issues up to sixinstructions every cycle—four integer instruc-tions and two floating-point instructions. It alsoprovides an in-order execution model to theprogrammer via in-order instruction retire.

Register renamingRegister renaming exposes application

instruction parallelism since it eliminatesunnecessary dependencies and allows specu-lative execution. Register renaming assigns aunique storage location with each write-ref-erence to a register. The 21264 speculativelyallocates a register to each instruction with aregister result. The register only becomes partof the user-visible (architectural) register statewhen the instruction retires/commits. Thislets the instruction speculatively issue anddeposit its result into the register file beforethe instruction retires. Register renaming alsoeliminates write-after-write and write-after-read register dependencies, but preserves allthe read-after-write register dependencies thatare necessary for correct computation.

The left side of Figure 5 depicts the map,or register rename, stage in more detail. Theprocessor maintains storage with each inter-nal register indicating the user-visible registerthat is currently associated with the giveninternal register (if any). Thus, register renam-ing is a content-addressable memory (CAM)operation for register sources together with aregister allocation for the destination register.All pipeline stages subsequent to the registermap stage operate on internal registers ratherthan user-visible registers.

Beyond the 31 integer and 31 floating-point user-visible (non-speculative) registers,an additional 41 integer and 41 floating-pointregisters are available to hold speculativeresults prior to instruction retirement. Theregister mapper stores the register map statefor each in-flight instruction so that themachine architectural state can be restored incase a misspeculation occurs.

The Alpha conditional-move instructionsmust be handled specially by the map stage.These operations conditionally move one oftwo source registers into a destination regis-ter. This makes conditional move the onlyinstruction in the Alpha architecture thatrequires three register sources—the two

28

ALPHA 21264

IEEE MICRO

Localhistorytable

(1,024 × 10)

Localprediction(1,024 × 3)

Global prediction(4,096 × 2)

Choice prediction(4,096 × 2)

Path history

Programcounter

Branchprediction

Mux

F igure 4. B lock diagram of the 21264 tournament branch predictor. The localhistory prediction path is on the left; the global history prediction path andthe chooser (choice prediction) are on the right.

Map

Savedmapstate

Map content-addressablememories

Queue

Arbiter80 in-flightinstructions

Request Grant

Registernumbers

Internalregister numbers

72–80internal registers

Instructions (4)

Registerscoreboard

Queueentries

Issuedinstructions

…

F igure 5. B lock diagram of the 21264’s map (register rename) and queuestages. The map stage renames programmer-visible register numbers tointernal register numbers. The queue stage stores instructions until theyare ready to issue . These structures are duplicated for integer and floating-point execution.

.

Output:12 physical registers numbers:

1 destination and 2 sources

for the 4 instructions to be issued.

Input: 4 instructions specifying architected registers.

For mis-speculation recovery

Time-stamped.



Recall: malloc() -- free() in hardware

• multimedia programs using MMX™ instructions (MM - 8),• games, e.g. Quake (GAM - 5),• programs written in JAVA (JAV - 5),• some TPC benchmarks (TPC - 3),• common programs running on NT, e.g. Word, Excel (NT- 8),• and common programs running on Windows 95 (W95 - 8).

In this paper, we focus more on statistical results whichemphasize the motivations for the various suggestions.Performance results are only briefly presented, mainly to give aflavor of the potential benefit. Performance results are de-emphasized, as actual performance benefits are highlydependent on the implementation and may vary a lot. Choosingan arbitrary configuration (whether current or futuristic) maygive biased results of questionable significance.

2 Advanced Register Renaming2.1 Current Register Dependency-Tracking and

Renaming TechniquesModern processors exploit out of order execution to speed up

processing time. Out of Order execution involves a mechanismcalled register renaming in which the processor maps logicalregisters into physical locations. Register renaming is used toremove register anti-dependencies and output-dependencies andto recover from control speculation. The basic register renamingmechanism is well known and widely used (e.g. Intel®Pentium® Pro Processor [Inte96]). This section presents themost advanced combined register renaming and dependency-tracking scheme involving three structures: a Free List (FL), aRegister Alias Table (RAT), and an Active List (AL). Thisscheme has been used in the MIPS R10000 and DEC 21264.

The RAT maintains the latest mapping2 for each logicalregister. The RAT is indexed by the source logical registers, andprovides the mappings to the corresponding physical registers(dependency-tracking). For each logical destination registerspecified by the renamed instructions, the allocator (renamer)provides an unused physical register from FL. The RAT isupdated with these new mappings. Physical registers can bereclaimed once they cannot be referenced anymore. Once alogical register is renamed, all subsequent instructions can onlyaccess the new mapping; i.e. they cannot read the physicalregister previously mapped. Thus, an appropriate andstraightforward condition for register reclaiming is to reclaim aphysical register only when the instruction that evicted it fromthe RAT retires. As a result, whenever a new mapping updatesthe RAT, the evicted old mapping is pushed into AL (an ALentry is provided to each instruction). When an instructionretires, the physical register of the old mapping recorded in AL,if any, is reclaimed and pushed into FL. This cycle is depicted infigure 1.

2 Throughout the paper, we refer to a mapping as the pairing of a logicalregister with a physical register it maps to.

Register Allocation

Register Reclaiming

InstructionRetirement

InstructionRenaming

# logical register # physical register

FREELIST

ACTIVELIST

REGISTERALIASTABLE

Figure 1. Register Renaming.

2.2 Physical Register ReuseMotivation. Most instructions operate on several sourceoperands and generate results. These results are recorded intolocal physical registers allocated for each instruction, so thatdependent instructions can operate on them. The range ofgenerated values is usually limited. Indeed, integer results areoften pretty small and the same value may be generated severaltimes by different instructions currently in the instructionwindow. A perfect example is Boolean values such as control-flow conditions. Figure 2 shows the percentage of computedvalues that match one of the values generated by precedinginstructions according to the number of prior instructionsscanned (16, 32, 64, 128, or 256 instructions). Note that thescanned values were not filtered for duplicate results so theymay also exhibit a high-level of redundancy. Results arehighlighted for four of the SpecInt95 benchmarks, and confirmour claim for programs compiled to run on an IA-32 processor.

0%10%20%30%40%50%60%70%80%90%

100%

compress 95 xlisp 95 go 95 ijpeg 95

16 32 64 128 256

Figure 2. Number of Identical Results.

Concept. Physical registers hold values that are part of thearchitectural states currently alive in the machine. A physicalregister is allocated for every result regardless of its value.However, there is no reason to allocate separate physicalregisters when they maintain the same value. This paperproposes to reuse a physical register whenever we detect that anincoming result value matches a previous one. Physical RegisterReuse relies on a Value-Identity Detection hardware to performthe detection prior to register renaming. The detector outcomecan be either safe or speculative. By mapping several logical

0-8186-8609-X/98 $10.00 (c) 1998 IEEE

The record-keeping shown in this

diagram occurs in the rename

stage.












Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller



cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)


Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface



Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr



Mux

Mux


.

Issue stage close-up:(1) Newly issued instructions placed in top of queue.

(2) Instructions check scoreboard: are 2 sources ready?(3) Arbiter selects 4 oldest “ready” instructions.

(4) Update removes these 4 from queue.Output:The 4 oldest

instructions whose

2 source registers are ready for use.

Input: 4

just-issued instructions, renamed to use physical registers.

producer/consumer relationships within the four instructions) are combined to assign either previously allocated registers or the registers supplied by the free register generator to the source specifiers.

The resulting four register maps are saved in the map silo, which, in turn, provides information to the free register generator as to which registers are currently allocated. Finally, the last map that was created is used as the initial map for the next cycle. In the event of a branch mispredict or trap, the CAM is restored with the map silo entry associated with the redirecting instruction.

Issue Queue

Each cycle, up to four instructions are loaded into the two issue queues. The floating point queue (Fqueue) chooses two instructions from a 15-entry window and issues them to the two floating point pipelines. The integer queue (Iqueue) chooses four instructions from a 20-entry window and issues them to the four integer pipelines 'figure 3).

Ebox Cluster 0

$. + 5 Execute -+ Reg

File Execute (80)

-

-

Media I Ebox Cluster 1

Figure 3. Integer Issue Queue and Ebox The fundamental circuit loop in the Iqueue is the path in which a single-cycle producing instruction is grunted (issued) at the end of one issue cycle and a consuming instruction requests to be issued at the beginning of the next cycle (e.g. Instructions (0) and (2) in the mapper example). The grunt must be communicated to all newer consumer instructions in the issue window.

The issue queue maintains a register scoreboard, based on physical register number, and tracks the progress of multiple-cycle (e.g. integer multiply) and variable cycle (e.g. memory load) instructions. When arithmetic result

data or load data is available for bypass, the scoreboard unit notifies all instructions in the queue.

The queue arbitration cycle works as follows: 1. New instructions are loaded into the "top" of the

queue. 2. Register scoreboard information is communicated to

all queue entries. 3. Instructions that are data-ready request for issue 4. A set of issue arbiters search the queue fkom "bottom"

to "top", selecting instructions that are data-ready in an age-prioritized order and skipping over instructions that are not data-ready.

5. Selected instructions are broadcast to the functional units.

In the next cycle a queue-update mechanism calculates which queue entries are available for future instructions and squashes issued instructions out of the queue. Instructions that are still resident in the queue shift towards the bottom.

Ebox and Fbox

The Ebox functional unit organization was designed around a fast execute-bypass cycle. In order to reduce the impact of the large number of register ports required for a quad-issue CPU and to limit the effect on cycle time of long bypass busses between the functional units, the Ebox was organized around two clusters (see figure 3) . Each cluster contains two functional units, an 80-entry register file, and result busses to/from the other cluster. The lower two functional units contain one-cycle adders and logical units; the upper two contain adders, logic units, and shifters. One upper functional unit contains a 7- cycle, fully pipelined multiplier; the other contains a 3- cycle motion video pipeline, which implements motion estimation, threshold, and pixel compaction/expansion functions. The two integer unit clusters have equal capability to execute most instructions (integer multiply, motion video, and some special-purpose instructions can only be executed in one cluster).

The execute pipeline operation proceeds as follows: Stage 3: Instructions are issued to both clusters.

0 Stage 4: Register files are read. 0 Stage 5: Execution (may be multiple cycles) 0 Stage 6: Results are written to the register file of the

cluster in which execution is performed and are bypassed into the next execution stage within the cluster. Stage 7: Results are written to the cross-cluster register file and are bypassed into the next execution stage in the other cluster.

31










28

ALPHA 21264

IEEE MICRO

Localhistorytable

(1,024 × 10)




Path history

Programcounter

Branchprediction

Mux


Map

Savedmapstate


Queue


Request Grant

Registernumbers



Instructions (4)

Registerscoreboard

Queueentries

Issuedinstructions

…


.

Scoreboard: Tracks writes to physical registers.












Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller



cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)


Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface



Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr



Mux

Mux


.

Execution close-up:



Issue Queue


Ebox Cluster 0


File Execute (80)

-

-











Ebox and Fbox





31

Internal memory systemThe internal memory system supports

many in-flight memory references and out-of-order operations. It can service up to twomemory references from the integer executionpipes every cycle. These two memory refer-ences are out-of-order issues. The memorysystem simultaneously tracks up to 32 in-flight loads, 32 in-flight stores, and 8 in-flight(instruction or data) cache misses. It also hasa 64-Kbyte, two-way set-associative datacache. This cache has much lower miss ratesthan the 8-Kbyte, direct-mapped cache in theearlier 21164. The end result is a high-band-width, low-latency memory system.

Data pathThe 21264 supports any combination of

two loads or stores per cycle without conflict.The data cache is double-pumped to imple-ment the necessary two ports. That meansthat the data cache is referenced twice eachcycle—once per each of the two clock phases.In effect, the data cache operates at twice thefrequency of the processor clock—an impor-tant feature of the 21264’s memory system.

Figure 7 depicts the memory system’s inter-nal data paths. The two 64-bit data buses arethe heart of the internal memory system. Eachload receives data via these buses from the datacache, the speculative store data buffers, or anexternal (system or L2) fill. Stores first trans-fer their data across the data buses into thespeculative store buffer. Store data remains inthe speculative store buffer until the storesretire. Once they retire, the data is written(dumped) into the data cache on idle cachecycles. Each dump can write 128 bits into thecache since two stores can merge into onedump. Dumps use the double-pumped datacache to implement a read-modify-writesequence. Read-modify-write is required onstores to update the stored SECDED ECCthat allows correction of single-bit errors.

Stores can forward their data to subsequentloads while they reside in the speculative storedata buffer. Load instructions compare theirage and address against these pending stores.On a match, the appropriate store data is puton the data bus rather than the data from thedata cache. In effect, the speculative store databuffer performs a memory-renaming func-tion. From the perspective of younger loads,

it appears the stores write into the data cacheimmediately. However, squashed stores areremoved from the speculative store data bufferbefore they affect the final cache state.

Figure 7 shows how data is brought into andout of the internal memory system. Fill dataarrives on the data buses. Pending loads sam-ple the data to write into the register file while,in parallel, the caches (instruction or data) alsofill using the same bus data. The data cache iswrite-back, so fills also use its double-pumpedcapability: The previous cache contents areread out in the same cycle that fill data is writ-ten in. The bus interface unit captures this vic-tim data and later writes it back.

Address and control structureThe internal memory system maintains a

32-entry load queue (LDQ) and a 32-entry


Table 2. Sample 21264 instruction latencies (s-p means single-precision; d-p means double-precision).

Instruction class Latency (cycles)Simple integer operations 1Motion-video instructions/integer population count and

leading/trailing zero count unit (MVI/PLZ) 3Integer multiply 7Integer load 3F loating-point load 4F loating-point add 4F loating-point multiply 4F loating-point divide 12 s-p,15 d-pF loating-point square-root 15 s-p, 30 d-p

64

64

128

128

128

64System

Businterface

L2

Cluster 1memory unit


Databuses

Speculativestore data

Instructioncache

Filldata

Victim data

Data cache

F igure 7. The 21264’s internal memory system data paths.

.















64

64

128

128

128

64System

Businterface

L2



Databuses


Instructioncache

Filldata

Victim data

Data cache


.

(1) Two copies of register files, to reduce port pressure.(2) Forwarding buses are low-latency paths through CPU.

Relies on speculations












Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller



cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)


Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface



Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr



Mux

Mux


.

Latencies, from issue to retirement.

8 retirements per cycle can be sustained over

short time periods.Peak rate is 11 retirements

in a single cycle.

flight window. This means that up to 80instructions can be in partial states of com-pletion at any time, allowing for significantexecution concurrency and latency hiding.(This is particularly true since the memorysystem can track an additional 32 in-flightloads and 32 in-flight stores.)

Table 1 shows the minimum latency, innumber of cycles, from issue until retire eligi-bility for different instruction classes. Theretire mechanism can retire at most 11instructions in a single cycle, and it can sustaina rate of 8 per cycle (over short periods).

Execution engineFigure 6 depicts the six execution pipelines.

Each pipeline is physically placed above orbelow its corresponding register file. The21264 splits the integer register file into twoclusters that contain duplicates of the 80-entryregister file. Two pipes access a single registerfile to form a cluster, and the two clusters com-

bine to support four-way integer instructionexecution. This clustering makes the designsimpler and faster, although it costs an extracycle of latency to broadcast results from aninteger cluster to the other cluster. The upperpipelines from the two integer clusters in Fig-ure 6 are managed by the same issue queuearbiter, as are the two lower pipelines. Theinteger queue statically slots instructions toeither the upper or lower pipeline arbiters. Itthen dynamically selects which cluster to exe-cute an instruction on, left or right.

The performance costs of the register clus-tering and issue queue arbitration simplifica-tions are small—a few percent or less comparedto an idealized unclustered implementation inmost applications. There are multiple reasonsfor the minimal performance effect. First, formany operations (such as loads and stores) thestatic-issue queue assignment is not a restrictionsince they can only execute in either the upperor lower pipelines. Second, critical-path com-putations tend to execute on the same cluster.The issue queue prefers older instructions, somore-critical instructions incur fewer cross-clus-ter delays—an instruction can usually issue firston the same cluster that produces the result.This integer pipeline architecture as a result pro-vides much of the implementation simplicity,lower risk, and higher speed of a two-issuemachine with the performance benefits of four-way integer issue. Figure 6 also shows the float-ing-point execution pipes’ configuration. Asingle cluster has the two floating-point execu-tion pipes, with a single 72-entry register file.

The 21264 includes new functional unitsnot present in prior Alpha microprocessors.The Alpha motion-video instructions (MVI,used to speed many forms of image process-ing), a fully pipelined integer multiply unit,an integer population count and leading/trail-ing zero count unit (PLZ), a floating-pointsquare-root functional unit, and instructionsto move register values directly between float-ing-point and integer registers are included.The processor also provides more completehardware support for the IEEE floating-pointstandard, including precise exceptions, NaNand infinity processing, and support for flush-ing denormal results to zero. Table 2 showssample instruction latencies (issue of produc-er to issue of consumer). These latencies areachieved through result bypassing.

30

ALPHA 21264

IEEE MICRO

Table 1. Sample 21264 retire pipe stages.

Instruction class Retire latency (cycles)Integer 4M emory 7F loating-point 8Branch/jump to subroutine 7

Integer

+1

+1

Integer multiply

Cluster 0

Shift/branch

Add/logic80 registers

Load/store

MVI/PLZ

Cluster 1

Shift/branch

Add/logic

Add/logicAdd/logic

80 registers

Load/store

Floating-point

multiply

Floating point

Floating-pointadd

Floating-pointdivide

Floating-pointSQRT

72 registers

MVIPLZ

SQRT

Motion video instructionsInteger population count and leading/trailing

Square-root functional unitzero count unit

Figure 6. The four integer execution pipes (upper and lowerfor each of a left and right cluster) and the two floating-pointpipes in the 21264, together w ith the functional units in each.

.










28

ALPHA 21264

IEEE MICRO

Localhistorytable

(1,024 × 10)




Path history

Programcounter

Branchprediction

Mux


Map

Savedmapstate


Queue


Request Grant

Registernumbers



Instructions (4)

Registerscoreboard

Queueentries

Issuedinstructions

…


.

Retirement managed here.

Short latencies keep buffers to a reasonable size.


Execution unit close-up:

flight window. This means that up to 80instructions can be in partial states of com-pletion at any time, allowing for significantexecution concurrency and latency hiding.(This is particularly true since the memorysystem can track an additional 32 in-flightloads and 32 in-flight stores.)

Table 1 shows the minimum latency, innumber of cycles, from issue until retire eligi-bility for different instruction classes. Theretire mechanism can retire at most 11instructions in a single cycle, and it can sustaina rate of 8 per cycle (over short periods).

Execution engineFigure 6 depicts the six execution pipelines.

Each pipeline is physically placed above orbelow its corresponding register file. The21264 splits the integer register file into twoclusters that contain duplicates of the 80-entryregister file. Two pipes access a single registerfile to form a cluster, and the two clusters com-

bine to support four-way integer instructionexecution. This clustering makes the designsimpler and faster, although it costs an extracycle of latency to broadcast results from aninteger cluster to the other cluster. The upperpipelines from the two integer clusters in Fig-ure 6 are managed by the same issue queuearbiter, as are the two lower pipelines. Theinteger queue statically slots instructions toeither the upper or lower pipeline arbiters. Itthen dynamically selects which cluster to exe-cute an instruction on, left or right.

The performance costs of the register clus-tering and issue queue arbitration simplifica-tions are small—a few percent or less comparedto an idealized unclustered implementation inmost applications. There are multiple reasonsfor the minimal performance effect. First, formany operations (such as loads and stores) thestatic-issue queue assignment is not a restrictionsince they can only execute in either the upperor lower pipelines. Second, critical-path com-putations tend to execute on the same cluster.The issue queue prefers older instructions, somore-critical instructions incur fewer cross-clus-ter delays—an instruction can usually issue firston the same cluster that produces the result.This integer pipeline architecture as a result pro-vides much of the implementation simplicity,lower risk, and higher speed of a two-issuemachine with the performance benefits of four-way integer issue. Figure 6 also shows the float-ing-point execution pipes’ configuration. Asingle cluster has the two floating-point execu-tion pipes, with a single 72-entry register file.

The 21264 includes new functional unitsnot present in prior Alpha microprocessors.The Alpha motion-video instructions (MVI,used to speed many forms of image process-ing), a fully pipelined integer multiply unit,an integer population count and leading/trail-ing zero count unit (PLZ), a floating-pointsquare-root functional unit, and instructionsto move register values directly between float-ing-point and integer registers are included.The processor also provides more completehardware support for the IEEE floating-pointstandard, including precise exceptions, NaNand infinity processing, and support for flush-ing denormal results to zero. Table 2 showssample instruction latencies (issue of produc-er to issue of consumer). These latencies areachieved through result bypassing.

30

ALPHA 21264

IEEE MICRO

Table 1. Sample 21264 retire pipe stages.

Instruction class Retire latency (cycles)Integer 4M emory 7F loating-point 8Branch/jump to subroutine 7

Integer

+1

+1

Integer multiply

Cluster 0

Shift/branch

Add/logic80 registers

Load/store

MVI/PLZ

Cluster 1

Shift/branch

Add/logic

Add/logicAdd/logic

80 registers

Load/store

Floating-point

multiply

Floating point

Floating-pointadd

Floating-pointdivide

Floating-pointSQRT

72 registers

MVIPLZ

SQRT

Motion video instructionsInteger population count and leading/trailing

Square-root functional unitzero count unit

Figure 6. The four integer execution pipes (upper and lowerfor each of a left and right cluster) and the two floating-pointpipes in the 21264, together w ith the functional units in each.

.



Issue Queue


Ebox Cluster 0


File Execute (80)

-

-











Ebox and Fbox





31

(1) Two arbiters: one for top pipes, one for bottom pipes.(2) Instructions statically assigned to top or bottom.

(3) Arbiter dynamically selects left or right.TopTop

BottomThus, 2 dual-issue dynamic machines, not a 4-issue machine.

Why? Simplifies arbiter. Performance penalty? A few %.14Thursday, April 3, 14











Floa

ting-

poin

t uni

ts Floatmapand

queue

Inst

ruct

ion

fetc

h

Businterface

unit

Memorycontroller

Memory controller



cache BIU

Integerqueue

Integermapper

Inte

ger u

nit

(clu

ster

1)

Inte

ger u

nit

(clu

ster

0)


Fetch0

Rename2

Issue3

Register read4

Execute5

Integerexecution

Integerexecution

Integerexecution

Integerexecution

Memory6

Datacache

(64 Kbytes,two-way)

Level-two

cacheand system

interface



Floating-point

registerfile(72)

Floating-point

registerrename

Slot1

Branchpredictor

Line/setprediction

Instructioncache

(64 Kbytes,two-way)

Integerissuequeue

(20entries)

Integerregister

file(80)

Integerregister

file(80)

Addr

Addr



Mux

Mux


.

Memory stages close-up:

Input: Say something















64

64

128

128

128

64System

Businterface

L2



Databuses


Instructioncache

Filldata

Victim data

Data cache


.

Loads and stores from execution unit appear as “Cluster 0/1 memory unit”

in the diagram below.

The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.

A

Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.

PA0 A PA1

MAF 8-ent ries

LDQ 32-entries

STQ 32-entries

Memory Operations

The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:

Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.

Mbox

The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .

Hazard Detection and Resolution

As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for

references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):

( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)

Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.

The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses

1 T: "j1 128-entries

Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).

The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup

32


A


PA0 A PA1

MAF 8-ent ries

LDQ 32-entries

STQ 32-entries

Memory Operations



Mbox











32

1st stop: TLB, to convert virtual memory addresses.


A


PA0 A PA1

MAF 8-ent ries

LDQ 32-entries

STQ 32-entries

Memory Operations



Mbox











32

3rd stop: Flush STQ to the data cache ... on a miss, place in Miss Address File.

(MAF == MHSR)“Doublepumped”

1 GHz

2nd stop: Load Queue(LDQ) and Store Queue (SDQ) each hold 32 instructions, until retirement ...


A


PA0 A PA1

MAF 8-ent ries

LDQ 32-entries

STQ 32-entries

Memory Operations



Mbox











32

So we can roll back!


LDQ/STQ close-up:


A


PA0 A PA1

MAF 8-ent ries

LDQ 32-entries

STQ 32-entries

Memory Operations



Mbox











32


A


PA0 A PA1

MAF 8-ent ries

LDQ 32-entries

STQ 32-entries

Memory Operations



Mbox











32

Hazards we are trying to prevent:

cache and system. Each entry in the MAF refers to a 64- byte block of data which is ultimately bound for the Dcache or the Icache.

The instruction processing pipeline described above extends to the Mbox as follows:

Stage 6: The Dcache tags and TLB are read. Dcache hit is calculated. Stage 7: The physical addresses generated from the TLB (PAO, PAl) are CAMed across the LDQ, STQ, and MAF. Stage 8: Load instructions are written into the LDQ; store instructions are written into the STQ, and, if the memory reference missed the Dcache, it is written into the MAF. In parallel, the CAM results from the preceding cycle are combined with relative instruction age information to detect hazards. In addition, the MAF uses the result of its CAMs to detect the loads and stores which can be merged into the same @-byte cache block. Stage 9: The MAF entry allocation in stage 8 is validated to the system interface, and the MAF number associated with this particular memory miss is written into the appropriate (LDQ/STQ) structure. This MAF number provides a mapping between the merged references to the same cache block and individual outstanding load and store instructions.

Given these resources and within the context of the Mbox pipeline, memory hazards are solved as follows. RAW hazards are discovered when an issued store detects that a younger load to the same address has already issued and delivered its data. In this event, the CPU is trapped to the store instruction, and instruction flow is replayed by the Ibox. This is a potentially common hazard, so, in addition to trapping, the Ibox is trained to issue that load in-order with respect to the prior store instruction.

WAR hazards are discovered when an issued load detects an older store which references the same address; the CPU is trapped to the load address. Finally, WAW hazards are avoided by forcing the STQ to write data to the Dcache in-order. Thus, stores can be issued out-of- order and removed from the Iqueue, allowing futther instruction processing, but the store data is written to the Dcache in program order.

the data to the Mbox. As the data is delivered, it must be spliced into the execution pipeline so that dependent instructions can be issued. The fill pipeline proceeds from the Cbox as follows:

1. The Cbox informs the Mbox and the rest of the chip that fill data will be available the Load Store Bus (LSD) in 6 cycles.

2. The Mbox receives the fill command plus the MAF number and CAMs the MAF number across the LDQ. The loads which referenced this cache block arbitrate for the two loadstore pipes.

3. The Ibox stops issuing load instructions for one cycle because the LSD bus has been scheduled for fill data.

4. Bubble Cycle. 5. Fill data arrives at the 21264 pins. The Ibox ceases

issuing loads for one cycle because the Dcache has been scheduled for a fill update. The Ibox issues instructions which were dependent on the load data because the data will be available for bypass on the LSD bus.

7. The fill data is driven on the LSD bus and is ready to be consumed.

8. The fill data and tag are written into the Dcache.

6.

Since the cache block is @-bytes, and the 21264 has two @-bit fill pipelines, it takes 4 transactions across these pipelines to complete a valid fill. The tag array is written in the first cycle so newly issued loads can “hit” in the Dcache on partially filled cache blocks. After the cache block is written, the MAF number is CAMed across both the LDQ and STQ. In the case of the LDQ, this CAM indicates all the loads which wanted this cache block but were not satisfied on the original fill. These loads are placed in “retry” state and use the Mbox retry pipeline to resolve themselves. In the case of the STQ, all stores which wanted this block are informed that the block is now in the Dcache and in the appropriate cache state to be writeable. When these stores are retired, they can be written into the Dcache in-order. In some cases, the Cbox delivers a cache block which can be consumed by loads but is not writeable. In these cases, all the stores in the STQ are also placed in the “retry” state for further processing by the Mbox retry pipeline. The Mbox retry pipeline starts in the first cycle of the fill pipeline, and similar to a fill, reissues loads and stores into the execution pipeline.

Dcache Miss Processing C box

If a memory reference misses the Dcache and is not trapped or merged in stage 8 of the Mbox pipeline, a new MAF entry is generated for the reference. The Cbox finds the block in the L2 cache or main memory, and delivers

The Cache Box ( O x ) controls the cache subsystem within the 21264 microprocessor and has two primary tasks. First, it cooperates with the external system to

33

cache and system. Each entry in the MAF refers to a 64- byte block of data which is ultimately bound for the Dcache or the Icache.

The instruction processing pipeline described above extends to the Mbox as follows:

Stage 6: The Dcache tags and TLB are read. Dcache hit is calculated. Stage 7: The physical addresses generated from the TLB (PAO, PAl) are CAMed across the LDQ, STQ, and MAF. Stage 8: Load instructions are written into the LDQ; store instructions are written into the STQ, and, if the memory reference missed the Dcache, it is written into the MAF. In parallel, the CAM results from the preceding cycle are combined with relative instruction age information to detect hazards. In addition, the MAF uses the result of its CAMs to detect the loads and stores which can be merged into the same @-byte cache block. Stage 9: The MAF entry allocation in stage 8 is validated to the system interface, and the MAF number associated with this particular memory miss is written into the appropriate (LDQ/STQ) structure. This MAF number provides a mapping between the merged references to the same cache block and individual outstanding load and store instructions.

Given these resources and within the context of the Mbox pipeline, memory hazards are solved as follows. RAW hazards are discovered when an issued store detects that a younger load to the same address has already issued and delivered its data. In this event, the CPU is trapped to the store instruction, and instruction flow is replayed by the Ibox. This is a potentially common hazard, so, in addition to trapping, the Ibox is trained to issue that load in-order with respect to the prior store instruction.

WAR hazards are discovered when an issued load detects an older store which references the same address; the CPU is trapped to the load address. Finally, WAW hazards are avoided by forcing the STQ to write data to the Dcache in-order. Thus, stores can be issued out-of- order and removed from the Iqueue, allowing futther instruction processing, but the store data is written to the Dcache in program order.

the data to the Mbox. As the data is delivered, it must be spliced into the execution pipeline so that dependent instructions can be issued. The fill pipeline proceeds from the Cbox as follows:

1. The Cbox informs the Mbox and the rest of the chip that fill data will be available the Load Store Bus (LSD) in 6 cycles.

2. The Mbox receives the fill command plus the MAF number and CAMs the MAF number across the LDQ. The loads which referenced this cache block arbitrate for the two loadstore pipes.

3. The Ibox stops issuing load instructions for one cycle because the LSD bus has been scheduled for fill data.

4. Bubble Cycle. 5. Fill data arrives at the 21264 pins. The Ibox ceases

issuing loads for one cycle because the Dcache has been scheduled for a fill update. The Ibox issues instructions which were dependent on the load data because the data will be available for bypass on the LSD bus.

7. The fill data is driven on the LSD bus and is ready to be consumed.

8. The fill data and tag are written into the Dcache.

6.

Since the cache block is @-bytes, and the 21264 has two @-bit fill pipelines, it takes 4 transactions across these pipelines to complete a valid fill. The tag array is written in the first cycle so newly issued loads can “hit” in the Dcache on partially filled cache blocks. After the cache block is written, the MAF number is CAMed across both the LDQ and STQ. In the case of the LDQ, this CAM indicates all the loads which wanted this cache block but were not satisfied on the original fill. These loads are placed in “retry” state and use the Mbox retry pipeline to resolve themselves. In the case of the STQ, all stores which wanted this block are informed that the block is now in the Dcache and in the appropriate cache state to be writeable. When these stores are retired, they can be written into the Dcache in-order. In some cases, the Cbox delivers a cache block which can be consumed by loads but is not writeable. In these cases, all the stores in the STQ are also placed in the “retry” state for further processing by the Mbox retry pipeline. The Mbox retry pipeline starts in the first cycle of the fill pipeline, and similar to a fill, reissues loads and stores into the execution pipeline.

Dcache Miss Processing C box

If a memory reference misses the Dcache and is not trapped or merged in stage 8 of the Mbox pipeline, a new MAF entry is generated for the reference. The Cbox finds the block in the L2 cache or main memory, and delivers

The Cache Box ( O x ) controls the cache subsystem within the 21264 microprocessor and has two primary tasks. First, it cooperates with the external system to

33

To do so, LDQ and SDQ lists of up to 32 loads and stores, in issued order. When a new load or store arrives, addresses are compared to detect/fix hazards:


LDQ/STQ speculation

address-out and address-in buses in the sys-tem pin bus. This provides bandwidth for newaddress requests (out from the processor) andsystem probes (into the processor), and allowsfor simple, small-scale multiprocessor systemdesigns. The 21264 system interface’s low pincounts and high bandwidth let a high-perfor-mance system (of four or more processors)broadcast probes without using a large num-ber of pins. The BIU stores pending systemprobes in an eight-entry probe queue beforeresponding to the probes, in order. It respondsto probes very quickly to support a systemwith minimum latency, and minimizes theaddress bus bandwidth required in commonprobe response cases.

The 21264 provides a rich set of possiblecoherence actions; it can scale to larger-scalesystem implementations, including directo-ry-based systems.4 It supports all five of thestandard MOESI (modified-owned-exclusive-shared-invalid) cache states.

The BIU supports a wide range of systemdata bus speeds. The peak bandwidth of thesystem data interface is 8 bytes of data per 1.5CPU cycles—or 3.2 Gbytes/sec at a 400-MHztransfer rate. The load latency (issue of load toissue of consumer) can be as low as 160 ns witha 60-ns DRAM access time. The total of eightin-flight MAFs and eight in-flight victims pro-vide many parallel memory operations toschedule for high SRAM and DRAM effi-ciency. This translates into high memory sys-tem performance, even with cache misses. Forexample, the 21264 has sustained in excess of1.3 Gbytes/sec (user-visible) memory band-width on the Stream benchmark.7

Dynamic execution examplesThe 21264 architecture is very dynamic. In

this article I have discussed a number of its

dynamic techniques, including the line predic-tor, branch predictor, and issue queue schedul-ing. Two more examples in this section furtherillustrate the 21264’s dynamic adaptability.

Store/load memory orderingThe 21264 memory system supports the

full capabilities of the out-of-order executioncore, yet maintains an in-order architecturalmemory model. This is a challenge when mul-tiple loads and stores reference the sameaddress. The register rename logic cannotautomatically handle these read-after-writememory dependencies as it does registerdependencies because it does not have thememory address until the instruction issues.Instead, the memory system dynamicallydetects the problem case after the instructionsissue (and the addresses are available).

This example shows how the 21264dynamically adapts to avoid the costs of loadmisspeculation. It remembers the first mis-speculation and avoids the problem in subse-quent executions by delaying the load.

Figure 10 shows how the 21264 resolves amemory read-after-write hazard. The sourceinstructions are on the far left—a store fol-lowed by a load to the same address. On thefirst execution of these instructions, the 21264attempts to issue the load as early as possi-ble—before the older store—to minimize loadlatency. The load receives the wrong data sinceit issues before the store in this case, so the21264 hazard detection logic squashes theload (and all subsequent instructions). Afterthis type of load misspeculation, the 21264trains to avoid it on subsequent executions bysetting a bit in a load wait table.

Figure 10 also shows what happens on sub-sequent executions of the same code. At fetchtime the store wait table bit corresponding tothe load is set. The issue queue then forces theissue point of the marked load to be delayeduntil all prior stores have issued, therebyavoiding this store/load order violation andalso allowing the speculative store buffer tobypass the correct data to the load. This storewait table is periodically cleared to avoidunnecessary waits.

This example store/load order case showshow the memory system produces a result thatis the same as an in-order memory systemwhile capturing the performance advantages

34

ALPHA 21264

IEEE MICRO

(Assume R10 = R11)Source codeSTQ R0, 0(R10)

LDQ R1, 0(R11)

First executionLDQ R1, 0(R11)

STQ R0, 0(R10)

This load gotthe wrong data!

Subsequent executions

STQ R0, 0(R10)

LDQ R1, 0(R11)

The marked (delayed)load gets the store data.

F igure 10. An example of the 21264 memory load-after-store hazardadaptation.

.

It also marks the load instruction in a predictor, so that future invocations are not speculatively executed.

of out-of-order execution.Unmarked loads issue as earlyas possible, and before asmany stores as possible, whileonly the necessary markedloads are delayed.

Load hit/miss predictionThere are minispeculations

within the 21264’s specula-tive execution engine. Toachieve the minimum three-cycle integer load hit latency,the processor must specula-tively issue the consumers ofthe integer load data beforeknowing if the load hit ormissed in the on-chip datacache. This early issue allowsthe consumers to receivebypassed data from a load atthe earliest possible time.Note in Figure 2 that the datacache stage is three cycles after the queue, orissue, stage, so the load’s cache lookup musthappen in parallel with the consumers issue.Furthermore, it really takes another cycle afterthe cache lookup to get the hit/miss indica-tion to the issue queue. This means that con-sumers of the results produced by theconsumers of the load data (the beneficiaries)can also speculatively issue—even though theload may have actually missed.

The processor could rely on the generalmechanisms available in the speculative exe-cution engine to abort the integer load data’sspeculatively executed consumers; however,that requires restarting the entire instructionpipeline. Given that load misses can be fre-quent in some applications, this techniquewould be too expensive. Instead, the proces-sor handles this with a minirestart. When con-sumers speculatively issue three cycles after aload that misses, two integer issue cycles (onall four integer pipes) are squashed. All inte-ger instructions that issued during those twocycles are pulled back into the issue queue tobe reissued later. This forces the processor toreissue both the consumers and the benefi-ciaries. If the load hits, the instruction sched-ule shown on the top of Figure 11 will beexecuted. If the load misses, however, the orig-inal issues of the unrelated instructions L3–L4

and U4–U6 must be reexecuted in cycles 5and 6. The schedule thus is delayed two cyclesfrom that depicted.

While this two-cycle window is less costlythan fully restarting the processor pipeline, itstill can be expensive for applications withmany integer load misses. Consequently, the21264 predicts when loads will miss and doesnot speculatively issue the consumers of theload data in that case. The bottom half of Fig-ure 11 shows the example instruction schedulefor this prediction. The effective load latencyis five cycles rather than the minimum threefor an integer load hit that is (incorrectly) pre-dicted to miss. But more unrelated instruc-tions are allowed to issue in the slots not takenby the consumer and the beneficiaries.

The load hit/miss predictor is the most-sig-nificant bit of a 4-bit counter that tracks thehit/miss behavior of recent loads. The satu-rating counter decrements by two on cycleswhen there is a load miss, otherwise it incre-ments by one when there is a hit. This hit/misspredictor minimizes latencies in applicationsthat often hit, and avoids the costs of over-speculation for applications that often miss.

The 21264 treats floating-point loads dif-ferently than integer loads for load hit/missprediction. The floating-point load latency isfour cycles, with no single-cycle operations, so


F igure 11. Integer load hit/m iss prediction example . This figure depicts the execution of aworkload when the se lected load (P) is predicted to hit (a) and predicted to m iss (b) on thefour integer pipes. The cross-hatched and screened sections show the instructions that aree ither squashed and reexecuted from the issue queue , or de layed due to operand availabili-ty or the reexecution of other instructions.

0Cycle 1 2 3 4 5 6

U1 U3 U5 U7 C

L1 L5 L6 B1

U9

U2 U4 U6 U8

L2P L3 L4 L7 L8Pred

ict m

iss

inte

ger p

ipes

U1 U3 CC U5U5

B1B1

U4U4 U6U6

L3L3 L4L4

C U5 B2

L1 B1 L5

U9

U9U2 U4 U6 U7

L2P L3 L4 L6 L7

Pred

ict h

itin

tege

r pip

es

P

C

BX

LX

UX

Producing load

Consumer

Beneficiary of load

Unrelated instruction, lower pipes

Unrelated instruction, upper pipes

Squashed and reexecuted if P misses

Delayed (rescheduled) if P misses

(a)

(b)

.

First execution











0Cycle 1 2 3 4 5 6

U1 U3 U5 U7 C

L1 L5 L6 B1

U9

U2 U4 U6 U8

L2P L3 L4 L7 L8Pred

ict m

iss

inte

ger p

ipes

U1 U3 CC U5U5

B1B1

U4U4 U6U6

L3L3 L4L4

C U5 B2

L1 B1 L5

U9

U9U2 U4 U6 U7

L2P L3 L4 L6 L7

Pred

ict h

itin

tege

r pip

es

P

C

BX

LX

UX

Producing load

Consumer

Beneficiary of load





(a)

(b)

.











0Cycle 1 2 3 4 5 6

U1 U3 U5 U7 C

L1 L5 L6 B1

U9

U2 U4 U6 U8

L2P L3 L4 L7 L8Pred

ict m

iss

inte

ger p

ipes

U1 U3 CC U5U5

B1B1

U4U4 U6U6

L3L3 L4L4

C U5 B2

L1 B1 L5

U9

U9U2 U4 U6 U7

L2P L3 L4 L6 L7

Pred

ict h

itin

tege

r pip

es

P

C

BX

LX

UX

Producing load

Consumer

Beneficiary of load





(a)

(b)

.

Subsequent execution17Thursday, April 3, 14

24











Corporation




0272-1732/99/$10.00 1999 IEEE


.

Designing a microprocessor is a team sport. Below are the author and acknowledgement lists for the papers whose figures I use.

Circuit Implementation of a 600MHz Superscalar RISC Microprocessor

M. Matson, D. Bailey, S. Bell, L. Biro, S. Butler, J. Clouser, J. Farrell, M. Gowan, D. Priore, and K. Wilcox

Compaq Computer Corporation, Shrewsbury, MA

AbstractThe circuit techniques used to implement a 600MHz,

out-of-order, superscalar RISC Alpha microprocessor aredescribed. Innovative logic and circuit design created achip that attains 30+ SpecInt95 and 50+ SpecFP95, andsupports a secondary cache bandwidth of 6.4GB/s.Microarchitectural techniques were used to optimizelatencies and cycle time, while a variety of static anddynamic design methods balanced critical path delaysagainst power consumption. The chip relies heavily on fullcustom design and layout to meet speed and area goals. Anextensive CAD suite guaranteed the integrity of the design.

1. IntroductionThe design of the Alpha 21264 microprocessor [1] was

driven by a desire to achieve the highest performancepossible in a single chip, 0.35um CMOS microprocessor.This goal was realized by combining low instructionlatencies and a high frequency of operation with out-of-order issue techniques. The microprocessor fetches fourinstructions per cycle and can issue up to sixsimultaneously. Large 64KB, two-way set associative,primary caches were included for both instructions anddata; a high bandwidth secondary cache interface transfersup to 6.4GB/s of data into or from the chip. A phase-lockedloop [2] generates the 600MHz internal clock. Theincreased power that accompanies such high frequencies ismanaged through reduced VDD, conditional clocking, andother low power techniques.

The Alpha microprocessor road map dictates continualimprovements in architecture, circuits, and fabricationtechnology with each successive generation. In comparisonto its predecessors [3-5], the 21264 issues instructions out-of-order, supports more in-flight instructions, executesmore instructions in parallel, has much larger primarycaches and memory bandwidth, and contains additionalinteger and floating point function units. Other differencesinclude a phase-locked loop to simplify system design, aswell as conditional clocks and a clocking hierarchy toreduce power consumption and permit critical pathtradeoffs. Custom circuit design enabled the incorporationof these advances while reducing the cycle time more thanpossible with the process shrink alone. The new 0.35um(drawn) process provides faster devices and die area for

more features, along with reference planes for better signalintegrity and power distribution.

The remainder of this paper will describe the 21264’sphysical characteristics, design methodology, and majorblocks, paying particular attention to the underlying designproblems and implementation approaches used to solvethem. The paper will conclude with a discussion of howthese strategies created a microprocessor with leading edgeperformance.

2. Physical CharacteristicsCharacteristics of the CMOS process are summarized in

Table 1. The process provides two fine pitch metal layers,two coarse pitch metal layers, and two reference planes.The pitch of the finer layers aids in compacting the layout,while the lower resistance of the coarser layers is beneficialfor clocks and long signal wires. The reference planeslower the effective impedance of the power supply and alsoprovide a low inductance return path for clock and signallines. Moreover, they greatly reduce the capacitive andinductive coupling between the wires, which couldotherwise induce reliability failures due to voltageundershoot or overshoot, functional failures caused byexcessive noise, and wide variations in path delays due todata dependencies. Gate oxide capacitors, placed near largedrivers and underneath upper level metal routing channels,further diminish power supply noise.

Feature size 0.35umChannel length 0.25umGate oxide 6.0nmVTXn/VTXp 0.35V / -0.35VMetal 1, 2 5.7kA AlCu, 1.225um pitchReference plane 1 14.8kA AlCu, VSSMetal 3, 4 14.8kA AlCu, 2.80um pitchReference plane 2 14.8kA AlCu, VDD

Table 1: CMOS Process Technology

The microprocessor is packaged in a 587 pin ceramicinterstitial pin grid array. A CuW heat slug lowers thethermal resistance between the die and detachable heat sink.The package has a 1uF wirebond attached chip capacitor inaddition to the distributed on-chip decoupling capacitors.

There is no “i” in T-E-A-M ...

circuits

there is enough time to resolve the exactinstruction that used the load result.

Compaq has been shipping the 21264 tocustomers since the last quarter of 1998.

Future versions of the 21264, taking advantageof technology advances for lower cost and high-er speed, will extend the Alpha’s performanceleadership well into the new millennium. Thenext-generation 21364 and 21464 Alphas arecurrently being designed. They will carry theAlpha line even further into the future. MICRO

AcknowledgmentsThe 21264 is the fruition of many individ-

uals, including M. Albers, R. Allmon, M.Arneborn, D. Asher, R. Badeau, D. Bailey, S.Bakke, A. Barber, S. Bell, B. Benschneider, M.Bhaiwala, D. Bhavsar, L. Biro, S. Britton, D.Brown, M. Callander, C. Chang, J. Clouser,R. Davies, D. Dever, N. Dohm, R. Dupcak,J. Emer, N. Fairbanks, B. Fields, M. Gowan,R. Gries, J. Hagan, C. Hanks, R. Hokinson,C. Houghton, J. Huggins, D. Jackson, D.Katz, J. Kowaleski, J. Krause, J. Kumpf, G.Lowney, M. Matson, P. McKernan, S. Meier,J. Mylius, K. Menzel, D. Morgan, T. Morse,L. Noack, N. O’Neill, S. Park, P. Patsis, M.Petronino, J. Pickholtz, M. Quinn, C. Ramey,D. Ramey, E. Rasmussen, N. Raughley, M.Reilly, S. Root, E. Samberg, S. Samudrala, D.Sarrazin, S. Sayadi, D. Siegrist, Y. Seok, T.Sperber, R. Stamm, J. St Laurent, J. Sun, R.Tan, S. Taylor, S. Thierauf, G. Vernes, V. vonKaenel, D. Webb, J. Wiedemeier, K. Wilcox,and T. Zou.

References1. D . Dobberpuhl et al., “A 200 MHz 64-bit Dual

Issue C M OS M icroprocessor, ” IEEE J. SolidState C ircuits , Vol. 27 , No . 11 , Nov . 1992 ,pp. 1,555–1,567.

2. J . Edmondson et a l. , “ Supersca larInstruct ion Execut ion in the 21164 A lphaM icroprocessor, ” IEEE M icro, Vol. 15, No.2, Apr. 1995; pp. 33–43.

3. B . G ieseke et al., “ A 600 M Hz SuperscalarRISC M icroprocessor w ith O ut-of-OrderExecut ion , ” IE E E Int’ l So lid-State C ircu itsConf . D ig . , Tech . Papers , IE E E Press ,Piscataway, N .J., Feb. 1997, pp. 176–177.

4. D . Le ibho lz and R . Razdan , “ The A lpha21264: A 500 M Hz Out-of-Order Execution

M icroprocessor, ” Proc. IEEE Compcon 97,IEEE Computer Soc . Press , Los A lam itos ,Calif., 1997, pp. 28–36.

5. R.E . Kessler, E .J. McLe llan, and D .A . W ebb,“ The A lpha 21264 M icroprocessorArch itecture , ” Proc . 1998 IE E E Int’ l Conf .Computer Design: VLSI in Computers andProcessors, IEEE Computer Soc. Press, Oct.1998, pp. 90–95.

6. M . Matson et al., “ C ircuit Implementation ofa 600 M Hz Superscalar RISC M icroproces-sor, ” 1998 IE E E Int’ l Conf . ComputerDesign: VLSI in Computers and Processors,Oct. 1998, pp. 104–110.

7. J .D . M cCa lp in , “ STRE A M: Susta inab leM e mory Band w idth in H igh-PerformanceComputers , ” Un iv . of V irg in ia , D ept . ofComputer Sc ience , Charlottesv ille , Va .;http://w w w .cs.virginia.edu/stream/.

8. S. McFarling, Combining Branch Predictors,Tech. Note TN-36, Compaq Computer Corp.W estern Research Laboratory , Pa lo A lto ,Ca lif . , June 1993; http://w w w .research .d i g i t a l . co m / w r l/ t e chre port s /abs trac t s /TN-36.htm l.

9. T . F ischer and D . Le ibho lz , “ D es ignTradeoffs in Sta ll-Contro l C ircu its for 600M Hz Instruction Queues, ” Proc. IEEE Int’lSolid-State C ircuits Conf. D ig., Tech. Papers,IEEE Press, Feb. 1998, pp. 398–399.

Richard E. Kessler is a consulting engineer inthe Alpha Development Group of CompaqComputer Corp. in Shrewsbury, Massachu-setts. He is an architect of the Alpha 21264 and21364 microprocessors. His interests includemicroprocessor and computer system archi-tecture. He has an MS and a PhD in comput-er sciences from the University of Wisconsin,Madison, and a BS in electrical and computerengineering from the University of Iowa. He isa member of the ACM and the IEEE.

Contact Kessler about this article at CompaqComputer Corp., 334 South St., Shrewsbury,MA 01545; [email protected].

36

ALPHA 21264

IEEE MICRO

.

24











Corporation




0272-1732/99/$10.00 1999 IEEE


.

architect

memory. Directory or duplicate-tag based protocols can be built using these primitives in a similar fashion.

References

[l] D. Dobkrpuhl et al,. "A 200-MHi @-bit Dual Issue CMOS Microprocessor," Digital Technical Journal, vol. 4, no. 4, 1992. [2] J. Edmondson et al., "Superscalar instruction execution in the 21 164 Alpha Microprocessor," IEEE Micro, vol. 15, no. 2, Apr. 1995. [3] S. McFarling, "Combining Branch Predictors," Technical Note TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993. <w w w .research .digital.com/wrl/tec hreports/abstracts/TN- 36.htmb

Acknowledgments

The authors acknowledge the contributions of the following individuals: J. Emer, B. Gieseke, B. Grundmann, J. Keller, R. Kessler, E. McLellan, D. Meyer, J. Pierce, S. Steely, and D. Webb.

36

The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

Daniel Leibholz and Rahul Razdan Digital Equipment Corporation

Hudson, MA 01749

Abstract

This paper describes the internal organization of the 21264, a 500 MHz, Out-Of Order, quad-ferch, six-way issue microprocessor. The aggressive cycle-time of the 21264 in combination with many architectural innovations, such as out-oforder and. speculative execution, enable this microprocessor to deliver an estimated 30 SpecInt95 and 50 SpecFp95 performance. In addition, the 21264 can sustain 54- Gigabyteslsec of bandwidth to an L2 cache and 3+ Gigabyteslsec to memory for high performance on memory-intensive applications.

Introduction

The 21264 is the third generation of Alpha microprocessors designed and built by Digital Semiconductor. Like its predecessors , the 21064 [l] and the 21 164 [2], thc design objective of the 21264 team was to build a world-class microprocessor which is the undisputed performance leader. The principle levers used to achieve this objective were:

A cycle time (2.0 ns in 0.35 micron CMOS at 2 volts) was chosen by evaluation of the circuit loops which provide the most performance leverage. For example, an integer add and result bypass (to the next integer operation) is critical to the performance of most integer programs and is therefore a determining factor in choosing the cycle time. An out-of-order, superscalar execution core was built to increase the average instructions executed per cycle (ipc) for the machine. The out-of-order execution model dynamically finds instruction-level- parallelism in the program and hides memory latency by executing load instructions that may be located past conditional branches. Performance-focused instructions were added to the Alpha architecture and implemented in the 21264. These include:

0

0

=$ Motion estimation instructions accelerate CPU-intensive video compression and decompression algorithms.

3 Prefetch instructions enable software control of the data caches.

3 Floating point square root and bidirectional register file transfer instructions (integer-to- floating point) enhance floating point performance.

High-speed interfaces to the backup (L2) cache and system memory dramatically increase the bandwidth available from each of these sources.

The combination of these techniques delivers an estimated 30 Speclnt95, and over 50 SpecFp95 performance on the standard SPEC95 benchmark suite and over 1600 MB/s on the McCalpin STREAM benchmark. In addition, the dramatic rise in external

Figure 1. 2 1264 Floorplan

28 1063-6390/97 $10.00 0 1997 IEEE

micro-architects


UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II

Break

Play:19Thursday, April 3, 14


Multi-Threading(Dynamic Scheduling)



Power 4 (predates Power 5 shown earlier)

● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.

● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.

● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.

Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate

Figure 4POWER4 instruction execution pipeline.

EA DC WB

MP ISS RF EX WB

MP ISS RF

MP ISS RF

F6

MP ISS RF

CP

LD/ST

FX

FP

WB

Fmt

D0

ICIF BP

EXD1 D2 D3 Xfer

Xfer

Xfer

GD

Branch redirects

Instruction fetch

Xfer

Xfer

BR

WB

Out-of-order processing

Instruction crack andgroup formation

Interrupts and flushes

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.

13

Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.



For most apps, most execution units lie idle

Applications

alvin

n

dodu

c

eqnto

tt

espre

sso

fpppp

hydro

2d li

mdlj

dp2

mdlj

sp2

nas

a7 ora

su2co

r

swm

tom

catv

100

90

80

70

60

50

40

30

20

10

0

com

posi

te

itlb miss

dtlb miss

dcache miss

processor busy

icache miss

branch misprediction

control hazards

load delays

short integer

long integer

short fp

long fp

memory conflict

Per

cent

of

Tota

l Is

sue

Cycl

es

Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all

others represent wasted issue slots.

such as an I tlb miss and an I cache miss, the wasted cycles are

divided up appropriately. Table 3 specifies all possible sources

of wasted cycles in our model, and some of the latency-hiding or

latency-reducing techniques that might apply to them. Previous

work [32, 5, 18], in contrast, quantified some of these same effects

by removing barriers to parallelism and measuring the resulting

increases in performance.

Our results, shown in Figure 2, demonstrate that the functional

units of our wide superscalar processor are highly underutilized.

From the composite results bar on the far right, we see a utilization

of only 19% (the “processor busy” component of the composite bar

of Figure 2), which represents an average execution of less than 1.5

instructions per cycle on our 8-issue machine.

These results also indicate that there is no dominant source of

wasted issue bandwidth. Although there are dominant items in

individual applications (e.g., mdljsp2, swm, fpppp), the dominant

cause is different in each case. In the composite results we see that

the largest cause (short FP dependences) is responsible for 37% of

the issue bandwidth, but there are six other causes that account for

at least 4.5% of wasted cycles. Even completely eliminating any

one factor will not necessarily improve performance to the degree

that this graph might imply, because many of the causes overlap.

Not only is there no dominant cause of wasted cycles — there

appears to be no dominant solution. It is thus unlikely that any single

latency-tolerating technique will produce a dramatic increase in the

performance of these programs if it only attacks specific types of

latencies. Instruction scheduling targets several important segments

of the wasted issue bandwidth, but we expect that our compiler

has already achieved most of the available gains in that regard.

Current trends have been to devote increasingly larger amounts of

on-chip area to caches, yet even if memory latencies are completely

eliminated, we cannot achieve 40% utilization of this processor. If

specific latency-hiding techniques are limited, then any dramatic

increase in parallelism needs to come from a general latency-hiding

solution, of which multithreading is an example. The different types

of multithreading have the potential to hide all sources of latency,

but to different degrees.

This becomes clearer if we classify wasted cycles as either vertical

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.Observation:

Most hardware in an

out-of-order CPU concerns

physical registers.

Could severalinstruction

threads share this hardware?



Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units



● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.

● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.

● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.

Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate

Figure 4POWER4 instruction execution pipeline.

EA DC WB

MP ISS RF EX WB

MP ISS RF

MP ISS RF

F6

MP ISS RF

CP

LD/ST

FX

FP

WB

Fmt

D0

ICIF BP

EXD1 D2 D3 Xfer

Xfer

Xfer

GD

Branch redirects

Instruction fetch

Xfer

Xfer

BR

WB


Instruction crack andgroup formation


IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.

13

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For


MP ISS RF EA DC WB Xfer

MP ISS RF EX WB Xfer


MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects


WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline


Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits(architected register sets)



Power 5 data flow ...

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For


MP ISS RF EA DC WB Xfer



MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects


WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline


Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck.



Power 5 thread performance ...

mode. In this mode, the Power5 gives all thephysical resources, including the GPR andFPR rename pools, to the active thread, allow-ing it to achieve higher performance than aPower4 system at equivalent frequencies.

The Power5 supports two types of ST oper-ation: An inactive thread can be in either adormant or a null state. From a hardware per-spective, the only difference between thesestates is whether or not the thread awakens onan external or decrementer interrupt. In thedormant state, the operating system boots upin SMT mode but instructs the hardware toput the thread into the dormant state whenthere is no work for that thread. To make adormant thread active, either the active threadexecutes a special instruction, or an externalor decrementer interrupt targets the dormantthread. The hardware detects these scenariosand changes the dormant thread to the activestate. It is software’s responsibility to restorethe architected state of a thread transitioningfrom the dormant to the active state.

When a thread is in the null state, the oper-ating system is unaware of the thread’s existence.As in the dormant state, the operating system

does not allocate resources to a null thread. Thismode is advantageous if all the system’s execut-ing tasks perform better in ST mode.

Dynamic power managementIn current CMOS technologies, chip power

has become one of the most important designparameters. With the introduction of SMT,more instructions execute per cycle per proces-sor core, thus increasing the core’s and thechip’s total switching power. To reduce switch-ing power, Power5 chips use a fine-grained,dynamic clock-gating mechanism extensively.This mechanism gates off clocks to a localclock buffer if dynamic power managementlogic knows the set of latches driven by thebuffer will not be used in the next cycle. Forexample, if the GPRs are guaranteed not tobe read in a given cycle, the clock-gatingmechanism turns off the clocks to the GPRread ports. This allows substantial power sav-ing with no performance impact.

In every cycle, the dynamic power man-agement logic determines whether a localclock buffer that drives a set of latches can beclock gated in the next cycle. The set of latch-es driven by a clock-gated local clock buffercan still be read but cannot be written. Weused power-modeling tools to estimate theutilization of various design macros and theirassociated switching power across a range ofworkloads. We then determined the benefitof clock gating for those macros, implement-ing cycle-by-cycle dynamic power manage-ment in macros where such managementprovided a reasonable power-saving benefit.We paid special attention to ensuring thatclock gating causes no performance loss andthat clock-gating logic does not create a crit-ical timing path. A minimum amount of logicimplements the clock-gating function.

In addition to switching power, leakagepower has become a performance limiter. Toreduce leakage power, the Power5 uses tran-sistors with low threshold voltage only in crit-ical paths, such as the FPR read path. Weimplemented the Power5 SRAM arrays main-ly with high threshold voltage devices.

The Power5 also has a low-power mode,enabled when the system software instructsthe hardware to execute both threads at thelowest available priority. In low-power mode,instructions dispatch once every 32 cycles at

46

HOT CHIPS 15

IEEE MICRO

Inst

ruct

ions

per

cyc

le (

IPC

)

Thread 0 priority, thread 1 priority

Powersavemode

Single-thread mode

Thread 0 IPC Thread 1 IPC

2,7 1,6

4,7 3,6 2,5 1,4

6,7 5,6 4,5 3,4 2,3 2,1

7,0 1,10,1 1,0

0,7 7,6 6,5 5,4 4,3 3,2 2,1

7,4 6,3 5,2 4,1

7,2 6,1

7,7 6,6 5,54,43,3 2,2

Figure 5. Effects of thread priority on performance.

Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.



Multi-Core



Recall: Superscalar utilization by a thread

Applications

alvin

n

dodu

c

eqnto

tt

espre

sso

fpppp

hydro

2d li

mdlj

dp2

mdlj

sp2

nas

a7 ora

su2co

r

swm

tom

catv

100

90

80

70

60

50

40

30

20

10

0

com

posi

te

itlb miss

dtlb miss

dcache miss

processor busy

icache miss


control hazards

load delays

short integer

long integer

short fp

long fp

memory conflict

Per

cent

of

Tota

l Is

sue

Cycl

es









































For an 8-way superscalar. Observation:

In many cases, the on-chip cache and DRAM I/O

bandwidth is also

underutilized by one CPU.

So, let 2 cores share them.



Most of Power 5 die is shared hardware

supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for P

ower

4-ba

sed

syst

ems

perf

orm

eq

ually

w

ell

onPo

wer

5-ba

sed

syst

ems.

Fig

ure

4 sh

ows

the

Pow

er5’

s ins

truc

tion

flow

dia

gram

.In

SM

T m

ode,

the

Pow

er5

uses

two

sepa

-ra

te in

stru

ctio

n fe

tch

addr

ess r

egist

ers t

o st

ore

the

prog

ram

cou

nter

s fo

r th

e tw

o th

read

s.In

stru

ctio

n fe

tche

s (I

F st

age)

al

tern

ate

betw

een

the

two

thre

ads.

In

ST m

ode,

the

Pow

er5

uses

onl

y on

e pr

ogra

m c

ount

er a

ndca

n fe

tch

inst

ruct

ions

for

that

thr

ead

ever

ycy

cle.

It

can

fetc

h up

to

eigh

t in

stru

ctio

nsfr

om t

he in

stru

ctio

n ca

che

(IC

sta

ge)

ever

ycy

cle.

The

two

thre

ads

shar

e th

e in

stru

ctio

nca

che

and

the

inst

ruct

ion

tran

slatio

n fa

cilit

y.In

a gi

ven

cycl

e, al

l fet

ched

inst

ruct

ions

com

efr

om th

e sa

me

thre

ad.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).Core #1

Core #2

SharedComponents

L2 Cache

L3 Cache

Control

DRAMController



Core-to-core interactions stay on chip

supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for

Pow

er4-

base

d sy

stem

s pe

rfor

m

equa

lly

wel

l on

Pow

er5-

base

d sy

stem

s. F

igur

e 4

show

s th

ePo

wer

5’s i

nstr

uctio

n flo

w d

iagr

am.

In S

MT

mod

e, th

e Po

wer

5 us

es tw

o se

pa-

rate

inst

ruct

ion

fetc

h ad

dres

s reg

ister

s to

stor

eth

e pr

ogra

m c

ount

ers

for

the

two

thre

ads.

Inst

ruct

ion

fetc

hes

(IF

stag

e)

alte

rnat

ebe

twee

n th

e tw

o th

read

s. I

n ST

mod

e, t

hePo

wer

5 us

es o

nly

one

prog

ram

cou

nter

and

can

fetc

h in

stru

ctio

ns fo

r th

at t

hrea

d ev

ery

cycl

e. I

t ca

n fe

tch

up t

o ei

ght

inst

ruct

ions

from

the

inst

ruct

ion

cach

e (I

C s

tage

) ev

ery

cycl

e. T

he tw

o th

read

s sh

are

the

inst

ruct

ion

cach

e an

d th

e in

stru

ctio

n tr

ansla

tion

faci

lity.

In a

give

n cy

cle,

all f

etch

ed in

stru

ctio

ns co

me

from

the

sam

e th

read

.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).

(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.

(1) Threads on two cores that use shared libraries conserve L2 memory.



Sun Niagara



The case for Sun’s Niagara ...

Applications

alvin

n

dodu

c

eqnto

tt

espre

sso

fpppp

hydro

2d li

mdlj

dp2

mdlj

sp2

nas

a7 ora

su2co

r

swm

tom

catv

100

90

80

70

60

50

40

30

20

10

0

com

posi

te

itlb miss

dtlb miss

dcache miss

processor busy

icache miss


control hazards

load delays

short integer

long integer

short fp

long fp

memory conflict

Per

cent

of

Tota

l Is

sue

Cycl

es









































For an 8-way superscalar. Observation:

Some apps struggle to

reach a CPI == 1.

For throughput on these apps,a large number of single-issue cores is better

than a few superscalars.



Niagara (original): 32 threads on one chip8 cores: Single-issue, 1.2 GHz 6-stage pipeline 4-way multi-threaded Fast crypto support

Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports

Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)

Die size: 340 mm² in 90 nm.Power: 50-60 W



The board that booted Niagara first-silicon

Source: J Schwartz weblog (then Sun COO, now CEO)



Used in Sun Fire T2000: “Coolthreads”

Web server benchmarks used to position the T2000 in the market.

Claim: server uses 1/3 the power of competing servers.


UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II© 2013 International Business Machines Corporation 2

Technology

POWER5 2004

POWER6 2007

POWER7 2010

POWER7+ 2012

Compute Cores

Threads Caching On-chip Off-chip

Bandwidth Sust. Mem.

Peak I/O

130nm SOI 65nm SOI 45nm SOI eDRAM

32nm SOI eDRAM

2 SMT2

2 SMT2

8 SMT4

8 SMT4

1.9MB 36MB

8MB 32MB

2 + 32MB None

2 + 80MB None

15GB/s 3GB/s

30GB/s 10GB/s

100GB/s 20GB/s

100GB/s 20GB/s

© 2013 International Business Machines Corporation 14

POWER5 2004

POWER6 2007

POWER7 2010

POWER7+ 2012

130nm SOI 65nm SOI 45nm SOI eDRAM

32nm SOI eDRAM

2 SMT2

2 SMT2

8 SMT4

8 SMT4

1.9MB 36MB

8MB 32MB

2 + 32MB None

2 + 80MB None

15GB/s 3GB/s

30GB/s 10GB/s

100GB/s 20GB/s

100GB/s 20GB/s

Technology Compute

Cores Threads

Caching On-chip Off-chip

Bandwidth Sust. Mem.

Peak I/O

POWER8

22nm SOI eDRAM

12 SMT8

6 + 96MB 128MB

230GB/s 48GB/s

2014

IBM RISC chips, since Power 4 (2001) ...


UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II© 2013 International Business Machines Corporation 6

VSU FXU

IFU

DFU

ISU

LSU

Larger Caching Structures vs. POWER7 • 2x L1 data cache (64 KB)

• 2x outstanding data cache misses

• 4x translation Cache

Wider Load/Store • 32B Æ 64B L2 to L1 data bus

• 2x data cache to execution

dataflow

Enhanced Prefetch • Instruction speculation awareness

• Data prefetch depth awareness

• Adaptive bandwidth awareness

• Topology awareness

Execution Improvement vs. POWER7 • SMT4 Æ SMT8

• 8 dispatch

• 10 issue

• 16 execution pipes:

• 2 FXU, 2 LSU, 2 LU, 4 FPU,

2 VMX, 1 Crypto, 1 DFU,

1 CR, 1 BR

• Larger Issue queues

(4 x 16-entry)

• Larger global completion,

Load/Store reorder

• Improved branch prediction

• Improved unaligned storage

access Core Performance vs . POWER7

~1.6x Single Thread ~2x Max SMT



Recap: Dynamic Scheduling

Three big ideas: register renaming, data-driven detection of RAW resolution, bus-based architecture.

Has saved architectures that have a small number of registers: IBM 360floating-point ISA, Intel x86 ISA.

Very complex, but enables many things: out-of-order execution, multiple issue, loop unrolling, etc.


On Tuesday

Epilogue ...

Have a good weekend!


Documents

CS 152 Computer Architecture and Engineeringcs152/sp14/lecnotes/lec...CS 152 Computer Architecture and Engineering cs152/ TA: Eric Love Lecture 19 -- Dynamic Scheduling II Play: Thursday,