Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
2014-4-3John Lazzaro
(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 19 -- Dynamic Scheduling II
Play:1Thursday, April 3, 14
UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers
Case studies of dynamic execution
DEC Alpha 21264: High performance from a relatively simple implementation of a modern instruction set.
IBM Power: Evolving dynamic designs over many generations.
Simultaneous Multi-threading: Adapting multi-threading to dynamic scheduling.
Short Break
2Thursday, April 3, 14
DEC Alpha
21164: 4-issue in-order design.
21264 was 50% to 200% faster in real-world applications.
21264: 4-issue out-of-order design.
3Thursday, April 3, 14
500 MHz 0.5µ parts for in-order 21164 and
out-of-order 21264.
Similarly-sized on-chip caches (116K vs 128K)In-order 21164
has larger off-chip cache.
21264 has 55% more transistors
than the 21164. The die is
44% larger.
21264 has a 1.7x advantage on integer code, and a 2.7x
advantage of floating-point code.
21264 consumes
46% more power
than the 21164.
4Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II
24
Alpha microprocessors have beenperformance leaders since their introductionin 1992. The first generation 21064 and thelater 211641,2 raised expectations for thenewest generation—performance leadershipwas again a goal of the 21264 design team.Benchmark scores of 30+ SPECint95 and 58+SPECfp95 offer convincing evidence thus farthat the 21264 achieves this goal and will con-tinue to set a high performance standard.
A unique combination of high clock speedsand advanced microarchitectural techniques,including many forms of out-of-order andspeculative execution, provide exceptional corecomputational performance in the 21264. Theprocessor also features a high-bandwidthmem-ory system that can quickly deliver data valuesto the execution core, providing robust perfor-mance for a wide range of applications, includ-ing those without cache locality. The advancedperformance levels are attained while main-taining an installed application base. All Alphagenerations are upward-compatible. Database,real-time visual computing, data mining, med-ical imaging, scientific/technical, and manyother applications can utilize the outstandingperformance available with the 21264.
Architecture highlightsThe 21264 is a superscalar microprocessor
that can fetch and execute up to four instruc-tions per cycle. It also features out-of-orderexecution.3,4 With this, instructions executeas soon as possible and in parallel with other
nondependent work, which results in fasterexecution because critical-path computationsstart and complete quickly.
The processor also employs speculative exe-cution to maximize performance. It specula-tively fetches and executes instructions eventhough it may not know immediately whetherthe instructions will be on the final executionpath. This is particularly useful, for instance,when the 21264 predicts branch directions andspeculatively executes down the predicted path.
Sophisticated branch prediction, coupledwith speculative and dynamic execution,extracts instruction parallelism from applica-tions. With more functional units and thesedynamic execution techniques, the processoris 50% to 200% faster than its 21164 prede-cessor for many applications, even thoughboth generations can fetch at most fourinstructions per cycle.5
The 21264’s memory system also enableshigh performance levels. On-chip and off-chip caches provide for very low latency dataaccess. Additionally, the 21264 can servicemany parallel memory references to all cachesin the hierarchy, as well as to the off-chipmemory system. This permits very high band-width data access.6 For example, the proces-sor can sustain more than 1.3 GBytes/sec onthe Stream benchmark.7
The microprocessor’s cycle time is 500 to600 MHz, implemented by 15 million tran-sistors in a 2.2-V, 0.35-micron CMOS processwith six metal layers. The 3.1 cm2 processor
R. E. KesslerCompaq Computer
Corporation
THE ALPHA 21264 OWES ITS HIGH PERFORMANCE TO HIGH CLOCK SPEED,
MANY FORMS OF OUT-OF-ORDER AND SPECULATIVE EXECUTION, AND A HIGH-
BANDWIDTH MEMORY SYSTEM.
0272-1732/99/$10.00 1999 IEEE
THE ALPHA 21264MICROPROCESSOR
.
The Real Difference: Speculation
If the ability to recover from
mis-speculation is built into an implementation
... it offers the option to
add speculative features to all parts of the
design.
5Thursday, April 3, 14
GRONOWSKI et al.: HIGH-PERFORMANCE MICROPROCESSOR DESIGN 677
Fig. 2. 21064 die photo.
Fig. 3. 21164 die photo.
II. ARCHITECTUREThe Alpha instruction set architecture is a true 64-bit
load/store RISC architecture designed with emphasis on highclock speed and multiple instruction issue [4]. Fixed-length in-structions, minimal instruction ordering constraints, and 64-bitdata manipulation allow for straightforward instruction decode
Fig. 4. 21264 die photo.
and a clean microarchitectural design. The architecture doesnot contain condition codes, branch delay slots, adaptationsfrom existing 32-bit architectures, and other bits of architec-tural history that can add complexity. The chip organizationfor each generation was carefully chosen to gain the mostadvantage from microarchitectural features while maintainingthe ability to meet critical circuit paths.The 21064 is a fully pipelined in-order execution machine
capable of issuing two instructions per clock cycle. It containsone pipelined integer execution unit and one pipelined floating-point execution unit. Integer instruction latency is one or twocycles, except for multiplies which are not pipelined. Floating-point instruction latency is six cycles for all instructions exceptfor divides. The chip includes an 8-kB instruction cacheand an 8-kB data cache. The emphasis of this designwas to gain performance through clock rate while keeping thearchitecture relatively simple. Subsequent designs rely moreheavily on aggressive architectural enhancements to furtherincrease performance.The quad-issue, in order execution implementation of the
21164 was more complex than the 21064, but simpler thanan out-of-order execution implementation [5]. It contains twopipelined integer execution units and two pipelined floating-point execution units. The first-level cache was changed tononblocking. A second-level 96-kB unified and cachewas added on-chip to improve memory latency without addingexcessive complexity. Integer latency was reduced to onecycle for all instructions, and was roughly halved for allMUL instructions. The floating-point unit contains separateadd and multiply pipelines, each with a four-cycle latency [6].Floating-point divide latency is reduced by 50%.The trend of increased architectural complexity continues
with Digital’s latest Alpha microprocessor. The 21264 gains
FP Pipe
Int Pipe
Int Pipe
OoOOoO
I-CacheI-CacheData
CacheData
Cache
Fetch and
predict
21264 die
Separate OoO control
for integer and floating point.
RISC decode happens in OoO blocks
Unlabeled areas devoted to memory
system control
6Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
21264 pipeline diagramRename and Issue stages are primary
locations of dynamic scheduling logic. Load/store disambiguation support resides in Memory stage.
Slot: absorbs delay of long path on last slide.
7Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
Fetch stage close-up:
dynamically retrains them when they are inerror. Most mispredictions cost a single cycle.The line and way predictors are correct 85%to 100% of the time for most applications, sotraining is infrequent. As an additional pre-caution, a 2-bit hysteresis counter associatedwith each fetch block eliminates overtrain-ing—training occurs only when the currentprediction has been in error multiple times.Line and way prediction is an important speedenhancement since the mispredict cost is lowand line/way mispredictions are rare.
Beyond the speed benefits of direct cacheaccess, line and way prediction has other ben-efits. For example, frequently encounteredpredictable branches, such as loop termina-tors, avoid the mis-fetch penalty often associ-ated with a taken branch. The processor alsotrains the line predictor with the address ofjumps and subroutine calls that use direct reg-ister addressing. Code using dynamicallylinked library routines will thus benefit afterthe line predictor is trained with the target.This is important since the pipeline delaysrequired to calculate the indirect (subroutine)jump address are eight cycles or more.
An instruction cache miss forces theinstruction fetch engine to check the level-two(L2) cache or system memory for the neces-sary instructions. The fetch engine prefetch-es up to four 64-byte (or 16-instruction) cache
lines to tolerate the additional latency. Theresult is very high bandwidth instructionfetch, even when the instructions are notfound in the instruction cache. For instance,the processor can saturate the available L2cache bandwidth with instruction prefetches.
Branch predictionBranch prediction is more important to the
21264’s efficiency than to previous micro-processors for several reasons. First, the seven-cycle mispredict cost is slightly higher thanprevious generations. Second, the instructionexecution engine is faster than in previous gen-erations. Finally, successful branch predictioncan utilize the processor’s speculative executioncapabilities. Good branch prediction avoids thecosts of mispredicts and capitalizes on the mostopportunities to find parallelism. The 21164could accept 20 in-flight instructions at most,but the 21264 can accept 80, offering manymore parallelism opportunities.
The 21264 implements a sophisticated tour-nament branch prediction scheme. The schemedynamically chooses between two types ofbranch predictors—one using local history, andone using global history—to predict the direc-tion of a given branch.8 The result is a tourna-ment branch predictor with better predictionaccuracy than larger tables of either individualmethod, with a 90% to 100% success rate onmost simulated applications/benchmarks.Together, local and global correlation tech-niques minimize branch mispredicts. Theprocessor adapts to dynamically choose the bestmethod for each branch.
Figure 4, in detailing the structure of thetournament branch predictor, shows the local-history prediction path—through a two-levelstructure—on the left. The first level holds 10bits of branch pattern history for up to 1,024branches. This 10-bit pattern picks from oneof 1,024 prediction counters. The global pre-dictor is a 4,096-entry table of 2-bit saturat-ing counters indexed by the path, or global,history of the last 12 branches. The choice pre-diction, or chooser, is also a 4,096-entry tableof 2-bit prediction counters indexed by thepath history. The “Local and global branchpredictors” box describes these techniques inmore detail.
The processor inserts the true branch direc-tion in the local-history table once branches
26
ALPHA 21264
IEEE MICRO
Learn dynamic jumps
No branch penalty
Set associativityPC
Instructiondecode,branch
prediction,validity check
Tag0
Tag1 Cached
instructionsLine
predictionWay
prediction
Next line plus wayInstructions (4)
Compare Compare
Hit/miss/way miss
Mux
Mux
Programcounter (PC)generation
…
F igure 3. A lpha 21264 instruction fetch. The line and way prediction (wrap-around path on the right side) provides a fast instruction fetch path thatavoids common fetch stalls when the predictions are correct.
.
Each cache line stores predictions of the next line, and the cache way to be fetched. If predictions are correct, fetcher maintains the required 4 instructions/cycle pace.
Speculative
8Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
Rename stage close-up:(1) Allocates new physical registers for destinations, (2) Looks up physical register numbers for sources,
(3) Handle rename dependences within the 4 issuing instructions in one clock cycle!
issue and retire. It also trains the correct pre-dictions by updating the referenced local,global, and choice counters at that time. Theprocessor maintains path history with a siloof 12 branch predictions. This silo is specu-latively updated before a branch retires and isbacked up on a mispredict.
Out-of-order execution The 21264 offers out-of-order efficiencies
with higher clock speeds than competingdesigns, yet this speed does not restrict themicroprocessor’s dynamic execution capabili-ties. The out-of-order execution logic receivesfour fetched instructions every cycle,renames/remaps the registers to avoid unneces-sary register dependencies, and queues the
instructions until operands or functional unitsbecome available. It dynamically issues up to sixinstructions every cycle—four integer instruc-tions and two floating-point instructions. It alsoprovides an in-order execution model to theprogrammer via in-order instruction retire.
Register renamingRegister renaming exposes application
instruction parallelism since it eliminatesunnecessary dependencies and allows specu-lative execution. Register renaming assigns aunique storage location with each write-ref-erence to a register. The 21264 speculativelyallocates a register to each instruction with aregister result. The register only becomes partof the user-visible (architectural) register statewhen the instruction retires/commits. Thislets the instruction speculatively issue anddeposit its result into the register file beforethe instruction retires. Register renaming alsoeliminates write-after-write and write-after-read register dependencies, but preserves allthe read-after-write register dependencies thatare necessary for correct computation.
The left side of Figure 5 depicts the map,or register rename, stage in more detail. Theprocessor maintains storage with each inter-nal register indicating the user-visible registerthat is currently associated with the giveninternal register (if any). Thus, register renam-ing is a content-addressable memory (CAM)operation for register sources together with aregister allocation for the destination register.All pipeline stages subsequent to the registermap stage operate on internal registers ratherthan user-visible registers.
Beyond the 31 integer and 31 floating-point user-visible (non-speculative) registers,an additional 41 integer and 41 floating-pointregisters are available to hold speculativeresults prior to instruction retirement. Theregister mapper stores the register map statefor each in-flight instruction so that themachine architectural state can be restored incase a misspeculation occurs.
The Alpha conditional-move instructionsmust be handled specially by the map stage.These operations conditionally move one oftwo source registers into a destination regis-ter. This makes conditional move the onlyinstruction in the Alpha architecture thatrequires three register sources—the two
28
ALPHA 21264
IEEE MICRO
Localhistorytable
(1,024 × 10)
Localprediction(1,024 × 3)
Global prediction(4,096 × 2)
Choice prediction(4,096 × 2)
Path history
Programcounter
Branchprediction
Mux
F igure 4. B lock diagram of the 21264 tournament branch predictor. The localhistory prediction path is on the left; the global history prediction path andthe chooser (choice prediction) are on the right.
Map
Savedmapstate
Map content-addressablememories
Queue
Arbiter80 in-flightinstructions
Request Grant
Registernumbers
Internalregister numbers
72–80internal registers
Instructions (4)
Registerscoreboard
Queueentries
Issuedinstructions
…
F igure 5. B lock diagram of the 21264’s map (register rename) and queuestages. The map stage renames programmer-visible register numbers tointernal register numbers. The queue stage stores instructions until theyare ready to issue . These structures are duplicated for integer and floating-point execution.
.
Output:12 physical registers numbers:
1 destination and 2 sources
for the 4 instructions to be issued.
Input: 4 instructions specifying architected registers.
For mis-speculation recovery
Time-stamped.
9Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Recall: malloc() -- free() in hardware
• multimedia programs using MMX™ instructions (MM - 8),• games, e.g. Quake (GAM - 5),• programs written in JAVA (JAV - 5),• some TPC benchmarks (TPC - 3),• common programs running on NT, e.g. Word, Excel (NT- 8),• and common programs running on Windows 95 (W95 - 8).
In this paper, we focus more on statistical results whichemphasize the motivations for the various suggestions.Performance results are only briefly presented, mainly to give aflavor of the potential benefit. Performance results are de-emphasized, as actual performance benefits are highlydependent on the implementation and may vary a lot. Choosingan arbitrary configuration (whether current or futuristic) maygive biased results of questionable significance.
2 Advanced Register Renaming2.1 Current Register Dependency-Tracking and
Renaming TechniquesModern processors exploit out of order execution to speed up
processing time. Out of Order execution involves a mechanismcalled register renaming in which the processor maps logicalregisters into physical locations. Register renaming is used toremove register anti-dependencies and output-dependencies andto recover from control speculation. The basic register renamingmechanism is well known and widely used (e.g. Intel®Pentium® Pro Processor [Inte96]). This section presents themost advanced combined register renaming and dependency-tracking scheme involving three structures: a Free List (FL), aRegister Alias Table (RAT), and an Active List (AL). Thisscheme has been used in the MIPS R10000 and DEC 21264.
The RAT maintains the latest mapping2 for each logicalregister. The RAT is indexed by the source logical registers, andprovides the mappings to the corresponding physical registers(dependency-tracking). For each logical destination registerspecified by the renamed instructions, the allocator (renamer)provides an unused physical register from FL. The RAT isupdated with these new mappings. Physical registers can bereclaimed once they cannot be referenced anymore. Once alogical register is renamed, all subsequent instructions can onlyaccess the new mapping; i.e. they cannot read the physicalregister previously mapped. Thus, an appropriate andstraightforward condition for register reclaiming is to reclaim aphysical register only when the instruction that evicted it fromthe RAT retires. As a result, whenever a new mapping updatesthe RAT, the evicted old mapping is pushed into AL (an ALentry is provided to each instruction). When an instructionretires, the physical register of the old mapping recorded in AL,if any, is reclaimed and pushed into FL. This cycle is depicted infigure 1.
2 Throughout the paper, we refer to a mapping as the pairing of a logicalregister with a physical register it maps to.
Register Allocation
Register Reclaiming
InstructionRetirement
InstructionRenaming
# logical register # physical register
FREELIST
ACTIVELIST
REGISTERALIASTABLE
Figure 1. Register Renaming.
2.2 Physical Register ReuseMotivation. Most instructions operate on several sourceoperands and generate results. These results are recorded intolocal physical registers allocated for each instruction, so thatdependent instructions can operate on them. The range ofgenerated values is usually limited. Indeed, integer results areoften pretty small and the same value may be generated severaltimes by different instructions currently in the instructionwindow. A perfect example is Boolean values such as control-flow conditions. Figure 2 shows the percentage of computedvalues that match one of the values generated by precedinginstructions according to the number of prior instructionsscanned (16, 32, 64, 128, or 256 instructions). Note that thescanned values were not filtered for duplicate results so theymay also exhibit a high-level of redundancy. Results arehighlighted for four of the SpecInt95 benchmarks, and confirmour claim for programs compiled to run on an IA-32 processor.
0%10%20%30%40%50%60%70%80%90%
100%
compress 95 xlisp 95 go 95 ijpeg 95
16 32 64 128 256
Figure 2. Number of Identical Results.
Concept. Physical registers hold values that are part of thearchitectural states currently alive in the machine. A physicalregister is allocated for every result regardless of its value.However, there is no reason to allocate separate physicalregisters when they maintain the same value. This paperproposes to reuse a physical register whenever we detect that anincoming result value matches a previous one. Physical RegisterReuse relies on a Value-Identity Detection hardware to performthe detection prior to register renaming. The detector outcomecan be either safe or speculative. By mapping several logical
0-8186-8609-X/98 $10.00 (c) 1998 IEEE
The record-keeping shown in this
diagram occurs in the rename
stage.
10Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
Issue stage close-up:(1) Newly issued instructions placed in top of queue.
(2) Instructions check scoreboard: are 2 sources ready?(3) Arbiter selects 4 oldest “ready” instructions.
(4) Update removes these 4 from queue.Output:The 4 oldest
instructions whose
2 source registers are ready for use.
Input: 4
just-issued instructions, renamed to use physical registers.
producer/consumer relationships within the four instructions) are combined to assign either previously allocated registers or the registers supplied by the free register generator to the source specifiers.
The resulting four register maps are saved in the map silo, which, in turn, provides information to the free register generator as to which registers are currently allocated. Finally, the last map that was created is used as the initial map for the next cycle. In the event of a branch mispredict or trap, the CAM is restored with the map silo entry associated with the redirecting instruction.
Issue Queue
Each cycle, up to four instructions are loaded into the two issue queues. The floating point queue (Fqueue) chooses two instructions from a 15-entry window and issues them to the two floating point pipelines. The integer queue (Iqueue) chooses four instructions from a 20-entry window and issues them to the four integer pipelines 'figure 3).
Ebox Cluster 0
$. + 5 Execute -+ Reg
File Execute (80)
-
-
Media I Ebox Cluster 1
Figure 3. Integer Issue Queue and Ebox The fundamental circuit loop in the Iqueue is the path in which a single-cycle producing instruction is grunted (issued) at the end of one issue cycle and a consuming instruction requests to be issued at the beginning of the next cycle (e.g. Instructions (0) and (2) in the mapper example). The grunt must be communicated to all newer consumer instructions in the issue window.
The issue queue maintains a register scoreboard, based on physical register number, and tracks the progress of multiple-cycle (e.g. integer multiply) and variable cycle (e.g. memory load) instructions. When arithmetic result
data or load data is available for bypass, the scoreboard unit notifies all instructions in the queue.
The queue arbitration cycle works as follows: 1. New instructions are loaded into the "top" of the
queue. 2. Register scoreboard information is communicated to
all queue entries. 3. Instructions that are data-ready request for issue 4. A set of issue arbiters search the queue fkom "bottom"
to "top", selecting instructions that are data-ready in an age-prioritized order and skipping over instructions that are not data-ready.
5. Selected instructions are broadcast to the functional units.
In the next cycle a queue-update mechanism calculates which queue entries are available for future instructions and squashes issued instructions out of the queue. Instructions that are still resident in the queue shift towards the bottom.
Ebox and Fbox
The Ebox functional unit organization was designed around a fast execute-bypass cycle. In order to reduce the impact of the large number of register ports required for a quad-issue CPU and to limit the effect on cycle time of long bypass busses between the functional units, the Ebox was organized around two clusters (see figure 3) . Each cluster contains two functional units, an 80-entry register file, and result busses to/from the other cluster. The lower two functional units contain one-cycle adders and logical units; the upper two contain adders, logic units, and shifters. One upper functional unit contains a 7- cycle, fully pipelined multiplier; the other contains a 3- cycle motion video pipeline, which implements motion estimation, threshold, and pixel compaction/expansion functions. The two integer unit clusters have equal capability to execute most instructions (integer multiply, motion video, and some special-purpose instructions can only be executed in one cluster).
The execute pipeline operation proceeds as follows: Stage 3: Instructions are issued to both clusters.
0 Stage 4: Register files are read. 0 Stage 5: Execution (may be multiple cycles) 0 Stage 6: Results are written to the register file of the
cluster in which execution is performed and are bypassed into the next execution stage within the cluster. Stage 7: Results are written to the cross-cluster register file and are bypassed into the next execution stage in the other cluster.
31
issue and retire. It also trains the correct pre-dictions by updating the referenced local,global, and choice counters at that time. Theprocessor maintains path history with a siloof 12 branch predictions. This silo is specu-latively updated before a branch retires and isbacked up on a mispredict.
Out-of-order execution The 21264 offers out-of-order efficiencies
with higher clock speeds than competingdesigns, yet this speed does not restrict themicroprocessor’s dynamic execution capabili-ties. The out-of-order execution logic receivesfour fetched instructions every cycle,renames/remaps the registers to avoid unneces-sary register dependencies, and queues the
instructions until operands or functional unitsbecome available. It dynamically issues up to sixinstructions every cycle—four integer instruc-tions and two floating-point instructions. It alsoprovides an in-order execution model to theprogrammer via in-order instruction retire.
Register renamingRegister renaming exposes application
instruction parallelism since it eliminatesunnecessary dependencies and allows specu-lative execution. Register renaming assigns aunique storage location with each write-ref-erence to a register. The 21264 speculativelyallocates a register to each instruction with aregister result. The register only becomes partof the user-visible (architectural) register statewhen the instruction retires/commits. Thislets the instruction speculatively issue anddeposit its result into the register file beforethe instruction retires. Register renaming alsoeliminates write-after-write and write-after-read register dependencies, but preserves allthe read-after-write register dependencies thatare necessary for correct computation.
The left side of Figure 5 depicts the map,or register rename, stage in more detail. Theprocessor maintains storage with each inter-nal register indicating the user-visible registerthat is currently associated with the giveninternal register (if any). Thus, register renam-ing is a content-addressable memory (CAM)operation for register sources together with aregister allocation for the destination register.All pipeline stages subsequent to the registermap stage operate on internal registers ratherthan user-visible registers.
Beyond the 31 integer and 31 floating-point user-visible (non-speculative) registers,an additional 41 integer and 41 floating-pointregisters are available to hold speculativeresults prior to instruction retirement. Theregister mapper stores the register map statefor each in-flight instruction so that themachine architectural state can be restored incase a misspeculation occurs.
The Alpha conditional-move instructionsmust be handled specially by the map stage.These operations conditionally move one oftwo source registers into a destination regis-ter. This makes conditional move the onlyinstruction in the Alpha architecture thatrequires three register sources—the two
28
ALPHA 21264
IEEE MICRO
Localhistorytable
(1,024 × 10)
Localprediction(1,024 × 3)
Global prediction(4,096 × 2)
Choice prediction(4,096 × 2)
Path history
Programcounter
Branchprediction
Mux
F igure 4. B lock diagram of the 21264 tournament branch predictor. The localhistory prediction path is on the left; the global history prediction path andthe chooser (choice prediction) are on the right.
Map
Savedmapstate
Map content-addressablememories
Queue
Arbiter80 in-flightinstructions
Request Grant
Registernumbers
Internalregister numbers
72–80internal registers
Instructions (4)
Registerscoreboard
Queueentries
Issuedinstructions
…
F igure 5. B lock diagram of the 21264’s map (register rename) and queuestages. The map stage renames programmer-visible register numbers tointernal register numbers. The queue stage stores instructions until theyare ready to issue . These structures are duplicated for integer and floating-point execution.
.
Scoreboard: Tracks writes to physical registers.
11Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
Execution close-up:
producer/consumer relationships within the four instructions) are combined to assign either previously allocated registers or the registers supplied by the free register generator to the source specifiers.
The resulting four register maps are saved in the map silo, which, in turn, provides information to the free register generator as to which registers are currently allocated. Finally, the last map that was created is used as the initial map for the next cycle. In the event of a branch mispredict or trap, the CAM is restored with the map silo entry associated with the redirecting instruction.
Issue Queue
Each cycle, up to four instructions are loaded into the two issue queues. The floating point queue (Fqueue) chooses two instructions from a 15-entry window and issues them to the two floating point pipelines. The integer queue (Iqueue) chooses four instructions from a 20-entry window and issues them to the four integer pipelines 'figure 3).
Ebox Cluster 0
$. + 5 Execute -+ Reg
File Execute (80)
-
-
Media I Ebox Cluster 1
Figure 3. Integer Issue Queue and Ebox The fundamental circuit loop in the Iqueue is the path in which a single-cycle producing instruction is grunted (issued) at the end of one issue cycle and a consuming instruction requests to be issued at the beginning of the next cycle (e.g. Instructions (0) and (2) in the mapper example). The grunt must be communicated to all newer consumer instructions in the issue window.
The issue queue maintains a register scoreboard, based on physical register number, and tracks the progress of multiple-cycle (e.g. integer multiply) and variable cycle (e.g. memory load) instructions. When arithmetic result
data or load data is available for bypass, the scoreboard unit notifies all instructions in the queue.
The queue arbitration cycle works as follows: 1. New instructions are loaded into the "top" of the
queue. 2. Register scoreboard information is communicated to
all queue entries. 3. Instructions that are data-ready request for issue 4. A set of issue arbiters search the queue fkom "bottom"
to "top", selecting instructions that are data-ready in an age-prioritized order and skipping over instructions that are not data-ready.
5. Selected instructions are broadcast to the functional units.
In the next cycle a queue-update mechanism calculates which queue entries are available for future instructions and squashes issued instructions out of the queue. Instructions that are still resident in the queue shift towards the bottom.
Ebox and Fbox
The Ebox functional unit organization was designed around a fast execute-bypass cycle. In order to reduce the impact of the large number of register ports required for a quad-issue CPU and to limit the effect on cycle time of long bypass busses between the functional units, the Ebox was organized around two clusters (see figure 3) . Each cluster contains two functional units, an 80-entry register file, and result busses to/from the other cluster. The lower two functional units contain one-cycle adders and logical units; the upper two contain adders, logic units, and shifters. One upper functional unit contains a 7- cycle, fully pipelined multiplier; the other contains a 3- cycle motion video pipeline, which implements motion estimation, threshold, and pixel compaction/expansion functions. The two integer unit clusters have equal capability to execute most instructions (integer multiply, motion video, and some special-purpose instructions can only be executed in one cluster).
The execute pipeline operation proceeds as follows: Stage 3: Instructions are issued to both clusters.
0 Stage 4: Register files are read. 0 Stage 5: Execution (may be multiple cycles) 0 Stage 6: Results are written to the register file of the
cluster in which execution is performed and are bypassed into the next execution stage within the cluster. Stage 7: Results are written to the cross-cluster register file and are bypassed into the next execution stage in the other cluster.
31
Internal memory systemThe internal memory system supports
many in-flight memory references and out-of-order operations. It can service up to twomemory references from the integer executionpipes every cycle. These two memory refer-ences are out-of-order issues. The memorysystem simultaneously tracks up to 32 in-flight loads, 32 in-flight stores, and 8 in-flight(instruction or data) cache misses. It also hasa 64-Kbyte, two-way set-associative datacache. This cache has much lower miss ratesthan the 8-Kbyte, direct-mapped cache in theearlier 21164. The end result is a high-band-width, low-latency memory system.
Data pathThe 21264 supports any combination of
two loads or stores per cycle without conflict.The data cache is double-pumped to imple-ment the necessary two ports. That meansthat the data cache is referenced twice eachcycle—once per each of the two clock phases.In effect, the data cache operates at twice thefrequency of the processor clock—an impor-tant feature of the 21264’s memory system.
Figure 7 depicts the memory system’s inter-nal data paths. The two 64-bit data buses arethe heart of the internal memory system. Eachload receives data via these buses from the datacache, the speculative store data buffers, or anexternal (system or L2) fill. Stores first trans-fer their data across the data buses into thespeculative store buffer. Store data remains inthe speculative store buffer until the storesretire. Once they retire, the data is written(dumped) into the data cache on idle cachecycles. Each dump can write 128 bits into thecache since two stores can merge into onedump. Dumps use the double-pumped datacache to implement a read-modify-writesequence. Read-modify-write is required onstores to update the stored SECDED ECCthat allows correction of single-bit errors.
Stores can forward their data to subsequentloads while they reside in the speculative storedata buffer. Load instructions compare theirage and address against these pending stores.On a match, the appropriate store data is puton the data bus rather than the data from thedata cache. In effect, the speculative store databuffer performs a memory-renaming func-tion. From the perspective of younger loads,
it appears the stores write into the data cacheimmediately. However, squashed stores areremoved from the speculative store data bufferbefore they affect the final cache state.
Figure 7 shows how data is brought into andout of the internal memory system. Fill dataarrives on the data buses. Pending loads sam-ple the data to write into the register file while,in parallel, the caches (instruction or data) alsofill using the same bus data. The data cache iswrite-back, so fills also use its double-pumpedcapability: The previous cache contents areread out in the same cycle that fill data is writ-ten in. The bus interface unit captures this vic-tim data and later writes it back.
Address and control structureThe internal memory system maintains a
32-entry load queue (LDQ) and a 32-entry
31MARCH–APRIL 1999
Table 2. Sample 21264 instruction latencies (s-p means single-precision; d-p means double-precision).
Instruction class Latency (cycles)Simple integer operations 1Motion-video instructions/integer population count and
leading/trailing zero count unit (MVI/PLZ) 3Integer multiply 7Integer load 3F loating-point load 4F loating-point add 4F loating-point multiply 4F loating-point divide 12 s-p,15 d-pF loating-point square-root 15 s-p, 30 d-p
64
64
128
128
128
64System
Businterface
L2
Cluster 1memory unit
Cluster 0memory unit
Databuses
Speculativestore data
Instructioncache
Filldata
Victim data
Data cache
F igure 7. The 21264’s internal memory system data paths.
.
Internal memory systemThe internal memory system supports
many in-flight memory references and out-of-order operations. It can service up to twomemory references from the integer executionpipes every cycle. These two memory refer-ences are out-of-order issues. The memorysystem simultaneously tracks up to 32 in-flight loads, 32 in-flight stores, and 8 in-flight(instruction or data) cache misses. It also hasa 64-Kbyte, two-way set-associative datacache. This cache has much lower miss ratesthan the 8-Kbyte, direct-mapped cache in theearlier 21164. The end result is a high-band-width, low-latency memory system.
Data pathThe 21264 supports any combination of
two loads or stores per cycle without conflict.The data cache is double-pumped to imple-ment the necessary two ports. That meansthat the data cache is referenced twice eachcycle—once per each of the two clock phases.In effect, the data cache operates at twice thefrequency of the processor clock—an impor-tant feature of the 21264’s memory system.
Figure 7 depicts the memory system’s inter-nal data paths. The two 64-bit data buses arethe heart of the internal memory system. Eachload receives data via these buses from the datacache, the speculative store data buffers, or anexternal (system or L2) fill. Stores first trans-fer their data across the data buses into thespeculative store buffer. Store data remains inthe speculative store buffer until the storesretire. Once they retire, the data is written(dumped) into the data cache on idle cachecycles. Each dump can write 128 bits into thecache since two stores can merge into onedump. Dumps use the double-pumped datacache to implement a read-modify-writesequence. Read-modify-write is required onstores to update the stored SECDED ECCthat allows correction of single-bit errors.
Stores can forward their data to subsequentloads while they reside in the speculative storedata buffer. Load instructions compare theirage and address against these pending stores.On a match, the appropriate store data is puton the data bus rather than the data from thedata cache. In effect, the speculative store databuffer performs a memory-renaming func-tion. From the perspective of younger loads,
it appears the stores write into the data cacheimmediately. However, squashed stores areremoved from the speculative store data bufferbefore they affect the final cache state.
Figure 7 shows how data is brought into andout of the internal memory system. Fill dataarrives on the data buses. Pending loads sam-ple the data to write into the register file while,in parallel, the caches (instruction or data) alsofill using the same bus data. The data cache iswrite-back, so fills also use its double-pumpedcapability: The previous cache contents areread out in the same cycle that fill data is writ-ten in. The bus interface unit captures this vic-tim data and later writes it back.
Address and control structureThe internal memory system maintains a
32-entry load queue (LDQ) and a 32-entry
31MARCH–APRIL 1999
Table 2. Sample 21264 instruction latencies (s-p means single-precision; d-p means double-precision).
Instruction class Latency (cycles)Simple integer operations 1Motion-video instructions/integer population count and
leading/trailing zero count unit (MVI/PLZ) 3Integer multiply 7Integer load 3F loating-point load 4F loating-point add 4F loating-point multiply 4F loating-point divide 12 s-p,15 d-pF loating-point square-root 15 s-p, 30 d-p
64
64
128
128
128
64System
Businterface
L2
Cluster 1memory unit
Cluster 0memory unit
Databuses
Speculativestore data
Instructioncache
Filldata
Victim data
Data cache
F igure 7. The 21264’s internal memory system data paths.
.
(1) Two copies of register files, to reduce port pressure.(2) Forwarding buses are low-latency paths through CPU.
Relies on speculations
12Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
Latencies, from issue to retirement.
8 retirements per cycle can be sustained over
short time periods.Peak rate is 11 retirements
in a single cycle.
flight window. This means that up to 80instructions can be in partial states of com-pletion at any time, allowing for significantexecution concurrency and latency hiding.(This is particularly true since the memorysystem can track an additional 32 in-flightloads and 32 in-flight stores.)
Table 1 shows the minimum latency, innumber of cycles, from issue until retire eligi-bility for different instruction classes. Theretire mechanism can retire at most 11instructions in a single cycle, and it can sustaina rate of 8 per cycle (over short periods).
Execution engineFigure 6 depicts the six execution pipelines.
Each pipeline is physically placed above orbelow its corresponding register file. The21264 splits the integer register file into twoclusters that contain duplicates of the 80-entryregister file. Two pipes access a single registerfile to form a cluster, and the two clusters com-
bine to support four-way integer instructionexecution. This clustering makes the designsimpler and faster, although it costs an extracycle of latency to broadcast results from aninteger cluster to the other cluster. The upperpipelines from the two integer clusters in Fig-ure 6 are managed by the same issue queuearbiter, as are the two lower pipelines. Theinteger queue statically slots instructions toeither the upper or lower pipeline arbiters. Itthen dynamically selects which cluster to exe-cute an instruction on, left or right.
The performance costs of the register clus-tering and issue queue arbitration simplifica-tions are small—a few percent or less comparedto an idealized unclustered implementation inmost applications. There are multiple reasonsfor the minimal performance effect. First, formany operations (such as loads and stores) thestatic-issue queue assignment is not a restrictionsince they can only execute in either the upperor lower pipelines. Second, critical-path com-putations tend to execute on the same cluster.The issue queue prefers older instructions, somore-critical instructions incur fewer cross-clus-ter delays—an instruction can usually issue firston the same cluster that produces the result.This integer pipeline architecture as a result pro-vides much of the implementation simplicity,lower risk, and higher speed of a two-issuemachine with the performance benefits of four-way integer issue. Figure 6 also shows the float-ing-point execution pipes’ configuration. Asingle cluster has the two floating-point execu-tion pipes, with a single 72-entry register file.
The 21264 includes new functional unitsnot present in prior Alpha microprocessors.The Alpha motion-video instructions (MVI,used to speed many forms of image process-ing), a fully pipelined integer multiply unit,an integer population count and leading/trail-ing zero count unit (PLZ), a floating-pointsquare-root functional unit, and instructionsto move register values directly between float-ing-point and integer registers are included.The processor also provides more completehardware support for the IEEE floating-pointstandard, including precise exceptions, NaNand infinity processing, and support for flush-ing denormal results to zero. Table 2 showssample instruction latencies (issue of produc-er to issue of consumer). These latencies areachieved through result bypassing.
30
ALPHA 21264
IEEE MICRO
Table 1. Sample 21264 retire pipe stages.
Instruction class Retire latency (cycles)Integer 4M emory 7F loating-point 8Branch/jump to subroutine 7
Integer
+1
+1
Integer multiply
Cluster 0
Shift/branch
Add/logic80 registers
Load/store
MVI/PLZ
Cluster 1
Shift/branch
Add/logic
Add/logicAdd/logic
80 registers
Load/store
Floating-point
multiply
Floating point
Floating-pointadd
Floating-pointdivide
Floating-pointSQRT
72 registers
MVIPLZ
SQRT
Motion video instructionsInteger population count and leading/trailing
Square-root functional unitzero count unit
Figure 6. The four integer execution pipes (upper and lowerfor each of a left and right cluster) and the two floating-pointpipes in the 21264, together w ith the functional units in each.
.
issue and retire. It also trains the correct pre-dictions by updating the referenced local,global, and choice counters at that time. Theprocessor maintains path history with a siloof 12 branch predictions. This silo is specu-latively updated before a branch retires and isbacked up on a mispredict.
Out-of-order execution The 21264 offers out-of-order efficiencies
with higher clock speeds than competingdesigns, yet this speed does not restrict themicroprocessor’s dynamic execution capabili-ties. The out-of-order execution logic receivesfour fetched instructions every cycle,renames/remaps the registers to avoid unneces-sary register dependencies, and queues the
instructions until operands or functional unitsbecome available. It dynamically issues up to sixinstructions every cycle—four integer instruc-tions and two floating-point instructions. It alsoprovides an in-order execution model to theprogrammer via in-order instruction retire.
Register renamingRegister renaming exposes application
instruction parallelism since it eliminatesunnecessary dependencies and allows specu-lative execution. Register renaming assigns aunique storage location with each write-ref-erence to a register. The 21264 speculativelyallocates a register to each instruction with aregister result. The register only becomes partof the user-visible (architectural) register statewhen the instruction retires/commits. Thislets the instruction speculatively issue anddeposit its result into the register file beforethe instruction retires. Register renaming alsoeliminates write-after-write and write-after-read register dependencies, but preserves allthe read-after-write register dependencies thatare necessary for correct computation.
The left side of Figure 5 depicts the map,or register rename, stage in more detail. Theprocessor maintains storage with each inter-nal register indicating the user-visible registerthat is currently associated with the giveninternal register (if any). Thus, register renam-ing is a content-addressable memory (CAM)operation for register sources together with aregister allocation for the destination register.All pipeline stages subsequent to the registermap stage operate on internal registers ratherthan user-visible registers.
Beyond the 31 integer and 31 floating-point user-visible (non-speculative) registers,an additional 41 integer and 41 floating-pointregisters are available to hold speculativeresults prior to instruction retirement. Theregister mapper stores the register map statefor each in-flight instruction so that themachine architectural state can be restored incase a misspeculation occurs.
The Alpha conditional-move instructionsmust be handled specially by the map stage.These operations conditionally move one oftwo source registers into a destination regis-ter. This makes conditional move the onlyinstruction in the Alpha architecture thatrequires three register sources—the two
28
ALPHA 21264
IEEE MICRO
Localhistorytable
(1,024 × 10)
Localprediction(1,024 × 3)
Global prediction(4,096 × 2)
Choice prediction(4,096 × 2)
Path history
Programcounter
Branchprediction
Mux
F igure 4. B lock diagram of the 21264 tournament branch predictor. The localhistory prediction path is on the left; the global history prediction path andthe chooser (choice prediction) are on the right.
Map
Savedmapstate
Map content-addressablememories
Queue
Arbiter80 in-flightinstructions
Request Grant
Registernumbers
Internalregister numbers
72–80internal registers
Instructions (4)
Registerscoreboard
Queueentries
Issuedinstructions
…
F igure 5. B lock diagram of the 21264’s map (register rename) and queuestages. The map stage renames programmer-visible register numbers tointernal register numbers. The queue stage stores instructions until theyare ready to issue . These structures are duplicated for integer and floating-point execution.
.
Retirement managed here.
Short latencies keep buffers to a reasonable size.
13Thursday, April 3, 14
Execution unit close-up:
flight window. This means that up to 80instructions can be in partial states of com-pletion at any time, allowing for significantexecution concurrency and latency hiding.(This is particularly true since the memorysystem can track an additional 32 in-flightloads and 32 in-flight stores.)
Table 1 shows the minimum latency, innumber of cycles, from issue until retire eligi-bility for different instruction classes. Theretire mechanism can retire at most 11instructions in a single cycle, and it can sustaina rate of 8 per cycle (over short periods).
Execution engineFigure 6 depicts the six execution pipelines.
Each pipeline is physically placed above orbelow its corresponding register file. The21264 splits the integer register file into twoclusters that contain duplicates of the 80-entryregister file. Two pipes access a single registerfile to form a cluster, and the two clusters com-
bine to support four-way integer instructionexecution. This clustering makes the designsimpler and faster, although it costs an extracycle of latency to broadcast results from aninteger cluster to the other cluster. The upperpipelines from the two integer clusters in Fig-ure 6 are managed by the same issue queuearbiter, as are the two lower pipelines. Theinteger queue statically slots instructions toeither the upper or lower pipeline arbiters. Itthen dynamically selects which cluster to exe-cute an instruction on, left or right.
The performance costs of the register clus-tering and issue queue arbitration simplifica-tions are small—a few percent or less comparedto an idealized unclustered implementation inmost applications. There are multiple reasonsfor the minimal performance effect. First, formany operations (such as loads and stores) thestatic-issue queue assignment is not a restrictionsince they can only execute in either the upperor lower pipelines. Second, critical-path com-putations tend to execute on the same cluster.The issue queue prefers older instructions, somore-critical instructions incur fewer cross-clus-ter delays—an instruction can usually issue firston the same cluster that produces the result.This integer pipeline architecture as a result pro-vides much of the implementation simplicity,lower risk, and higher speed of a two-issuemachine with the performance benefits of four-way integer issue. Figure 6 also shows the float-ing-point execution pipes’ configuration. Asingle cluster has the two floating-point execu-tion pipes, with a single 72-entry register file.
The 21264 includes new functional unitsnot present in prior Alpha microprocessors.The Alpha motion-video instructions (MVI,used to speed many forms of image process-ing), a fully pipelined integer multiply unit,an integer population count and leading/trail-ing zero count unit (PLZ), a floating-pointsquare-root functional unit, and instructionsto move register values directly between float-ing-point and integer registers are included.The processor also provides more completehardware support for the IEEE floating-pointstandard, including precise exceptions, NaNand infinity processing, and support for flush-ing denormal results to zero. Table 2 showssample instruction latencies (issue of produc-er to issue of consumer). These latencies areachieved through result bypassing.
30
ALPHA 21264
IEEE MICRO
Table 1. Sample 21264 retire pipe stages.
Instruction class Retire latency (cycles)Integer 4M emory 7F loating-point 8Branch/jump to subroutine 7
Integer
+1
+1
Integer multiply
Cluster 0
Shift/branch
Add/logic80 registers
Load/store
MVI/PLZ
Cluster 1
Shift/branch
Add/logic
Add/logicAdd/logic
80 registers
Load/store
Floating-point
multiply
Floating point
Floating-pointadd
Floating-pointdivide
Floating-pointSQRT
72 registers
MVIPLZ
SQRT
Motion video instructionsInteger population count and leading/trailing
Square-root functional unitzero count unit
Figure 6. The four integer execution pipes (upper and lowerfor each of a left and right cluster) and the two floating-pointpipes in the 21264, together w ith the functional units in each.
.
producer/consumer relationships within the four instructions) are combined to assign either previously allocated registers or the registers supplied by the free register generator to the source specifiers.
The resulting four register maps are saved in the map silo, which, in turn, provides information to the free register generator as to which registers are currently allocated. Finally, the last map that was created is used as the initial map for the next cycle. In the event of a branch mispredict or trap, the CAM is restored with the map silo entry associated with the redirecting instruction.
Issue Queue
Each cycle, up to four instructions are loaded into the two issue queues. The floating point queue (Fqueue) chooses two instructions from a 15-entry window and issues them to the two floating point pipelines. The integer queue (Iqueue) chooses four instructions from a 20-entry window and issues them to the four integer pipelines 'figure 3).
Ebox Cluster 0
$. + 5 Execute -+ Reg
File Execute (80)
-
-
Media I Ebox Cluster 1
Figure 3. Integer Issue Queue and Ebox The fundamental circuit loop in the Iqueue is the path in which a single-cycle producing instruction is grunted (issued) at the end of one issue cycle and a consuming instruction requests to be issued at the beginning of the next cycle (e.g. Instructions (0) and (2) in the mapper example). The grunt must be communicated to all newer consumer instructions in the issue window.
The issue queue maintains a register scoreboard, based on physical register number, and tracks the progress of multiple-cycle (e.g. integer multiply) and variable cycle (e.g. memory load) instructions. When arithmetic result
data or load data is available for bypass, the scoreboard unit notifies all instructions in the queue.
The queue arbitration cycle works as follows: 1. New instructions are loaded into the "top" of the
queue. 2. Register scoreboard information is communicated to
all queue entries. 3. Instructions that are data-ready request for issue 4. A set of issue arbiters search the queue fkom "bottom"
to "top", selecting instructions that are data-ready in an age-prioritized order and skipping over instructions that are not data-ready.
5. Selected instructions are broadcast to the functional units.
In the next cycle a queue-update mechanism calculates which queue entries are available for future instructions and squashes issued instructions out of the queue. Instructions that are still resident in the queue shift towards the bottom.
Ebox and Fbox
The Ebox functional unit organization was designed around a fast execute-bypass cycle. In order to reduce the impact of the large number of register ports required for a quad-issue CPU and to limit the effect on cycle time of long bypass busses between the functional units, the Ebox was organized around two clusters (see figure 3) . Each cluster contains two functional units, an 80-entry register file, and result busses to/from the other cluster. The lower two functional units contain one-cycle adders and logical units; the upper two contain adders, logic units, and shifters. One upper functional unit contains a 7- cycle, fully pipelined multiplier; the other contains a 3- cycle motion video pipeline, which implements motion estimation, threshold, and pixel compaction/expansion functions. The two integer unit clusters have equal capability to execute most instructions (integer multiply, motion video, and some special-purpose instructions can only be executed in one cluster).
The execute pipeline operation proceeds as follows: Stage 3: Instructions are issued to both clusters.
0 Stage 4: Register files are read. 0 Stage 5: Execution (may be multiple cycles) 0 Stage 6: Results are written to the register file of the
cluster in which execution is performed and are bypassed into the next execution stage within the cluster. Stage 7: Results are written to the cross-cluster register file and are bypassed into the next execution stage in the other cluster.
31
(1) Two arbiters: one for top pipes, one for bottom pipes.(2) Instructions statically assigned to top or bottom.
(3) Arbiter dynamically selects left or right.TopTop
BottomThus, 2 dual-issue dynamic machines, not a 4-issue machine.
Why? Simplifies arbiter. Performance penalty? A few %.14Thursday, April 3, 14
comes in a 587-pin PGA package. It can exe-cute up to 2.4 billion instructions per second.
Figure 1 shows a photo of the 21264, high-lighting major sections. Figure 2 is a high-leveloverview of the 21264 pipeline, which hasseven stages, similar to the earlier in-order21164. One notable addition is the map stagethat renames registers to expose instructionparallelism—this addition is fundamental tothe 21264’s out-of-order techniques.
Instruction pipeline—FetchThe instruction pipeline begins with the
fetch stage, which delivers four instructionsto the out-of-order execution engine eachcycle. The processor speculatively fetchesthrough line, branch, or jump predictions.Since the predictions are usually accurate, thisinstruction fetch implementation typicallysupplies a continuous stream of good-pathinstructions to keep the functional units busywith useful work.
Two architectural techniques increase fetchefficiency: line and way prediction, andbranch prediction. A 64-Kbyte, two-way set-associative instruction cache offers much-improved level-one hit rates compared to the8-Kbyte, direct-mapped instruction cache inthe Alpha 21164.
Line and way predictionThe processor implements
a line and way prediction tech-nique that combines theadvantages of set-associativebehavior and fetch bubbleelimination, together with thefast access time of a direct-mapped cache. Figure 3 (nextpage) shows the technique’smain features. Each four-instruction fetch blockincludes a line and way pre-diction. This prediction indi-cates where to fetch the nextblock of four instructions,including which way—that is,which of the two choicesallowed by two-way associative cache.
The processor reads out the next instruc-tions using the prediction (via the wraparoundpath in Figure 3) while, in parallel, it completesthe validity check for the previous instruc-tions. Note that the address paths needingextra logic levels—instruction decode, branchprediction, and cache tag comparison—areoutside the critical fetch loop.
The processor loads the line and way pre-dictors on an instruction cache fill, and
25MARCH–APRIL 1999
Floa
ting-
poin
t uni
ts Floatmapand
queue
Inst
ruct
ion
fetc
h
Businterface
unit
Memorycontroller
Memory controller
Data and control buses
DatacacheInstruction
cache BIU
Integerqueue
Integermapper
Inte
ger u
nit
(clu
ster
1)
Inte
ger u
nit
(clu
ster
0)
F igure 1. A lpha 21264 m icroprocessor diephoto. BIU stands for bus interface unit.
Fetch0
Rename2
Issue3
Register read4
Execute5
Integerexecution
Integerexecution
Integerexecution
Integerexecution
Memory6
Datacache
(64 Kbytes,two-way)
Level-two
cacheand system
interface
Integerregisterrename
Floating-pointissuequeue(15)
Floating-point
registerfile(72)
Floating-point
registerrename
Slot1
Branchpredictor
Line/setprediction
Instructioncache
(64 Kbytes,two-way)
Integerissuequeue
(20entries)
Integerregister
file(80)
Integerregister
file(80)
Addr
Addr
Floating-pointmultiply execution
Floating-pointadd execution
Mux
Mux
F igure 2. Stages of the A lpha 21264 instruction pipe line .
.
Memory stages close-up:
Input: Say something
Internal memory systemThe internal memory system supports
many in-flight memory references and out-of-order operations. It can service up to twomemory references from the integer executionpipes every cycle. These two memory refer-ences are out-of-order issues. The memorysystem simultaneously tracks up to 32 in-flight loads, 32 in-flight stores, and 8 in-flight(instruction or data) cache misses. It also hasa 64-Kbyte, two-way set-associative datacache. This cache has much lower miss ratesthan the 8-Kbyte, direct-mapped cache in theearlier 21164. The end result is a high-band-width, low-latency memory system.
Data pathThe 21264 supports any combination of
two loads or stores per cycle without conflict.The data cache is double-pumped to imple-ment the necessary two ports. That meansthat the data cache is referenced twice eachcycle—once per each of the two clock phases.In effect, the data cache operates at twice thefrequency of the processor clock—an impor-tant feature of the 21264’s memory system.
Figure 7 depicts the memory system’s inter-nal data paths. The two 64-bit data buses arethe heart of the internal memory system. Eachload receives data via these buses from the datacache, the speculative store data buffers, or anexternal (system or L2) fill. Stores first trans-fer their data across the data buses into thespeculative store buffer. Store data remains inthe speculative store buffer until the storesretire. Once they retire, the data is written(dumped) into the data cache on idle cachecycles. Each dump can write 128 bits into thecache since two stores can merge into onedump. Dumps use the double-pumped datacache to implement a read-modify-writesequence. Read-modify-write is required onstores to update the stored SECDED ECCthat allows correction of single-bit errors.
Stores can forward their data to subsequentloads while they reside in the speculative storedata buffer. Load instructions compare theirage and address against these pending stores.On a match, the appropriate store data is puton the data bus rather than the data from thedata cache. In effect, the speculative store databuffer performs a memory-renaming func-tion. From the perspective of younger loads,
it appears the stores write into the data cacheimmediately. However, squashed stores areremoved from the speculative store data bufferbefore they affect the final cache state.
Figure 7 shows how data is brought into andout of the internal memory system. Fill dataarrives on the data buses. Pending loads sam-ple the data to write into the register file while,in parallel, the caches (instruction or data) alsofill using the same bus data. The data cache iswrite-back, so fills also use its double-pumpedcapability: The previous cache contents areread out in the same cycle that fill data is writ-ten in. The bus interface unit captures this vic-tim data and later writes it back.
Address and control structureThe internal memory system maintains a
32-entry load queue (LDQ) and a 32-entry
31MARCH–APRIL 1999
Table 2. Sample 21264 instruction latencies (s-p means single-precision; d-p means double-precision).
Instruction class Latency (cycles)Simple integer operations 1Motion-video instructions/integer population count and
leading/trailing zero count unit (MVI/PLZ) 3Integer multiply 7Integer load 3F loating-point load 4F loating-point add 4F loating-point multiply 4F loating-point divide 12 s-p,15 d-pF loating-point square-root 15 s-p, 30 d-p
64
64
128
128
128
64System
Businterface
L2
Cluster 1memory unit
Cluster 0memory unit
Databuses
Speculativestore data
Instructioncache
Filldata
Victim data
Data cache
F igure 7. The 21264’s internal memory system data paths.
.
Loads and stores from execution unit appear as “Cluster 0/1 memory unit”
in the diagram below.
The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.
A
Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.
PA0 A PA1
MAF 8-ent ries
LDQ 32-entries
STQ 32-entries
Memory Operations
The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:
Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.
Mbox
The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .
Hazard Detection and Resolution
As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for
references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):
( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)
Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.
The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses
1 T: "j1 128-entries
Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).
The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup
32
The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.
A
Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.
PA0 A PA1
MAF 8-ent ries
LDQ 32-entries
STQ 32-entries
Memory Operations
The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:
Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.
Mbox
The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .
Hazard Detection and Resolution
As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for
references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):
( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)
Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.
The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses
1 T: "j1 128-entries
Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).
The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup
32
1st stop: TLB, to convert virtual memory addresses.
The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.
A
Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.
PA0 A PA1
MAF 8-ent ries
LDQ 32-entries
STQ 32-entries
Memory Operations
The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:
Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.
Mbox
The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .
Hazard Detection and Resolution
As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for
references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):
( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)
Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.
The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses
1 T: "j1 128-entries
Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).
The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup
32
3rd stop: Flush STQ to the data cache ... on a miss, place in Miss Address File.
(MAF == MHSR)“Doublepumped”
1 GHz
2nd stop: Load Queue(LDQ) and Store Queue (SDQ) each hold 32 instructions, until retirement ...
The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.
A
Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.
PA0 A PA1
MAF 8-ent ries
LDQ 32-entries
STQ 32-entries
Memory Operations
The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:
Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.
Mbox
The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .
Hazard Detection and Resolution
As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for
references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):
( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)
Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.
The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses
1 T: "j1 128-entries
Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).
The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup
32
So we can roll back!
15Thursday, April 3, 14
LDQ/STQ close-up:
The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.
A
Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.
PA0 A PA1
MAF 8-ent ries
LDQ 32-entries
STQ 32-entries
Memory Operations
The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:
Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.
Mbox
The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .
Hazard Detection and Resolution
As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for
references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):
( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)
Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.
The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses
1 T: "j1 128-entries
Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).
The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup
32
The one-cycle cross-cluster bypass delay resulted in a negligible performance penalty (abut 1 % on SPECInt95) but reduced the operand bypass bus length by 75%.
A
Floating Point Execution Unit The floating point pipe execution units are organized around a single 72-entry register file. One unit contains a 4-cycle fully pipelined adder and the other contains a 4- cycle multiplier. In addition, the adder pipeline contains a square-root and divide unit. The Fbox pipeline operation is similar to the Ebox pipeline except the execution stage is elongated and there is only one cluster.
PA0 A PA1
MAF 8-ent ries
LDQ 32-entries
STQ 32-entries
Memory Operations
The lower two integer functional unit adders are shared between ADD/SUB instructions and effective virtual address calculations (register + displacement) for load and store instructions. Loads are processed as follows:
Stage 3: Up to two load instructions are issued, potentially out-of-order. Stage 4 and 5: Register file read and displacement address calculation. Stage 6A and 6B: The 64KB 2-way, virtually indexed, physically tagged data cache is accessed. The cache is phase-pipelined such that one index is supplied every 1 ns (assuming a 2ns cycle time). Most dual-ported caches impose constraints on the indices that are supplied each cycle to avoid bank conflicts. Phase-pipelining the cache avoids these constraints. Stage 7: A 128-bit loadstore data bus (the LSD bus) is driven from the cache to the execution units. The cache data reaches both integer unit subclusters at the same time -- consuming instructions can issue to any functional unit 3 cycles after the load is issued. Cache data takes an additional cycle to reach the floating point execution unit.
Mbox
The memory instruction pipeline discussed above is optimized for loads/stores which hit in the Dcache and do not cause any address reference order hazards. The Mbox detects and resolves these hazards and processes Dcache misses .
Hazard Detection and Resolution
As discussed earlier, out-of-order issued instructions can generate three type of hazards read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). Register renaming resolves WAW and WAR for
references to the architectural register specifiers, and the Ibox queue resolves the RAW dependencies. The Mbox must detect and resolve these hazards as they apply to references to memory. Consider the following series of memory instructions which reference address (A):
( 0 ) LD Memory(A) 3 R1 (1) ST R2 3 Memory(A) ( 2 ) LD Memory(A) 3 R 3 (3) ST R 4 a Memory(A)
Assume that address (A) is cached in the Dcache. If (0) and (1) issue out-of-order from the Iqueue, R1 will incorrectly receive the result of the store. If (1) and (2) are issued out-of-order, R3 will incorrectly receive the value before the store, and, finally, if (1) and (3) issue and complete out-of-order, the value stored to location (A) will be R2 instead of R4.
The datapath which the Mbox uses to resolve these hazards is shown in figure 4. Since loads and stores can dual-issue, the data-path receives two effective addresses
1 T: "j1 128-entries
Figure 4. Mbox Address Datapath per cycle (VAO and VA1) from the Ebox adders. It first translates them to physical addresses (PA0 and PA1) using a dual-ported, 128-entry, fully associative translation lookaside buffer (TLB). The physical addresses travel over the three key structures in the Mbox: the b a D Queue (LDQ), the STore Queue (STQ), and the Miss Address File (MAF).
The 32-entry LDQ contains all the in-flight load instructions, and the 32-entry STQ contains all the in- flight store instructions. The MAF contains all the in- flight cache transactions which are pending to the backup
32
Hazards we are trying to prevent:
cache and system. Each entry in the MAF refers to a 64- byte block of data which is ultimately bound for the Dcache or the Icache.
The instruction processing pipeline described above extends to the Mbox as follows:
Stage 6: The Dcache tags and TLB are read. Dcache hit is calculated. Stage 7: The physical addresses generated from the TLB (PAO, PAl) are CAMed across the LDQ, STQ, and MAF. Stage 8: Load instructions are written into the LDQ; store instructions are written into the STQ, and, if the memory reference missed the Dcache, it is written into the MAF. In parallel, the CAM results from the preceding cycle are combined with relative instruction age information to detect hazards. In addition, the MAF uses the result of its CAMs to detect the loads and stores which can be merged into the same @-byte cache block. Stage 9: The MAF entry allocation in stage 8 is validated to the system interface, and the MAF number associated with this particular memory miss is written into the appropriate (LDQ/STQ) structure. This MAF number provides a mapping between the merged references to the same cache block and individual outstanding load and store instructions.
Given these resources and within the context of the Mbox pipeline, memory hazards are solved as follows. RAW hazards are discovered when an issued store detects that a younger load to the same address has already issued and delivered its data. In this event, the CPU is trapped to the store instruction, and instruction flow is replayed by the Ibox. This is a potentially common hazard, so, in addition to trapping, the Ibox is trained to issue that load in-order with respect to the prior store instruction.
WAR hazards are discovered when an issued load detects an older store which references the same address; the CPU is trapped to the load address. Finally, WAW hazards are avoided by forcing the STQ to write data to the Dcache in-order. Thus, stores can be issued out-of- order and removed from the Iqueue, allowing futther instruction processing, but the store data is written to the Dcache in program order.
the data to the Mbox. As the data is delivered, it must be spliced into the execution pipeline so that dependent instructions can be issued. The fill pipeline proceeds from the Cbox as follows:
1. The Cbox informs the Mbox and the rest of the chip that fill data will be available the Load Store Bus (LSD) in 6 cycles.
2. The Mbox receives the fill command plus the MAF number and CAMs the MAF number across the LDQ. The loads which referenced this cache block arbitrate for the two loadstore pipes.
3. The Ibox stops issuing load instructions for one cycle because the LSD bus has been scheduled for fill data.
4. Bubble Cycle. 5. Fill data arrives at the 21264 pins. The Ibox ceases
issuing loads for one cycle because the Dcache has been scheduled for a fill update. The Ibox issues instructions which were dependent on the load data because the data will be available for bypass on the LSD bus.
7. The fill data is driven on the LSD bus and is ready to be consumed.
8. The fill data and tag are written into the Dcache.
6.
Since the cache block is @-bytes, and the 21264 has two @-bit fill pipelines, it takes 4 transactions across these pipelines to complete a valid fill. The tag array is written in the first cycle so newly issued loads can “hit” in the Dcache on partially filled cache blocks. After the cache block is written, the MAF number is CAMed across both the LDQ and STQ. In the case of the LDQ, this CAM indicates all the loads which wanted this cache block but were not satisfied on the original fill. These loads are placed in “retry” state and use the Mbox retry pipeline to resolve themselves. In the case of the STQ, all stores which wanted this block are informed that the block is now in the Dcache and in the appropriate cache state to be writeable. When these stores are retired, they can be written into the Dcache in-order. In some cases, the Cbox delivers a cache block which can be consumed by loads but is not writeable. In these cases, all the stores in the STQ are also placed in the “retry” state for further processing by the Mbox retry pipeline. The Mbox retry pipeline starts in the first cycle of the fill pipeline, and similar to a fill, reissues loads and stores into the execution pipeline.
Dcache Miss Processing C box
If a memory reference misses the Dcache and is not trapped or merged in stage 8 of the Mbox pipeline, a new MAF entry is generated for the reference. The Cbox finds the block in the L2 cache or main memory, and delivers
The Cache Box ( O x ) controls the cache subsystem within the 21264 microprocessor and has two primary tasks. First, it cooperates with the external system to
33
cache and system. Each entry in the MAF refers to a 64- byte block of data which is ultimately bound for the Dcache or the Icache.
The instruction processing pipeline described above extends to the Mbox as follows:
Stage 6: The Dcache tags and TLB are read. Dcache hit is calculated. Stage 7: The physical addresses generated from the TLB (PAO, PAl) are CAMed across the LDQ, STQ, and MAF. Stage 8: Load instructions are written into the LDQ; store instructions are written into the STQ, and, if the memory reference missed the Dcache, it is written into the MAF. In parallel, the CAM results from the preceding cycle are combined with relative instruction age information to detect hazards. In addition, the MAF uses the result of its CAMs to detect the loads and stores which can be merged into the same @-byte cache block. Stage 9: The MAF entry allocation in stage 8 is validated to the system interface, and the MAF number associated with this particular memory miss is written into the appropriate (LDQ/STQ) structure. This MAF number provides a mapping between the merged references to the same cache block and individual outstanding load and store instructions.
Given these resources and within the context of the Mbox pipeline, memory hazards are solved as follows. RAW hazards are discovered when an issued store detects that a younger load to the same address has already issued and delivered its data. In this event, the CPU is trapped to the store instruction, and instruction flow is replayed by the Ibox. This is a potentially common hazard, so, in addition to trapping, the Ibox is trained to issue that load in-order with respect to the prior store instruction.
WAR hazards are discovered when an issued load detects an older store which references the same address; the CPU is trapped to the load address. Finally, WAW hazards are avoided by forcing the STQ to write data to the Dcache in-order. Thus, stores can be issued out-of- order and removed from the Iqueue, allowing futther instruction processing, but the store data is written to the Dcache in program order.
the data to the Mbox. As the data is delivered, it must be spliced into the execution pipeline so that dependent instructions can be issued. The fill pipeline proceeds from the Cbox as follows:
1. The Cbox informs the Mbox and the rest of the chip that fill data will be available the Load Store Bus (LSD) in 6 cycles.
2. The Mbox receives the fill command plus the MAF number and CAMs the MAF number across the LDQ. The loads which referenced this cache block arbitrate for the two loadstore pipes.
3. The Ibox stops issuing load instructions for one cycle because the LSD bus has been scheduled for fill data.
4. Bubble Cycle. 5. Fill data arrives at the 21264 pins. The Ibox ceases
issuing loads for one cycle because the Dcache has been scheduled for a fill update. The Ibox issues instructions which were dependent on the load data because the data will be available for bypass on the LSD bus.
7. The fill data is driven on the LSD bus and is ready to be consumed.
8. The fill data and tag are written into the Dcache.
6.
Since the cache block is @-bytes, and the 21264 has two @-bit fill pipelines, it takes 4 transactions across these pipelines to complete a valid fill. The tag array is written in the first cycle so newly issued loads can “hit” in the Dcache on partially filled cache blocks. After the cache block is written, the MAF number is CAMed across both the LDQ and STQ. In the case of the LDQ, this CAM indicates all the loads which wanted this cache block but were not satisfied on the original fill. These loads are placed in “retry” state and use the Mbox retry pipeline to resolve themselves. In the case of the STQ, all stores which wanted this block are informed that the block is now in the Dcache and in the appropriate cache state to be writeable. When these stores are retired, they can be written into the Dcache in-order. In some cases, the Cbox delivers a cache block which can be consumed by loads but is not writeable. In these cases, all the stores in the STQ are also placed in the “retry” state for further processing by the Mbox retry pipeline. The Mbox retry pipeline starts in the first cycle of the fill pipeline, and similar to a fill, reissues loads and stores into the execution pipeline.
Dcache Miss Processing C box
If a memory reference misses the Dcache and is not trapped or merged in stage 8 of the Mbox pipeline, a new MAF entry is generated for the reference. The Cbox finds the block in the L2 cache or main memory, and delivers
The Cache Box ( O x ) controls the cache subsystem within the 21264 microprocessor and has two primary tasks. First, it cooperates with the external system to
33
To do so, LDQ and SDQ lists of up to 32 loads and stores, in issued order. When a new load or store arrives, addresses are compared to detect/fix hazards:
16Thursday, April 3, 14
LDQ/STQ speculation
address-out and address-in buses in the sys-tem pin bus. This provides bandwidth for newaddress requests (out from the processor) andsystem probes (into the processor), and allowsfor simple, small-scale multiprocessor systemdesigns. The 21264 system interface’s low pincounts and high bandwidth let a high-perfor-mance system (of four or more processors)broadcast probes without using a large num-ber of pins. The BIU stores pending systemprobes in an eight-entry probe queue beforeresponding to the probes, in order. It respondsto probes very quickly to support a systemwith minimum latency, and minimizes theaddress bus bandwidth required in commonprobe response cases.
The 21264 provides a rich set of possiblecoherence actions; it can scale to larger-scalesystem implementations, including directo-ry-based systems.4 It supports all five of thestandard MOESI (modified-owned-exclusive-shared-invalid) cache states.
The BIU supports a wide range of systemdata bus speeds. The peak bandwidth of thesystem data interface is 8 bytes of data per 1.5CPU cycles—or 3.2 Gbytes/sec at a 400-MHztransfer rate. The load latency (issue of load toissue of consumer) can be as low as 160 ns witha 60-ns DRAM access time. The total of eightin-flight MAFs and eight in-flight victims pro-vide many parallel memory operations toschedule for high SRAM and DRAM effi-ciency. This translates into high memory sys-tem performance, even with cache misses. Forexample, the 21264 has sustained in excess of1.3 Gbytes/sec (user-visible) memory band-width on the Stream benchmark.7
Dynamic execution examplesThe 21264 architecture is very dynamic. In
this article I have discussed a number of its
dynamic techniques, including the line predic-tor, branch predictor, and issue queue schedul-ing. Two more examples in this section furtherillustrate the 21264’s dynamic adaptability.
Store/load memory orderingThe 21264 memory system supports the
full capabilities of the out-of-order executioncore, yet maintains an in-order architecturalmemory model. This is a challenge when mul-tiple loads and stores reference the sameaddress. The register rename logic cannotautomatically handle these read-after-writememory dependencies as it does registerdependencies because it does not have thememory address until the instruction issues.Instead, the memory system dynamicallydetects the problem case after the instructionsissue (and the addresses are available).
This example shows how the 21264dynamically adapts to avoid the costs of loadmisspeculation. It remembers the first mis-speculation and avoids the problem in subse-quent executions by delaying the load.
Figure 10 shows how the 21264 resolves amemory read-after-write hazard. The sourceinstructions are on the far left—a store fol-lowed by a load to the same address. On thefirst execution of these instructions, the 21264attempts to issue the load as early as possi-ble—before the older store—to minimize loadlatency. The load receives the wrong data sinceit issues before the store in this case, so the21264 hazard detection logic squashes theload (and all subsequent instructions). Afterthis type of load misspeculation, the 21264trains to avoid it on subsequent executions bysetting a bit in a load wait table.
Figure 10 also shows what happens on sub-sequent executions of the same code. At fetchtime the store wait table bit corresponding tothe load is set. The issue queue then forces theissue point of the marked load to be delayeduntil all prior stores have issued, therebyavoiding this store/load order violation andalso allowing the speculative store buffer tobypass the correct data to the load. This storewait table is periodically cleared to avoidunnecessary waits.
This example store/load order case showshow the memory system produces a result thatis the same as an in-order memory systemwhile capturing the performance advantages
34
ALPHA 21264
IEEE MICRO
(Assume R10 = R11)Source codeSTQ R0, 0(R10)
LDQ R1, 0(R11)
First executionLDQ R1, 0(R11)
STQ R0, 0(R10)
This load gotthe wrong data!
Subsequent executions
STQ R0, 0(R10)
LDQ R1, 0(R11)
The marked (delayed)load gets the store data.
F igure 10. An example of the 21264 memory load-after-store hazardadaptation.
.
It also marks the load instruction in a predictor, so that future invocations are not speculatively executed.
of out-of-order execution.Unmarked loads issue as earlyas possible, and before asmany stores as possible, whileonly the necessary markedloads are delayed.
Load hit/miss predictionThere are minispeculations
within the 21264’s specula-tive execution engine. Toachieve the minimum three-cycle integer load hit latency,the processor must specula-tively issue the consumers ofthe integer load data beforeknowing if the load hit ormissed in the on-chip datacache. This early issue allowsthe consumers to receivebypassed data from a load atthe earliest possible time.Note in Figure 2 that the datacache stage is three cycles after the queue, orissue, stage, so the load’s cache lookup musthappen in parallel with the consumers issue.Furthermore, it really takes another cycle afterthe cache lookup to get the hit/miss indica-tion to the issue queue. This means that con-sumers of the results produced by theconsumers of the load data (the beneficiaries)can also speculatively issue—even though theload may have actually missed.
The processor could rely on the generalmechanisms available in the speculative exe-cution engine to abort the integer load data’sspeculatively executed consumers; however,that requires restarting the entire instructionpipeline. Given that load misses can be fre-quent in some applications, this techniquewould be too expensive. Instead, the proces-sor handles this with a minirestart. When con-sumers speculatively issue three cycles after aload that misses, two integer issue cycles (onall four integer pipes) are squashed. All inte-ger instructions that issued during those twocycles are pulled back into the issue queue tobe reissued later. This forces the processor toreissue both the consumers and the benefi-ciaries. If the load hits, the instruction sched-ule shown on the top of Figure 11 will beexecuted. If the load misses, however, the orig-inal issues of the unrelated instructions L3–L4
and U4–U6 must be reexecuted in cycles 5and 6. The schedule thus is delayed two cyclesfrom that depicted.
While this two-cycle window is less costlythan fully restarting the processor pipeline, itstill can be expensive for applications withmany integer load misses. Consequently, the21264 predicts when loads will miss and doesnot speculatively issue the consumers of theload data in that case. The bottom half of Fig-ure 11 shows the example instruction schedulefor this prediction. The effective load latencyis five cycles rather than the minimum threefor an integer load hit that is (incorrectly) pre-dicted to miss. But more unrelated instruc-tions are allowed to issue in the slots not takenby the consumer and the beneficiaries.
The load hit/miss predictor is the most-sig-nificant bit of a 4-bit counter that tracks thehit/miss behavior of recent loads. The satu-rating counter decrements by two on cycleswhen there is a load miss, otherwise it incre-ments by one when there is a hit. This hit/misspredictor minimizes latencies in applicationsthat often hit, and avoids the costs of over-speculation for applications that often miss.
The 21264 treats floating-point loads dif-ferently than integer loads for load hit/missprediction. The floating-point load latency isfour cycles, with no single-cycle operations, so
35MARCH–APRIL 1999
F igure 11. Integer load hit/m iss prediction example . This figure depicts the execution of aworkload when the se lected load (P) is predicted to hit (a) and predicted to m iss (b) on thefour integer pipes. The cross-hatched and screened sections show the instructions that aree ither squashed and reexecuted from the issue queue , or de layed due to operand availabili-ty or the reexecution of other instructions.
0Cycle 1 2 3 4 5 6
U1 U3 U5 U7 C
L1 L5 L6 B1
U9
U2 U4 U6 U8
L2P L3 L4 L7 L8Pred
ict m
iss
inte
ger p
ipes
U1 U3 CC U5U5
B1B1
U4U4 U6U6
L3L3 L4L4
C U5 B2
L1 B1 L5
U9
U9U2 U4 U6 U7
L2P L3 L4 L6 L7
Pred
ict h
itin
tege
r pip
es
P
C
BX
LX
UX
Producing load
Consumer
Beneficiary of load
Unrelated instruction, lower pipes
Unrelated instruction, upper pipes
Squashed and reexecuted if P misses
Delayed (rescheduled) if P misses
(a)
(b)
.
First execution
of out-of-order execution.Unmarked loads issue as earlyas possible, and before asmany stores as possible, whileonly the necessary markedloads are delayed.
Load hit/miss predictionThere are minispeculations
within the 21264’s specula-tive execution engine. Toachieve the minimum three-cycle integer load hit latency,the processor must specula-tively issue the consumers ofthe integer load data beforeknowing if the load hit ormissed in the on-chip datacache. This early issue allowsthe consumers to receivebypassed data from a load atthe earliest possible time.Note in Figure 2 that the datacache stage is three cycles after the queue, orissue, stage, so the load’s cache lookup musthappen in parallel with the consumers issue.Furthermore, it really takes another cycle afterthe cache lookup to get the hit/miss indica-tion to the issue queue. This means that con-sumers of the results produced by theconsumers of the load data (the beneficiaries)can also speculatively issue—even though theload may have actually missed.
The processor could rely on the generalmechanisms available in the speculative exe-cution engine to abort the integer load data’sspeculatively executed consumers; however,that requires restarting the entire instructionpipeline. Given that load misses can be fre-quent in some applications, this techniquewould be too expensive. Instead, the proces-sor handles this with a minirestart. When con-sumers speculatively issue three cycles after aload that misses, two integer issue cycles (onall four integer pipes) are squashed. All inte-ger instructions that issued during those twocycles are pulled back into the issue queue tobe reissued later. This forces the processor toreissue both the consumers and the benefi-ciaries. If the load hits, the instruction sched-ule shown on the top of Figure 11 will beexecuted. If the load misses, however, the orig-inal issues of the unrelated instructions L3–L4
and U4–U6 must be reexecuted in cycles 5and 6. The schedule thus is delayed two cyclesfrom that depicted.
While this two-cycle window is less costlythan fully restarting the processor pipeline, itstill can be expensive for applications withmany integer load misses. Consequently, the21264 predicts when loads will miss and doesnot speculatively issue the consumers of theload data in that case. The bottom half of Fig-ure 11 shows the example instruction schedulefor this prediction. The effective load latencyis five cycles rather than the minimum threefor an integer load hit that is (incorrectly) pre-dicted to miss. But more unrelated instruc-tions are allowed to issue in the slots not takenby the consumer and the beneficiaries.
The load hit/miss predictor is the most-sig-nificant bit of a 4-bit counter that tracks thehit/miss behavior of recent loads. The satu-rating counter decrements by two on cycleswhen there is a load miss, otherwise it incre-ments by one when there is a hit. This hit/misspredictor minimizes latencies in applicationsthat often hit, and avoids the costs of over-speculation for applications that often miss.
The 21264 treats floating-point loads dif-ferently than integer loads for load hit/missprediction. The floating-point load latency isfour cycles, with no single-cycle operations, so
35MARCH–APRIL 1999
F igure 11. Integer load hit/m iss prediction example . This figure depicts the execution of aworkload when the se lected load (P) is predicted to hit (a) and predicted to m iss (b) on thefour integer pipes. The cross-hatched and screened sections show the instructions that aree ither squashed and reexecuted from the issue queue , or de layed due to operand availabili-ty or the reexecution of other instructions.
0Cycle 1 2 3 4 5 6
U1 U3 U5 U7 C
L1 L5 L6 B1
U9
U2 U4 U6 U8
L2P L3 L4 L7 L8Pred
ict m
iss
inte
ger p
ipes
U1 U3 CC U5U5
B1B1
U4U4 U6U6
L3L3 L4L4
C U5 B2
L1 B1 L5
U9
U9U2 U4 U6 U7
L2P L3 L4 L6 L7
Pred
ict h
itin
tege
r pip
es
P
C
BX
LX
UX
Producing load
Consumer
Beneficiary of load
Unrelated instruction, lower pipes
Unrelated instruction, upper pipes
Squashed and reexecuted if P misses
Delayed (rescheduled) if P misses
(a)
(b)
.
of out-of-order execution.Unmarked loads issue as earlyas possible, and before asmany stores as possible, whileonly the necessary markedloads are delayed.
Load hit/miss predictionThere are minispeculations
within the 21264’s specula-tive execution engine. Toachieve the minimum three-cycle integer load hit latency,the processor must specula-tively issue the consumers ofthe integer load data beforeknowing if the load hit ormissed in the on-chip datacache. This early issue allowsthe consumers to receivebypassed data from a load atthe earliest possible time.Note in Figure 2 that the datacache stage is three cycles after the queue, orissue, stage, so the load’s cache lookup musthappen in parallel with the consumers issue.Furthermore, it really takes another cycle afterthe cache lookup to get the hit/miss indica-tion to the issue queue. This means that con-sumers of the results produced by theconsumers of the load data (the beneficiaries)can also speculatively issue—even though theload may have actually missed.
The processor could rely on the generalmechanisms available in the speculative exe-cution engine to abort the integer load data’sspeculatively executed consumers; however,that requires restarting the entire instructionpipeline. Given that load misses can be fre-quent in some applications, this techniquewould be too expensive. Instead, the proces-sor handles this with a minirestart. When con-sumers speculatively issue three cycles after aload that misses, two integer issue cycles (onall four integer pipes) are squashed. All inte-ger instructions that issued during those twocycles are pulled back into the issue queue tobe reissued later. This forces the processor toreissue both the consumers and the benefi-ciaries. If the load hits, the instruction sched-ule shown on the top of Figure 11 will beexecuted. If the load misses, however, the orig-inal issues of the unrelated instructions L3–L4
and U4–U6 must be reexecuted in cycles 5and 6. The schedule thus is delayed two cyclesfrom that depicted.
While this two-cycle window is less costlythan fully restarting the processor pipeline, itstill can be expensive for applications withmany integer load misses. Consequently, the21264 predicts when loads will miss and doesnot speculatively issue the consumers of theload data in that case. The bottom half of Fig-ure 11 shows the example instruction schedulefor this prediction. The effective load latencyis five cycles rather than the minimum threefor an integer load hit that is (incorrectly) pre-dicted to miss. But more unrelated instruc-tions are allowed to issue in the slots not takenby the consumer and the beneficiaries.
The load hit/miss predictor is the most-sig-nificant bit of a 4-bit counter that tracks thehit/miss behavior of recent loads. The satu-rating counter decrements by two on cycleswhen there is a load miss, otherwise it incre-ments by one when there is a hit. This hit/misspredictor minimizes latencies in applicationsthat often hit, and avoids the costs of over-speculation for applications that often miss.
The 21264 treats floating-point loads dif-ferently than integer loads for load hit/missprediction. The floating-point load latency isfour cycles, with no single-cycle operations, so
35MARCH–APRIL 1999
F igure 11. Integer load hit/m iss prediction example . This figure depicts the execution of aworkload when the se lected load (P) is predicted to hit (a) and predicted to m iss (b) on thefour integer pipes. The cross-hatched and screened sections show the instructions that aree ither squashed and reexecuted from the issue queue , or de layed due to operand availabili-ty or the reexecution of other instructions.
0Cycle 1 2 3 4 5 6
U1 U3 U5 U7 C
L1 L5 L6 B1
U9
U2 U4 U6 U8
L2P L3 L4 L7 L8Pred
ict m
iss
inte
ger p
ipes
U1 U3 CC U5U5
B1B1
U4U4 U6U6
L3L3 L4L4
C U5 B2
L1 B1 L5
U9
U9U2 U4 U6 U7
L2P L3 L4 L6 L7
Pred
ict h
itin
tege
r pip
es
P
C
BX
LX
UX
Producing load
Consumer
Beneficiary of load
Unrelated instruction, lower pipes
Unrelated instruction, upper pipes
Squashed and reexecuted if P misses
Delayed (rescheduled) if P misses
(a)
(b)
.
Subsequent execution17Thursday, April 3, 14
24
Alpha microprocessors have beenperformance leaders since their introductionin 1992. The first generation 21064 and thelater 211641,2 raised expectations for thenewest generation—performance leadershipwas again a goal of the 21264 design team.Benchmark scores of 30+ SPECint95 and 58+SPECfp95 offer convincing evidence thus farthat the 21264 achieves this goal and will con-tinue to set a high performance standard.
A unique combination of high clock speedsand advanced microarchitectural techniques,including many forms of out-of-order andspeculative execution, provide exceptional corecomputational performance in the 21264. Theprocessor also features a high-bandwidthmem-ory system that can quickly deliver data valuesto the execution core, providing robust perfor-mance for a wide range of applications, includ-ing those without cache locality. The advancedperformance levels are attained while main-taining an installed application base. All Alphagenerations are upward-compatible. Database,real-time visual computing, data mining, med-ical imaging, scientific/technical, and manyother applications can utilize the outstandingperformance available with the 21264.
Architecture highlightsThe 21264 is a superscalar microprocessor
that can fetch and execute up to four instruc-tions per cycle. It also features out-of-orderexecution.3,4 With this, instructions executeas soon as possible and in parallel with other
nondependent work, which results in fasterexecution because critical-path computationsstart and complete quickly.
The processor also employs speculative exe-cution to maximize performance. It specula-tively fetches and executes instructions eventhough it may not know immediately whetherthe instructions will be on the final executionpath. This is particularly useful, for instance,when the 21264 predicts branch directions andspeculatively executes down the predicted path.
Sophisticated branch prediction, coupledwith speculative and dynamic execution,extracts instruction parallelism from applica-tions. With more functional units and thesedynamic execution techniques, the processoris 50% to 200% faster than its 21164 prede-cessor for many applications, even thoughboth generations can fetch at most fourinstructions per cycle.5
The 21264’s memory system also enableshigh performance levels. On-chip and off-chip caches provide for very low latency dataaccess. Additionally, the 21264 can servicemany parallel memory references to all cachesin the hierarchy, as well as to the off-chipmemory system. This permits very high band-width data access.6 For example, the proces-sor can sustain more than 1.3 GBytes/sec onthe Stream benchmark.7
The microprocessor’s cycle time is 500 to600 MHz, implemented by 15 million tran-sistors in a 2.2-V, 0.35-micron CMOS processwith six metal layers. The 3.1 cm2 processor
R. E. KesslerCompaq Computer
Corporation
THE ALPHA 21264 OWES ITS HIGH PERFORMANCE TO HIGH CLOCK SPEED,
MANY FORMS OF OUT-OF-ORDER AND SPECULATIVE EXECUTION, AND A HIGH-
BANDWIDTH MEMORY SYSTEM.
0272-1732/99/$10.00 1999 IEEE
THE ALPHA 21264MICROPROCESSOR
.
Designing a microprocessor is a team sport. Below are the author and acknowledgement lists for the papers whose figures I use.
Circuit Implementation of a 600MHz Superscalar RISC Microprocessor
M. Matson, D. Bailey, S. Bell, L. Biro, S. Butler, J. Clouser, J. Farrell, M. Gowan, D. Priore, and K. Wilcox
Compaq Computer Corporation, Shrewsbury, MA
AbstractThe circuit techniques used to implement a 600MHz,
out-of-order, superscalar RISC Alpha microprocessor aredescribed. Innovative logic and circuit design created achip that attains 30+ SpecInt95 and 50+ SpecFP95, andsupports a secondary cache bandwidth of 6.4GB/s.Microarchitectural techniques were used to optimizelatencies and cycle time, while a variety of static anddynamic design methods balanced critical path delaysagainst power consumption. The chip relies heavily on fullcustom design and layout to meet speed and area goals. Anextensive CAD suite guaranteed the integrity of the design.
1. IntroductionThe design of the Alpha 21264 microprocessor [1] was
driven by a desire to achieve the highest performancepossible in a single chip, 0.35um CMOS microprocessor.This goal was realized by combining low instructionlatencies and a high frequency of operation with out-of-order issue techniques. The microprocessor fetches fourinstructions per cycle and can issue up to sixsimultaneously. Large 64KB, two-way set associative,primary caches were included for both instructions anddata; a high bandwidth secondary cache interface transfersup to 6.4GB/s of data into or from the chip. A phase-lockedloop [2] generates the 600MHz internal clock. Theincreased power that accompanies such high frequencies ismanaged through reduced VDD, conditional clocking, andother low power techniques.
The Alpha microprocessor road map dictates continualimprovements in architecture, circuits, and fabricationtechnology with each successive generation. In comparisonto its predecessors [3-5], the 21264 issues instructions out-of-order, supports more in-flight instructions, executesmore instructions in parallel, has much larger primarycaches and memory bandwidth, and contains additionalinteger and floating point function units. Other differencesinclude a phase-locked loop to simplify system design, aswell as conditional clocks and a clocking hierarchy toreduce power consumption and permit critical pathtradeoffs. Custom circuit design enabled the incorporationof these advances while reducing the cycle time more thanpossible with the process shrink alone. The new 0.35um(drawn) process provides faster devices and die area for
more features, along with reference planes for better signalintegrity and power distribution.
The remainder of this paper will describe the 21264’sphysical characteristics, design methodology, and majorblocks, paying particular attention to the underlying designproblems and implementation approaches used to solvethem. The paper will conclude with a discussion of howthese strategies created a microprocessor with leading edgeperformance.
2. Physical CharacteristicsCharacteristics of the CMOS process are summarized in
Table 1. The process provides two fine pitch metal layers,two coarse pitch metal layers, and two reference planes.The pitch of the finer layers aids in compacting the layout,while the lower resistance of the coarser layers is beneficialfor clocks and long signal wires. The reference planeslower the effective impedance of the power supply and alsoprovide a low inductance return path for clock and signallines. Moreover, they greatly reduce the capacitive andinductive coupling between the wires, which couldotherwise induce reliability failures due to voltageundershoot or overshoot, functional failures caused byexcessive noise, and wide variations in path delays due todata dependencies. Gate oxide capacitors, placed near largedrivers and underneath upper level metal routing channels,further diminish power supply noise.
Feature size 0.35umChannel length 0.25umGate oxide 6.0nmVTXn/VTXp 0.35V / -0.35VMetal 1, 2 5.7kA AlCu, 1.225um pitchReference plane 1 14.8kA AlCu, VSSMetal 3, 4 14.8kA AlCu, 2.80um pitchReference plane 2 14.8kA AlCu, VDD
Table 1: CMOS Process Technology
The microprocessor is packaged in a 587 pin ceramicinterstitial pin grid array. A CuW heat slug lowers thethermal resistance between the die and detachable heat sink.The package has a 1uF wirebond attached chip capacitor inaddition to the distributed on-chip decoupling capacitors.
There is no “i” in T-E-A-M ...
circuits
there is enough time to resolve the exactinstruction that used the load result.
Compaq has been shipping the 21264 tocustomers since the last quarter of 1998.
Future versions of the 21264, taking advantageof technology advances for lower cost and high-er speed, will extend the Alpha’s performanceleadership well into the new millennium. Thenext-generation 21364 and 21464 Alphas arecurrently being designed. They will carry theAlpha line even further into the future. MICRO
AcknowledgmentsThe 21264 is the fruition of many individ-
uals, including M. Albers, R. Allmon, M.Arneborn, D. Asher, R. Badeau, D. Bailey, S.Bakke, A. Barber, S. Bell, B. Benschneider, M.Bhaiwala, D. Bhavsar, L. Biro, S. Britton, D.Brown, M. Callander, C. Chang, J. Clouser,R. Davies, D. Dever, N. Dohm, R. Dupcak,J. Emer, N. Fairbanks, B. Fields, M. Gowan,R. Gries, J. Hagan, C. Hanks, R. Hokinson,C. Houghton, J. Huggins, D. Jackson, D.Katz, J. Kowaleski, J. Krause, J. Kumpf, G.Lowney, M. Matson, P. McKernan, S. Meier,J. Mylius, K. Menzel, D. Morgan, T. Morse,L. Noack, N. O’Neill, S. Park, P. Patsis, M.Petronino, J. Pickholtz, M. Quinn, C. Ramey,D. Ramey, E. Rasmussen, N. Raughley, M.Reilly, S. Root, E. Samberg, S. Samudrala, D.Sarrazin, S. Sayadi, D. Siegrist, Y. Seok, T.Sperber, R. Stamm, J. St Laurent, J. Sun, R.Tan, S. Taylor, S. Thierauf, G. Vernes, V. vonKaenel, D. Webb, J. Wiedemeier, K. Wilcox,and T. Zou.
References1. D . Dobberpuhl et al., “A 200 MHz 64-bit Dual
Issue C M OS M icroprocessor, ” IEEE J. SolidState C ircuits , Vol. 27 , No . 11 , Nov . 1992 ,pp. 1,555–1,567.
2. J . Edmondson et a l. , “ Supersca larInstruct ion Execut ion in the 21164 A lphaM icroprocessor, ” IEEE M icro, Vol. 15, No.2, Apr. 1995; pp. 33–43.
3. B . G ieseke et al., “ A 600 M Hz SuperscalarRISC M icroprocessor w ith O ut-of-OrderExecut ion , ” IE E E Int’ l So lid-State C ircu itsConf . D ig . , Tech . Papers , IE E E Press ,Piscataway, N .J., Feb. 1997, pp. 176–177.
4. D . Le ibho lz and R . Razdan , “ The A lpha21264: A 500 M Hz Out-of-Order Execution
M icroprocessor, ” Proc. IEEE Compcon 97,IEEE Computer Soc . Press , Los A lam itos ,Calif., 1997, pp. 28–36.
5. R.E . Kessler, E .J. McLe llan, and D .A . W ebb,“ The A lpha 21264 M icroprocessorArch itecture , ” Proc . 1998 IE E E Int’ l Conf .Computer Design: VLSI in Computers andProcessors, IEEE Computer Soc. Press, Oct.1998, pp. 90–95.
6. M . Matson et al., “ C ircuit Implementation ofa 600 M Hz Superscalar RISC M icroproces-sor, ” 1998 IE E E Int’ l Conf . ComputerDesign: VLSI in Computers and Processors,Oct. 1998, pp. 104–110.
7. J .D . M cCa lp in , “ STRE A M: Susta inab leM e mory Band w idth in H igh-PerformanceComputers , ” Un iv . of V irg in ia , D ept . ofComputer Sc ience , Charlottesv ille , Va .;http://w w w .cs.virginia.edu/stream/.
8. S. McFarling, Combining Branch Predictors,Tech. Note TN-36, Compaq Computer Corp.W estern Research Laboratory , Pa lo A lto ,Ca lif . , June 1993; http://w w w .research .d i g i t a l . co m / w r l/ t e chre port s /abs trac t s /TN-36.htm l.
9. T . F ischer and D . Le ibho lz , “ D es ignTradeoffs in Sta ll-Contro l C ircu its for 600M Hz Instruction Queues, ” Proc. IEEE Int’lSolid-State C ircuits Conf. D ig., Tech. Papers,IEEE Press, Feb. 1998, pp. 398–399.
Richard E. Kessler is a consulting engineer inthe Alpha Development Group of CompaqComputer Corp. in Shrewsbury, Massachu-setts. He is an architect of the Alpha 21264 and21364 microprocessors. His interests includemicroprocessor and computer system archi-tecture. He has an MS and a PhD in comput-er sciences from the University of Wisconsin,Madison, and a BS in electrical and computerengineering from the University of Iowa. He isa member of the ACM and the IEEE.
Contact Kessler about this article at CompaqComputer Corp., 334 South St., Shrewsbury,MA 01545; [email protected].
36
ALPHA 21264
IEEE MICRO
.
24
Alpha microprocessors have beenperformance leaders since their introductionin 1992. The first generation 21064 and thelater 211641,2 raised expectations for thenewest generation—performance leadershipwas again a goal of the 21264 design team.Benchmark scores of 30+ SPECint95 and 58+SPECfp95 offer convincing evidence thus farthat the 21264 achieves this goal and will con-tinue to set a high performance standard.
A unique combination of high clock speedsand advanced microarchitectural techniques,including many forms of out-of-order andspeculative execution, provide exceptional corecomputational performance in the 21264. Theprocessor also features a high-bandwidthmem-ory system that can quickly deliver data valuesto the execution core, providing robust perfor-mance for a wide range of applications, includ-ing those without cache locality. The advancedperformance levels are attained while main-taining an installed application base. All Alphagenerations are upward-compatible. Database,real-time visual computing, data mining, med-ical imaging, scientific/technical, and manyother applications can utilize the outstandingperformance available with the 21264.
Architecture highlightsThe 21264 is a superscalar microprocessor
that can fetch and execute up to four instruc-tions per cycle. It also features out-of-orderexecution.3,4 With this, instructions executeas soon as possible and in parallel with other
nondependent work, which results in fasterexecution because critical-path computationsstart and complete quickly.
The processor also employs speculative exe-cution to maximize performance. It specula-tively fetches and executes instructions eventhough it may not know immediately whetherthe instructions will be on the final executionpath. This is particularly useful, for instance,when the 21264 predicts branch directions andspeculatively executes down the predicted path.
Sophisticated branch prediction, coupledwith speculative and dynamic execution,extracts instruction parallelism from applica-tions. With more functional units and thesedynamic execution techniques, the processoris 50% to 200% faster than its 21164 prede-cessor for many applications, even thoughboth generations can fetch at most fourinstructions per cycle.5
The 21264’s memory system also enableshigh performance levels. On-chip and off-chip caches provide for very low latency dataaccess. Additionally, the 21264 can servicemany parallel memory references to all cachesin the hierarchy, as well as to the off-chipmemory system. This permits very high band-width data access.6 For example, the proces-sor can sustain more than 1.3 GBytes/sec onthe Stream benchmark.7
The microprocessor’s cycle time is 500 to600 MHz, implemented by 15 million tran-sistors in a 2.2-V, 0.35-micron CMOS processwith six metal layers. The 3.1 cm2 processor
R. E. KesslerCompaq Computer
Corporation
THE ALPHA 21264 OWES ITS HIGH PERFORMANCE TO HIGH CLOCK SPEED,
MANY FORMS OF OUT-OF-ORDER AND SPECULATIVE EXECUTION, AND A HIGH-
BANDWIDTH MEMORY SYSTEM.
0272-1732/99/$10.00 1999 IEEE
THE ALPHA 21264MICROPROCESSOR
.
architect
memory. Directory or duplicate-tag based protocols can be built using these primitives in a similar fashion.
References
[l] D. Dobkrpuhl et al,. "A 200-MHi @-bit Dual Issue CMOS Microprocessor," Digital Technical Journal, vol. 4, no. 4, 1992. [2] J. Edmondson et al., "Superscalar instruction execution in the 21 164 Alpha Microprocessor," IEEE Micro, vol. 15, no. 2, Apr. 1995. [3] S. McFarling, "Combining Branch Predictors," Technical Note TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993. <w w w .research .digital.com/wrl/tec hreports/abstracts/TN- 36.htmb
Acknowledgments
The authors acknowledge the contributions of the following individuals: J. Emer, B. Gieseke, B. Grundmann, J. Keller, R. Kessler, E. McLellan, D. Meyer, J. Pierce, S. Steely, and D. Webb.
36
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
Daniel Leibholz and Rahul Razdan Digital Equipment Corporation
Hudson, MA 01749
Abstract
This paper describes the internal organization of the 21264, a 500 MHz, Out-Of Order, quad-ferch, six-way issue microprocessor. The aggressive cycle-time of the 21264 in combination with many architectural innovations, such as out-oforder and. speculative execution, enable this microprocessor to deliver an estimated 30 SpecInt95 and 50 SpecFp95 performance. In addition, the 21264 can sustain 54- Gigabyteslsec of bandwidth to an L2 cache and 3+ Gigabyteslsec to memory for high performance on memory-intensive applications.
Introduction
The 21264 is the third generation of Alpha microprocessors designed and built by Digital Semiconductor. Like its predecessors , the 21064 [l] and the 21 164 [2], thc design objective of the 21264 team was to build a world-class microprocessor which is the undisputed performance leader. The principle levers used to achieve this objective were:
A cycle time (2.0 ns in 0.35 micron CMOS at 2 volts) was chosen by evaluation of the circuit loops which provide the most performance leverage. For example, an integer add and result bypass (to the next integer operation) is critical to the performance of most integer programs and is therefore a determining factor in choosing the cycle time. An out-of-order, superscalar execution core was built to increase the average instructions executed per cycle (ipc) for the machine. The out-of-order execution model dynamically finds instruction-level- parallelism in the program and hides memory latency by executing load instructions that may be located past conditional branches. Performance-focused instructions were added to the Alpha architecture and implemented in the 21264. These include:
0
0
=$ Motion estimation instructions accelerate CPU-intensive video compression and decompression algorithms.
3 Prefetch instructions enable software control of the data caches.
3 Floating point square root and bidirectional register file transfer instructions (integer-to- floating point) enhance floating point performance.
High-speed interfaces to the backup (L2) cache and system memory dramatically increase the bandwidth available from each of these sources.
The combination of these techniques delivers an estimated 30 Speclnt95, and over 50 SpecFp95 performance on the standard SPEC95 benchmark suite and over 1600 MB/s on the McCalpin STREAM benchmark. In addition, the dramatic rise in external
Figure 1. 2 1264 Floorplan
28 1063-6390/97 $10.00 0 1997 IEEE
micro-architects
18Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II
Break
Play:19Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Multi-Threading(Dynamic Scheduling)
20Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 4 (predates Power 5 shown earlier)
● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.
● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.
● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.
Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate
Figure 4POWER4 instruction execution pipeline.
EA DC WB
MP ISS RF EX WB
MP ISS RF
MP ISS RF
F6
MP ISS RF
CP
LD/ST
FX
FP
WB
Fmt
D0
ICIF BP
EXD1 D2 D3 Xfer
Xfer
Xfer
GD
Branch redirects
Instruction fetch
Xfer
Xfer
BR
WB
Out-of-order processing
Instruction crack andgroup formation
Interrupts and flushes
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.
13
Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.
21Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
For most apps, most execution units lie idle
Applications
alvin
n
dodu
c
eqnto
tt
espre
sso
fpppp
hydro
2d li
mdlj
dp2
mdlj
sp2
nas
a7 ora
su2co
r
swm
tom
catv
100
90
80
70
60
50
40
30
20
10
0
com
posi
te
itlb miss
dtlb miss
dcache miss
processor busy
icache miss
branch misprediction
control hazards
load delays
short integer
long integer
short fp
long fp
memory conflict
Per
cent
of
Tota
l Is
sue
Cycl
es
Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
such as an I tlb miss and an I cache miss, the wasted cycles are
divided up appropriately. Table 3 specifies all possible sources
of wasted cycles in our model, and some of the latency-hiding or
latency-reducing techniques that might apply to them. Previous
work [32, 5, 18], in contrast, quantified some of these same effects
by removing barriers to parallelism and measuring the resulting
increases in performance.
Our results, shown in Figure 2, demonstrate that the functional
units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization
of only 19% (the “processor busy” component of the composite bar
of Figure 2), which represents an average execution of less than 1.5
instructions per cycle on our 8-issue machine.
These results also indicate that there is no dominant source of
wasted issue bandwidth. Although there are dominant items in
individual applications (e.g., mdljsp2, swm, fpppp), the dominant
cause is different in each case. In the composite results we see that
the largest cause (short FP dependences) is responsible for 37% of
the issue bandwidth, but there are six other causes that account for
at least 4.5% of wasted cycles. Even completely eliminating any
one factor will not necessarily improve performance to the degree
that this graph might imply, because many of the causes overlap.
Not only is there no dominant cause of wasted cycles — there
appears to be no dominant solution. It is thus unlikely that any single
latency-tolerating technique will produce a dramatic increase in the
performance of these programs if it only attacks specific types of
latencies. Instruction scheduling targets several important segments
of the wasted issue bandwidth, but we expect that our compiler
has already achieved most of the available gains in that regard.
Current trends have been to devote increasingly larger amounts of
on-chip area to caches, yet even if memory latencies are completely
eliminated, we cannot achieve 40% utilization of this processor. If
specific latency-hiding techniques are limited, then any dramatic
increase in parallelism needs to come from a general latency-hiding
solution, of which multithreading is an example. The different types
of multithreading have the potential to hide all sources of latency,
but to different degrees.
This becomes clearer if we classify wasted cycles as either vertical
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.
For an 8-way superscalar.Observation:
Most hardware in an
out-of-order CPU concerns
physical registers.
Could severalinstruction
threads share this hardware?
22Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Simultaneous Multi-threading ...
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleOne thread, 8 units
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleTwo threads, 8 units
23Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.
● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.
● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.
Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate
Figure 4POWER4 instruction execution pipeline.
EA DC WB
MP ISS RF EX WB
MP ISS RF
MP ISS RF
F6
MP ISS RF
CP
LD/ST
FX
FP
WB
Fmt
D0
ICIF BP
EXD1 D2 D3 Xfer
Xfer
Xfer
GD
Branch redirects
Instruction fetch
Xfer
Xfer
BR
WB
Out-of-order processing
Instruction crack andgroup formation
Interrupts and flushes
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.
13
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
Power 4
Power 5
2 fetch (PC),2 initial decodes
2 commits(architected register sets)
24Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 5 data flow ...
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck.
25Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 5 thread performance ...
mode. In this mode, the Power5 gives all thephysical resources, including the GPR andFPR rename pools, to the active thread, allow-ing it to achieve higher performance than aPower4 system at equivalent frequencies.
The Power5 supports two types of ST oper-ation: An inactive thread can be in either adormant or a null state. From a hardware per-spective, the only difference between thesestates is whether or not the thread awakens onan external or decrementer interrupt. In thedormant state, the operating system boots upin SMT mode but instructs the hardware toput the thread into the dormant state whenthere is no work for that thread. To make adormant thread active, either the active threadexecutes a special instruction, or an externalor decrementer interrupt targets the dormantthread. The hardware detects these scenariosand changes the dormant thread to the activestate. It is software’s responsibility to restorethe architected state of a thread transitioningfrom the dormant to the active state.
When a thread is in the null state, the oper-ating system is unaware of the thread’s existence.As in the dormant state, the operating system
does not allocate resources to a null thread. Thismode is advantageous if all the system’s execut-ing tasks perform better in ST mode.
Dynamic power managementIn current CMOS technologies, chip power
has become one of the most important designparameters. With the introduction of SMT,more instructions execute per cycle per proces-sor core, thus increasing the core’s and thechip’s total switching power. To reduce switch-ing power, Power5 chips use a fine-grained,dynamic clock-gating mechanism extensively.This mechanism gates off clocks to a localclock buffer if dynamic power managementlogic knows the set of latches driven by thebuffer will not be used in the next cycle. Forexample, if the GPRs are guaranteed not tobe read in a given cycle, the clock-gatingmechanism turns off the clocks to the GPRread ports. This allows substantial power sav-ing with no performance impact.
In every cycle, the dynamic power man-agement logic determines whether a localclock buffer that drives a set of latches can beclock gated in the next cycle. The set of latch-es driven by a clock-gated local clock buffercan still be read but cannot be written. Weused power-modeling tools to estimate theutilization of various design macros and theirassociated switching power across a range ofworkloads. We then determined the benefitof clock gating for those macros, implement-ing cycle-by-cycle dynamic power manage-ment in macros where such managementprovided a reasonable power-saving benefit.We paid special attention to ensuring thatclock gating causes no performance loss andthat clock-gating logic does not create a crit-ical timing path. A minimum amount of logicimplements the clock-gating function.
In addition to switching power, leakagepower has become a performance limiter. Toreduce leakage power, the Power5 uses tran-sistors with low threshold voltage only in crit-ical paths, such as the FPR read path. Weimplemented the Power5 SRAM arrays main-ly with high threshold voltage devices.
The Power5 also has a low-power mode,enabled when the system software instructsthe hardware to execute both threads at thelowest available priority. In low-power mode,instructions dispatch once every 32 cycles at
46
HOT CHIPS 15
IEEE MICRO
Inst
ruct
ions
per
cyc
le (
IPC
)
Thread 0 priority, thread 1 priority
Powersavemode
Single-thread mode
Thread 0 IPC Thread 1 IPC
2,7 1,6
4,7 3,6 2,5 1,4
6,7 5,6 4,5 3,4 2,3 2,1
7,0 1,10,1 1,0
0,7 7,6 6,5 5,4 4,3 3,2 2,1
7,4 6,3 5,2 4,1
7,2 6,1
7,7 6,6 5,54,43,3 2,2
Figure 5. Effects of thread priority on performance.
Relative priority of each thread controllable in hardware.
For balanced operation, both threads run slower than if they “owned” the machine.
26Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Multi-Core
27Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Recall: Superscalar utilization by a thread
Applications
alvin
n
dodu
c
eqnto
tt
espre
sso
fpppp
hydro
2d li
mdlj
dp2
mdlj
sp2
nas
a7 ora
su2co
r
swm
tom
catv
100
90
80
70
60
50
40
30
20
10
0
com
posi
te
itlb miss
dtlb miss
dcache miss
processor busy
icache miss
branch misprediction
control hazards
load delays
short integer
long integer
short fp
long fp
memory conflict
Per
cent
of
Tota
l Is
sue
Cycl
es
Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
such as an I tlb miss and an I cache miss, the wasted cycles are
divided up appropriately. Table 3 specifies all possible sources
of wasted cycles in our model, and some of the latency-hiding or
latency-reducing techniques that might apply to them. Previous
work [32, 5, 18], in contrast, quantified some of these same effects
by removing barriers to parallelism and measuring the resulting
increases in performance.
Our results, shown in Figure 2, demonstrate that the functional
units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization
of only 19% (the “processor busy” component of the composite bar
of Figure 2), which represents an average execution of less than 1.5
instructions per cycle on our 8-issue machine.
These results also indicate that there is no dominant source of
wasted issue bandwidth. Although there are dominant items in
individual applications (e.g., mdljsp2, swm, fpppp), the dominant
cause is different in each case. In the composite results we see that
the largest cause (short FP dependences) is responsible for 37% of
the issue bandwidth, but there are six other causes that account for
at least 4.5% of wasted cycles. Even completely eliminating any
one factor will not necessarily improve performance to the degree
that this graph might imply, because many of the causes overlap.
Not only is there no dominant cause of wasted cycles — there
appears to be no dominant solution. It is thus unlikely that any single
latency-tolerating technique will produce a dramatic increase in the
performance of these programs if it only attacks specific types of
latencies. Instruction scheduling targets several important segments
of the wasted issue bandwidth, but we expect that our compiler
has already achieved most of the available gains in that regard.
Current trends have been to devote increasingly larger amounts of
on-chip area to caches, yet even if memory latencies are completely
eliminated, we cannot achieve 40% utilization of this processor. If
specific latency-hiding techniques are limited, then any dramatic
increase in parallelism needs to come from a general latency-hiding
solution, of which multithreading is an example. The different types
of multithreading have the potential to hide all sources of latency,
but to different degrees.
This becomes clearer if we classify wasted cycles as either vertical
For an 8-way superscalar. Observation:
In many cases, the on-chip cache and DRAM I/O
bandwidth is also
underutilized by one CPU.
So, let 2 cores share them.
28Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Most of Power 5 die is shared hardware
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for P
ower
4-ba
sed
syst
ems
perf
orm
eq
ually
w
ell
onPo
wer
5-ba
sed
syst
ems.
Fig
ure
4 sh
ows
the
Pow
er5’
s ins
truc
tion
flow
dia
gram
.In
SM
T m
ode,
the
Pow
er5
uses
two
sepa
-ra
te in
stru
ctio
n fe
tch
addr
ess r
egist
ers t
o st
ore
the
prog
ram
cou
nter
s fo
r th
e tw
o th
read
s.In
stru
ctio
n fe
tche
s (I
F st
age)
al
tern
ate
betw
een
the
two
thre
ads.
In
ST m
ode,
the
Pow
er5
uses
onl
y on
e pr
ogra
m c
ount
er a
ndca
n fe
tch
inst
ruct
ions
for
that
thr
ead
ever
ycy
cle.
It
can
fetc
h up
to
eigh
t in
stru
ctio
nsfr
om t
he in
stru
ctio
n ca
che
(IC
sta
ge)
ever
ycy
cle.
The
two
thre
ads
shar
e th
e in
stru
ctio
nca
che
and
the
inst
ruct
ion
tran
slatio
n fa
cilit
y.In
a gi
ven
cycl
e, al
l fet
ched
inst
ruct
ions
com
efr
om th
e sa
me
thre
ad.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).Core #1
Core #2
SharedComponents
L2 Cache
L3 Cache
Control
DRAMController
29Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Core-to-core interactions stay on chip
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for
Pow
er4-
base
d sy
stem
s pe
rfor
m
equa
lly
wel
l on
Pow
er5-
base
d sy
stem
s. F
igur
e 4
show
s th
ePo
wer
5’s i
nstr
uctio
n flo
w d
iagr
am.
In S
MT
mod
e, th
e Po
wer
5 us
es tw
o se
pa-
rate
inst
ruct
ion
fetc
h ad
dres
s reg
ister
s to
stor
eth
e pr
ogra
m c
ount
ers
for
the
two
thre
ads.
Inst
ruct
ion
fetc
hes
(IF
stag
e)
alte
rnat
ebe
twee
n th
e tw
o th
read
s. I
n ST
mod
e, t
hePo
wer
5 us
es o
nly
one
prog
ram
cou
nter
and
can
fetc
h in
stru
ctio
ns fo
r th
at t
hrea
d ev
ery
cycl
e. I
t ca
n fe
tch
up t
o ei
ght
inst
ruct
ions
from
the
inst
ruct
ion
cach
e (I
C s
tage
) ev
ery
cycl
e. T
he tw
o th
read
s sh
are
the
inst
ruct
ion
cach
e an
d th
e in
stru
ctio
n tr
ansla
tion
faci
lity.
In a
give
n cy
cle,
all f
etch
ed in
stru
ctio
ns co
me
from
the
sam
e th
read
.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).
(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.
(1) Threads on two cores that use shared libraries conserve L2 memory.
30Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Sun Niagara
31Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
The case for Sun’s Niagara ...
Applications
alvin
n
dodu
c
eqnto
tt
espre
sso
fpppp
hydro
2d li
mdlj
dp2
mdlj
sp2
nas
a7 ora
su2co
r
swm
tom
catv
100
90
80
70
60
50
40
30
20
10
0
com
posi
te
itlb miss
dtlb miss
dcache miss
processor busy
icache miss
branch misprediction
control hazards
load delays
short integer
long integer
short fp
long fp
memory conflict
Per
cent
of
Tota
l Is
sue
Cycl
es
Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
such as an I tlb miss and an I cache miss, the wasted cycles are
divided up appropriately. Table 3 specifies all possible sources
of wasted cycles in our model, and some of the latency-hiding or
latency-reducing techniques that might apply to them. Previous
work [32, 5, 18], in contrast, quantified some of these same effects
by removing barriers to parallelism and measuring the resulting
increases in performance.
Our results, shown in Figure 2, demonstrate that the functional
units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization
of only 19% (the “processor busy” component of the composite bar
of Figure 2), which represents an average execution of less than 1.5
instructions per cycle on our 8-issue machine.
These results also indicate that there is no dominant source of
wasted issue bandwidth. Although there are dominant items in
individual applications (e.g., mdljsp2, swm, fpppp), the dominant
cause is different in each case. In the composite results we see that
the largest cause (short FP dependences) is responsible for 37% of
the issue bandwidth, but there are six other causes that account for
at least 4.5% of wasted cycles. Even completely eliminating any
one factor will not necessarily improve performance to the degree
that this graph might imply, because many of the causes overlap.
Not only is there no dominant cause of wasted cycles — there
appears to be no dominant solution. It is thus unlikely that any single
latency-tolerating technique will produce a dramatic increase in the
performance of these programs if it only attacks specific types of
latencies. Instruction scheduling targets several important segments
of the wasted issue bandwidth, but we expect that our compiler
has already achieved most of the available gains in that regard.
Current trends have been to devote increasingly larger amounts of
on-chip area to caches, yet even if memory latencies are completely
eliminated, we cannot achieve 40% utilization of this processor. If
specific latency-hiding techniques are limited, then any dramatic
increase in parallelism needs to come from a general latency-hiding
solution, of which multithreading is an example. The different types
of multithreading have the potential to hide all sources of latency,
but to different degrees.
This becomes clearer if we classify wasted cycles as either vertical
For an 8-way superscalar. Observation:
Some apps struggle to
reach a CPI == 1.
For throughput on these apps,a large number of single-issue cores is better
than a few superscalars.
32Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Niagara (original): 32 threads on one chip8 cores: Single-issue, 1.2 GHz 6-stage pipeline 4-way multi-threaded Fast crypto support
Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports
Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)
Die size: 340 mm² in 90 nm.Power: 50-60 W
33Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
The board that booted Niagara first-silicon
Source: J Schwartz weblog (then Sun COO, now CEO)
34Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Used in Sun Fire T2000: “Coolthreads”
Web server benchmarks used to position the T2000 in the market.
Claim: server uses 1/3 the power of competing servers.
35Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II© 2013 International Business Machines Corporation 2
Technology
POWER5 2004
POWER6 2007
POWER7 2010
POWER7+ 2012
Compute Cores
Threads Caching On-chip Off-chip
Bandwidth Sust. Mem.
Peak I/O
130nm SOI 65nm SOI 45nm SOI eDRAM
32nm SOI eDRAM
2 SMT2
2 SMT2
8 SMT4
8 SMT4
1.9MB 36MB
8MB 32MB
2 + 32MB None
2 + 80MB None
15GB/s 3GB/s
30GB/s 10GB/s
100GB/s 20GB/s
100GB/s 20GB/s
© 2013 International Business Machines Corporation 14
POWER5 2004
POWER6 2007
POWER7 2010
POWER7+ 2012
130nm SOI 65nm SOI 45nm SOI eDRAM
32nm SOI eDRAM
2 SMT2
2 SMT2
8 SMT4
8 SMT4
1.9MB 36MB
8MB 32MB
2 + 32MB None
2 + 80MB None
15GB/s 3GB/s
30GB/s 10GB/s
100GB/s 20GB/s
100GB/s 20GB/s
Technology Compute
Cores Threads
Caching On-chip Off-chip
Bandwidth Sust. Mem.
Peak I/O
POWER8
22nm SOI eDRAM
12 SMT8
6 + 96MB 128MB
230GB/s 48GB/s
2014
IBM RISC chips, since Power 4 (2001) ...
36Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II© 2013 International Business Machines Corporation 6
VSU FXU
IFU
DFU
ISU
LSU
Larger Caching Structures vs. POWER7 • 2x L1 data cache (64 KB)
• 2x outstanding data cache misses
• 4x translation Cache
Wider Load/Store • 32B Æ 64B L2 to L1 data bus
• 2x data cache to execution
dataflow
Enhanced Prefetch • Instruction speculation awareness
• Data prefetch depth awareness
• Adaptive bandwidth awareness
• Topology awareness
Execution Improvement vs. POWER7 • SMT4 Æ SMT8
• 8 dispatch
• 10 issue
• 16 execution pipes:
• 2 FXU, 2 LSU, 2 LU, 4 FPU,
2 VMX, 1 Crypto, 1 DFU,
1 CR, 1 BR
• Larger Issue queues
(4 x 16-entry)
• Larger global completion,
Load/Store reorder
• Improved branch prediction
• Improved unaligned storage
access Core Performance vs . POWER7
~1.6x Single Thread ~2x Max SMT
37Thursday, April 3, 14
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Recap: Dynamic Scheduling
Three big ideas: register renaming, data-driven detection of RAW resolution, bus-based architecture.
Has saved architectures that have a small number of registers: IBM 360floating-point ISA, Intel x86 ISA.
Very complex, but enables many things: out-of-order execution, multiple issue, loop unrolling, etc.
38Thursday, April 3, 14
On Tuesday
Epilogue ...
Have a good weekend!
39Thursday, April 3, 14