Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1REK August 1998
The Alpha 21264 Microprocessor:Out-of-Order Execution at 600 Mhz
R. E. KesslerCOMPAQ Computer Corporation
Shrewsbury, MA
2REK August 1998
Some Highlights
z Continued Alpha performance leadershipy 600 Mhz operation in 0.35u CMOS6, 6 metal layers, 2.2V
y 15 Million transistors, 3.1 cm2, 587 pin PGA
y Specint95 of 30+ and Specfp95 of 50+
y Out-of-order and speculative execution
y 4-way integer issue
y 2-way floating-point issue
y Sophisticated tournament branch prediction
y High-bandwidth memory system (1+ GB/sec)
3REK August 1998
Alpha 21264: Block Diagram
Cache Bus
Phys Addr
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
BusInter-faceUnit
IntRegMap
IntIssue
Queue(20)
Exec
4 Instructions / cycle
RegFile(80)
Sys Bus
64-bit
128-bit
44-bit
VictimBuffer
L1DataCache64KB2-Set
FPRegMap
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
MissAddress
BranchPredictors
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecRegFile(80)
FPIssueQueue
(15)
RegFile(72)
4REK August 1998
IntIssue
Queue(20)
FPIssueQueue
(15)
Alpha 21264: Block Diagram
Cache Bus
Phys Addr
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
BusInter-faceUnit
IntRegMap
Exec
4 Instructions / cycle
RegFile(80)
Sys Bus
64-bit
128-bit
44-bit
VictimBuffer
L1DataCache64KB2-Set
FPRegMap
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
MissAddress
BranchPredictors
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecRegFile(80)
RegFile(72)
5REK August 1998
21264 Instruction Fetch Bandwidth Enablers
z The 64 KB two-way associative instruction cache suppliesfour instructions every cycle
z The next-fetch and set predictors provide the fast cache accesstimes of a direct-mapped cache and eliminate bubbles innon-sequential control flows
z The instruction fetcher speculates through up to 20 branchpredictions to supply a continuous stream of instructions
z The tournament branch predictor dynamically selectsbetween Local and Global history to minimize mispredicts
6REK August 1998
Instruction Stream Improvements
...
IcacheData
Line Addr
Instructions (4)
Tag S0
Cmp
Hit/Miss/Set Miss
Tag S1
Cmp
Next LP address + set
PC
Set Pred
0-cycle Penalty On Branches
Set-Associative Behavior
PC GEN
Instr Decode, Branch Pred
Learns Dynamic Jumps
7REK August 1998
Fetch Stage:Branch Prediction
z Some branch directions canbe predicted based on theirpast behavior: LocalCorrelation
z Others can be predicted basedon how the program arrived atthe branch: Global Correlation
if (b == 0)
if (a%2 == 0) TNTN
NNNNif (b == 0)
if (a == b)
if (a == 0)(predict taken)
(predict taken)
(predict taken)
8REK August 1998
Tournament Branch Prediction
Prediction
GlobalPrediction(4096x2)
Path History
ChoicePrediction(4096x2)
LocalPrediction(1024x3)
LocalHistory
(1024x10)
Program Counter
9REK August 1998
IntIssue
Queue(20)
FPIssueQueue
(15)
Alpha 21264: Block Diagram
Cache Bus
Phys Addr
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
BusInter-faceUnit
IntRegMap
Exec
4 Instructions / cycle
RegFile(80)
Sys Bus
64-bit
128-bit
44-bit
VictimBuffer
L1DataCache64KB2-Set
FPRegMap
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
MissAddress
BranchPredictors
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecRegFile(80)
RegFile(72)
10REK August 1998
Mapper and Queue Stages
z Mapper:y Rename 4 instructions per cycle (8 source / 4 dest )
y 80 integer + 72 floating-point physical registers
z Queue Stage:y Integer: 20 entries / Quad-Issue
y Floating Point: 15 entries / Dual-Issue
y Instructions issued out-of-order when data readyy Prioritized from oldest to youngest each cycle
y Instructions leave the queues after they issuey The queue collapses every cycle as needed
11REK August 1998
Map and Queue Stages
RegisterScoreboard
Arbiter
Issued Instructions
QUEUE 20 Entries
MAPCAMs
80 Entries
RegisterNumbers
MAPPER ISSUE QUEUE
4 Inst / Cycle
Request Grant
Instructions
SavedMapState
RenamedRegisters
12REK August 1998
IntIssue
Queue(20)
FPIssueQueue
(15)
Alpha 21264: Block Diagram
Cache Bus
Phys Addr
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
BusInter-faceUnit
IntRegMap
Exec
4 Instructions / cycle
RegFile(80)
Sys Bus
64-bit
128-bit
44-bit
VictimBuffer
L1DataCache64KB2-Set
FPRegMap
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
MissAddress
BranchPredictors
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecRegFile(80)
RegFile(72)
13REK August 1998
Register and Execute Stages
FP Add
SQRT
FP Div
FP Mul
Reg
Floating Point Execution Units
add / logic
mul
add / logic /memory
shift/br
Reg
add / logic
m. video
add / logic /memory
shift/br
Reg
Cluster 0 Cluster 1
Integer Execution Units
14REK August 1998
Integer Cross-Cluster Instruction Scheduling andExecution
add / logic
mul
add / logic /memory
shift/br
Reg
add / logic
m. video
add / logic /memory
shift/br
Reg
• Instructions are staticallypre-slotted to the upper orlower execution pipes
• The issue queuedynamically selects betweenthe left and right clusters
• This has most of theperformance of 4-way withthe simplicity of 2-way issue
Cluster 0 Cluster 1
Integer Execution Units
15REK August 1998
IntIssue
Queue(20)
FPIssueQueue
(15)
Alpha 21264: Block Diagram
Cache Bus
Phys Addr
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
BusInter-faceUnit
IntRegMap
Exec
4 Instructions / cycle
RegFile(80)
Sys Bus
64-bit
128-bit
44-bit
VictimBuffer
L1DataCache64KB2-Set
FPRegMap
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
MissAddress
BranchPredictors
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecRegFile(80)
RegFile(72)
16REK August 1998
21264 On-Chip Memory System Features
z Two loads/stores per cycley any combination
z 64 KB two-way associative L1 data cache (9.6 GB/sec)y phase-pipelined at > 1 Ghz (no bank conflicts!)
y 3 cycle latency (issue to issue of consumer) with hit prediction
z Out-of-order and speculative executiony minimizes effective memory latency
z 32 outstanding loads and 32 outstanding storesy maximizes memory system parallelism
z Speculative stores forward data to subsequent loads
17REK August 1998
Memory System (Contd.)
• New references check addressesagainst existing references
• Speculative store data thataddress matches bypasses to loads
• Multiple misses to the samecache block merge
• Hazard detection logic managesout-of-order references
64
64
Cluster 0 Memory UnitCluster 1 Memory Unit
Speculative Store Data
Double-Pumped Ghz Data Cache
128
128
Bus
Inte
rfac
e
128
64
L2 @
1.5
xS
yste
m @
1.5
x Data PathAddress Path
18REK August 1998
Low-latency Speculative Issue of Integer LoadData Consumers (Predict Hit)
LDQ R0, 0(R0)ADDQ R0, R0, R1OP R2, R2, R3
Q R E D Q R E D
3 Cycle Latency
LDQ Cache Lookup Cycle
ADDQ Issue Cycle
z When predicting a load hit:y The ADDQ issues (speculatively) after 3 cycles
y Best performance if the load actually hits (matching the prediction)
y The ADDQ issues before the load hit/miss calculation is known
z If the LDQ misses when predicted to hit:y Squash two cycles (replay the ADDQ and its consumers)
y Force a “mini-replay” (direct from the issue queue)
Q R E D
Squash this ADDQ on a miss
Restart ops here on a missQ R E D This op also gets squashed
19REK August 1998
Low-latency Speculative Issue of Integer LoadData Consumers (Predict Miss)
LDQ R0, 0(R0)OP1 R2, R2, R3OP2 R4, R4, R5ADDQ R0, R0, R1
Q R E D Q R E D
5 Cycle Latency (Min)
LDQ Cache Lookup Cycle
Earliest ADDQ Issue
z When predicting a load miss:y The minimum load latency is 5 cycles (more on a miss)
y There are no squashes
y Best performance if the load actually misses (as predicted)
z The hit/miss predictor:y MSB of 4-bit counter (hits increment by 1, misses decrement by 2)
Q R E D ADDQ can issueQ R E D
20REK August 1998
Dynamic Hazard Avoidance(Before Marking)
Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29)
Store followed by load tothe same memory address!
First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28)
This (re-ordered) load gotthe wrong data value!
21REK August 1998
Dynamic Hazard Avoidance(After Marking/Training)
Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29)
Store followed by load tothe same memory address!
Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) LDQ R1, 0(R29)
The marked (delayed)load gets the bypassed
store data!
This loadis markedthis time
22REK August 1998
New Memory Prefetches
zSoftware-directed prefetchesy Prefetch
y Prefetch w/ modify intent
y Prefetch, evict it next
y Evict Block (eject data from the cache)
y Write hinty allocate block with no data read
y useful for full cache block writes
23REK August 1998
21264 Off-Chip Memory System Features
z 8 outstanding block fills + 8 victimsz Split L2 cache and system busses (back-side cache)z High-speed (bandwidth) point-to-point channels
y clock-forwarding technology, low pin counts
z L2 hit load latency (load issue to consumer issue) = 6cycles + SRAM latency
z Max L2 cache bandwidth of 16 bytes per 1.5 cyclesy 6.4 GB/sec with a 400Mhz transfer rate
z L2 miss load latency can be 160ns (60 ns DRAM)z Max system bandwidth of 8 bytes per 1.5 cycles
y 3.2 GB/sec with a 400Mhz transfer rate
24REK August 1998
21264 Pin Bus21264 Pin Bus
L2 Cache : 0 -16MB Synch Direct Mapped Peak Bandwidth :Example 1: Reg-Reg 133 MHz BurstRAM 2.1 GB/SecExample 2: ‘RISC’ 200+ MHz Late Write 3.2+ GB/SecExample 3: Dual Data 400+ MHz 6.4+ GB/Sec
Alpha21264
L2 CacheTAG RAMs
L2 CacheData RAMs128b
Address Out
Address In
System Data
64b
System Pin Bus L2 Cache Port
Data
TAG
Address
Independent data busses allow simultaneous data transfers
SystemChip-set(DRAMand I/O)
25REK August 1998
COMPAQ Alpha / AMD Shared SystemCOMPAQ Alpha / AMD Shared SystemPin Bus (External Interface)Pin Bus (External Interface)
AMD K7
Address Out
Address In
System Data
64b
System Pin Bus (“Slot A”)
SystemChip-set(DRAMand I/O)
Cache
• High-performance system pin bus
• Shared system chipset designs
• This is a win-win!
26REK August 1998
Dual 21264 System Example
21264 &L2 Cache
100MHz SDRAM
100MHz SDRAM
256b
256b
8 Data Switches
64b
64b
ControlChip
PCI-chip
MemoryBank
MemoryBank
64b PCI 64b PCI
21264 &L2 Cache
PCI-chip
27REK August 1998
Measured Performance:SPEC95
Spe cInt95
30
18.817.4 16.5 16.1
14.7 14.4
0
5
10
15
20
25
30
35
21264 - 600 21164 - 600 P A 8200 -240
P entium II -400
UltraS parcII - 360
R10000 -250
P owerP C604e - 332
SpecFP95
50
29.2 28.526.6
24.5 23.5
13.7 12.6
0
10
20
30
40
50
60
21264 -600
21164 -600
P A 8200 -240
P OW E R2- 160
R10000 -250
UltraS parcII - 360
P entium II- 400
P owerP C604e - 332
28REK August 1998
Measured Performance:STREAMS
Max Streams (MB/sec)
1000
883
367 358292
213
0
200
400
600
800
1000
1200
21264 - 600(ass.)
RS6000-397 UltraSparc II -UE6000 (ass.)
R10000 - 250- O2000
21164 - 533 -164SX
Pentium II -300 - Dell
29REK August 1998
Alpha 21264
InstructionCache
Data Cache
Data and Control Busses
IntegerMapper
IntegerQueue
Integer Unit
(right)
Integer Unit
(Left)
Memory Controller
Bus
Interface
Unit
Floating-point U
nit
FloatingMapper
andQueue
MemoryController
InstructionD
ata path
BIU
30REK August 1998
Summary
z The 21264 will maintain Alpha’s performance leady 30+ Specint95 and 50+ Specfp95
y 1+ GB/sec memory bandwidth
z The 21264 proves that both high frequency andsophisticated architectural features can coexisty high-bandwidth speculative instruction fetch
y out-of-order execution
y 6-way instruction issue
y highly parallel out-of-order memory system