The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 … · REK August 1998 2 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in 0.35u CMOS6,

1REK August 1998

The Alpha 21264 Microprocessor:Out-of-Order Execution at 600 Mhz

R. E. KesslerCOMPAQ Computer Corporation

Shrewsbury, MA

2REK August 1998

Some Highlights

z Continued Alpha performance leadershipy 600 Mhz operation in 0.35u CMOS6, 6 metal layers, 2.2V

y 15 Million transistors, 3.1 cm2, 587 pin PGA

y Specint95 of 30+ and Specfp95 of 50+

y Out-of-order and speculative execution

y 4-way integer issue

y 2-way floating-point issue

y Sophisticated tournament branch prediction

y High-bandwidth memory system (1+ GB/sec)

3REK August 1998

Alpha 21264: Block Diagram

Cache Bus

Phys Addr

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

BusInter-faceUnit

IntRegMap

IntIssue

Queue(20)

Exec

4 Instructions / cycle

RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer

L1DataCache64KB2-Set

FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

MissAddress

BranchPredictors

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecRegFile(80)

FPIssueQueue

(15)

RegFile(72)

4REK August 1998

IntIssue

Queue(20)

FPIssueQueue

(15)


Cache Bus

Phys Addr


BusInter-faceUnit

IntRegMap

Exec


RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer


FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr


MissAddress

BranchPredictors

Next-LineAddress


Exec

Exec

ExecRegFile(80)

RegFile(72)

5REK August 1998

21264 Instruction Fetch Bandwidth Enablers

z The 64 KB two-way associative instruction cache suppliesfour instructions every cycle

z The next-fetch and set predictors provide the fast cache accesstimes of a direct-mapped cache and eliminate bubbles innon-sequential control flows

z The instruction fetcher speculates through up to 20 branchpredictions to supply a continuous stream of instructions

z The tournament branch predictor dynamically selectsbetween Local and Global history to minimize mispredicts

6REK August 1998

Instruction Stream Improvements

...

IcacheData

Line Addr

Instructions (4)

Tag S0

Cmp

Hit/Miss/Set Miss

Tag S1

Cmp

Next LP address + set

PC

Set Pred

0-cycle Penalty On Branches

Set-Associative Behavior

PC GEN

Instr Decode, Branch Pred

Learns Dynamic Jumps

7REK August 1998

Fetch Stage:Branch Prediction

z Some branch directions canbe predicted based on theirpast behavior: LocalCorrelation

z Others can be predicted basedon how the program arrived atthe branch: Global Correlation

if (b == 0)

if (a%2 == 0) TNTN

NNNNif (b == 0)

if (a == b)

if (a == 0)(predict taken)

(predict taken)

(predict taken)

8REK August 1998

Tournament Branch Prediction

Prediction

GlobalPrediction(4096x2)

Path History

ChoicePrediction(4096x2)

LocalPrediction(1024x3)

LocalHistory

(1024x10)

Program Counter

9REK August 1998

IntIssue

Queue(20)

FPIssueQueue

(15)


Cache Bus

Phys Addr


BusInter-faceUnit

IntRegMap

Exec


RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer


FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr


MissAddress

BranchPredictors

Next-LineAddress


Exec

Exec

ExecRegFile(80)

RegFile(72)

10REK August 1998

Mapper and Queue Stages

z Mapper:y Rename 4 instructions per cycle (8 source / 4 dest )

y 80 integer + 72 floating-point physical registers

z Queue Stage:y Integer: 20 entries / Quad-Issue

y Floating Point: 15 entries / Dual-Issue

y Instructions issued out-of-order when data readyy Prioritized from oldest to youngest each cycle

y Instructions leave the queues after they issuey The queue collapses every cycle as needed

11REK August 1998

Map and Queue Stages

RegisterScoreboard

Arbiter

Issued Instructions

QUEUE 20 Entries

MAPCAMs

80 Entries

RegisterNumbers

MAPPER ISSUE QUEUE

4 Inst / Cycle

Request Grant

Instructions

SavedMapState

RenamedRegisters

12REK August 1998

IntIssue

Queue(20)

FPIssueQueue

(15)


Cache Bus

Phys Addr


BusInter-faceUnit

IntRegMap

Exec


RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer


FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr


MissAddress

BranchPredictors

Next-LineAddress


Exec

Exec

ExecRegFile(80)

RegFile(72)

13REK August 1998

Register and Execute Stages

FP Add

SQRT

FP Div

FP Mul

Reg

Floating Point Execution Units

add / logic

mul

add / logic /memory

shift/br

Reg

add / logic

m. video

add / logic /memory

shift/br

Reg

Cluster 0 Cluster 1

Integer Execution Units

14REK August 1998

Integer Cross-Cluster Instruction Scheduling andExecution

add / logic

mul

add / logic /memory

shift/br

Reg

add / logic

m. video

add / logic /memory

shift/br

Reg

• Instructions are staticallypre-slotted to the upper orlower execution pipes

• The issue queuedynamically selects betweenthe left and right clusters

• This has most of theperformance of 4-way withthe simplicity of 2-way issue

Cluster 0 Cluster 1

Integer Execution Units

15REK August 1998

IntIssue

Queue(20)

FPIssueQueue

(15)


Cache Bus

Phys Addr


BusInter-faceUnit

IntRegMap

Exec


RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer


FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr


MissAddress

BranchPredictors

Next-LineAddress


Exec

Exec

ExecRegFile(80)

RegFile(72)

16REK August 1998

21264 On-Chip Memory System Features

z Two loads/stores per cycley any combination

z 64 KB two-way associative L1 data cache (9.6 GB/sec)y phase-pipelined at > 1 Ghz (no bank conflicts!)

y 3 cycle latency (issue to issue of consumer) with hit prediction

z Out-of-order and speculative executiony minimizes effective memory latency

z 32 outstanding loads and 32 outstanding storesy maximizes memory system parallelism

z Speculative stores forward data to subsequent loads

17REK August 1998

Memory System (Contd.)

• New references check addressesagainst existing references

• Speculative store data thataddress matches bypasses to loads

• Multiple misses to the samecache block merge

• Hazard detection logic managesout-of-order references

64

64

Cluster 0 Memory UnitCluster 1 Memory Unit

Speculative Store Data

Double-Pumped Ghz Data Cache

128

128

Bus

Inte

rfac

e

128

64

L2 @

1.5

xS

yste

m @

1.5

x Data PathAddress Path

18REK August 1998

Low-latency Speculative Issue of Integer LoadData Consumers (Predict Hit)

LDQ R0, 0(R0)ADDQ R0, R0, R1OP R2, R2, R3

Q R E D Q R E D

3 Cycle Latency

LDQ Cache Lookup Cycle

ADDQ Issue Cycle

z When predicting a load hit:y The ADDQ issues (speculatively) after 3 cycles

y Best performance if the load actually hits (matching the prediction)

y The ADDQ issues before the load hit/miss calculation is known

z If the LDQ misses when predicted to hit:y Squash two cycles (replay the ADDQ and its consumers)

y Force a “mini-replay” (direct from the issue queue)

Q R E D

Squash this ADDQ on a miss

Restart ops here on a missQ R E D This op also gets squashed

19REK August 1998

Low-latency Speculative Issue of Integer LoadData Consumers (Predict Miss)

LDQ R0, 0(R0)OP1 R2, R2, R3OP2 R4, R4, R5ADDQ R0, R0, R1

Q R E D Q R E D

5 Cycle Latency (Min)

LDQ Cache Lookup Cycle

Earliest ADDQ Issue

z When predicting a load miss:y The minimum load latency is 5 cycles (more on a miss)

y There are no squashes

y Best performance if the load actually misses (as predicted)

z The hit/miss predictor:y MSB of 4-bit counter (hits increment by 1, misses decrement by 2)

Q R E D ADDQ can issueQ R E D

20REK August 1998

Dynamic Hazard Avoidance(Before Marking)

Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29)

Store followed by load tothe same memory address!

First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28)

This (re-ordered) load gotthe wrong data value!

21REK August 1998

Dynamic Hazard Avoidance(After Marking/Training)

Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29)

Store followed by load tothe same memory address!

Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) LDQ R1, 0(R29)

The marked (delayed)load gets the bypassed

store data!

This loadis markedthis time

22REK August 1998

New Memory Prefetches

zSoftware-directed prefetchesy Prefetch

y Prefetch w/ modify intent

y Prefetch, evict it next

y Evict Block (eject data from the cache)

y Write hinty allocate block with no data read

y useful for full cache block writes

23REK August 1998

21264 Off-Chip Memory System Features

z 8 outstanding block fills + 8 victimsz Split L2 cache and system busses (back-side cache)z High-speed (bandwidth) point-to-point channels

y clock-forwarding technology, low pin counts

z L2 hit load latency (load issue to consumer issue) = 6cycles + SRAM latency

z Max L2 cache bandwidth of 16 bytes per 1.5 cyclesy 6.4 GB/sec with a 400Mhz transfer rate

z L2 miss load latency can be 160ns (60 ns DRAM)z Max system bandwidth of 8 bytes per 1.5 cycles

y 3.2 GB/sec with a 400Mhz transfer rate

24REK August 1998

21264 Pin Bus21264 Pin Bus

L2 Cache : 0 -16MB Synch Direct Mapped Peak Bandwidth :Example 1: Reg-Reg 133 MHz BurstRAM 2.1 GB/SecExample 2: ‘RISC’ 200+ MHz Late Write 3.2+ GB/SecExample 3: Dual Data 400+ MHz 6.4+ GB/Sec

Alpha21264

L2 CacheTAG RAMs

L2 CacheData RAMs128b

Address Out

Address In

System Data

64b

System Pin Bus L2 Cache Port

Data

TAG

Address

Independent data busses allow simultaneous data transfers

SystemChip-set(DRAMand I/O)

25REK August 1998

COMPAQ Alpha / AMD Shared SystemCOMPAQ Alpha / AMD Shared SystemPin Bus (External Interface)Pin Bus (External Interface)

AMD K7

Address Out

Address In

System Data

64b

System Pin Bus (“Slot A”)

SystemChip-set(DRAMand I/O)

Cache

• High-performance system pin bus

• Shared system chipset designs

• This is a win-win!

26REK August 1998

Dual 21264 System Example

21264 &L2 Cache

100MHz SDRAM

100MHz SDRAM

256b

256b

8 Data Switches

64b

64b

ControlChip

PCI-chip

MemoryBank

MemoryBank

64b PCI 64b PCI

21264 &L2 Cache

PCI-chip

27REK August 1998

Measured Performance:SPEC95

Spe cInt95

30

18.817.4 16.5 16.1

14.7 14.4

0

5

10

15

20

25

30

35

21264 - 600 21164 - 600 P A 8200 -240

P entium II -400

UltraS parcII - 360

R10000 -250

P owerP C604e - 332

SpecFP95

50

29.2 28.526.6

24.5 23.5

13.7 12.6

0

10

20

30

40

50

60

21264 -600

21164 -600

P A 8200 -240

P OW E R2- 160

R10000 -250

UltraS parcII - 360

P entium II- 400

P owerP C604e - 332

28REK August 1998

Measured Performance:STREAMS

Max Streams (MB/sec)

1000

883

367 358292

213

0

200

400

600

800

1000

1200

21264 - 600(ass.)

RS6000-397 UltraSparc II -UE6000 (ass.)

R10000 - 250- O2000

21164 - 533 -164SX

Pentium II -300 - Dell

29REK August 1998

Alpha 21264

InstructionCache

Data Cache

Data and Control Busses

IntegerMapper

IntegerQueue

Integer Unit

(right)

Integer Unit

(Left)

Memory Controller

Bus

Interface

Unit

Floating-point U

nit

FloatingMapper

andQueue

MemoryController

InstructionD

ata path

BIU

30REK August 1998

Summary

z The 21264 will maintain Alpha’s performance leady 30+ Specint95 and 50+ Specfp95

y 1+ GB/sec memory bandwidth

z The 21264 proves that both high frequency andsophisticated architectural features can coexisty high-bandwidth speculative instruction fetch

y out-of-order execution

y 6-way instruction issue

y highly parallel out-of-order memory system

Documents

The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 … · REK August 1998 2 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in 0.35u CMOS6,