48
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

Embed Size (px)

Citation preview

Page 1: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution Structures

Page 2: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

MIPS R10000-Like Design

• Based on:– Complexity-Effective Superscalar Processors– S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97

Page 3: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Fetch Phase

• Fetch:– Read instructions from I-Cache– Predict Branches– Pass on to Decode phase

Page 4: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decode Phase

• Decode:– Parse instruction– Shuffle opcode parts to appropriate ports for

rename

Page 5: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Phase

• Rename:– Map Architectural registers to Physical– Eliminate False Dependences– Passes renamed instructions to scheduler

• Called Dispatch

Page 6: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduling Phase

• Wakeup:– Instructions check whether they become ready– From Writeback: physical register names

• Select:– Amongst the ready select those to execute– Structural hazards

Page 7: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register File Read Phase

• Read source operands

Page 8: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Bypass and Execute Phase

Page 9: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Data Cache Access Phase

Page 10: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Writeback Phase

• Write result to register file• Broadcast tag in order to wakeup waiting

instructions– Notice that the tag broadcast should happen TWO

cycles in advance of the result production

Page 11: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Reservation Station Model

• Used by Pentium Pro, PowerPC 604• Re-order buffer holds values• Renaming points to re-order buffer entries

– Tomasulo-like

Page 12: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Physical Register File vs. Reservation Station

• Physical Register File– Values reside in the register file– At writeback instructions broadcast the

register name• Reservation Stations:

– Values reside:– In the register file upon commit

• Non-speculative

– In reservation stations prior to commit• Speculative

Page 13: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Quantifying Complexity

• Critical Path Delay as a function of architectural parameters– Instruction Window size (WinSize)– Issue Width (IW)

• Full-custom Implementations– Study the critical path– Delay model– Extrapolate how it will scale with “future”

technologies

Page 14: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming

• Inputs:– IW instructions– Up to 2 x Input register names– Up to 1 x Output register name

• Outputs:– 2 x input physical registers– 1 x new output physical register– 1 x previous physical register name for

checkpointing – Updated rename table

• Superscalar Issue complicates things a bit

Page 15: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming One Instruction

s1s2

d

RAT

p0

p31

s1s2

old

d

new reg from free list

Write port

Read port

Read port

Read port

1

1

2

1

For mispeculation recovery

Page 16: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Two Instructions

RAT

s1 s2 d new d s1 s2 d new d

?

?

?

ps1 ps2 Old d new d ps1 ps2 Old dnew d

Cross BundleDependency Check Logic

Page 17: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming More Instructions

• Dependency Checking logic for instruction i must match against all preceding destinations

• If there are multiple matches it must enforce priority:– Pick the one closest to this instruction

Page 18: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

RAT: SRAM Implementation

decoderSRAM cell

bitlines

Sense amp

Arch reg

Phys reg

#ARCH REGS

lg(#PHYS REGS)

Page 19: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

SRAM RAT cell

Page 20: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

RAT: CAM Implementation

encoder

CAM cellArch reg

Phys reg#PHYS REGS

lg(#ARCH REGS)

Active bit

• One CAM per physical register• Active bit indicates the current map• New version by setting active bit

Page 21: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

CAM Cell

Match

Wordline

Bitline

Bitline_B

Page 22: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

SRAM vs. CAM

• SRAM:– Arch reg rows– Lg(phy reg) cols– SRAM read/write

• CAM:– Phy reg rows– Lg(arch reg) cols– CAM match– Update:

• Reset previous valid bit• Set current valid bit

Page 23: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduler: Part #1 - Wakeup

Page 24: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tree of Arbiters

REQ Signals

GRANT Signals

Anyreq raised if any req is active, Grant

Issued if arbiter enabled

Root enabled if

FU available

Scheduler: Part #2 - Select

For a Single FU

Location based select policy

Page 25: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Select for more than one FUs

• Handling Multiple FUs of Same Type:– Stack Select logic blocks

in series - hierarchy– Mask the Request granted

to previous unit

• NOT Feasible for More than 2 FUs• Alternative:

– statically partition issue window among FUs – MIPS R10000, HP PA 8000

Page 26: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Datapath and Bypass

Commonly Used Layout:

1 Bit-Slice

Turn on Tri-State A

to pass result of

FU1 to left operand of

FU0

Page 27: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Complexity Analysis

• Critical path delay as a function of:– Issue Width – Window Size

• Register Renaming Table

• Wakeup and Select

• Bypass paths

Page 28: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Methodology

• A representative CMOS design is selected from published alternatives

• Implemented the circuits for 3 technologies:– 0.8micron, 0.35micron and 0.18 micron

• Optimize for speed

• Wire parasitics in delay model– Rmetal, Cmetal

Page 29: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Methodology

• Feature size scaling: 1 / S• Voltage scaling: 1 / U

• Logic Delay = (CLx V) / I

• Capac. Load: CL= 1 1 / S

• Supply Voltage: V = 1 1 / U• Average charge/discharge current: I = 1

1 / U

• So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S

Page 30: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wire Delay

• L: wire length• Intrinsic RC delay

• Rmetal: resistance per unit length

• Cmetal: capacitance per unit length

• 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C

Page 31: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wire Delay Scaling

• Metal Thickness doesn’t scale much– Width ~ 1/S– Rmetal ~ S

• Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate

• Parallel plate – scales with 1 / S– Cmetal ~ S

• Length scales with 1/S• Overall Scale factor: S x S x (1/S)2 = 1

• Wire delay remains constant

Page 32: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming Table

Page 33: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dependency Checking Logic

• Accessed in Parallel with Map Table• Every Logical Reg compared against

logical dest regs of current rename group• For IW=2,4,8, delay less than map table

r1

r4

r4

r4

r4

Page 34: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Delay

• SRAM scheme• Delay Components:

– Time to decode the arch reg index– Time to drive wordline– Time to pull down bit line– Time for SenseAmp to detect pull-down– MUX time ignored as control from dep.

Check logic comes in advance

Page 35: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Circuit

Page 36: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay

Page 37: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay

• Predecoding for speed• Length of predecode lines:

– Cellheight: Height of single cell excluding wordlines

– Wordline spacing• NVREG: # of virtual reg-s• x3: 3-operand instr-s

Page 38: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay

• Tnand fall delay of NAND• Tnor rise delay of NOR

• Rnandpd NAND pull-down channel resistance + Predecode line metal resistance

• Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.

Page 39: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay

• Substitute• Predecode line length, Req and Ceq we

get:

• c2: intrinsic RC delay of predecode line• c2 very small • Decoder delay ~linearly dependent on

IW

Page 40: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Rename Delay

• Wordline

• c2: intrinsic RC delay of wordline• c2 very small • Wordline delay ~linearly dependent on

IW

Page 41: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Rename Delay• Bitline:

• c2 very small • Bitline delay ~linearly dependent on IW

• SenseAmp delay ~linearly dependent on IW

Page 42: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Rename Logic Delay Scaling

• Feature size - [increase in bitline&wordline delay with increasing IW]

• 0.8um: IW 2 8 Bitline delay + 37%• 0.18um: IW 28 Bitline delay + 53%

• Total delay increases linearly with IW

• Each Component shows linear increase with IW

• Bitline Delay > Wordline Delay

• Bitline length ~ # of Logical reg-s

• Wordline length ~ width of physical reg designator

IW impact on delay worsenswith decreasing featuresize

Page 43: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay• Critical Path: Mismatch Pull ready signal low• Delay Components:

– Tag drivers drive tag lines - vertical– Mismatched bit: pull down stack pull matchline low

– horizontal– Final OR gate or all the matchlines of an operand

tag

• Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C

• Quadratic component significant for IW>2 & 0.18um

Page 44: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay

• Quadratic component Small for both cases

• Both delays ~linearly dependent on IW

Page 45: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay: IW and Window Size• 0.18um Process• Quadratic

dependence• Issue width has

greater effect increase all 3 delay components

• As IW & WinSize + together delay actually changes like: THIS

Page 46: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay: Window Size

• 8 way & 0.18 Process• Tag drive delay increases rapidly with WinSize +• Match OR delay constant

Page 47: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay: Feature size

• 8 way & 64 entry window• Tag drive and Tag match delays do not scale as well as MatchOR

delay • Match OR logic delay• Others also have wire delays

Page 48: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Selection Logic and Bypass Delay

• Selection– Logarithmically dependent on WinSize

• Bypass: Delay dependent on (IW)2