37
© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Storage Bus Instruction Unit EECS 470 FloatingPoint Buffers FLB 3 4 5 6 Floating Point Registers(FLR) Control 2 4 8 Floating Operand Stack Floating Point Buffers (FLB) 2 3 4 5 6 Point Busy Bits Tags (FLOS) Lecture 6 Tomasolus Algorithm Registers FLR 2 1 2 Decoder Registers (FLR) 0 2 1 2 2 3 Control Tags (FLOS) Tomasolu s Algorithm Fall 2007 C ontrol Store Data 1 2 Buffers (SDB) Control FLB Bus FLR Bus CDB Sink Tag Tag Source Ctrl. Sink Tag Tag Source Ctrl. Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs4 Adder Result Multiply/Divide Common Data Bus (CDB) Adder Sink Tag Tag SourceCtrl. Sink Tag Tag SourceCtrl. Sink Tag Tag SourceCtrl. Result http://www.eecs.umich.edu/courses/eecs4 70 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Lecture 6 Slide 1 EECS 470 Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. University, University of Michigan, and University of Wisconsin.

Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Storage Bus Instruction Unit

EECS 470Floating Point 4

8

Floating

Operand

Stack

FLOSControl

Floating Point

Buffers FLB 3

4

5

6

Floating PointRegisters(FLR)

Control

248

Floating

Operand Stack

Floating Point

Buffers (FLB)

23456

Point

BusyBits

Tags

(FLOS)

Lecture 6

Tomasolu’s Algorithm

Registers FLR

0

2

Store 3

C

Decoder

1

2

Decoder

Registers (FLR)02

12

23

Control

Tags

(FLOS)

Tomasolu s Algorithm

Fall 2007

Data

1

2

Buffers SDB

Control

StoreData

12

Buffers (SDB)

Control

FLB BusFLR BusCDB ••

Sink TagTag SourceCtrl.Sink TagTag SourceCtrl.

Fall 2007

Prof. Thomas Wenisch

http://www.eecs.umich.edu/courses/eecs4

Adder

Result

Multiply/Divide

Common Data Bus (CDB)

Adder

Sink TagTag SourceCtrl.

Sink TagTag SourceCtrl.Sink TagTag SourceCtrl.

Result

http://www.eecs.umich.edu/courses/eecs470

Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen,

Lecture 6 Slide 1EECS 470

Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. University, University of Michigan, and University of Wisconsin.

Page 2: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Announcements

HW # 2 due Wednesday 9/26HW # 2 due Wednesday 9/26 Programming assignment #2 due today

• Hand in printed synthesis results by 6pm at CSE 4620• Electronic submission of integer square root by 6pm

Lecture 6 Slide 2EECS 470

Page 3: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Readings

For Today:For Today:H & P Chapter 2.4‐2.6D. Sima “Design Space of Register Renaming Techniques”

For Wednesday:Smith & Pleszkun “Implementing Precise Interrupts”Smith & Pleszkun Implementing Precise Interrupts

Lecture 6 Slide 3EECS 470

Page 4: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scoreboard Redux• The good

+ Cheap hardware• InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area

+ Pretty good performance• 1.7X for FORTRAN (scientific array) programs1.7X for FORTRAN (scientific array) programs

• The less good– No bypassing

• Is this a fundamental problem?– Limited scheduling scope

• Structural/WAW hazards delay dispatch/ y p– Slow issue of truly-dependent (RAW) insns

• WAR hazards delay writeback• Fix with hardware register renaming

EECS 470Lecture 6 Slide 4EECS 470

• Fix with hardware register renaming

Page 5: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Register Renaming• Register renaming (in hardware)

• Change register names to eliminate WAR/WAW hazards• One of most the beautiful things in computer architecture• Key: think of registers (r1,f0…) as names, not storage locations+ Can have more locations than names+ Can have more locations than names+ Can have multiple active versions of same name

H d it k?• How does it work?• Map-table: maps names to most recent locations

• SRAM indexed by namey• On a write: allocate new location, note in map-table• On a read: find location of most recent write via map-table lookup• Small detail: must de-allocate locations at some point

EECS 470Lecture 6 Slide 5EECS 470

• Small detail: must de-allocate locations at some point

Page 6: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Register Renaming Example• Parameters

• Names: r1,r2,r3• Locations: p1,p2,p3,p4,p5,p6,p7• Original mapping: r1→p1, r2→p2, r3→p3, p4–p7 are “free”

MapTable FreeList Raw insns Renamed insnsMapTable FreeList Raw insns Renamed insnsr1 r2 r3p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4p4 p2 p3 p5 p6 p7 sub r2 r1 r3 sub p2 p4 p5p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6p4 p2 p6 p7 div r1,4,r1 div p4,4,p7

• Renaming+ Removes WAW and WAR dependences+ Leaves RAW intact!

EECS 470Lecture 6 Slide 6EECS 470

+ Leaves RAW intact!

Page 7: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scheduling Algorithm II: Tomasulo• Tomasulo’s algorithm

• Reservation stations (RS): instruction buffer• Common data bus (CDB): broadcasts results to RS• Register renaming: removes WAR/WAW hazards

• First implementation: IBM 360/91 [1967]• Dynamic scheduling for FP units only• Bypassing

• Our example: “Simple Tomasulo”• Our example: Simple Tomasulo• Dynamic scheduling for everything, including load/store• No bypassing (for comparison with Scoreboard)

EECS 470Lecture 6 Slide 7EECS 470

• 5 RS: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined)

Page 8: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Data Structures• Reservation Stations (RS#)

• FU, busy, op, R: destination register name• T: destination register tag (RS# of this RS)• T1,T2: source register tags (RS# of RS that will produce value)• V1,V2: source register valuesV1,V2: source register values

• That’s new

• Map Table• T: tag (RS#) that will write this register

• Common Data Bus (CDB)• Broadcasts <RS# value> of completed insns• Broadcasts <RS#, value> of completed insns

• Tags interpreted as ready-bits++• T==0 → Value is ready somewhere

EECS 470Lecture 6 Slide 8EECS 470

• T!=0 → Value is not ready, wait until CDB broadcasts T

Page 9: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simple Tomasulo Data Structures

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R========

FU

==Reservation Stations

T

==

• Insn fields and status bits• Tags• Values

EECS 470Lecture 6 Slide 9EECS 470

Values

Page 10: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simple Tomasulo Pipeline• New pipeline structure: F, D, S, X, W

• D (dispatch)• Structural hazard ? stall : allocate RS entry

• S (issue)• RAW hazard ? wait (monitor CDB) : go to executeRAW hazard ? wait (monitor CDB) : go to execute

• W (writeback)• Write register, free RS entry

W d RAW d d t S i l• W and RAW-dependent S in same cycle• W and structural-dependent D in same cycle

EECS 470Lecture 6 Slide 10EECS 470

Page 11: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Dispatch (D)

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R========

FU

==Reservation Stations

T

==

• Stall for structural (RS) hazards• Allocate RS entry• Input register ready ? read value into RS : read tag into RS

EECS 470Lecture 6 Slide 11EECS 470

• Input register ready ? read value into RS : read tag into RS• Set register status (i.e., rename) for ouput register

Page 12: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Issue (S)

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R========

FU

==Reservation Stations

T

==

• Wait for RAW hazards• Read register values from RS

EECS 470Lecture 6 Slide 12EECS 470

Page 13: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Execute (X)

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R========

FU

==Reservation Stations

T

==

EECS 470Lecture 6 Slide 13EECS 470

Page 14: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Writeback (W)

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R========

FU

==Reservation Stations

T

==

• Wait for structural (CDB) hazards• Output Reg Status tag still matches? clear, write result to register• CDB broadcast to RS: tag match ? clear tag, copy value

EECS 470Lecture 6 Slide 14EECS 470

g g, py• Free RS entry

Page 15: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Difference Between Scoreboard…

XS Insn valueTReg StatusRegfile

R1 R2 T2T1Top========

Fetchedinsns

R========

FU StatusFU

==

T

==

EECS 470Lecture 6 Slide 15EECS 470

Page 16: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

…And Tomasulo

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R========

FU

==Reservation Stations

T

==

• What in Tomasulo implements register renaming?• Value copies in RS (V1, V2)• Insn stores correct input values in its own RS entry

EECS 470Lecture 6 Slide 16EECS 470

p y+ Future insns can overwrite master copy in regfile, doesn’t matter

Page 17: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Value/Copy-Based Register Renaming• Tomasulo-style register renaming

• Called “value-based” or “copy-based” • Names: architectural registers• Storage locations: register file and reservation stations

• Values can and do exist in bothValues can and do exist in both• Register file holds master (i.e., most recent) values+ RS copies eliminate WAR hazards

St l ti f d t i t ll b RS# t• Storage locations referred to internally by RS# tags• Register table translates names to tags• Tag == 0 value is in register file• Tag != 0 value is not ready and is being computed by RS#

• CDB broadcasts values with tags attached• So insns know what value they are looking at

EECS 470Lecture 6 Slide 17EECS 470

So insns know what value they are looking at

Page 18: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Value-Based Renaming Exampleldf X(r1),f1 (allocated RS#2)

• MT[r1] == 0 → RS[2].V2 = RF[r1] • MT[f1] = RS#2• MT[f1] = RS#2

mulf f0,f1,f2 (allocate RS#4)• MT[f0] == 0 → RS[4].V1 = RF[f0]

MT[f1] RS#2 RS[4] T2 RS#2• MT[f1] == RS#2 → RS[4].T2 = RS#2• MT[f2] = RS#4

addf f7,f8,f0C it RF[f0] b f lf t h ?

Map TableReg Tf0• Can write RF[f0] before mulf executes, why?

ldf X(r1),f1• Can write RF[f1] before mulf executes, why?

C [ ] b f f h ?

f0f1 RS#2f2 RS#4r1

• Can write RF[f1] before first ldf, why?

Reservation StationsT FU busy op R T1 T2 V1 V2

r1

EECS 470Lecture 6 Slide 18EECS 470

2 LD yes ldf f1 - - - [r1]4 FP1 yes mulf f2 - RS#2 [f0]

Page 19: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Data StructuresInsn StatusInsn D S X Wldf X(r1),f1

Map TableReg Tf0

CDBT V

ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1

f0f1f2r1

ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no1 ALU no2 LD no3 ST no4 FP1 no

EECS 470Lecture 6 Slide 19EECS 470

5 FP2 no

Page 20: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 1Insn StatusInsn D S X Wldf X(r1),f1 c1

Map TableReg Tf0

CDBT V

ldf X(r1),f1 c1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1

f0f1 RS#2f2r1

ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no1 ALU no2 LD yes ldf f1 - - - [r1]3 ST no4 FP1 no

allocate

EECS 470Lecture 6 Slide 20EECS 470

5 FP2 no

Page 21: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 2Insn StatusInsn D S X Wldf X(r1),f1 c1 c2

Map TableReg Tf0

CDBT V

ldf X(r1),f1 c1 c2mulf f0,f1,f2 c2stf f2,Z(r1) addi r1,4,r1

f0f1 RS#2f2 RS#4r1

ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no1 ALU no2 LD yes ldf f1 - - - [r1]3 ST no4 FP1 yes mulf f2 - RS#2 [f0] - allocate

EECS 470Lecture 6 Slide 21EECS 470

y # [ ]5 FP2 no

allocate

Page 22: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 3Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3

Map TableReg Tf0

CDBT V

ldf X(r1),f1 c1 c2 c3mulf f0,f1,f2 c2stf f2,Z(r1) c3addi r1,4,r1

f0f1 RS#2f2 RS#4r1

ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no1 ALU no2 LD yes ldf f1 - - - [r1]3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - RS#2 [f0] -

allocate

EECS 470Lecture 6 Slide 22EECS 470

y # [ ]5 FP2 no

Page 23: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 4Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT VRS#2 [f1]ldf X(r1),f1 c1 c2 c3 c4

mulf f0,f1,f2 c2 c4stf f2,Z(r1) c3addi r1,4,r1 c4

f0f1 RS#2f2 RS#4r1 RS#1

RS#2 [f1]

ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

ldf finished (W) clear f1 RegStatus

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU yes addi r1 - - [r1] - allocate

gCDB broadcast

1 ALU yes addi r1 [r1]2 LD no3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - RS#2 [f0] CDB.V

allocatefree

RS#2 ready →

EECS 470Lecture 6 Slide 23EECS 470

y # [ ]5 FP2 no

S# y →grab CDB value

Page 24: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 5Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT V

ldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5stf f2,Z(r1) c3addi r1,4,r1 c4 c5

f0f1 RS#2f2 RS#4r1 RS#1

ldf X(r1),f1 c5mulf f0,f1,f2stf f2,Z(r1)

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU yes addi r1 - - [r1] -1 ALU yes addi r1 [r1]2 LD yes ldf f1 - RS#1 - -3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - - [f0] [f1]

allocate

EECS 470Lecture 6 Slide 24EECS 470

y [ ] [ ]5 FP2 no

Page 25: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 6Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT V

ldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+stf f2,Z(r1) c3addi r1,4,r1 c4 c5 c6

f0f1 RS#2f2 RS#4RS#5r1 RS#1

ldf X(r1),f1 c5mulf f0,f1,f2 c6stf f2,Z(r1)

no D stall on WAW: scoreboard wouldoverwrite f2 RegStatusanyone who needs old f2 tag has it

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU yes addi r1 - - [r1] -

anyone who needs old f2 tag has it

1 ALU yes addi r1 [r1]2 LD yes ldf f1 - RS#1 - -3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - - [f0] [f1]

EECS 470Lecture 6 Slide 25EECS 470

y [ ] [ ]5 FP2 yes mulf f2 - RS#2 [f0] - allocate

Page 26: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 7Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT VRS#1 [r1]ldf X(r1),f1 c1 c2 c3 c4

mulf f0,f1,f2 c2 c4 c5+stf f2,Z(r1) c3addi r1,4,r1 c4 c5 c6 c7

f0f1 RS#2f2 RS#5r1 RS#1

RS#1 [r1]

ldf X(r1),f1 c5 c7mulf f0,f1,f2 c6stf f2,Z(r1)

no W wait on WAR: scoreboard wouldanyone who needs old r1 has RS copy

D stall on store RS: structural

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no

addi finished (W) clear r1 RegStatusCDB broadcast

1 ALU no2 LD yes ldf f1 - RS#1 - CDB.V3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - - [f0] [f1]

RS#1 ready →grab CDB value

EECS 470Lecture 6 Slide 26EECS 470

y [ ] [ ]5 FP2 yes mulf f2 - RS#2 [f0] -

Page 27: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 8Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT VRS#4 [f2]ldf X(r1),f1 c1 c2 c3 c4

mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8addi r1,4,r1 c4 c5 c6 c7

f0f1 RS#2f2 RS#5r1

RS#4 [f2]

ldf X(r1),f1 c5 c7 c8mulf f0,f1,f2 c6stf f2,Z(r1)

mulf finished (W) don’t clear f2 RegStatusalready overwritten by 2nd mulf (RS#5)

CDB b d tReservation StationsT FU busy op R T1 T2 V1 V21 ALU no

CDB broadcast

1 ALU no2 LD yes ldf f1 - - - [r1]3 ST yes stf - RS#4 - CDB.V [r1]4 FP1 no

RS#4 ready →grab CDB value

EECS 470Lecture 6 Slide 27EECS 470

5 FP2 yes mulf f2 - RS#2 [f0] -g

Page 28: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 9Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT VRS#2 [f1]ldf X(r1),f1 c1 c2 c3 c4

mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9addi r1,4,r1 c4 c5 c6 c7

f0f1 RS#2f2 RS#5r1

RS#2 [f1]

ldf X(r1),f1 c5 c7 c8 c9mulf f0,f1,f2 c6 c9stf f2,Z(r1)

2nd ldf finished (W) clear f1 RegStatusCDB broadcast

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no1 ALU no2 LD no3 ST yes stf - - - [f2] [r1]4 FP1 no

EECS 470Lecture 6 Slide 28EECS 470

5 FP2 yes mulf f2 - RS#2 [f0] CDB.V RS#2 ready →grab CDB value

Page 29: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 10Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4

Map TableReg Tf0

CDBT V

ldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9 c10addi r1,4,r1 c4 c5 c6 c7

f0f1f2 RS#5r1

ldf X(r1),f1 c5 c7 c8 c9mulf f0,f1,f2 c6 c9 c10stf f2,Z(r1) c10

stf finished (W) no output register → no CDB broadcast

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no1 ALU no2 LD no3 ST yes stf - RS#5 - - [r1]4 FP1 no

free → allocate

EECS 470Lecture 6 Slide 29EECS 470

5 FP2 yes mulf f2 - - [f0] [f1]

Page 30: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scoreboard vs. TomasuloScoreboard Tomasulo

Insn D S X W D S X Wldf X(r1),f1 c1 c2 c3 c4 c1 c2 c3 c4ldf X(r1),f1 c1 c2 c3 c4 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9 c10 c3 c8 c9 c10addi r1,4,r1 c4 c5 c6 c9 c4 c5 c6 c7ldf X(r1),f1 c5 c9 c10 c11 c5 c7 c8 c9mulf f0,f1,f2 c8 c11 c12+ c15 c6 c9 c10+ c13stf f2,Z(r1) c10 c15 c16 c17 c10 c13 c14 c15

Hazard Scoreboard TomasuloInsn buffer stall in D stall in DFU wait in S wait in SRAW  wait in S wait in SWAR wait in W noneWAW stall in D none

EECS 470Lecture 6 Slide 30EECS 470

Page 31: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scoreboard vs. Tomasulo II: Cache MissScoreboard Tomasulo

Insn D S X W D S X Wldf X(r1),f1 c1 c2 c3+ c8 c1 c2 c3+ c8ldf X(r1),f1 c1 c2 c3+ c8 c1 c2 c3+ c8mulf f0,f1,f2 c2 c8 c9+ c12 c2 c8 c9+ c12stf f2,Z(r1) c3 c12 c13 c14 c3 c12 c13 c14addi r1,4,r1 c4 c5 c6 c13 c4 c5 c6 c7ldf X(r1),f1 c8 c13 c14 c15 c5 c7 c8 c9mulf f0,f1,f2 c12 c15 c16+ c19 c6 c9 c10+ c13stf f2,Z(r1) c13 c19 c20 c21 c7 c13 c14 c15

• Assume• 5 cycle cache miss on first ldf• Ignore FUST and RS structural hazards• Ignore FUST and RS structural hazards+ Advantage Tomasulo

• No addi WAR hazard (c7) means iterations run in parallel

EECS 470Lecture 6 Slide 31EECS 470

Page 32: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Can We Add Superscalar?• Dynamic scheduling and multiple issue are orthogonal

• E.g., Pentium4: dynamically scheduled 5-way superscalar• Two dimensions

• N: superscalar width (number of parallel operations)• W: window size (number of reservation stations)W: window size (number of reservation stations)

• What do we need for an N-by-W Tomasulo?• RS: N tag/value w-ports (D), N value r-ports (S), 2N tag CAMs (W)• Select logic: W→N priority encoder (S)• MT: 2N r-ports (D) N w-ports (D)• MT: 2N r ports (D), N w ports (D)• RF: 2N r-ports (D), N w-ports (W)• CDB: N (W)

Whi h th i i ?

EECS 470Lecture 6 Slide 32EECS 470

• Which are the expensive pieces?

Page 33: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Superscalar Select Logic• Superscalar select logic: W→N priority encoder

– Somewhat complicated (N2 logW)• Can simplify using different RS designs

• Split design• Divide RS into N banks: 1 per FU?• Divide RS into N banks: 1 per FU? • Implement N separate W/N→1 encoders+ Simpler: N * logW/N– Less scheduling flexibility

• FIFO design [Palacharla+]• Can issue only head of each RS bank• Can issue only head of each RS bank + Simpler: no select logic at all– Less scheduling flexibility (but surprisingly not that bad)

EECS 470Lecture 6 Slide 33EECS 470

Page 34: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Can We Add Bypassing?

valueTMap TableRegfile

DB

.V

DB

.T

V1 V2T2T1Top========

CD

CD

Fetchedinsns

R======== ==

Reservation StationsT

==

• Yes, but it’s more complicated than you might think

FU

EECS 470Lecture 6 Slide 34EECS 470

• In fact: requires a completely new pipeline

Page 35: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Why Out-of-Order Bypassing Is HardNo Bypassing Bypassing

Insn D S X W D S X Wldf X(r1),f1 c1 c2 c3 c4 c1 c2 c3 c4ldf X(r1),f1 c1 c2 c3 c4 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8 c2 c3 c4+ c7stf f2,Z(r1) c3 c8 c9 c10 c3 c6 c7 c8addi r1,4,r1 c4 c5 c6 c7 c4 c5 c6 c7ldf X(r1),f1 c5 c7 c8 c9 c5 c7 c7 c9mulf f0,f1,f2 c6 c9 c10+ c13 c6 c9 c8+ c13stf f2,Z(r1) c10 c13 c14 c15 c10 c13 c11 c15

• Bypassing: ldf X in c3 → mulf X in c4 → mulf S in c3• But how can mulf S in c3 if ldf W in c4? Must change pipeline

M d h d l• Modern scheduler• Split CDB tag and value, move tag broadcast to S

• ldf tag broadcast now in cycle 2 → mulf S in cycle 3

EECS 470Lecture 6 Slide 35EECS 470

g y y• How do multi-cycle operations work? How do cache misses work?

Page 36: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Dynamic Scheduling Summary• Dynamic scheduling: out-of-order execution

• Higher pipeline/FU utilization, improved performance• Easier and more effective in hardware than software

+ More storage locations than architectural registers+ Dynamic handling of cache misses+ Dynamic handling of cache misses

• Instruction buffer: multiple F/D latches• Implements large scheduling scope + “passing” functionality• Split decode into in-order dispatch and out-of-order issue

• Stall vs. wait

• Dynamic scheduling algorithmsDynamic scheduling algorithms• Scoreboard: no register renaming, limited out-of-order• Tomasulo: copy-based register renaming, full out-of-order

EECS 470Lecture 6 Slide 36EECS 470

Page 37: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasolu’s ...web.eecs.umich.edu/~twenisch/470_F07/lectures/6.pdf · Scheduling Algorithm II: Tomasulo • Tomasulo’s algorithm

© Wenisch 2007 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Are we done?

• When can Tomasulo go wrong?• Exceptions!!

N t fi t l ti d f i t ti i RS• No way to figure out relative order of instructions in RS

• Next Lecture: How to make exceptions work• Next Lecture: How to make exceptions work

EECS 470Lecture 6 Slide 37EECS 470