47
ECE 4100/6100 Advanced Computer Architecture Lecture 7 Dynamic Scheduling (I) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

Embed Size (px)

Citation preview

Page 1: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

ECE 4100/6100Advanced Computer Architecture

Lecture 7 Dynamic Scheduling (I)

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

Page 2: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

2

Data Flow Graph (DFG) i1: r2 = 4(r22) i2: r10 = 4(r25) i3: r10 = r2 + r10 i4: 4(r26) = r10 i5: r14 = 8(r27) i6: r6 = (r22) i7: r5 = (r23) i8: r5 = r6 – r5 i9: r4 = r14 * r5 i10: r15 = 12(r27) i11: r7 = 4(r22) i12: r8 = 4(r23) i13: r8 = r7 – r8 i14: r8 = r15* r8 i15: r8 = r4 – r8 i16: (r28) = r8

i1 i2

i3

i4

i6 i7

i8

i5

i9

i11 i12

i13

i10

i14

i15

i16

Data Flow Graph (or Data Dependency Graph)

Page 3: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

3

Data Flow Execution Model• To exploit maximal ILP

• An instruction can be executed immediately after– All source operands are ready– Execution unit available– Destination is ready (to be written)

Page 4: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

4

Dynamic Scheduling• Exploit ILP at run-time• Execute instructions out-of-order by a restricted

data flow execution model (still use PC!)• Hardware will

– Maintain true dependency (data flow manner)– Maintain exception behavior– Find ILP within an Instruction Window (pool)

• Need an accurate branch predictor• Pros

– Scalable performance: allows code to be compiled on one platform, but also run efficiently on another

– Handle cases where dependency is unknown at compile-time

• Cons– Hardware complexity (main argument from the VLIW/EPIC

camp)

Page 5: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

5

Out-of-Order Execution i1: r2 = 4(r22) i2: r10 = 4(r25) i3: r10 = r2 + r10 i4: 4(r26) = r10 i5: r14 = 8(r27) i6: r6 = (r22) i7: r5 = (r23) i8: r5 = r6 – r5 i9: r4 = r14 * r5 i10: r15 = 12(r27) i11: r7 = 4(r22) i12: r8 = 4(r23) i13: r8 = r7 – r8 i14: r8 = r15* r8 i15: r8 = r4 – r8 i16: (r28) = r8

i1 i2

i3

i4

i6 i7

i8

i5

i9

i11 i12

i13

i10

i14

i15

i16

Page 6: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

6

OOO Execution• OOO execution out-of-order completion

• OOO execution out-of-order retirement (commit)

• No (speculative) instruction allowed to retire until it is confirmed on the right path

• Fetch, decode, issue (i.e., front-end) are still done in the program order

Page 7: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

7

CDC 6600 Scoreboard Algorithm• Enable OOO Execution to address long-

latency FP instructions• Use scoreboard tables to track

– Functional unit status– Register update status

• Issue and execute instructions whenever– No structural hazard– No data hazard

• Cons– Stop issue when WAW is detected– Stop writeback when WAR is detected

Page 8: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

8

CDC6600 Scoreboard

Func

tiona

l Uni

ts

Reg

iste

rs

FP MultFP Mult

FP Divide

FP Add

Integer

MemorySCOREBOARD

Data bus

Data bus

Data bus

Data bus

Control bus/Status

Int

Mult1

Mult2

Add

Div

Fu1

1

0

1

1

BusyLoad

Mult

Sub

Div

OpF1

F0

F8

F2

DestR3

F1

F6

F0

Src1

F4

F1

F6

Src2

Int

Mult1

Dep1

Int

Dep2

F0 F1 F2 .. .. .. F31Mult1 Int Div .. .. .. xxxFU

FU Status Table

Register Update Table

Page 9: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

9

IBM 360• IBM 360 introduced

– 8-bit = 1 byte– 32-bit = 1 word– Byte-addressable memory– Differentiate an “architecture” from an

“implementation”

• IBM 360/91 FPU about 3 years after CDC 6600 (1966-7)

• Tomasulo algorithm – Dynamic scheduling– Register renaming

Page 10: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

10

Tomasulo Algorithm• Goal: High Performance without special compilers

– Dynamic scheduling done completely by HW– We generally use “supercalar processor” for such

category as opposed to “VLIW” or “EPIC”• Differences between IBM 360 and CDC 6600 ISA

– IBM has only 2 register specifiers per inst vs. 3 in CDC 6600

• Make WAW and WAR much worse– IBM has 4 FP registers vs. 8 in CDC 6600

• Smaller number of architectural register, compiler is incapable of exploiting better register allocation

– IBM has memory-to-register operations• Why study? Lead to Pentium Pro/II/III/4, Core, Alpha 21264,

MIPS R10000, HP 8000, PowerPC 604

Page 11: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

11

IBM 360/91 FPU w/ Tomasulo Algorithm• To not stall floating point instructions due to long

latency– Two function units FP Add + FP Mult/Div – 360/91 FPU is not pipelined

• Three new Mechanisms– Reservation StationsReservation Stations (RS)

• 3 in FP Add, 2 in FP mult/div• Register name is discarded when issue to reservation station

– TagsTags• 4-bit tag for one of the 11 possible sources (5 RSs + 6 FLB for

loads)• Written for unavailable sources whose results are being

generated by one of the sources (5 RS or 6 FLB)• New tag assignment eliminates false dependency

– Common Data BusCommon Data Bus (CDB), driven by• 11 Sources: 5 RS + 6 FLB• 17 Destinations: 2*5 RS + 3 SDB + 4 FLR

Page 12: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

12

Basic Principles• Do not rely on a centralized register file !

• RS fetches and buffers an operand as soon as it is available via CDB– Eliminating the need to get it from a register (No WAR)– Data Flow execution model

• Pending instructions designate the RS that will provide their input (renaming and maintain RAW)

• Due to in-order issue, the register status table always keeps the latest write (No WAW issue)

Page 13: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

13

Key Representation• Op Operation to perform in the units• Vj Value of Source 1 (called SINK in

360/91)• Vk Value of Source 2 (called SOURCE in

360/91)• Qj The RS (tag) will produce source 1• Qk The RS (tag) will produce source 2• A(ddress) Hold info for the memory

address generation for a load or store• Qi Whose value should be stored into

the register

Page 14: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

14

IBM 360/91 FPU w/ Tomasulo AlgorithmFrom Mem FP Registers (FLR)

Reservation Reservation StationsStations

Common Data Bus (CDB)Common Data Bus (CDB)

To Mem

FP operation stack (FLOS)

FP Load Buffers(FLB)

Store DataBuffers(SDB)

654321

FP Adder FP Mult/Div

321

21

Page 15: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

15

IBM 360/91 FPU w/ Tomasulo AlgorithmFrom Mem

FLR

Reservation Reservation StationsStations

Common Data Bus (CDB)Common Data Bus (CDB)

To Mem

FP operation stack (FLOS)

FLB

Store DataBuffers(SDB)

654321

FP Adder FP Mult/Div

321

21

Tag(Qj)

Tag(Qk)

Sink(Vj)

Source(Vk)Control

Tags and other info in RS Tags and other info in RS

Control Tag(Qi) Tags in FLBTags in FLB

Control Tag

Control

Page 16: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

16

RAWRAW Example: i: R2 i: R2 R0 + R4 (2 clks) R0 + R4 (2 clks)j: R8 j: R8 R0 + R2 (2 clks) R0 + R2 (2 clks)

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 3.54 10.08 7.8

Cycle #0: Cycle #0:

RSRSTag Sink Tag Src

4455

Multiplier/Divider

RSRSTag Sink Tag Src

11 0 6.0 0 10.02233

Adder

FLR Busy

Tag Data

0 6.02 1 11 ---4 10.08 7.8

Cycle #1: IssueCycle #1: Issue i i

RSRSTag Sink Tag Src

4455

Multiplier/Divider

RSRSTag Sink Tag Src

11 0 6.0 0 10.022 0 6.0 11 ---33

Adder

FLR Busy

Tag Data

0 6.02 1 11 ---4 10.08 1 22 ---

Cycle #2: Issue Cycle #2: Issue jj

RSRSTag Sink Tag Src

4455

Multiplier/Divider

Page 17: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

17

RAWRAW Example: i: R2 i: R2 R0 + R4 (2 clks) R0 + R4 (2 clks)j: R8 j: R8 R0 + R2 (2 clks) R0 + R2 (2 clks)

RSRSTag Sink Tag Src

1122 0 6.0 0 16.033

Adder

FLR Busy

Tag Data

0 6.02 16.04 10.08 1 22 ---

Cycle #3: Cycle #3: Broadcasts tag and result: CDB_a=<RS1,16.0>Broadcasts tag and result: CDB_a=<RS1,16.0>

RSRSTag Sink Tag Src

4455

Multiplier/Divider

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 16.04 10.08 22.0

Cycle #5: Cycle #5: Broadcasts tag and result: CDB_a=<RS2,22.0>Broadcasts tag and result: CDB_a=<RS2,22.0>

RSRSTag Sink Tag Src

4455

Multiplier/Divider

Page 18: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

18

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 3.54 10.08 7.8

Cycle #0:Cycle #0:

RSRSTag Sink Tag Src

4455

Multiplier/Divider

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 3.54 1 44 ---8 7.8

Cycle #1: Issue Cycle #1: Issue ii

RSRSTag Sink Tag Src

44 0 6.0 0 7.855

Multiplier/Divider

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 1 55 ---2 3.54 1 44 ---8 7.8

Cycle #2: Issue Cycle #2: Issue jj

RSRSTag Sink Tag Src

44 0 6.0 0 7.855 44 --- 0 3.5

Multiplier/Divider

WARWAR Example: i: R4 i: R4 R0 x R8 (3) R0 x R8 (3)j: R0 j: R0 R4 x R2 (3) R4 x R2 (3)k: R2 k: R2 R2 + R8 (2) R2 + R8 (2)

Page 19: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

19

Cycle #3: Issue Cycle #3: Issue kk

WARWAR Example: i: R4 i: R4 R0 x R8 (3) R0 x R8 (3)j: R0 j: R0 R4 x R2 (3) R4 x R2 (3)k: R2 k: R2 R2 + R8 (2) R2 + R8 (2)

RSRSTag Sink Tag Src

11 0 3.5 0 7.82233

Adder

FLR Busy

Tag Data

0 1 55 ---2 1 11 ---4 1 44 ---8 7.8

RSRSTag Sink Tag Src

44 0 6.0 0 7.855 44 --- 0 3.5

Multiplier/Divider

RSRSTag Sink Tag Src

11 0 3.5 0 7.82233

Adder

FLR Busy

Tag Data

0 1 55 ---2 1 11 ---4 46.88 7.8

Cycle #4: Cycle #4: Broadcasts CDB_m=<RS4,46.8>;Broadcasts CDB_m=<RS4,46.8>;

RSRSTag Sink Tag Src

4455 0 46.8 0 3.5

Multiplier/Divider

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 1 55 ---2 11.34 46.88 7.8

Cycle #5: Cycle #5: Broadcasts CDB_a=<RS1,11.3> Broadcasts CDB_a=<RS1,11.3>

RSRSTag Sink Tag Src

4455 0 46.8 0 3.5

Multiplier/Divider

Page 20: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

20

WARWAR Example: i: R4 i: R4 R0 x R8 (3) R0 x R8 (3)j: R0 j: R0 R4 x R2 (3) R4 x R2 (3)k: R2 k: R2 R2 + R8 (2) R2 + R8 (2)

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 163.82 11.34 46.88 7.8

Cycle #7: Cycle #7: Broadcasts CDB_m=<RS5,163.8> Broadcasts CDB_m=<RS5,163.8>

RSRSTag Sink Tag Src

4455

Multiplier/Divider

Page 21: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

21

RSRSTag Sink Tag Src

11 0 6.0 44 ---2233

Adder

FLR Busy

Tag Data

0 6.02 1 11 ---4 1 44 ---8 7.8

Cycle #2: IssueCycle #2: Issue j j

RSRSTag Sink Tag Src

44 0 6.0 0 7.855

Multiplier/Divider

WAWWAW Example:

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 3.54 10.08 7.8

Cycle #0:Cycle #0:

RSRSTag Sink Tag Src

4455

Multiplier/Divider

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 3.54 1 44 ---8 7.8

Cycle #1: Issue Cycle #1: Issue ii

RSRSTag Sink Tag Src

44 0 6.0 0 7.855

Multiplier/Divider

i: R4 i: R4 R0 x R8 (3) R0 x R8 (3)j: R2 j: R2 R0 + R4 (2) R0 + R4 (2)k: R4 k: R4 R0 + R8 (2) R0 + R8 (2)

Page 22: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

22

RSRSTag Sink Tag Src

11 0 6.0 44 ---22 0 6.0 0 7.833

Adder

FLR Busy

Tag Data

0 6.02 1 11 ---4 1 22 ---8 7.8

Cycle #3: IssueCycle #3: Issue k k

RSRSTag Sink Tag Src

44 0 6.0 0 7.855

Multiplier/Divider

WAWWAW Example: i: R4 i: R4 R0 x R8 (3) R0 x R8 (3)j: R2 j: R2 R0 + R4 (2) R0 + R4 (2)k: R4 k: R4 R0 + R8 (2) R0 + R8 (2)

RSRSTag Sink Tag Src

11 0 6.0 0 46.822 0 6.0 0 7.833

Adder

FLR Busy

Tag Data

0 6.02 1 11 ---4 1 22 ---8 7.8

Cycle #4: Cycle #4: Broadcasts CDB_m=<RS4,46.8>Broadcasts CDB_m=<RS4,46.8>

RSRSTag Sink Tag Src

4455

Multiplier/Divider

RSRSTag Sink Tag Src

11 0 6.0 0 46.82233

Adder

FLR Busy

Tag Data

0 6.02 1 11 ---4 13.88 7.8

Cycle #5: Cycle #5: Broadcasts CDB_a=<RS2,13.8> Broadcasts CDB_a=<RS2,13.8>

RSRSTag Sink Tag Src

4455

Multiplier/Divider

Page 23: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

23

WAWWAW Example:

RSRSTag Sink Tag Src

112233

Adder

FLR Busy

Tag Data

0 6.02 52.84 13.88 7.8

Cycle #6: Cycle #6: Broadcasts CDB_a=<RS1,52.8> Broadcasts CDB_a=<RS1,52.8>

RSRSTag Sink Tag Src

4455

Multiplier/Divider

i: R4 i: R4 R0 x R8 (3) R0 x R8 (3)j: R2 j: R2 R0 + R4 (2) R0 + R4 (2)k: R4 k: R4 R0 + R8 (2) R0 + R8 (2)

Page 24: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

24

Tomasulo Example (H&P Text)

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 Qi

These are RS, we have only “one FU” for eachtype (MUL, ADD, LD). We reduce Load from 6 to 3 for simplicity. SDB is not shown either

Page 25: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

25

Assumption• INT (load) 1 cycle • MULT 10 cycles• ADD 2 cycles• DIVIDE 40 cycles

Page 26: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

26

Tomasulo Example Cycle 1Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 Qi Load1

Page 27: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

27

Tomasulo Example Cycle 2Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 Qi Load2 Load1

Note: Unlike CDC6600, RS enables multiple outstanding loadsLoad is calculating the effective address

Page 28: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

28

Tomasulo Example Cycle 3Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 Qi Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard

• Load1 completing; what is waiting for Load1?

Page 29: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

29

Tomasulo Example Cycle 4Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 Qi Mult1 Load2 M(A1) Add1

• Load1 write to CDB; Load2 completing; what is waiting for Load2?

Page 30: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

30

Tomasulo Example Cycle 5Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 Qi Mult1 M(A2) Add1 Mult2

Page 31: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

31

Tomasulo Example Cycle 6Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD R(F2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 Qi Mult1 Add2 Add1 Mult2

• R(F6) was entered in Cycle 5• Issue ADDD here vs. scoreboard?

Page 32: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

32

Tomasulo Example Cycle 7Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD R(F2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 Qi Mult1 Add2 Add1 Mult2

• Add1 completing; what is waiting for it?

Page 33: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

33

Tomasulo Example Cycle 8Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No2 Add2 Yes ADDD (M1-M2)R(F2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 Qi Mult1 Add2 (M1-M2)Mult2

Page 34: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

34

Tomasulo Example Cycle 9Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No1 Add2 Yes ADDD (M1-M2)R(F2)

Add3 No6 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 Add2 Mult2

Page 35: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

35

Tomasulo Example Cycle 10Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No0 Add2 Yes ADDD (M1-M2)R(F2)

Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 Add2 Mult2

• Add2 completing; what is waiting for it?

Page 36: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

36

Tomasulo Example Cycle 11Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 (M1-M2+M(A2)) Mult2

• Write result of ADDD here vs. scoreboard?• All quick instructions complete in this cycle!

Page 37: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

37

Tomasulo Example Cycle 12Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 Mult2

Page 38: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

38

Tomasulo Example Cycle 13Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 Mult2

Page 39: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

39

Tomasulo Example Cycle 14Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 Mult2

Page 40: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

40

Tomasulo Example Cycle 15Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD R(F6) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 Mult2

Page 41: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

41

Tomasulo Example Cycle 16Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 R(F6)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 Mult2

Page 42: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

42

Faster than light computation(skip a couple of cycles)

Page 43: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

43

Tomasulo Example Cycle 55Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 R(F6)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU Mult2

Page 44: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

44

Tomasulo Example Cycle 56Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 R(F6)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU Mult2

• Mult2 is completing; what is waiting for it?

Page 45: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

45

Tomasulo Example Cycle 57Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3057 FU Result

• Once again: In-order issue, out-of-order execution and completion.

Page 46: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

46

Compare to Scoreboard Cycle 62

Instruction status: Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue ComplResultLD F6 34+ R2 1 2 3 4 1 3 4LD F2 45+ R3 5 6 7 8 2 4 5MULTD F0 F2 F4 6 9 19 20 3 15 16SUBD F8 F6 F2 7 9 11 12 4 7 8DIVD F10 F0 F6 8 21 61 62 5 56 57ADDD F6 F8 F2 13 14 16 22 6 10 11

• Why take longer on scoreboard/6600?• Structural Hazards• Lack of forwarding

Page 47: Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1

47

Issues in Tomasulo Algorithm• CDB at high speed?• Precise exception issues• Speculative instructions

– Branch prediction enlarges instruction window

– How to rollback when mispredicted?