73
EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards

Dynamic Scheduling

  • Upload
    charis

  • View
    100

  • Download
    1

Embed Size (px)

DESCRIPTION

Dynamic Scheduling. A scheme to overcome data hazards. Advantages of Dynamic Scheduling. Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior It handles cases when dependences unknown at compile time - PowerPoint PPT Presentation

Citation preview

Page 1: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Dynamic SchedulingA scheme to overcome data hazards

Page 2: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Advantages of Dynamic Scheduling• Dynamic scheduling - hardware rearranges the

instruction execution to reduce stalls while maintaining data flow and exception behavior

• It handles cases when dependences unknown at compile time

– it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve

• It allows code that compiled for one pipeline to run efficiently on a different pipeline

• It simplifies the compiler • Hardware speculation, a technique with significant

performance advantages, builds on dynamic scheduling

Page 3: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

HW Schemes: Instruction Parallelism• Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14

• Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)

– In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

• Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution

• Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

Page 4: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Dynamic Scheduling Step 1

• Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue

• Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

• Issue—Decode instructions, check for structural hazards

• Read operands—Wait until no data hazards, then read operands

Page 5: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Algorithm• Control & buffers distributed with Function Units (FU)

– FU buffers called “reservation stations”; have pending operands• Registers in instructions replaced by values or pointers to

reservation stations(RS); called register renaming; – Avoids: WAR

WAW hazards– More reservation stations than registers, so can do optimizations

compilers cannot• Results to FU from RS, not through registers, over Common

Data Bus that broadcasts results to all FUs• Load and Stores treated as FUs with RSs as well.• Integer instructions can go past branches, allowing

FP ops beyond basic block in FP queue.

RX inst. iRX inst. j

Page 6: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

2

1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memory

Loadbuffers

FPOperation

queue

From instruction unit

FPRegisters

To memory

Storebuffers

Operation bus

Operand buses

Tomasulo scheme

Page 7: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Reservation Station ComponentsOp—Operation to perform in the unit (e.g., + or –)Vj, Vk—Value of Source operands

– Store buffers has V field, result to be stored

Qj, Qk—Reservation stations producing source registers (value to be written)– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready– Store buffers only have Qi for RS producing result

Busy—Indicates reservation station or FU is busyRegister result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 8: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Three Stages of Tomasulo Algorithm1. Issue—get instruction from FP Op Queue

If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2.Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

Page 9: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 0

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTDF0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU

Page 10: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

2

1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand busesLD F6, 34(R2)

Cycle: 0

Page 11: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 34+R2 3

2

1

3

2

1

2

1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

LD F2, 45(R3)

Cycle: 1

LD F6, 34(R2)

F6 : load1

Page 12: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2 45+R3

1 34+R2 3

2

1

3

2

1

2

1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

MULTD F0,F2,F4

Cycle: 2

LD F2, 45(R3)LD F6, 34(R2)

F6 : load1F2 : load2

Page 13: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2 45+R3

1 Mem[34+R2] 3

2

1

3

2

1

2

M load2 “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

SUB F8,F6,F2

Cycle: 3

MULTD F0,F2,F4 LD F2, 45(R3)LD F6, 34(R2)

F6 : load1F2 : load2F0 : mult1

Page 14: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

FP Registers

F2 : load2F6 : load1F6 Mem[34+R2]

F0 : mult1

L1: Mem[34+R2]

6

5

4

3

2 Mem[45+R3]

1 3

2

1

3

2

1 S load1 load2

2

M load2 “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit

To memory

Store buffers

Operation bus

Operand buses

DIVD F10,F0,F6

L1: Mem[34+R2]

Mem[34+R2]

Cycle: 4

SUB F8,F6,F2 MULTD F0,F2,F4 LD F2, 45(R3)LD F6, 34(R2)

F8: add1

L1: Mem[34+R2]

Page 15: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1 S Mem[R2] load2

D Mult1 2

M load2 “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

ADD F6,F8,F2

L2: Mem[45+R3]

Mem[45+R3]

Mem[45+R3]Mem[45+R3]

Cycle: 5

DIVD F10,F0,F6SUB F8,F6,F2

MULTD F0,F2,F4 LD F2, 45(R3)

F2 : load2F2 Mem[45+R3]

FP addersFP Multipliers

F8: add1

F0 : mult1

F10: mult2

L2: Mem[45+R3]L2: Mem[45+R3]

Page 16: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2 A add1 M[R3]

1 S Mem[R2] M[R3]

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 6

ADD F6,F8,F2DIVD F10,F0,F6SUB F8,F6,F2

MULTD F0,F2,F4

F6: add2F0 : mult1

F10: mult2F8: add1

Page 17: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2 A add1 M[R3]

1 S Mem[R2] M[R3]

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 7

ADD F6,F8,F2DIVD F10,F0,F6SUB F8,F6,F2

MULTD F0,F2,F4

F6: add2F0 : mult1

F10: mult2F8: add1

Page 18: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2 A add1 M[R3]1 S Mem[R2] M[R3]

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 8

ADD F6,F8,F2DIVD F10,F0,F6SUB F8,F6,F2

MULTD F0,F2,F4

Add1: M()-M()

F6: add2F0 : mult1

F10: mult2F8: add1

M()-M()

F8 M()-M()

Page 19: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2 A M()-M() M[R3]1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 9

ADD F6,F8,F2DIVD F10,F0,F6

MULTD F0,F2,F4

F6: add2F0 : mult1

F10: mult2

Page 20: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2 A M()-M() M[R3]1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 10

ADD F6,F8,F2DIVD F10,F0,F6

MULTD F0,F2,F4

F6: add2F0 : mult1

F10: mult2

Page 21: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2 A M()-M() M[R3]

1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 11

ADD F6,F8,F2DIVD F10,F0,F6

MULTD F0,F2,F4

F6: add2F0 : mult1

F10: mult2

Add2: (M()-M())+M()

F6 (M()-m())+M()

Page 22: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 12

DIVD F10,F0,F6

MULTD F0,F2,F4

F0 : mult1

F10: mult2

Page 23: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 13

DIVD F10,F0,F6

MULTD F0,F2,F4

F0 : mult1

F10: mult2

Page 24: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 14

DIVD F10,F0,F6

MULTD F0,F2,F4

F0 : mult1

F10: mult2

Page 25: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 15

DIVD F10,F0,F6

MULTD F0,F2,F4

F0 : mult1

F10: mult2

Page 26: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D Mult1 M[R3] 2

M M[R3] “F4” 1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 16

DIVD F10,F0,F6

MULTD F0,F2,F4

F0 : mult1

F10: mult2

Mult1: M()*F4

F0 M()*F4

M()*F4

Page 27: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D M()*F4 M[R3] 2

1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 17

DIVD F10,F0,F6

F10: mult2

Page 28: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

6

5

4

3

2

1 3

2

1

3

2

1

D M()*F4 M[R3] 2

1

FP addersFP Multipliers

Common data bus (CDB)

Reservation Stations

From memoryLo

ad b

uffe

rs

FP o

pera

tion

queu

e

From instruction unit FP Registers

To memory

Store buffers

Operation bus

Operand buses

Cycle: 57

DIVD F10,F0,F6

F10: mult2

Mult2: M()*F4 / M()

F10 M()*F4 / M()

Page 29: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 1

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 Load1 No 34+R2LD F2 45+ R3 Load2 NoMULTDF0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Load1

Yes

Page 30: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 2

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTDF0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU Load2 Load1

Note: Unlike 6600, can have multiple loads outstanding

Page 31: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 3

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 Yes MULTD R(F4) Load20 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU Mult1 Load2 Load1• Note: registers names are removed (“renamed”) in Reservation

Stations; MULT issued vs. scoreboard• Load1 completing; what is waiting for Load1?

Page 32: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 4

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes SUBD M(34+R2) Load20 Add2 No

Add3 No0 Mult1 Yes MULTD R(F4) Load20 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Mult1 Load2 M(34+R2) Add1

• Load2 completing; what is waiting for it?

Page 33: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 5

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes SUBD M(34+R2) M(45+R3)0 Add2 No

Add3 No10 Mult1 Yes MULTD M(45+R3) R(F4)

0 Mult2 Yes DIVD M(34+R2) Mult1Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Mult1 M(45+R3) M(34+R2) Add1 Mult2

Page 34: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 6

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDDF6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes SUBD M(34+R2) M(45+R3)0 Add2 Yes ADDD M(45+R3) Add1

Add3 No9 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 M(45+R3) Add2 Add1 Mult2

• Issue ADDD here vs. scoreboard?

Page 35: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 7

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDDF6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes SUBD M(34+R2) M(45+R3)0 Add2 Yes ADDD M(45+R3) Add1

Add3 No8 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 M(45+R3) Add2 Add1 Mult2

• Add1 completing; what is waiting for it?

Page 36: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 8

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No 2 Add2 Yes ADDD M()-M() M(45+R3)0 Add3 No7 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 M(45+R3) Add2 M()-M() Mult2

Page 37: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 9

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No1 Add2 Yes ADDD M()–M() M(45+R3)0 Add3 No6 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 M(45+R3) Add2 M()–M() Mult2

Page 38: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 Yes ADDD M()–M() M(45+R3)0 Add3 No5 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(45+R3) Add2 M()–M() Mult2

Tomasulo Example Cycle 10

• Add2 completing; what is waiting for it?

Page 39: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 11

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No4 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(45+R3) (M-M)+M() M()ĞM() Mult2

• Write result of ADDD here vs. scoreboard?

Page 40: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 12

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 6 7DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No3 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(45+R3) (M-M)+M() M()–M() Mult2

• Note: all quick instructions complete already

Page 41: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 13

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No2 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(45+R3) (M–M)+M() M()–M() Mult2

Page 42: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 14

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No1 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(45+R3) (M–M)+M() M()–M() Mult2

Page 43: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 15

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(45+R3) (M–M)+M() M()–M() Mult2

• Mult1 completing; what is waiting for it?

Page 44: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 16

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No

40 Mult2 Yes DIVD M*F4 M(34+R2)Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(45+R3) (M–M)+M() M()–M() Mult2

• Note: Just waiting for divide

Page 45: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 55

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No1 Mult2 Yes DIVD M*F4 M(34+R2)

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(45+R3) (M–M)+M() M()–M() Mult2

Page 46: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 56

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 Yes DIVD M*F4 M(34+R2)

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(45+R3) (M–M)+M() M()–M() Mult2

• Mult 2 completing; what is waiting for it?

Page 47: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Example Cycle 57

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F3057 FU M*F4 M(45+R3) (M–M)+M() M()–M() M*F4/M

• Again, in-order issue, out-of-order execution, completion

Page 48: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Drawbacks

• Complexity– delays of 360/91, MIPS 10000, IBM 620?

• Many associative stores (CDB) at high speed• Performance limited by Common Data Bus

– Multiple CDBs => more FU logic for parallel assoc stores

Page 49: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Loop Example

Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop

• Assume Multiply takes 4 clocks• Assume first load takes 8 clocks (cache miss?),

second load takes 4 clocks (hit)• To be clear, will show clocks for SUBI, BNEZ• Reality, integer instructions ahead

Page 50: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 0Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 Load1 NoMULTDF4 F0 F2 1 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

0 80 Qi

Page 51: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 1Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

1 80 Qi Load1

Page 52: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 2Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

2 80 Qi Load1 Mult1

Page 53: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 3Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

3 80 Qi Load1 Mult1

• Note: MULT1 has no registers names in RS

Page 54: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 4Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

4 72 Qi Load1 Mult1

Page 55: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 5Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

5 72 Qi Load1 Mult1

Page 56: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 6Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

6 72 Qi Load2 Mult1

• Note: F0 never sees Load1 result

Page 57: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 7Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

7 72 Qi Load2 Mult2

• Note: MULT2 has no registers names in RS

Page 58: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 8Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

8 72 Qi Load2 Mult2

Page 59: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 9Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F30

9 64 Qi Load2 Mult2

• Load1 completing; what is waiting for it?

Page 60: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 10Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 10 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R14 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3010 64 Qi Load2 Mult2

• Load2 completing; what is waiting for it?

Page 61: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 11Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R13 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #84 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3011 64 Qi Load3 Mult2

Page 62: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 12Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R12 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #83 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3012 64 Qi Load3 Mult2

Page 63: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 13Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R11 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #82 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3013 64 Qi Load3 Mult2

Page 64: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 14Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #81 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3014 64 Qi Load3 Mult2

• Mult1 completing; what is waiting for it?

Page 65: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 15Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3015 64 Qi Load3 Mult2

• Mult2 completing; what is waiting for it?

Page 66: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 16Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3016 64 Qi Load3 Mult1

Page 67: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 17Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3017 64 Qi Load3 Mult1

Page 68: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 18Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3018 56 Qi Load3 Mult1

Page 69: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 19Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3019 56 Qi Load3 Mult1

Page 70: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 20Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 20 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3020 56 Qi Load3 Mult1

Page 71: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Loop Example Cycle 21Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 NoSD F4 0 R1 2 8 20 21 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12... F3021 56 Qi Load3 Mult1

Page 72: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Tomasulo Summary• Reservations stations: renaming to larger set of

registers + buffering source operands– Prevents registers as bottleneck– Avoids WAR, WAW hazards of Scoreboard– Allows loop unrolling in HW

• Not limited to basic blocks (integer units gets ahead, beyond branches)

• Helps cache misses as well• Lasting Contributions

– Dynamic scheduling– Register renaming– Load/store disambiguation

• 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Page 73: Dynamic Scheduling

EE524/CptS561 Advanced Computer Architecture

Fetch Unit

Dispatch unit w/ 8-entry

instruction queue

Register nos.

DataCache

InstructionCache

Completionunit w/

reorder buffer

XSU0 XSU1 MCFXU LSU FPUBPU

Instruction dispatch buses

GP operand buses

Instruction Operation buses

Reorder buffer information

Reservation Stations

GP result buses

Result status buses

Register

nos.Register nos.

Register nos.

FP operand buses

FP result buses

Branch correction