Transcript

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.1

CS152Computer Architecture and Engineering

Lecture 16

Dynamic Scheduling (Cont), Speculation, and ILP

October 25, 1999

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.2

Review: Compiler techniques for parallelism

° Loop unrolling ⇒ Multiple iterations of loop insoftware:

• Amortizes loop overhead over several iterations• Gives more opportunity for scheduling around stalls

° Software Pipelining ⇒ Take one instruction from eachof several iterations of the loop

• Software overlapping of loop iterations• Today will show hardware overlapping of loop iterations

° Very Long Instruction Word machines (VLIW) ⇒Multiple operations coded in single, long instruction

• Requires sophisticated compiler to decide whichoperations can be done in parallel

• Trace scheduling ⇒ find common path and schedulecode as if branches didn’t exist (+ add “fixup code”)

° All of these require additional registers

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.3

Review: Dynamic hardware for out-of-order execution° HW exploitation of ILP

• Works when can’t know dependence at compile time.• Code for one machine runs well on another

° Key idea of Scoreboard: Allow instructions behind stallto proceed (Decode => Issue instr & read operands)

• Enables out-of-order execution => out-of-order completion

• ID stage checked both for structural & data dependencies

• Original version didn’t handle forwarding.

• No automatic register renaming⇒stalls for WAR and WAW hazards

• Are these fundamental limitations??? (No)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.4

° The Five Classic Components of a Computer

° Today’s Topics:• Recap last lecture

• Hardware loop unrolling with Tomasulo algorithm

• Administrivia

• Speculation, branch prediction

• Reorder buffers

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.5

Another Dynamic Algorithm: Tomasulo Algorithm

° For IBM 360/91 about 3 years after CDC 6600 (1966)

° Goal: High Performance without special compilers

° Differences between IBM 360 & CDC 6600 ISA• IBM has only 2 register specifiers/instr vs. 3 in CDC 6600

• IBM has 4 FP registers vs. 8 in CDC 6600

• IBM has memory-register ops

° Why Study? lead to Alpha 21264, HP 8000, MIPS 10000,Pentium II, PowerPC 604, …

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.6

Tomasulo Algorithm vs. Scoreboard

° Control & buffers distributed with Function Units (FU) vs.centralized in scoreboard;

• FU buffers called “reservation stations”; have pending operands

° Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;

• avoids WAR, WAW hazards

• More reservation stations than registers, so can do optimizationscompilers can’t

° Results to FU from RS, not through registers, overCommon Data Bus that broadcasts results to all FUs

° Load and Stores treated as FUs with RSs as well

° Integer instructions can go past branches, allowingFP ops beyond basic block in FP queue

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.7

Tomasulo Organization

������������������

��������

��� ������������� ����������

���������

��� ��� ������������

��������������������

�� �����������������

�����

��� �!����

"������##���

��������##���

"���"����"����"���$"���%"���&

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.8

Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of Source operands• Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers(value to be written)• Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready• Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.9

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

° Normal data bus: data + destination (“go to” bus)

° Common data bus: data + source (“come from” bus)• 64 bits of data + 4 bits of Functional Unit source address

• Write if matches expected Functional Unit (produces result)

• Does the broadcast

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.10

Tomasulo Example

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.11

Tomasulo Example Cycle 1

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Load1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.12

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU Load2 Load1

������������� ��������������������������������

Tomasulo Example Cycle 2

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.13

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU Mult1 Load2 Load1

� ���������������������������������������������������������������� �!�"#���������$������%����

� "���&��������� �'������'�����(���"���&)

Tomasulo Example Cycle 3

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.14

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Mult1 Load2 M(A1) Add1

� "���*��������� �'������'�����(���"���&)

Tomasulo Example Cycle 4

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.15

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes SUBD M(A1) M(A2)

Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Mult1 M(A2) M(A1) Add1 Mult2

Tomasulo Example Cycle 5

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.16

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes SUBD M(A1) M(A2)

Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 M(A2) Add2 Add1 Mult2

� +�����,---��������$������%����)

Tomasulo Example Cycle 6

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.17

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes SUBD M(A1) M(A2)

Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 M(A2) Add2 Add1 Mult2

� ,��&��������� �'������'�����(����)

Tomasulo Example Cycle 7

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.18

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 No

2 Add2 Yes ADDD (M-M) M(A2)Add3 No

7 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 M(A2) Add2 (M-M) Mult2

Tomasulo Example Cycle 8

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.19

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 No

1 Add2 Yes ADDD (M-M) M(A2)Add3 No

6 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 M(A2) Add2 (M-M) Mult2

Tomasulo Example Cycle 9

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.20

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 No

0 Add2 Yes ADDD (M-M) M(A2)Add3 No

5 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

10 FU Mult1 M(A2) Add2 (M-M) Mult2

� ,��*��������� �'������'�����(����)

Tomasulo Example Cycle 10

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.21

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

� .�����������(�,---��������$������%����)� ,�/������������������������������0��1

Tomasulo Example Cycle 11

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.22

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 12

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.23

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

13 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 13

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.24

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 14

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.25

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 15

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.26

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 16

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.27

Faster than light computation(skip a couple of cycles)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.28

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 55

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.29

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU M*F4 M(A2) (M-M+M(M-M) Mult2

� !��*����������� �'������'�����(����)

Tomasulo Example Cycle 56

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.30

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU M*F4 M(A2) (M-M+M(M-M) Mult2

� 2����������+�3���������������3�(3�������4������������������$

Tomasulo Example Cycle 57

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.31

Instruction status: Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue Compl ResultLD F6 34+ R2 1 2 3 4 1 3 4

LD F2 45+ R3 5 6 7 8 2 4 5

MULTD F0 F2 F4 6 9 19 20 3 15 16

SUBD F8 F6 F2 7 9 11 12 4 7 8

DIVD F10 F0 F6 8 21 61 62 5 56 57

ADDD F6 F8 F2 13 14 16 22 6 10 11

� .�0��������������������%����5�� )����������6�7����"�����(�(��'�����

Compare to Scoreboard Cycle 62

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.32

Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)

window size: ~ 14 instructions ~ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers

Control: reservation stations central scoreboard

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.33

° Complexity• delays of 360/91, MIPS 10000, IBM 620?

° Many associative stores (CDB) at high speed

° Performance limited by Common Data Bus• Multiple CDBs => more FU logic for parallel assoc stores

Tomasulo Drawbacks

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.34

Administrivia

° Should be debugging Lab 5 by now!• Remember: a Working processor is necessary for full credit…

° Tomorrow: Sections are back in classroom

° More info on some of the things that we have beentalking about last two lectures:

• Computer Architecture: A Quantitative Approach by JohnHennesy and David Patterson

° Next: Memory systems• Start reading Chapter 7 (of your text) now…

• Lab 6 will be using memory systems.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.35

Administrivia: Be careful about clock edges in lab5!

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

Valid

IRex

Dcd

Ctr

l

IRm

em

Ex

Ctr

l

IRw

b

Mem

Ctr

l

WB

Ctr

l

D

° Since Register file has edge-triggered write:• Must have everything set up at end of memory stage• This means that “M” register here is not actual register!

° Same with edge-triggered memory ⇒ “D” register appears“inside” memory

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.36

Tomasulo Loop Example

Loop:LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

° Assume Multiply takes 4 clocks

° Assume first load takes 8 clocks (cache miss),second load takes 1 clock (hit)

° To be clear, will show clocks for SUBI, BNEZ

° Reality: integer instructions ahead

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.37

Loop Example

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F300 80 Fu

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.38

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Fu Load1

Loop Example Cycle 1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.39

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Fu Load1 Mult1

Loop Example Cycle 2

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.40

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F303 80 Fu Load1 Mult1

° Implicit renaming sets up “DataFlow” graph

Loop Example Cycle 3

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.41

What does this mean physically?

addr: 80addr: 80

F0: Load 1 F0: Load 1

F4: Mult1 F4: Mult1

������������������

��������

��� ������������� ����������

���������

��� ��� ������������

��������������������

�� �����������������

�����

��� �!����

"������##���"���"����"����"���$"���%"���&

R(F2) Load1mul

�������##���

Addr: 80Addr: 80 Mult1Mult1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.42

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F304 80 Fu Load1 Mult1

° Dispatching SUBI Instruction

Loop Example Cycle 4

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.43

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F305 72 Fu Load1 Mult1

° And, BNEZ instruction

Loop Example Cycle 5

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.44

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F306 72 Fu Load2 Mult1

° Notice that F0 never sees Load from location 80

Loop Example Cycle 6

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.45

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F307 72 Fu Load2 Mult2

° Register file completely detached from iteration 1

Loop Example Cycle 7

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.46

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F308 72 Fu Load2 Mult2

Loop Example Cycle 8

° First and Second iteration completely overlapped

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.47

What does this mean physically?

addr: 80addr: 80addr: 72addr: 72

F0: Load2 F0: Load2

F4: Mult2 F4: Mult2

������������������

��������

��� ������������� ����������

���������

��� ��� ������������

��������������������

�� �����������������

�����

��� �!����

"������##���"���"����"����"���$"���%"���&

R(F2) Load1mulR(F2) Load2mul

�������##���

Addr: 80Addr: 80 Mult1Mult1Addr: 72Addr: 72 Mult2Mult2

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.48

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F309 72 Fu Load2 Mult2

° Load1 completing: who is waiting?° Note: Dispatching SUBI

Loop Example Cycle 9

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.49

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3010 64 Fu Load2 Mult2

° Load2 completing: who is waiting?° Note: Dispatching BNEZ

Loop Example Cycle 10

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.50

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3011 64 Fu Load3 Mult2

° Next load in sequence

Loop Example Cycle 11

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.51

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3012 64 Fu Load3 Mult2

° Why not issue third multiply?

Loop Example Cycle 12

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.52

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3013 64 Fu Load3 Mult2

Loop Example Cycle 13

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.53

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3014 64 Fu Load3 Mult2

° Mult1 completing. Who is waiting?

Loop Example Cycle 14

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.54

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3015 64 Fu Load3 Mult2

° Mult2 completing. Who is waiting?

Loop Example Cycle 15

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.55

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3016 64 Fu Load3 Mult1

Loop Example Cycle 16

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.56

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3017 64 Fu Load3 Mult1

Loop Example Cycle 17

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.57

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3018 64 Fu Load3 Mult1

Loop Example Cycle 18

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.58

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3019 64 Fu Load3 Mult1

Loop Example Cycle 19

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.59

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3020 64 Fu Load3 Mult1

Loop Example Cycle 20

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.60

Why can Tomasulo overlap iterations of loops?

° Register renaming

• Multiple iterations use different physical destinationsfor registers (dynamic loop unrolling).

• Replace static register names from code with dynamicregister “pointers”

• Effectively increases size of register file

• Permit instruction issue to advance past integercontrol flow operations.

° Crucial: integer unit must “get ahead” of floating pointunit so that we can issue multiple iterations

° Other idea: Tomasulo building “DataFlow” graph.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.61

Recall: Unrolled Loop That Minimizes Stalls

1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

&8�������0��������9$:�������������������'����������;<����������������1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.62

Why issue in order?

° In-order issue permits us to analyze data flow of program° This way, we know exactly which results should flow to

which subsequent instructions• If we issued out-of-order, we would confuse RAW and

WAR hazards!• The most advanced machines that I know of all issue

in order.° This idea works perfectly well “in principle” with multiple

instructions issued per clock:• Need to multi-port “rename table” and be able to rename a

sequence of instructions together• Need to be able to issue to multiple reservation stations in a

single cycle.• Need to have 2x number of read ports and 1x number of

write ports in register file.° In-order issue can be serious bottleneck when issuing

multiple instructions per clock-cycle

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.63

Branches must be resolved quickly for loop overlap!

° In our example, we relied on the fact that brancheswere under control of “fast” integer unit in order toget overlap!

Loop: LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

° What happens if branch depends on result of multd??• We completely lose all of our advantages!

• Need to be able to “predict” branch outcome.

• If we were to predict that branch was taken, this wouldbe right most of the time.

° Problem much worse for superscalar machines!10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz

Lec16.64

Independent “Fetch” unit

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Unit

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

° Instruction fetch decoupled from execution

° Often issue logic (+ rename) included with Fetch

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.65

° Prediction has become essential to getting goodperformance from scalar instruction streams.

° We will discuss predicting branches. However,architects are now predicting everything: datadependencies, actual data, and results of groups ofinstructions:

• At what point does computation become a probabilistic operation +verification?

• We are pretty close with control hazards already…

° Why does prediction work?• Underlying algorithm has regularities.

• Data that is being operated on has regularities.

• Instruction sequence has redundancies that are artifacts of way thathumans/compilers think about problems.

° Prediction ⇒ Compressible information streams?

Prediction: Branches, Dependencies, Data

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.66

Dynamic Branch Prediction

° Prediction could be “Static” (at compile time) or“Dynamic” (at runtime)

• For our example, if we were to statically predict“taken”, we would only be wrong once each passthrough loop

° Is dynamic branch prediction better than staticbranch prediction?

• Seems to be. Still some debate to this effect

• Today, lots of hardware being devoted to dynamicbranch predictors.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.67

° Address of branch index to get prediction AND branchaddress (if taken)• Must check for branch match now, since can’t use wrong branch address

• Grab predicted PC from table since may take several cycles to compute

° Update predicted PC when branch is actually resolved

° Return instruction addresses predicted with stack

����'(��� �����'������

)*

����#�������'�����+�

�,

�����'����-����������-��

Simple dynamic prediction: Branch Target Buffer (BTB)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.68

Dynamic Branch Prediction

° Performance = ƒ(accuracy, cost of misprediction)

° Branch History Table: Lower bits of PC addressindex table of 1-bit values

• Says whether or not branch taken last time

• No address check

° Problem: in a loop, 1-bit BHT will cause twomispredictions (avg is 9 iteratios before exit):

• End of loop case, when it exits instead of looping as before

• First time through loop on next time through code, when itpredicts exit instead of looping

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.69

° Solution: 2-bit scheme where change predictiononly if get misprediction twice: (Figure 4.13, p. 264)

° Red: stop, not taken

° Green: go, taken

° Adds hysteresis to decision making process

Dynamic Branch Prediction

T

TNT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.70

BHT Accuracy

° Mispredict because either:• Wrong guess for that branch

• Got branch history of wrong branch when index the table

° 4096 entry table programs vary from 1%misprediction (nasa7, tomcatv) to 18% (eqntott),with spice at 9% and gcc at 12%

° 4096 about as good as infinite table(in Alpha 211164)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.71

Correlating Branches

° Hypothesis: recent branches are correlated; that is,behavior of recently executed branches affects predictionof current branch

° Two possibilities; Current branch depends on:• Last m most recently executed branches anywhere in program

Produces a “GA” (for “global address”) in the Yeh and Pattclassification (e.g. GAg)

• Last m most recent outcomes of same branch.Produces a “PA” (for “per address”) in same classification (e.g. PAg)

° Idea: record m most recently executed branches as takenor not taken, and use that pattern to select the properbranch history table entry

• A single history table shared by all branches (appends a “g” at end),indexed by history value.

• Address is used along with history to select table entry (appends a “p”at end of classification)

• If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.72

Correlating Branches

(2,2) GAs predictor• First 2 means that we keep two

bits of history

• Second means that we have 2bit counters in each slot.

• Then behavior of recentbranches selects between,say, four predictions of nextbranch, updating just thatprediction

• Note that the original two-bitcounter solution would be a(0,2) GAs predictor

• Note also that aliasing ispossible here...

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history register

° For instance, consider global history, set-indexedBHT. That gives us a GAs history table.

+�'(���������./���'������

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.73

Accuracy of Different Schemes

Fre

quen

cy o

f M

ispr

edic

tion

s

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa

7

mat

rix3

00

tom

catv

dodu

cd

spic

e

fppp

p gcc

espr

esso

eqnt

ott li

0%

1%

5%

6% 6%

11%

4%

6%

5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

0%

18%

Fre

qu

ency

of

Mis

pre

dic

tio

ns

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.74

HW support for More ILP

° Avoid branch prediction by turning branches intoconditionally executed instructions:

if (x) then A = B op C else NOP• If false, then neither store result nor cause exception

• Expanded ISA of Alpha, MIPS, PowerPC, SPARC haveconditional move; PA-RISC can annul any following instr.

• EPIC: 64 1-bit condition fields selected so conditional execution

° Drawbacks to conditional instructions• Still takes a clock even if “annulled”

• Stall if condition evaluated late

• Complex conditions reduce effectiveness;condition becomes known late in pipeline

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.75

Now what about exceptions???

° Out-of-order commit really messes up our chance toget precise exceptions!

• When committing results out-of-order, register filecontains results from later instructions while earlierones have not completed yet.

• What if need to cause exception on one of those earlyinstructions??

° Need to be able to “rollback” register file toconsistent state

• Remember that “precise” means that there is some PCsuch that: all instructions before have committedresults, and none after have committed results.

° Big problem for branch prediction as well:What if prediction wrong??

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.76

° Speculation is a form of guessing.

° Important for branch prediction:• Need to “take our best shot” at predicting branch direction.

• If we issue multiple instructions per cycle, lose lots of potentialinstructions otherwise:

- Consider 4 instructions per cycle

- If take single cycle to decide on branch, waste from 4 - 7instruction slots!

° If we speculate and are wrong, need to back up andrestart execution to point at which we predictedincorrectly:

• This is exactly same as precise exceptions!

° Technique for both precise interrupts/exceptions andspeculation: in-order completion or commit

Relationship between precise interrupts and specultation:

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.77

HW support for precise interrupts

° Need HW buffer for results ofuncommitted instructions:reorder buffer

• 3 fields: instr, destination, value

• Reorder buffer can be operandsource => more registers like RS

• Use reorder buffer number instead ofreservation station when executioncompletes

• Supplies operands betweenexecution complete & commit

• Once operand commits,result is put into register

• Instructionscommit

• As a result, its easy to undospeculated instructionson mispredicted branchesor on exceptions

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.78

1. Issue—get instruction from FP Op Queue• If reservation station and reorder buffer slot free, issue instr & send

operands & reorder buffer no. for destination (this stage sometimescalled “dispatch”)

2. Execution—operate on operands (EX)• When both operands ready then execute; if not ready, watch CDB for

result; when both in reservation station, execute; checks RAW(sometimes called “issue”)

3. Write result—finish execution (WB)• Write on Common Data Bus to all awaiting FUs & reorder buffer;

mark reservation station available.

4. Commit—update register with reorder result• When instr. at head of reorder buffer & result present, update

register with result (or store to memory) and remove instr fromreorder buffer.

• Mispredicted branch or interrupt flushes reorder buffer (sometimescalled “graduation”)

Four Steps of Speculative Tomasulo Algorithm

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.79

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder buffer:

���� ��0

������������������ ��� ������������� ����������

��������������������

��� �!����

� �1� �&

� �%

� �%

� ��

� ��

� �

----

F0F0<val2><val2>

<val2><val2>ST 0(R3),F0ST 0(R3),F0

ADDD F0,F4,F6ADDD F0,F4,F6YY

ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

����*

���� ����

�����

2�3���

#�� ��� ��0

1 10+R21 10+R2����

����������##��

���������

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.80

Dynamic Scheduling in PowerPC 604 and Pentium Pro

° Both In-order Issue, Out-of-order execution, In-order Commit

PPro central reservation station for anyfunctional units with one bus shared by abranch and an integer unit

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.81

Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro

Max. instructions issued/clock 4 3

Max. instr. complete exec./clock 6 5

Max. instr. commited/clock 6 3

Instructions in reorder buffer 16 40

Number of rename buffers 12 Int/8 FP 40

Number of reservations stations 12 20

No. integer functional units (FUs) 2 2No. floating point FUs 1 1No. branch FUs 1 1No. complex integer FUs 1 0No. memory FUs 1 1 load +1 store

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.82

Dynamic Scheduling in Pentium Pro

° PPro doesn’t pipeline 80x86 instructions

° PPro decode unit translates the Intel instructions into72-bit micro-operations (- MIPS)

° Sends micro-operations to reorder buffer & reservationstations

° Takes 1 clock cycle to determine length of 80x86instructions + 2 more to create the micro-operations

° Most instructions translate to 1 to 4 micro-operations

° Complex 80x86 instructions are executed by aconventional microprogram (8K x 72 bits) that issueslong sequences of micro-operations

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.83

Limits to Multi-Issue Machines

° Inherent limitations of ILP• 1 branch in 5: How to keep a 5-way superscalar busy?

• Latencies of units: many operations must be scheduled

• Need about Pipeline Depth x No. Functional Units of independentinstructions to keep fully busy

• Increase ports to Register File

- VLIW example needs 7 read and 3 write for Int. Reg.& 5 read and 3 write for FP reg

• Increase ports to memory

• Current state of the art: Many hardware structures (such asissue/rename logic) has delay proportional to square of number ofinstructions issued/cycle

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.84

° Conflicting studies of amount• Benchmarks (vectorized Fortran FP vs. integer C programs)

• Hardware sophistication

• Compiler sophistication

° How much ILP is available using existingmechanims with increasing HW budgets?

° Do we need to invent new HW/SW mechanisms tokeep on processor performance curve?

• Intel MMX

• Motorola AltaVec

• Supersparc Multimedia ops, etc.

Limits to ILP

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.85

Initial HW Model here; MIPS compilers.

Assumptions for ideal/perfect machine to start:

1. Register renaming–infinite virtual registers and allWAW & WAR hazards are avoided

2. Branch prediction–perfect; no mispredictions

3. Jump prediction–all jumps perfectly predicted =>machine with perfect speculation & an unboundedbuffer of instructions available

4. Memory-address alias analysis–addresses areknown & a store can be moved before a load providedaddresses not equal

1 cycle latency for all instructions; unlimited number ofinstructions issued per clock cycle

Limits to ILP

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.86

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

IPC

Upper Limit to ILP: Ideal Machine

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.87

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

35

41

16

61

5860

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

Perfect Selective predictor Standard 2-bit Static None

Change from Infinitewindow to examine to2000 and maximumissue of 64 instructionsper clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect No prediction

FP: 15 - 45

Integer: 6 - 12

IPC

More Realistic HW: Branch Impact

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.88

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

910

11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

Change 2000 instrwindow, 64 instrissue, 8K 2 levelPrediction

Integer: 5 - 15

FP: 11 - 45

IPC

More Realistic HW: Register Impact (rename regs)

64 None256Infinite 32128

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.89

Program

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect Inspection None

Change 2000 instrwindow, 64 instrissue, 8K 2 levelPrediction, 256renaming registers

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC

More Realistic HW: Alias Impact

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.90

Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 54

6

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Perfect disambiguation(HW), 1K SelectivePrediction, 16 entryreturn, 64 registers,issue as many aswindow

Integer: 6 - 12

FP: 8 - 45

IPC

Realistic HW for ‘9X: Window Impact

64 16256Infinite 32128 8 4

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.91

° 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

0

100

200

300

400

500

600

700

800

900

espr

esso

li

eqnt

ott

com

pres

s sc gcc

spic

e

dodu

c

mdl

jdp2

wav

e5

tom

catv or

a

alvi

nn ear

mdl

jsp2

swm

256

su2c

or

hydr

o2d

nasa

fppp

pBraniac vs. Speed Demon(1993)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.92

Summary #1/2

° Reservations stations: renaming to larger set ofregisters + buffering source operands

• Prevents registers as bottleneck

• Avoids WAR, WAW hazards of Scoreboard

• Allows loop unrolling in HW

° Not limited to basic blocks(integer units gets ahead, beyond branches)

° Helps cache misses as well

° Lasting Contributions• Dynamic scheduling

• Register renaming

• Load/store disambiguation

° 360/91 descendants are Pentium II; PowerPC 604;MIPS R10000; HP-PA 8000; Alpha 21264

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.93

° Dynamic hardware schemes can unroll loopsdynamically in hardware

° Branch prediction very important to good performance

° Precise exceptions/Speculation: Out-of-orderexecution, In-order commit (reorder buffer)

° Superscalar and VLIW: CPI < 1 (IPC > 1)• Dynamic issue vs. Static issue

• More instructions issue at same time => larger hazard penalty

• Limitation is often number of instructions that you can successfullyfetch and decode per cycle ⇒ “Flynn barrier”

° SW Pipelining• Symbolic Loop Unrolling to get most from pipeline with little code

expansion, little overhead

Summary #2/2


Recommended