Module: Speculative Execution © Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia…

Module: Module: Speculative ExecutionSpeculative Execution

© Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech)

2ECE 4100/6100 (2)

Reading for This Module Speculative Execution

– Section 3.7

The Reorder Buffer and Register Renaming – Section 3.7

Multithreading– Section 6.9

Additional Reading– Section 3.10, Section 4.5 (pp. 345-350)

Hyperthreading– ftp://download.intel.com/technology/itj/2002/volume06issue

01/art01_hyper/vol6iss1_art01.pdf

P4 Microarchitecture– http://www.intel.com/technology/itj/q12001/articles/art_2.htm

ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf

ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf

http://www.intel.com/technology/itj/q12001/articles/art_2.htm

3ECE 4100/6100 (3)

Speculation

Speculative execution is the execution of instructions before it is known if it is safe to do so

Rely on branch prediction to get the branch direction right in most cases

4ECE 4100/6100 (4)

Speculation vs. Prediction

Prediction is targeted at instruction fetch– Prediction is de-coupled from the decision to execute

fetched instructions– Prediction helps boost the issue rate

Speculation refers to the execution of predicted instructions

IF ID

EXINT

EXFP

EXBR

MEM WB

Keep the instruction pipeline full via prediction

Keep the out-of-order execution core full via

speculation

Maintain correctness of out-of-order execution

5ECE 4100/6100 (5)

Speculation

Hardware based speculation as an extension of dynamic scheduling is composed of– Branch prediction to select instructions to be

speculatively executed– Dynamic scheduling what we have seen so far– Execution – Commitment update machine state– Exception handling

Challenges – Handling multiple executions completions/cycle – Enforcing dependencies to ensure correctness– Handling exceptions

The Reorder BufferThe Reorder Buffer

7ECE 4100/6100 (7)

Principle

Basic block sizes are not very large– Prediction can increase the issue rate but not the completion rate– Boosting issue rate by itself is insufficient

The completion rate has to be increased to keep up with the issue rate – Need speculative execution

Key idea: separate instruction execution from instruction commitment– Compute on a need-to-know basis until speculation outcome is

determined

I-Fetch Execution Core Retire

Processor datapath

8ECE 4100/6100 (8)

Issues

What is commitment?– Updating the register file!– Permanent update to the machine state

What should be the criteria?– Commitment is performed in program order

How to enforce the criteria?– Reorder the instructions that complete out-of-order

Reorder Buffer

9ECE 4100/6100 (9)

The Reorder Buffer

Initially proposed to support precise interrupts

Handles output and anti-dependences– Another form of register renaming

Does not take care of flow dependences

A FIFO circular queue

10ECE 4100/6100 (10)

The Reorder Buffer

11ECE 4100/6100 (11)

Three Simple Steps

Every instruction gets a reorder entry allocated in-order as it is issued - the entry is marked “invalid”

When an instruction completes, it writes its result to the corresponding entry in the reorder buffer – the entry is now valid

When the entry at the head of the reorder buffer is “valid” it is committed to the register file

87654321

654321

321

321

21

FP Add(2 stage)

FP Mul/Div(6 stage)

Decoder

FP Registers

LoadBuffer

FP Ops“Stack”

OperandBusses

StoreBuffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory

From Instruction UnitROB

12ECE 4100/6100 (12)

Structure/Operation of the ROB

Issue/dispatch must now issue a ROB entry– ROB tag is used in renaming

Execute in a data-driven manner– Write results on the CDB using the ROB tag

Commit instructions in-order– Commit valid instructions at the head of the ROB– Incorrect branches cause the ROB to be flushed and

execution restarted

I-Type Dest Value Ready

branch memory register

register memory address

status

Speculation info

speculative? identify which block?

Why do you need this information?

13ECE 4100/6100 (13)

The Result

Results are written into the register file in-order

Destination registers are effectively renamed to reorder buffer entries– each instruction writes to a new “destination register”

14ECE 4100/6100 (14)

Using the Speculative Bits

Speculative instructions are marked in the reorder buffer by a special “speculative” bit– Should a branch become confirmed it will turn the

“speculative” bits of the corresponding speculative instruction to “confirm”

– If it is not confirmed, status is set to “not confirmed” When an instruction reaches the head of the reorder

buffer– If it is marked “speculative”, commitment is stalled until its

status is determined, i.e., it is no longer speculative– If it is marked “confirm”, commit the instruction– If it is marked as “not confirmed”, its result is discarded

These are known as speculative writebacks

15ECE 4100/6100 (15)

Speculative Memory References

Loads/stores do exhibit flow, output and anti-dependences

Speculative Stores are a problem– Use a store buffer and manage it like a re-order buffer

16ECE 4100/6100 (16)

Simple solution

Only one load/store unit

Reservation stations for this single load/store unit is a queue

Process in strict queue (in-order) order

Inefficient

17ECE 4100/6100 (17)

Separate Load/Store Units

18ECE 4100/6100 (18)

An Example - Alpha 21264

Instruction Cache

Data Cache

Instruction processing and dispatch

Integer issue fp issue

Integer execution unit FP execution unit

Load queue

Storequeue

Memory interface

Both 32 entryreorder buffers

19ECE 4100/6100 (19)

Parallel Retirement

Retiring only one instruction per cycle is also a bottleneck

Can retire instructions in parallel

Advantage: free up more reorder buffer entries quickly– Does not affect instruction execution directly as instructions

can read from the reorder buffer directly

20ECE 4100/6100 (20)

Parallel Retirement

InstructionResults

ReorderBuffer

InstructionOperands

RegisterFile

RetirementLogic

InstructionOperands

21ECE 4100/6100 (21)

Parallel Retirement

Although retiring in parallel, retirement logic must guarantee in-order retirement– must check valid bits in sequence– must check destination register number

Requires more ports to the register file

22ECE 4100/6100 (22)

Forwarding from the ROB

Results from the ROB can be forwarded directly to executing instructions– Can read valid results directly from the reorder buffer

Suppose two reorder buffer entry write to the same destination register, R0 say

For an instruction reading from R0, must use extra hardware to decide which is the right one to read from – the later of the two instructions writing to R0 has the higher

priority

Register RenamingRegister Renaming

24ECE 4100/6100 (24)

Dependencies and Register Pressure

Registers are re-used over the life of a program

Compilers provide a static scheme for re-using registers

Speculative execution creates a greater demand for registers to eliminate name dependencies

Register renaming increases issue rate

25ECE 4100/6100 (25)

Renaming Used at Different Points

Extend the resources available for renaming– More physical registers are available than are visible in the

ISA Renaming performed at/during ID or prior to issue

– Number of registers determines how many instructions can exist between issue and commit

IF ID

EXINT

EXFP

EXBR

MEM MEM WB

Values available for forwarding

Values available for commitment

26ECE 4100/6100 (26)

Principle

Instructions specify logical or architecture registers At instruction issue a logical register is re-mapped

or re-named to one of a larger pool of physical registers

R0R1

R7

P0

P1

P11Entry contains the name of a physical

register

Register Re-Map Table (Logical Register File)

Physical Register File

27ECE 4100/6100 (27)

Example: IBM RS 6000

Add a few extra registers to be re-used over the life of the program

R0R1R2

R7

Extra registers

R0

R1

R2

R jFree registers

Registers in use

RS 6000 Scheme

How do we keep track of this mapping information – Index a table with register number Mapping table– Keep track of free registers available for renaming– Keep track of currently in use registers in use

28ECE 4100/6100 (28)

When is Safe to Re-Use a Register?

If no active instruction is using that register, it can re-used

One approach is to check the registers being used by all active instructions– Expensive

Another approach is to perform checks at instruction commitment

Case Study: MIPS R10000Case Study: MIPS R10000

30ECE 4100/6100 (30)

MIPS R10000

There are 32 logical registers– 5 bit logical register specifiers

There are 64 physical registers– 6 bit physical register identifiers

31ECE 4100/6100 (31)

Main Data Structures

The Register Map Table

The Free Register List

The Active List

The Busy Bit Table

Duplicated for General Purpose and Floating Point Registers

32ECE 4100/6100 (32)

The Register Map Table

A multi-ported Static Random Access Memory (SRAM)

Takes 5 bit addresses

Deliver 6 bit results

For each instruction that may be issued in one cycle, requires three read ports– ADD.D F0, F2, F4– Need at least one write port per instruction that can be

retired in a cycle (recall parallel retirement)

33ECE 4100/6100 (33)

Active List

A FIFO queue - similar in function to the reorder buffer

Each instruction has a corresponding active list entry

Processing the head of the active list is called instruction retirement or graduation ( what we referred to as commitment)

34ECE 4100/6100 (34)

Free Register List

A FIFO queue of physical registers that are available for reuse

35ECE 4100/6100 (35)

Busy Bit Table

A table to indicate the availability of source operands

Busy bit in the instruction queue entry must be updated constantly– Each time a physical register is being written, all

corresponding busy bits in the instruction queues must be updated

36ECE 4100/6100 (36)

Functional Unit Instruction Queue

Equivalent of reservation stations for each functional unit

Consists of– opcode– ready bit of physical register operands– physical source register identifiers– physical destination register identifier– a TAG field for locating the corresponding active list entry

37ECE 4100/6100 (37)

MIPS R10000 RMT

op src1 src2 dst

Register Map Table

OpReadyField

Pscr1 Pscr2 Pdst TagOld Pdst

DstDone

Bit

Free Register

List

BusyBit

Table

Instruction

FU Instruction Queue FU Instruction Queue

New Pdst

Old Pdst

38ECE 4100/6100 (38)

Upon Instruction Issue...

Each instruction gets the following allocated

an entry in the corresponding FU instruction queue (reservation station)

an entry in the active list (reorder buffer)

a new physical destination register from the free register list (register renaming)

39ECE 4100/6100 (39)

Next...

The two 5 bit logical source register specifiers are used to access the RMT to obtain the corresponding physical registers

The 5 bit logical destination register specifier is used to access the RMT– The output is written to the corresponding active list entry– The busy bit for the physical destination is set

40ECE 4100/6100 (40)

Instruction Execution

When both physical source registers are ready, proceed with operand read– Takes care of flow dependences

Result is written directly to the physical destination register

Update Busy Bit Table

DONE bit in active list entry is set

41ECE 4100/6100 (41)

Instruction Retirement

When the entry at the head of the active list is marked “DONE”, proceed to retire instruction– Old physical register is released to free register list for

reuse

Each allocated physical register is written exactly once

42ECE 4100/6100 (42)

When Is It Safe? When is it safe to reuse a physical register?

Example: R1 P7 previously, now a new instruction, I1, will write to R1 and gets assigned P5

It is safe to reuse P7 when I1 has completed execution (and has written to P5)

R1 = …..= R1= R1..

R1 = ….= R1

Remapped to P7

Remapped to P5 From behavior of ROB, we know all prior instructions have committed, i.e. P7 can be now freed after this instruction commits

43ECE 4100/6100 (43)

Why?

Because the logical register R1 has been overwritten

Any subsequent read of R1 should be done to P5

44ECE 4100/6100 (44)

Handling Flow Dependences

Each time we allocated a new destination register, we update the RMT

Any subsequent read will get the correct map from the RMT

The Busy bit system comprising of the Busy Bit Table and the constantly updating of the busy fields in the instruction queue entry ensures data availability checking

45ECE 4100/6100 (45)

Handling Output and Anti-Dependences

Each instruction writes to a newly allocated physical register

Registers are renamed from logical to physical– Can use more physical registers

Case Study: Case Study: Intel Pentium III andIntel Pentium III and

Pentium IV (NETBURST)Pentium IV (NETBURST)

47ECE 4100/6100 (47)

Intel IA32

Due to backward compatibility– Complex instructions– Limited number of registers

Each complex instruction is translated into several micro-ops (uops)

Register renaming used to allow for more registers

48ECE 4100/6100 (48)

The Pipeline

fetch fetch dec dec dec rename ROBrd

Rdysch dispatch exec

1 2 3 4 5 6 7 8 9 10

drv all que sch sch sch disp disp RF RF EX drv

1 5 10 15 20

rename

Basic Pentium 3 Misprediction Pipeline

Basic Pentium 4 Misprediction Pipeline: Key stages

TC Nxt IP TC Fetch

49ECE 4100/6100 (49)

ROB

The Reorder Buffer (ROB) in the IA32 is implemented by content-addressable memory

Served as an instruction pool

ROB

Bus interface

L2 cache

L1 I-cache L1 D-Cache

Fetch/decode unit

Dispatch/execute

Retire unit

To system bus

Instruction pool

loadfetch store

50ECE 4100/6100 (50)

ROB Entries

Each ROB entry has a data and a status field

ROB data field stores the data result of a uop

ROB status field track the status of the uop producing the result that is to go into the corresponding data field

51ECE 4100/6100 (51)

Register Renaming in P-III

A Register Alias Table (multi-ported SRAM) keeps track of the latest alias for logical registers

ROB is managed like a reorder buffer

Tracks availability of data

Once retired, data is copied from ROB to the Retirement Register File (RRF)

52ECE 4100/6100 (52)

Pentium 3

ESPEBP

ESIEDI

ECXEDX

EAXEBX

ROB

RRF

Status Data

40 entry ROB

Register Alias Table (RAT): Remember the most current

version of each register

53ECE 4100/6100 (53)

Register Renaming

RAT may point to a ROB entry or a RRF

No physical EAX, EBX etc. exist

54ECE 4100/6100 (54)

Pentium IV

Introduced the NETBURST architecture

Eliminate the copying of ROB data value to the RRF

Consists of two RAT– Frontend RAT– Retirement RAT

55ECE 4100/6100 (55)

Pentium IV

The 128 Register File (RF) is separated from the ROB - which now only consists of status fields

A unique, in-order sequence number is allocated for each uop that points to the corresponding ROB entry

56ECE 4100/6100 (56)

Pentium IV

ESPEBP

ESIEDI

ECXEDX

EAXEBX

Status Data

RF ROB

ESPEBP

ESIEDI

ECXEDX

EAXEBX

Front End RAT

Retirement RAT

NetBurst

128 physical registers

drv all que sch sch sch disp disp RF RF EX drv

1 5 10 15 20

renameTC Nxt IP TC Fetch

57ECE 4100/6100 (57)

Pentium IV Execution Core

Up to 126 instructions in flight and up to 48 loads and 24 stores pending

The front end feeds the execution core– Allocator

– allocate ROB entry, rename registers, allocate μop queue entry, allocate load/store buffer

Front end μop supply and backend μop retirement bandwidth is 3 μops

Dispatch bandwidth into the execution core is 6 μops

Multi-clock bypass network for double speed integer ALUs

58ECE 4100/6100 (58)

Pentium IV Execution Core

Exec Port 0 Exec Port 1 Load Port Store Port

ALU(2X)

FP MoveALU(2X)

Integer FP StoreLoad

FP/SSE MoveFP/SSE Store

Add/SubLogicStore DataBranches

Add/Sub

Shift/Rotate

FP/SSE Add Mul Div

Dispatch Ports

schedulerscheduler scheduler scheduler

Out-of-order schedulers feed dispatch ports

Compute μop queue memory μop queue

59ECE 4100/6100 (59)

Some Observations

Applications have a high level of thread parallelism

Within a thread, high latency operations have to be tolerated, e.g., cache misses

5X increase in performance for a 15X increase in effective (scaled) chip area and 18X increase in power!

Transistors have been invested to improve the performance of a single thread– Sub-linear relationship between investment (chip area)

and return (execution speed) utilization is the key!

60ECE 4100/6100 (60)

What Next?

Exploit thread level parallelism– Use multiple processors and keep them busy– Time sharing

– Switch-on-event time sharing– Need to flush the deep pipelines

– Fine grained multi-threading to keep the pipelines full

Simultaneous multithreading to maximize resource utilization with minimal overhead

61ECE 4100/6100 (61)

Forms of Multithreading

stall

Superscalar Coarse Grain Multithreading

Fine Grained Multithreading

Simultaneous Multithreading

Issue slots

time

62ECE 4100/6100 (62)

Increasing Utilization in the NetBurst Microarchitecture

Observations for dynamically scheduled processors– Have large registers sets with support for renaming– Tag support enables tracking of instructions across

threads– Schedulers and execution units track dependencies

Idea: provide support for sharing resources across threads with little additional hardware support Hyper-threading

Abstraction: Logical processors– This is what the programmer and operating system sees

63ECE 4100/6100 (63)

Hyper-threading in the Xeon Processor Family

Goals:– Minimize die area cost– Independent forward progress for a logical processor– Do not penalize single thread performance minimize static

allocation of resources Implementation of Hyper-threading adds less that 5% to the

chip area Principle: share major logic components by adding or

partitioning buffering logic

Processor Execution Resources

Arch State




Arch State Arch State Arch State Arch State Arch State

2 CPU Without Hyper-threading 2 CPU With Hyper-threading

64ECE 4100/6100 (64)

The Xeon Pipeline

Regrename

allocator

Register cache Reg TC

EXINT

EXFP

EXBR

Trace cache access

μop queue

Rename Queue Schedule RegisterRead

Execute

L1 cache WB Retire

ROB

round robin access dynamic sharingShared ucode ROMIndependent code pointers

Duplicate ITLBs and PCs Independent I-buffers for decode RAS duplicated and some sharing of branch prediction logic

fairness enforced by limits on buffer sharing

Separate RATs

Schedulers oblivious to logical processors

Execution unit oblivious to logical processorsForwarding feasible due to shared register file

fairness enforced by limits on buffer sharing

Fetch LogicShared DTLB with logical processor tags

65ECE 4100/6100 (65)

Performance

65% performance increase for high end server applications for 4-way server platform

~20%-30% performance improvement for categories such as transactions, web server, and server side Java environment

Operating system can optimize scheduling of threads across logical/physical processor combinations

66ECE 4100/6100 (66)

Power 5

67ECE 4100/6100 (67)

Power 5: Key Features

Shared I-cache, fetching 8 instructions/thread/cycle Shared BHT 5 instr/thread/ decoded and grouped Group dispatch and commitment

– Instructions tracked as a group via GCT Register renaming dynamically shares registers

between threads as well as LRQ and SRQ Issue is independent of group membership I and D caches fully shared via increased

associativity Resource balancing logic to prevent starvation

68ECE 4100/6100 (68)

Recall…

Determine Dependences

Determine Independences

Bind Resources

Execute

Front-End & Optimizer

Determine Dependences

Determine Independences

Bind Resources

Sequential(superscalar)

Dependence Architecture(dataflow)

Independence Architecture(Horizon)

Independence Architecture(VLIW)

Compiler Hardware

69ECE 4100/6100 (69)

Review of the Superscalar Datapath

in-order fetch and issue logic

In-order completion logicOut-of-order

execution core

Instruction Issue

Instruction Execution

Instruction Completion

Renaming Allocate reservation stations Allocate re-order buffer entry Check for structural hazards

Data driven execution all dependencies have been resolved Issue to functional unit De-allocate reservation stations Forwarding Check load/store dependencies

Enable waiting instructions Retire from re-order buffer Forward from re-order buffer

70ECE 4100/6100 (70)

Concluding Remarks

Degree of speculation– Speculate the bad along with the good, e.g., cache misses

Speculating through multiple branches– Hide long functional unit delays– May need to speculate through multiple branches in one

cycle Use SATSIM

– Follow the execution and understand the use of the register renaming and use of the re-order buffer

– http://www.ece.gatech.edu/research/pica/SATSim/satsim.html

Check the data sheets for modern processors. What techniques do they use?

http://www.ece.gatech.edu/research/pica/SATSim/satsim.html

http://www.ece.gatech.edu/research/pica/SATSim/satsim.html

71ECE 4100/6100 (71)

Study Guide

Given a code sequence– What is the state if the ROB at some point in time?

Exception handling – Using a ROB

Register renaming– Given a code sequence,

– what would be the contents of the rename table or rename register file (depending on which technique is used) at some point in time

– Which physical registers are available?

Forms of speculation – understanding how they work– Across branches– Speculating memory accesses

Documents

Module: Speculative Execution © Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia…