53
1 Instruction-Level Parallelism

Instruction-Level Parallelism

  • Upload
    leland

  • View
    56

  • Download
    1

Embed Size (px)

DESCRIPTION

Instruction-Level Parallelism. Outline. Instruction-Level Parallelism: Concepts and Challenges Overcoming Data Hazards with Dynamic Scheduling Reducing Branch cost with Dynamic Hardware Prediction High-Performance Instruction Delivery Hardware-Based Speculation - PowerPoint PPT Presentation

Citation preview

Page 1: Instruction-Level Parallelism

1

Instruction-Level Parallelism

Page 2: Instruction-Level Parallelism

2

Outline

• Instruction-Level Parallelism: Concepts and Challenges• Overcoming Data Hazards with Dynamic Scheduling• Reducing Branch cost with Dynamic Hardware Prediction• High-Performance Instruction Delivery• Hardware-Based Speculation• Studies of the Limitations of ILP

Page 3: Instruction-Level Parallelism

3

Instruction-Level Parallelism: Concepts and Challenges

Page 4: Instruction-Level Parallelism

4

Introduction

• Instruction-Level Parallelism (ILP): potential execution overlap among instructions – Instructions are executed in parallel– Pipeline supports a limited sense of ILP

• This chapter introduces techniques to increase the amount of parallelism exploited among instructions– How to reduce the impact of data and control hazards– How to increase the ability of the processor to exploit parallelism

Page 5: Instruction-Level Parallelism

5

Approaches To Exploiting ILP

• Hardware approach: focus of this chapter– Dynamic – running time– Dominate desktop and server markets– Pentium III and IV;

• Software approach: focus of next chapter– Static – compiler time– Rely on compilers– Broader adoption in the embedded market– But include IA-64 and Intel’s Itanium

Page 6: Instruction-Level Parallelism

6

ILP within a Basic Block

• Basic Block – Instructions between branch instructions – Instructions in a basic block are executed in sequence– Real code is a bunch of basic blocks connected by branch

• Notice: dynamic branch frequency – between 15% and 25%– Basic block size between 6 and 7 instructions– May depend on each other (data dependence)– Therefore, probably little in the way of parallelism

• To obtain substantial performance enhancement: ILP across multiple basic blocks– Easiest target is the loop– Exploit parallelism among iterations of a loop (loop-level parallelism)

Page 7: Instruction-Level Parallelism

7

Loop Level Parallelism (LLP)

• Consider adding two 1000 element arrays

– There is no dependence between data values produced in any iteration j and those needed in j+n for any j and n

– Truly independent iterations– Independence means no stalls due to data hazards

• Basic idea to convert LLP into ILP– Unroll the loop either statically by the compiler (next chapter) or

dynamically by the hardware (this chapter)

for (i=1; i<=1000, i=i+1) x[i] = x[i] + y[i]

Page 8: Instruction-Level Parallelism

8

Data Dependences and Hazards

Page 9: Instruction-Level Parallelism

9

Introduction

• To exploit instruction-level parallelism we must determine which instructions can be executed in parallel.

• If two instructions are independent, then– They can execute (parallel) simultaneously in a pipeline without stall

• Assume no structural hazards– Their execution orders can be swapped

• Dependent instructions must be executed in order, or partially overlapped in pipeline

• Why to check dependence?– Determine how much parallelism exists, and how that parallelism

can be exploited

Page 10: Instruction-Level Parallelism

10

Types of dependences

Types of dependences – Data– Name– Control dependence

Page 11: Instruction-Level Parallelism

11

Data Dependence Analysis

• i is data dependent on j if i uses a result produced by j– OR i uses a result produced by k and k depends on j (chain)

• Dependence indicates a potential RAW hazard– Induce a hazard and stall? - depends on the pipeline organization– The possibility limits the performance

• Order in which instructions must be executed• Sets a bound on how much parallelism can be exploited

• Overcome data dependence– Maintain dependence but avoid a hazard – scheduling the code

(HW,SW)– Eliminate a dependence by transforming the code (by compiler)

Page 12: Instruction-Level Parallelism

12

Data Dependence Example

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped.

Page 13: Instruction-Level Parallelism

13

Data Dependence through Memory Location

• Dependences that flow through memory locations are more difficult to detect

• Two Addresses may refer to the same location but look different– Example : 100(R4) and 20(R6) may be identical

• The effective address of a load or store may change from one execution of the instruction to another

Page 14: Instruction-Level Parallelism

14

Name Dependence

• Occurs when 2 instructions use the same register name or memory location without data dependence

• Let i precede j in program order– i is antidependent on j when j writes a register that i reads

• Indicates a potential WAR hazard– i is output dependent on j if they both write to the same register

• indicates a potential WAW hazard

• Not true data dependences – no value being transmitted between instructions– Can execute simultaneously or be reordered if the name used in the

instructions is changed so the instructions do not conflict

Page 15: Instruction-Level Parallelism

15

Name Dependence Example

L.D F0, 0(R1)

ADD.D F4,F0,F2

S.D F4, 0(R1)

L.D F0,-8(R1)

ADD.D F4,F0,F2

: Output dependence

: Anti-dependence

Register renaming

Renaming can be performedeither by compiler or hardware

Page 16: Instruction-Level Parallelism

16

• A control dependence determines the ordering of an instruction, i, with respect to a branch instruction – so that the instruction i is executed in correct program order – and only when it should be.

• One of the simplest examples of a control dependence is the dependence of the statements in the “then” part of an if statement on the branch.

Control Dependence

Page 17: Instruction-Level Parallelism

17

Control Dependence

• Since branches are conditional– Some instructions will be executed and others will not– Instructions before the branch don’t matter– Only possibility is between a branch and instructions which follow it

• 2 obvious constraints to maintain control dependence– Instructions controlled by the branch cannot be moved before the

branch (since it would then be uncontrolled)– An instruction not controlled by the branch cannot be moved after

the branch (since it would then be controlled)

if p1 { s1};Aif p2 { s2};

Page 18: Instruction-Level Parallelism

18

• Control dependence is preserved by 2 properties in a simple pipeline.– Instructions execute in program order.– The detection of control or branch hazards ensures that

• an instruction that is control dependent on branch is not executed until the branch direction is known.

Page 19: Instruction-Level Parallelism

19

Overcoming Data Hazards with Dynamic Scheduling

Page 20: Instruction-Level Parallelism

20

Introduction

• Approaches used to avoid data hazard – Forwarding or bypassing – let dependence not result in hazards– Stall – Stall the instruction that uses the result and successive

instructions– Compiler (Pipeline) scheduling – static scheduling

• In-order instruction issue and execution– Instructions are issued in program order, and if an instruction is

stalled in the pipeline, no later instructions can proceed– If there is a dependence between two closely spaced instructions

in the pipeline, this will lead to a hazard and a stall will result– Out-of-order execution - Dynamic

• Send independent instructions to execution units as soon as

possible

Page 21: Instruction-Level Parallelism

21

Dynamic Scheduling VS. Static Scheduling

• Dynamic Scheduling – Avoid stalling when dependences are present

• Static Scheduling – Minimize stalls by separating dependent instructions so that they will not lead to hazards

Page 22: Instruction-Level Parallelism

22

Dynamic Scheduling

• Dynamic scheduling – HW rearranges the instruction execution to avoid stalling when dependences, which could generate hazards, are present– Advantages

• Enable handling some dependences unknown at compile time• Simplify the compiler• Code for one machine runs well on another

– Approaches• Scoreboard • Tomasulo Approach

Page 23: Instruction-Level Parallelism

23

• Original simple pipeline– ID – decode, check all hazards, read operands– EX – execute

• Dynamic pipeline– Split ID (“issue to execution unit”) into two parts– Check for structural hazards– Wait for data dependences

• New organization (conceptual):– Issue – decode, check structural hazards, read ready operands– ReadOps – wait until data hazards clear, read operands, begin

execution Issue stays in-order; ReadOps/beginning of EX is out-of-order

Dynamic Scheduling Idea

Page 24: Instruction-Level Parallelism

24

Dynamic Scheduling (Cont.)

• Dynamic scheduling can create WAW, WAR hazards.

Consider (WAR Hazard):DIV.D F0, F2, F4

ADD.D F10, F0, F8

SUB.D F12, F8, F14– DIV.D has a long latency (20+ pipeline stages)– ADD.D has a data dependence on F0, SUB.D does not

• Stalling ADD.D will stall SUB.D too• So swap them - compiler might have done this but so could HW

• Key Idea – allow instructions behind stall to proceed– SUB.D can proceed even when ADD.D is stalled

Hazard?

Page 25: Instruction-Level Parallelism

25

• All instructions pass through the issue stage in order• But, instructions can be stalled or bypass each other in the

read-operand stage, and thus enter execution out of order.

Dynamic Scheduling (Cont.)

Page 26: Instruction-Level Parallelism

26

Reducing Branch Penalties with Dynamic Hardware Prediction

Page 27: Instruction-Level Parallelism

27

Dynamic Control Hazard Avoidance

• Consider Effects of Increasing the ILP– Control dependencies rapidly become the limiting factor– They tend to not get optimized by the compiler

• Higher branch frequencies result• Plus multiple issue (more than one instructions/sec) more

control instructions per sec.– Control stall penalties will go up as machines go faster

• Amdahl’s Law in action - again

• Branch Prediction: helps if can be done for reasonable cost– Static by compiler– Dynamic by HW

Page 28: Instruction-Level Parallelism

28

Dynamic Branch Prediction

• Processor attempts to resolve the outcome of a branch early, thus preventing control dependences from causing stalls

• BP_Performance = f (accuracy, cost of misprediction)• Branch penalties depend on

– The structure of the pipeline– The type of predictor– The strategies used for recovering from misprediction

Page 29: Instruction-Level Parallelism

29

Basic Branch Prediction and Branch-Prediction Buffers

• The simplest dynamic branch-prediction scheme is branch-prediction buffer or branch history table.– A branch prediction buffer is a small memory indexed by the lower

portion of the address of the branch instruction.– The memory contains a bit that says whether the branch was

recently taken or not.– It is used to reduce the branch delay.

• The prediction is a hint that is assumed to be correct, and fetching begins in the predicted direction.– If the hint turns out to be wrong, the prediction bit is inverted and

stored back.

Page 30: Instruction-Level Parallelism

30

BHT Prediction

If two branch instructions withthe same lower bits…

Useful only for the target addressis known before CC is decided

Page 31: Instruction-Level Parallelism

31

Problem with the Simple BHT

• Aliasing– All branches with the same index (lower) bits reference same BHT entry

• Hence they mutually predict each other

• No guarantee that a prediction is right. But it may not matter anyway

– Avoidance

• Make the table bigger - OK since it’s only a single bit-vector

• This is a common cache improvement strategy as well

– Other cache strategies may also apply

• Consider how this works for loops– Always mispredict twice for every loop

• Once is unavoidable since the exit is always a surprise

• However previous exit will always cause a mis-predict on the first try of every new loop entry

clear benefit is that it’s cheap and understandable

Page 32: Instruction-Level Parallelism

32

N-bit Predictors

• Use an n-bit saturating counter– 2-bit counter implies 4 states

• Statistically 2 bits gets most of the advantage

idea: improve on the loop entry problem

Page 33: Instruction-Level Parallelism

33

• 2-bit predictors use only the recent behavior of a single branch to predict the future of that branch.

• It may possible to improve the prediction accuracy if we also look at the recent behavior of other branch.

Improve Prediction Strategy By Correlating Branches

Page 34: Instruction-Level Parallelism

34

Correlating Branches

• Consider the worst case for the 2-bit predictorif (aa==2) then aa=0;if (bb==2) then bb=0;if (aa != bb) then whatever– single level predictors can never get this case

• Correlating or 2-level predictors– Correlation = what happened on the last branch

• Note that the last correlator branch may not always be the same– Predictor = which way to go

• 4 possibilities: which way the last one went chooses the prediction

– (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)

if the first 2 fail then the 3rd will always be taken

Page 35: Instruction-Level Parallelism

35

Correlating Branches

• Hypothesis: recently executed branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

• Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

• In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters– Old 2-bit BHT is then a (0,2) predictor

Page 36: Instruction-Level Parallelism

36

Tournament Predictors

• Adaptively combine local and global predictors– Multiple predictors

• One based on global information: Results of recently executed m branches

• One based on local information: Results of past executions of the current branch instruction

– Selector to choose which predictors to use• 2-bit saturating counter, incremented whenever the “predicted”

predictor is correct and the other predictor is incorrect, and it is decremented in the reverse situation

• Advantage– Ability to select the right predictor for the right branch

• Example: Alpha 21264 Branch Predictor (p. 207 – p. 209)

Page 37: Instruction-Level Parallelism

37

• In a high-performance pipeline, especially one with multiple issue, predicting branches well is not enough.

• We actually deliver a high –bandwidth instruction stream.• We consider 3 concepts

– A branch –target buffer– An integrated instruction fetch unit– Dealing with indirect branches by predicting return address.

High-Performance Instruction Delivery

Page 38: Instruction-Level Parallelism

38

Branch Target Buffer/Cache• To reduce the branch penalty to 0

– Need to know what the address is by the end of IF– But the instruction is not even decoded yet – So use the instruction address rather than wait for decode

• If prediction works then penalty goes to 0!

• A branch –prediction cache that stores the predicted address for the next instruction after a branch is called a branch-target buffer or branch-target cache.

Branch-Target Buffers

Page 39: Instruction-Level Parallelism

39

• BTB Idea -- Cache to store taken branches (no need to store untaken)– Match tag is instruction address compare with current PC– Data field is the predicted PC

• May want to add predictor field– To avoid the mispredict twice on every loop phenomenon– Adds complexity since we now have to track untaken branches as

well

Page 40: Instruction-Level Parallelism

40

Branch Target Buffer/Cache-- Illustration

Page 41: Instruction-Level Parallelism

41

The Steps involved in handling an instruction with a branch - target buffer

Page 42: Instruction-Level Parallelism

42

Return Address Predictor

• Indirect jump – jumps whose destination address varies at run time– indirect procedure call, select or case, procedure return

• Accuracy of BTB for procedure returns are low– if procedure is called from many places, and the calls from one place

are not clustered in time

• Use a small buffer of return addresses operating as a stack– Cache the most recent return addresses– Push a return address at a call, and pop one off at a return– If the cache is sufficient large (max call depth) prefect

Page 43: Instruction-Level Parallelism

43

Dynamic Branch Prediction Summary

• Branch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches correlated with

next branch• Branch Target Buffer: include branch address & prediction• Reduce penalty further by fetching instructions from both the

predicted and unpredicted direction

Page 44: Instruction-Level Parallelism

44

3.7 Hardware-Based Speculation

Page 45: Instruction-Level Parallelism

45

Overview

• Overcome control dependence by speculating on the outcome of branches and executing the program as if our guesses were correct– Fetch, issue, and execute instructions– Need mechanisms to handle the situation when the speculation is

incorrect• A variety of mechanisms for supporting speculation by the

compiler (Next chapter)• Hardware speculation, which extends the ideas of dynamic

scheduling. ( in this chapter)

Page 46: Instruction-Level Parallelism

46

Key Ideas

• Hardware-based speculation combines 3 key ideas:– Dynamic branch prediction to choose which instructions to execute– Speculation to allow the speculated blocks to execution before the

control dependences are resolved• And undo the effects of an incorrectly speculated sequence

– Dynamic scheduling to deal with the scheduling of different combinations of basic blocks (Tomasulo style approach)

Page 47: Instruction-Level Parallelism

47

HW Speculation Approach

• Issue execution write result commit – Commit is the point where the operation is no longer speculative

• Allow out of order execution– Require in-order commit– Prevent speculative instructions from performing destructive state

changes (e.g. memory write or register write)

• Collect pre-commit instructions in a reorder buffer (ROB)– Holds completed but not committed instructions– Effectively contains a set of virtual registers to store the result of

speculative instructions until they are no longer speculative• Similar to reservation station And becomes a bypass source

Page 48: Instruction-Level Parallelism

48

The Speculative MIPS (Cont.)

• Need HW buffer for results of uncommitted instructions: reorder buffer (ROB)– 4 fields: instruction type, destination field, value field, ready field– ROB is a source of operands more registers like RS (Reservation

Station)• ROB supplies operands in the interval between completion of

instruction execution and instruction commit• Use ROB number instead of RS to indicate the source of

operands when execution completes (but not committed)– Once instruction commits, result is put into register– As a result, its easy to undo speculated instructions on mispredicted

branches or on exceptions

Page 49: Instruction-Level Parallelism

49

ROB Fields

• Instruction type – branch, store, register operations• Destination field

– Unused for branches– Memory address for stores– Register number for load and ALU operations (register operations)

• Value – hold the value of the instruction result until commit• Ready – indicate if the instruction has completed execution

Page 50: Instruction-Level Parallelism

50

Steps in Speculative Execution

• Issue (or dispatch)– Get instruction from the instruction queue– In-order issue if available empty RS and ROB slot; otherwise, stall– Send operands to RS if they are in register or ROB– The ROB no. allocated for the result is sent to RS.

• Execute– RS waits grabs results – When all operands are there execution happens

• Write Result– Result posted to ROB – Waiting reservation stations can grab it as well

Page 51: Instruction-Level Parallelism

51

Steps in Speculative Execution (Cont.)

• Commit (or graduate) – instruction reaches the ROB head– Normal commit – when instruction reaches the ROB head and its

result is present in the buffer• Update the register and remove the instruction from ROB

– Store – Update memory and remove the instruction from ROB– Branch with incorrect prediction – wrong speculation

• Flush ROB • Restart at the correct successor of the branch• Remove the instruction from ROB

– Branch with correct prediction – finish the branch• Remove the instruction from ROB

Page 52: Instruction-Level Parallelism

52

Other Issues

• Performance is more sensitive to branch-prediction– Impact of a mis-prediction will be higher– Prediction accuracy, mis-prediction detection, and mis-prediction

recovery increase in importance

• Precise exception– Handled by not recognizing the exception until it is ready to commit– If a speculation instruction raises an exception, the exception is

recorded in ROB• Mis-prediction branch exception are flushed as well• If the instruction reaches the ROB head take the exception

Page 53: Instruction-Level Parallelism

53

Multiple Issue with Speculation

• Process multiple instructions per clock, assigning RS and ROB to the instructions

• To maintain throughput of greater than one instruction per cycle, must handle multiple instruction commits per clock

• Speculation helps significantly when a branch is a key potential performance limitation

• Speculation can be advantageous when there are data-dependent branches, which otherwise would limit performance– Depend on accurate branch prediction incorrect speculation will

typically harm performance