96
1 2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

1 2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

12004 Morgan Kaufmann Publishers

Chapter Six

Enhancing Performance with Pipelining

Page 2: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

22004 Morgan Kaufmann Publishers

Page 3: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

32004 Morgan Kaufmann Publishers

Outline

• 6.1 An overview of pipelining

• 6.2 A pipelined Datapath

• 6.3 Pipelined Control

• 6.4 Data Hazards and Forwarding

• 6.5 Data Hazards and Stalls

• 6.6 Branch Hazards

• 6.7 Using a Hardware Description Language to describe and Model a pipeline

• 6.8 Exceptions

• 6.9 Advanced Pipelining: Extracting More Performance

• 6.10 Real Stuff: The Pentium 4 Pipeline

• 6.11 Fallacies and Pitfalls

• 6.12 Concluding Remarks

• 6.13 Historical Perspective and Further Reading

Page 4: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

42004 Morgan Kaufmann Publishers

6.1 An overview of Pipelining

Page 5: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

52004 Morgan Kaufmann Publishers

Keywords

• Pipelining An implementation technique in which multiple instructions are overlapped in execution, much like to an assembly line.

• Structural hazard An occurrence in which a planned instruction cannot execute in the proper clock cycle because the hardware cannot support the combination of instructions that are set to execute in the given clock cycle.

• Data hazard Also called pipeline data hazard. An occurrence in which a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available.

• Forwarding Also called bypassing. A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory.

• Load-use data hazard A specific form of data hazard in which the data requested by a load instruction has not yet become available when it is requested.

Page 6: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

62004 Morgan Kaufmann Publishers

Keywords

• Pipeline stall Also called bubble. A stall initiated in order to resolve a hazard.

• Control hazard Also called branch hazard. An occurrence in which the proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.

• Untaken branch One that falls through to the successive instruction. A taken branch is one that causes transfer to the branch target.

• Branch prediction A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome.

• Latency (pipeline) The number of stages in a pipeline or the number of stage between two instructions during execution.

Page 7: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

72004 Morgan Kaufmann Publishers

Figure 6.1 The laundry analogy for pipelining.

Page 8: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

82004 Morgan Kaufmann Publishers

Figure 6.2 Total time for each instruction calculated from the time for each component.

Instruction classInstruction

FetchRegister

ReadALU

operationData

accessRegister

writeTotal time

Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps

Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps

R-format (add, sub, and, or, slt)

200 ps 100 ps 200 ps 100 ps 600 ps

Branch (beq) 200 ps 100 ps 200 ps 500 ps

Page 9: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

92004 Morgan Kaufmann Publishers

Pipelining

• Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400 1600 1800

Instructionfetch Reg ALU Data

access Reg

Instructionfetch Reg ALU Data

access Reg

Instructionfetch

800 ps

800 ps

800 ps

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400

Instructionfetch Reg ALU Data

access Reg

Instructionfetch

Instructionfetch

Reg ALU Dataaccess Reg

Reg ALU Dataaccess Reg

200 ps

200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

Note: timing assumptions changedfor this example

Page 10: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

102004 Morgan Kaufmann Publishers

Figure 6.4 Graphical representation of the instruction pipeline, similar in spirit to the laundry pipeline in figure 6.4 on page 371.

Page 11: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

112004 Morgan Kaufmann Publishers

Figure 6.5 Graphical representation of forwarding

Page 12: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

122004 Morgan Kaufmann Publishers

Figure 6.6 We need a stall even with forwarding when an R-format instruction following a load tries to use the data.

Page 13: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

132004 Morgan Kaufmann Publishers

Figure 6.7 Pipeline showing stalling on every conditional branch as solution to control hazards.

Page 14: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

142004 Morgan Kaufmann Publishers

Figure 6.8 Predicting that branches are not taken as a solution to control hazard.

Page 15: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

152004 Morgan Kaufmann Publishers

Pipelining

• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores

• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction

• We’ll build a simple pipeline and look at these issues

• We’ll talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.

Page 16: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

162004 Morgan Kaufmann Publishers

6.2 A pipelined Datapath

Page 17: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

172004 Morgan Kaufmann Publishers

Basic Idea

• What do we need to add to actually split the datapath into stages?

WB: Write backMEM: Memory accessIF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

Address

Writedata

Readdata

DataMemory

Readregister 1Readregister 2

WriteregisterWritedata

Registers

Readdata 1

Readdata 2

ALUZeroALUresult

ADDAddresult

Shiftleft 2

Address

Instruction

Instructionmemory

Add

4

PC

Signextend

16 32

Page 18: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

182004 Morgan Kaufmann Publishers

6.10 Instructions being executed using the single-cycle datapath in figure 6.9, assuming pipelined execution.

Page 19: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

192004 Morgan Kaufmann Publishers

Pipelined Datapath

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Add

Address

Instructionmemory

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata 1

Readdata 2

RegistersAddress

Writedata

Readdata

Datamemory

Add Addresult

ALU ALUresult

Zero

Shiftleft 2

Signextend

PC

4

ID/EXIF/ID EX/MEM MEM/WB

16 32

Page 20: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

202004 Morgan Kaufmann Publishers

Figure 6.12 IF and ID: first and second pipe stages of an instruction, with the active portions of the datapath in figure 6.11 highlighted.

Page 21: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

212004 Morgan Kaufmann Publishers

Page 22: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

222004 Morgan Kaufmann Publishers

Figure 6.13 EX: the third pipe stage of a load instruction, highlighting the portions of the datapath in figure 6.11 used in this pipe stage.

Page 23: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

232004 Morgan Kaufmann Publishers

Figure 6.14 MEM and WB: the fourth and fifth pipe stages of a load instruction, highlighting the portions of the datapath in figure 6.11 used in this pipe stage.

Page 24: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

242004 Morgan Kaufmann Publishers

Page 25: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

252004 Morgan Kaufmann Publishers

Figure 6.15 EX: the third pipe stage of a store instruction.

Page 26: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

262004 Morgan Kaufmann Publishers

Figure 6.16 MEM and WB: the fourth and fifth pipe stage of a store instruction.

Page 27: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

272004 Morgan Kaufmann Publishers

Page 28: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

282004 Morgan Kaufmann Publishers

Corrected Datapath

Page 29: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

292004 Morgan Kaufmann Publishers

Graphically Representing Pipelines

• Can help with answering questions like:

– how many cycles does it take to execute this code?

– what is the ALU doing during cycle 4?

– use this representation to help understand datapaths

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC7

IM DMReg RegALU

IM DMReg RegALU

IM DMReg RegALU

Page 30: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

302004 Morgan Kaufmann Publishers

Figure 6.18 The portion of the datapath in figure 6.17 that is used in all five stages of a load instruction.

Page 31: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

312004 Morgan Kaufmann Publishers

Figure 6.19 Multiple-clock-cycle pipeline diagram of five instructions.

Page 32: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

322004 Morgan Kaufmann Publishers

Figure 6.20 Traditional multiple-clock-cycle pipeline diagram of five instructions in figure 6.19.

Page 33: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

332004 Morgan Kaufmann Publishers

Figure 6.21 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in figures 6.19 and 6.20.

Page 34: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

342004 Morgan Kaufmann Publishers

6.3 Pipelined Control

Page 35: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

352004 Morgan Kaufmann Publishers

Pipeline Control

Page 36: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

362004 Morgan Kaufmann Publishers

• We have 5 stages. What needs to be controlled in each stage?

– Instruction Fetch and PC Increment

– Instruction Decode / Register Fetch

– Execution

– Memory Stage

– Write Back

• How would control be handled in an automobile plant?

– a fancy control center telling everyone what to do?

– should we use a finite state machine?

Pipeline control

Page 37: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

372004 Morgan Kaufmann Publishers

Figure 6.23 A copy of figure 5.12 on page 302.

Instruction opcode ALUop

Instruction operation

Function code

Desired ALU action

ALU control input

LW 00 load word XXXXXX add 0010

SW 00 store word XXXXXX add 0010

Branch equal 01 branch equal XXXXXX subtract 0110

R-type 10 add 100000 add 0010

R-type 10 subtract 100010 subtract 0110

R-type 10 AND 100100 and 0000

R-type 10 OR 100101 or 0001

R-type 10 set on less than 101010 set on less than 0111

Page 38: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

382004 Morgan Kaufmann Publishers

Figure 6.24 A copy of figure 5.16 on page 306.

Signal name Effect when deasserted (0) Effect when asserted (1)

RegDst The register destination number for the Write register comes from the rt field (bit 20:16).

The register destination number for the Write register comes from the rd field (bits 15:11).

RegWrite None. The register on the Write register input is written with the value on the Write data input.

ALUSrc The second ALU operand comes from the second register file output (Read data 2).

The second ALU operand is the single-extended, lower 16 bits of the instruction.

PCSrc The PC is replaced by the output of the adder that computes the value of PC+4.

The PC is replaced by the output of the adder that computes the branch target.

MemRead None. Data memory contents designated by the address input are put on the Read data output.

MemWrite None. Data memory contents designated by the address input are replaced by the value on the Write data input.

MemtoReg The value fed to the register Write data input comes from the ALU.

The value fed to the register Write data input comes from the data memory.

Page 39: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

392004 Morgan Kaufmann Publishers

Figure 6.25 The values of the control lines are the same as in figure 5.18 on page 308, but they have been shuffled into three groups corresponding to the last three pipeline stages.

Instruction

Execution/address calculation stage control lines

Memory access stage control lines

Write-back stage control lines

Reg

Dst

ALU

Op1

ALU

Op0

ALU

Src

Branch

Mem Read

Mem Write

Reg Write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0

lw 0 0 0 1 0 1 0 1 1

sw X 0 0 1 0 0 1 0 X

beq X 0 1 0 1 0 0 0 X

Page 40: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

402004 Morgan Kaufmann Publishers

6.4 Data Hazards and Forwarding

Page 41: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

412004 Morgan Kaufmann Publishers

• Pass control signals along just like the data

Pipeline Control

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 42: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

422004 Morgan Kaufmann Publishers

Datapath with Control

Page 43: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

432004 Morgan Kaufmann Publishers

• Problem with starting next instruction before first is finished

– dependencies that “go backward in time” are data hazards

Dependencies

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value ofregister $2:

Page 44: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

442004 Morgan Kaufmann Publishers

• Have compiler guarantee no hazards

• Where do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

• Problem: this really slows us down!

Software Solution

Page 45: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

452004 Morgan Kaufmann Publishers

EX hazard

if (EX / MEM . RegWrite

and (EX / MEM . RegisterRd ≠ 0)

and (EX / MEM . RegisterRd = ID / EX . RegisterRs) ) ForwardA = 10

if (EX / MEM . RegWrite

and (EX / MEM . RegisterRd ≠ 0)

and (EX / MEM . RegisterRd = ID / EX . RegisterRt) ) ForwardB = 10

Page 46: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

462004 Morgan Kaufmann Publishers

MEM harzard

if (MEM / WB . RegWrite

and (MEM / WB . RegisterRd ≠ 0)

and (MEM / WB . RegisterRd = ID / EX . RegisterRs) ) ForwardA = 01

if (MEM / WB . RegWrite

and (MEM / WB . RegisterRd ≠ 0)

and (MEM / WB . RegisterRd = ID / EX . RegisterRt) ) ForwardB = 01

Page 47: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

472004 Morgan Kaufmann Publishers

• Use temporary results, don’t wait for them to be written

– register file forwarding to handle read/write to same register

– ALU forwarding

Forwarding

what if this $2 was $13?

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14,$2 , $2

sw $15, 100($2)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X

Page 48: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

482004 Morgan Kaufmann Publishers

Figure 6.30 On the top are the ALU and pipeline registers before adding forwarding.

Page 49: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

492004 Morgan Kaufmann Publishers

Forwarding

• The main idea (some details not shown)

Page 50: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

502004 Morgan Kaufmann Publishers

Figure 6.31 The control values for the forwarding multiplexors in figure 6.30.

Mux control Source Explanation

ForwardA = 00 ID / EX The first ALU operand comes from the register file.

ForwardA = 10 EX / MEM The first ALU operand is forwarded from the prior ALU result.

ForwardA = 01 MEM / WB The first ALU operand is forward from data memory or an earlier ALU result.

ForwardB = 00 ID / EX The second ALU operand comes from the register file.

ForwardB = 10 EX / MEM The second ALU operand is forwarded from the prior ALU result.

ForwardB = 01 MEM / WB The second ALU operand is forward from data memory or an earlier ALU result.

Page 51: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

512004 Morgan Kaufmann Publishers

Figure 6.32 The datapath modified to resolve hazards via forwarding.

Page 52: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

522004 Morgan Kaufmann Publishers

Figure 6.33 A close-up of the datapath in figure 6.30 on page 409 shows a 2:1 multiplexor, which has been added to select the signed immediate as an ALU input.

Page 53: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

532004 Morgan Kaufmann Publishers

6.5 Data hazards and Stalls

Page 54: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

542004 Morgan Kaufmann Publishers

Keywords

• nop An instruction that does no operation to change state.

Page 55: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

552004 Morgan Kaufmann Publishers

• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction

that writes to the same register.

• Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Programexecutionorder(in instructions)

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

Page 56: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

562004 Morgan Kaufmann Publishers

Stalling

• We can stall the pipeline by keeping an instruction in the same stage

bubble

Programexecutionorder(in instructions)

lw $2, 20($1)

and becomes nop

add $4, $2, $5

or $8, $2, $6

add $9, $4, $2

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

Page 57: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

572004 Morgan Kaufmann Publishers

Hazard Detection Unit

• Stall by letting an instruction that won’t write anything go forward

Page 58: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

582004 Morgan Kaufmann Publishers

6.6 Branch Hazards

Page 59: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

592004 Morgan Kaufmann Publishers

Keywords

• Flush (instructions) To discard instructions in a pipeline, usually due to an unexpected event.

• Dynamic branch prediction Prediction of branches at runtime using runtime information.

• Branch prediction buffer Also called branch history table. A small memory that is indexed by the lower portion of the address of the branch instruction and that contains one or more bits indicating whether the branch was recently taken or not.

• Branch delay slot The slot directly after a delayed branch instruction, which in the MIPS architecture is filled by an instruction that does not affect the branch.

• Branch target buffer A structure that caches the destination PC or destination instruction for a branch. It is usually organized as a cache with tags, making it more costly than a simple prediction buffer.

Page 60: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

602004 Morgan Kaufmann Publishers

Keywords

• Correlating predictor A branch predictor that combines local behavior of a particular branch and global information about the behavior of some recent number of executed branches.

• Tournament branch predictor A branch predictor with multiple predictions for each branch and a selection mechanism that chooses which predictor to enable for a given branch.

Page 61: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

612004 Morgan Kaufmann Publishers

• When we decide to branch, other instructions are in the pipeline!

• We are predicting “branch not taken”

– need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Programexecutionorder(in instructions)

40 beq $1, $3, 28

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DM Reg

IM DMReg Reg

IM DMReg Reg

Page 62: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

622004 Morgan Kaufmann Publishers

Figure 6.38 The ID stage of clock cycle 3 determines that a branch must be taken, so it selects 72 as the next PC address and zeros the instruction fetched for the next clock cycle.

Page 63: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

632004 Morgan Kaufmann Publishers

Page 64: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

642004 Morgan Kaufmann Publishers

Branches

• If the branch is taken, we have a penalty of one cycle• For our simple design, this is reasonable• With deeper pipelines, penalty increases and static branch prediction

drastically hurts performance• Solution: dynamic branch prediction

Predict taken Predict taken

Predict not taken Predict not taken

Not taken

Not taken

Not taken

Not taken

Taken

Taken

Taken

Taken

A 2-bit prediction scheme

Page 65: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

652004 Morgan Kaufmann Publishers

Branch Prediction

• Sophisticated Techniques:

– A “branch target buffer” to help us look up the destination

– Correlating predictors that base prediction on global behaviorand recently executed branches (e.g., prediction for a specific

branch instruction based on what happened in previous branches)

– Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

– A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

• Modern processors predict correctly 95% of the time!

Page 66: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

662004 Morgan Kaufmann Publishers

Figure 6.40 Scheduling the branch delay slot.

Page 67: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

672004 Morgan Kaufmann Publishers

6.7 Using a Hardware Description Language to Describe and Model a Pipeline

Page 68: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

682004 Morgan Kaufmann Publishers

Figure 6.41 The final datapath and control for this chapter.

Page 69: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

692004 Morgan Kaufmann Publishers

6.8 Exceptions

Page 70: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

702004 Morgan Kaufmann Publishers

Keywords

• Imprecise interrupt Also called imprecise exception. Interrupts or exceptions in pipelined computers that are not associated with the exact instruction that was the cause of the interrupt or exception.

• Precise interrupt Also called precise exception. An interrupt or exception that is always associated with the correct instruction in pipelined computers.

Page 71: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

712004 Morgan Kaufmann Publishers

Figure 6.42 The datapath with controls to handle exceptions.

Page 72: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

722004 Morgan Kaufmann Publishers

Figure 6.43 The result of an exception due to arithmetic overflow in the add instruction.

Page 73: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

732004 Morgan Kaufmann Publishers

Page 74: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

742004 Morgan Kaufmann Publishers

6.9 Advanced Pipelining: Extracting More Performance

Page 75: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

752004 Morgan Kaufmann Publishers

Keywords

• Instruction-level parallelism The parallelism among instructions.

• Multiple issue A scheme whereby multiple instructions are launched in 1 clock cycle.

• Static multiple issue An approach to implementing a multiple-issue processor where many decisions are made by the compiler before execution.

• Dynamic multiple issue An approach to implementing a multiple-issue processor where many decisions are made during execution by the processor.

• Issue slots The positions from which instructions could issue in a given clock cycle; by analogy these correspond to positions at the starting blocks for a sprint.

• Speculation An approach whereby the compiler or processor guesses the outcome of an instruction to remove it as a dependence in executing other instructions.

Page 76: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

762004 Morgan Kaufmann Publishers

Keywords

• Issue packet The set of instructions that issues together in 1 clock cycle; the packet may be determined statically by the compiler or dynamically by the processor.

• Loop unrolling A technique to get more performance from loops that access arrays, in which multiple copies of the loop body are made and instructions from different iterations are scheduled together.

• Register renaming The renaming of registers, by the compiler or hardware, to remove antidependences.

• Antidependences Also called name dependence. An ordering forced by the reuse of a name, typically a register, rather then by a true dependence that carries a value between two..

• Instruction group In IA-64, a sequence of consecutive instructions with no register data dependences among them.

Page 77: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

772004 Morgan Kaufmann Publishers

Keywords

• Stop In IA-64, an explicit indicator of a break between independent and dependent instructions.

• Prediction A technique to make instructions dependent on predicates rather than on branches.

• Poison A result generated when a speculative load yields an exception, or an instruction uses a poisoned operand.

• Advanced load In IA-64, a speculative load instruction with support to check for aliases that could invalidate the load.

• Superscalar An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle.

• Dynamic pipeline scheduling Hardware support for recording the order of instruction execution so as to avoid stalls.

• Commit unit The unit in a dynamic or out-of-order execution pipeline that decides when it is safe to release the result of an operation to programmer-visible registers and memory.

Page 78: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

782004 Morgan Kaufmann Publishers

Keywords

• Reservation station A buffer within a functional unit that holds the operands and the operation.

• Reorder buffer The buffer that holds results in a dynamically scheduled processor unit it is safe to store the results to memory or a register.

• In-order commit A commit in which the results of pipelined execution are written to the programmer-visible state in the same order that instructions are fetched.

• Out-of-order execution A situation in pipelined execution when an instruction blocked from executing does not cause the following instructions to wait.

Page 79: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

792004 Morgan Kaufmann Publishers

Figure 6.44 Static two-issue pipeline in operation.

Instruction type Pipe stages

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

Page 80: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

802004 Morgan Kaufmann Publishers

Figure 6.45 A static two-issue datapath.

Page 81: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

812004 Morgan Kaufmann Publishers

Figure 6.46 The scheduled code as it would look on a two-issue MIPS pipeline.

ALU or branch instruction Data transfer instruction Clock cycle

Loop:

lw $t0, 0($s1) 1

addi $s1, $s1, -4 2

addu $t0, $t0, $s2 3

bne $s1, $zero, Loop sw $t0, 4($s1) 4

Page 82: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

822004 Morgan Kaufmann Publishers

Figure 6.47 The unrolled and scheduled code of figure 6.46 as it would look on a static.

ALU or branch instruction Data transfer instruction Clock cycle

Loop:

addi $s1, $s1, -16 lw $t0, 0($s1) 1

lw $t1, 12($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 4($s1) 4

addu $t2, $t2, $s2 sw $t0, 16($s1) 5

addu $t3, $t3, $s2 sw $t1, 12($s1) 6

sw $t2, 8($s1) 7

bne $s1, $zero, Loop sw $t3, 4($s1) 8

Page 83: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

832004 Morgan Kaufmann Publishers

Figure 6.48 A summary of the characteristics of the Itanium and Itanium2, Intel’s first two implementations of the IA-64 architecture.

Processor

Max. instr.

Issues / clock

Functional units

Max. ops. Per

clock

Max. clock rate

Transistors (millions)

Power (watts)

SPEC int2000

SPEC fp2000

Itanium 6 4 integer/media

2 memory

3 branch

2 FP

9 0.8 GHz 25 130 379 701

Itanium 2 6 6 integer/media

4 memory

3 branch

2 FP

11 1.5 GHz 221 130 810 1427

Page 84: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

842004 Morgan Kaufmann Publishers

Improving Performance

• Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

• Dynamic Pipeline Scheduling

– Hardware chooses which instructions to execute next

– Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!)

– Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)

• Trying to exploit instruction-level parallelism

Page 85: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

852004 Morgan Kaufmann Publishers

Figure 6.49 The three primary units of a dynamically scheduled pipeline.

Page 86: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

862004 Morgan Kaufmann Publishers

Advanced Pipelining

• Increase the depth of the pipeline

• Start more than one instruction each cycle (multiple issue)

• Loop unrolling to expose more ILP (better scheduling)

• “Superscalar” processors

– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

• All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)

• VLIW: very long instruction word, static multiple issue (relies more on compiler technology)

• This class has given you the background you need to learn more!

Page 87: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

872004 Morgan Kaufmann Publishers

6.10 Real Stuff: The Pentium 4 Pipeline

Page 88: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

882004 Morgan Kaufmann Publishers

Keywords

• Microarchitecture The organization of the processor, including the major functional units, their interconnection, and control.

• Architectural registers The instruction set visible registers of a processor; for example, in MIPS, these are 32 integer and 16 floating-point registers.

Page 89: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

892004 Morgan Kaufmann Publishers

Figure 6.50 The microarchitecture of the Intel Pentium 4.

Page 90: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

902004 Morgan Kaufmann Publishers

Figure 6.51 The Pentium 4 pipeline showing the pipeline flow for a typical instruction and the number of clock cycles for the major steps in the pipeline.

Page 91: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

912004 Morgan Kaufmann Publishers

6.11 Fallacies and Pitfalls

Page 92: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

922004 Morgan Kaufmann Publishers

• Fallacy: Pipelining is easy.

• Fallacy: Pipelining ideas can be implemented independent of technology.

• Pitfall: Failure to consider instruction set design can adversely impact pipelining.

Page 93: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

932004 Morgan Kaufmann Publishers

6.12 Concluding Remarks

Page 94: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

942004 Morgan Kaufmann Publishers

Keywords

• Instruction latency The inherent execution time for an instruction.

Page 95: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

952004 Morgan Kaufmann Publishers

Chapter 6 Summary

• Pipelining does not improve latency, but does improve throughput

Slower Faster

Instructions per clock (IPC = 1/CPI)

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

Deeplypipelined

Pipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

Page 96: 1  2004 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining

962004 Morgan Kaufmann Publishers

1 Several

Use latency in instructions

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

Deeplypipelined

Pipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)