ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE

1[ material from Patterson & Hennessy and Harris]

ENGN1640: Design of Computing SystemsTopic 05: Pipeline Processor Design

Professor Sherief Redahttp://scale.engin.brown.edu

Electrical Sciences and Computer EngineeringSchool of Engineering

Brown UniversitySpring 2016

http://scale.engin.brown.edu

Pipelining analogy

2

Pipelined laundry: overlapping execution Parallelism improves performance

Four loads: Speedup

= 16/7 = 2.3 Non-stop:

Ideal Speedup= 4n/(n + 3) ≈ 4= number of stages

Pipelined ARM processor

3

• Temporal parallelism• Divide single-cycle processor into 5 stages:

– Fetch– Decode– Execute– Memory– Writeback

• Add pipeline registers between stages

Single-Cycle vs Pipelined

4

Time (ps)Instr

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000

Instr

1

2

(b)

3

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

Single-Cycle

Pipelined

Pipeline datapath abstraction

5

Time (cycles)

LDR R2, [R0, #40] RF 40

R0RF

R2+ DM

RF R10

R9RF

R3+ DM

RF R5

R1RF

R4- DM

RF R13

R12RF

R5& DM

RF 20

R1RF

R6+ DM

RF 42

R11RF

R7| DM

ADD R3, R9, R10

SUB R4, R1, R5

AND R5, R12, R13

STR R6, [R1, #20]

ORR R7, R11, #42

1 2 3 4 5 6 7 8 9 10

ADD

IM

IM

IM

IM

IM

IM LDR

SUB

AND

STR

ORR

Single-cycle & pipelined datapath

6

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr

19:16

15:12

23:0

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

CLK

ALU

PCPlus8 R15

3:0

+

4

15RA1

RA2

Extend

01

01

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

PCPlus8 R15

3:0

+

4

15RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

Fetch Decode Execute Memory Writeback

InstrF

ALUOutM ALUOutW

WA3D

Single-Cycle

Pipelined

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

PCPlus8

R15

3:0

+

4

15RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF

ALUOutM ALUOutWWA3E WA3M WA3WWA3D

• WA3 must arrive at same time as Result• Register file written on falling edge of CLK

Optimized pipeline datapath

7

Remove adder by using PCPlus4F after PC has been updated to PC+4Assumes writing happens (e.g., in first half of clock cycle) before reading

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


PCPlus8D

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

PCPlus8

R15

3:0

+

4

15RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


Tracing LDR in its journey: 1st cycle

8

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


PCPlus8D

LDR fetch

Tracing LDR in its journey: 2nd cycle

9

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


PCPlus8D

LDR decode

Tracing LDR in its journey: 3rd cycle

10

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


PCPlus8D

LDR EXE

Tracing LDR in its journey: 4th cycle

11

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


PCPlus8D

LDR mem

Tracing LDR in its journey: 5th cycle

12

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF


PCPlus8D

LDR WBLDR WB

Pipeline performance

13

Assume time for stages is 100ps for register read or write 200ps for other stages

Compare pipelined datapath with single-cycle datapath

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

LDR 200ps 100 ps 200ps 200ps 100 ps 800ps

STR 200ps 100 ps 200ps 200ps 700ps

data op 200ps 100 ps 200ps 100 ps 600ps

Branch 200ps 100 ps 200ps 500ps

Single-cycle versus pipeline performance

14

Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

LDR R1,[R5]

LDR R2, [R6]

LDR R3, [R7]

LDR R1,[R5]

LDR R2, [R6]

LDR R3, [R7]

Pipeline speedup

15

If all stages are balanced i.e., all take the same time Time between instructionspipelined

= Time between instructionsnonpipelinedNumber of stages

Ideal speedup (n instructions and s stages)

If not balanced, speedup is less Speedup due to increased throughput

Latency (time for each instruction) does not decrease Branches will also reduce the speedup Added pipeline registers reduce the speedup

Reminder of single-cycle control

16

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr

19:16

15:12

23:0

25:20

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

27:26

ImmSrc

PCSrc

MemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

ALUFlags

CLK

ALUControl

ALU

PCPlus8 R15

3:0

Cond31:28

Flags

15:12 Rd

+

4

15RA1

RA2

0 1

Extend

01

01

RegSrc

Modifications to pipeline control

17

Control signals derived from instruction Same as in single-cycle implementation Control delayed to proper pipeline stage

Pipelined datapath + control

18

ExtImm

E

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCFPC'

InstrD

19:16

15:12

23:0

25:20

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlagsCLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutWWA3E WA3M WA3W

CLK CLK

MemWriteEMemtoRegE

ALUSrcE

RegWriteE

ALUControlEMemWriteMMemtoRegMRegWriteM

MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondE

CondExE

10

PCSrcD PCSrcE PCSrcM PCSrcW

Flags'CondUnit

• Same control unit as single-cycle processor• Control delayed to proper pipeline stage

Pipelining hazards

19

Situations that prevent starting the next instruction in the next cycle

1. Structural hazards• A required resource is busy

2. Data hazard• Need to wait for previous instruction to

complete its data read/write3. Control hazard

• Deciding on control action depends on previous instruction

1. Structural hazards

20

Conflict for use of a resource In ARMs pipeline with a single memory

Load/store requires data access Instruction fetch would have to stall for that

cycle Would cause a pipeline “bubble”

Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches

2. Data Hazards: compute-use

21

Time (cycles)

ADD R1, R4, R5 RF R5

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

Data Hazard: load-use

22

LDR R1, [R4, #40]

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

Handling data hazards

23

A. Compile-time techniques B. Forward data at run timeC. Stall the processor at run time

A. Data hazard elimination using compile-time techniques (nop)

24

• Insert enough nops until result is ready (wastes cycles)

Time (cycles)


R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

NOP

NOP

RF RFDMNOPIM

RF RFDMNOPIM

9 10

A. Data hazard elimination using compile-time techniques (code rescheduling)

25

Reorder code to avoid use of load result in the next instruction

Compiler must be aware of pipeline structure

ADD R1, R4, R5

ADD R8, R1, R3

AND R9, R6, R1

SUB R10, R2, R7

ADD R1, R4, R5

SUB R10, R2, R7

NOP

ADD R8, R1, R3

AND R9, R6, R1

Rescheduling saved one cycles!

B. Data hazard elimination using data forwarding/bypassing during runtime

26

Don’t wait for result to be stored in a register forward the results whenever the results

Requires extra connections in the datapath

Time (cycles)


R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

• Check if register read in Execute stage matches register written in Memory or Writeback stage

• If so, forward result

Circuitry for forwarding

27

ExtImm

E

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCFPC'

InstrD

19:16

15:12

23:0

25:20

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlagsCLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK


CLK CLK

MemWriteEMemtoRegE

ALUSrcE

RegWriteE


MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondE

CondExE

10


Flags'CondUnit

000110

000110

Hazard Unit

ForwardA

EForw

ardBE

RegW

riteM

Match

RegW

riteWCLK

Presenter

Presentation Notes

Problem in book with 2nd forWARD MUZ

To forward or not to forward!

28

• Execute stage register matches Memory stage register?Match_1E_M = (RA1E == WA3M)Match_2E_M = (RA2E == WA3M)

• Execute stage register matches Writeback stage register?Match_1E_W = (RA1E == WA3W)Match_2E_W = (RA2E == WA3W)

• If it matches, forward result:

if (Match_1E_M • RegWriteM) ForwardAE = 10; else if (Match_1E_W • RegWriteW) ForwardAE = 01; else ForwardAE = 00;

Double data hazard

29

Consider the sequence:ADD R1,R1,R2ADD R1,R1,R3ADD R1,R1,R4

Both hazards occur Want to use the most recent

Revise MEM hazard condition Give priority to EX results. That is, only fwd

from MEM if EX hazard condition isn’t true

Pipelining hazards

30

Structural hazards2. Data hazard3. Control hazard

Compile-time techniques Forward data at run timeC. Stall the processor at run time

Stalling

3131

Time (cycles)

LDR R1, [R4, #40] RF 40

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM LDR

ORR

SUB

Trouble!

Forwarding is not going to eliminate all hazards

32

Time (cycles)

LDR R1, [R4, #40] RF 40

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM LDR

ORR

SUB

9

RF R3

R1

IM ORR

Stall

FIX C. Data hazard elimination by stalling

33

stall inserted here

clock cycle wasted − necessary for correctness

Stalling HW

34

ExtImm

E

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCFPC'

InstrD

19:16

15:12

23:0

25:20

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlagsCLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK


CLK CLK

MemWriteEMemtoRegE

ALUSrcE

RegWriteE


MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondE

CondExE

10


Flags'CondUnit

000110

000110

Hazard Unit

ForwardA

EForw

ardBE

RegW

riteM

Match

RegW

riteW

Mem

toRegE

StallF

StallD

FlushE

EN

CLR

CLRE

N

FlushD

CLK

To stall or not to stall!

35

• Is either source register in the Decode stage the same as the one being written in the Execute stage?

Match_12D_E = (RA1D == WA3E) | (RA2D == WA3E)

• Is a LDR in the Execute stage AND Match_12D_E?

ldrstall = Match_12D_E • MemtoRegEStallF = StallD = FlushE = ldrstall

Data hazard summary

36

Compiler can arrange code to avoid hazards and stalls. Requires knowledge of the pipeline structure

Forwarding can sometimes avoids stalls at the expense of extra hardware complexity

Stalls reduce performance by increasing the average cycles per instruction (CPI). But sometimes are absolutely necessary to get correct results

3. Control hazards

37

• B: – branch not determined until the Writeback stage

of pipeline– Instructions after branch fetched before branch

occurs– These 4 instructions must be flushed if branch

happens

• Writes to PC (R15) similar

Control hazards

38

Time (cycles)

B 3C RF RFDM

RF R3

R1RF& DM

RF R1

R6RF| DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM B

ORR

20

24

28

2C

34... ...

9

Flushthese

instructions

64 ADD R12, R3, R4 RF R4

R3RF

R12+ DMIM ADD

RF R7

R1RF- DMIM SUB

RF R8

R1RF- DMIM SUBSUB R11, R1, R830

10

Branch misprediction penalty• number of instruction flushed when branch is taken (4)• May be reduced by determining BTA earlier

Early branch resolution

39

• Determine BTA in Execute stage– Branch misprediction penalty = 2 cycles

• Hardware changes– Add a branch multiplexer before PC register to select BTA

from ALUResultE– Add BranchTakenE select signal for this multiplexer (only

asserted if branch condition satisfied)– PCSrcW now only asserted for writes to PC

Pipelined processor with early BTA

40

ExtImm

E

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF01

PC'

InstrD

19:16

15:12

23:0

25:20

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlagsCLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutW

000110

000110

WA3E WA3M WA3W

CLK CLK

MemWriteEMemtoRegE

ALUSrcE

RegWriteE


MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondEC

ondExE

Hazard Unit

StallF

StallD

FlushE

ForwardA

EForw

ardBE

EN

CLR

CLRE

N

10


FlushD

Flags'CondUnit

BranchTakenE

RegW

riteM

Match

RegW

riteW

Mem

toRegE

CLK

Control hazards with early BTA

41

Time (cycles)

B 3C RF RFDM

RF R3

R1RF& DM

RF R1

R6RF| DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM B

ORR

20

24

28

2C

34... ...

9

Flushthese

instructions

64 ADD R12, R3, R4 RF R4

R3RF

R12+ DMIM ADD

SUB R11, R1, R830

10

Control stalling logic

42

• PCWrPendingF = 1 if write to PC in Decode, Execute or Memory

PCWrPendingF = PCSrcD + PCSrcE + PCSrcM

• Stall Fetch if PCWrPendingFStallF = ldrStallD + PCWrPendingF

• Flush Decode if PCWrPendingF OR PC is written in Writeback OR branch is taken

FlushD = PCWrPendingF + PCSrcW + BranchTakenE

• Flush Execute if branch is takenFlushE = ldrStallD + BranchTakenE

• Stall Decode if ldrStallD (as before)StallD = ldrStallD

ARM Pipelined Processor with Hazard Unit

43

ExtImm

E

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF01

PC'

InstrD

19:16

15:12

23:0

25:20

SrcBE


WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlagsCLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutW

000110

000110

WA3E WA3M WA3W

CLK CLK

MemWriteEMemtoRegE

ALUSrcE

RegWriteE


MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondEC

ondExE

Hazard Unit

StallF

StallD

FlushE

ForwardA

EForw

ardBE

EN

CLR

CLRE

N

10


FlushD

Flags'CondUnit

BranchTakenE

RegW

riteM

Match

RegW

riteW

Mem

toRegE

CLK

Branch prediction

44

• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:

– Always not taken– Always taken– Check direction of branch (forward or backward): If

backward, predict taken; else, predict not taken• Dynamic branch prediction:

– Make a dynamic based on history of branches

In all cases, branch must be executed to see if prediction was correct if not, flush instructions and resume from correct direction!

Eliminating 2-cycle stall for taken-prediction policy with branch target buffer

45

Even with predictor, still need to calculate the target address 2-cycle penalty for a taken branch

Branch target buffer Cache of target addresses Indexed by PC when instruction fetched

If hit and instruction is branch predicted taken, can fetch target immediately – no 2-cycle penalty

Branch target buffer

46

Branch PC Branch target address

=

No: it is not a branchNext PC = PC+4

Yes: it is a branch andPC = branch target address

PC

IM

Pipeline reg

MU

X rest of pipeline

2. Dynamic branch prediction

47

In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction:

Branch prediction buffer (aka branch history table) indexed by recent branch instruction addresses and stores outcome (taken/not taken)

To execute a branch: Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction

[floorplan of Pentium

processor]

1-bit branch predictor

48

• Remembers whether branch was taken the last time and does the same thing

• Mispredicts first and last branch of loop• Prediction bits are added to BTB entries

MOV R1, #0 ; R1 = sum

MOV R0, #0 ; R0 = i

FOR ; for (i=0; i<10; i=i+1)

CMP R0, #10

BGE DONE

ADD R1, R1, R0 ; sum = sum + i

ADD R0, R0, #1

B FOR

DONE

Problem with 1-bit predictor

49

Inner loop branches mispredicted twice!outer: …

…inner: …

…beq …, …, inner…beq …, …, outer

Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of

inner loop next time around

2-bit predictor

50

Only change prediction on two successive mispredictions

Summary

• Pipelining for speedup

• Ideal speedup = number of stages, but actual speedup depends on delay balance between stages (clock frequency), delays introduced by pipeline registers, and number of stalls (CPI).

• Hazards (structural, data, and control) can increase CPI

• Hazards can be eliminated or mitigated using code reorganization, stalling, flushing, forwarding / bypassing

• Branch prediction, branch prediction buffer, branch target buffer can reduce stalls arising from control hazards

51

Documents

ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE