Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1[ material from Patterson & Hennessy and Harris]
ENGN1640: Design of Computing SystemsTopic 05: Pipeline Processor Design
Professor Sherief Redahttp://scale.engin.brown.edu
Electrical Sciences and Computer EngineeringSchool of Engineering
Brown UniversitySpring 2016
Pipelining analogy
2
Pipelined laundry: overlapping execution Parallelism improves performance
Four loads: Speedup
= 16/7 = 2.3 Non-stop:
Ideal Speedup= 4n/(n + 3) ≈ 4= number of stages
Pipelined ARM processor
3
• Temporal parallelism• Divide single-cycle processor into 5 stages:
– Fetch– Decode– Execute– Memory– Writeback
• Add pipeline registers between stages
Single-Cycle vs Pipelined
4
Time (ps)Instr
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000
Instr
1
2
(b)
3
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
Single-Cycle
Pipelined
Pipeline datapath abstraction
5
Time (cycles)
LDR R2, [R0, #40] RF 40
R0RF
R2+ DM
RF R10
R9RF
R3+ DM
RF R5
R1RF
R4- DM
RF R13
R12RF
R5& DM
RF 20
R1RF
R6+ DM
RF 42
R11RF
R7| DM
ADD R3, R9, R10
SUB R4, R1, R5
AND R5, R12, R13
STR R6, [R1, #20]
ORR R7, R11, #42
1 2 3 4 5 6 7 8 9 10
ADD
IM
IM
IM
IM
IM
IM LDR
SUB
AND
STR
ORR
Single-cycle & pipelined datapath
6
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
CLK
ALU
PCPlus8 R15
3:0
+
4
15RA1
RA2
Extend
01
01
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8 R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
Fetch Decode Execute Memory Writeback
InstrF
ALUOutM ALUOutW
WA3D
Single-Cycle
Pipelined
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8
R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
• WA3 must arrive at same time as Result• Register file written on falling edge of CLK
Optimized pipeline datapath
7
Remove adder by using PCPlus4F after PC has been updated to PC+4Assumes writing happens (e.g., in first half of clock cycle) before reading
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8
R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
Tracing LDR in its journey: 1st cycle
8
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR fetch
Tracing LDR in its journey: 2nd cycle
9
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR decode
Tracing LDR in its journey: 3rd cycle
10
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR EXE
Tracing LDR in its journey: 4th cycle
11
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR mem
Tracing LDR in its journey: 5th cycle
12
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR WBLDR WB
Pipeline performance
13
Assume time for stages is 100ps for register read or write 200ps for other stages
Compare pipelined datapath with single-cycle datapath
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
LDR 200ps 100 ps 200ps 200ps 100 ps 800ps
STR 200ps 100 ps 200ps 200ps 700ps
data op 200ps 100 ps 200ps 100 ps 600ps
Branch 200ps 100 ps 200ps 500ps
Single-cycle versus pipeline performance
14
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
LDR R1,[R5]
LDR R2, [R6]
LDR R3, [R7]
LDR R1,[R5]
LDR R2, [R6]
LDR R3, [R7]
Pipeline speedup
15
If all stages are balanced i.e., all take the same time Time between instructionspipelined
= Time between instructionsnonpipelinedNumber of stages
Ideal speedup (n instructions and s stages)
If not balanced, speedup is less Speedup due to increased throughput
Latency (time for each instruction) does not decrease Branches will also reduce the speedup Added pipeline registers reduce the speedup
Reminder of single-cycle control
16
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
25:20
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
27:26
ImmSrc
PCSrc
MemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
ALUFlags
CLK
ALUControl
ALU
PCPlus8 R15
3:0
Cond31:28
Flags
15:12 Rd
+
4
15RA1
RA2
0 1
Extend
01
01
RegSrc
Modifications to pipeline control
17
Control signals derived from instruction Same as in single-cycle implementation Control delayed to proper pipeline stage
Pipelined datapath + control
18
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
• Same control unit as single-cycle processor• Control delayed to proper pipeline stage
Pipelining hazards
19
Situations that prevent starting the next instruction in the next cycle
1. Structural hazards• A required resource is busy
2. Data hazard• Need to wait for previous instruction to
complete its data read/write3. Control hazard
• Deciding on control action depends on previous instruction
1. Structural hazards
20
Conflict for use of a resource In ARMs pipeline with a single memory
Load/store requires data access Instruction fetch would have to stall for that
cycle Would cause a pipeline “bubble”
Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches
2. Data Hazards: compute-use
21
Time (cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
Data Hazard: load-use
22
LDR R1, [R4, #40]
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
Handling data hazards
23
A. Compile-time techniques B. Forward data at run timeC. Stall the processor at run time
A. Data hazard elimination using compile-time techniques (nop)
24
• Insert enough nops until result is ready (wastes cycles)
Time (cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
NOP
NOP
RF RFDMNOPIM
RF RFDMNOPIM
9 10
A. Data hazard elimination using compile-time techniques (code rescheduling)
25
Reorder code to avoid use of load result in the next instruction
Compiler must be aware of pipeline structure
ADD R1, R4, R5
ADD R8, R1, R3
AND R9, R6, R1
SUB R10, R2, R7
ADD R1, R4, R5
SUB R10, R2, R7
NOP
ADD R8, R1, R3
AND R9, R6, R1
Rescheduling saved one cycles!
B. Data hazard elimination using data forwarding/bypassing during runtime
26
Don’t wait for result to be stored in a register forward the results whenever the results
Requires extra connections in the datapath
Time (cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
• Check if register read in Execute stage matches register written in Memory or Writeback stage
• If so, forward result
Circuitry for forwarding
27
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
000110
000110
Hazard Unit
ForwardA
EForw
ardBE
RegW
riteM
Match
RegW
riteWCLK
To forward or not to forward!
28
• Execute stage register matches Memory stage register?Match_1E_M = (RA1E == WA3M)Match_2E_M = (RA2E == WA3M)
• Execute stage register matches Writeback stage register?Match_1E_W = (RA1E == WA3W)Match_2E_W = (RA2E == WA3W)
• If it matches, forward result:
if (Match_1E_M • RegWriteM) ForwardAE = 10; else if (Match_1E_W • RegWriteW) ForwardAE = 01; else ForwardAE = 00;
Double data hazard
29
Consider the sequence:ADD R1,R1,R2ADD R1,R1,R3ADD R1,R1,R4
Both hazards occur Want to use the most recent
Revise MEM hazard condition Give priority to EX results. That is, only fwd
from MEM if EX hazard condition isn’t true
Pipelining hazards
30
Structural hazards2. Data hazard3. Control hazard
Compile-time techniques Forward data at run timeC. Stall the processor at run time
Stalling
3131
Time (cycles)
LDR R1, [R4, #40] RF 40
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM LDR
ORR
SUB
Trouble!
Forwarding is not going to eliminate all hazards
32
Time (cycles)
LDR R1, [R4, #40] RF 40
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM LDR
ORR
SUB
9
RF R3
R1
IM ORR
Stall
FIX C. Data hazard elimination by stalling
33
stall inserted here
clock cycle wasted − necessary for correctness
Stalling HW
34
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
000110
000110
Hazard Unit
ForwardA
EForw
ardBE
RegW
riteM
Match
RegW
riteW
Mem
toRegE
StallF
StallD
FlushE
EN
CLR
CLRE
N
FlushD
CLK
To stall or not to stall!
35
• Is either source register in the Decode stage the same as the one being written in the Execute stage?
Match_12D_E = (RA1D == WA3E) | (RA2D == WA3E)
• Is a LDR in the Execute stage AND Match_12D_E?
ldrstall = Match_12D_E • MemtoRegEStallF = StallD = FlushE = ldrstall
Data hazard summary
36
Compiler can arrange code to avoid hazards and stalls. Requires knowledge of the pipeline structure
Forwarding can sometimes avoids stalls at the expense of extra hardware complexity
Stalls reduce performance by increasing the average cycles per instruction (CPI). But sometimes are absolutely necessary to get correct results
3. Control hazards
37
• B: – branch not determined until the Writeback stage
of pipeline– Instructions after branch fetched before branch
occurs– These 4 instructions must be flushed if branch
happens
• Writes to PC (R15) similar
Control hazards
38
Time (cycles)
B 3C RF RFDM
RF R3
R1RF& DM
RF R1
R6RF| DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM B
ORR
20
24
28
2C
34... ...
9
Flushthese
instructions
64 ADD R12, R3, R4 RF R4
R3RF
R12+ DMIM ADD
RF R7
R1RF- DMIM SUB
RF R8
R1RF- DMIM SUBSUB R11, R1, R830
10
Branch misprediction penalty• number of instruction flushed when branch is taken (4)• May be reduced by determining BTA earlier
Early branch resolution
39
• Determine BTA in Execute stage– Branch misprediction penalty = 2 cycles
• Hardware changes– Add a branch multiplexer before PC register to select BTA
from ALUResultE– Add BranchTakenE select signal for this multiplexer (only
asserted if branch condition satisfied)– PCSrcW now only asserted for writes to PC
Pipelined processor with early BTA
40
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF01
PC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutW
000110
000110
WA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondEC
ondExE
Hazard Unit
StallF
StallD
FlushE
ForwardA
EForw
ardBE
EN
CLR
CLRE
N
10
PCSrcD PCSrcE PCSrcM PCSrcW
FlushD
Flags'CondUnit
BranchTakenE
RegW
riteM
Match
RegW
riteW
Mem
toRegE
CLK
Control hazards with early BTA
41
Time (cycles)
B 3C RF RFDM
RF R3
R1RF& DM
RF R1
R6RF| DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM B
ORR
20
24
28
2C
34... ...
9
Flushthese
instructions
64 ADD R12, R3, R4 RF R4
R3RF
R12+ DMIM ADD
SUB R11, R1, R830
10
Control stalling logic
42
• PCWrPendingF = 1 if write to PC in Decode, Execute or Memory
PCWrPendingF = PCSrcD + PCSrcE + PCSrcM
• Stall Fetch if PCWrPendingFStallF = ldrStallD + PCWrPendingF
• Flush Decode if PCWrPendingF OR PC is written in Writeback OR branch is taken
FlushD = PCWrPendingF + PCSrcW + BranchTakenE
• Flush Execute if branch is takenFlushE = ldrStallD + BranchTakenE
• Stall Decode if ldrStallD (as before)StallD = ldrStallD
ARM Pipelined Processor with Hazard Unit
43
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF01
PC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutW
000110
000110
WA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondEC
ondExE
Hazard Unit
StallF
StallD
FlushE
ForwardA
EForw
ardBE
EN
CLR
CLRE
N
10
PCSrcD PCSrcE PCSrcM PCSrcW
FlushD
Flags'CondUnit
BranchTakenE
RegW
riteM
Match
RegW
riteW
Mem
toRegE
CLK
Branch prediction
44
• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:
– Always not taken– Always taken– Check direction of branch (forward or backward): If
backward, predict taken; else, predict not taken• Dynamic branch prediction:
– Make a dynamic based on history of branches
In all cases, branch must be executed to see if prediction was correct if not, flush instructions and resume from correct direction!
Eliminating 2-cycle stall for taken-prediction policy with branch target buffer
45
Even with predictor, still need to calculate the target address 2-cycle penalty for a taken branch
Branch target buffer Cache of target addresses Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately – no 2-cycle penalty
Branch target buffer
46
Branch PC Branch target address
=
No: it is not a branchNext PC = PC+4
Yes: it is a branch andPC = branch target address
PC
IM
Pipeline reg
MU
X rest of pipeline
2. Dynamic branch prediction
47
In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction:
Branch prediction buffer (aka branch history table) indexed by recent branch instruction addresses and stores outcome (taken/not taken)
To execute a branch: Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction
[floorplan of Pentium
processor]
1-bit branch predictor
48
• Remembers whether branch was taken the last time and does the same thing
• Mispredicts first and last branch of loop• Prediction bits are added to BTB entries
MOV R1, #0 ; R1 = sum
MOV R0, #0 ; R0 = i
FOR ; for (i=0; i<10; i=i+1)
CMP R0, #10
BGE DONE
ADD R1, R1, R0 ; sum = sum + i
ADD R0, R0, #1
B FOR
DONE
Problem with 1-bit predictor
49
Inner loop branches mispredicted twice!outer: …
…inner: …
…beq …, …, inner…beq …, …, outer
Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of
inner loop next time around
2-bit predictor
50
Only change prediction on two successive mispredictions
Summary
• Pipelining for speedup
• Ideal speedup = number of stages, but actual speedup depends on delay balance between stages (clock frequency), delays introduced by pipeline registers, and number of stalls (CPI).
• Hazards (structural, data, and control) can increase CPI
• Hazards can be eliminated or mitigated using code reorganization, stalling, flushing, forwarding / bypassing
• Branch prediction, branch prediction buffer, branch target buffer can reduce stalls arising from control hazards
51