Upload
programming-passion
View
57
Download
2
Embed Size (px)
Citation preview
INTRODUCTION TO ADVANCED PIPELINING
Lecture 5
Pipelined Processor: Datapath + Control
PC
Inst
ruct
ion
Add
Instruction[20– 16]
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresul t
Writeregis ter
Writedata
Readdata 1
Readdata 2
Readregis ter 1
Readregis ter 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Sh iftleft 2
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Address
Address
Reg
Writ
e
ALUSrc
ALUOp
RegDst
MemRead
Mem
ToR
eg
Mem
Writ
eBranch
PCSrc
Imem
DmemRegs
Control Hazard on BranchesThree Stage Stall
Four Branch Hazard Alternatives(Drawn in subsequent slides)
#1: Stall until branch direction is clear – 3 slots delay –Well, move decision to 2nd stage by testing register – Save 2 cycles – See Fig
#2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken 47% branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken 53% branches taken on average But haven’t calculated branch target address Move the branch adder to 2nd stage Still incurs 1 cycle branch penalty – Why?
#4: Dynamic Branch Prediction – Keep a history of branches and predict accordingly – 90% accuracy – employed in most CPUs
Reducing Stalls
Stall: wait until decision is clear To stall pipeline, clear the contents of the existing
instructions in the pipeline – clear contents of IF/ID, ID/EX and EX/MEM registers.
Move up decision to 2nd stage by adding hardware to check registers as being read – Adopted by many MIPS processors - See Fig. 6.51 – Penalty 1 cycle
Use Exclusive OR to compare the output of registers in the 2nd stage and enable the branch condition instead of waiting for comparison by the ALU in the 3rd stage.
Flush instruction in the IF stage by adding a control line called IF.Flush in Fig. 6.51 that zeros the IF/ID pipeline register – no operation.
Control Hazard Solutions guess branch taken, then back up if wrong: “branch
prediction” For example, Predict not taken
Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time)
More dynamic scheme: keep history of the branch instruction (~ 90%)
add
beq
Load
AL
U IM Reg DM Reg
AL
U IM Reg DM Reg
IM
AL
UReg DM Reg
Instr.
Order
Time (clock cycles)
Compiler Solutions
Redefine branch behavior (takes place after next instruction) “delayed branch”
Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” (≥ 50% of time)
add
beq
Misc
AL
U IM Reg DM Reg
AL
U IM Reg DM Reg
IM
AL
UReg DM Reg
Load IM
AL
UReg DM Reg
Instr.
Order
Time (clock cycles)
Example Nondelayed vs. Delayed Branch
add M1 ,M2,M3
sub M4, M5,M6
beq M1, M4, Exit
or M8, M9 ,M10
xor M10, M1,M11
Nondelayed Branch
Exit:
add M1 ,M2,M3
sub M4, M5,M6
beq M1, M4, Exit
or M8, M9 ,M10
xor M10, M1,M11
Delayed Branch
Exit:
Delayed Branch
Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when
branch taken From fall through: only valuable when branch not
taken Compiler effectiveness for single branch
delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch
delay slots useful in computation About 50% (60% x 80%) of slots usefully filled
Dynamic Branch Prediction Performance = ƒ (accuracy, cost of
misprediction) Branch History Table (BHT): Lower bits of PC
address index table of 1-bit values Says whether or not branch taken last time ( T-
Taken, N ) No full address check
Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit): End of loop case, when it exits instead of looping as
before First time through loop on ne x t time through code,
when it predicts e x it instead of looping Only 77.8% accuracy if 9 iterations per loop on
average
Better Solution: 2-bit scheme:
Red: stop, not taken Green: go, taken
2-bit Branch Prediction - Scheme 1
T
T
N
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
N
T
N
N
T* T*N
N*N*T
(Jim Smith, 1981)
Branch History Table (BHT)
BHT is a table of “Predictors” 2-bit, saturating counters indexed by PC address of Branch
In Fetch phase of branch: Predictor from BHT used to make prediction
When branch completes: Update corresponding Predictor
Predictor 0
Predictor 127
Predictor 1
•••
Branch PC
T
T
NTN
TN
N
T* T*N
N*N*T
Another Solution: 2-bit scheme where change prediction (in either direction) only if get misprediction twic e :
Red: stop, not taken Green: go, taken
2-bit Branch Prediction - Scheme 2
T
T
N
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
N
T
N
N
T* T*N
N*N*T
Lee & A. Smith, IEEE Computer, Jan 1984
Comparison
Actual: N N T T N N T TState: N* N* N* N*T T* T*N N* N*T
Predicted: N N N N ? ? ? ?
Actual: N N T T N N T TState: N* N* N* N*T T*N N*T N* N*T
Predicted: N N N N T N N NScheme 1
Scheme 2
T
T
NTNTN
N
T* T*N
N*N*T
T
T
NTNT
N
NT* T*N
N*N*T
2 1
Further Comparison
Alternating taken / not-taken Your worst-case prediction scenario Both schemes achieve 80-95% accuracy with
only a small difference in behavior
T
T
NTNT
N
NT* T*N
N*N*T
T
T
NTNTN
N
T* T*N
N*N*T
12
n-bit Branch Predictor
n-bit p re d ic tio n : Ke e p a n n-bit s a tura ting c o unte r fo r e a ch
bra nch. Inc re m e nt it o n bra nch taken a nd d e c re m e nt
it o n bra nch not taken . If the c o unte r is g re a te r tha n o r e q ua l to ha lf
its m a x im um va lue , p re d ic t the bra nch a s ta ke n.
This c a n be d o ne fo r a ny n, But it turns o ut tha t n= 2 p e rfo rm s a lm o s t a s g o o d
a s o the r va lue s fo r n.
Correlating Branches
Idea: taken/not taken of recently executed branches is related to behavior of present branch (as well as the history of that branch behavior) Then behavior of recent
branches selects between, say, 4 predictions of next branch, updating just that prediction
(2,2) predictor: 2-bit global, 2-bit local
Branch address (4 bits)
2-bits per branch local predictors
PredictionPrediction
2-bit recent global branch history(01 = not taken (0) then taken (1) branches before reaching this)
Accuracy of Different Schemes
Floating Point Arithmetic Pipeline
Pipeline arithmetic units are usually found in very high speed computers
They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems
Floating Point Arithmetic Pipeline
Example for floating-point addition and subtraction Inputs are two normalized floating-point binary
numbers X = A x 2^a Y = B x 2^b
A and B are two fractions that represent the mantissas a and b are the exponents
Floating Point Arithmetic Pipeline
Compare the exponents Align the mantissas Add or subtract the mantissas Normalize the result
Floating Point Arithmetic Pipeline X = 0.9504 x 103 and Y = 0.8200 x 102 The two exponents are subtracted in the first segment to obtain 3-
2=1 The larger exponent 3 is chosen as the exponent of the result Segment 2 shifts the mantissa of Y to the right to obtain Y =
0.0820 x 103 The mantissas are now aligned Segment 3 produces the sum Z = 1.0324 x 103 Segment 4 normalizes the result by shifting the mantissa once to
the right and incrementing the exponent by one to obtain Z = 0.10324 x 104
Case Study: MIPS R4000 Pipeline
8 Stage Pipeline:
IF First half of fetching of instruction PC selection Initiation of instruction cache access
IS - Second half of fetching of instruction Access to instruction cache
RF Instruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check)
EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation
DF - First half of access to data cacheDS - Second half of access to data cacheTC - Tag check for data cache hitWB -Write back for loads and register-register operations
The Pipeline Structure of the R4000
REG
AL
U Data Memory REG
Instruction is available
Tag check
load data available
IF IS RF EX DF DS TC WB
Case Study: MIPS R4000LOAD Latency
2 Cycle Load Latency
Load data availableLoad data availablewith forwardingwith forwarding
LD R1, X IF IS RF EX DF DS TC WB
IF IS RF EX DF DS . . .
ADD R3, R1, R2 IF IS RF EX DF DS TC WB
IF IS RF EX DF . . .
EX
Load data neededLoad data needed
EX
2 Stall Cycles2 Stall Cycles
Extending DLX to Handle Floating Point Operations
IF ID MEM WB
Integer Unit(EX)Integer Unit(EX)
FP/integer multiplyFP MultiplierFP Multiplier
FP AdderFP Adder
FP DividerFP Divider