AnneBracyCS3410
ComputerScienceCornellUniversity
SeeP&HChapter:4.5-4.8
The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer.
2
insn0.fetch, dec, exec
Single-cycle
Multi-cycle
insn1.fetch, dec, exec
insn0.decinsn0.fetchinsn1.decinsn1.fetch
insn0.execinsn1.exec
Pipelinedinsn0.decinsn0.fetch
insn1.decinsn1.fetchinsn0.exec
insn1.exec
5-stagePipeline• Implementation• WorkingExample
3
Hazards• Structural• DataHazards• ControlHazards
Write-BackMemory
InstructionFetch Execute
InstructionDecode
extend
registerfile
control
5
alu
memory
din dout
addrPC
memory
newpc
inst
IF/ID ID/EX EX/MEM MEM/WB
imm
BA
ctrl
ctrl
ctrl
BD D
M
computejump/branch
targets
+4
6
1 2 3 4 5 6 7 8 9Cycle
Latency:Throughput:
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Latency: 5cyclesThroughput: 1insn/cycle CPI=1
add
nand
lw
add
sw
• Breakdatapath intomultiplecycles (here5)• Parallelexecutionincreasesthroughput• Balancedpipelineveryimportant
• Sloweststagedeterminesclockrate• Imbalancekillsperformance
• Addpipelineregisters(flip-flops) forisolation• Eachstagebeginsbyreadingvaluesfromlatch• Eachstageendsbywritingvaluestolatch
• Resolvehazards
7
8
Stage PerformFunctionality Latchvaluesofinterest
Fetch UsePCtoindexProgramMemory,increment PC
Instructionbits (tobedecoded)PC+4(tocomputebranchtargets)
Decode Decode instruction, generatecontrolsignals,readregisterfile
Controlinformation,Rdindex,immediates, offsets,register values(Ra,Rb),PC+4(tocomputebranchtargets)
ExecutePerformALUoperationComputetargets(PC+4+offset,etc.)incasethis isabranch,decide ifbranchtaken
Controlinformation,Rdindex, etc.Result ofALUoperation,value incasethis isastoreinstruction
Memory Performload/store ifneeded,address isALUresult
Controlinformation,Rdindex,etc.Result ofload,passresult from execute
Writeback Selectvalue,writetoregisterfile
9
PC
instructionmemory
inst
addr mc
00=readword
IF/ID
Restofp
ipeline
+4
PC+4
pcsel
pcregpcrel
pcabs• PC+4• pcreg (PCregisters: JR)• pcrel (PC-relative: BEQ,BNE)• pcabs (PCabsolute: JandJAL)
10
ctrl
ID/EX
Restofp
ipeline
PC+4
inst
IF/ID
PC+4
Stage1:InstructionFetch
registerfile
WERd
Ra Rb
DB
A
BA
extend imm
decode
result
dest
Stage2:InstructionDe
code
pcrel
pcabs
11
ctrl
EX/MEM
Restofp
ipeline
BD
ctrl
ID/EX
PC+4
BA
alu
+
�
branch?im
mpcsel
pcreg
target
12
ctrl
MEM/WB
Restofp
ipeline
Stage3:Execute
MD
ctrl
EX/MEM
BD
memory
din doutaddr
mctarget
branch?pcsel
pcrel
pcabs
pcreg
13
Stage4:M
emory
ctrl
MEM/WB
MD
result
dest
IF/ID
+4
ID/EX EX/MEM MEM/WB
mem
din dout
addrinst
PC+4
BA
Rt
BD
MD
PC+4
imm
ctrl
target
OP
Rd
OP
PC
instmem
Rd
Ra Rb
DB
A
Rd
14
• Instructionssamelength• 32bits,easytofetchandthendecode
• 3typesofinstructionformats• Easytoroutebitsbetweenstages• Canreadaregistersourcebeforeevenknowing
whattheinstructionis• Memoryaccessthroughlw andsw only
• AccessmemoryafterALU
15
Consideranon-pipelinedprocessorwithclockperiodC(e.g.,50ns).IfyoudividetheprocessorintoNstages(e.g.,6),yournewclockperiodwillbe:
A. CB. NC. lessthanC/ND. C/NE. greaterthanC/N
16
5-stagePipeline• Implementation• WorkingExample
17
Hazards• Structural• DataHazards• ControlHazards
add r3 ß r1, r2 nand r6 ß r4, r5 lw r4 ß 20(r2)add r5 ß r2, r5sw r7 à 12(r3)
Assume8-registermachine
18
data
dest
IF/ID ID/EX EX/MEM MEM/WB
extend
0MUX
0
Time:019
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
nop
0
0
0
040
0
nop
0
0
nop
0
0
0
0
add312
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits26-31
data
dest
Fetch:add312
add312
IF/ID ID/EX EX/MEM MEM/WB
extend
0MUX
0
Time:120
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
add
3
9
36
480
0
nop
0
0
nop
0
0
0
0nand645
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
12
Bits26-31
data
dest
Fetch:nand645
nand 645 add312
IF/ID ID/EX EX/MEM MEM/WB
extend
2MUX
3
Time:221
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
nand
6
7
18
8124
45
add
3
9
nop
0
0
0
0lw420(2)
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
45
Bits26-31
data
dest
Fetch:lw420(2)
lw 420(2) nand 645 add312
36
9
3
IF/ID ID/EX EX/MEM MEM/WB
extend
5MUX
6 32
Time:322
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
lw
20
18
9
12168
-3
nand
6
7
add
3
45
0
0add525
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
24
Bits26-31
data
dest
Fetch:add525
add525 lw 420(2) nand 645 add312
18
7
6
45
3
IF/ID ID/EX EX/MEM MEM/WB
extend
4MUX
0 65
Time:4
nand
18=0100107=000111-------------------3=111101
23
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
add
5
7
9
162012
29
lw
4
18
nand
6
-3
0
0sw712(3)
945187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
25
Bits26-31
data
dest
Fetch:sw712(3)
sw 712(3) add525 lw 420(2) nand 645add312
9
20
4
-3
6
45
3
IF/ID ID/EX EX/MEM MEM/WB
extend
5MUX
5 04
Time:524
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
sw
12
22
45
2016
16
add
5
7
lw
4
29
99
0945187
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
37
Bits26-31
data
dest
Nomoreinstructions
nop sw 712(3) add525 lw 420(2) nand 645
9
7
5
29
4
-3
6
IF/ID ID/EX EX/MEM MEM/WB
extend
7MUX
0 55
Time:625
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
20
57
sw
7
22
add
5
16
0
0945997
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits26-31
data
dest
Nomoreinstructions
nop nop sw 712(3) add525 lw 420(2)
45
7
12
16
5
99
4
IF/ID ID/EX EX/MEM MEM/WB
extend
MUX
07
Time:726
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
sw
7
57
0
9459916
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits26-31
data
dest
Nomoreinstructions
nop nop nop sw 712(3) add525
2257
22
16
5
SlidesthankstoSallyMcKee
IF/ID ID/EX EX/MEM MEM/WB
extend
MUX
Time:827
PC
Registerfile
MUXA
LU
MUX
4
Datamem
+
MUX
Bits11-15Bits16-20
9459916
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits21-23
data
dest
Nomoreinstructions
nop nop nop nop sw 712(3)
IF/ID ID/EX EX/MEM MEM/WB
extend
MUX
Time:928
5-stagePipeline• Implementation• WorkingExample
29
Hazards• Structural• DataHazards• ControlHazards
Correctnessproblemsassociatedw/processordesign
1. StructuralhazardsSameresourceneededfordifferentpurposesatthesametime(Possible:ALU, RegisterFile,Memory)
2. DatahazardsInstructionoutputneededbeforeit’savailable
3. ControlhazardsNextinstructionPCunknownattimeofFetch
30
31
addr3,r1,r2nopnop
addr6,r3,r8
datamem
instmem
DB
A
IF ID Ex M WIF ID Ex M W
IF ID Ex M W
add r3, r1,r2nopaddr6,r3,r8
Problem:NeedtoreadavaluethatiscurrentlybeingwrittenSolution:negateRFclock:writefirsthalf,readsecondhalf
nop
IF ID Ex M W
Dependence:relationshipbetweentwoinsns• Data:twoinsnsusesamestoragelocation• Control: 1insnaffectswhetheranotherexecutesatall• Notabadthing,programswouldbeboring otherwise• Enforcedbymakingolderinsngobeforeyoungerone
– Happensnaturallyinsingle-/multi-cycle designs– Butnotinapipeline
Hazard:dependence&possibilityofwronginsnorder• Effectsofwronginsnordercannotbeexternallyvisible• Hazardsareabadthing:mostsolutionseithercomplicatethehardwareorreduceperformance
32
DataHazards• registerfilereadsoccurinstage2(ID)• registerfilewritesoccurinstage5(WB)• nextinstructionsmayreadvaluesabouttobewritten
add r3, r1, r2
sub r5, r3, r4
Isthereadependence?Isthereahazard?Howdowedetectthis?
33
34
IF ID MEM
IF ID MEM WB
IF ID MEM WB
IF ID MEM WB
IF ID MEM WB
Clockcycle1 2 3 4 5 6 7 8 9
sub r5,r3,r4
lw r6,4(r3)
or r5,r3,r5
sw r6,12(r3)
addr3,r1,r2
time
WBX
X
X
X
X
backwardsarrowsrequiretimetravel
35
IF ID MEM
IF ID MEM WB
IF ID MEM WB
IF ID MEM WB
IF ID MEM WB
Clockcycle1 2 3 4 5 6 7 8 9
sub r5,r3,r4
lw r6,4(r3)
or r5,r3,r5
sw r6,12(r3)
addr3,r1,r2
time
WBX
X
X
X
X
36
IF ID MEM
IF ID MEM WB
IF ID MEM WB
IF ID MEM WB
IF ID MEM WB
Clockcycle1 2 3 4 5 6 7 8 9
sub r5,r3,r4
lw r6,4(r3)
or r5,r3,r5
sw r6,12(r3)
addr3,r1,r2
time
WBX
X
X
X
X
IF/ID
+4
ID/EX EX/MEM MEM/WB
mem
din dout
addrinst
PC+4
OP
BA
Rt
BD
MD
PC+4
imm
OP
Rd
OP
Rd
PC
instmem
Rd
Ra Rb
DB
A
Rd
Detecting Data Hazards
IF/ID.Ra ≠0?
37
Ra==? Ra==
?
add r3, r1, r2subr5,r3,r4
Stall=(IF/ID.Ra !=0&& (IF/ID.Ra==ID/EX.Rd||IF/ID.Ra==EX/M.Rd))
1. DoNothing• ChangetheISAtomatchimplementation• “Heycompiler:don’tcreatecodew/datahazards!”
(Wecandobetterthanthis)2. Stall
• Pausecurrentandsubsequentinstructionstillsafe3. Forward/bypass
• Forwarddatavaluetowhereitisneeded(Onlyworksifvalueactuallyexistsalready)
38
HowtostallaninstructioninIDstage• preventIF/IDpipelineregisterupdate
– stallstheIDstageinstruction
• convertIDstageinsn intonop forlaterstages– innocuous“bubble”passesthroughpipeline
• preventPCupdate– stallsthenext(IFstage)instruction
39
IF/ID
+4
ID/EX EX/MEM MEM/WB
mem
din dout
addr
PC
instmem
Rd
Ra Rb
DB
A
40
Rd
addr3,r1,r2subr5,r3,r5orr6,r3,r4addr6,r3,r8
inst
PC+4
OP
BA
Rt
BD
MD
PC+4
imm
OP
Rd
OP
Rd
Ifhazard:
WE=0MemWr=0RegWr=0
detecthazard
41
datamem
B
A
B
D
M
Dinstmem
DrD B
A
Rd RdRd
WE
WE
Op
WE
Op
rA rB
PC
+4
Opnop
inst
/stall
addr3,r1,r2
(MemWr=0RegWr=0)
NOP=If(IF/ID.rA ≠0&&(IF/ID.rA==ID/Ex.RdIF/ID.rA==Ex/M.Rd))
subr5,r3,r5
orr6,r3,r4 (WE=0)
STALLCONDITIONMET
42
datamem
B
A
B
D
M
Dinstmem
DrD B
A
Rd RdRd
WE
WE
Op
WE
Op
rA rB
PC
+4
Opnop
inst
/stall
nop
(MemWr=0RegWr=0)
NOP=If(IF/ID.rA ≠0&&(IF/ID.rA==ID/Ex.RdIF/ID.rA==Ex/M.Rd))
addr3,r1,r2subr5,r3,r5
(MemWr=0RegWr=0)
orr6,r3,r4 (WE=0)
STALLCONDITIONMET
43
datamem
B
A
B
D
M
Dinstmem
DrD B
A
Rd RdRd
WE
WE
Op
WE
Op
rA rB
PC
+4
Opnop
inst
/stall
nop
NOP=If(IF/ID.rA ≠0&&(IF/ID.rA==ID/Ex.RdIF/ID.rA==Ex/M.Rd))
addr3,r1,r2subr5,r3,r5
(MemWr=0RegWr=0)
orr6,r3,r4 (WE=1)NOSTALLCONDITIONMET:suballowedtoleavedecode stage
nop
44
Clockcycle1 2 3 4 5 6 7 8
addr3,r1,r2
subr5,r3,r5
or r6,r3,r4
addr6,r3,r8
time
1. DoNothing• ChangetheISAtomatchimplementation• “Compiler:don’tcreatecodewithdatahazards!”
(Nicetry,wecandobetterthanthis)2. Stall
• Pausecurrentandsubsequentinstructionstillsafe3. Forward/bypass
• Forwarddatavaluetowhereitisneeded(Onlyworksifvalueactuallyexistsalready)
46
47
datamemim
mB
A
B
D
M
Dinstmem
DB
A
Rd Rd
Rb
WE
WE
MC
Ra
MC
forwardunit
detecthazard
Twotypesofforwarding/bypass• ForwardingfromEx/Mem registerstoExstage(M→Ex)• ForwardingfromMem/WBregistertoExstage(W→ Ex)
IF/ID ID/Ex Ex/Mem Mem/WB
48
addr3,r1,r2
subr5,r3,r1
datamem
instmem
DB
A
IF ID Ex M W
IF ID Ex M W
addr3,r1,r2subr5,r3,r1
Problem:EXneedsALUresultthatisinMEMstageSolution:addabypassfromEX/MEM.DtostartofEX
Ex/Mem
49
datamem
instmem
DB
A
DetectionLogicinExStage:forward=(Ex/M.WE&&EX/M.Rd !=0&&
ID/Ex.Ra ==Ex/M.Rd)||(sameforRb)
addr3,r1,r2subr5,r3,r1
Ex/Mem
50
addr3,r1,r2
subr5,r3,r1
orr6,r3,r4
datamem
instmem
DB
A
IF ID Ex M WIF ID
IF WEx M WID Ex M
Problem:EXneedsvaluebeingwrittenbyWBSolution:AddbypassfromWBfinalvaluetostartofEX
Mem/WB
add r3, r1,r2subr5,r3,r1orr6,r3,r4
51
datamem
instmem
DB
A
DetectionLogic:forward=(M/WB.WE&&M/WB.Rd !=0&&
ID/Ex.Ra ==M/WB.Rd &¬(ID/Ex.WE &&Ex/M.Rd !=0&&
ID/Ex.Ra ==Ex/M.Rd)||(sameforRb)
Mem/WB
add r3, r1,r2subr5,r3,r1orr6,r3,r4
52
Clockcycle1 2 3 4 5 6 7 8
addr3,r1,r2
sub r5,r3,r4
lwr6,4(r3)
or r5,r3,r5
sw r6,12(r3)
IF ID Ex M W
IF ID
IF W
Ex M W
ID Ex M
IF ID Ex
time
M W
IF ID Ex M W
Datadependencyafteraloadinstruction:• ValuenotavailableuntilaftertheMstageàNextinstructioncannotproceedifdependent
THEKILLERHAZARD53
datamem
instmem
DB
A
lw r4,20(r8)orr6,r3,r4
54
lw r4,20(r8)
or r6,r3,r4
datamem
instmem
DB
A
IF ID Ex
IF ID
lw r4,20(r8)orr6,r4,r1
55
datamem
instmem
DB
A
NOPorr6,r4,r1 lw r4,20(r8)
lw r4,20(r8)
or r6,r3,r4
IF ID Ex M W
IF ID Ex M WID*Stall
56
datamem
instmem
DB
A
NOPorr6,r4,r1 lw r4,20(r8)
Ex
lw r4,20(r8)
or r6,r3,r4
IF ID Ex M W
IF ID Ex M WID*Stall
57
datamemim
mB
A
B
D
M
Dinstmem
DB
A
Rd Rd
Rb
WE
WE
MCRa
MC
forwardunit
detecthazard
IF/ID ID/Ex Ex/Mem Mem/WB
Stall=If(ID/Ex.MemRead &&IF/ID.Ra==ID/Ex.Rd
RdMC
TwoMIPSSolutions:• MIPS2000/3000:delayslot
– ISAsaysresultsofloadsarenotavailableuntilonecyclelater
–Assemblerinsertsnop,orreorderstofilldelayslot
• MIPS4000onwards:stall–Butreally,programmer/compilerreorderstoavoidstallingintheloaddelayslot
58
5-stagePipeline• Implementation• WorkingExample
59
Hazards• Structural• DataHazards• ControlHazards
ControlHazards• instructionsarefetchedinstage1(IF)• branchandjumpdecisionsoccurinstage3(EX)à nextPCnotknownuntil2cycles after branch/jump
0x10: beq r1,r2,L0x14: addr3,r0,r30x18: subr5,r4,r60x1C:L: orr3,r2,r4
60
Branchnot taken?NoProblem!
Branchtaken?Justfetchedadd,sub…à Zap&Flush
61
beq r1,r2,Ladd r3, r0,r3subr5,r4,r6L:orr3,r2, r4
datamem
instmem D
B
A
PC
+4
NOPIF ID Ex M W
IF ID NOP NOPNOPIF NOP NOP NOP
branchcalc
decidebranch
IF ID Ex M W
10:
14:18:
1C:
IfbranchTaken→Zap
• preventPCupdate• clearIF/IDlatch• branchcontinues
NewPC=1C
62
beq r1,r2,Ladd r3, r0,r3subr5,r4,r6L:orr3,r2, r4
datamem
instmem D
B
A
PC
+4
NOPIF ID Ex M W
IF ID NOP NOPNOPIF NOP NOP NOP
branchcalc
decidebranch
IF ID Ex M W
10:
14:18:
1C:
Foreverytakenbranch?OUCH!!!NewPC=1C
1. DelaySlot• YouMUSTdothis• MIPSISA:1insn afterctrlinsn always executed
• Whetherbranchtakenornot
2. ResolveBranchatDecode• SomegroupsdothisforProject2,yourchoice• Movebranchcalc fromEXtoID• Alternative:justzap2nd instructionwhenbranchtaken
3. BranchPrediction• Notin3410,buteveryprocessorworthanythingdoesthis
63
64
beq r1,r2,L
add r3, r0,r3
subr5,r4,r6
L:orr3,r2, r4
datamem
instmem D
B
A
PC
+4
IF ID Ex M W
IF
IF ID Ex M W
10:
14:
18:
1C:
ID Ex M WNOPIF NOP NOP NOP
branchcalc
decidebranchNewPC=1C
65
beq r1,r2,L
add r3, r0,r3
subr5,r4,r6
L:orr3,r2, r4
datamem
instmem D
B
A
PC
+4
IF ID Ex M W
IF ID Ex M W
10:
14:
18:
1C:
NewPC=1C
IF ID Ex M W
branchcalc
decidebranch
MostprocessorsupportSpeculativeExecution• Guess directionofthebranch
– Allowinstructionstomovethroughpipeline– Zapthemlaterifguessturnsouttobewrong
• Amustforlongpipelines
66
Datahazardsoccurwhenaoperand(register)dependsontheresultofapreviousinstructionthatmaynotbecomputedyet.Pipelinedprocessorsneedtodetectdatahazards.
Stalling,preventingadependentinstructionfromadvancing,isonewaytoresolvedatahazards.StallingintroducesNOPs(“bubbles”)intoapipeline.IntroduceNOPsby(1)preventingthePCfromupdating,(2)preventingwritestoIF/IDregistersfromchanging,and(3)preventingwritestomemoryandregisterfile.Nops significantlydecreaseperformance.
Forwardingbypassessomepipelinedstagesforwardingaresulttoadependentinstructionoperand(register).Betterperformancethanstalling.
67
ControlhazardsoccurbecausethePCfollowingacontrolinstructionisnotknownuntilcontrolinstructionisexecuted.Ifbranchistakenà needtozapinstructions.1cycleperformancepenalty.
DelaySlotscanpotentiallyincreaseperformanceduetocontrolhazards.Theinstructioninthedelayslotwillalwaysbeexecuted.Requiressoftware(compiler)tomakeuseofdelayslot.Putnop indelayslotifnotabletoputusefulinstructionindelayslot.
WecanreducecostofacontrolhazardbymovingbranchdecisionandcalculationfromExstagetoIDstage.Withadelayslot,thisremovestheneedtoflushinstructionsontakenbranches.
68