Upload
trankiet
View
218
Download
0
Embed Size (px)
Citation preview
Chapter7<1>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Chapter7
DigitalDesignandComputerArchitecture:ARM®Edi*onSarahL.HarrisandDavidMoneyHarris
Chapter7<2>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Chapter7::Topics
• Introduc*on• PerformanceAnalysis• Single-CycleProcessor• Mul*cycleProcessor• PipelinedProcessor• AdvancedMicroarchitecture
Chapter7<3>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Microarchitecture:howtoimplementanarchitectureinhardware
• Processor:– Datapath:func>onalblocks– Control:controlsignals
Introduc>on
Chapter7<4>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Mul>pleimplementa>onsforasinglearchitecture:– Single-cycle:Eachinstruc>onexecutesinasinglecycle
– Mul*cycle:Eachinstruc>onisbrokenupintoseriesofshortersteps
– Pipelined:Eachinstruc>onbrokenupintoseriesofsteps&mul>pleinstruc>onsexecuteatonce
Microarchitecture
Chapter7<5>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Programexecu*on*me
Execu*onTime=(#instruc*ons)(cycles/instruc*on)(seconds/cycle)
• Defini*ons:– CPI:Cycles/instruc>on– clockperiod:seconds/cycle– IPC:instruc>ons/cycle=IPC
• Challengeistosa*sfyconstraintsof:– Cost– Power– Performance
ProcessorPerformance
Chapter7<6>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• ConsidersubsetofARMinstruc>ons:– Data-processinginstruc*ons:
• ADD,SUB,AND,ORR • withregisterandimmediateSrc2,butnoshiLs
– Memoryinstruc*ons:• LDR,STR • withposi*veimmediateoffset
– Branchinstruc*ons:• B
ARMProcessor
Chapter7<7>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Review:Instruc>onFormats
Branch
Chapter7<8>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Determineseverythingaboutaprocessor:– Architecturalstate:
• 16registers(includingPC)• Statusregister
– Memory
ArchitecturalStateElements
Chapter7<9>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
CLK
A RD
InstructionMemory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
CLK
R15
CLK
Status
32 32 32 32
32
32
32
3232
32
32
4
4
4
4 4
ARMArchitecturalStateElements
Chapter7<10>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Datapath• Control
Single-CycleARMProcessor
Chapter7<11>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Datapath• Control
Single-CycleARMProcessor
Chapter7<12>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Datapath:startwithLDRinstruc>on• Example: LDR R1, [R2, #5] LDR Rd, [Rn, imm12]
Single-CycleARMProcessor
Chapter7<13>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
STEP1:Fetchinstruc>on
CLK
A RD
InstructionMemory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
Instr
CLK
R15
Single-CycleDatapath:LDRfetch
Chapter7<14>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
STEP2:ReadsourceoperandsfromRF
CLK
A RD
InstructionMemory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
Instr 19:16
CLK
R15
RA1
Single-CycleDatapath:LDRRegRead
LDR Rd, [Rn, imm12]
Chapter7<15>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
STEP3:Extendtheimmediate
ExtImm
CLK
A RD
InstructionMemory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
Instr 19:16
15:12
11:0
CLK
R15
RA1
Extend
Single-CycleDatapath:LDRImmed.
LDR Rd, [Rn, imm12]
Chapter7<16>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
STEP4:Computethememoryaddress
ExtImm
CLK
A RD
InstructionMemory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
Instr 19:16
15:12
11:0
SrcB
ALUResult
SrcA
CLK
ALU
R15
RA1
Extend
ALUControl00
Single-CycleDatapath:LDRAddress
LDR Rd, [Rn, imm12]
Chapter7<17>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR Rd, [Rn, imm12]
STEP5:Readdatafrommemoryandwriteitbacktoregisterfile
ExtImm
CLK
A RD
InstructionMemory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
SrcA
CLK
ALU
R15
RA1
Extend
RegWrite ALUControl1 00
Single-CycleDatapath:LDRMemRead
Chapter7<18>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
STEP6:Determineaddressofnextinstruc>on
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
SrcA
PCPlus4
CLK
ALU
R15
RA1
Extend
RegWrite ALUControl1 00
o
Single-CycleDatapath:PCIncrement
Chapter7<19>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
PCcanbesource/des>na>onofinstruc>on
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPC1
0PC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
SrcA
PCPlus4
CLK
ALU
PCPlus8 R15+
4
RA1
Extend
RegWritePCSrc ALUControl1 1 00
Single-CycleDatapath:AccesstoPC
Chapter7<20>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
PCcanbesource/des>na>onofinstruc>on• Source:R15mustbeavailableinRegisterFile
– PCisreadasthecurrentPCplus8
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPC1
0PC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
SrcA
PCPlus4
CLK
ALU
PCPlus8 R15+
4
RA1
Extend
RegWritePCSrc ALUControl1 1 00
Single-CycleDatapath:AccesstoPC
Chapter7<21>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
PCcanbesource/des>na>onofinstruc>on• Source:R15mustbeavailableinRegisterFile
– PCisreadasthecurrentPCplus8• Des*na*on:BeabletowriteresulttoPC
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPC1
0PC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
SrcA
PCPlus4
CLK
ALU
PCPlus8 R15+
4
RA1
Extend
RegWritePCSrc ALUControl1 1 00
Single-CycleDatapath:AccesstoPC
Chapter7<22>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
ExpanddatapathtohandleSTR:• WritedatainRdtomemory
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPC1
0PC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
CLK
ALU
PCPlus8 R15+
4
RA1
RA2
Extend
RegWritePCSrc MemWriteALUControl
0 0 00 1
Single-CycleDatapath:STR
STR Rd, [Rn, imm12]
Chapter7<23>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
WithimmediateSrc2:• ReadfromRnandImm8(ImmSrcchoosesthezero-extendedImm8
insteadofImm12)• WriteALUResulttoregisterfile• WritetoRd
Single-CycleDatapath:Data-processing
ADD Rd, Rn, imm8
Chapter7<24>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
WithimmediateSrc2:• ReadfromRnandImm8(ImmSrcchoosesthezero-extendedImm8
insteadofImm12)• WriteALUResulttoregisterfile• WritetoRd
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WE
10
PC10
PC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
ALUFlags
CLK
ALU
PCPlus8 R15+
4
RA1
RA2
Extend
RegWritePCSrc ImmSrc MemWrite MemtoRegALUControl
0 1 0 varies 0 0
Single-CycleDatapath:Data-processing
ADD Rd, Rn, imm8
Chapter7<25>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
WithregisterSrc2:• ReadfromRnandRm(insteadofImm8) • WriteALUResulttoregisterfile• WritetoRd
Single-CycleDatapath:Data-processing
ADD Rd, Rn, Rm
Chapter7<26>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
WithregisterSrc2:• ReadfromRnandRm(insteadofImm8) • WriteALUResulttoregisterfile• WritetoRd
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr 19:16
15:12
11:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
ALUFlags
CLK
ALU
PCPlus8 R15
3:0
+
4
RA1
RA2
Extend
01
RegSrc RegWritePCSrc ImmSrc MemWrite MemtoRegALUControlALUSrc
0 1 X 0 varies 0 00
Single-CycleDatapath:Data-processing
ADD Rd, Rn, Rm
Chapter7<27>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Calculatebranchtargetaddress: BTA=(ExtImm)+(PC+8)
ExtImm=Imm24<<2andsign-extended
Single-CycleDatapath:B
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
ALUFlags
CLK
ALU
PCPlus8 R15
3:0+
4
15RA1
RA2
Extend
01
01
RegSrc RegWritePCSrc ImmSrc MemWrite MemtoRegALUControlALUSrc
11 0 10 1 00 0 0x
B Label
Chapter7<29>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Single-CycleARMProcessor
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
25:20
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
27:26
ImmSrc
PCSrc
MemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
ALUFlags
CLK
ALUControl
ALU
PCPlus8 R15
3:0
Cond31:28
Flags
15:12 Rd
+
4
15RA1
RA2
0 1
Extend
01
01
RegSrc
Chapter7<66>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Example:ORR
Chapter7<73>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
ProgramExecu*onTime=(#instruc>ons)(cycles/instruc>on)(seconds/cycle)=#instruc>onsxCPIxTC
Review:ProcessorPerformance
Chapter7<74>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
TClimitedbycri*calpath(LDR)
Single-CyclePerformance
Chapter7<75>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Single-cyclecri*calpath: Tc1 = tpcq_PC + tmem + tdec + max[tmux + tRFread, tsext +
tmux] + tALU + tmem + tmux + tRFsetup
• Typically,limi*ngpathsare:– memory,ALU,registerfile– Tc1 = tpcq_PC + 2tmem + tdec + tRFread + tALU + 2tmux +
tRFsetup
Single-CyclePerformance
Chapter7<76>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Element Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Decoder tdec 70 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60
Tc1 = ?
Single-CyclePerformanceExample
Chapter7<77>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Tc1 = tpcq_PC + 2tmem + tdec + tRFread + tALU + 2tmux + tRFsetup = [50 + 2(200) + 70 + 100 + 120 + 2(25) + 60] ps = 840 ps
Single-CyclePerformanceExampleElement Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Decoder tdec 70 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60
Chapter7<78>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Programwith100billioninstruc>ons: Execu*onTime=#instruc>onsxCPIxTC =(100×109)(1)(840×10-12s) =84seconds
Single-CyclePerformanceExample
Chapter7<79>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Single-cycle:+simple- cycle>melimitedbylongestinstruc>on(LDR)- separatememoriesforinstruc>onanddata- 3adders/ALUs
• Mul*cycleprocessoraddressestheseissuesbybreakinginstruc*onintoshorterstepso shorterinstruc>onstakefewerstepso canre-usehardwareo cycle>meisfaster
Mul>cycleARMProcessor
Chapter7<80>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Single-cycle:+simple- cycle>melimitedbylongestinstruc>on(LDR)- separatememoriesforinstruc>onanddata- 3adders/ALUs
• Mul*cycle:+higherclockspeed+simplerinstruc>onsrunfaster+reuseexpensivehardwareonmul>plecycles-sequencingoverheadpaidmany>mes
Mul>cycleARMProcessor
Chapter7<81>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Single-cycle:+simple- cycle>melimitedbylongestinstruc>on(LDR)- separatememoriesforinstruc>onanddata- 3adders/ALUs
• Mul*cycle:+higherclockspeed+simplerinstruc>onsrunfaster+reuseexpensivehardwareonmul>plecycles-sequencingoverheadpaidmany>mes
Mul>cycleARMProcessor
Samedesignstepsassingle-cycle:• firstdatapath• thencontrol
Chapter7<82>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
ReplaceInstruc>onandDatamemorieswithasingleunifiedmemory–morerealis>c
Mul>cycleStateElements
Chapter7<83>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
STEP1:Fetchinstruc>on
Mul>cycleDatapath:Instruc>onFetch
LDR Rd, [Rn, imm12]
Chapter7<84>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR Rd, [Rn, imm12]
Mul>cycleDatapath:LDRRegisterRead
STEP2:ReadsourceoperandsfromRF
Chapter7<85>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR Rd, [Rn, imm12]
Mul>cycleDatapath:LDRAddress
STEP3:Computethememoryaddress
Chapter7<86>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR Rd, [Rn, imm12]
Mul>cycleDatapath:LDRMemoryRead
STEP4:Readdatafrommemory
Chapter7<87>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR Rd, [Rn, imm12]
Mul>cycleDatapath:LDRWriteRegister
STEP5:Writedatabacktoregisterfile
Chapter7<88>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:IncrementPC
STEP6:IncrementPC
Chapter7<89>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:AccesstoPC
PCcanberead/wrijenbyinstruc>on
Chapter7<90>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:AccesstoPC
PCcanberead/wrijenbyinstruc>on• Read:R15(PC+8)availableinRegisterFile
Chapter7<91>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:ReadtoPC(R15)
Example:ADD R1, R15, R2
Chapter7<92>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:ReadtoPC(R15)
Example:ADD R1, R15, R2 • R15needstobereadasPC+8fromRegisterFile(RF)in2ndstep• So(alsoin2ndstep)PC+8isproducedbyALUandroutedtoR15
inputofRF
Chapter7<93>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:ReadtoPC(R15)
Example:ADD R1, R15, R2 • R15needstobereadasPC+8fromRegisterFile(RF)in2ndstep• So(alsoin2ndstep)PC+8isproducedbyALUandroutedtoR15
inputofRF– SrcA=PC(whichwasalreadyupdatedinstep1toPC+4)– SrcB=4– ALUResult=PC+8
• ALUResultisfedtoR15inputportofRFin2ndstep(whichisthenroutedtoRD1outputofRF)
Chapter7<94>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:ReadtoPC(R15)
Example:ADD R1, R15, R2 • R15needstobereadasPC+8fromRegisterFile(RF)in2ndstep• So(alsoin2ndstep)PC+8isproducedbyALUandroutedtoR15
inputofRF
Chapter7<95>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:AccesstoPC
PCcanberead/wrijenbyinstruc>on• Read:R15(PC+8)availableinRegisterFile• Write:Beabletowriteresultofinstruc>ontoPC
Chapter7<96>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:WritetoPC(R15)
Example:SUB R15, R8, R3
Chapter7<97>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:WritetoPC(R15)
Example:SUB R15, R8, R3 • Resultofinstruc>onneedstobewrijentothePCregister• ALUResultalreadyroutedtothePCregister,justassertPCWrite
Chapter7<98>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleDatapath:WritetoPC(R15)
Example:SUB R15, R8, R3 • Resultofinstruc>onneedstobewrijentothePCregister• ALUResultalreadyroutedtothePCregister,justassertPCWrite
Chapter7<99>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
WritedatainRn tomemory
Mul>cycleDatapath:STR
Chapter7<100>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Withimmediateaddressing(i.e.,animmediateSrc2),noaddi>onalchangesneededfordatapath
Mul>cycleDatapath:Data-processing
Chapter7<101>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Withregisteraddressing(registerSrc2):ReadfromRnandRm
Mul>cycleDatapath:Data-processing
Chapter7<102>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Calculatebranchtargetaddress: BTA=(ExtImm)+(PC+8)
ExtImm=Imm24<<2andsign-extended
Mul>cycleDatapath:B
Chapter7<103>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleARMProcessor
Chapter7<111>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:Fetch
Chapter7<112>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:Decode
Chapter7<113>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:Address
Chapter7<114>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:ReadMemory
Chapter7<116>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:LDR
Chapter7<117>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:STR
Chapter7<118>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:Data-processing
Chapter7<119>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MainControllerFSM:Data-processing
Chapter7<120>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleControllerFSM
Chapter7<125>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Instruc>onstakedifferentnumberofcycles.
Mul>cycleProcessorPerformance
Chapter7<126>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cycleControllerFSM
Chapter7<127>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Instruc>onstakedifferentnumberofcycles:– 3cycles: – 4cycles: – 5cycles:
Mul>cycleProcessorPerformance
Chapter7<128>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Instruc>onstakedifferentnumberofcycles:– 3cycles:B – 4cycles:DP, STR – 5cycles: LDR
Mul>cycleProcessorPerformance
Chapter7<129>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Instruc>onstakedifferentnumberofcycles:– 3cycles:B – 4cycles:DP, STR – 5cycles: LDR
• CPIisweightedaverage• SPECINT2000benchmark:
– 25%loads– 10%stores– 13%branches– 52%R-type
Mul>cycleProcessorPerformance
Chapter7<130>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Instruc>onstakedifferentnumberofcycles:– 3cycles:B – 4cycles:DP, STR – 5cycles: LDR
• CPIisweightedaverage• SPECINT2000benchmark:
– 25%loads– 10%stores– 13%branches– 52%R-type
Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
Mul>cycleProcessorPerformance
Chapter7<131>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Mul>cyclecri>calpath:• Assump>ons:• RFisfasterthanmemory• wri>ngmemoryisfasterthanreadingmemory
Tc2 = tpcq + 2tmux + max(tALU + tmux, tmem) + tsetup
Mul>cycleProcessorPerformance
Chapter7<132>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Tc2 = ?
Mul>cyclePerformanceExampleElement Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40
Registersetup tsetup 50
Mul>plexer tmux 25
ALU tALU 120
Decoder tdec 70
Memoryread tmem 200
Registerfileread tRFread 100
Registerfilesetup tRFsetup 60
Chapter7<133>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Tc2 = tpcq + 2tmux + max[tALU + tmux, tmem] + tsetup = [40 + 2(25) + 200 + 50] ps = 340 ps
Mul>cyclePerformanceExampleElement Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40
Registersetup tsetup 50
Mul>plexer tmux 25
ALU tALU 120
Decoder tdec 70
Memoryread tmem 200
Registerfileread tRFread 100
Registerfilesetup tRFsetup 60
Chapter7<134>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Foraprogramwith100billioninstruc>onsexecu>ngonamul*cycleARMprocessor
– CPI=4.12cycles/instruc>on– Clockcycle*me:Tc2=340ps
Execu*onTime=?
Mul>cyclePerformanceExample
Chapter7<135>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Foraprogramwith100billioninstruc>onsexecu>ngonamul*cycleARMprocessor
– CPI=4.12cycles/instruc>on– Clockcycle*me:Tc2=340ps
Execu*onTime=(#instruc>ons)×CPI×Tc =(100×109)(4.12)(340×10-12) =140seconds
Mul>cyclePerformanceExample
Chapter7<136>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Foraprogramwith100billioninstruc>onsexecu>ngonamul*cycleARMprocessor
– CPI=4.12cycles/instruc>on– Clockcycle*me:Tc2=340ps
Execu*onTime=(#instruc>ons)×CPI×Tc =(100×109)(4.12)(340×10-12) =140seconds
Thisisslowerthanthesingle-cycleprocessor(84sec.)
Mul>cyclePerformanceExample
Chapter7<137>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Review:Single-CycleARMProcessor
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
25:20
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
27:26
ImmSrc
PCSrc
MemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
ALUFlags
CLK
ALUControl
ALU
PCPlus8 R15
3:0
Cond31:28
Flags
15:12 Rd
+
4
15RA1
RA2
0 1
Extend
01
01
RegSrc
Chapter7<138>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Review:Mul>cycleARMProcessor
Chapter7<139>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Temporalparallelism• Dividesingle-cycleprocessorinto5stages:
– Fetch– Decode– Execute– Memory– Writeback
• Addpipelineregistersbetweenstages
PipelinedARMProcessor
Chapter7<140>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Single-Cyclevs.Pipelined
Time(ps)Instr
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead/Write
WrReg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000
Instr
1
2
(b)
3
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead/Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead/Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead/Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead/Write
WrReg
Single-Cycle
Pipelined
Chapter7<141>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
PipelinedProcessorAbstrac>on
Time(cycles)
LDR R2, [R0, #40] RF 40
R0RF
R2+ DM
RF R10
R9RF
R3+ DM
RF R5
R1RF
R4- DM
RF R13
R12RF
R5& DM
RF 20
R1RF
R6+ DM
RF 42
R11RF
R7| DM
ADD R3, R9, R10
SUB R4, R1, R5
AND R5, R12, R13
STR R6, [R1, #20]
ORR R7, R11, #42
1 2 3 4 5 6 7 8 9 10
ADD
IM
IM
IM
IM
IM
IM LDR
SUB
AND
STR
ORR
Chapter7<142>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Single-Cycle&PipelinedDatapath
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
CLK
ALU
PCPlus8 R15
3:0
+
4
15RA1
RA2
Extend
01
01
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8 R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
Fetch Decode Execute Memory Writeback
InstrF
ALUOutM ALUOutW
WA3D
Single-Cycle
Pipelined
Chapter7<143>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• WA3mustarriveatsame*measResult• Registerfilewri]enonfallingedgeofCLK
CorrectedPipelinedDatapath
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8
R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
Chapter7<144>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
RemoveadderbyusingPCPlus4FaLerPChasbeenupdatedtoPC+4
Op>mizedPipelinedDatapath
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8
R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
Chapter7<145>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Samecontrolunitassingle-cycleprocessor• Controldelayedtoproperpipelinestage
PipelinedProcessorControl
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlags
CLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteE
MemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
Chapter7<146>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Whenaninstruc>ondependsonresultfrominstruc>onthathasn’tcompleted
• Types:– Datahazard:registervaluenotyetwrijenbacktoregisterfile
– Controlhazard:nextinstruc>onnotdecidedyet(causedbybranch)
PipelineHazards
Chapter7<147>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
DataHazard
Time(cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
Chapter7<148>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• InsertNOPsincodeatcompile>me• Rearrangecodeatcompile>me• Forwarddataatrun>me• Stalltheprocessoratrun>me
HandlingDataHazards
Chapter7<149>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• InsertenoughNOPsforresulttobeready• Ormoveindependentusefulinstruc>onsforward
Compile-TimeHazardElimina>on
Time(cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
NOP
NOP
RF RFDMNOPIM
RF RFDMNOPIM
9 10
Chapter7<150>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
DataForwarding
Time(cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
Chapter7<151>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
DataForwarding
• CheckifregisterreadinExecutestagematchesregisterwrijeninMemoryorWritebackstage
• Ifso,forwardresult
Time(cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
Chapter7<152>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
DataForwarding
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlags
CLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteE
MemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'
CondUnit
000110
000110
HazardUnit
ForwardA
EForw
ardBE
RegW
riteM
Match
RegW
riteW
CLK
Chapter7<153>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
DataForwarding• ExecutestageregistermatchesMemorystageregister?
Match_1E_M=(RA1E==WA3M)Match_2E_M=(RA2E==WA3M)
• ExecutestageregistermatchesWritebackstageregister?Match_1E_W=(RA1E==WA3W)Match_2E_W=(RA2E==WA3W)
• Ifitmatches,forwardresult:if(Match_1E_M•RegWriteM) ForwardAE=10;elseif(Match_1E_W•RegWriteW) ForwardAE=01;else ForwardAE=00;
Chapter7<154>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
DataForwarding• ExecutestageregistermatchesMemorystageregister?
Match_1E_M=(RA1E==WA3M)Match_2E_M=(RA2E==WA3M)
• ExecutestageregistermatchesWritebackstageregister?Match_1E_W=(RA1E==WA3W)Match_2E_W=(RA2E==WA3W)
• Ifitmatches,forwardresult:if(Match_1E_M•RegWriteM) ForwardAE=10;elseif(Match_1E_W•RegWriteW) ForwardAE=01;else ForwardAE=00;
ForwardBEsamebutwithMatch2E
Chapter7<155>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Stalling
Time(cycles)
LDR R1, [R4, #40] RF 40
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM LDR
ORR
SUB
Trouble!
Chapter7<156>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Stalling
Time(cycles)
LDR R1, [R4, #40] RF 40
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM LDR
ORR
SUB
9
RF R3
R1
IM ORR
Stall
Chapter7<157>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
StallingHardware
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlags
CLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteE
MemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondECondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'
CondUnit
000110
000110
HazardUnit
ForwardA
EForw
ardBE
RegW
riteM
Match
RegW
riteW
MemtoRegE
StallF
StallD
FlushE
EN
CLR
CLREN
FlushD
CLK
Chapter7<158>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• IseithersourceregisterintheDecodestagethesameastheonebeingwrijenintheExecutestage?
Match_12D_E=(RA1D==WA3E)+(RA2D==WA3E)• IsaLDRintheExecutestageANDMatch_12D_E?
ldrstall=Match_12D_E•MemtoRegEStallF=StallD=FlushE=ldrstall
StallingLogic
Chapter7<159>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• B:– branchnotdeterminedun>ltheWritebackstageofpipeline
– Instruc>onsaserbranchfetchedbeforebranchoccurs
– These4instruc>onsmustbeflushedifbranchhappens
• WritestoPC(R15)similar
ControlHazards
Chapter7<160>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
ControlHazardsTime(cycles)
B 3C RF RFDM
RF R3
R1RF& DM
RF R1
R6RF| DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM B
ORR
20
24
28
2C
34... ...
9
Flushthese
instructions
64 ADD R12, R3, R4 RF R4
R3RF
R12+ DMIM ADD
RF R7
R1RF- DMIM SUB
RF R8
R1RF- DMIM SUBSUB R11, R1, R830
10
Branchmispredic*onpenalty• numberofinstruc>onflushedwhenbranchistaken(4)• MaybereducedbydeterminingBTAearlier
Chapter7<161>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
EarlyBranchResolu>on
• DetermineBTAinExecutestage– Branchmispredic>onpenalty=2cycles
• Hardwarechanges– Addabranchmul>plexerbeforePCregistertoselectBTAfromALUResultE
– AddBranchTakenEselectsignalforthismul>plexer(onlyassertedifbranchcondi>onsa>sfied)
– PCSrcWnowonlyassertedforwritestoPC
Chapter7<162>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
PipelinedprocessorwithEarlyBTA
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF01
PC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlags
CLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutW
000110
000110
WA3E WA3M WA3W
CLK CLK
MemWriteE
MemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondECondExE
HazardUnit
StallF
StallD
FlushE
ForwardA
EForw
ardBE
EN
CLR
CLREN
10
PCSrcD PCSrcE PCSrcM PCSrcW
FlushD
Flags'CondUnit
BranchTakenE
RegW
riteM
Match
RegW
riteW
MemtoR
egECLK
Chapter7<163>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
ControlHazardswithEarlyBTATime(cycles)
B 3C RF RFDM
RF R3
R1RF& DM
RF R1
R6RF| DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM B
ORR
20
24
28
2C
34... ...
9
Flushthese
instructions
64 ADD R12, R3, R4 RF R4
R3RF
R12+ DMIM ADD
SUB R11, R1, R830
10
Chapter7<164>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• PCWrPendingF=1ifwritetoPCinDecode,ExecuteorMemory
PCWrPendingF=PCSrcD+PCSrcE+PCSrcM
• StallFetchifPCWrPendingFStallF=ldrStallD+PCWrPendingF
• FlushDecodeifPCWrPendingFORPCiswrijeninWritebackORbranchistaken
FlushD=PCWrPendingF+PCSrcW+BranchTakenE
• FlushExecuteifbranchistakenFlushE=ldrStallD+BranchTakenE
• StallDecodeifldrStallD(asbefore)StallD=ldrStallD
ControlStallingLogic
Chapter7<165>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
ARMPipelinedProcessorwithHazardUnit
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF01
PC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlags
CLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutW
000110
000110
WA3E WA3M WA3W
CLK CLK
MemWriteE
MemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondECondExE
HazardUnit
StallF
StallD
FlushE
ForwardA
EForw
ardBE
EN
CLR
CLREN
10
PCSrcD PCSrcE PCSrcM PCSrcW
FlushD
Flags'CondUnit
BranchTakenE
RegW
riteM
Match
RegW
riteW
MemtoR
egECLK
Chapter7<166>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• SPECINT2000benchmark:– 25%loads– 10%stores– 13%branches– 52%R-type
• Suppose:– 40%ofloadsusedbynextinstruc>on– 50%ofbranchesmispredicted
• WhatistheaverageCPI?
PipelinedPerformanceExample
Chapter7<167>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• SPECINT2000benchmark:– 25%loads– 10%stores– 13%branches– 52%R-type
• Suppose:– 40%ofloadsusedbynextinstruc>on– 50%ofbranchesmispredicted
• WhatistheaverageCPI?– LoadCPI=1whennotstalling,2whenstalling
So,CPIlw=1(0.6)+2(0.4)=1.4– BranchCPI=1whennotstalling,3whenstalling
So,CPIbeq=1(0.5)+3(0.5)=2
Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) =1.23
PipelinedPerformanceExample
Chapter7<168>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Pipelined processor critical path: Tc3 = max [
tpcq + tmem + tsetup Fetch 2(tRFread + tsetup ) Decode tpcq + 2tmux + tALU + tsetup Execute tpcq + tmem + tsetup Memory 2(tpcq + tmux + tRFwrite) ] Writeback
PipelinedPerformance
Chapter7<169>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Element Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60 Registerfilewrite tRFwrite 70
Cycle*me: Tc3 = ?
PipelinedPerformanceExample
Chapter7<170>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Element Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60 Registerfilewrite tRFwrite 70
Cycle*me: Tc3 = 2(tRFread + tsetup ) = 2[100 + 50] ps = 300 ps
PipelinedPerformanceExample
Chapter7<171>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Programwith100billioninstruc>onsExecu*onTime =(#instruc>ons)×CPI×Tc =(100×109)(1.23)(300×10-12) =36.9seconds
PipelinedPerformanceExample
Chapter7<172>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Processor
Execu*onTime(seconds)
Speedup(single-cycleasbaseline)
Single-cycle 84 1
Mul*cycle 140 0.6
Pipelined 36.9 2.28
ProcessorPerformanceComparison
Chapter7<173>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• DeepPipelining• Micro-opera>ons• BranchPredic>on• SuperscalarProcessors• OutofOrderProcessors• RegisterRenaming• SIMD• Mul>threading• Mul>processors
AdvancedMicroarchitecture
Chapter7<174>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• 10-20stagestypical• Numberofstageslimitedby:– Pipelinehazards– Sequencingoverhead– Power– Cost
DeepPipelining
Chapter7<175>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Decomposemorecomplexinstruc>onsintoaseriesofsimpleinstruc>onscalledmicro-operaKons(micro-opsorµ-ops)
• Atrun->me,complexinstruc>onsaredecodedintooneormoremicro-ops
• UsedheavilyinCISC(complexinstruc>onsetcomputer)architectures(e.g.,x86)
• UsedforsomeARMinstruc>ons,forexample:
ComplexOp Micro-opSequence LDR R1, [R2], #4 LDR R1, [R2] ADD R2, R2, #4
Withoutu-ops,wouldneed2ndwriteportontheregisterfile
Micro-opera>ons
Chapter7<176>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Allowfordensecode(fewermemoryaccesses)• YetpreservesimplicityofRISChardware• ARMstrikesbalancebychoosinginstruc>onsthat:
– GivebejercodedensitythanpureRISCinstruc>onsets(suchasMIPS)
– EnablemoreefficientdecodingthanCISCinstruc>onsets(suchasx86)
Micro-opera>ons
Chapter7<177>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Guesswhetherbranchwillbetaken– Backwardbranchesareusuallytaken(loops)– Considerhistorytoimproveguess
• Goodpredic>onreducesfrac>onofbranchesrequiringaflush
BranchPredic>on
Chapter7<178>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Idealpipelinedprocessor:CPI=1• Branchmispredic>onincreasesCPI• Sta*cbranchpredic*on:– Checkdirec>onofbranch(forwardorbackward)– Ifbackward,predicttaken– Else,predictnottaken
• Dynamicbranchpredic*on:– Keephistoryoflastseveralhundred(orthousand)branchesinbranchtargetbuffer,record:• Branchdes>na>on• Whetherbranchwastaken
BranchPredic>on
Chapter7<179>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
MOV R1, #0 ; R1 = sum
MOV R0, #0 ; R0 = i
FOR ; for (i=0; i<10; i=i+1)
CMP R0, #10
BGE DONE
ADD R1, R1, R0 ; sum = sum + i ADD R0, R0, #1
B FOR
DONE
BranchPredic>onExample
Chapter7<180>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Rememberswhetherbranchwastakenthelast>meanddoesthesamething
• Mispredictsfirstandlastbranchofloop
1-BitBranchPredictor
Chapter7<181>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
Onlymispredictslastbranchofloop
stronglytaken
predicttaken
weaklytaken
predicttaken
weaklynot taken
predictnot taken
stronglynot taken
predictnot taken
taken taken taken
takentakentaken
taken
taken
2-BitBranchPredictor
Chapter7<182>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Mul>plecopiesofdatapathexecutemul>pleinstruc>onsatonce
• Dependenciesmakeittrickytoissuemul>pleinstruc>onsatonce
CLK CLK CLK CLK
ARD A1
A2RD1A3
WD3WD6
A4A5A6
RD4
RD2RD5
InstructionMemory
RegisterFile Data
Memory
ALUs
PC
CLK
A1A2
WD1WD2
RD1RD2
Superscalar
Chapter7<183>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
IdealIPC: 2 ActualIPC: 2
SuperscalarExample
Time(cycles)
1 2 3 4 5 6 7 8
RF40
R0
RF
R8+
DMIM
LDR
ADD
LDR R8, [R0, #40]
ADD R9, R1, R2
SUB R10, R1, R3
AND R11, R3, R4
ORR R12, R1, R5
STR R5, [R0, #80]
R9R2
R1
+
RFR3
R1
RF
R10-
DMIM
SUB
AND R11R4
R3
&
RFR5
R1
RF
R12|
DMIM
ORR
STR 80
R0
+ R5
Chapter7<184>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
SuperscalarwithDependencies
Stall
Time(cycles)
1 2 3 4 5 6 7 8
RF40
R0
RF
R8+
DMIM
LDRLDR R8, [R0, #40]
ADD R9, R8, R1
SUB R8, R2, R3
AND R10, R4, R8
STR R7, [R11, #80]
RFR1
R8ADD
RFR1
R8
RF
R9+
DM
RFR8
R4
RF
R10&
DMIM
AND
IMORR
AND
SUB
|R6
R5R11
RF80
R11
RF+
DMSTR
IM
R7
9
R3
R2
R3
R2-
R8
ORRORR R11, R5, R6
IM
IdealIPC: 2 ActualIPC: 6/5=1.2
Chapter7<185>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Looksaheadacrossmul>pleinstruc>ons• Issuesasmanyinstruc>onsaspossibleatonce• Issuesinstruc>onsoutoforder(aslongasnodependencies)
• Dependencies:– RAW(readaserwrite):oneinstruc>onwrites,laterinstruc>onreadsaregister
– WAR(writeaserread):oneinstruc>onreads,laterinstruc>onwritesaregister
– WAW(writeaserwrite):oneinstruc>onwrites,laterinstruc>onwritesaregister
OutofOrderProcessor
Chapter7<186>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Instruc*onlevelparallelism(ILP):numberofinstruc>onthatcanbeissuedsimultaneously(average<3)
• Scoreboard:tablethatkeepstrackof:– Instruc>onswai>ngtoissue– Availablefunc>onalunits– Dependencies
OutofOrderProcessor
Chapter7<187>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR R8, [R0, #40] ADD R9, R8, R1 SUB R8, R2, R3 IdealIPC: 2 AND R10, R4, R8 ActualIPC: 6/4=1.5 ORR R11, R5, R6
STR R7, [R11, #80]
OutofOrderProcessorExample
Time(cycles)
1 2 3 4 5 6 7 8
RF40
R0
RF
R8+
DMIM
LDRLDR R8, [R0, #40]
ADD R9, R8, R1
SUB R8, R2, R3
AND R10, R4, R8
STR R7, [R11, #80]
ORR|R6
R5R11
RF80
R11
RF+
DMSTR R7
ORR R11, R5, R6
IM
RFR1
R8
RF
R9+
DMIM
ADD
SUB-R3
R2R8
two cycle latencybetween load anduse of R8
RAW
WAR
RAW
RFR8
R4
RF&
DMAND
IM
R10
RAW
Chapter7<188>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
LDR R8, [R0, #40] ADD R9, R8, R1 SUB R8, R2, R3 IdealIPC: 2 AND R10, R4, R8 ActualIPC: 6/3=2 ORR R11, R5, R6
STR R7, [R11, #80]
RegisterRenaming
Time(cycles)
1 2 3 4 5 6 7
RF40
R0
RF
R8+
DMIM
LDRLDR R8, [R0, #40]
ADD R9, R8, R1
SUB T0, R2, R3
AND R10, R4, T0
STR R7, [R11, #80]
SUB-R3
R2T0
RFT0
R4
RF&
DMAND
R7
ORR R11, R5, R6IM
RFR1
R8
RF
R9+
DMIM
ADD
STR+80
R11
RAW
R6
R5|
ORR
2-cycle RAW
RAW
R10
R11
Chapter7<189>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• SingleInstruc>onMul>pleData(SIMD)– Singleinstruc>onactsonmul>plepiecesofdataatonce– Commonapplica>on:graphics– Performshortarithme>copera>ons(alsocalledpackedarithmeKc)
• Forexample,addeight8-bitelements
SIMD
a0
0781516232431 Bit position
D0a1a2a3
b0 D1b1b2b3
a0 + b0 D2a1 + b1a2 + b2a3 + b3
+
a4a5a6a7
b4b5b6b7
a4 + b4a5 + b5a6 + b6a7 + b7
3239404748555663
Chapter7<190>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Mul*threading– Wordprocessor:threadfortyping,spellchecking,prin>ng
• Mul*processors– Mul>pleprocessors(cores)onasinglechip
AdvancedArchitectureTechniques
Chapter7<191>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Process:programrunningonacomputer– Mul>pleprocessescanrunatonce:e.g.,surfingWeb,playingmusic,wri>ngapaper
• Thread:partofaprogram– Eachprocesshasmul>plethreads:e.g.,awordprocessormayhavethreadsfortyping,spellchecking,prin>ng
Threading:Defini>ons
Chapter7<192>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Onethreadrunsatonce• Whenonethreadstalls(forexample,wai>ngformemory):– Architecturalstateofthatthreadstored– Architecturalstateofwai>ngthreadloadedintoprocessoranditruns
– Calledcontextswitching• Appearstouserlikeallthreadsrunningsimultaneously
ThreadsinConven>onalProcessor
Chapter7<193>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Mul>plecopiesofarchitecturalstate• Mul>plethreadsac*veatonce:– Whenonethreadstalls,anotherrunsimmediately– Ifonethreadcan’tkeepallexecu>onunitsbusy,anotherthreadcanusethem
• Doesnotincreaseinstruc>on-levelparallelism(ILP)ofsinglethread,butincreasesthroughput
Intelcallsthis“hyperthreading”
Mul>threading
Chapter7<194>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Mul>pleprocessors(cores)withamethodofcommunica>onbetweenthem
• Types:– Homogeneous:mul>plecoreswithsharedmainmemory
– Heterogeneous:separatecoresfordifferenttasks(forexample,DSPandCPUincellphone)
– Clusters:eachcorehasownmemorysystem
Mul>processors
Chapter7<195>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015
• Pajerson&Hennessy’s:ComputerArchitecture:AQuanKtaKveApproach
• Conferences:– www.cs.wisc.edu/~arch/www/– ISCA(Interna>onalSymposiumonComputerArchitecture)
– HPCA(Interna>onalSymposiumonHighPerformanceComputerArchitecture)
OtherResources