Upload
aman-khan
View
9
Download
1
Tags:
Embed Size (px)
Citation preview
1
Case Study: MIPS R4000
Section A.6
Advanced Computer Architecture – Fall 2004
2
Goal
• Today: Longer pipelines (R4000) => Better branch prediction, more instruction parallelism?
3
Case Study: MIPS R4000 (200 MHz)
• 8 Stage Pipeline:– IF–first half of fetching of instruction; PC selection happens here as
well as initiation of instruction cache access.– IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also
instruction cache hit detection.– EX–execution, which includes effective address calculation, ALU
operation, and branch target computation and condition evaluation.– DF–data fetch, first half of access to data cache.– DS–second half of access to data cache.– TC–tag check, determine whether the data cache access hit.– WB–write back for loads and register-register operations.
• 8 Stages: What is impact on Load delay? Branch delay? Why?
4
R4000 Pipeline
Inst. Memory Reg Reg
AL
U Data Memory
IF IS RF EX DF DS TC WB
Inst . First: PC select
Inst. 2nd: fin-ish
RegFetch, Icache hit
EFA, ALU, branch tgt.
Data Cache
Data Cache
Tag Check - data
Write back for loads and RR ALU opera-
5
Load Delay
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
TWO CycleLoad Latency
• 2 Cycle Load Delay– Value available at the end of DS => Use before tag check
– Seems odd since you have to take it back on a miss…
– But miss is rare, so common case is a hit
– And dependent instructions will be behind, so stalling them is easy
– But, can’t do a destructive operation until you know it was a hit
6
Branch Delay
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
THREE CycleBranch Latency(conditions evaluated during EX phase)
Delay slot plus two stallsBranch likely cancels delay slot if not taken
• 3 cycle Branch Delay– Target is only really known after EX stage
– 1st delay slot happens no matter what
– Slots 2 and 3 • Filled if predict-not-taken
• NOPs if predict-taken
7
Branch handling in MIPS• MIPS architecture uses delayed branch• R4000 uses a predict-no-taken strategy• Only one delay slot is filled by the compiler• Taken branches need 2 additional stall cycles
inserted by the HW
8
MIPS R4000 Floating Point• 3 functional units: FP Adder, FP Multiplier, FP Divider• Last step of FP Multiplier/Divider uses FP Adder HW• 8 kinds of stages in FP units:
Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers
• A stage may be used multiple times by an instruction Lots of structural hazards
9
MIPS FP Pipe Stages
FP Instr 1 2 3 4 5 6 7 8 …Add, Subtract U S+A A+R R+SMultiply U E+M M M M N N+A RDivide U A R D28 … D+A D+R, D+R, D+A, D+R, A, RSquare root U E (A+R)108 … A RNegate U SAbsolute value U SFP compare U A RStages:
M First stage of multiplierN Second stage of multiplierR Rounding stageS Operand shift stageU Unpack FP numbers
A Mantissa ADD stage
D Divide pipeline stage
E Exception test stage
10
FP operations in MIPS R4000• Latency and Initiation Interval?
FP instruction Latency Initiation Interval
Add, Sub 4 3
Multiply 8 4
Divide 36 35
Square Root 112 111
Negate 2 1
Absolute Value 2 1
FP compare 3 2
11
R4000 Performance• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)– Branch stalls (2 cycles + unfilled slots)– FP result stalls: RAW data hazard (latency)– FP structural stalls: Not enough FP hardware (parallelism)
00.51
1.52
2.53
3.54
4.5
eqnto
tt
esp
ress
o
gcc li
doduc
nasa
7
ora
spic
e2g6
su2co
r
tom
catv
Base Load stalls Branch stalls FP result stalls FP structural
stalls
12
Some Things to Observe• Unexpected (e.g. branch) may cause problems• Pipelining cannot be done arbitrarily
– Laminarity becomes a problem– Stalls tend to increase– Exception and branch penalties go up– There is a point of diminishing returns
• Performance of unoptimized code is misleading– Especially true of pipelined machines– Instruction scheduling can remove most of the stalls
Easy huh?! On to ILP and multiple-issue pipelines…
13
Summary of Appendix A• Pipelining: most important technique for
enhancing processor performance• Pipelining concepts• Simple compiler strategies for enhancing
performance• How to deal with structural, data and control
hazards• How to deal with FP operations• Example of longer pipeline: MIPS 4000