MIPSR4000

1

Case Study: MIPS R4000

Section A.6

Advanced Computer Architecture – Fall 2004

2

Goal

• Today: Longer pipelines (R4000) => Better branch prediction, more instruction parallelism?

3

Case Study: MIPS R4000 (200 MHz)

• 8 Stage Pipeline:– IF–first half of fetching of instruction; PC selection happens here as

well as initiation of instruction cache access.– IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also

instruction cache hit detection.– EX–execution, which includes effective address calculation, ALU

operation, and branch target computation and condition evaluation.– DF–data fetch, first half of access to data cache.– DS–second half of access to data cache.– TC–tag check, determine whether the data cache access hit.– WB–write back for loads and register-register operations.

• 8 Stages: What is impact on Load delay? Branch delay? Why?

4

R4000 Pipeline

Inst. Memory Reg Reg

AL

U Data Memory

IF IS RF EX DF DS TC WB

Inst . First: PC select

Inst. 2nd: fin-ish

RegFetch, Icache hit

EFA, ALU, branch tgt.

Data Cache

Data Cache

Tag Check - data

Write back for loads and RR ALU opera-

5

Load Delay

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

WBTCDSDFEXRFISIF

TWO CycleLoad Latency

• 2 Cycle Load Delay– Value available at the end of DS => Use before tag check

– Seems odd since you have to take it back on a miss…

– But miss is rare, so common case is a hit

– And dependent instructions will be behind, so stalling them is easy

– But, can’t do a destructive operation until you know it was a hit

6

Branch Delay

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

WBTCDSDFEXRFISIF

THREE CycleBranch Latency(conditions evaluated during EX phase)

Delay slot plus two stallsBranch likely cancels delay slot if not taken

• 3 cycle Branch Delay– Target is only really known after EX stage

– 1st delay slot happens no matter what

– Slots 2 and 3 • Filled if predict-not-taken

• NOPs if predict-taken

7

Branch handling in MIPS• MIPS architecture uses delayed branch• R4000 uses a predict-no-taken strategy• Only one delay slot is filled by the compiler• Taken branches need 2 additional stall cycles

inserted by the HW

8

MIPS R4000 Floating Point• 3 functional units: FP Adder, FP Multiplier, FP Divider• Last step of FP Multiplier/Divider uses FP Adder HW• 8 kinds of stages in FP units:

Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers

• A stage may be used multiple times by an instruction Lots of structural hazards

9

MIPS FP Pipe Stages

FP Instr 1 2 3 4 5 6 7 8 …Add, Subtract U S+A A+R R+SMultiply U E+M M M M N N+A RDivide U A R D28 … D+A D+R, D+R, D+A, D+R, A, RSquare root U E (A+R)108 … A RNegate U SAbsolute value U SFP compare U A RStages:

M First stage of multiplierN Second stage of multiplierR Rounding stageS Operand shift stageU Unpack FP numbers

A Mantissa ADD stage

D Divide pipeline stage

E Exception test stage

10

FP operations in MIPS R4000• Latency and Initiation Interval?

FP instruction Latency Initiation Interval

Add, Sub 4 3

Multiply 8 4

Divide 36 35

Square Root 112 111

Negate 2 1

Absolute Value 2 1

FP compare 3 2

11

R4000 Performance• Not ideal CPI of 1:

– Load stalls (1 or 2 clock cycles)– Branch stalls (2 cycles + unfilled slots)– FP result stalls: RAW data hazard (latency)– FP structural stalls: Not enough FP hardware (parallelism)

00.51

1.52

2.53

3.54

4.5

eqnto

tt

esp

ress

o

gcc li

doduc

nasa

7

ora

spic

e2g6

su2co

r

tom

catv

Base Load stalls Branch stalls FP result stalls FP structural

stalls

12

Some Things to Observe• Unexpected (e.g. branch) may cause problems• Pipelining cannot be done arbitrarily

– Laminarity becomes a problem– Stalls tend to increase– Exception and branch penalties go up– There is a point of diminishing returns

• Performance of unoptimized code is misleading– Especially true of pipelined machines– Instruction scheduling can remove most of the stalls

Easy huh?! On to ILP and multiple-issue pipelines…

13

Summary of Appendix A• Pipelining: most important technique for

enhancing processor performance• Pipelining concepts• Simple compiler strategies for enhancing

performance• How to deal with structural, data and control

hazards• How to deal with FP operations• Example of longer pipeline: MIPS 4000

Documents

MIPSR4000