63
Computer Architecture 2011 – Pipeline 1 Computer Architecture MIPS Pipeline Lihu Rappoport and Adi Yoaz Some of the slides were taken from: (1) Avi Mendelson (2) Randi Katz (3) Patterson

Computer Architecture MIPS Pipeline

  • Upload
    siusan

  • View
    75

  • Download
    1

Embed Size (px)

DESCRIPTION

Computer Architecture MIPS Pipeline. Lihu Rappoport and Adi Yoaz. Some of the slides were taken from: (1) Avi Mendelson (2) Randi Katz (3) Patterson . Program execution order. Program execution order. Time. Time. lw R1, 100(R0). lw R1, 100(R0). lw R2, 200(R0). - PowerPoint PPT Presentation

Citation preview

Page 1: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline1

Computer Architecture

MIPS Pipeline

Lihu Rappoport and Adi Yoaz

Some of the slides were taken from: (1) Avi Mendelson (2) Randi Katz (3) Patterson

Page 2: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline2

DataAccess

DataAccess

DataAccess

DataAccess

DataAccess

Pipelining Instructions

Ideal speedup is number of stages in the pipeline. Do we achieve this?

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

. ..

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

InstFetch

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

2 ns

2 ns

2 ns 2 ns 2 ns 2 ns 2 ns

8 ns

8 ns

8 ns

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

Page 3: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline3

Pipelined Car Assembly

chassis engine finish

1 hour 2 hours 1 hour

Car 1Car 2Car 3

Page 4: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline4

Pipelining Pipelining does not reduce the latency of single task,

it increases the throughput of entire workload Potential speedup = Number of pipe stages

Pipeline rate is limited by the slowest pipeline stageÞ Partition the pipe to many pipe stagesÞ Make the longest pipe stage to be as short as possible Þ Balance the work in the pipe stages

Pipeline adds overhead (e.g., latches) Time to “fill” pipeline and time to “drain” it reduces speedup Stall for dependencies

Þ Too many pipe-stages start to loose performance IPC of an ideal pipelined machine is 1

Every clock one instruction finishes

Page 5: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline5

Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipelined CPU

Page 6: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline6

Pipelined CPU with Control

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM

MEM/WBIn

stru

ctio

n

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

Con

trol

WBMEM

WBMEM

WB

EXE

Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

Page 7: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline7

Structural Hazard Attempt to use the same resource two different ways at the same time Register File:

Accessed in 2 stages: Read during stage 2 (ID) Write during stage 5 (WB)

Solution: 2 read ports, 1 write port Memory

Accessed in 2 stages: Instruction Fetch during stage 1 (IF) Data read/write during stage 4 (MEM)

Solution: separate instruction cache and data cache

Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all

instructions

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

Page 8: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline8

Pipeline Example: cycle 1

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction lwPC4

4

Page 9: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline9

Pipeline Example: cycle 2

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC8

8 4

sub[R1]

910

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

Page 10: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline10

Pipeline Example: cycle 3

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC12

12 8

and[R2]

[R1]+9

sub4

[R3]

1011 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

Page 11: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline11

Pipeline Example: cycle 4

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC16

12 8

or[R4]

Data frommemoryaddress[R1]+9

sub4

[R5]

1112

and16

10

[R2]-[R3]

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

Page 12: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline12

Problem with starting next instruction before first is finished dependencies that “go backward in time” are data

hazards

Dependencies: RAW Hazard

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 10/–20Value of R2 10 10 10 -20 -20 -20 -20

Page 13: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline13

IM bubble bubble bubble bubble

IM bubble bubble bubble bubble

IM bubble bubble bubble bubble

RAW Hazard: HW Solution 1 - Add Stalls

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 10/–20Value of R2

DM Reg

IM DM RegReg

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

sub R2, R1, R3

stall

stall

stall

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

10 10 10 -20 -20 -20 -20

• Have the hardware detect hazard and add stalls if needed

• Problem: this also slows us down!

Page 14: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline14

Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding

RAW Hazard: HW Solution 2 - Forwarding

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

X X X –20 X X X X XX X X X –20 X X X X

DM

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 10/–20Value of R2 10 10 10 -20 -20 -20 -20

Value EX/MEMValue MEM/WB

Page 15: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline15

Forwarding Hardware

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0muxB

1

2

0muxA

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

EX/M

EM.R

egW

rite

MEM

/WB

.Reg

Writ

e

Page 16: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline16

Forwarding ControlEX Hazard:

if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) ALUSelA = 1

if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) ALUSelB = 1

MEM Hazard: if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg

ID/EX.ReadReg1)) and (MEM/WB.WriteReg = ID/EX.ReadReg1)) ALUSelA = 2 if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg

ID/EX.ReadReg2)) and (MEM/WB.WriteReg = ID/EX.ReadReg2)) ALUSelB = 2

Page 17: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline17

Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2

lw R11,9(R1) sub R10,R2, R3and R12,R10,R11

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

lw[R10]

Data frommemoryaddress[R1]+9

sub

[R11]

1012

and

11

[R2]-[R3]

1110

Page 18: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline18

Forwarding Hardware Example 2: Bypassing From WB to Src2

sub R10,R2, R3xxxand R12,R10,R11

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

sub[R11]

xxx

[R10]

12

and

10

[R2]-[R3]

1110

Page 19: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline19

Register File Split

sub R2, R1, R3

xxx

xxx

and R12,R2,R11

Register file is written during first half of the cycle Register file is read during second half of the cycleÞ Register file is written before it is read Þ returns the

correct data

IM Reg

IM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

DMReg

Reg

Page 20: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline20

Load word can still causes a hazard: an instruction tries to read a register following a load

instruction that writes to the same register

Þ A hazard detection unit is needed to “stall” the load instruction

Can't always forward

Reg

IM

Reg

Reg

IM

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 9Programexecutionorder

lw R2, 30(R1)

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Page 21: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline21

De-assert the enable to ID/EXE The dependant instruction (and) stays another cycle in IF/EXE

De-assert the enable to the IF/ID latch, and to the PC Freeze pipeline stages preceding the stalled instruction

Issue a NOP into the EXE/MEM latch (instead of the stalled inst.) Allow the stalling instruction (lw) to move on

Reg

IM

Reg

Reg

IM DM

IM Reg DM RegIM

IM DM Reg

IM DM Reg

DM Reg

RegReg

Reg

bubble

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 9Programexecutionorder

lw R2, 30(R1)

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Stalling

Page 22: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline22

Hazard Detection (Stall) Logicif (ID/EX.RegWrite and (ID/EX.opcode = lw) and ( (ID/EX.WriteReg = IF/ID.ReadReg1) or (ID/EX.WriteReg = IF/ID.ReadReg2) ) then stall

Page 23: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline23

Forwarding + Hazard Detection Unit

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

HazardDetection

Unit

0

ID/EX.MemRead

PC W

rite

0mux

1

IF/ID

Writ

e

ID/EX.Rt

Page 24: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline24

Example: code for (assume all variables are in memory):a = b + c;d = e – f;

Slow codeLW Rb,bLW Rc,c

Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f

Stall SUB Rd,Re,RfSW d,Rd

Instruction order can be changed as long as the correctness is kept

Software Scheduling to Avoid Load Hazards

Fast codeLW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Page 25: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline25

Control Hazards

Page 26: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline26

Executing a BEQ Instruction (i)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

Calculate branch condition

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

R4R5 -27

812

12

beqand

Calculate branch target

Page 27: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline27

Executing a BEQ Instruction (ii)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction R4-R5=0-

8+SignExt(27)*4

1216

16

beqandsw

Page 28: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline28

Executing a BEQ Instruction (iii)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

8+SignExt(27)*4

16

20 or 116

beqandsw

20

sub

Page 29: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline29

Control Hazard on Branches

And

Beq

sub

sw

The 3 instructions following the branch get into the pipe even if the branch is taken

Inst from target

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

Page 30: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline30

Control Hazard: Stall

Stall pipe when branch is encountered until resolved Stall impact: assumptions

CPI = 1 20% of instructions are branches Stall 3 cycles on every branch

Þ CPI new = 1 + 0.2 × 3 = 1.6 (CPI new = CPI Ideal + avg. stall cycles / instr.)

We loose 60% of the performance

Page 31: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline31

Control Hazard: Predict Not Taken

Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid

If branch actually taken Flush the fall-through path instructions before they

change the machine state (memory / registers) Fetch the instructions from the correct (taken) path

Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3

Page 32: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline32

Dynamic Branch Prediction

Look up

PC of inst in fetch

?= Branch predicted taken or not taken

No:Inst is not pred to be branch

Yes:Inst is pred to be branch

Branch PC Target PC History

Predicted Target

Add a Branch Target Buffer (BTB) the predicts (at fetch) Instruction is a branch Branch taken / not-taken Taken branch target

Page 33: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline33

BTB Allocation

Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated

Not taken branches need not be allocated BTB miss implicitly predicts not-taken

Prediction BTB lookup is done parallel to IC lookup BTB provides

Indication that the instruction is a branch (BTB hits) Branch predicted target Branch predicted direction Branch predicted type (e.g., conditional, unconditional)

Update (when branch outcome is known) Branch target Branch history (taken / not-taken)

Page 34: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline34

BTB (cont.) Wrong prediction

Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but wrong

target In case of wrong prediction – flush the pipeline

Reset latches (same as making all instructions to be NOPs) Select the PC source to be from the correct path

Need get the fall-through with the branch Start fetching instruction from correct path

Assuming P% correct prediction rate CPI new = 1 + (0.2 × (1-P)) × 3 For example, if P=0.7

CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

Page 35: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline35

Adding a BTB to the Pipeline

Shift left 2

ALUSrc

6

ALUresultZero

+

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM MEM

/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4+

IF/ID

PC

0

1

mux

0

1

mux

0mux

1

0

mux

Inst.Memory

Address

Instruction

BTB

12

pred targetpred dir

PC+4 (Not-taken target)taken target

3

Mis-predict

DetectionUnit

Flush

predicted targetPC+4 (Not-taken target)

predicted direction

−4

address

targetdirection

alloc/updt

Page 36: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline36

Using The BTBPC moves to next instruction

Inst Mem gets PCand fetches new inst

BTB gets PCand looks it up

IF/ID latch loadedwith new inst

BTB Hit ?

Br taken ?

PC PC + 4PC perd addr

IF

IDIF/ID latch loaded

with pred instIF/ID latch loaded

with seq. instBranch ?

yes no

noyes

noyesEXE

Page 37: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline37

Using The BTB (cont.)ID

EXE

MEM

WB

Branch ?

Calculate brcond & trgt

Flush pipe &update PC

Corect pred ?

yes no

IF/ID latch loadedwith correct inst

continue

Update BTB

yes no

continue

Page 38: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline38

Backup

Page 39: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline39

· R-type (register insts)

· I-type (Load, Store, Branch, inst’s w/imm data)· J-type (Jump)

op: operation of the instructionrs, rt, rd: the source and destination register specifiersshamt: shift amountfunct: selects the variant of the operation in the “op” fieldaddress / immediate: address offset or immediate valuetarget address: target address of the jump instruction

op target address02631

6 bits 26 bits

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

MIPS Instruction Formats

Page 40: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline40

Each memory location is 8 bit = 1 byte wide has an address

We assume 32 byte address An address space of 232 bytes

Memory stores both instructions and data Each instruction is 32 bit wide Þ

stored in 4 consecutive bytes in memory

Various data types have different width

The Memory Space1 byte

00000000000000010000000200000003000000040000000500000006

FFFFFFFAFFFFFFFBFFFFFFFCFFFFFFFDFFFFFFFEFFFFFFFF

Page 41: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline41

Register FileRegWrite

Readreg 1

Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

RegisterFile

32

32

32

5

5

5

The Register File holds 32 registers Each register is 32 bit wide The RF supports parallel

reading any two registers and writing any register

Inputs Read reg 1/2: #register whose value

will be output on Read data 1/2 RegWrite: write enable

Write reg (relevant when RegWrite=1) #register to which the value in Write data is written to

Write data (relevant when RegWrite=1) data written to Write reg

Outputs Read data 1/2: data read from Read reg 1/2

Page 42: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline42

Memory Components Inputs

Address: address of the memory location we wish to access

Read: read data from location Write: write data into location Write data (relevant when Write=1)

data to be written into specified location Outputs

Read data (relevant when Read=1) data read from specified location

Write

Address

WriteData

ReadDataMemory

Read

32

32

32

Cache Memory components are slow relative to the CPU

A cache is a fast memory which contains only small part of the memory Instruction cache stores parts of the memory space which hold code Data Cache stores parts of the memory space which hold data

Page 43: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline43

The Program Counter (PC) Holds the address (in memory) of the next instruction to be

executed After each instruction, advanced to point to the next

instruction If the current instruction is not a taken branch, the next instruction resides right after the current instruction PC PC + 4 If the current instruction is a taken branch, the next instruction resides at the branch target PC target (absolute jump) PC PC + 4 + offset×4 (relative jump)

Page 44: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline44

Fetch Fetch instruction pointed by PC from I-Cache

Decode Decode instruction (generate control signals) Fetch operands from register file

Execute For a memory access: calculate effective address For an ALU operation: execute operation in ALU For a branch: calculate condition and target

Memory Access For load: read data from memory For store: write data into memory

Write Back Write result back to register file update program counter

Instruction Execution Stages

InstructionFetch

InstructionDecode

Execute

Memory

ResultStore

Page 45: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline45

The MIPS CPUInstruction

Decode /register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

ALUSrcALU

ALU resZero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4Add

PC

InstructionCache

Address

Instruction

0

1

mux

MemRead

MemWrite

Address

WriteData

ReadData

DataCache

Branch

PCSrc

MemtoReg

1

0

mux

RegDst

RegWrite

Readreg 1

Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r Fi

le [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

Con

trol

Page 46: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline46

Executing an Add Instruction

ALUSrc=0ALU

ALU resZero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead=0

MemWrite=0

Address

WriteData

ReadData

DataMemory

Branch=0

PCSrc=0

MemtoReg=01

0

mux

RegDst=1

RegWrite=1Readreg 1

Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r Fi

le [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R3

R5

35

R3 + R5

+

[PC]+4

2

op rs rt rd shamt funct 0611162126313 5 2 0 0 = AddALU

Add R2, R3, R5 ; R2 R3+R5

ADD

Page 47: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline47

Executing a Load Instruction

op rs rt immediate 0162126312 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

ALUSrc=1ALU

ALU resZero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead=1

MemWrite=0

Address

WriteData

ReadData

DataMemory

Branch=0

PCSrc=0

MemtoReg=11

0

mux

RegDst=0

RegWrite=1Readreg 1

Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r Fi

le [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R2

30

2R2+30+

[PC]+4

1

Mem[R2+30]

LW

Page 48: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline48

Executing a Store Instruction

op rs rt immediate 0162126312 1 30SW

SW R1, (30)R2 ; Mem[R2+30] R1

ALUSrc = 1ALU

ALU resZero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead

MemWrite = 1

Address

WriteData

ReadData

DataMemory

Branch=0

PCSrc = 0

MemtoReg =1

0

mux

RegDst =

RegWrite = 0Readreg 1

Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r Fi

le [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R2

30

2R2+30+

[PC]+4

1SW

R1

Page 49: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline49

Executing a BEQ InstructionBEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

op rs rt immediate 0162126314 5 27BEQ

ALUSrc=0ALU

ALU resZero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead=0

MemWrite=0

Address

WriteData

ReadData

DataMemory

Branch=1

PCSrc= (R4 – R5 == 0)

MemtoReg= 1

0

mux

RegDst=

RegWrite=0Readreg 1

Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r Fi

le [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R4

R5

45

R4-R5-

PC+4

27

ADD

PC+4orPC+4+SignExt(27)*4

Page 50: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline50

Control Signals

add sub ori lw sw beq jumpRegDstALUSrcMemtoRegRegWriteMemWriteBranchJumpALUctr<2:0>

1001000

Add

1001000

Subtract

0101000

Or

0111000

Add

x1x0100

Add

x0x0010

Subtract

xxx00x1

xxx

funcop 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010

10 0000 10 0010 Don’t Care

Page 51: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline51

Pipelined CPU: Load (cycle 1 – Fetch)

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction lw

op rs rt immediate 0162126312 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

PC+4

Page 52: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline52

Pipelined CPU: Load (cycle 2 – Dec)

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 0162126312 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

PC+4

R2

30

Page 53: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline53

Pipelined CPU: Load (cycle 3 – Exe)

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 0162126312 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

R2+30

Page 54: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline54

Pipelined CPU: Load (cycle 4 – Mem)

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 0162126312 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

D

Page 55: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline55

Pipelined CPU: Load (cycle 5 – WB)

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 0162126312 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

D

Page 56: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline56

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Datapath with Control

Page 57: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline57

Multi-Cycle Control Pass control signals along just like the data

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 58: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline58

Five Execution Steps Instruction Fetch

Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC.

IR = Memory[PC];PC = PC + 4;

Instruction Decode and Register Fetch Read registers rs and rt Compute the branch address A = Reg[IR[25-21]];

B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);

We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Page 59: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline59

Five Execution Steps (cont.)• Execution ALU is performing one of three functions, based on instruction type:

Memory Reference: effective address calculation.ALUOut = A + sign-extend(IR[15-0]);

R-type:ALUOut = A op B;

Branch:if (A==B) PC = ALUOut;

• Memory Access or R-type instruction completion

• Write-back step

Page 60: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline60

The Store Instruction

sw rt, rs, imm16

mem[PC] Fetch the instruction from memory

Addr <- R[rs] + SignExt(imm16) Calculate the memory addressMem[Addr] <- R[rt] Store the register into memoryPC <- PC + 4 Calculate the next instruction’s

address

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

Mem[Rs + SignExt[imm16]] <- Rt Example: sw rt, rs, imm16

Page 61: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline61

RAW Hazard: SW Solution

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 10/–20Value of R2

DM Reg

IM DM RegReg

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

sub R2, R1, R3

NOP

NOP

NOP

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

10 10 10 -20 -20 -20 -20

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

• Have compiler avoid hazards by adding NOP instructions

• Problem: this really slows us down!

Page 62: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline62

Delayed Branch Define branch to take place AFTER n following instruction

HW executes n instructions following the branch regardless of branch is taken or not

SW puts in the n slots following the branch instructions that need to be executed regardless of branch resolution Instructions that are before the branch instruction, or Instructions from the converged path after the branch

If cannot find independent instructions, put NOP

Original Coder3 = 23R4 = R3+R5If (r1==r2) goto xR1 = R4 + R5X: R7 = R1

New CodeIf (r1==r2) goto x r3 = 23R4 = R3 +R5NOPR1 = R4 + R5X: R7 = R1

Þ

Page 63: Computer Architecture MIPS Pipeline

Computer Architecture 2011 – Pipeline63

Delayed Branch Performance Filling 1 delay slot is easy, 2 is hard, 3 is harder Assuming we can effectively fill d% of the delayed slots

CPInew = 1 + 0.2 × (3 × (1-d))

For example, for d=0.5, we get CPInew = 1.3 Mixing architecture with micro-arch

New generations requires more delay slots Cause computability issues between generations