361 Computer Architecture Lecture 12: Designing a Pipeline ...kcoloma/ece361/lectures/Lec12-pipeline.pdf · pipeline.1 361 Computer Architecture Lecture 12: Designing a Pipeline Processor

1

pipeline.1

361Computer Architecture

Lecture 12: Designing a Pipeline Processor

pipeline.2

Overview of a Multiple Cycle Implementation

° The root of the single cycle processor’s problems:• The cycle time has to be long enough for the slowest instruction

° Solution:• Break the instruction into smaller steps• Execute each step (instead of the entire instruction) in one cycle

- Cycle time: time it takes to execute the longest step- Keep all the steps to have similar length

• This is the essence of the multiple cycle processor

° The advantages of the multiple cycle processor:• Cycle time is much shorter• Different instructions take different number of cycles to complete

- Load takes five cycles- Jump only takes three cycles

• Allows a functional unit to be used more than once per instruction

2

pipeline.3

Multiple Cycle Processor

° MCP: A functional unit to be used more than once per instruction

IdealMemoryWrAdrDin

RAdr

32

32

32Dout

MemWr32

AL

U

3232

ALUOp

ALUControl

Instruction Reg

32

IRWr

32

Reg File

Ra

Rw

busW

Rb5

5

32busA

32busB

RegWr

Rs

Rt

Mux

0

1

Rt

Rd

PCWr

ALUSelA

Mux 01

RegDst

Mux

0

1

32

PC

MemtoReg

Extend

ExtOp

Mux

0

132

0

1

23

4

16Imm 32

<< 2

ALUSelB

Mux

1

0

Target32

Zero

ZeroPCWrCond PCSrc BrWr

32

IorD

pipeline.4

Outline of Today’s Lecture

° Recap and Introduction

° Introduction to the Concept of Pipelined Processor

° Pipelined Datapath and Pipelined Control

° How to Avoid Race Condition in a Pipeline Design?

° Pipeline Example: Instructions Interaction

° Summary

3

pipeline.5

Pipelining is Natural!

° Laundry Example

° Sammy, Marc, Griffy, Alberteach have one load of clothesto wash, dry, and fold

° Washer takes 30 minutes

° Dryer takes 30 minutes

° “Folder” takes 30 minutes

° “Stasher” takes 30 minutesto put clothes into drawers

A B C D

pipeline.6

Sequential Laundry

° Sequential laundry takes 8 hours for 4 loads

° If they learned pipelining, how long would laundry take?

30Task

Order

B

CD

ATime

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1 2 AM

4

pipeline.7

Pipelined Laundry: Start work ASAP

° Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

BCD

A3030 30 3030 30 30

pipeline.8

Pipelining Lessons

° Pipelining doesn’t help latencyof single task, it helpsthroughput of entire workload

° Multiple tasks operatingsimultaneously using differentresources

° Potential speedup = Numberpipe stages

° Pipeline rate limited by slowestpipeline stage

° Unbalanced lengths of pipestages reduces speedup

° Time to “fill” pipeline and time to“drain” it reduces speedup

° Stall for Dependences

6 PM 7 8 9Time

BCD

A3030 30 3030 30 30

Task

Order

5

pipeline.9

Why Pipeline?

° Suppose we execute 100 instructions

° Single Cycle Machine• 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

° Multicycle Machine• 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

° Ideal pipelined machine• 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

pipeline.10

Timing Diagram of a Load Instruction

Clk

PC

Rs, Rt, Rd,Op, Func

Clk-to-Q

ALUctr

Instruction Memory Access Time

Old Value New Value

RegWr Old Value New Value

Delay through Control Logic

busARegister File Access Time

Old Value New Value

busBALU Delay

Old Value New Value

Old Value New Value

New ValueOld Value

ExtOp Old Value New Value

ALUSrc Old Value New Value

Address Old Value New Value

busW Old Value New

Delay through Extender & Mux

Data Memory Access Time

Instruction Fetch Instr Decode /

Reg. Fetch

Address Reg WrData Memory

Register File W

rite Tim

e

6

pipeline.11

The Five Stages of Load

° Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode

° Exec: Calculate the memory address

° Mem: Read the data from the Data Memory

° Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

pipeline.12

Pipelining the Load Instruction

° The five independent functional units in the pipeline datapath are:• Instruction Memory for the Ifetch stage• Register File’s Read ports (bus A and busB) for the Reg/Dec stage• ALU for the Exec stage• Data Memory for the Mem stage• Register File’s Write port (bus W) for the Wr stage

° One instruction enters the pipeline every cycle• One instruction comes out of the pipeline (complete) every cycle• The “Effective” Cycles per Instruction (CPI) is 1

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw

7

pipeline.13

Conventional Pipelined Execution Representation

IFetch Dcd Exec Mem WB





IFetch Dcd Exec Mem WBProgram Flow

Time

pipeline.14

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec MemLoad Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

IfetchR-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

8

pipeline.15

Why Pipeline? Because the resources are there!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3A

LUIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

AL

UIm Reg Dm Reg

pipeline.16

Can pipelining get us into trouble?

° Yes: Pipeline Hazards• structural hazards: attempt to use the same resource two

different ways at the same time- E.g., combined washer/dryer would be a structural hazard

or folder busy doing something else (watching TV)• data hazards: attempt to use item before it is ready

- E.g., one sock of pair in dryer and one in washer; can’tfold until get sock from washer through dryer

- instruction depends on result of prior instruction still inthe pipeline

• control hazards: attempt to make a decision before condition isevaulated

- E.g., washing football uniforms and need to get properdetergent level; need to see after dryer before next load in

- branch instructions

° Can always resolve hazards by waiting• pipeline control must detect the hazard• take action (or delay action) to resolve hazards

9

pipeline.17

Mem

Single Memory is a Structural Hazard

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4A

LUMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem Reg

AL

UMem Reg Mem Reg

Detection is easy in this case! (right half highlight means read, left half write)

pipeline.18

Structural Hazards limit performance

° Example: if 1.3 memory accesses per instruction and only one memoryaccess per cycle then

• average CPI 1.3• otherwise resource is more than 100% utilized• More on Hazards later

10

pipeline.19

Pipelining the R-type and Load Instruction

° We have a problem:• Two instructions try to write to the register file at the same time!

Clock


Ifetch Reg/Dec Exec WrR-type





Ops! We have a problem!

pipeline.20

The Four Stages of R-type



° Exec: ALU operates on the two register operands

° Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4


11

pipeline.21

Important Observation

° Each functional unit can only be used once per instruction

° Each functional unit must be used at the same stage for all instructions:• Load uses Register File’s Write Port during its 5th stage

• R-type uses Register File’s Write Port during its 4th stage

Ifetch Reg/Dec Exec Mem WrLoad1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type1 2 3 4

pipeline.22

Solution 1: Insert “Bubble” into the Pipeline

° Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle• The control logic can be complex

° No instruction is completed during Cycle 5:• The “Effective” CPI for load is >1

Clock



Ifetch Reg/Dec Exec


Ifetch Reg/Dec Exec WrR-typeIfetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr

12

pipeline.23

Solution 2: Delay R-type’s Write by One Cycle

° Delay R-type’s register write by one cycle:• Now R-type instructions also use Reg File’s write port at Stage 5• Mem stage is a NOOP stage: nothing is being done

Clock







Ifetch Reg/Dec Exec WrR-type Mem

Mem

Mem

Mem

Mem

1 2 3 4 5

pipeline.24

The Four Stages of Store



° Exec: Calculate the memory address

° Mem: Write the data into the Data Memory


Ifetch Reg/Dec Exec MemStore Wr

13

pipeline.25

The Four Stages of Beq



° Exec: ALU compares the two register operands• Adder calculates the branch target address

° Mem: If the registers we compared in the Exec stage are the same,• Write the branch target address into the PC


Ifetch Reg/Dec Exec MemBeq Wr

pipeline.26

A Pipelined Datapath

IF/ID R

egister

ID/E

x Register

Ex/M

em R

egister

Mem

/Wr R

egister

PC

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr

RegWr ExtOp

ExecUnit

busAbusB

Imm16

ALUOp

ALUSrc

Mux

1

0

MemtoReg

10

RegDst

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch

10

Clk

Ifetch Reg/Dec Exec Mem Wr

14

pipeline.27

The Instruction Fetch Stage

IF/ID: lw

$1, 100 ($2)

ID/E

x Register

Ex/M

em R

egister

Mem

/Wr R

egister

PC = 14 Data

Mem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr

RegWr ExtOp

ExecUnit

busAbusB

Imm16

ALUOp

ALUSrc

Mux

1

0

MemtoReg

10

RegDst

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch

10

ClkIfetch Reg/Dec Exec Mem

You are here!

° Location 10: lw $1, 0x100($2) $1 <- Mem[($2) + 0x100]

pipeline.28

A Detail View of the Instruction Unit

° Location 10: lw $1, 0x100($2)

IF/ID: lw

$1, 100 ($2)

PC = 14

10

10

Adder

InstructionMemory

“4”

Instruction

Address

Clk

Ifetch

You are here!

Reg/Dec

15

pipeline.29

The Decode / Register Fetch Stage

IF/ID:

ID/E

x: Reg. 2 &

0x100

Ex/M

em R

egister

Mem

/Wr R

egister

PC

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr

RegWr ExtOp

ExecUnit

busAbusB

Imm16

ALUOp

ALUSrc

Mux

1

0

MemtoReg

10

RegDst

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch

10


You are here!


pipeline.30

Load’s Address Calculation Stage

IF/ID: ID

/Ex R

egister

Ex/M

em: L

oad’s Address

Mem

/Wr R

egister

PC

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr

RegWr ExtOp=1

ExecUnit

busAbusB

Imm16

ALUOp=Add

ALUSrc=1

Mux

1

0

MemtoReg

10

RegDst=0

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch

10


You are here!


16

pipeline.31

A Detail View of the Execution Unit

ID/E

x Register

Ex/M

em: L

oad’s Mem

ory AddressALU

Control

ALUctr

32busA

32busB

Extender

Mux

16

imm16

ALUSrc=1ExtOp=1

3

AL

U

Zero

0

1

32ALUout

32

Adder

3 ALUOp=Add

<< 2

32PC+4

Target

32

Clk

Exec

You are here!

Mem

pipeline.32

Load’s Memory Access Stage

IF/ID: ID

/Ex R

egister

Ex/M

em R

egister

Mem

/Wr: L

oad’s Data

PC

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr=0

RegWr ExtOp

ExecUnit

busAbusB

Imm16

ALUOp

ALUSrc

Mux

1

0

MemtoReg

10

RegDst

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch=0

10


You are here!


17

pipeline.33

Load’s Write Back Stage

IF/ID: ID

/Ex R

egister

Ex/M

em R

egister

Mem

/Wr R

egister

PC

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr

RegWr=1 ExtOp

ExecUnit

busAbusB

Imm16

ALUOp

ALUSrc

Mux

1

0

MemtoReg=1

10

RegDst

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch

10


You are somewhere out there!


Wr

pipeline.34

How About Control Signals?

IF/ID: ID

/Ex R

egister

Ex/M

em: L

oad’s Address

Mem

/Wr R

egister

PC

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

MemWr

RegWr ExtOp=1

ExecUnit

busAbusB

Imm16

ALUOp=Add

ALUSrc=1

Mux

1

0

MemtoReg

10

RegDst=0

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch

10

Ifetch Reg/Dec Exec Mem

° Key Observation: Control Signals at Stage N = Func (Instr. at Stage N)• N = Exec, Mem, or Wr

° Example: Controls Signals at Exec Stage = Func(Load’s Exec)

Wr

18

pipeline.35

Pipeline Control

° The Main Control generates the control signals during Reg/Dec• Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later• Control signals for Mem (MemWr Branch) are used 2 cycles later• Control signals for Wr (MemtoReg MemWr) are used 3 cycles later

IF/ID R

egister

ID/E

x Register

Ex/M

em R

egister

Mem

/Wr R

egister

Reg/Dec Exec Mem

ExtOp

ALUOpRegDst

ALUSrc

BranchMemWr

MemtoRegRegWr

MainControl

ExtOp

ALUOpRegDst

ALUSrc

MemtoRegRegWr

MemtoRegRegWr

MemtoRegRegWr

BranchMemWr

BranchMemWr

Wr

pipeline.36

Beginning of the Wr’s Stage: A Real World Problem

° At the beginning of the Wr stage, we have a problem if:• RegAdr’s (Rd or Rt) Clk-to-Q > RegWr’s Clk-to-Q

° Similarly, at the beginning of the Mem stage, we have a problem if:• WrAdr’s Clk-to-Q > MemWr’s Clk-to-Q

° We have a race condition between Address and Write Enable!

Ex/M

em

Mem

/Wr RegAdr

RegWr MemWr

DataWrAdrData

RegFile

DataMemory

Clk

RegAdr

RegWrRegWr’s Clk-to-Q

RegAdr’s Clk-to-Q

Clk

WrAdr

MemWrMemWr’s Clk-to-Q

WrAdr’s Clk-to-Q

19

pipeline.37

The Pipeline Problem

° Multiple Cycle design prevents race condition between Addr and WrEn:• Make sure Address is stable by the end of Cycle N• Asserts WrEn during Cycle N + 1

° This approach can NOT be used in the pipeline design because:• Must be able to write the register file every cycle• Must be able write the data memory every cycle

Clock

Ifetch Reg/Dec Exec Mem WrStore

Ifetch Reg/Dec Exec Mem WrStore

Ifetch Reg/Dec Exec Mem WrR-type

Ifetch Reg/Dec Exec Mem WrR-type

pipeline.38

Synchronize Register File & Synchronize Memory

° Solution: And the Write Enable signal with the Clock• This is the ONLY place where gating the clock is used• MUST consult circuit expert to ensure no timing violation:

- Example: Clock High Time > Write Access Delay

WrEn

I_AddrI_Data

Reg Fileor

Memory

Clk

I_AddrI_WrEn

AddressData

I_WrEn

C_WrEn

C_WrEn

Clk

AddressData

WrEn

Reg Fileor

Memory

Synchronize Memory and Register File

Address, Data, and WrEn must be stableat least 1 set-up time before the Clk edge

Write occurs at the cycle followingthe clock edge that captures the signals

20

pipeline.39

A More Extensive Pipelining Example

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Ifetch Reg/Dec Exec Mem Wr0: Load

Ifetch Reg/Dec Exec Mem Wr4: R-type

Ifetch Reg/Dec Exec Mem Wr8: Store

Ifetch Reg/Dec Exec Mem Wr12: Beq (target is 1000)

End ofCycle 4

End ofCycle 5

End ofCycle 6

End ofCycle 7

° End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s Ifetch

° End of Cycle 5: Load’s Wr, R-type’s Mem, Store’s Exec, Beq’s Reg

° End of Cycle 6: R-type’s Wr, Store’s Mem, Beq’s Exec

° End of Cycle 7: Store’s Wr, Beq’s Mem

pipeline.40

Pipelining Example: End of Cycle 4° 0: Load’s Mem 4: R-type’s Exec 8: Store’s Reg 12: Beq’s Ifetch

IF/ID: B

eq Instruction

ID/E

x: Store’s busA &

B

Ex/M

em: R

-type’s Result

Mem

/Wr: L

oad’s Dout

PC = 16 Data

Mem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

RegWr=0 ExtOp=x

ExecUnit

busAbusB

Imm16

ALUOp=R-type

ALUSrc=0

Mux

1

0

MemtoReg=x

10

RegDst=1

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch=0

10

12: Beq’s Ifet

8: Store’s Reg 4: R-type’s Exec 0: Load’s Mem

Clk

MemWr=0Clk

21

pipeline.41

Pipelining Example: End of Cycle 5° 0: Lw’s Wr 4: R’s Mem 8: Store’s Exec 12: Beq’s Reg 16: R’s Ifetch

IF/ID: Instruction @

16

ID/E

x: Beq’s busA

& B

Ex/M

em: Store’s A

ddress

Mem

/Wr: R

-type’s Result

PC = 20 Data

Mem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

RegWr=1 ExtOp=1

ExecUnit

busAbusB

Imm16

ALUOp=Add

ALUSrc=1

Mux

1

0

MemtoReg=1

10

RegDst=x

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch=0

10

16: R’s Ifet

12: Beq’s Reg 8: Store’s Exec 4: R-type’s Mem

0: Load’s Wr

Clk

MemWr=0Clk

pipeline.42

Pipelining Example: End of Cycle 6° 4: R’s Wr 8: Store’s Mem 12: Beq’s Exec 16: R’s Reg 20: R’s Ifet


20

ID/E

x:R-type’s busA

& B

Ex/M

em: B

eq’s Results

Mem

/Wr: N

othing for St

PC = 24 Data

Mem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

RegWr=1 ExtOp=1

ExecUnit

busAbusB

Imm16

ALUOp=Sub

ALUSrc=0

Mux

1

0

MemtoReg=0

10

RegDst=x

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch=0

10

20:R-type’s Ifet

16: R-type’s Reg 12: Beq’s Exec 8: Store’s Mem

4: R-type’s Wr

Clk

MemWr=1Clk

22

pipeline.43

Pipelining Example: End of Cycle 7° 8: Store’s Wr 12: Beq’s Mem 16: R’s Exec 20: R’s Reg 24: R’s Ifet


24

ID/E

x:R-type’s busA

& B

Ex/M

em: R

type’s Results

Mem

/Wr:N

othing for Beq

PC = 1000

DataMem

WADi

RA Do

IUnit

A

I

RFileDi

Ra

Rb

Rw

RegWr=0 ExtOp=x

ExecUnit

busAbusB

Imm16

ALUOp=R-type

ALUSrc=0

Mux

1

0

MemtoReg=x

10

RegDst=1

Rt

Rd

Imm16

PC+4 PC+4

Rs

Rt

PC+4

Zero

Branch=1

10

24:R-type’s Ifet

20: R-type’s Reg 16: R-type’s Exec 12: Beq’s Mem

8: Store’s Wr

Clk

MemWr=0Clk

pipeline.44

The Delay Branch Phenomenon

° Although Beq is fetched during Cycle 4:• Target address is NOT written into the PC until the end of Cycle 7• Branch’s target is NOT fetched until Cycle 8• 3-instruction delay before the branch take effect

° This is referred to as Branch Hazard:• Clever design techniques can reduce the delay to ONE instruction






12: Beq(target is 1000)

20: R-type

Clk

Ifetch Reg/Dec Exec Mem Wr1000: Target of Br

23

pipeline.45

The Delay Load Phenomenon

° Although Load is fetched during Cycle 1:• The data is NOT written into the Reg File until the end of Cycle 5• We cannot read this value from the Reg File until Cycle 6• 3-instruction delay before the load take effect

° This is referred to as Data Hazard:• Clever design techniques can reduce the delay to ONE instruction

Clock


Ifetch Reg/Dec Exec Mem WrI0: Load

Ifetch Reg/Dec Exec Mem WrPlus 1




pipeline.46

Summary

° Disadvantages of the Single Cycle Processor• Long cycle time• Cycle time is too long for all instructions except the Load

° Multiple Clock Cycle Processor:• Divide the instructions into smaller steps• Execute each step (instead of the entire instruction) in one cycle

° Pipeline Processor:• Natural enhancement of the multiple clock cycle processor• Each functional unit can only be used once per instruction• If a instruction is going to use a functional unit:

- it must use it at the same stage as all other instructions• Pipeline Control:

- Each stage’s control signal depends ONLY on the instructionthat is currently in that stage

Documents

361 Computer Architecture Lecture 12: Designing a Pipeline ...kcoloma/ece361/lectures/Lec12-pipeline.pdf · pipeline.1 361 Computer Architecture Lecture 12: Designing a Pipeline Processor