58
EECS 322 March 27, 2000 Based on Dave Patterson slides Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow EECS 322 Computer Architecture Introduction to Pipelining

EECS 322 Computer Architecture Introduction to Pipelining

  • Upload
    thai

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

EECS 322 Computer Architecture Introduction to Pipelining. Based on Dave Patterson slides. Instructor: Francis G. Wolff [email protected] Case Western Reserve University This presentation uses powerpoint animation: please viewshow. Comparison. CISC RISC. - PowerPoint PPT Presentation

Citation preview

Page 1: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Based on Dave

Patterson slides

Instructor: Francis G. Wolff [email protected]

Case Western Reserve University This presentation uses powerpoint animation: please viewshow

EECS 322 Computer Architecture Introduction to Pipelining

Page 2: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

CISC RISC

Comparison

Any instruction may reference memory Only load/store references memory

Many instructions & addressing modes Few instructions & addressing modes

Variable instruction formats Fixed instruction formats

Single register set Multiple register sets

Multi-clock cycle instructions Single-clock cycle instructions

Micro-program interprets instructions Hardware (FSM) executes instructions

Complexity is in the micro-program Complexity is in the complier

Less to no pipelining Highly pipelined

Program code size small Program code size large

Page 3: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining (Designing…,M.J.Quinn, ‘87)

Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time.

Ferranti ATLAS (1963): Pipelining reduced the average time per instruction by 375% Memory could not keep up with the CPU, needed a cache.

Cache memory is a small, fast memory unit used as a buffer between a processor and primary memory

Page 4: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Memory Hierarchy

RegistersRegisters

PipeliningPipelining

Cache memoryCache memory

Primary real memoryPrimary real memory

Virtual memory (Disk, swapping)Virtual memory (Disk, swapping)

Fas

ter

Ch

eap

er

Mo

re C

apac

ity

Page 5: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining versus Parallelism (Designing…,M.J.Quinn, ‘87)

Most high-performance computers exhibit a great deal of concurrency.

However, it is not desirable to call every modern computer a parallel computer.

Pipelining and parallelism are 2 methods used to achieve concurrency.

Pipelining increases concurrency by dividing a computation into a number of steps.

Parallelism is the use of multiple resources to increase concurrency.

Page 6: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining is Natural!

° Laundry Example

° Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

° Washer takes 30 minutes

° Dryer takes 30 minutes

° “Folder” takes 30 minutes

° “Stasher” takes 30 minutesto put clothes into drawers

A B C D

Page 7: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Sequential Laundry

° Sequential laundry takes 8 hours for 4 loads

° If they learned pipelining, how long would laundry take?

30Task

Order

B

C

D

ATime

3030 3030 30 3030 3030 3030 3030 3030

6 PM 7 8 9 10 11 12 1 2 AM

Page 8: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelined Laundry: Start work ASAP

° Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

B

C

D

A

303030 3030 3030

Page 9: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining Lessons

° Pipelining doesn’t help latency of single task, it helps throughput of entire workload

° Multiple tasks operating simultaneously using different resources

° Potential speedup = Number pipe stages

° Pipeline rate limited by slowest pipeline stage

° Unbalanced lengths of pipe stages reduces speedup

° Time to “fill” pipeline and time to “drain” it reduces speedup

° Stall for Dependences

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

Page 10: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

The Five Stages of Load

° Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode

° Exec: Calculate the memory address

° Mem: Read the data from the Data Memory

° Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Clock

Page 11: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

PC

PC address

Read Data

Write Data

address

Read Data

Write Data

Read Data

Accumulator

WriteData

Read Data

Accumulator

WriteData

ALU ALUOut

ALUOut

MemWrite

RegDst

RegWrite

1 0

Instruction[7-0]

0

1

2

MDRMDR

1

0

0

1 2

0

1

2

ALUsrcA

ALUsrcBPCSrc

P0 | (~AluZero & BZ)P0 | (~AluZero & BZ)

IorD

MemRead

IRWrite

Y

X

ALUop1 X+02 X-Y3 0+Y4 05 X+Y

ALUop1 X+02 X-Y3 0+Y4 05 X+Y

I R

I R

MDR2MDR2

RISCEE 4 Architecture

Clock

Clock = load value into register

Page 12: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem

Load Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

Page 13: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Why Pipeline?

° Suppose we execute 100 instructions

° Single Cycle Machine• 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

° Multicycle Machine• 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

° Ideal pipelined machine• 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Page 14: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Why Pipeline? Because the resources are there!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

AL

UIm Reg Dm Reg

ResourceMemInstMemDataRegReadRegWriteALU

busy idleidleidleidle

busyidlebusyidleidle

busyidlebusyidlebusy

busybusy busyidlebusy

busybusybusybusybusy

RegRead RegWrite

idlebusybusybusybusy

idlebusyidlebusybusy

idlebusyidlebusyidle

idleidleidle busyidle

Page 15: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Can pipelining get us into trouble?

° Yes: Pipeline Hazards• structural hazards: attempt to use the same resource two different ways at the

same time

- E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)

• data hazards: attempt to use item before it is ready

- E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer

- instruction depends on result of prior instruction still in the pipeline

• control hazards: attempt to make a decision before condition is evaulated

- E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in

- branch instructions

° Can always resolve hazards by waiting• pipeline control must detect the hazard

• take action (or delay action) to resolve hazards

Page 16: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Single Memory (Inst & Data) is a Structural Hazard

Mem

Instr.

Order

Load

Instr 1

Instr 2

Instr 3

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UReg Mem Reg

Detection is easy in this case!Detection is easy in this case!

structural hazards: attempt to use the same resource two different ways at the same time

ResourceMem(Inst & Data)

RegReadRegWrite

ALU

busy idleidleidle

busybusyidleidle

busybusyidlebusy

busybusyidlebusy

busybusybusybusy

idlebusybusybusy

idleidlebusybusy

idleidlebusyidle

right half: highlight means read.

left half write.

right half: highlight means read.

left half write.

Previous example: Separate InstMem and DataMemPrevious example: Separate InstMem and DataMem

Page 17: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Single Memory (Inst & Data) is a Structural Hazard

structural hazards: attempt to use the same resource two different ways at the same time

By change the architecture from a Harvard (separate instruction and data memory) to a von Neuman memory, we actually created a structural hazard!

Structural hazards can be avoid by changing

• hardware: design of the architecture (splitting resources)• software: re-order the instruction sequence• software: delay

Page 18: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining

° Improve perfomance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Page 19: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

Figure 6.4

Stall on Branch

Page 20: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Figure 6.5

Predicting branches

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

Page 21: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Figure 6.6

Delayed branch

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)

Programexecutionorder(in instructions)

Page 22: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Figure 6.7Instruction pipeline

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Pipeline stages• IF instruction fetch (read)• ID instruction decode

and register read (read)• EX execute alu operation• MEM data memory (read or write)• WB Write back to register

Resources• Mem instr. & data memory• RegRead1 register read port #1• RegRead2 register read port #2• RegWrite register write• ALU alu operation

Page 23: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Figure 6.8

Forwarding

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Page 24: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Figure 6.9

Load Forwarding

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Page 25: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Figure 6.9

Reordering

lw $t0, 0($t1) $t0=Memory[0+$t1]

lw $t2, 4($t1) $t2=Memory[4+$t1]

sw $t2, 0($t1) Memory[0+$t1]=$t2

sw $t0, 4($t1) Memory[4+$t1]=$t0

lw $t2, 4($t1)

lw $t0, 0($y1)

sw $t2, 0($t1)

sw $t0, 4($t1)

Page 26: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Basic Idea: split the datapath

° What do we need to add to actually split the datapath into stages?

Page 27: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Graphically Representing Pipelines

° Can help with answering questions like:• how many cycles does it take to execute this code?

• what is the ALU doing during cycle 4?

• use this representation to help understand datapaths

Page 28: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipeline datapath with registers

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Figure 6.12

Page 29: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Load instruction fetch and decode

Figure 6.13

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

Page 30: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Load instruction execution

Figure 6.14

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory

Page 31: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Load instruction memory and write back

Figure 6.15

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address

97108/Patterson Figure 06.15

Page 32: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Store instruction execution

Figure 6.16

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

sw

Address

Page 33: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Store instruction memory and write back

Figure 6.17

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

sw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

sw

Page 34: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Load instruction: corrected datapath

Figure 6.18

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

Page 35: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Load instruction: overall usage

Figure 6.19

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Address

Datamemory

Page 36: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Multi-clock-cycle pipeline diagram

Figure 6.20-21

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $10, 20($1)

Programexecutionorder(in instructions)

sub $11, $2, $3

ALU

ALU

Programexecutionorder(in instructions)

Time ( in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Instructionfetch

Instructiondecode

Instructionfetch

Instructiondecode Execution Write back

Execution

Dataaccess

Dataaccess Write backlw $10, $20($1)

sub $11, $2, $3

Page 37: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Single-cycle #1-2

Figure 6.22

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Page 38: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Single-cycle #3-4

Figure 6.23

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Page 39: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Single-cycle #5-6

Figure 6.24

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Page 40: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Conventional Pipelined Execution Representation

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WBProgram Flow

Time

Page 41: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Structural Hazards limit performance

° Example: if 1.3 memory accesses per instruction and only one memory access per cycle then

• average CPI 1.3

• otherwise resource is more than 100% utilized

Page 42: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

° Stall: wait until decision is clear• Its possible to move up decision to 2nd stage by adding

hardware to check registers as being read

° Impact: 2 clock cycles per branch instruction => slow

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem RegMem

Page 43: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

° Predict: guess one direction then back up if wrong• Predict not taken

° Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right 50% of time)

° More dynamic scheme: history of 1 branch ( 90%)

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

MemA

LUReg Mem Reg

Page 44: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

° Redefine branch behavior (takes place after next instruction) “delayed branch”

° Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)

° As launch more instruction per clock cycle, less useful

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

MemA

LUReg Mem Reg

Load Mem

AL

UReg Mem Reg

Page 45: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Data Hazard on r1

add r1 ,r2,r3

sub r4, r1 ,r3

and r6, r1 ,r7

or r8, r1 ,r9

xor r10, r1 ,r11

Page 46: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

• Dependencies backwards in time are hazards

Data Hazard on r1:

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF

EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Page 47: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

• “Forward” result from one stage to another

• “or” OK if define read/write properly

Data Hazard Solution:

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF

EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Page 48: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

• Dependencies backwards in time are hazards

• Can’t solve with forwarding: • Must delay/stall instruction dependent on loads

Forwarding (or Bypassing): What about Loads

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3

IF

ID/RF

EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Page 49: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining the Load Instruction

° The five independent functional units in the pipeline datapath are:

• Instruction Memory for the Ifetch stage

• Register File’s Read ports (bus A and busB) for the Reg/Dec stage

• ALU for the Exec stage

• Data Memory for the Mem stage

• Register File’s Write port (bus W) for the Wr stage

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw

Page 50: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

The Four Stages of R-type

° Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode

° Exec: • ALU operates on the two register operands

• Update PC

° Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec WrR-type

Page 51: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Pipelining the R-type and Load Instruction

° We have pipeline conflict or structural hazard:• Two instructions try to write to the register file at the same time!

• Only one write port

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type

Ops! We have a problem!

Page 52: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Important Observation

° Each functional unit can only be used once per instruction

° Each functional unit must be used at the same stage for all instructions:

• Load uses Register File’s Write Port during its 5th stage

• R-type uses Register File’s Write Port during its 4th stage

Ifetch Reg/Dec Exec Mem WrLoad

1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type

1 2 3 4

° 2 ways to solve this pipeline hazard.

Page 53: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Solution 1: Insert “Bubble” into the Pipeline

° Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle

• The control logic can be complex.

• Lose instruction fetch and issue opportunity.

° No instruction is started in Cycle 6!

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr

Page 54: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Solution 2: Delay R-type’s Write by One Cycle

° Delay R-type’s register write by one cycle:• Now R-type instructions also use Reg File’s write port at Stage 5

• Mem stage is a NOOP stage: nothing is being done.

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5

Page 55: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

The Four Stages of Store

° Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory

° Reg/Dec: Registers Fetch and Instruction Decode

° Exec: Calculate the memory address

° Mem: Write the data into the Data Memory

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec MemStore Wr

Page 56: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

The Three Stages of Beq

° Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory

° Reg/Dec: • Registers Fetch and Instruction Decode

° Exec: • compares the two register operand,

• select correct branch target address

• latch into PC

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec MemBeq Wr

Page 57: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Summary: Pipelining

° What makes it easy• all instructions are the same length

• just a few instruction formats

• memory operands appear only in loads and stores

° What makes it hard?• structural hazards: suppose we had only one memory

• control hazards: need to worry about branch instructions

• data hazards: an instruction depends on a previous instruction

° We’ll build a simple pipeline and look at these issues

° We’ll talk about modern processors and what really makes it hard:

• exception handling

• trying to improve performance with out-of-order execution, etc.

Page 58: EECS 322 Computer Architecture  Introduction to Pipelining

EECS 322 March 27, 2000

Summary

° Pipelining is a fundamental concept• multiple steps using distinct resources

° Utilize capabilities of the Datapath by pipelined instruction processing

• start next instruction while working on the current one

• limited by length of longest stage (plus fill/flush)

• detect and resolve hazards