36
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Embed Size (px)

Citation preview

Page 1: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.1

Computer Architecture

Chapter 6

Enhancing Performance with Pipelining

Page 2: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.2

Pipelining Overview

Pipelined Datapath

Pipelined Control

Hazards• Structural Hazards

• Data Hazards

• Branch Hazards

Dynamic Scheduling

Examples of Pipelining

Summary

Contents

Page 3: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.3

The Five Classic Components of a Computer

Back to the datapath again to try to speed it up some more• Single cycle datapath – great CPI but terrible cycle time (long critical path)

• Multiple cycle datapath – good cycle time but poor CPI

• Pipelining – get the best of both!

Control

Datapath

Memory

Processor

Input

Output

The Big Picture: Where are We Now?

Page 4: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.4

Sequential laundry takes 8 hours for 4 loads

If they learned pipelining, how long would laundry take?

30Task

Order

B

C

D

ATime

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1 2 AM

Sequential Laundry

Page 5: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.5

Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

B

C

D

A

303030 3030 30 30

Pipelined Laundry: Start work ASAP

Page 6: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.6

Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory

Reg/Dec: Register Fetch and Instruction Decode

Exec: Calculate the memory address

Mem: Read the data from the Data Memory

Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

One way to break up the datapath operations

Page 7: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.7

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3A

LUIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Apply pipelining

Page 8: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.8

IFetch Reg Exec Mem WB

IFetch Reg Exec Mem WB

IFetch Reg Exec Mem WB

IFetch Reg Exec Mem WB

IFetch Reg Exec Mem WB

IFetch Reg Exec Mem WBProgram Flow

Time

Conventional Pipelined Execution Representation

Page 9: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.9

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Pipelining

Page 10: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.10

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem

Load Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

Single Cycle, Multiple Cycle, vs. Pipeline

Page 11: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.11

Suppose we execute 100 instructions

Single Cycle Machine• 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

Multicycle Machine• 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

Ideal pipelined machine• 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

How much improvement can pipelining give us ?

Page 12: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.12

What makes it easy• all instructions are the same length• just a few instruction formats• memory operands appear only in loads and stores

What makes it hard?• structural hazards: suppose we had only one memory• control hazards: need to worry about branch instructions• data hazards: an instruction depends on a previous instruction

We’ll build a simple pipeline and look at these issues

We’ll talk about modern processors and what really makes it hard:• exception handling• trying to improve performance with out-of-order execution, etc.

Pipelining

Page 13: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.13

What do we need to add to actually split the datapath into stages?

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Basic Idea

Page 14: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.14

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Pipelined Datapath

Page 15: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.15

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

Corrected Datapath

Page 16: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.16

Can help with answering questions like:• how many cycles does it take to execute this code?

• what is the ALU doing during cycle 4?

• use this representation to help understand datapaths

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $10, 20($1)

Programexecutionorder(in instructions)

sub $11, $2, $3

ALU

ALU

Graphically Representing Pipelines

Page 17: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.17

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

Pipeline Control

Page 18: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.18

We have 5 stages. What needs to be controlled in each stage?• Instruction Fetch and PC Increment

• Instruction Decode / Register Fetch

• Execution

• Memory Stage

• Write Back

How would control be handled in an automobile plant?• a fancy control center telling everyone what to do?

• should we use a finite state machine?

Pipeline control

Page 19: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.19

Pass control signals along just like the dataExecution/Address Calculation

stage control linesMemory access stage

control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Pipeline Control

Page 20: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.20

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Datapath with Control

Page 21: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.21

Yes: Pipeline Hazards• structural hazards: attempt to use the same resource two different ways

at the same time

- E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)

• data hazards: attempt to use item before it is ready

- E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer

- instruction depends on result of prior instruction still in the pipeline

• control hazards: attempt to make a decision before condition is evaluated

- E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in

- branch instructions

Can always resolve hazards by waiting• pipeline control must detect the hazard

• take action (or delay action) to resolve hazards

Can pipelining get us into trouble?

Page 22: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.22

Mem

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4A

LUMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem Reg

AL

UMem Reg Mem Reg

Detection is easy in this case! (right half highlight means read, left half write)

Single Memory is a Structural Hazard

Page 23: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.23

Problem with starting next instruction before first is finished• dependencies that go backward in time are data hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Dependencies

Page 24: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.24

Have compiler guarantee no hazards

Where do we insert the nops??

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution

Page 25: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.25

Use temporary results, don’t wait for them to be written• register file forwarding to handle read/write to same register

• ALU forwarding

what if this $2 was $13?

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Forwarding

Page 26: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.26

PCInstruction

memory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Forwarding

Page 27: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.27

Load word can still cause a hazard:• an instruction tries to read a register following a load instruction that writes to the same

register.

Thus, we need a hazard detection unit to stall the load instruction

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Can't always forward

Page 28: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.28

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Stalling

Page 29: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.29

Stall by letting an instruction that won’t write anything go forward

PCInstruction

memory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

Hazard Detection Unit

Page 30: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.30

When we decide to branch, other instructions are in the pipeline!

We are predicting branch not taken• need to add hardware for flushing instructions if we are wrong

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Control (Branch) Hazards

Page 31: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.31

PCInstruction

memory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

Flushing Instructions

Page 32: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.32

Stall: wait until decision is clear• Its possible to move up decision to 2nd stage by adding hardware to

check registers as being read

Impact: 2 clock cycles per branch instruction => slow

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UReg Mem RegMem

Example: Control Hazard Solutions

Page 33: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.33

Predict: guess one direction then back up if wrong• Predict not taken

Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right 50% of time)

More dynamic scheme: history of 1 branch ( 90%)

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Example: Control Hazard Solutions

Page 34: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.34

Redefine branch behavior (takes place after next instruction) “delayed branch”

Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

MemA

LUReg Mem Reg

Load Mem

AL

UReg Mem Reg

Example: Control Hazard Solutions

Page 35: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.35

The hardware performs the scheduling?• hardware tries to find instructions to execute

• out of order execution is possible

• speculative execution and dynamic branch prediction

All modern processors are very complicated• DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

• PowerPC and Pentium: branch history table

• Compiler technology important

Dynamic Scheduling

Page 36: Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining

Chap 6.36

Pipelining is a fundamental concept• multiple steps using distinct resources

Utilize capabilities of the Datapath by pipelined instruction processing• start next instruction while working on the current one

• limited by length of longest stage (plus fill/flush)

• detect and resolve hazards

Summary