Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 1

Enhancing Performance by Pipelining

COE608: Computer Organization and Architecture

Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan

Electrical and Computer Engineering

Ryerson University

Overview • Introduction to Pipelining

♦ Laundry Example • Pipelining and CPU Performance

♦ Pipelining Hazards ♦ Hazard Solutions

• Pipelined Datapath and Control ♦ Stalling, Forwarding and Flushing

Chapter 4 (Sections 4.5, 4.6, 4.7, 4.8 and part of 4.10/4.11) of Text


Introduction

Laundry Example Consider Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers

A B C D


Sequential Laundry

Sequential laundry takes 8 hours for 4 loads

30Task

Order

B

CD

ATime

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1


Pipelined Laundry Start work ASAP

• Multiple tasks operating simultaneously using

different resources.

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

BC

D

A3030 30 3030 30 30


Pipelined Laundry

• Pipelining doesn’t reduce latency of a single task. It improves throughput of the entire workload

• Potential speedup = Number of (pipe) stages • Pipeline rate limited by the slowest pipeline stage • Unbalanced lengths of pipe stages reduce the

speedup • Time to “fill” pipeline and time to “drain” it

reduces the speedup. • Stall due to dependencies??

6 PM 7 8 9Tim e

BC

D

A3030 30 3030 30 30

Task

Order


Pipelining in the CPU Improve performance by increasing instruction throughput. Five Natural Stages in an Instruction Execution

Load Word Instruction Stages

• Ifetch: Instruction Fetch Fetch instruction from Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec: Calculate the memory address

• Mem: Read data from the Data Memory

• Wr: Write data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem Wr


CPU Pipelining

Ideal speedup is number of stages in the pipeline.

What makes pipelining easy? • When all instructions are of the same length. • Few instruction formats. • Memory operands appear only in loads and stores.

What makes pipelining hard? • Structural Hazards: • Control Hazards: • Data Hazards: An instruction depends on a

previous instruction.

Instruction�fetch Reg ALU Data�

access Reg

8 nsInstruction�

fetch Reg ALU Data�access Reg

8 nsInstruction�

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Program�execution�order�(in instructions)

Instruction�fetch Reg ALU Data�

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 ns Instruction�fetch Reg ALU Data�

access Reg

2 nsInstruction�

fetch Reg ALU Data�access Reg

2 ns 2 ns 2 ns 2 ns 2 ns

�



Single Cycle, Multiple Cycle vs. Pipeline

Suppose we execute 100 instructions • Single Cycle Machine

45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine

10 ns/cycle x 4.6 CPI (mix of inst) x 100 inst = 4600 ns • Ideal Pipelined Machine

10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Cycle 1

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Ifetch Reg Exec MemLoad Store

Clk

Single Cycle Implementation:Load Store Waste

IfetchR-type

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec MemStore

Ifetch Reg Exec WrR-type

Cycle 1 Cycle 2


Pipeline Hazards • Structural Hazards: attempt to use the same

resource two different ways at the same time

• Data Hazards: attempt to use an item before it is ready Instruction depends on the results of a prior instruction still in the pipeline

• Control Hazards: attempt to make a decision before condition is evaluated

Branch instructions The hazards can always be resolved by waiting.

Pipeline control must detect the hazards and take actions (or delay action) to resolve hazards.


Hazards

Structural Hazard: Single Memory

Control Hazard

Mem

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUMem Reg Mem Reg

AL

UReg Mem Reg

AL

UMem Reg Mem Reg

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UReg Mem RegMem


Control Hazard Solutions

Stall It is possible to move the decision upward to 2nd stage by adding hardware that checks the registers as being read.

• Predict: guess one branch direction then backup if wrong.

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg


Control Hazard Solutions

Redefine branch behavior (takes place after next instruction) “delayed branch”

• Impact: No clock cycle is wasted if one can find an instruction to put in “slot”

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem RegA

LUMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Load Mem

AL

UReg Mem Reg


Data Hazard

Data Hazard in r1 Dependencies backwards in time are hazards

“Forward” result from one stage to another.

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg


Pipelining R-type and Load

Structural hazard: • Two instructions try to write to the register file

at the same time.

Load uses Register File’s Write Port in 5th stage.

R-type uses Register File’s Write Port in 4th stage.

Solution ???

Clock


Ifetch Reg/Dec Exec WrR-type


Ifetch Reg/Dec Exec Mem WrLoad



We have a problem!

Ifetch Reg/Dec Exec Mem WrLoad1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type1 2 3 4



Insert “Bubble” into the Pipeline

Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle.

• The control logic can be complex. • Lose instruction fetch and issue opportunity.

Clock



Ifetch Reg/Dec Exec


Ifetch Reg/Dec Exec WrR-typeIfetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr



Delay R-type’s register-write by one cycle: • Now R-type instructions also use Reg File’s

write port at Stage 5. • Mem stage is a NOOP stage: nothing is

being done.

Clock


Ifetch Reg/Dec Mem WrR-type





Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5


Stages in the Datapath What do we need to add for splitting a datapath into stages?

Instruction�memory

Address

4

32

0

Add Add�result

Shift�left 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

M�u�x

0

1

Add

PC

0Write�data

M�u�x

1Registers

Read�data 1

Read�data 2

Read�register 1

Read�register 2

16Sign�

extend

Write�register

Write�data

Read�data

1

ALU�result

M�u�x

ALUZero

ID/EX

Data�memory

Address


Address

4

32

0

Add Add�result

Shift�left 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

M�u�x

0

1

Add

PC

0

Address

Write�data

M�u�x

1Registers

Read�data 1

Read�data 2

Read�register 1

Read�register 2

16Sign�

extend

Write�register

Write�data

Read�data

Data�memory

1

ALU�result

M�u�x

ALUZero

ID/EX


Pipelined Datapath and Control

PC


Address

Inst

ruct

ion

Instruction�[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction�[15– 0]

0

0Registers

Write�register

Write�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1Write�data

Read�data M�

u�x

1

ALU�control

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Data�memory

PCSrc

Zero

Add Add�result

Shift�left 2

ALU�result

ALUZero

Add

0

1

M�u�x

0

1

M�u�x


Pipeline Control

Main Control generates the control signals during Instruction Decode • Exec (ExtOp, ALUSrc, ...) control signals are used 1 cycle later. • Mem (MemWr Branch) control signals are used 2 cycles later. • Wr (MemtoReg MemWr) control signals are used 3 cycles later.

Pass control signals along just like the data

Execution/Address Calculation stage control lines

Memory access stage control lines

stage control lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

IF/ID R

egister

ID/E

x Register

Ex/M

em R

egister

Mem

/Wr R

egister

Reg/Dec Exec Mem

ExtOp

ALUOpRegDst

ALUSrc

BranchMemWr

MemtoRegRegWr

MainControl

ExtOp

ALUOpRegDst

ALUSrc

MemtoRegRegWr

MemtoRegRegWr

MemtoRegRegWr

BranchMemWr Branch

MemWr

Wr


Datapath with Control

PC


Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction�[15– 0]

0

0

M�u�x

0

1

Add Add�result

RegistersWrite�register

Write�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1

ALU�result

Zero

Write�data

Read�data

M�u�x

1

ALU�control

Shift�left 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M�u�x

0

1

Mem

Writ

e

AddressData�

memory

Address


Dependencies Problem: When starting next instruction before the first is completed. Dependencies that “go backward in time” cause data hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of �register $2:

DM Reg

Reg

Reg

Reg

DM


Software Solution Have compiler guarantee no hazards Where do we insert the “noops” ?

sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows down the application!

Use temporary results and don’t wait for them to be written into register file Forwarding to handle read/write to same register ALU forwarding

IM Reg

IM Reg



sub $2, $1, $3

Program�execution order�(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM


Forwarding

Can't always forward

An instruction tries to read a register following a load instruction that writes to the same register.

PC Instruction�memory

Registers

M�u�x

M�u�x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data�memory

M�u�x

Forwarding�unit

IF/ID

Inst

ruct

ion

M�u�x

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM


Stalling A hazard detection unit to “stall” load instruction Stall the pipeline by keeping an instruction in the same stage


Registers

M�u�x

M�u�x

M�u�x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data�memory

M�u�x

Hazard�detection�

unit

Forwarding�unit

0

M�u�x

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble


Branch Hazards When decide to branch, instructions are in pipeline We predict “branch not taken”

Need to add hardware for flushing instructions if we are wrong.

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg


4

Registers

M�u�x

M�u�x

M�u�x

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Data�memory

M�u�x

Hazard�detection�

unit

Forwarding�unit

IF.Flush

IF/ID

Sign�extend

Control

M�u�x

=

Shift�left 2

M�u�x


Improving Performance

Try and avoid stalls. e.g. reorder instructions.

lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)

Add a “branch delay slot” • Always execute next instruction after a branch. • Rely on compiler to “fill” the slot with something

useful rather than NOOPS

Dynamic Scheduling The hardware performs the “scheduling”

• Hardware tries to find instructions to execute • Out of order execution is possible • Speculative execution and dynamic branch

prediction

Modern processors are very complicated • Alpha 21264: 9 stage pipeline, 6 instruction issue • PowerPC and Pentium: Keeps branch history table.

Compiler technology important

Superscalar Architecture Start more than one instruction in the same cycle


Nondelayed vs. Delayed Branch

Exit:

Fill Slot

Exit:

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11


Superscalar Architecture

An Example: Independent integer and FP issue to separate pipelines

Single Issue Total Time = Int Exec Time + FP Exec Time

Max Speedup: Total Time MAX(Int Exec Time, FP Exec Time)

Operand / Result Busses

Int Reg Inst Issueand Bypass

FP Reg

Int. Unit

I-Cache

Load / Store Unit

FP Add FP Mul

D-Cache


Superscalar Laundry: Parallel per stage

More resources, HW to match mix of parallel tasks?

T a s k

Or d e r

BCD

A

E

F

(light clothing) (dark clothing) (very dirty clothing)

(light clothing) (dark clothing) (very dirty clothing)

126 PM 7 8 9 10 11

Time 3030 30 30 30


Superscalar Laundry: Mismatch Mix

Task mix underutilizes extra resources

T a s k

O r d e r

(light clothing)

(light clothing) (dark clothing)

(light clothing)

A

B

D

C

126 PM 7 8 9 10 11

Time 3030 30 30 30 30 30


General Superscalar Organization

Super-Pipelined • Many pipeline stages need less than half

a clock cycle. • Double internal clock speed gets two

tasks per external clock cycle • Superscalar allows parallel fetch and

execute


Superscalar Execution

• Simultaneously fetch multiple instructions • Logic to determine true dependencies involving

register values. • Mechanisms to communicate these values. • Mechanisms to initiate multiple instructions in

parallel. • Resources for parallel execution of multiple

instructions. • Mechanisms for committing process state in

correct order.


The PowerPC 601 Architecture


Intel Itanium Microprocessor

Documents

Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the