34
© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 1 Enhancing Performance by Pipelining COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview Introduction to Pipelining Laundry Example Pipelining and CPU Performance Pipelining Hazards Hazard Solutions Pipelined Datapath and Control Stalling, Forwarding and Flushing Chapter 4 (Sections 4.5, 4.6, 4.7, 4.8 and part of 4.10/4.11) of Text

Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

Embed Size (px)

Citation preview

Page 1: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 1

Enhancing Performance by Pipelining

COE608: Computer Organization and Architecture

Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan

Electrical and Computer Engineering

Ryerson University

Overview • Introduction to Pipelining

♦ Laundry Example • Pipelining and CPU Performance

♦ Pipelining Hazards ♦ Hazard Solutions

• Pipelined Datapath and Control ♦ Stalling, Forwarding and Flushing

Chapter 4 (Sections 4.5, 4.6, 4.7, 4.8 and part of 4.10/4.11) of Text

Page 2: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 2

Introduction

Laundry Example Consider Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers

A B C D

Page 3: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 3

Sequential Laundry

Sequential laundry takes 8 hours for 4 loads

30Task

Order

B

CD

ATime

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1

Page 4: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 4

Pipelined Laundry Start work ASAP

• Multiple tasks operating simultaneously using

different resources.

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

BC

D

A3030 30 3030 30 30

Page 5: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 5

Pipelined Laundry

• Pipelining doesn’t reduce latency of a single task. It improves throughput of the entire workload

• Potential speedup = Number of (pipe) stages • Pipeline rate limited by the slowest pipeline stage • Unbalanced lengths of pipe stages reduce the

speedup • Time to “fill” pipeline and time to “drain” it

reduces the speedup. • Stall due to dependencies??

6 PM 7 8 9Tim e

BC

D

A3030 30 3030 30 30

Task

Order

Page 6: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 6

Pipelining in the CPU Improve performance by increasing instruction throughput. Five Natural Stages in an Instruction Execution

Load Word Instruction Stages

• Ifetch: Instruction Fetch Fetch instruction from Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec: Calculate the memory address

• Mem: Read data from the Data Memory

• Wr: Write data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem Wr

Page 7: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 7

CPU Pipelining

Ideal speedup is number of stages in the pipeline.

What makes pipelining easy? • When all instructions are of the same length. • Few instruction formats. • Memory operands appear only in loads and stores.

What makes pipelining hard? • Structural Hazards: • Control Hazards: • Data Hazards: An instruction depends on a

previous instruction.

Instruction�fetch Reg ALU Data�

access Reg

8 nsInstruction�

fetch Reg ALU Data�access Reg

8 nsInstruction�

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Program�execution�order�(in instructions)

Instruction�fetch Reg ALU Data�

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 ns Instruction�fetch Reg ALU Data�

access Reg

2 nsInstruction�

fetch Reg ALU Data�access Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Program�execution�order�(in instructions)

Page 8: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 8

Single Cycle, Multiple Cycle vs. Pipeline

Suppose we execute 100 instructions • Single Cycle Machine

45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine

10 ns/cycle x 4.6 CPI (mix of inst) x 100 inst = 4600 ns • Ideal Pipelined Machine

10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Cycle 1

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Ifetch Reg Exec MemLoad Store

Clk

Single Cycle Implementation:Load Store Waste

IfetchR-type

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec MemStore

Ifetch Reg Exec WrR-type

Cycle 1 Cycle 2

Page 9: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 9

Pipeline Hazards • Structural Hazards: attempt to use the same

resource two different ways at the same time

• Data Hazards: attempt to use an item before it is ready Instruction depends on the results of a prior instruction still in the pipeline

• Control Hazards: attempt to make a decision before condition is evaluated

Branch instructions The hazards can always be resolved by waiting.

Pipeline control must detect the hazards and take actions (or delay action) to resolve hazards.

Page 10: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 10

Hazards

Structural Hazard: Single Memory

Control Hazard

Mem

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUMem Reg Mem Reg

AL

UReg Mem Reg

AL

UMem Reg Mem Reg

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UReg Mem RegMem

Page 11: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 11

Control Hazard Solutions

Stall It is possible to move the decision upward to 2nd stage by adding hardware that checks the registers as being read.

• Predict: guess one branch direction then backup if wrong.

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Page 12: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 12

Control Hazard Solutions

Redefine branch behavior (takes place after next instruction) “delayed branch”

• Impact: No clock cycle is wasted if one can find an instruction to put in “slot”

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem RegA

LUMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Load Mem

AL

UReg Mem Reg

Page 13: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 13

Data Hazard

Data Hazard in r1 Dependencies backwards in time are hazards

“Forward” result from one stage to another.

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Page 14: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 14

Pipelining R-type and Load

Structural hazard: • Two instructions try to write to the register file

at the same time.

Load uses Register File’s Write Port in 5th stage.

R-type uses Register File’s Write Port in 4th stage.

Solution ???

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type

We have a problem!

Ifetch Reg/Dec Exec Mem WrLoad1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type1 2 3 4

Page 15: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 15

Pipelining R-type and Load

Insert “Bubble” into the Pipeline

Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle.

• The control logic can be complex. • Lose instruction fetch and issue opportunity.

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Exec WrR-typeIfetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr

Page 16: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 16

Pipelining R-type and Load

Delay R-type’s register-write by one cycle: • Now R-type instructions also use Reg File’s

write port at Stage 5. • Mem stage is a NOOP stage: nothing is

being done.

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5

Page 17: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 17

Stages in the Datapath What do we need to add for splitting a datapath into stages?

Instruction�memory

Address

4

32

0

Add Add�result

Shift�left 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

M�u�x

0

1

Add

PC

0Write�data

M�u�x

1Registers

Read�data 1

Read�data 2

Read�register 1

Read�register 2

16Sign�

extend

Write�register

Write�data

Read�data

1

ALU�result

M�u�x

ALUZero

ID/EX

Data�memory

Address

Instruction�memory

Address

4

32

0

Add Add�result

Shift�left 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

M�u�x

0

1

Add

PC

0

Address

Write�data

M�u�x

1Registers

Read�data 1

Read�data 2

Read�register 1

Read�register 2

16Sign�

extend

Write�register

Write�data

Read�data

Data�memory

1

ALU�result

M�u�x

ALUZero

ID/EX

Page 18: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 18

Pipelined Datapath and Control

PC

Instruction�memory

Address

Inst

ruct

ion

Instruction�[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction�[15– 0]

0

0Registers

Write�register

Write�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1Write�data

Read�data M�

u�x

1

ALU�control

RegWrite

MemRead

Instruction�[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Data�memory

PCSrc

Zero

Add Add�result

Shift�left 2

ALU�result

ALUZero

Add

0

1

M�u�x

0

1

M�u�x

Page 19: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 19

Pipeline Control

Main Control generates the control signals during Instruction Decode • Exec (ExtOp, ALUSrc, ...) control signals are used 1 cycle later. • Mem (MemWr Branch) control signals are used 2 cycles later. • Wr (MemtoReg MemWr) control signals are used 3 cycles later.

Pass control signals along just like the data

Execution/Address Calculation stage control lines

Memory access stage control lines

stage control lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

IF/ID R

egister

ID/E

x Register

Ex/M

em R

egister

Mem

/Wr R

egister

Reg/Dec Exec Mem

ExtOp

ALUOpRegDst

ALUSrc

BranchMemWr

MemtoRegRegWr

MainControl

ExtOp

ALUOpRegDst

ALUSrc

MemtoRegRegWr

MemtoRegRegWr

MemtoRegRegWr

BranchMemWr Branch

MemWr

Wr

Page 20: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 20

Datapath with Control

PC

Instruction�memory

Inst

ruct

ion

Add

Instruction�[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction�[15– 0]

0

0

M�u�x

0

1

Add Add�result

RegistersWrite�register

Write�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1

ALU�result

Zero

Write�data

Read�data

M�u�x

1

ALU�control

Shift�left 2

Reg

Writ

e

MemRead

Control

ALU

Instruction�[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M�u�x

0

1

Mem

Writ

e

AddressData�

memory

Address

Page 21: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 21

Dependencies Problem: When starting next instruction before the first is completed. Dependencies that “go backward in time” cause data hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Program�execution�order�(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of �register $2:

DM Reg

Reg

Reg

Reg

DM

Page 22: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 22

Software Solution Have compiler guarantee no hazards Where do we insert the “noops” ?

sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows down the application!

Use temporary results and don’t wait for them to be written into register file Forwarding to handle read/write to same register ALU forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Program�execution order�(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 23: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 23

Forwarding

Can't always forward

An instruction tries to read a register following a load instruction that writes to the same register.

PC Instruction�memory

Registers

M�u�x

M�u�x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data�memory

M�u�x

Forwarding�unit

IF/ID

Inst

ruct

ion

M�u�x

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Program�execution�order�(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Page 24: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 24

Stalling A hazard detection unit to “stall” load instruction Stall the pipeline by keeping an instruction in the same stage

PC Instruction�memory

Registers

M�u�x

M�u�x

M�u�x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data�memory

M�u�x

Hazard�detection�

unit

Forwarding�unit

0

M�u�x

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

lw $2, 20($1)

Program�execution�order�(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 25: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 25

Branch Hazards When decide to branch, instructions are in pipeline We predict “branch not taken”

Need to add hardware for flushing instructions if we are wrong.

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Program�execution�order�(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

PC Instruction�memory

4

Registers

M�u�x

M�u�x

M�u�x

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Data�memory

M�u�x

Hazard�detection�

unit

Forwarding�unit

IF.Flush

IF/ID

Sign�extend

Control

M�u�x

=

Shift�left 2

M�u�x

Page 26: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 26

Improving Performance

Try and avoid stalls. e.g. reorder instructions.

lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)

Add a “branch delay slot” • Always execute next instruction after a branch. • Rely on compiler to “fill” the slot with something

useful rather than NOOPS

Dynamic Scheduling The hardware performs the “scheduling”

• Hardware tries to find instructions to execute • Out of order execution is possible • Speculative execution and dynamic branch

prediction

Modern processors are very complicated • Alpha 21264: 9 stage pipeline, 6 instruction issue • PowerPC and Pentium: Keeps branch history table.

Compiler technology important

Superscalar Architecture Start more than one instruction in the same cycle

Page 27: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 27

Nondelayed vs. Delayed Branch

Exit:

Fill Slot

Exit:

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Page 28: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 28

Superscalar Architecture

An Example: Independent integer and FP issue to separate pipelines

Single Issue Total Time = Int Exec Time + FP Exec Time

Max Speedup: Total Time MAX(Int Exec Time, FP Exec Time)

Operand / Result Busses

Int Reg Inst Issueand Bypass

FP Reg

Int. Unit

I-Cache

Load / Store Unit

FP Add FP Mul

D-Cache

Page 29: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 29

Superscalar Laundry: Parallel per stage

More resources, HW to match mix of parallel tasks?

T a s k

Or d e r

BCD

A

E

F

(light clothing) (dark clothing) (very dirty clothing)

(light clothing) (dark clothing) (very dirty clothing)

126 PM 7 8 9 10 11

Time 3030 30 30 30

Page 30: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 30

Superscalar Laundry: Mismatch Mix

Task mix underutilizes extra resources

T a s k

O r d e r

(light clothing)

(light clothing) (dark clothing)

(light clothing)

A

B

D

C

126 PM 7 8 9 10 11

Time 3030 30 30 30 30 30

Page 31: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 31

General Superscalar Organization

Super-Pipelined • Many pipeline stages need less than half

a clock cycle. • Double internal clock speed gets two

tasks per external clock cycle • Superscalar allows parallel fetch and

execute

Page 32: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 32

Superscalar Execution

• Simultaneously fetch multiple instructions • Logic to determine true dependencies involving

register values. • Mechanisms to communicate these values. • Mechanisms to initiate multiple instructions in

parallel. • Resources for parallel execution of multiple

instructions. • Mechanisms for committing process state in

correct order.

Page 33: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 33

The PowerPC 601 Architecture

Page 34: Enhancing Performance by Pipelining - ee.ryerson.cacourses/coe608/lectures/Pipelining.pdf · • Pipelining doesn’t reduce latency of a single task. It improves throughput of the

© G.N. Khan Computer Organization & Architecture-COE608: Pipelining Page: 34

Intel Itanium Microprocessor