32
1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining PC Instruction m em ory Instruction Add Instruction [20– 16] Mem toReg ALU O p Branch R egD st ALU Src 4 16 32 Instruction [15– 0] 0 0 M u x 0 1 A dd Add result R egisters W rite register W rite data R ead data 1 R ead data 2 R ead register1 R ead register2 Sign extend M u x 1 A LU result Zero W rite data Read data M u x 1 A LU control Shift left2 RegWrite M em R ead C ontrol ALU Instruction [15– 11] 6 EX M WB M WB WB IF/ID PC Src ID /EX EX /M E M MEM/W B M u x 0 1 MemWrite A ddress D ata m em ory Address

1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

1CSE 45432 SUNY New Paltz

Chapter Six

Enhancing Performance with Pipelining

PC

Instructionmemory

Instruction

Add

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

MemW

rite

AddressData

memory

Address

Page 2: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

2CSE 45432 SUNY New Paltz

Sequential Laundry

• Sequential laundry takes 8 hours for 4 loads

• If they learned pipelining, how long would laundry take?

30Task

Order

B

C

D

ATime

3030 3030 30 3030 3030 3030 3030 3030

6 PM 7 8 9 10 11 12 1 2 AM

Page 3: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

3CSE 45432 SUNY New Paltz

Pipelined Laundry• Pipelining doesn’t help latency of

single task, it helps throughput of entire workload

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Pipeline rate limited by slowest pipeline stage

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

• Stall for dependencies

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

Page 4: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

4CSE 45432 SUNY New Paltz

Single Stage VS. Pipeline Performance

Ideal speedup is number of stages in the pipeline.

Instructionfetch

R eg ALU Dataaccess

R eg

8 nsInstruction

fetchR eg ALU

Dataaccess

Reg

8 nsIns truction

fe tch

8 ns

Tim e

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUDa ta

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

D ataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 n s

P rogramexecutionorder(in instruc tions)

InstructionInstr.

MemoryRegister

Read ALU Op.Data

MemoryReg. Write Total

R-format 2 1 2 0 1 6 nslw 2 1 2 1 2 8 nssw 2 1 2 2 7 nsbeq 2 1 2 5 ns

Page 5: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

5CSE 45432 SUNY New Paltz

Pipelining

• What makes it easy in MIPS

– all instructions are the same length

– just a few instruction formats

– memory operands appear only in loads and stores

• What makes it hard?

– structural hazards: suppose we had only one memory

– control hazards: need to worry about branch instructions

– data hazards: an instruction depends on a previous instruction

• We’ll build a simple pipeline and look at these issues

Page 6: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

6CSE 45432 SUNY New Paltz

The Five Stages of Load

• Ifetch: Instruction Fetch

– Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec: Calculate the memory address

• Mem: Read the data from the Data Memory

• Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Page 7: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

7CSE 45432 SUNY New Paltz

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem

Load Store

Pipeline Implementation:Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

Page 8: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

8CSE 45432 SUNY New Paltz

Basic Idea

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Page 9: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

9CSE 45432 SUNY New Paltz

Pipelined Datapath

• Walk through lw instruction• Walk through sw instruction• The design

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Page 10: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

10CSE 45432 SUNY New Paltz

Corrected Datapath

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

Page 11: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

11CSE 45432 SUNY New Paltz

Graphically Representing Pipelines

• Can help with answering questions like:– how many cycles does it take to execute this code?– what is the ALU doing during cycle 4?– use this representation to help understand datapaths

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $10, 20($1)

Programexecutionorder(in instructions)

sub $11, $2, $3

ALU

ALU

Page 12: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

12CSE 45432 SUNY New Paltz

Why Pipeline? Because the resources are there!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

AL

UIm Reg Dm Reg

Page 13: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

13CSE 45432 SUNY New Paltz

Pipeline Control

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

Page 14: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

14CSE 45432 SUNY New Paltz

• Pass control signals along just like the data

Pipeline Control

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 15: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

15CSE 45432 SUNY New Paltz

Datapath with Control

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Write

AddressData

memory

Address

Page 16: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

16CSE 45432 SUNY New Paltz

Designing a Pipelined Processor

• Go back and examine your datapath and control diagram

• associated resources with states

• ensure that flows do not conflict, or figure out how to resolve

• assert control in appropriate stage

Page 17: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

17CSE 45432 SUNY New Paltz

Can pipelining get us into trouble?

• Yes: Pipeline Hazards

– structural hazards: attempt to use the same resource two different ways at the same time

– data hazards: attempt to use item before it is ready

• instruction depends on result of prior instruction still in the pipeline

– control hazards: attempt to make a decision before condition is evaluated

• branch instructions• Can always resolve hazards by waiting

– pipeline control must detect the hazard

– take action (or delay action) to resolve hazards

Page 18: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

18CSE 45432 SUNY New Paltz

• Problem with starting next instruction before first is finished

• dependencies that “go backward in time” are data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 19: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

19CSE 45432 SUNY New Paltz

• Have compiler guarantee no hazards

• Where should compiler insert “nop” instructions?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

• Problem:

– It happens too often to rely on compiler

– It really slows us down!

Software Solution

Page 20: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

20CSE 45432 SUNY New Paltz

• Use temporary results, don’t wait for them to be written

• Also, write register file during 1st half of clock and read during 2nd half

Data Hazard Solution: Forwarding

what if this $2 was $13?

IM R e g

IM R e g

C C 1 C C 2 C C 3 C C 4 C C 5 C C 6

T im e ( in c lo c k cy c le s )

s ub $ 2 , $ 1 , $ 3

P r o g ra me xe c u tio n o rde r( in in s tru c tio n s )

a n d $ 1 2 , $ 2 , $5

IM R eg D M R e g

IM D M R e g

IM D M R e g

C C 7 C C 8 C C 9

10 1 0 1 0 1 0 1 0 /– 2 0 – 2 0 – 2 0 – 2 0 – 20

o r $ 1 3 , $ 6 , $2

a dd $ 1 4 , $ 2 , $2

s w $ 15 , 1 0 0 ($ 2 )

V a lue o f re g is te r $ 2 :

D M R eg

R eg

R e g

R eg

X X X – 20 X X X X XV a lu e o f E X /M E M :X X X X – 2 0 X X X XV a lu e o f M E M /W B :

D M

Page 21: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

21CSE 45432 SUNY New Paltz

• Steer the result from precious instruction to the ALU

• EX hazardif (EX/MEM.RegWrite

and (EX/MEM.RegisterRd = 0)

and (EX /MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd = 0)

and (EX /MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

• MEM hazardif (MEM/WB.RegWrite

and (MEM/WB.RegisterRd = 0)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd = 0)

and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

Hazard Conditions

Page 22: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

22CSE 45432 SUNY New Paltz

Forwarding

PCInstructionmemory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

00 Register file

01 Mem. or earlier ALU

10 Prior ALU

Page 23: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

23CSE 45432 SUNY New Paltz

• lw can still cause a hazard:– an instruction tries to read a register following a load instruction that writes to the same

register.

• Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Page 24: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

24CSE 45432 SUNY New Paltz

Stalling

• We can stall the pipeline by keeping an instruction in the same stage

• Repeat in clock cycle 4 what they did in clock cycle 3

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 25: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

25CSE 45432 SUNY New Paltz

Hazard Detection Unit

• Stall by letting an instruction that won’t write anything go forward

• controls writing of the PC and IF/ID plus MUX

PCInstruction

memory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemReadIF

/ID

Wri

te

PC

Writ

e

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

Page 26: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

26CSE 45432 SUNY New Paltz

• When we decide to branch, other instructions are in the pipeline!

• We are predicting “branch not taken”– need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Page 27: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

27CSE 45432 SUNY New Paltz

Flushing Instructions

PCInstruction

memory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

• Reduce branch delay

Page 28: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

28CSE 45432 SUNY New Paltz

Improving Performance

• Superpipelining: ideal maximum speedup is related to number of stages

• Superscalar: start more than one instruction in the same cycle

• Dynamic pipeline scheduling

– Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Page 29: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

29CSE 45432 SUNY New Paltz

Dynamic Scheduling

• The hardware performs the “scheduling”

– hardware tries to find instructions to execute

– out of order execution is possible

– speculative execution and dynamic branch prediction

• All modern processors are very complicated

– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

– PowerPC and Pentium: branch history table

– Compiler technology important

Page 30: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

30CSE 45432 SUNY New Paltz

Dynamic Scheduling in PowerPC 604 and Pentium Pro

• Both In-order Issue, Out-of-order execution, In-order Commit

Pentium Pro central reservation station for any functional units with one bus shared by a branch and an integer unit

Page 31: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

31CSE 45432 SUNY New Paltz

Dynamic Scheduling in Pentium Pro

• PPro doesn’t pipeline 80x86 instructions

• PPro decode unit translates the Intel instructions into 72-bit micro-operations ( MIPS)

• Sends micro-operations to reorder buffer & reservation stations

• Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations

• Most instructions translate to 1 to 4 micro-operations

• Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

Page 32: 1 CSE 45432 SUNY New Paltz Chapter Six Enhancing Performance with Pipelining

32CSE 45432 SUNY New Paltz

FYI: MIPS R3000 clocking discipline

• 2-phase non-overlapping clocks

phi1

phi2

phi1 phi1phi2Edge-triggered