15
1 COSC 6385 Computer Architecture - Pipelining Edgar Gabriel Spring 2018 Some of the slides are based on a lecture by David Culler, University of California, Berkley http://www.eecs.berkeley.edu/~ culler/courses/cs252-s05 Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution Split an “expensive” operation into several sub- operations Execute the sub-operations in a staggered manner Real world analogy: assembly line in car manufacturing Each station is doing something different Each station working on a separate car Pipelining increases the throughput, but does not reduce the latency of an operation

COSC 6385 Computer Architecture - Pipelininggabriel/courses/cosc6385_s18/CA_06_Pipelining.pdf · 1 COSC 6385 Computer Architecture - Pipelining Edgar Gabriel Spring 2018 Some of the

Embed Size (px)

Citation preview

1

COSC 6385

Computer Architecture

- Pipelining

Edgar Gabriel

Spring 2018

Some of the slides are based on a lecture by David Culler, University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Pipelining

• Pipelining is an implementation technique whereby

multiple instructions are overlapped in execution

– Split an “expensive” operation into several sub-

operations

– Execute the sub-operations in a staggered manner

• Real world analogy: assembly line in car manufacturing

– Each station is doing something different

– Each station working on a separate car

• Pipelining increases the throughput, but does not

reduce the latency of an operation

2

Classes of instructions

• ALU instructions

– Take either 2 registers as operands or 1 register and one

16bit immediate offset

– Results are stored in a 3rd register

• Load and store instructions

• Branches and jumps

Typical implementation of an

instruction (I)1. Instruction fetch cycle (IF):

• send PC to memory

• Fetch current instruction

• Update PC to next sequential PC (+4 bytes)

2. Instruction decode/register fetch cycle (ID)

• Decode instruction

• Read registers corresponding to register source

specifiers from register file

• Sign extend offset fields if needed

• Compute possible branch target address

3

Typical implementation of an

instruction (II)3. Execution /effective address cycle (EX)

• ALU adds base register and offset to form effective address or

• ALU performs operations on the values read from register file or

• ALU performs operation on value read from register and sign-extended immediate

4. Memory access cycle (MEM)

• If instruction is a load, read memory using the effective address computed in step 3

• If instruction is a store, write the data from the second register read of the register file to the effective address

5. Write-back cycle (WB)

• Write result into register file

• From memory for a load instruction

• From ALU for an ALU instruction

Typical implementation of an

instruction (III)MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MU

X

Mem

ory

Reg

File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

4

Adder Zero?

Next SEQ PC

PC

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

4

Details(I)

4

Adder

PC Read

address

Instruction

memory

Instruction

Fetching instructions and incrementing program count (PC)

Details (II)

Read

register 1

Register

file

ALU instructions, e.g. add R1, R2, R3

Read

register 2

Write

register

Write

Data

Read

data 1

Read

data 2

5

5

5ALU

Register

numbers

Data

Data

RegWrite

ALU operation4

ALU

result

Zero

Register number input is 5 bit wide

if you have 32(=25) registers

Write control signal

ALU operation

control signal (4 bits)

5

Details (III)

Address

Data

memory

Load/Store instructions, e.g. LW R1,offset (R2)

Write

Data

Read

data

MemRead

MemWrite

SignExtend

16 32

Basic steps for a load/store operation

• sign extend the offset from 16 to 32 bit

• add the sign extended offset to R2

• Load the content of the resulting address into R1 or

• store the data from R1 into the resulting memory address

Details (IV)Combining Load/Store and ALU instructions

Read

register 1

Register

file

Read register 2

Write

register

Write

Data

Read

data 1

Read

data 2

RegWrite

Instruction

SignExtend

16 32

MUX

0

1

ALU

4

Address

Data

memory

Write

Data

Read

data

MemRead

MemWriteALUsrc

MUX

0

1

MemtoReg

ALU operation

6

Details (V)Branches e.g. beq R1,R2,offset

Basic steps for a branch equal instruction

• compute branch target address

• sign extended offset field

• shift offset field by 2 bits in order to ensure a word offset

• add shifted, sign-extended offset to PC

• compare registers R1 and R2

Details (VI)Implementation of branches, e.g. beq R1,R2,offset

Read

register 1

Register

file

Read register 2

Write

register

Write

Data

Read

data 1

Read

data 2

RegWrite

SignExtend

16 32

ALU

4 ALU operation

InstructionTo branch

control logic

Shift Left 2

Add

PC+4 from instruction datapathBranch

target

7

Visualizing pipelining

Instr.

Order

Time (clock ycles)

ID

ALU

MemIF WB

ID

ALU

MemIF WB

ID

ALU

MemIF WB

ID

ALU

MemIF WB

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Effects of pipelining

• A pipeline of depth n requires n-times the memory bandwidth of a

non-pipelined processor for the same clock rate

• Separate data and instruction cache eliminates some memory

conflicts

• Register file is used in stage ID and in WB

– Usually not a conflict, since write’s are executed in the first

half of the clock-cycle and read’s in the second half

• Instructions in the pipeline should not attempt to use the same

hardware resources at the same time

– Introducing pipeline registers between successive stages of the

pipeline

– Registers named after the stages they connect (e.g. IF/ID,

ID/ALU, etc.)

8

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr.Calc

ALU

Mem

ory

Reg

File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

Zero?

IF/I

D

ID/E

X

MEM

/WB

EX/M

EM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Pipeline Hazards

• Limits to pipelining: Hazards prevent next instruction

from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination

of instructions

– Data hazards: Instruction depends on result of prior

instruction still in the pipeline

– Control hazards: Caused by delay between the fetching of

instructions and decisions about changes in control flow

(branches and jumps).

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

9

One Memory Port/Structural Hazards

Instr.

Order

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Reg

ALU

DMemIfetch Reg

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

One Memory Port/Structural Hazards

Instr.

Order

Load

Instr 1

Instr 2

Stall

Instr 3

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

10

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data Hazard on R1

IF ID EX MEMWB

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

• Read After Write (RAW)

InstrJ tries to read operand before InstrI writes it

• Caused by a “Dependence” (in compiler nomenclature).

This hazard results from an actual need for

communication.

Three Generic Data Hazards

I: add r1,r2,r3

J: sub r4,r1,r3

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

11

• Write After Read (WAR)

InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.

This results from reuse of the name “r1”.

• Can’t happen in our 5 stage pipeline because:

– All instructions take 5 stages, and

– Reads are always in stage 2, and

– Writes are always in stage 5

I: sub r4,r1,r3

J: add r1,r2,r3

K: mul r6,r1,r7

Three Generic Data Hazards

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Three Generic Data Hazards

• Write After Write (WAW)

InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writers

This also results from the reuse of name “r1”.

• Can’t happen in 5 stage pipeline because:

– All instructions take 5 stages, and

– Writes are always in stage 5

I: sub r1,r4,r3

J: add r1,r2,r3

K: mul r6,r1,r7

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

12

Time (clock cycles)

Forwarding to Avoid Data Hazard

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Instr.

Order

ld r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard even with Forwarding

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

13

Data Hazard Even with Forwarding

or r8,r1,r9

Instr.

Order

ld r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg

ALU

DMemIfetch Reg

RegIfetch

ALU

DMem RegBubble

Ifetch

ALU

DMem RegBubble Reg

Ifetch

ALU

DMemBubble Reg

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Adder

IF/I

D

Branches: Pipelined DatapathMemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg

File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX/M

EM

4

Adder

Next SEQ PC

RD RD RD WB

Dat

a

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

ID/E

X

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

14

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken

– Execute successor instructions in sequence

– “Squash” instructions in pipeline if branch actually taken

– Advantage of late pipeline state update

– 47% branches not taken on average

– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken

– 53% branches taken on average

– But haven’t calculated branch target address yet

• still incurs 1 cycle branch penalty

• Other machines: branch target known before outcomeSlide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

Four Branch Hazard Alternatives

#4: Delayed Branch

– Define branch to take place AFTER a following

instruction

branch instruction

sequential successor1sequential successor2........

sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target

address in 5 stage pipeline

Branch delay of length n

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

15

Delayed Branch

• Where to get instructions to fill branch delay slot?

– Before branch instruction

– From the target address: only valuable when branch

taken

– From fall through: only valuable when branch not taken

• Compiler effectiveness for single branch delay slot:

– Fills about 60% of branch delay slots

– About 80% of instructions executed in branch delay slots

useful in computation

Slide based on a lecture by David Culler,

University of California, Berkley

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05