CSCI2510 Computer Organization Lecture 11: Pipelining

CSCI2510 Computer Organization

Lecture 11: Pipelining

Ming-Chang YANG

[email protected]

Reading: Chap. 8 (5th Ed.)

mailto:[email protected]

Why We Need Pipelining?

• Real-life example:

Four loads of

laundry that need

to be washed,

dried, and folded.

– Washing: 30 min

– Drying: 40 min

– Folding: 20 min

• Without pipeline:

– 360 min in total

• With pipeline:

– 210 min in total!

CSCI2510 Lec11: Pipelining 2

https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation


Sequential Execution

• The processor fetches and executes instructions, one

after the other.

– Fi: Fetch steps for instruction Ii

– Ei: Execute steps for instruction Ii

• Execution of a program consists of a sequential

sequence of fetch and execute steps:

• How to improve the speed of execution?

– Use faster technologies to build CPU and memory ($$$).

– Arrange hardware to do multiple operations at a time ($).


F1 E1 F2 E2 F3 E3

I1 I2 I3

Time

Separate HW & Interstage Buffer

• Consider a computer having two separate hardware

units:

– One hardware unit is for fetching instructions.

– The other hardware unit is for executing instructions.

• Interstage Buffer: Deposit the fetched instruction.

– Execution unit executes the deposited instruction.

– Fetch unit fetches the next instruction at the same time.


InstructionFetchUnit

ExecutionUnit

Interstage buffer

Instruction

Inst

ruct

ion

• Assume the computer is controlled by a clock.

– The fetch and execute steps of any instruction can be

completed in one clock cycle.

• Fetch and execute units form a two-stage pipeline:

– Both units are kept busy all the time.

– An interstage buffer is needed to hold the instruction.


Basic Idea of Instruction Pipelining (1/2)

F 1 E 1

F 2 E 2

F 3 E 3

I 1

I 2

I 3

Instruction

Clock cycle 1 2 3 4 Time

• Parallelism is increased by overlapping the fetch and

execute steps.

– If executions sustain for a long time, the completion rate of

a two-stage pipelining will be twice.

• More is better? How about 4-stage pipeline?

– F: Fetch instruction from memory

– D: Decode instruction and fetch source operands

– E: Execute instruction

– W: Write the result


Basic Idea of Instruction Pipelining (2/2)

FFetch

instruction

DDecode

instruction

4-Stage Pipeline (1/2)


F 4I4

F 1

F 2

F 3

I1

I2

I3

D1

D2

D3

D4

E 1

E 2

E 3

E 4

W1

W2

W3

W4

Instruction

Clock cycle 1 2 3 4 5 6 7

EExecute

operation

WWriteresults

Interstage buffers

B1 B2 B3

Time

4-Stage Pipeline (2/2)

• The four hardware units perform their tasks

simultaneously without interfering others.

– The required information is passed from one unit to the next through an interstage buffer.

• Each stage should be roughly the same maximum

clock period.

– Why? A unit that completes its task early is idle for the

remainder of the clock period.


Class Exercise 11.1

• During clock cycle 4, what is the information hold by

the three interstage buffers (i.e., B1, B2, and B3)

respectivley?


Student ID:

Name:

Date:

FFetch

instruction

DDecode

instruction

F 4I4

F 1

F 2

F 3

I1

I2

I3

D1

D2

D3

D4

E 1

E 2

E 3

E 4

W1

W2

W3

W4

Clock cycle 1 2 3 4 5 6 7

EExecute

operation

WWriteresults

B1 B2 B3

Time

Outline



– Data Hazard





Reality: Pipeline may Stall

• If any pipeline stage requires more than 1 cycle,

others have to wait (pipeline stalled).

– E.g. E2 requires three cycles to complete


F 1

F 2

F 3

I 1

I 2

I 3

E 1

E 2

E 3

D 1

D 2

D 3

W 1

W 2

W 3

Instruction

F 4 D 4I 4

Clock cycle 1 2 3 4 5 6 7 8 9

E 4

F 5I 5 D 5

T ime

E 5

W 4

In cycles 5 and 6: Write, Decode and

Fetch units must wait and do nothing …

Stall & Hazard

• Hazard: Any condition that causes pipeline to stall.

• Another example: A cache miss occurs in F2:

Figure: Instruction execution steps in successive clock cycles.

Figure: Statuses of processor stages in successive clock cycles.


F1

F2

F3

I1

I2

I3

D1

D2

D3

E1

E2

E3

W1

W2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock c ycleTime

1 2 3 4 5 6 7 8Clock c ycle

Stage

F: Fetch

D: Decode

E: Execute

W: Write

F1 F2 F3

D1 D2 D3idle idle idle

E1 E2 E3idle idle idle

W1 W2idle idle idle

9

W3

F2 F2 F2

Time

Types of Hazards

• Data Hazard

– Either the source or the destination operands of an

instruction are not available when required.

• Instruction Hazard

– A delay in the availability of an instruction (this may

be a result of a miss in the cache).

• Structural Hazard

– Two instructions require the use of a given

hardware resource at the same time.


Outline



– Data Hazard





Data Hazard


I1 (Mul)

I2 (Add)

I3

Instruction

1 2 3 4 5 6 7 8 9Clock c ycle

I4

F 1

F 2

F 3

D 1

D 3

E 1

E 3

E2

W3

W1

D2-A W2

F 4 D4 E 4 W 4

D 2

Time

Pipeline is stalled for two cycles.

I1: A = 3 * A;

I2: B = 4 + A;

D: Decode and fetch

source operands

• A data hazard is a situation in which the pipeline is

stalled because the operands are delayed.

• Example:

– Dependent operations must be performed sequentially to

ensure the data consistency.

Class Exercise 11.2

• Please specify whether we will encounter data

hazards for the following instructions.


I1: A = 5 * C;

I2: B = 20 + C;

I1: C = A * B;

I2: E = C + D;

Software Solution to Data Hazard

• The compiler detects and introduces two-cycle delay

by inserting NOP (No-operation) instructions.

– Advantage: Simpler hardware, less cost

– Disadvantage: Larger code size, less flexibility, and

reduced performance


F 1

F 2

I1 (Mul)

I2 (Add)

D 1 E 1

E2

Instruction

1 2 3 4 5 6 7 8 9Clock c ycle

W1

W2D 2

Time

NOP

NOP

I1: A = 3 * A;

I2: B = 4 + A;

Question: Do we really avoid the pipline stalling?

Hardware Solution to Data Hazard (1/2)

• The data hazard arises because I2 is waiting for data

to be written in the register A.

• In fact, the result of I1 is available at the output of ALU.

• Delay is reduced if the result can be forwarded to E2.


F 1

F 2

F 3

I1 (Mul)

I2 (Add)

I3

D 1

D 3

E 1

E 3

E2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock c ycle

W1

D2-A W2

F 4 D4 E 4 W 4I4

D 2

Time

D: Decode and fetch

source operands

I1: A = 3 * A;

I2: B = 4 + A;

Result of I1 is available here!

Hardware Solution to Data Hazard (2/2)

• Operand Forwarding: By introducing the forwarding

path, the execution of I2 can proceed without stalling.

– Disadvantage: Additional hardware cost


E: Ex ecute(ALU)

W: Write(Register file)

SRC1,SRC2 RSL T

(b) Source and result registers

Register

file

SRC1 SRC2

RSL T

Destination

Source 1

Source 2

(a) Datapath (3 buses)

ALU

Port A Port B

Outline



– Data Hazard





Instruction Hazard

• Recall: The purpose of the instruction fetch unit is to

supply the execution units with instructions.

– F: Fetch instruction from memory

– D: Decode instruction and fetch source operands

– E: Execute instruction

– W: Write the result

• Instruction Hazard: The cases cause the pipeline to

stall, because of the delay of instructions.

1) Cache miss

2) Branch instruction (both unconditional and conditional)CSCI2510 Lec11: Pipelining 24

FFetch

instruction

DDecode

instruction

EExecute

operation

WWriteresults

B1 B2 B3

Instruction Hazard: Cache Miss

• The effect of a cache miss on the pipelined operation

is as follows:

– I1 is fetched from the cache in cycle 1.

– The fetch operation F2 for I2 results in a cache miss.

• The instruction fetch unit must suspend any further fetch requests until

F2 is completed.


F1

F2

I1

I2

I3

D1

D2

E1

E2

W1

W2

F3 D3 E3 W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycleTime

F3

Postponed

• Branches may also cause the pipeline to stall.

– Branch Penalty: The time lost because of a branch inst.

– Branch penalty can be reduced by computing the branch

address earlier in Decode stage (rather than Execute stage)

• However, it still results in 1 cycle branch penalty to the pipeline.


F1 D1 E 1 W1

I2 (Branch to Ik)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

Branch address computed in Execute stageBranch Penalty: 2 clock cycles

E 2

8Time

F 1 D1 E 1 W1

I2 (Branch to Ik)

I 1


F 2 D2

Branch address computed in Decode stageBranch Penalty: 1 clock cycle

Time

Fk D k Ek

Fk+ 1 Dk+ 1

Ik

Ik+ 1

Wk

E k+ 1

XF3I3

D3

F4 XI4

I3 and I4 must be discarded

F 3 X

F k D k Ek

F k+ 1 D k+ 1

I 3

I k

I k+ 1

W k

E k+ 1

Only I3 is discarded

Instruction Hazard: Unconditional Branch

Solution to Instruction Hazard

• Instruction Queue: The interstage buffer between

Fetch and Decode units can keep multiple instructions.

– Fetch unit gets and deposits one instruction at a time.

– Decode unit consumes one instruction at a time.


Instruction queue

EExecute

operation

WWriteresults

DDecode

instruction

FFetch

instruction

Interstage buffers

F4

W3E 3

F2 D2 E 2 W2

F3 D3

E 4D4 W4F4

• F4, F5, F6, Fk, and Fk+1, are delayed.

• I1, I2, I3, I4, and Ik cannot complete in successive cycles.


Example: Without Instruction Queue

F1 D1 E 1 E 1 E 1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+ 1

10Time

XF6

Fk D k Ek

Fk+ 1 D k+ 1

Wk

E k+ 1

11 12

Instruction 1 takes 3

Execute cycles (i.e., 2-

cycle stall).

Instruction 4 is delayed.

Instruction 5 is a branch .

Instruction 6 is discarded.

F5 D5

Since there is no

instruction queue!

• I6 is still being discarded, but the instruction queue can avoid

delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.

• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec11: Pipelining 29

Example: With Instruction Queue

F1 D1 E 1 E 1 E 1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+ 1

10

1Queue length 1 1 12 3 2 1 1 1

Time

X

F4

W3E 3

F2 D2 E 2 W2

F3 D3

E 4D4 W4

F5

F6

Fk D k Ek

Fk+ 1 D k+ 1

Wk

E k+ 1

Keep fetching

D5

Instruction 1 takes 3

Execute cycles (i.e., 2-

cycle stall),

The queue length rises to

3 before cycle 6.

Instruction 5 is a branch .

Instruction 6 is discarded,

after taking Branch.

The queue length drops to

1 before cycle 8.

Without vs With Instruction Queue

• With instruction queue, the branch instruction does

not increase the overall execution time (if the queue

is not empty).

– Since instructions can complete in successive clock cycles.

• Branch address is computed in parallel with other

instructions, so no cycles lost due to branch.

– This is called branch folding.

• Instruction queue is also possible to hide the effect of

cache miss (if the queue is not empty).


Class Exercise 11.3

• Please show how instruction queue can hide the

effect of cache miss (three cycles) caused by F4.


F1 D1 E 1 E 1 E 1 W1

W3E 3

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E 2 W2

F3 D3

I2

I3

I4

10Time

11 12

F1 D1 E 1 E 1 E 1 W1

W3E 3

I1

F2 D2 E 2 W2

F3 D3

I2

I3

I4

1 2 3 4 5 6 7 8 9Clock cycle 10Time

11 12

1 1 1Queue length

Without

Instruction

Queue

With

Instruction

Queue

All intermediate instructionsmust be discarded …

• Branch folding is not working for conditional branches.

• Conditional branches may result in added hazard.

– Since the condition is based on the preceding instruction.

• Example:


Instruction Hazard: Conditional Branch

Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

R2 is used as the branch condition.

We need to wait for R2 to determine whether to perform the conditional branching.

F 1 D1 E 1 W1

I 2 (Decrement)

I 1


F3 D3I 3

Time

F 2 D2 E 2 W2

(Shift)

(Branch if R2 = 0) D3-R2

F k D k EkI k W k

8 9 10

LOOP

Solution 1) Delayed Branch (1/2)

• The location following a

branch instruction is

called a branch delay slot.

• Delayed branching can

minimize the penalty by

– Placing useful instructions

in branch delay slots, and

– Internally re-ordering the

instructions.


Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

(a) Original program loop

Add

LOOP

Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

(b) Internally Re-ordered instructions(actual program logic NOT affected)

Branch Delay Slot

Solution 1) Delayed Branch (2/2)

• Delayed branching can minimize the branch penalty.


Instruction

1 2 3 4 5 6 7 8Clock cycle Time

F ENEXT: Add (Branch not taken) WD

9 10

ALU

Result

Forwarding

ALU

Result

Forwarding

F D

F Daddr

F E

Decrement

Branch=0?

Shift (delay slot)

E

W

W

D

(get branch address)

F E

F

F E

Decrement (Branch is taken)

Branch=0?

Shift (delay slot)

W

W

D

Daddr

D

(get branch address)

Solution 2) Branch Prediction (1/2)


F1

F2

I1 (Compare)

I2 (Branch>0)

D1 E1 W1

Instruction

E2

Clock cycle 1 2 3 4 5 6

D 2 / P2

Time

I3 F3 D 3 X(Branch Delay Slot)

F4

Fk D k

XI4

Ik

Incorrect Prediction

Fk D kIk

Correct Prediction

• Attempt to predict

whether conditional

branch will take place.

– Delayed branch can

be applied together.

• Branch Prediction:

– If we get it right: no

lost cycles.

• Registers and memory

cannot be updated until

we know we got it right.

– If we get it wrong, just

cancel the instructions.

– Branch prediction can

be dynamic or static.

Solution 2) Branch Prediction (2/2)

• Static Branch Prediction

– The same choice is used every time the conditional branch

is encountered.

– For example, a branch instruction at the end of a loop

causes a branch to the start of the loop for every pass

through the loop except the last one.

• It is helpful to assume this branch will be taken under this case.

– A flexible approach is to have the compiler decide.

• Dynamic Branch Prediction

– The choice is influenced by the past behavior.

– For example, a simple prediction is to use the result of the

most recent execution of the branch instruction.


Outline



– Data Hazard





Structural Hazard

• A structural hazard is the situation when two

instructions require the use of a hardware resource at

the same time.

• The most common case is in accessing to memory.

– Case 1: One instruction is accessing memory during the

Execute or Write stage; while another is being fetched.

– Solution 1: Many processors use separate instruction and

data caches to avoid this delay.

– Case 2: Another example is when two instructions require

access to the register file at the same time.

– Solution 2: Let the register file have more input/output ports.

• In general, the structural hazard can be avoided by

providing sufficient hardware resources ($$$).CSCI2510 Lec11: Pipelining 39

Class Exercise 11.4

• What is the cause of the following structure hazard?


Outline



– Data Hazard





Superscalar Operation (1/2)

• Superscalar: Execute multiple instructions at any

time via multiple processing units (i.e., we can

execute more than one instruction per cycle)


W : Writeresults

Decode /Dispatch

unit

Instruction queue

Floating-pointunit

Integerunit

F : Instructionfetch unit

Fetch two instructionsat a time

Decode two instructionsat a time

I1 (FracAdd)

Instruction

Clock cycle 1 2 3 4 5 6

Time

F1 D1 E1A E1B E1C W1

I2 (Add) F2 D2 E2 W2

Superscalar Operation (2/2)

• Superscalar operation may result in out-of-order

execution, and cause data consistency issue.

– In our previous example, I1 and I2 are dispatched in the

same order as they appear.

– However, their execution is completed out of order.

– To guarantee a consistent state when out-of-order

execution occur, the results of the execution of instructions

must be written in program order strictly .

• The out-of-order execution is also a common

technique to make use of instruction cycles by re-

ordering instructions.

– E.g., Delayed branching reorders the instructions to

minimize the branch penalty.CSCI2510 Lec11: Pipelining 44

Out-of-Order Execution

R1 mem[r0] /* Instruction 1 */

R2 R1 + R2 /* Instruction 2 */

R5 R5 + 1 /* Instruction 3 */

R6 R6 – R3 /* Instruction 4 */

• Instruction 1 results in a cache miss, and a cache

miss can stall entire processor for 20-30 cycles.

• Instruction 2 cannot be executed since it needs R1.

• In instruction queue, look ahead and find instructions

3 and 4 to execute first (reordering).

R1 mem[r0] /* Instruction 1 */

R5 R5 + 1 /* Instruction 3 */

R6 R6 – R3 /* Instruction 4 */

R2 R1 + R2 /* Instruction 2 */


Summary



– Data Hazard





Documents

CSCI2510 Computer Organization Lecture 11: Pipelining