43
CSCI2510 Computer Organization Lecture 12: Pipelining Ming-Chang YANG [email protected] Reading: Chap. 8 (5 th Ed.)

CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

CSCI2510 Computer Organization

Lecture 12: Pipelining

Ming-Chang YANG

[email protected]

Reading: Chap. 8 (5th Ed.)

Page 2: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Why We Need Pipelining?

• Real-life example:

Four loads of

laundry that need

to be washed,

dried, and folded.

– Washing: 30 min

– Drying: 40 min

– Folding: 20 min

• Without pipeline:

– 360 min in total

• With pipeline:

– 210 min in total!

CSCI2510 Lec12: Pipelining 2

https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html

Page 3: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 3

Page 4: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Sequential Execution

• The processor fetches and executes instructions, one

after the other.

– Fi: Fetch steps for instruction Ii

– Ei: Execute steps for instruction Ii

• Execution of a program consists of a sequential

sequence of fetch and execute steps:

• How to improve the speed of execution?

– Use faster technologies to build CPU and memory ($$$).

– Arrange hardware to do multiple operations at a time ($).CSCI2510 Lec12: Pipelining 4

F1 E1 F2 E2 F3 E3

I1 I2 I3

Time

Page 5: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Separate HW & Interstage Buffer

• Consider a computer having two separate hardware

units:

– One hardware unit is for fetching instructions.

– The other hardware unit is for executing instructions.

• Interstage Buffer: Deposit the fetched instruction.

– Execution unit executes the deposited instruction.

– Fetch unit fetches the next instruction at the same time.

CSCI2510 Lec12: Pipelining 5

InstructionFetchUnit

ExecutionUnit

Interstage buffer

Instruction

Instr

uction

Page 6: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

• Assume the computer is controlled by a clock.

– The fetch and execute steps of any instruction can be

completed in one clock cycle.

• Fetch and execute units form a two-stage pipeline:

– Both units are kept busy all the time.

– An interstage buffer is needed to hold the instruction.

CSCI2510 Lec12: Pipelining 6

Basic Idea of Instruction Pipelining (1/2)

F1 E 1

F2 E 2

F3 E 3

I 1

I 2

I 3

Instruction

Clock cycle 1 2 3 4 Time

Page 7: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

• Parallelism is increased by overlapping the fetch and

execute steps.

– If executions sustain for a long time, the completion rate of

a two-stage pipelining will be twice.

• More is better? How about 4-stage pipeline?

– F: Fetch instruction from memory

– D: Decode instruction and fetch source operands

– E: Execute instruction

– W: Write the result

CSCI2510 Lec12: Pipelining 7

Basic Idea of Instruction Pipelining (2/2)

Page 8: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

F

Fetch

instruction

D

Decode

instruction

4-Stage Pipeline (1/2)

CSCI2510 Lec12: Pipelining 8

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction

Clock cycle 1 2 3 4 5 6 7

E

Execute

operation

W

Write

results

Interstage buffers

B1 B2 B3

Time

Page 9: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Class Exercise 12.1

• During clock cycle 4, what is the information hold by

the three interstage buffers (i.e., B1, B2, and B3)

respectivley?

CSCI2510 Lec12: Pipelining 9

Student ID:

Name:

Date:

F

Fetch

instruction

D

Decode

instruction

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Clock cycle 1 2 3 4 5 6 7

E

Execute

operation

W

Write

results

B1 B2 B3

Time

Page 10: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

4-Stage Pipeline (2/2)

• The four hardware units perform their tasks

simultaneously without interfering others.

– The required information is passed from one unit to the

next through a interstage buffer.

• Each stage should be roughly the same maximum

clock period.

– Why? A unit that completes its task early is idle for the

remainder of the clock period.

• Question: What is the ideal speedup of an N-stage pipeline compared to the sequential execution?

CSCI2510 Lec12: Pipelining 11

Page 11: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 12

Page 12: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Reality: Pipeline may Stall

• If a pipeline stage requires more than 1 cycle, others

have to wait (pipeline stalled)

– E.g. E2 requires three cycles to complete

CSCI2510 Lec12: Pipelining 13

F1

F2

F3

I1

I2

I3

E1

E2

E3

D1

D 2

D 3

W 1

W 2

W 3

Instruction

F4 D 4I4

Clock cycle 1 2 3 4 5 6 7 8 9

E4

F5I5 D 5

Time

E5

W 4

In cycles 5 and 6: Write, Decode and

Fetch units must wait and do nothing …

Page 13: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Stall & Hazard

• Hazard: Any condition that causes pipeline to stall.

• Another example: A cache miss occurs in F2:

Figure: Instruction execution steps in successive clock cycles.

Figure: Statuses of processor stages in successive clock cycles.

CSCI2510 Lec12: Pipelining 14

F1

F2

F3

I1

I2

I3

D1

D2

D3

E1

E2

E3

W1

W2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

Time

1 2 3 4 5 6 7 8Clock cycle

Stage

F: Fetch

D: Decode

E: Execute

W: Write

F1 F2 F3

D1 D2 D3idle idle idle

E1 E2 E3idle idle idle

W1 W2idle idle idle

9

W3

F2 F2 F2

Time

Page 14: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Types of Hazards

• Data Hazard

– Either the source or the destination operands of an

instruction are not available when required.

• Instruction Hazard

– A delay in the availability of an instruction (this may

be a result of a miss in the cache).

• Structural Hazard

– Two instructions require the use of a given

hardware resource at the same time.

CSCI2510 Lec12: Pipelining 15

Page 15: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 16

Page 16: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Data Hazard

CSCI2510 Lec12: Pipelining 17

I1 (Mul)

I2 (Add)

I3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

I4

F1

F2

F3

D1

D3

E1

E3

E2

W3

W1

D2-A W2

F4 D4 E4 W4

D2

Time

Pipeline is stalled for two cycles.

I1: A = 3 * A;

I2: B = 4 + A;

D: Decode and fetch

source operands

• A data hazard is a situation in which the pipeline is

stalled because the operands are delayed.

• Example:

– Dependent operations must be performed sequentially to

ensure the data consistency.

Page 17: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Class Exercise 12.2

• Please specify whether we will encounter data

hazards for the following instructions.

CSCI2510 Lec12: Pipelining 18

I1: A = 5 * C;

I2: B = 20 + C;

I1: C = A * B;

I2: E = C + D;

Page 18: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Software Solution to Data Hazard

• The compiler detects and introduces two-cycle delay

by inserting NOP (No-operation) instructions.

– Advantage: Simpler hardware, less cost

– Disadvantage: Larger code size, less flexibility, and

reduced performance

CSCI2510 Lec12: Pipelining 20

F1

F2

I1 (Mul)

I2 (Add)

D1 E1

E2

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

W1

W2D2

Time

NOP

NOP

I1: A = 3 * A;

I2: B = 4 + A;

Question: Do we really avoid the pipline stalling?

Page 19: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Hardware Solution to Data Hazard (1/2)

• The data hazard arises because I2 is waiting for data

to be written in the register A.

• In fact, the result of I1 is available at the output of ALU.

• Delay is reduced if the result can be forwarded to E2.

CSCI2510 Lec12: Pipelining 21

F1

F2

F3

I1 (Mul)

I2 (Add)

I3

D1

D3

E1

E3

E2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

W1

D2-A W2

F4 D4 E4 W4I4

D2

Time

D: Decode and fetch

source operands

I1: A = 3 * A;

I2: B = 4 + A;

Result of I1 is available here!

Page 20: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Hardware Solution to Data Hazard (2/2)

• Operand Forwarding: By introducing the forwarding

path, the execution of I2 can proceed without stalling.

– Disadvantage: Additional hardware cost

CSCI2510 Lec12: Pipelining 22

E: Execute(ALU)

W: Write(Register file)

SRC1,SRC2 RSLT

(b) Source and result registers

Register

file

SRC1 SRC2

RSLT

Destination

Source 1

Source 2

(a) Datapath (3 buses)

ALU

Port A Port B

Page 21: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 23

Page 22: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Instruction Hazard

• Recall: The purpose of the instruction fetch unit is to

supply the execution units with instructions.

– F: Fetch instruction from memory

– D: Decode instruction and fetch source operands

– E: Execute instruction

– W: Write the result

• Instruction Hazard: The cases cause the pipeline to

stall, because of the delay of instructions.

1) Cache miss

2) Brach instruction (both unconditional and conditional)CSCI2510 Lec12: Pipelining 24

F

Fetch

instruction

D

Decode

instruction

E

Execute

operation

W

Write

results

B1 B2 B3

Page 23: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Instruction Hazard: Cache Miss

• The effect of a cache miss on the pipelined operation

is as follows:

– I1 is fetched from the cache in cycle 1.

– The fetch operation F2 for I2 results in a cache miss.

• The instruction fetch unit must suspend any further fetch requests until

F2 is completed.

CSCI2510 Lec12: Pipelining 25

F1

F2

I1

I2

I3

D1

D2

E1

E2

W1

W2

F3 D3 E3 W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycleTime

F3

Postponed

Page 24: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

• Branches may also cause the pipeline to stall.

– Branch Penalty: The time lost because of a branch inst.

– Branch penalty can be reduced by computing the branch

address earlier in Decode stage (rather than Execute stage)

• However, it still results in 1 cycle branch penalty to the pipeline.

CSCI2510 Lec12: Pipelining 26

F1 D1 E1 W1

I2 (Branch to Ik)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

Branch address computed in Execute stage

Branch Penalty: 2 clock cycles

E2

8

Time

F1 D1 E1 W1

I2 (Branch to Ik)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

Branch address computed in Decode stage

Branch Penalty: 1 clock cycle

Time

Fk Dk Ek

Fk+1 Dk+1

Ik

Ik+1

Wk

Ek+1

XF3I3

D3

F4 XI4

I3 and I4 must be

discarded

F3 X

Fk Dk Ek

Fk+ 1 Dk+ 1

I3

Ik

Ik+ 1

Wk

Ek+ 1

Only I3 is

discarded

Instruction Hazard: Unconditional Branch

Page 25: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Solution to Instruction Hazard

• Instruction Queue: The interstage buffer between

Fetch and Decode units can keep multiple instructions.

– Fetch unit gets and deposits one instruction at a time.

– Decode unit consumes one instruction at a time.

CSCI2510 Lec12: Pipelining 27

Instruction queue

E

Execute

operation

W

Write

results

D

Decode

instruction

F

Fetch

instruction

Interstage buffers

Page 26: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

F4

W3E3

F2 D2 E2 W2

F3 D3

E4D4 W4F4

• F4, F5, F6, Fk, and Fk+1, are delayed.

• I1, I2, I3, I4, and Ik cannot complete in successive cycles.

CSCI2510 Lec12: Pipelining 28

Example: Without Instruction Queue

F1 D1 E1 E1 E1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+1

10Time

XF6

Fk Dk Ek

Fk+1 Dk+1

Wk

Ek+1

11 12

Instruction 1 takes 3

Execute cycles (i.e., 2-

cycle stall).

Instruction 4 is delayed.

Instruction 5 is a branch .

Instruction 6 is discarded.

F5 D5

Since there is no

instruction queue!

Page 27: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

• I6 is still being discarded, but the instruction queue can avoid

delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.

• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec12: Pipelining 29

Example: With Instruction Queue

F1 D1 E1 E1 E1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+1

10

1Queue length 1 1 12 3 2 1 1 1

Time

X

F4

W3E3

F2 D2 E2 W2

F3 D3

E4D4 W4

F5

F6

Fk Dk Ek

Fk+1 Dk+1

Wk

Ek+1

Keep

fetching

D5

Instruction 1 takes 3

Execute cycles (i.e., 2-

cycle stall),

The queue length rises to

3 before cycle 6.

Instruction 5 is a branch .

Instruction 6 is discarded,

after taking Branch.

The queue length drops to

1 before cycle 8.

Page 28: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Without vs With Instruction Queue

• With instruction queue, the branch instruction does

not increase the overall execution time (if the queue

is not empty).

– Since instructions can complete in successive clock cycles.

• Branch address is computed in parallel with other

instructions, so no cycles lost due to branch.

– This is called branch folding.

• Instruction queue is also possible to hide the effect of

cache miss (if the queue is not empty).

CSCI2510 Lec12: Pipelining 30

Page 29: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Class Exercise 12.3

• Please show how instruction queue can hide the

effect of cache miss (three cycles) caused by F4.

CSCI2510 Lec12: Pipelining 31

F1 D1 E1 E1 E1 W1

W3E3

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

F3 D3

I2

I3

I4

10Time

11 12

F1 D1 E1 E1 E1 W1

W3E3

I1

F2 D2 E2 W2

F3 D3

I2

I3

I4

1 2 3 4 5 6 7 8 9Clock cycle 10Time

11 12

1 1 1Queue length

Without

Instruction

Queue

With

Instruction

Queue

Page 30: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

All intermediate instructions

must be discarded …

• Branch folding is not working for conditional branches.

• Conditional branches may result in added hazard.

– Since the condition is based on the preceding instruction.

• Example:

CSCI2510 Lec12: Pipelining 33

Instruction Hazard: Conditional Branch

Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

R2 is used as the

branch condition.

We need to wait for R2 to

determine whether to perform

the conditional branching.

F1 D1 E1 W1

I2 (Decrement)

I1

1 2 3 4 5 6 7Clock cycle

F3 D3I3

Time

F2 D2 E2 W2

(Shift)

(Branch if R2 = 0) D3-R2

Fk Dk EkIkWk

8 9 10

LOOP

Page 31: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Solution 1) Delayed Branch (1/2)

• The location following a

branch instruction is

called a branch delay slot.

• Delayed branching can

minimize the penalty by

– Placing useful instructions

in branch delay slots, and

– Internally re-ordering the

instructions.

CSCI2510 Lec12: Pipelining 34

Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

(a) Original program loop

Add

LOOP

Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

(b) Internally Re-ordered instructions

(actual program logic NOT affected)

Branch Delay Slot

Page 32: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Solution 1) Delayed Branch (2/2)

• Delayed branching can minimize the branch penalty.

CSCI2510 Lec12: Pipelining 35

Instruction

1 2 3 4 5 6 7 8Clock cycle Time

F ENEXT: Add (Branch not taken) WD

9 10

ALU

Result

Forwarding

ALU

Result

Forwarding

F D

F Daddr

F E

Decrement

Branch=0?

Shift (delay slot)

E

W

W

D

(get branch address)

F E

F

F E

Decrement (Branch is taken)

Branch=0?

Shift (delay slot)

W

W

D

Daddr

D

(get branch address)

Page 33: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Solution 2) Branch Prediction (1/2)

CSCI2510 Lec12: Pipelining 36

F1

F2

I1 (Compare)

I2 (Branch>0)

D1 E1 W1

Instruction

E2

Clock cycle 1 2 3 4 5 6

D2 / P2

Time

I3 F3 D3 X(Branch Delay Slot)

F4

Fk Dk

XI4

Ik

Incorrect Prediction

Fk DkIk

Correct Prediction

• Attempt to predict

whether conditional

branch will take place.

– Delayed branch can

be applied together.

• Branch Prediction:

– If we get it right: no

lost cycles.

• Registers and memory

cannot be updated until

we know we got it right.

– If we get it wrong, just

cancel the instructions.

– Branch prediction can

be dynamic or static.

Page 34: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Solution 2) Branch Prediction (2/2)

• Static Branch Prediction

– The same choice is used every time the conditional branch

is encountered.

– For example, a branch instruction at the end of a loop

causes a branch to the start of the loop for every pass

through the loop except the last one.

• It is helpful to assume this branch will be taken under this case.

– A flexible approach is to have the compiler decide.

• Dynamic Branch Prediction

– The choice is influenced by the past behavior.

– For example, a simple prediction is to use the result of the

most recent execution of the branch instruction.

CSCI2510 Lec12: Pipelining 37

Page 35: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 38

Page 36: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Structural Hazard

• A structural hazard is the situation when two

instructions require the use of a hardware resource at

the same time.

• The most common case is in accessing to memory.

– Case 1: One instruction is accessing memory during the

Execute or Write stage; while another is being fetched.

– Solution 1: Many processors use separate instruction and

data caches to avoid this delay.

– Case 2: Another example is when two instructions require

access to the register file at the same time.

– Solution 2: Let the register file have more input/output ports.

• In general, the structural hazard can be avoided by

providing sufficient hardware resources ($$$).CSCI2510 Lec12: Pipelining 39

Page 37: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Class Exercise 12.4

• What is the cause of the following structure hazard?

CSCI2510 Lec12: Pipelining 40

Page 38: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 42

Page 39: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Superscalar Operation (1/2)

• Superscalar: Execute multiple instructions at any

time via multiple processing units (i.e., we can

execute more than one instruction per cycle)

CSCI2510 Lec12: Pipelining 43

W : Writeresults

Decode /

Dispatchunit

Instruction

queue

Floating-pointunit

Integerunit

F : Instructionfetch unit

Fetch two instructions

at a time

Decode two

instructions

at a time

I1 (FracAdd)

Instruction

Clock cycle 1 2 3 4 5 6

Time

F1 D1 E1A E1B E1C W1

I2 (Add) F2 D2 E2 W2

Page 40: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Superscalar Operation (2/2)

• Superscalar operation may result in out-of-order

execution, and cause data consistency issue.

– In our previous example, I1 and I2 are dispatched in the

same order as they appear.

– However, their execution is completed out of order.

– To guarantee a consistent state when out-of-order

execution occur, the results of the execution of instructions

must be written in program order strictly .

• The out-of-order execution is also a common

technique to make use of instruction cycles by re-

ordering instructions.

– E.g., Delayed branching reorders the instructions to

minimize the branch penalty.CSCI2510 Lec12: Pipelining 44

Page 41: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Out-of-Order Execution

R1 mem[r0] /* Instruction 1 */

R2 R1 + R2 /* Instruction 2 */

R5 R5 + 1 /* Instruction 3 */

R6 R6 – R3 /* Instruction 4 */

• Instruction 1 results in a cache miss, and a cache

miss can stall entire processor for 20-30 cycles.

• Instruction 2 cannot be executed since it needs R1.

• In instruction queue, look ahead and find instructions

3 and 4 to execute first (reordering).

R1 mem[r0] /* Instruction 1 */

R5 R5 + 1 /* Instruction 3 */

R6 R6 – R3 /* Instruction 4 */

R2 R1 + R2 /* Instruction 2 */

CSCI2510 Lec12: Pipelining 45

Page 42: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

• I6 is still being discarded, but the instruction queue can avoid

delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.

• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec12: Pipelining 46

Recall: With Instruction Queue

F1 D1 E1 E1 E1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+1

10

1Queue length 1 1 12 3 2 1 1 1

Time

X

F4

W3E3

F2 D2 E2 W2

F3 D3

E4D4 W4

F5

F6

Fk Dk Ek

Fk+1 Dk+1

Wk

Ek+1

Keep

fetching

D5

Instruction 1 takes 3

Execute cycles (i.e., 2-

cycle stall),

The queue length rises to

3 before cycle 6.

Instruction 5 is a branch .

Instruction 6 is discarded,

after taking Branch.

The queue length drops to

1 before cycle 8.

Decode two

instructionsat a time

Page 43: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps

Summary

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation

CSCI2510 Lec12: Pipelining 47