53
55:035 Computer Architecture and Organization Lecture 10

55:035 Computer Architecture and Organization Lecture 10

Embed Size (px)

Citation preview

Page 1: 55:035 Computer Architecture and Organization Lecture 10

55:035 Computer Architecture and Organization

Lecture 10

Page 2: 55:035 Computer Architecture and Organization Lecture 10

Outline Pipelining

Basics Throughput Execution Time Pipeline Registers Program Execution Hazards

Structural Data Control

255:035 Computer Architecture and Organization

Page 3: 55:035 Computer Architecture and Organization Lecture 10

ILP: Instruction Level Parallelism Single-cycle and multi-cycle datapaths execute

one instruction at a time. How can we get better performance? Answer: Execute multiple instruction at a time:

Pipelining – Enhance a multi-cycle datapath to fetch one instruction every cycle.

Parallelism – Fetch multiple instructions every cycle.

355:035 Computer Architecture and Organization

Page 4: 55:035 Computer Architecture and Organization Lecture 10

Automobile Team Assembly

1 car assembled every four hours6 cars per day180 cars per month2,040 cars per year

1 hour

1 hour

1 hour1 hour

455:035 Computer Architecture and Organization

Page 5: 55:035 Computer Architecture and Organization Lecture 10

Automobile Assembly Line

First car assembled in 4 hours (pipeline latency) thereafter 1 car per hour 21 cars on first day, thereafter 24 cars per day 717 cars per month 8,637 cars per year

Task 11 hour

Task 21 hour

Task 31 hour

Task 41 hour

Mecahnical Electrical Painting Testing

555:035 Computer Architecture and Organization

Page 6: 55:035 Computer Architecture and Organization Lecture 10

Throughput: Team Assembly

Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing

Time of assembling one car = n hours

where n is the number of nearly equal subtasks,each requiring 1 unit of time

Throughput = 1/n cars per unit time

Red car completed

Red car started

TimeBlue car started

Blue car completed

655:035 Computer Architecture and Organization

Page 7: 55:035 Computer Architecture and Organization Lecture 10

Throughput: Assembly LineMechanical Electrical Painting Testing

Mechanical Electrical Painting Testing

Mechanical Electrical Painting Testing

Mechanical Electrical Painting Testing

Car 1

Car 2

Car 3

Car 4

.

.

Car 1 complete

Car 2 complete

Time to complete first car = n time units (latency)Cars completed in time T = T – n + 1Throughput = 1- (n - 1)/ T car per unit time

Throughput (assembly line) 1 – (n - 1)/ T n (n – 1)─────────────────── = ──────── = n – ───── → nThroughput (team assembly) 1/n T as T→∞

time

755:035 Computer Architecture and Organization

Page 8: 55:035 Computer Architecture and Organization Lecture 10

Some Features of Assembly Line

Task 11 hour

Task 21 hour

Task 31 hour

Task 41 hour

Mechanical Electrical Painting Testing

Electrical parts delivered (JIT)

Defect found

Stall assembly line to fix the cause of defect

3 cars in the assembly line are suspects, to be removed (flush pipeline)

855:035 Computer Architecture and Organization

Page 9: 55:035 Computer Architecture and Organization Lecture 10

Pipelining in a Computer Divide datapath into nearly equal tasks, to be

performed serially and requiring non-overlapping resources.

Insert registers at task boundaries in the datapath; registers pass the output data from one task as input data to the next task.

Synchronize tasks with a clock having a cycle time that just exceeds the time required by the longest task.

Break each instruction down into a fixed number of tasks so that instructions can be executed in a staggered fashion.

955:035 Computer Architecture and Organization

Page 10: 55:035 Computer Architecture and Organization Lecture 10

Single-Cycle DatapathInstructionInstruction

classclass

Instr.Instr.

fetchfetch

(IF)(IF)

Instr. Instr. DecodeDecode

(also reg. (also reg. file read)file read)

(ID)(ID)

Execution Execution (ALU(ALU

Operation)Operation)

(EX)(EX)

Data Data accessaccess

(MEM)(MEM)

Write Write Back Back (Reg. (Reg.

file file write)write)

(WB)(WB)

Total Total timetime

lwlw 2ns2ns 1ns1ns 2ns2ns 2ns2ns 1ns1ns 8ns8ns

swsw 2ns2ns 1ns1ns 2ns2ns 2ns2ns 8ns8ns

R-formatR-format

add, sub, and, or, sltadd, sub, and, or, slt

2ns2ns 1ns1ns 2ns2ns 1ns1ns 8ns8ns

B-format, beqB-format, beq 2ns2ns 1ns1ns 2ns2ns 8ns8ns

No operation on data; idle time equalizes instruction length to a fixed clock period.

1055:035 Computer Architecture and Organization

Page 11: 55:035 Computer Architecture and Organization Lecture 10

Execution Time: Single-Cycle

IF ID EX MEM WB IF ID EX MEM WB

IF ID EX MEM WB

0 2 4 6 8 10 12 14 16 . .

Time (ns) lw $1, 100($0) lw $2, 200($0) lw $3, 300($0)

Clock cycle time = 8 ns

Total time for executing three lw instructions = 24 ns

1155:035 Computer Architecture and Organization

Page 12: 55:035 Computer Architecture and Organization Lecture 10

Pipelined DatapathInstructionInstruction

classclass

Instr.Instr.

fetchfetch

(IF)(IF)

Instr. Instr. DecodeDecode

(also reg. (also reg. file read)file read)

(ID)(ID)

Execu-Execu-tion (ALUtion (ALU

Opera-Opera-tion)tion)

(EX)(EX)

Data Data accessaccess

(MEM)(MEM)

Write Write Back Back

(Reg. file (Reg. file write)write)

(WB)(WB)

Total Total timetime

lwlw 2ns2ns1ns1ns

2ns2ns2ns2ns 2ns2ns

1ns1ns

2ns2ns10ns10ns

swsw 2ns2ns1ns1ns

2ns2ns2ns2ns 2ns2ns

1ns1ns

2ns2ns10ns10ns

R-format: add, sub, R-format: add, sub, and, or, sltand, or, slt 2ns2ns

1ns1ns

2ns2ns2ns2ns 2ns2ns

1ns1ns

2ns2ns10ns10ns

B-format:B-format:

beqbeq2ns2ns

1ns1ns

2ns2ns2ns2ns 2ns2ns

1ns1ns

2ns2ns10ns10ns

No operation on data; idle time inserted to equalize instruction lengths.

1255:035 Computer Architecture and Organization

Page 13: 55:035 Computer Architecture and Organization Lecture 10

Execution Time: Pipeline

IF ID EX MEM RW

IF ID EX MEM RW

IF ID EX MEM RW

0 2 4 6 8 10 12 14 16 . .

Time (ns) lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Clock cycle time = 2 ns, four times faster than single-cycle clock

Total time for executing three lw instructions = 14 ns

Single-cycle time 24Performance ratio = ──────────── = ── = 1.7

Pipeline time 14

1355:035 Computer Architecture and Organization

Page 14: 55:035 Computer Architecture and Organization Lecture 10

Pipeline PerformanceClock cycle time = 2 ns

1,003 lw instructions:

Total time for executing 1,003 lw instructions = 2,014 ns

Single-cycle time 8,024Performance ratio = ──────────── = ──── = 3.98

Pipeline time 2,014

10,003 lw instructions:

Performance ratio = 80,024 / 20,014 = 3.998 → Clock cycle ratio (4)

Pipeline performance approaches clock-cycle ratio for long programs.

1455:035 Computer Architecture and Organization

Page 15: 55:035 Computer Architecture and Organization Lecture 10

Single-Cycle Datapath

Instr. mem.PC

Ad

d

Reg

. F

ile

Datamem.

1 m

ux

0

1 m

ux

0

0 m

ux

1

4

1 m

ux

0

Sign ext.

Shift left 2

ALUCont.

CO

NT

RO

L

opcode

MemWriteMemReadA

LU

Branch

zero

0-15

0-5

11-15

16-20

21-25

26-31

AL

U

MemtoReg

ALUOp

ALUSrc

RegDst

RegWrite

IF: Instr. fetch ID: Instr. decode,reg. file read

EX: Execute,address calc.

MEM: mem. access

WB:writeback

1555:035 Computer Architecture and Organization

Page 16: 55:035 Computer Architecture and Organization Lecture 10

Pipelining of RISC Instructions

FetchInstruction

ExamineOpcode

FetchOperands

PerformOperation

StoreResult

Although an instruction takes five clock cycles,one instruction is completed every cycle.

IF ID EX MEM WB

Instruction Decode Execute Memory WriteFetch instruction and Operation Back

Fetch operands to Reg file

1655:035 Computer Architecture and Organization

Page 17: 55:035 Computer Architecture and Organization Lecture 10

Pipeline Registers

Instr. mem.PC

Ad

d

Reg

. F

ile

Datamem.

1 m

ux

0

1 m

ux

0

0 m

ux

1

4

1 m

ux

0

Sign ext.

Shift left 2

ALUCont.

CO

NT

RO

L

opcode

MemWriteMemReadA

LU

Branch

zero

0-15

0-5

11-15

16-20

21-25

26-31

AL

U

MemtoReg

ALUOp

ALUSrc

RegDst

RegWrite

IF/ID ID/EX EX/MEM

MEM/WB

This requires aCONTROL not too different from single-cycle

1755:035 Computer Architecture and Organization

Page 18: 55:035 Computer Architecture and Organization Lecture 10

Pipeline Register Functions Four pipeline registers are added:

Register Register namename Data heldData held

IF/IDIF/ID PC+4, Instruction word (IW)PC+4, Instruction word (IW)

ID/EXID/EX PC+4, R1, R2, IW(0-15) sign ext., IW(11-15)PC+4, R1, R2, IW(0-15) sign ext., IW(11-15)

EX/MEMEX/MEM PC+4, zero, ALUResult, R2, IW(11-15) or IW(16-20)PC+4, zero, ALUResult, R2, IW(11-15) or IW(16-20)

MEM/WBMEM/WB M[ALUResult], ALUResult, IW(11-15) or IW(16-20)M[ALUResult], ALUResult, IW(11-15) or IW(16-20)

1855:035 Computer Architecture and Organization

Page 19: 55:035 Computer Architecture and Organization Lecture 10

Pipelined Datapath

InstrmemPC

Ad

d

Reg

. F

ile

Datamem1

mu

x 0

1 m

ux

0

0 m

ux

1

4

Sgn ext

Shift left 2

opcode

AL

U

zero

0-15

16-20

21-25

26-31

AL

U

IF/ID ID/EX EX/MEM MEM/WB

11-15 for R-type16-20 for I-type lw

1955:035 Computer Architecture and Organization

Page 20: 55:035 Computer Architecture and Organization Lecture 10

Five-Cycle PipelineCC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

2055:035 Computer Architecture and Organization

Page 21: 55:035 Computer Architecture and Organization Lecture 10

Add Instruction add $t0, $s1, $s2

Machine instruction word 000000 10001 10010 01000 00000 100000 opcode $s1 $s2 $t0 function

CC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IF ID EX MEM WBread $s1 add write $t0read $s2 $s1+$s2

2155:035 Computer Architecture and Organization

Page 22: 55:035 Computer Architecture and Organization Lecture 10

Pipelined Datapath Executing add

PCR

eg.

Fil

e

1 m

ux

0

1 m

ux

0

0 m

ux

1

4

Sign ext.

Shift left 2

opcode

AL

U

zero

0-15

16-20

21-25

26-31

AL

U

IF/ID ID/EX EX/MEM MEM/WB

s1

$s1

11-15 for R-type16-20 for I-type lw

t0

Instr mem s2 $s2

Ad

d

addr

Datamem

data

2255:035 Computer Architecture and Organization

Page 23: 55:035 Computer Architecture and Organization Lecture 10

Load Instruction lw $t0, 1200 ($t1) 100011 01001 01000 0001 0010 0000 0000 opcode $t1 $t0 1200

CC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IF ID EX MEM WBread $t1 add read write $t0sign ext $t1+1200 M[addr]1200

2355:035 Computer Architecture and Organization

Page 24: 55:035 Computer Architecture and Organization Lecture 10

Pipelined Datapath Executing lw

PC

Ad

d

Reg

. F

ile

addr

Datamem

data1

mu

x 0

1 m

ux

0

0 m

ux

1

4

Sign ext.

Shift left 2

opcode

AL

U

zero

0-15

16-20

21-25

26-31

AL

U

IF/ID ID/EX EX/MEM MEM/WB

t1

1200

$t1

11-15 for R-type16-20 for I-type lw

t0

Instrmem

2455:035 Computer Architecture and Organization

Page 25: 55:035 Computer Architecture and Organization Lecture 10

Store Instruction sw $t0, 1200 ($t1) 101011 01001 01000 0001 0010 0000 0000 opcode $t1 $t0 1200

CC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IF ID EX MEM WBread $t1 add write sign ext $t1+1200 M[addr]1200 (addr) ← $t0

2555:035 Computer Architecture and Organization

Page 26: 55:035 Computer Architecture and Organization Lecture 10

Pipelined Datapath Executing sw

PC

Reg

. File

addrDatamem

data

1 m

ux

0

1 m

ux

0

0 m

ux

1

4

Sign ext.

Shift left 2

opcode

AL

U

zero

0-15

16-20

21-25

26-31

AL

U

IF/ID ID/EX EX/MEM MEM/WB

t1

1200

$t1

11-15 for R-type16-20 for I-type lw

Instr mem t0 $t0

Ad

d

2655:035 Computer Architecture and Organization

Page 27: 55:035 Computer Architecture and Organization Lecture 10

Executing a Program

Consider a five-instruction segment:

lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) add $14, $5, $6

2755:035 Computer Architecture and Organization

Page 28: 55:035 Computer Architecture and Organization Lecture 10

Program ExecutiontimeCC1 CC2 CC3 CC4 CC5

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

lw $10, 20($1)

sub $11, $2, $3

add $12, $3, $4

lw $13, 24($1)

add $14, $5, $6

Pro

gra

m i

nst

ruct

ion

s

2855:035 Computer Architecture and Organization

Page 29: 55:035 Computer Architecture and Organization Lecture 10

CC5

InstrmemPC

Ad

d

Reg

. F

ile

Datamem.1

mu

x 0

1 m

ux

0

0 m

ux

1

4

Sign ext.

Shift left 2

opcode

AL

U

zero

0-15

16-20

21-25

26-31

AL

U

IF/ID ID/EX EX/MEM MEM/WB

11-15 for R-type16-20 for I-type lw

IF: add $14, $5, $6 ID: lw $13, 24($1) EX: add $12, $3, $4MEM: sub $11, $2, $3

WB: lw $10, 20($1)

2955:035 Computer Architecture and Organization

Page 30: 55:035 Computer Architecture and Organization Lecture 10

Advantages of Pipeline After the fifth cycle (CC5), one instruction is completed

each cycle; CPI ≈ 1, neglecting the initial pipeline latency of 5 cycles.

Pipeline latency is defined as the number of stages in the pipeline, or

The number of clock cycles after which the first instruction is completed.

The clock cycle time is about four times shorter than that of single-cycle datapath and about the same as that of multicycle datapath.

For multicycle datapath, CPI = 3. …. So, pipelined execution is faster, but . . .

3055:035 Computer Architecture and Organization

Page 31: 55:035 Computer Architecture and Organization Lecture 10

Pipeline Hazards Definition: Hazard in a pipeline is a situation in

which the next instruction cannot complete execution one clock cycle after completion of the present instruction.

Three types of hazards: Structural hazard (resource conflict) Data hazard Control hazard

3155:035 Computer Architecture and Organization

Page 32: 55:035 Computer Architecture and Organization Lecture 10

Structural Hazard Two instructions cannot execute due to a

resource conflict. Example: Consider a computer with a common

data and instruction memory. The fourth cycle of a lw instruction requires memory access (memory read) and at the same time the first cycle of the fourth instruction requires instruction fetch (memory read). This will cause a memory resource conflict.

3255:035 Computer Architecture and Organization

Page 33: 55:035 Computer Architecture and Organization Lecture 10

Example of Structural HazardCC1 CC2 CC3 CC4 CC5

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

lw $10, 20($1)

sub $11, $2, $3

add $12, $3, $4

lw $13, 24($1)

time

Pro

gra

m i

ns

tru

cti

on

s

Common data and instr. Mem.

Needed by two instructions

3355:035 Computer Architecture and Organization

Page 34: 55:035 Computer Architecture and Organization Lecture 10

Possible Remedies for Structural Hazards

Provide duplicate hardware resources in datapath.

Control unit or compiler can insert delays (no-op cycles) between instructions. This is known as pipeline stall or bubble.

3455:035 Computer Architecture and Organization

Page 35: 55:035 Computer Architecture and Organization Lecture 10

Stall (Bubble) for Structural Hazard

CC1 CC2 CC3 CC4 CC5

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM/D

M

ID,

RE

G.

FIL

E

RE

AD

AL

U

IM/D

M R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

lw $10, 20($1)

sub $11, $2, $3

add $12, $3, $4

lw $13, 24($1)

time

Pro

gra

m i

ns

tru

cti

on

s

Stall (bubble)

3555:035 Computer Architecture and Organization

Page 36: 55:035 Computer Architecture and Organization Lecture 10

Data Hazard Data hazard means that an instruction cannot be

completed because the needed data, being generated by another instruction in the pipeline, is not available.

Example: consider two instructions: add $s0, $t0, $t1 sub $t2, $s0, $t3 # needs $s0

3655:035 Computer Architecture and Organization

Page 37: 55:035 Computer Architecture and Organization Lecture 10

Example of Data HazardCC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

add $s0, $t0, $t1

sub $t2, $s0, $t3

time

Pro

gra

m in

stru

ctio

ns

Write s0 in CC5

Read s0 and t3 in CC3

We need to read s0 from reg file in cycle 3But s0 will not be written in reg file until cycle 5

However, s0 will only be used in cycle 4And it is available at the end of cycle 3

3755:035 Computer Architecture and Organization

Page 38: 55:035 Computer Architecture and Organization Lecture 10

Forwarding or Bypassing Output of a resource used by an instruction is

forwarded to the input of some resource being used by another instruction.

Forwarding can eliminate some, but not all, data hazards.

3855:035 Computer Architecture and Organization

Page 39: 55:035 Computer Architecture and Organization Lecture 10

Forwarding for Data HazardCC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

add $s0, $t0, $t1

sub $t2, $s0, $t3

time

Pro

gra

m in

stru

ctio

ns

Write s0 in CC5

Read s0 and t3 in CC3

Forwarding

3955:035 Computer Architecture and Organization

Page 40: 55:035 Computer Architecture and Organization Lecture 10

Forwarding Unit Hardware

ForwardingUnit

DataMem.A

LU

FOR

W.

MU

X

FOR

W.

MU

X

ID/EX EX/MEM MEM/WB

MU

X

Destination registersSource reg.

IDs from opcode

Datato reg.

file

4055:035 Computer Architecture and Organization

Page 41: 55:035 Computer Architecture and Organization Lecture 10

Forwarding Alone May Not WorkCC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

lw $s0, 20($s1)

sub $t2, $s0, $t3

time

Pro

gra

m in

stru

ctio

ns

Write s0 in CC5

Read s0 and t3 in CC3

data needed by sub (data hazard)

data available from memory only at the end of cycle 4

4155:035 Computer Architecture and Organization

Page 42: 55:035 Computer Architecture and Organization Lecture 10

Use Bubble and ForwardingCC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

lw $s0, 20($s1)

sub $t2, $s0, $t3

time

Pro

gra

m in

stru

ctio

ns

Write s0 in CC5

Forwarding

stall(bubble)

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

4255:035 Computer Architecture and Organization

Page 43: 55:035 Computer Architecture and Organization Lecture 10

Hazard Detection Unit Hardware

Source reg.IDs from opcode

ForwardingUnit

DataMem.A

LU

FOR

W.

MU

X

FOR

W.

MU

X

ID/EX EX/MEM MEM/WB

Controlsignals

to reg. file

HazardDetection

Unit

PC

ControlN

OP M

UX

0

IF/ID

Inst

ruct

ion

Disablewrite

4355:035 Computer Architecture and Organization

Page 44: 55:035 Computer Architecture and Organization Lecture 10

Resolving Hazards Hazards are resolved by Hazard detection and

forwarding units. Compiler’s understanding of how these units

work can improve performance.

4455:035 Computer Architecture and Organization

Page 45: 55:035 Computer Architecture and Organization Lecture 10

Avoiding Stall by Code ReorderC code:

A = B + E;C = B + F;

MIPS code: lw $t1, 0($t0) . $t1 writtenlw $t2, 4($t0) . . $t2 writtenadd $t3, $t1, $t2 . . . $t1, $t2 neededsw $t3, 12($t0) . . . .lw $t4, 8($t0) . . . . . $t4 writtenadd $t5, $t1, $t4 . . . . . $t4 neededsw $t5, 16,($t0) . . . . .

. . . . . . .

. . .

4555:035 Computer Architecture and Organization

Page 46: 55:035 Computer Architecture and Organization Lecture 10

Reordered Code

C code: A = B + E;C = B + F;

MIPS code: lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2 no hazardsw $t3, 12($t0)add $t5, $t1, $t4 no hazardsw $t5, 16,($t0)

4655:035 Computer Architecture and Organization

Page 47: 55:035 Computer Architecture and Organization Lecture 10

Control Hazard Instruction to be fetched is not known! Example: Instruction being executed is branch-

type, which will determine the next instruction: add $4, $5, $6 beq $1, $2, 40 next instruction . . . 40 and $7, $8, $9

4755:035 Computer Architecture and Organization

Page 48: 55:035 Computer Architecture and Organization Lecture 10

Stall on BranchCC1 CC2 CC3 CC4 CC5

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID, R

EG

. F

ILE

R

EA

D

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

add $4, $5, $6

beq $1, $2, 40

next instruction or and $7, $8, $9

time

Pro

gra

m in

stru

ctio

ns

Stall (bubble)

Page 49: 55:035 Computer Architecture and Organization Lecture 10

Why Only One Stall? Extra hardware in ID phase:

Additional ALU to compute branch address Comparator to generate zero signal Hazard detection unit writes the branch address in PC

4955:035 Computer Architecture and Organization

Page 50: 55:035 Computer Architecture and Organization Lecture 10

Ways to Handle Branch Stall or bubble Branch prediction:

Heuristics Next instruction Prediction based on statistics (dynamic) Hardware decision (dynamic)

Prediction error: pipeline flush

Delayed branch

5055:035 Computer Architecture and Organization

Page 51: 55:035 Computer Architecture and Organization Lecture 10

Delayed Branch Example Stall on branch

add $4, $5, $6 beq $1, $2, skip next instruction . . . skip or $7, $8, $9

Delayed branch beq $1, $2, skip add $4, $5, $6 next instruction . . . skip or $7, $8, $9

Instruction executed irrespective of branch decision

5155:035 Computer Architecture and Organization

Page 52: 55:035 Computer Architecture and Organization Lecture 10

Delayed BranchCC1 CC2 CC3 CC4 CC5

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

IM

ID,

RE

G.

FIL

E

RE

AD

AL

U

DM

R

EG

. F

ILE

W

RIT

E

IF/I

D

ID/E

X

ME

M/W

B

EX

/ME

M

add $4, $5, $6

beq $1, $2, skip

next instruction or skip or $7, $8, $9

time

Pro

gra

m in

stru

ctio

ns

5255:035 Computer Architecture and Organization

Page 53: 55:035 Computer Architecture and Organization Lecture 10

Summary: Hazards Structural hazards

Cause: resource conflict Remedies: (i) hardware resources, (ii) stall (bubble)

Data hazards Cause: data unavailablity Remedies: (i) forwarding, (ii) stall (bubble), (iii) code

reordering

Control hazards Cause: out-of-sequence execution (branch or jump) Remedies: (i) stall (bubble), (ii) branch prediction/pipeline

flush, (iii) delayed branch/pipeline flush

5355:035 Computer Architecture and Organization