Lec 04 Pipeline d Processor

Pipelined Processor Design

M S Bhat

Dept. of E&C,

NITK Suratkal

VL722 2 8/16/2015

Presentation Outline

Pipelining versus Serial Execution

Pipelined Datapath and Control

Pipeline Hazards

Data Hazards and Forwarding

Load Delay, Hazard Detection, and Stall

Control Hazards

Delayed Branch and Dynamic Branch Prediction

VL722 3 8/16/2015

Laundry Example: Three Stages

1. Wash dirty load of clothes

2. Dry wet clothes

3. Fold and put clothes into drawers

Each stage takes 30 minutes to complete

Four loads of clothes to wash, dry, and fold

A B

C D

Pipelining Example

VL722 4 8/16/2015

Sequential laundry takes 6 hours for 4 loads

Intuitively, we can use pipelining to speed up laundry

Sequential Laundry

Time

6 PM

A

30 30 30

7 8 9 10 11 12 AM

30 30 30

B

30 30 30

C

30 30 30

D

VL722 5 8/16/2015

Pipelined laundry takes 3 hours for 4 loads

Speedup factor is 2 for 4 loads

Time to wash, dry, and fold one load is still the

same (90 minutes)

Pipelined Laundry: Start Load ASAP

Time

6 PM

A

30

7 8 9 PM

B

30

30

C

30

30

30

D

30

30

30

30

30 30

VL722 6 8/16/2015

Serial Execution versus Pipelining

Consider a task that can be divided into k subtasks

The k subtasks are executed on k different stages

Each subtask requires one time unit

The total execution time of the task is k time units

Pipelining is to overlap the execution

The k stages work in parallel on k different tasks

Tasks enter/leave pipeline at the rate of one task per time unit

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

Without Pipelining

One completion every k time units

With Pipelining

One completion every 1 time unit

VL722 7 8/16/2015

Synchronous Pipeline

Uses clocked registers between stages

Upon arrival of a clock edge

All registers hold the results of previous stages simultaneously

The pipeline stages are combinational logic circuits

It is desirable to have balanced stages

Approximately equal delay in all stages

Clock period is determined by the maximum stage delay

S1 S2 Sk

Re

gis

ter

Re

gis

ter

Re

gis

ter

Re

gis

ter

Input

Clock

Output

VL722 8 8/16/2015

Let ti = time delay in stage Si

Clock cycle t = max(ti) is the maximum stage delay

Clock frequency f = 1/t = 1/max(ti)

A pipeline can process n tasks in k + n 1 cycles

k cycles are needed to complete the first task

n 1 cycles are needed to complete the remaining n 1 tasks

Ideal speedup of a k-stage pipeline over serial execution

Pipeline Performance

k + n 1 Pipelined execution in cycles

Serial execution in cycles = = Sk k for large n

nk Sk

VL722 9 8/16/2015

MIPS Processor Pipeline

Five stages, one cycle per stage

1. IF: Instruction Fetch from instruction memory

2. ID: Instruction Decode, register read, and J/Br address

3. EX: Execute operation or calculate load/store address

4. MEM: Memory access for load and store

5. WB: Write Back result to register

VL722 10 8/16/2015

Performance Example

Assume the following operation times for components:

Instruction and data memories: 200 ps

ALU and adders: 180 ps

Decode and Register file access (read or write): 150 ps

Ignore the delays in PC, mux, extender, and wires

Which of the following would be faster and by how much?

Single-cycle implementation for all instructions

Multicycle implementation optimized for every class of instructions

Assume the following instruction mix:

40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps

VL722 11 8/16/2015

Single-Cycle vs Multicycle Implementation

Break instruction execution into five steps

Instruction fetch

Instruction decode and register read

Execution, memory address calculation, or branch completion

Memory access or ALU instruction completion

Load instruction completion

One step = One clock cycle (clock cycle is reduced)

First 2 steps are the same for all instructions

Instruction # cycles Instruction # cycles

ALU & Store 4 Branch 3

Load 5 Jump 2

VL722 12 8/16/2015

Solution Instruction

Class

Instruction

Memory

Register

Read

ALU

Operation

Data

Memory

Register

Write Total

ALU 200 150 180 150 680 ps

Load 200 150 180 200 150 880 ps

Store 200 150 180 200 730 ps

Branch 200 150 180 530 ps

Jump 200 150 350 ps

For fixed single-cycle implementation:

Clock cycle =

For multi-cycle implementation:

Clock cycle =

Average CPI =

Speedup =

decode and update PC

0.44 + 0.25 + 0.14+ 0.23 + 0.12 = 3.8

max (200, 150, 180) = 200 ps (maximum delay at any step)

880 ps determined by longest delay (load instruction)

880 ps / (3.8 200 ps) = 880 / 760 = 1.16

VL722 13 8/16/2015

Single-Cycle vs Pipelined Performance

Consider a 5-stage instruction execution in which

Instruction fetch = Data memory access = 200 ps

ALU operation = 180 ps

Register read = register write = 150 ps

What is the clock cycle of the single-cycle processor?

What is the clock cycle of the pipelined processor?

What is the speedup factor of pipelined execution?

Solution

Single-Cycle Clock = 200+150+180+200+150 = 880 ps

Reg ALU MEM IF

880ps

Reg

Reg ALU MEM IF

880 ps

Reg

VL722 14 8/16/2015

Single-Cycle versus Pipelined contd Pipelined clock cycle =

CPI for pipelined execution =

One instruction is completed in each cycle (ignoring pipeline fill)

Speedup of pipelined execution =

Instruction count and CPI are equal in both cases

Speedup factor is less than 5 (number of pipeline stage)

Because the pipeline stages are not balanced

Throughput = 1/Max(delay) = 1/200 ps = 5 x 109 instructions/sec

880 ps / 200 ps = 4.4

1

max(200, 180, 150) = 200 ps

200

IF Reg MEM ALU Reg

IF Reg MEM Reg ALU

IF Reg MEM ALU Reg 200

200 200 200 200 200

VL722 15 8/16/2015

Pipeline Performance Summary

VL722 16 8/16/2015

Next . . .



Pipeline Hazards



Control Hazards


VL722 17 8/16/2015

ID = Decode &

Register Read

Single-Cycle Datapath

Shown below is the single-cycle datapath

How to pipeline this single-cycle datapath?

Answer: Introduce pipeline register at end of each stage

Next

PC

zero PCSrc

ALUCtrl

Reg

Write

ExtOp

RegDst

ALUSrc

Data

Memory

Address

Data_in

Data_out

32

32 A L U

32

Registers

RA

RB

BusA

BusB

RW

5

BusW

32

Address

Instruction

Instruction

Memory

PC

00 30

Rs

5

Rd E

Imm16

Rt

0

1

0

1

32

Imm26

32

ALU result

32

0

1

clk

+1

0

1

30

Jump or Branch Target Address

Mem

Read

Mem

Write

Mem

toReg

EX = Execute IF = Instruction Fetch MEM = Memory

Access

WB =

Write

Back

Bne

Beq

J

VL722 18 8/16/2015

zero

Pipelined Datapath Pipeline registers are shown in green, including the PC

Same clock edge updates all pipeline registers, register file, and data memory (for store instruction)

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

ID = Decode &

Register Read

EX = Execute IF = Instruction Fetch MEM = Memory

Access

WB

= W

rite

Ba

ck

32

Re

gis

ter

File

RA

RB

BusA

BusB

RW BusW

32

E

Rd

0

1

PC

AL

Uou

t D

WB

Da

ta

32

Imm16

Imm26

Next

PC

A

B

Imm

N

PC

2

Instr

uctio

n

NP

C

+1

0

1

VL722 19 8/16/2015

Problem with Register Destination Is there a problem with the register destination address?

Instruction in the ID stage different from the one in the WB stage

Instruction in the WB stage is not writing to its destination register but to the destination of a different instruction in the ID stage

zero

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

ID = Decode &

Register Read EX = Execute IF = Instruction Fetch MEM =

Memory Access

WB

= W

rite

Ba

ck

32

Re

gis

ter

File

RA

RB

BusA

BusB

RW BusW

32

E

Rd

0

1

PC

AL

Uou

t D

WB

Da

ta

32

Imm16

Imm26

Next

PC

A

B

Imm

N

PC

2

Instr

uctio

n

NP

C

+1

0

1

VL722 20 8/16/2015

Pipelining the Destination Register Destination Register number should be pipelined

Destination register number is passed from ID to WB stage

The WB stage writes back data knowing the destination register

zero

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

ID EX IF MEM WB

32

Reg

iste

r F

ile

RA

RB

BusA

BusB

RW BusW

32

E

PC

AL

Uou

t D

WB

Da

ta

32

Imm16

Imm26

Next

PC

A

B

Imm

N

PC

2

Instr

uctio

n

NP

C

+1

0

1

Rd

Rd2

0

1 Rd3

Rd4

VL722 21 8/16/2015

Graphically Representing Pipelines

Multiple instruction execution over multiple clock cycles

Instructions are listed in execution order from top to bottom

Clock cycles move from left to right

Figure shows the use of resources at each stage and each cycle

Time (in cycles)

Prog

ram

Execu

tion

Ord

er

add R17,R18,R19

CC2

Reg

IF

DM

Reg

sub R13, R18, R11

CC4

ALU

IF

sw R18, 10(R19)

DM

Reg

CC5

Reg

ALU

IF

DM

Reg

CC6

Reg

ALU DM

CC7

Reg

ALU

CC8

Reg

DM

lw R14, 8(R21) IF

CC1

Reg

ori R20,R11, 7

ALU

CC3

IF

VL722 22 8/16/2015

Instruction-Time Diagram shows:

Which instruction occupying what stage at each clock cycle

Instruction flow is pipelined over the 5 stages

Instruction-Time Diagram

IF

WB

EX

ID

WB

EX

WB

MEM

ID

IF

EX

ID

IF

Time CC1 CC4 CC5 CC6 CC7 CC8 CC9 CC2 CC3

MEM

EX

ID

IF

WB

MEM

EX

ID

IF

lw R15, 8(R19)

lw R14, 8(R21)

ori R12, R19, 7

sub R13, R18, R11

sw R18, 10(R19) Ins

truc

tion

Ord

er

Up to five instructions can be in the

pipeline during the same cycle.

Instruction Level Parallelism (ILP)

ALU instructions skip

the MEM stage.

Store instructions

skip the WB stage

VL722 23 8/16/2015

Control Signals

zero

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

32

Reg

iste

r F

ile

RA

RB

BusA

BusB

RW BusW

32

E

PC

AL

Uou

t D

WB

Da

ta

32

Imm16

Imm26

Next

PC

A

B

Imm

N

PC

2

Instr

uctio

n

NP

C

+1

0

1

Rd

Rd2

0

1 Rd3

Rd4

Same control signals used in the single-cycle datapath

ALU

Ctrl

Reg

Write

Reg

Dst ALU

Src Mem

Write

Mem

toReg Mem

Read

Ext

Op

PCSrc

ID EX IF MEM WB

Bne

Beq

J

VL722 24 8/16/2015

Pipelined Control

zero

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

32 R

eg

iste

r F

ile

RA

RB

BusA

BusB

RW BusW

32

E

PC

AL

Uou

t D

WB

Da

ta

32

Imm16

Imm26

Next

PC

A

B

Imm

N

PC

2

Instr

uctio

n

NP

C

+1

0

1

Rd

Rd2

0

1 Rd3

Rd4

PCSrc

Bne

Beq

J

Op

Reg

Dst

EX

ALU

Src

ALU

Ctrl

Ext

Op

J

Beq

Bne

ME

M

Mem

Write

Mem

Read

Mem

toReg

WB

Reg

Write Pass control

signals along

pipeline just

like the data

Main

& ALU

Control

func

VL722 25 8/16/2015

Pipelined Control Cont'd

ID stage generates all the control signals

Pipeline the control signals as the instruction moves

Extend the pipeline registers to include the control signals

Each stage uses some of the control signals

Instruction Decode and Register Read

Control signals are generated

RegDst is used in this stage

Execution Stage => ExtOp, ALUSrc, and ALUCtrl

Next PC uses J, Beq, Bne, and zero signals for branch control

Memory Stage => MemRead, MemWrite, and MemtoReg

Write Back Stage => RegWrite is used in this stage

VL722 26 8/16/2015

Op

Decode

Stage

Execute Stage

Control Signals

Memory Stage

Control Signals

Write

Back

RegDst ALUSrc ExtOp J Beq Bne ALUCtrl MemRd MemWr MemReg RegWrite

R-Type 1=Rd 0=Reg x 0 0 0 func 0 0 0 1

addi 0=Rt 1=Imm 1=sign 0 0 0 ADD 0 0 0 1

slti 0=Rt 1=Imm 1=sign 0 0 0 SLT 0 0 0 1

andi 0=Rt 1=Imm 0=zero 0 0 0 AND 0 0 0 1

ori 0=Rt 1=Imm 0=zero 0 0 0 OR 0 0 0 1

lw 0=Rt 1=Imm 1=sign 0 0 0 ADD 1 0 1 1

sw x 1=Imm 1=sign 0 0 0 ADD 0 1 x 0

beq x 0=Reg x 0 1 0 SUB 0 0 x 0

bne x 0=Reg x 0 0 1 SUB 0 0 x 0

j x x x 1 0 0 x 0 0 x 0

Control Signals Summary

VL722 27 8/16/2015

Next . . .



Pipeline Hazards



Control Hazards


VL722 28 8/16/2015

Hazards: situations that would cause incorrect execution

If next instruction were launched during its designated clock cycle

1. Structural hazards

Caused by resource contention

Using same resource by two instructions during the same cycle

2. Data hazards

An instruction may compute a result needed by next instruction

Hardware can detect dependencies between instructions

3. Control hazards

Caused by instructions that change control flow (branches/jumps)

Delays in changing the flow of control

Hazards complicate pipeline control and limit performance

Pipeline Hazards

VL722 29 8/16/2015

Structural Hazards Problem

Attempt to use the same hardware resource by two different

instructions during the same cycle

Example

Writing back ALU result in stage 4

Conflict with writing load data in stage 5

WB

WB

EX

ID

WB

EX MEM

IF ID

IF

Time CC1 CC4 CC5 CC6 CC7 CC8 CC9 CC2 CC3

EX

ID

IF

MEM

EX

ID

IF

lw R14, 8(R21)

ori R12, R19, 7

sub R13, R18,R19

sw R18, 10(R19) Ins

truc

tion

s

Structural Hazard

Two instructions are

attempting to write

the register file

during same cycle

VL722 30 8/16/2015

Resolving Structural Hazards Serious Hazard:

Hazard cannot be ignored

Solution 1: Delay Access to Resource

Must have mechanism to delay instruction access to resource

Delay all write backs to the register file to stage 5

ALU instructions bypass stage 4 (memory) without doing anything

Solution 2: Add more hardware resources (more costly)

Add more hardware to eliminate the structural hazard

Redesign the register file to have two write ports

First write port can be used to write back ALU results in stage 4

Second write port can be used to write back load data in stage 5

VL722 31 8/16/2015

Next . . .



Pipeline Hazards



Control Hazards


VL722 32 8/16/2015

Dependency between instructions causes a data hazard

The dependent instructions are close to each other

Pipelined execution might change the order of operand access

Read After Write RAW Hazard

Given two instructions I and J, where I comes before J

Instruction J should read an operand after it is written by I

Called a data dependence in compiler terminology

I: add R17, R18, R19 # R17 is written

J: sub R20, R17, R19 # R17 is read

Hazard occurs when J reads the operand before I writes it

Data Hazards

VL722 33 8/16/2015

DM Reg

IF

Reg

ALU

IF

DM

Reg

Reg

ALU

IF

DM

Reg

Reg

ALU DM

Reg

ALU

Reg

DM

IF

Reg

ALU

IF

Time (cycles)

Prog

ram

Execu

tion

Ord

er

value of R18

sub R18, R9, R11

CC1 10

CC2

add R20, R18, R13

10

CC3

or R22, R11, R18

10

CC4

and R23, R12, R18

10

CC6 20

CC7 20

CC8 20

CC5

sw R24, 10(R18)

10

Example of a RAW Data Hazard

Result of sub is needed by add, or, and, & sw instructions

Instructions add & or will read old value of R18 from reg file

During CC5, R18 is written at end of cycle

VL722 34 8/16/2015

Reg Reg

Solution 1: Stalling the Pipeline

Three stall cycles during CC3 thru CC5 (wasting 3 cycles)

Stall cycles delay execution of add & fetching of or instruction

The add instruction cannot read R18 until beginning of CC6

The add instruction remains in the Instruction register until CC6

The PC register is not modified until beginning of CC6

DM

Reg

Reg Reg

Time (in cycles)

Ins

truc

tion

Ord

er value of R18

CC1 10

CC2 10

CC3 10

CC4 10

CC6 20

CC7 20

CC8 20

CC5 10

add R20, R18, R13 IF

or R22, R11, R18 IF ALU

ALU Reg

sub R18, R9, R11 IF Reg ALU DM Reg

CC9 20

stall stall

DM

stall

VL722 35 8/16/2015

DM

Reg

Reg

Reg

Reg

Reg

Time (cycles)

Prog

ram

Execu

tion

Ord

er

value of R18

sub R18, R9, R11 IF

CC1 10

CC2


10

CC3

or R22, R11, R18

ALU

IF

10

CC4

and R23, R22, R18

ALU

IF

10

CC6

Reg

DM

ALU

20

CC7

Reg

DM

ALU

20

CC8

Reg

DM

20

CC5

sw R24, 10(R18)

Reg

DM

ALU

IF

10

Solution 2: Forwarding ALU Result

The ALU result is forwarded (fed back) to the ALU input

No bubbles are inserted into the pipeline and no cycles are wasted

ALU result is forwarded from ALU, MEM, and WB stages

VL722 36 8/16/2015

Implementing Forwarding

0

1

2

3

0

1

2

3

Resu

lt

32 32

clk

Rd

32

Rs

Instr

uctio

n

0

1

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

32

Rd4

A L U E

Imm16 Imm26

1

0

Rd3

Rd2

A

B W

Data

D

Im2

6

32

Reg

iste

r F

ile

RB

BusA

BusB

RW BusW

RA Rt

Two multiplexers added at the inputs of A & B registers

Data from ALU stage, MEM stage, and WB stage is fed back

Two signals: ForwardA and ForwardB control forwarding

ForwardA

ForwardB

VL722 37 8/16/2015

Forwarding Control Signals

Signal Explanation

ForwardA = 0 First ALU operand comes from register file = Value of (Rs)

ForwardA = 1 Forward result of previous instruction to A (from ALU stage)

ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)

ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)

ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)

ForwardB = 1 Forward result of previous instruction to B (from ALU stage)

ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)

ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)

VL722 38 8/16/2015

Forwarding Example

Resu

lt

32 32

clk

Rd

32

Rs

Instr

uctio

n

0

1

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

32

Rd4

A L U

ext Imm16

Imm26

1

0

Rd3

Rd2

A

B

0

1

2

3

0

1

2

3

WD

ata

D

Imm

32

Reg

iste

r F

ile

RB

BusA

BusB

RW BusW

RA Rt

Instruction sequence:

lw R12, 4(R8)

ori R15, R9, 2

sub R11, R12, R15

When sub instruction is in decode stage

ori will be in the ALU stage

lw will be in the MEM stage

ForwardA = 2 from MEM stage ForwardB = 1 from ALU stage

lw R12,4(R8) ori R15, R9,2 sub R11,R12,R15

2

1

VL722 39 8/16/2015

RAW Hazard Detection Current instruction being decoded is in Decode stage

Previous instruction is in the Execute stage

Second previous instruction is in the Memory stage

Third previous instruction in the Write Back stage

If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite)) ForwardA 1

Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2

Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWrite)) ForwardA 3

Else ForwardA 0

If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite)) ForwardB 1

Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2

Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWrite)) ForwardB 3

Else ForwardB 0

VL722 40 8/16/2015

Hazard Detect and Forward Logic

0

1

2

3

0

1

2

3

Resu

lt

32 32

clk

Rd

32

Rs

0

1

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

32

Rd4

A L U

E

Imm26

1

0

Rd3

Rd2

A

B W

Data

D

Im2

6

32

Reg

iste

r F

ile

RB

BusA

BusB

RW BusW

RA Rt

Instr

uctio

n

ForwardB ForwardA

Hazard Detect

and Forward func

ALUCtrl

RegDst

Main

& ALU

Control

Op M

EM

EX

WB

RegWrite RegWrite RegWrite

VL722 41 8/16/2015

Next . . .



Pipeline Hazards


Load Delay, Hazard Detection, and Pipeline Stall

Control Hazards


VL722 42 8/16/2015

Reg

Reg

Reg

Time (cycles)

Prog

ram

Ord

er

CC2

add R20, R18, R13

Reg

IF

CC3

or R14, R11, R18

ALU

IF

CC6

Reg

DM

ALU

CC7

Reg

Reg

DM

CC8

Reg

lw R18, 20(R17) IF

CC1 CC4

and R15, R18, R12

DM

ALU

IF

CC5

DM

ALU

Load Delay

Unfortunately, not all data hazards can be forwarded

Load has a delay that cannot be eliminated by forwarding

In the example shown below

The LW instruction does not read data until end of CC4

Cannot forward data to ADD at end of CC3 - NOT possible

However, load can

forward data to

2nd next and later

instructions

VL722 43 8/16/2015

Detecting RAW Hazard after Load

Detecting a RAW hazard after a Load instruction:

The load instruction will be in the EX stage

Instruction that depends on the load data is in the decode stage

Condition for stalling the pipeline

if ((EX.MemRead == 1) // Detect Load in EX stage

and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard

Insert a bubble into the EX stage after a load instruction

Bubble is a no-op that wastes one clock cycle

Delays the dependent instruction after load by once cycle

Because of RAW hazard

VL722 44 8/16/2015

Reg or R14, R11, R18 IF DM Reg ALU

Reg ALU DM Reg


Reg lw R18, 20(R17) IF

stall

ALU

bubble bubble bubble

DM Reg

Stall the Pipeline for one Cycle ADD instruction depends on LW stall at CC3

Allow Load instruction in ALU stage to proceed

Freeze PC and Instruction registers (NO instruction is fetched)

Introduce a bubble into the ALU stage (bubble is a NO-OP)

Load can forward data to next instruction after delaying it Time (cycles)

Prog

ram

Ord

er

CC2 CC3 CC6 CC7 CC8 CC1 CC4 CC5

VL722 45 8/16/2015

lw R18, 8(R17) MEM WB EX ID Stall IF

lw R17, (R13) MEM WB EX ID IF

Showing Stall Cycles Stall cycles can be shown on instruction-time diagram

Hazard is detected in the Decode stage

Stall indicates that instruction is delayed

Instruction fetching is also delayed after a stall

Example:

add R2, R18, R11 MEM WB EX ID Stall IF

sub R1, R18, R2 MEM WB EX ID IF

Time CC1 CC4 CC5 CC6 CC7 CC8 CC9 CC2 CC3 CC10

Data forwarding is shown using blue arrows

VL722 46 8/16/2015

Control Signals

Bubble

= 0

0

1

Hazard Detect, Forward, and Stall

0

1

2

3

0

1

2

3

Resu

lt

32 32

clk

Rd

32

Rs

0

1

ALU result

32

0

1

Data

Memory

Address

Data_in

Data_out

32

Rd4

A L U

E

Imm26

1

0

Rd3

Rd2

A

B W

Data

D

Im2

6

32

Reg

iste

r F

ile

RB

BusA

BusB

RW BusW

RA Rt

Instr

uctio

n

ForwardB ForwardA

Hazard Detect

Forward, & Stall

func

RegDst

Main & ALU

Control

Op

ME

M

EX

WB

RegWrite RegWrite

RegWrite

MemRead Stall

Dis

able

PC

P

C

VL722 47 8/16/2015

Code Scheduling to Avoid Stalls Compilers reorder code in a way to avoid load stalls

Consider the translation of the following statements:

A = B + C; D = E F; // A thru F are in Memory

Slow code:

lw R8, 4(R16) # &B = 4(R16)

lw R9, 8(R16) # &C = 8(R16)

add R10, R8, R9 # stall cycle

sw R10, 0(R16) # &A = 0(R16)

lw R11, 16(R16) # &E = 16(R16)

lw R12, 20(R16) # &F = 20(R16)

sub R13, R11, R12 # stall cycle

sw R13, 12(R0) # &D = 12(R0)

Fast code: No Stalls

lw R8, 4(R16)

lw R9, 8(R16)

lw R11, 16(R16)

lw R12, 20(R16)

add R10, R8, R9

sw R10, 0(R16)

sub R13, R11, R12

sw R13, 12(R0)

VL722 48 8/16/2015

Instruction J writes its result after it is read by I

Called anti-dependence by compiler writers

I: sub R12, R9, R11 # R9 is read

J: add R9, R10, R11 # R9 is written

Results from reuse of the name R9

NOT a data hazard in the 5-stage pipeline because:

Reads are always in stage 2

Writes are always in stage 5, and

Instructions are processed in order

Anti-dependence can be eliminated by renaming

Use a different destination register for add (eg, R13)

Name Dependence: Write After Read

VL722 49 8/16/2015

Name Dependence: Write After Write

Same destination register is written by two instructions

Called output-dependence in compiler terminology

I: sub R9, R12, R11 # R9 is written

J: add R9, R10, R11 # R9 is written again

Not a data hazard in the 5-stage pipeline because:

All writes are ordered and always take place in stage 5

However, can be a hazard in more complex pipelines

If instructions are allowed to complete out of order, and

Instruction J completes and writes R9 before instruction I

Output dependence can be eliminated by renaming R9

Read After Read is NOT a name dependence

VL722 50 8/16/2015

Hazards due to Loads and Stores Consider the following statements:

sw R10, 0(R1)

lw R11, 6(R5)

Is there any Data Hazard possible in the above code sequence?

If 0(R1) == 6(R5), then it means the same memory location!!

Data dependency through Data Memory

But no Data Hazard since the memory is not accessed by the two instructions simultaneously writing and reading happens in two consecutive clock cycles

But, in an out-of-order execution processor, this can lead to a Hazard!

VL722 51 8/16/2015

Next . . .



Pipeline Hazards



Control Hazards


VL722 52 8/16/2015

Control Hazards

Jump and Branch can cause great performance loss

Jump instruction needs only the jump target address

Branch instruction needs two things:

Branch Result Taken or Not Taken

Branch Target Address

PC + 4 If Branch is NOT taken

PC + 4 immediate If Branch is Taken

Jump and Branch targets are computed in the ID stage

At which point a new instruction is already being fetched

Jump Instruction: 1-cycle delay

Branch: 2-cycle delay for branch result (taken or not taken)

VL722 53 8/16/2015

2-Cycle Branch Delay

Control logic detects a Branch instruction in the 2nd Stage

ALU computes the Branch outcome in the 3rd Stage

Next1 and Next2 instructions will be fetched anyway

Convert Next1 and Next2 into bubbles if branch is taken

Beq R9,R10,L1 IF

cc1

Next1

cc2

Reg

IF

Next2

cc4 cc5 cc6 cc7

IF Reg DM ALU

Bubble Bubble Bubble

Bubble Bubble Bubble Bubble

L1: target instruction

cc3

Branch

Target

Addr

ALU

Reg

IF

VL722 54 8/16/2015

Implementing Jump and Branch

zero

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

32

Reg

iste

r F

ile

RA

RB

BusA

BusB

RW BusW

E

PC

AL

Uo

ut

D

Imm16

Imm26

A

B

Im2

6

NP

C2

Instr

uctio

n

NP

C

+1

0

1

Rd

Rd2

0

1 Rd3

PCSrc

Bne

Beq

J

Op

Reg

Dst

EX

J, Beq, Bne

ME

M

Main & ALU

Control

func

Next

PC

0

2

3

1

0

2

3

1

Control Signals 0

1 Bubble = 0

Branch Delay = 2 cycles

Branch target & outcome

are computed in ALU stage

Ju

mp

or

Bra

nch

Ta

rge

t

VL722 55 8/16/2015

Predict Branch NOT Taken

Branches can be predicted to be NOT taken

If branch outcome is NOT taken then

Next1 and Next2 instructions can be executed

Do not convert Next1 & Next2 into bubbles

No wasted cycles

Else, convert Next1 and Next2 into bubbles 2 wasted cycles

Beq R9,R10,L1 IF

cc1

Next1

cc2

Reg

IF

Next2

cc3

NOT Taken ALU

Reg

IF Reg

cc4 cc5 cc6 cc7

ALU DM

ALU DM

Reg

Reg

VL722 56 8/16/2015

Reducing the Delay of Branches

Branch delay can be reduced from 2 cycles to just 1 cycle

Branches can be determined earlier in the Decode stage

A comparator is used in the decode stage to determine branch decision, whether the branch is taken or not

Because of forwarding the delay in the second stage will be increased and this will also increase the clock cycle

Only one instruction that follows the branch is fetched

If the branch is taken then only one instruction is flushed

We should insert a bubble after jump or taken branch

This will convert the next instruction into a NOP

VL722 57 8/16/2015

J, Beq, Bne

Reducing Branch Delay to 1 Cycle

clk

32

A L U

32

5

Address

Instruction

Instruction

Memory Rs

5 Rt 1

0

32

Reg

iste

r F

ile

RA

RB

BusA

BusB

RW BusW

E

PC

AL

Uo

ut

D

Imm16

A

B

Im1

6

Instr

uctio

n

+1

0

1

Rd

Rd2

0

1 Rd3

PCSrc

Bne

Beq

J

Op

Reg

Dst

EX

ME

M

Main & ALU

Control

func

0

2

3

1

0

2

3

1

Control Signals 0

1 Bubble = 0

Ju

mp

or

Bra

nch

Ta

rge

t

Reset signal converts

next instruction after

jump or taken branch

into a bubble

Zero

=

ALUCtrl

Reset

Next

PC

Longer Cycle

Data forwarded

then compared

VL722 58 8/16/2015

Based on SPEC benchmarks on DLX

Branches occur with a frequency of 14% to 16% in integer

programs and 3% to 12% in floating point programs.

About 75% of the branches are forward branches

60% of forward branches are taken

85% of backward branches are taken

67% of all branches are taken

Why are branches (especially backward branches) more

likely to be taken than not taken?

Branch Behavior in Programs

VL722 59 8/16/2015

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken

Execute successor instructions in sequence

Flush instructions in pipeline if branch actually taken

Advantage of late pipeline state update

33% DLX branches not taken on average

PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken

67% DLX branches taken on average

#4: Define branch to take place AFTER n following instruction (Delayed branch)

Four Branch Hazard Alternatives

VL722 60 8/16/2015

Next . . .



Pipeline Hazards



Control Hazards


VL722 61 8/16/2015

Branch Hazard Alternatives

Predict Branch Not Taken (previously discussed)

Successor instruction is already fetched

Do NOT Flush instruction after branch if branch is NOT taken

Flush only instructions appearing after Jump or taken branch

Delayed Branch

Define branch to take place AFTER the next instruction

Compiler/assembler fills the branch delay slot (for 1 delay cycle)

Dynamic Branch Prediction

Loop branches are taken most of time

Must reduce branch delay to 0, but how?

How to predict branch behavior at runtime?

VL722 62 8/16/2015

Define branch to take place after the next instruction

Instruction in branch delay slot is always executed

Compiler (tries to) move a useful instruction into delay slot.

From before the Branch: Always helpful when possible

For a 1-cycle branch delay, we have one delay slot

branch instruction

branch delay slot (next instruction)

branch target (if branch taken)

Compiler fills the branch delay slot

By selecting an independent instruction

From before the branch

If no independent instruction is found

Compiler fills delay slot with a NO-OP

Delayed Branch

label:

. . .

add R10,R11,R12

beq R17,R16,label

Delay Slot

label:

. . .

beq R17,R16,label

add R10,R11,R12

VL722 63 8/16/2015

Delayed Branch From the Target: Helps when branch is taken. May duplicate instructions

ADD R2, R1, R3 ADD R2, R1, R3

BEQZ R2, L1 BEQZ R2, L2

DELAY SLOT SUB R4, R5, R6

- -

L1: SUB R4, R5, R6 L1: SUB R4, R5, R6

L2: L2:

Instructions between BEQ and SUB (in fall through) must not use R4.

Why is instruction at L1 duplicated? What if R5 or R6 changed?

(Because, L1 can be reached by another path !!)

VL722 64 8/16/2015

Delayed Branch From Fall Through: Helps when branch is not taken (default case)

ADD R2, R1, R3 ADD R2, R1, R3

BEQZ R2, L1 BEQZ R2, L1

DELAY SLOT SUB R4, R5, R6

SUB R4, R5, R6 -

-

L1: L1:

Instructions at target (L1 and after) must not use R4 till set (written) again.

VL722 65 8/16/2015

Compiler effectiveness for single branch delay slot:

Fills about 60% of branch delay slots

About 80% of instructions executed in branch delay slots are useful in computation

About 50% (i.e., 60% x 80%) of slots usefully filled

Canceling branches or nullifying branches

Include a prediction of the branch is taken or not taken

If the prediction is correct, the instruction in the delay slot is executed

If the prediction is incorrect, the instruction in the delay slot is quashed.

Allow more slots to be filled from the target address or fall through

Filling delay slots

VL722 66 8/16/2015

New meaning for branch instruction

Branching takes place after next instruction (Not immediately!)

Impacts software and compiler

Compiler is responsible to fill the branch delay slot

For a 1-cycle branch delay One branch delay slot

However, modern processors are deeply pipelined

Branch penalty is multiple cycles in deeper pipelines

Multiple delay slots are difficult to fill with useful instructions

MIPS used delayed branching in earlier pipelines

However, delayed branching is not useful in recent processors

Drawback of Delayed Branching

VL722 67 8/16/2015

Two strategies examined

Backward branch predict taken, forward branch not taken

Profile-based prediction: record branch behavior, predict branch based on prior run

Instr

uctions p

er

mis

pre

dic

ted b

ranch

1

10

100

1000

10000

100000

alv

inn

com

pre

ss

doduc

esp

ress

o

gcc

hyd

ro2d

mdljs

p2 ora

swm

256

tom

catv

Profile-based Direction-based

Compiler Static Prediction of Taken/Untaken Branches

VL722 68 8/16/2015

Zero-Delayed Branching

How to achieve zero delay for a jump or a taken branch?

Jump or branch target address is computed in the ID stage

Next instruction has already been fetched in the IF stage

Solution

Introduce a Branch Target Buffer (BTB) in the IF stage

Store the target address of recent branch and jump instructions

Use the lower bits of the PC to index the BTB

Each BTB entry stores Branch/Jump address & Target Address

Check the PC to see if the instruction being fetched is a branch

Update the PC using the target address stored in the BTB

VL722 69 8/16/2015

Branch Target Buffer

The branch target buffer is implemented as a small cache

Stores the target address of recent branches and jumps

We must also have prediction bits

To predict whether branches are taken or not taken

The prediction bits are dynamically determined by the hardware

mux

PC

Branch Target & Prediction Buffer

Addresses of

Recent Branches

Target

Addresses

low-order bits

used as index

Predict

Bits Inc

= predict_taken

VL722 70 8/16/2015

Dynamic Branch Prediction

Prediction of branches at runtime using prediction bits

Prediction bits are associated with each entry in the BTB

Prediction bits reflect the recent history of a branch instruction

Typically few prediction bits (1 or 2) are used per entry

We dont know if the prediction is correct or not

If correct prediction

Continue normal execution no wasted cycles

If incorrect prediction (misprediction)

Flush the instructions that were incorrectly fetched wasted cycles

Update prediction bits and target address for future use

VL722 71 8/16/2015

Correct Prediction

No stall cycles

Yes No

Dynamic Branch Prediction Contd Use PC to address Instruction

Memory and Branch Target Buffer

Found

BTB entry with predict

taken?

Increment PC PC = target address

Mispredicted Jump/branch

Enter jump/branch address, target

address, and set prediction in BTB entry.

Flush fetched instructions

Restart PC at target address

Mispredicted branch

Branch not taken

Update prediction bits

Flush fetched instructions

Restart PC after branch

Normal

Execution

Yes No

Jump

or taken

branch?

Jump

or taken

branch?

No Yes

IF

ID

EX

VL722 72 8/16/2015

Prediction is just a hint that is assumed to be correct

If incorrect then fetched instructions are flushed

1-bit prediction scheme is simplest to implement

1 bit per branch instruction (associated with BTB entry)

Record last outcome of a branch instruction (Taken/Not taken)

Use last outcome to predict future behavior of a branch

1-bit Prediction Scheme

Predict

Not Taken

Taken

Predict

Taken

Not

Taken

Not Taken

Taken

VL722 73 8/16/2015

1-Bit Predictor: Shortcoming

Inner loop branch mispredicted twice!

Mispredict as taken on last iteration of inner loop

Then mispredict as not taken on first iteration of inner loop next time around

outer: inner: bne , , inner bne , , outer

VL722 74 8/16/2015

1-bit prediction scheme has a performance shortcoming

2-bit prediction scheme works better and is often used

4 states: strong and weak predict taken / predict not taken

Implemented as a saturating counter

Counter is incremented to max=3 when branch outcome is taken

Counter is decremented to min=0 when branch is not taken


Not Taken

Taken

Not Taken

Taken Strong

Predict

Not Taken

Taken

Weak

Predict

Taken

Not Taken

Weak

Predict

Not Taken Not Taken

Taken Strong

Predict

Taken

VL722 75 8/16/2015

Alternative state machine


1-Bit Branch History Table

Example main()

{

int i, j;

while (1)

for (i=1; i

2-Bit Counter Scheme Prediction is made based on the last two branch outcomes

Each of BHT entries consists of 2 bits, usually a 2-bit counter, which is associated with the state of the automaton(many different

automata are possible)

.

.

.

Prediction PC

BHT

Automaton

Branch outcome

2-Bit BHT 2-bit scheme where change prediction only if get misprediction twice

MSB of the state symbol represents the prediction;

1: TAKEN, 0: NOT TAKEN

Predict

TAKEN 11 Predict

TAKEN 10

Predict

NOT

TAKEN 01

Predict

NOT

TAKEN 00

T

NT

NT

NT

NT

T

T

T

Prediction accuracy of for branch: 75 %

1

11

T

11

O

1

11

T

11

O

1

11

T

11

O

1

10

T

11

O

1

11

T

11

O

1

11

T

11

O

1

10

T

11

O

1

11

T

11

O

1

11

T

11

O

O : correct, X : mispredict

branch outcome

counter value

prediction

new counter value

correctness

. . .

. . .

. . .

. . .

. . .

0

11

T

10

X

0

11

T

10

X

0

11

T

10

X

assume initial counter value : 11

79

Case for Correlating Predictors

Basic two-bit predictor schemes

use recent behavior of a branch to predict its future behavior

Improve the prediction accuracy

look also at recent behavior of other branches

if (aa == 2) aa = 0;

if (bb == 2) bb = 0;

if (aa != bb) { }

subi R3, R1, #2 bnez R3, L1 ; b1

add R1, R0, R0

L1: subi R3, R1, #2

bnez R3, L2 ; b2

add R2, R0, R0

L2: sub R3, R1, R2

beqz R3, L3 ; b3

b3 is correlated with b1 and b2;

If b1 and b2 are both untaken,

then b3 will be taken.

=>

Use correlating predictors or

two-level predictors.

80

Branch Correlation

Branch direction

Not independent

Correlated to the path taken

Example: Decision of b3 can be surely known beforehand if the path to b3 is 1-1

Track path using a 2-bit register

if (aa==2) // b1

aa = 0;

if (bb==2) // b2

bb = 0;

if (aa!=bb) { // b3

.

}

Code Snippet b1

b2 b2

b3 b3 b3

0 (NT)

0 0

1 (T)

1

b3

1

Path: A:0-0 B:0-1 C:1-0 D:1-1 aa=0

bb=0

aa=0

bb2

aa2

bb=0

aa2

bb2

Correlating Branches

Example: if (d==0)

d=1;

if (d==1)

BNEZ R1,b1 ; (b1)(d!=0)

ADDI R1,R1,#1; since d==0, make d=1

b1 : SUBI R3,R1,#1

BNEZ R3,b2; (b2)(d!=1)

.....

b2 :

Initial value

of d

0

1

2

d==0?

Y

N

N

b1

NT

T

T

Value of d

before b2

1

1

2

d==1?

Y

Y

N

b2

NT

NT

T

If b1 is NT, then b2 is NT

d=?

2

0

2

0

b1

prediction

NT

T

NT

T

b1

action

T

NT

T

NT

New b1

prediction

T

NT

T

NT

b2

prediction

NT

T

NT

T

b2

action

T

NT

T

NT

New b2

prediction

T

NT

T

NT

All branches are mispredicted

1-bit

self history

predictor

Sequence of

2,0,2,0,2,0,...


Example: if (d==0)

d=1;

if (d==1)

BNEZ R1,L1 ; branch b1 (d!=0)

ADDI R1,R1,#1; since d==0, make d=1

b1 : SUBI R3,R1,#1

BNEZ R3,L2; branch b2(d!=1)

.....

b2 :

Self

Prediction

bits(XX)

NT/NT

NT/T

T/NT

T/T

Prediction, if last

branch action was NT

NT

NT

T

T

Prediction, if last

branch action was T

NT

T

NT

T

Gloabal

d=?

2

0

2

0

b1 prediction

NT/NT

T/NT

T/NT

T/NT

b1 action

T

NT

T

NT

new b1 prediction

T/NT

T/NT

T/NT

T/NT

b2 prediction

NT/NT

NT/T

NT/T

NT/T

b2 action

T

NT

T

NT

new b2 prediction

NT/T

NT/T

NT/T

NT/T

Initial self prediction bits NT/NT and

Initial last branch was NT. Prediction used is shown in Red

Misprediction only in the first prediction

8/16/2015 83

Local/Global Predictors

Instead of maintaining a counter for each branch to capture the common case,

Maintain a counter for each branch and surrounding pattern

If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor

If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor

8/16/2015 84

Correlated Branch Prediction

Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the

proper n-bit branch history table

In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit

counters

Thus, old 2-bit BHT is a (0,2) predictor

Global Branch History: m-bit shift register keeping T/NT status of last m branches.

Each entry in table has m n-bit predictors.

8/16/2015 85


(2,2) predictor Behavior of recent

branches selects

between four

predictions of next

branch, updating just

that prediction

Branch address

2-bits per branch predictor

Prediction

2-bit global branch history

4

86

Correlated Branch Predictor

(M,N) correlation scheme

M: shift register size (# bits)

N: N-bit counter

2-bit

counter

hash

X X

Branch PC

hash

2-bit

counter

2-bit

counter

X X

2-bit

counter

2-bit

counter

Prediction Prediction

2-bit shift register

(global branch history)

select

Subsequent

branch

direction

(2,2) Correlation Scheme 2-bit Sat. Counter Scheme

2w

w

Branch PC

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8/16/2015 87

0%

Fre

quen

cy o

f M

ispre

dic

tions

0% 1%

5% 6% 6%

11%

4%

6% 5%

1% 2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes

4096 Entries 2-bit BHT

Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

nas

a7

mat

rix

300

doducd

spic

e

fpp

pp

gcc

expre

sso

eqnto

tt

li

tom

catv

88

Two-Level Branch Predictor

Generalized correlated branch predictor

1st level keeps branch history in Branch History Register (BHR)

2nd level segregates pattern history in Pattern History Table (PHT)

1 1 . . . . . 1 0

00..00

00..01

00..10

11..11

11..10

Branch History Pattern

Pattern History Table (PHT)

Prediction

Rc-k Rc-1

Rc: Actual Branch Outcome

FSM

Update

Logic

Branch History Register (BHR)

(Shift left when update)

N

2N entries

Current State PHT update

.

89

Branch History Register

An N-bit Shift Register = 2N patterns in PHT

Shift-in branch outcomes

1 taken

0 not taken

First-in First-Out

BHR can be

Global

Per-set

Local (Per-address)

90

Pattern History Table

2N entries addressed by N-bit BHR

Each entry keeps a counter (2-bit or more) for prediction

Counter update: the same as 2-bit counter

Can be initialized in alternate patterns (01, 10, 01, 10, ..)

Alias (or interference) problem

91

Two-Level Branch Prediction

0110

BHR

PC = 0x4001000C

PHT

00110110

00110110

00110111

11111101

11111110

00000000

00000001

00000010

11111111 MSB = 1 Predict Taken

10

Set

92

Predictor Update (Actually, Not Taken)

0110

BHR

PC = 0x4001000C

PHT

00110110

00110110

00110111

11111101

11111110

00000000

00000001

00000010

11111111

decremented

1100

00111100

00111100

Update Predictor after branch is resolved

10 01

8/16/2015 93

A local predictor might work well for some branches or programs, while a global predictor might work well for others

Provide one of each and maintain another predictor to identify which predictor is best for each branch

Tournament

Predictor

Branch PC

Table of 2-bit

saturating counters

Local

Predictor

Global

Predictor

M

U

X

Tournament Predictors

8/16/2015 94

Tournament Predictors Multilevel branch predictor

Selector for the Global and Local predictors of correlating branch

prediction

Use n-bit saturating counter to choose between predictors

Usual choice between global and local predictors

95

Advantage of tournament predictor is the ability to select the right predictor for a particular branch

A typical tournament predictor selects global predictor 40% of the time for SPEC integer benchmarks

AMD Opteron and Phenom use tournament style

Tournament Predictors

Accuracy v. Size (SPEC89)

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Co

nd

itio

na

l b

ran

ch

mis

pre

dic

tio

n r

ate

Local - 2 bit counters

Correlating - (2,2) scheme

Tournament

97

Based on predictors used in Core 2 Duo chip Combines three different predictors

Two-bit Global history Loop exit predictor

Uses a counter to predict the exact number of taken branches (number of loop iterations) for a branch

that is detected as a loop branch

Tournament: Tracks accuracy of each predictor Main problem of speculation:

A mispredicted branch may lead to another branch being mispredicted !

Tournament Predictors (Intel Core i7)

98

Branch Prediction is More Important Today

Conditional branches still comprise about 20% of instructions

Correct predictions are more important today - why?

pipelines deeper

branch not resolved until more cycles from fetching - therefore the

misprediction penalty greater

cycle times smaller - more emphasis on throughput (performance)

more functionality between fetch & execute

multiple instruction issue (superscalars & VLIW)

branch occurs almost every cycle

flushing & refetching more instructions

object-oriented programming

more indirect branches - which are harder to predict

dual of Amdahls Law

other forms of pipeline stalling are being addressed - so the portion

of CPI due to branch delays is relatively larger

All this means that the potential stalling due to branches is greater

99

Branch Prediction is More Important Today

On the other hand,

Chips are denser so we can consider sophisticated HW solutions

Hardware cost is small compared to the performance gain

100

Directions in Branch Prediction

1: Improve the prediction

correlated (2-level) predictor (Pentium III - 512 entries, 2-bit, Pentium Pro - 4 history bits)

hybrid local/global predictor (Alpha 21264)

confidence predictors

2: Determine the target earlier

branch target buffer (Pentium Pro, IA-64 Itanium)

next address in I-cache (Alpha 21264, UltraSPARC)

return address stack (Alpha 21264, IA-64 Itanium, MIPS R10000, Pentium Pro, UltraSPARC-3)

3: Reduce misprediction penalty

fetch both instruction streams (IBM mainframes, SuperSPARC)

4: Eliminate the branch

predicated execution (IA-64 Itanium, Alpha 21264)

VL722 101 8/16/2015

Exceptions: Events other than branches or jumps that change the normal flow of instruction execution. Some types of exceptions:

I/O Device request

Invoking an OS service from user program

Tracing Instruction execution

Breakpoint (programmer requested interrupt)

Integer arithmetic overflow

FP arithmetic anomaly

Page fault (page not in main memory)

Misaligned memory accesses

Memory protection violation

Use of undefined instruction

Hardware malfunction

Power failure

Pipelining Complications

VL722 102 8/16/2015

Exceptions: Events other than branches or jumps that change the normal flow of instruction execution.

5 instructions executing in 5 stage pipeline

How to stop the pipeline? Who caused the interrupt?

How to restart the pipeline?

Stage Problems causing the interrupts

IF Page fault on instruction fetch; misaligned

memory access; memory-protection violation

ID Undefined or illegal opcode

EX Arithmetic interrupt

MEM Page fault on data fetch; misaligned memory access; memory-protection violation


VL722 103 8/16/2015

Simultaneous exceptions in more than one pipeline stage, e.g.,

LOAD with data page fault in MEM stage

ADD with instruction page fault in IF stage

Solution #1

Interrupt status vector per instruction

Defer check until last stage, kill state update if exception

Solution #2

Interrupt ASAP

Restart everything that is incomplete


VL722 104 8/16/2015

Our DLX pipeline only writes results at the end of the instructions execution. Not all processors do this.

Address modes: Auto-increment causes register change during instruction execution

Interrupts Need to restore register state

Adds WAR and WAW hazards since writes happen not only in last stage

Memory-Memory Move Instructions

Must be able to handle multiple page faults

VAX and x86 store values temporarily in registers

Condition Codes

Need to detect the last instruction to change condition codes


VL722 105 8/16/2015

Fallacies and Pitfalls

Pipelining is easy!

The basic idea is easy

The devil is in the details

Detecting data hazards and stalling pipeline

Poor ISA design can make pipelining harder

Complex instruction sets (Intel IA-32)

Significant overhead to make pipelining work

IA-32 micro-op approach

Complex addressing modes

Register update side effects, memory indirection

VL722 106 8/16/2015

Pipeline Hazards Summary

Three types of pipeline hazards

Structural hazards: conflicts using a resource during same cycle

Data hazards: due to data dependencies between instructions

Control hazards: due to branch and jump instructions

Hazards limit the performance and complicate the design

Structural hazards: eliminated by careful design or more hardware

Data hazards are eliminated by data forwarding

However, load delay cannot be completely eliminated

Delayed branching can be a solution for control hazards

BTB with branch prediction can reduce branch delay to zero

Branch misprediction should flush the wrongly fetched instructions

Documents

Lec 04 Pipeline d Processor