64
Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Grupo de Arquitectura de Computadores,

Comunicaciones y Sistemas

COMPUTER

ARCHITECTURE

Instruction Level Parallelism Exploitation

Page 2: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Contents 2

Introduction to pipelining.

Hazards.

Structural hazards.

Data hazards

Control hazards.

Compile-time alternatives.

Run-time alternatives

Multi-cycle operations.

Computer Architecture - 2014 - J. Daniel García

Page 3: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Introduction to pipelining 3

Computer Architecture - 2014 - J. Daniel García

Page 4: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline 4

Implementation technique: Multiple instructions

overlap their execution over time.

A costly operation is divided into several simpler sub-

operations.

Sub-operation execution in stages.

Effects:

Throughput is increased.

Latency does not decrease.

Computer Architecture - 2014 - J. Daniel García

Page 5: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline 5

Cycle1

Cycle 2

Cycle 3

Cycle 4

IF1

IF2 ID1

ID2 EX1 IF3

ID3 EX2 IF4 M1

Cycle 5

Cycle 6

Cycle 7

Cycle 8

IF5

IF6 ID5

ID6 EX5 IF7

ID7 EX6 IF8 M5

ID4 EX3 M2

EX4 M3

M4

W1

W2

W3

W4

Filling the pipeline

Latency 5 cycles

Throughput (ideal) 1 instruction per cycle

Computer Architecture - 2014 - J. Daniel García

Page 6: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 6

Instruction Fetch (IF)

Send PC value to memory.

Fetch current instruction.

Update PC (e.g. add 4).

Computer Architecture - 2014 - J. Daniel García

Page 7: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 7

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

L M D

ALU

MU

X

Mem

ory

Reg F

ile

MU

X M

UX

Data

Mem

ory

MU

X

Sign Extend

4

Adder Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Computer Architecture - 2014 - J. Daniel García

Page 8: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 8

Instruction Decode (ID)

Decode current instruction.

Read referenced source registers values.

Perform equality test on register values.

Sign-extend offset field of instruction.

Compute possible branch target (offset + incremented

PC).

Computer Architecture - 2014 - J. Daniel García

Page 9: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 9

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

L M D

ALU

MU

X

Mem

ory

Reg F

ile

MU

X M

UX

Data

Mem

ory

MU

X

Sign Extend

4

Adder Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Computer Architecture - 2014 - J. Daniel García

Page 10: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 10

Execution / Effective Address(EX)

ALU operates on operands:

Memory reference: ALU adds base register and offset to

form effective address.

Register-Register: ALU performs operation o values from

register file.

Register-Immediate: ALU performs operation on value from

register file and sign extended immediate.

Computer Architecture - 2014 - J. Daniel García

Page 11: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 11

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

L M D

ALU

MU

X

Mem

ory

Reg F

ile

MU

X M

UX

Data

Mem

ory

MU

X

Sign Extend

4

Adder Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Computer Architecture - 2014 - J. Daniel García

Page 12: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 12

Memory Access (MEM)

Load instructions:

Read using effective address computed in EX.

Store instructions:

Write data from second register read from register file in ID

into effective address computed in EX.

Write-back (WB)

Write result (from memory or ALU) into register file.

Computer Architecture - 2014 - J. Daniel García

Page 13: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline stages 13

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

L M D

ALU

MU

X

Mem

ory

Reg F

ile

MU

X M

UX

Data

Mem

ory

MU

X

Sign Extend

4

Adder Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Computer Architecture - 2014 - J. Daniel García

Page 14: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline over time 14

I n s t r.

O r d e r

Time (clock cycles)

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Computer Architecture - 2014 - J. Daniel García

Page 15: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline effects 15

A n-depth pipeline, has n times the needed bandwidth compared to the non-pipelined version when clock rate is the same.

Solution: Caching, caching, caching, …

Separation into data and instruction caches suppresses some memory conflicts.

Instructions in the pipeline should not try to use the same resource at the same time.

Solution: Introduce registers at every stage boundary.

Computer Architecture - 2014 - J. Daniel García

Page 16: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Stages communication 16

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

ALU

Mem

ory

Reg F

ile

MU

X M

UX

Data

Mem

ory

MU

X

Sign Extend

Zero?

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Dat

a

Next PC

Address

RS1

RS2

Imm

MU

X

Computer Architecture - 2014 - J. Daniel García

Page 17: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Example 17

Non-pipelined processor:

Clock cycle: 1ns

ALU operations (40%) and branching (20%): 4 cycles.

Memory operations (40%): 5 cycles.

Pipeline overhead: 0.2 ns

¿Which is the pipelined version speedup?

7.32.1

4.4

2.01

4.454.046.01CPIcycleclock

ns

nsS

nsnst

nsnst

pipeline

orig

Computer Architecture - 2014 - J. Daniel García

Page 18: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazards

Data hazards

Control hazards

Pipeline hazards 18

Computer Architecture - 2014 - J. Daniel García

Page 19: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Hazards 19

Hazard: Situation preventing next instruction to start at the expected clock cycle.

Hazards reduce pipelined architectures performance.

Hazards:

Structural hazard.

Data hazard.

Control hazard.

Simple approach for hazards:

Stall the instruction flow.

Already started instructions will continue.

Computer Architecture - 2014 - J. Daniel García

Page 20: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazard 20

When the processor cannot support all possible instruction sequences.

Two pipeline stages need to use the same resource at the same cycle.

Reasons:

Functional units which are not fully pipelined.

Functional units which are not duplicated.

These hazards can be avoided at the cost of a more expensive hardware.

Computer Architecture - 2014 - J. Daniel García

Page 21: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Impact of stalls 21

Pipeline speedup.

Pipeline ideal CPI is 1.

Need to add stall cycles per instruction.

Non pipelined processor:

CPI=1, but clock cycle much higher.

Clock cycle is N times pipelined cycle.

N is pipeline depth

pipelinedpipelined

pipelinednon pipelinednon

cycle

cycle

pipelined instr time average

pipelinednon instr time average

CPI

CPIS

ninstructioper stalls 1

depth pipeline

S

Computer Architecture - 2014 - J. Daniel García

Page 22: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazard: Example 22

LOAD

Instr i+1

Instr i+2

Instr i+3

Instr i+4

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Reg

ALU

DMem Ifetch Reg

Assuming single port

memory

Computer Architecture - 2014 - J. Daniel García

Page 23: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazard: Example 23

LOAD

Instr i+1

Instr i+2

Instr i+3

Instr i+4

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Assuming single port

memory

Computer Architecture - 2014 - J. Daniel García

Page 24: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Example 24

Two alternative designs:

A: No structural hazard. Clock cycle 1ns.

B: With structural hazards. Clock cycle 0.9 ns.

Data access instructions with hazards: 30%.

¿Which is the fastest alterantive?

nscycleCPIAt

nsnscycleCPIAt

inst

inst

26.19.04.19.0114.016.0)(

111)(

Computer Architecture - 2014 - J. Daniel García

Page 25: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazards

Data hazards

Control hazards

Pipeline hazards 25

Computer Architecture - 2014 - J. Daniel García

Page 26: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Riesgos de datos 26

A data hazard happens when the pipeline modifies

the read/write access order to operands.

I1: DADD R1, R2, R3 I2: DSUB R4, R1, R5 I3: AND R6, R1, R7 I4: OR R8, R1, R9 I5: XOR R10, R1, R11

I2 reads R1 before I1 modifies it.

I3 reads R1 before I1 modifies it.

I4 gets the right value

Register file read in second

half of cycle.

I5 gets right value.

Computer Architecture - 2014 - J. Daniel García

Page 27: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Data hazard 27

DADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8,R1,R9

XOR R10, R1, R11

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

Reg

ALU

DMem Ifetch Reg

IF ID/RF EX MEM WB

Computer Architecture - 2014 - J. Daniel García

Page 28: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Stalls in data hazards 28

DADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8,R1,R9

XOR R10, R1, R11

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Computer Architecture - 2014 - J. Daniel García

Page 29: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Data hazard: RAW 29

Read After Write.

Instruction i+1 tries to read a datum before instruction i writes it.

i: add r1, r2, r3 i+1: sub r4, r1, r3

If there is a data dependency:

Instructions can neither be executed in parallel nor overlap.

Solutions:

Hardware detection.

Compiler control. Computer Architecture - 2014 - J. Daniel García

Page 30: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Data hazard: WAR 30

Write After Read:

Instruction i+1 modifies operand before instruction i reads it.

i: sub r4, r1, r3 i+1: add r1, r2, r3 i+2: mul r6, r1, r7

Also known as anti-dependency (compiler domain).

Name reuse.

Cannot happen in MIPS with 5-stages pipeline.

All instructions with 5 stages.

Reads are always in stage 2.

Writes are always in stage 5.

Computer Architecture - 2014 - J. Daniel García

Page 31: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Data hazards: WAW 31

Write After Write:

Instruction i+1 modifies operand before instruction i modifies it.

i: sub r1, r4, r3 i+1: add r1, r2, r3 i+2: mul r6, r1, r7

Also known as output dependency (compiler domain).

Name reuse.

Cannot happen in MIPS with 5-stages pipeline.

All instructions with 5 stages.

Writes are always in stage 5.

Computer Architecture - 2014 - J. Daniel García

Page 32: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Solutions to data hazards 32

RAW dependencies:

Forwarding.

WAR y WAW dependencies:

Register renaming:

Statically by compiler.

Dynamically by hardware.

Computer Architecture - 2014 - J. Daniel García

Page 33: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Forwarding 33

Technique to avoid some data stalls.

Basic idea:

No need to wait until result is written into register file.

Result is already in pipeline registers.

Use this value instead of the one from the register file.

Implementation:

Results from EX and MEM stages written into ALU input registers.

Forwarding logic selects between real input and forwarding register.

Computer Architecture - 2014 - J. Daniel García

Page 34: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Forwarding 34

DADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8,R1,R9

XOR R10, R1, R11

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Computer Architecture - 2014 - J. Daniel García

Page 35: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Forwarding limitations 35

Not every hazard can be avoided with forwarding.

You cannot travel backwards in time!

I1: LD R1, (0)R2 I2: DSUB R4, R1, R5 I3: AND R6, R1, R7 I4: OR R8, R1, R9 I5: XOR R10, R1, R11

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

In this case a stall is

needed

Computer Architecture - 2014 - J. Daniel García

Page 36: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Memory access stalls 36

LD R1, 0(R2)

DSUB R4, R1, R5

AND R6, R1, R7

OR R8,R1,R9

XOR R10, R1, R11

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

ALU

DM IM Reg Reg

Computer Architecture - 2014 - J. Daniel García

Page 37: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazards

Data hazards

Control hazards

Compile time alternatives

Run-time alternatives

Pipeline hazards 37

Computer Architecture - 2014 - J. Daniel García

Page 38: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Control hazard 38

A control hazard is associated to a PC modification

instruction.

Next instruction is not known until current one completes.

Terminology:

Taken branch: PC is updated.

Not taken branch: PC is not updated.

Problem:

Pipeline assumes branch will not be taken.

What if, after ID, we find out branch needs to be taken?

Computer Architecture - 2014 - J. Daniel García

Page 39: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Alternatives in control hazards 39

Compile-time: Fixed for all program execution.

Software may try to minimize impact if it knows

hardware behavior.

Compiler can do this job.

Run-time: Variable behavior during program

execution.

Tries to predict what software will do.

Computer Architecture - 2014 - J. Daniel García

Page 40: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Control hazards: Static solutions 40

Alternatives:

Pipeline freezing.

Fixed prediction.

Always not taken.

Always taken.

Delayed branching.

In many cases the compiler needs to know what will

be done to reduce negative impacts.

Computer Architecture - 2014 - J. Daniel García

Page 41: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline Freezing 41

Idea: If current instruction is a branch stop or flush

subsequent instructions from the pipeline until target

is known.

Run-time penalty is known.

Software (compiler) cannot do anything.

Branch target is known in ID stage

Repeat next instruction FETCH.

Computer Architecture - 2014 - J. Daniel García

Page 42: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

FETCH repetition 42

Branch Instr.

Instr. i+1

Instr. i+2

Instr. i+3

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg IM

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

IF repeated

Computer Architecture - 2014 - J. Daniel García

Page 43: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Fixed prediction: not-taken 43

Idea: Assume branch will be not-taken.

Avoids updating processor state until branch not taken

is confirmed.

If branch is taken, subsequent instructions are retired

from pipeline and next instruction is fetched from

branch target.

Transforms instructions in NOPs.

Compiler task:

Organize code setting most frequent option as not-

taken and inverting condition if needed.

Computer Architecture - 2014 - J. Daniel García

Page 44: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Fixed prediction: not-taken 44

Branch Instr.

Instr. i+1

Branch target

Instr. i+1

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

IM Inactive

Computer Architecture - 2014 - J. Daniel García

Page 45: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Fixed prediction: taken 45

Idea: Assume branch will be taken.

As soon as branch is decoded and target is computed,

target instructions start to be fetched.

In a 5-stages pipeline does not provide improvements.

Target address not known until branch decision is made.

Useful in processors with complex and slow conditions.

Compiler task:

Organize code setting most frequent option as taken

and inverting condition if needed.

Computer Architecture - 2014 - J. Daniel García

Page 46: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Delayed branching 46

Idea: Branch happens after executing n subsequent

instructions to branch instruction.

In 5-stages pipeline: 1 delay slot.

Branch instruction Instruction suc1

Instruction suc2

… Instruction sucn

Conditional instruction

N-length delay

Computer Architecture - 2014 - J. Daniel García

Page 47: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Delayed branching 47

Branch Instruction

Delayed instruction

Next or target instruction

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Reg ALU

DM IM Reg

Computer Architecture - 2014 - J. Daniel García

Page 48: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Delayed branching: Compiler 48

DADD R1, R2, R3 BEQZ R2, LABEL XOR R5, R6, R7 LABEL: AND R8, R9, R10

Delay slot

LABEL: DADD R1, R2, R3 XOR R5, R6, R7 BEQZ R5, LABEL AND R8, R9, R10

Delay slot

DADD R1, R2, R3 BEQZ R1, LABEL XOR R5, R6, R7 LABEL: AND R8, R9, R10

Delay slot

DADD R1, R2, R3 BEQZ R2, LABEL XOR R5, R6, R7 LABEL: AND R8, R9, R10

DADD R1, R2, R3

LABEL: DADD R1, R2, R3 LABEL: XOR R5, R6, R7 BEQZ R5, LABEL AND R8, R9, R10

DADD R1, R2, R3

DADD R1, R2, R3 BEQZ R1, LABEL XOR R5, R6, R7 LABEL: AND R8, R9, R10

XOR R5, R6, R7

Preferred XOR cannot move to delay slot

due to data dependency

Only if R5 is not used after

LABEL

Computer Architecture - 2014 - J. Daniel García

Page 49: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Delayed branching 49

Compiler effectiveness for the 1-slot case:

Fills around 60% slots.

Around 80% instructions executed in slots are useful for

computation.

Around 50% slots filled usefully.

With deeper pipelines and multiple issue more slots

are needed.

Need to move to more popular dynamic approaches.

Computer Architecture - 2014 - J. Daniel García

Page 50: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipeline performance with branch

fixed prediction 50

Branch stalls number depends on:

Branch frequency.

Branch penalty.

penaltybranch frequency branch branches from cycles stall

penaltybranch frequency branch 1

depth pipeline

S

Computer Architecture - 2014 - J. Daniel García

Page 51: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Example 51

MIPS R4000 (deeper pipeline).

3 stages before knowing branch target.

1 additional stage to evaluate condition.

Assuming no data stalls in comparisons.

Branch frequency:

Unconditional branch: 4%.

Conditional branch, not-taken: 6%

Conditional branch, taken: 10%

Branch scheme Penalty

Unconditional

Penalty

Not-taken

Penalty

Taken

Flush pipeline 2 3 3

Predict taken 2 3 2

Predict not-taken 2 0 3

Computer Architecture - 2014 - J. Daniel García

Page 52: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Solution 52

Branch scheme Unconditional

Branch

Branch not-taken Branch taken Total

Frequency 4% 6% 10% 20%

Flush pipeline 0.04 x 2 = 0.08 0.06 x 3=0.18 0.10 x 3=0.30 0.56

Predict taken 0.04 x 2 = 0.08 0.06 x 3 = 0.18 0.10 x 2 = 0.20 0.46

Predict not-taken 0.04 x 2 = 0.08 0.06 x 0 = 0.00 0.10 x 3 = 0.30 0.38

Contribution over ideal CPI

Speedup of predicting taken

over flushing pipeline.

Speedup of predicting not-taken

over flushing pipeline

068.146.01

56.01

S 130.1

38.01

56.01

S

Computer Architecture - 2014 - J. Daniel García

Page 53: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Structural hazards

Data hazards

Control hazards

Compile-time alternatives

Run-time alternatives

Pipeline hazards 53

Computer Architecture - 2014 - J. Daniel García

Page 54: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Branching and run-time 54

Each branch is strongly biased:

Either it is taken most of the time,

Or it is not taken most of the time.

Prediction based on execution profile:

Run once to collect statistics.

Collected information used to modify code and take

advantage of information.

Computer Architecture - 2014 - J. Daniel García

Page 55: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

12%

22%

18%

11% 12%

4%6%

9% 10%

15%

0%

5%

10%

15%

20%

25%

Mis

pre

dic

tio

n R

ate

Prediction with execution profile 55

SPEC92: Branch frequency 3% a 24%

Floating point.

Missprediction rate. Average: 9%. Standard deviation: 4%

Integer. Missprediction rate. Average: 15%. Standard deviation: 5%

Floating point Integer

Computer Architecture - 2014 - J. Daniel García

Page 56: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Dynamic prediction: BHT 56

Branch History Table:

Index: Lower portion of branch instruction address (PC).

Value: 1 bit (branch taken or not last time).

Improvement: Use more bits to increase precision. T

T NT

NT

Predict Taken (11)

Predict Not

Taken (01)

Predict Taken (10)

Predict Not

Taken (00) T

NT T

NT

Computer Architecture - 2014 - J. Daniel García

Page 57: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

BHT: Precision 57

Misspredictions:

Wrongly predict branch outcome.

History of different branch in table entry.

BHT results of 2 bits en 4K entries:

18%

5%

12%10%

9%

5%

9% 9%

0%1%

0%2%

4%6%

8%10%12%

14%16%

18%20%

eqnt

ott

espr

esso gc

c li

spice

dodu

c

spice

fppp

p

mat

rix30

0

nasa

7

Mis

pre

dic

tio

n R

ate

Computer Architecture - 2014 - J. Daniel García

Page 58: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Branch dynamic prediction 58

Why does branch prediction work?

Algorithms exhibits regularities.

Data structures exhibit regularities.

Is dynamic prediction better than static prediction?

It looks like.

There is a small amount of important branches in

programs with dynamic behavior.

Computer Architecture - 2014 - J. Daniel García

Page 59: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Multi-cycle operations 59

Computer Architecture - 2014 - J. Daniel García

Page 60: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Floating point operations 60

Floating point operations in a single cycle?

A extremely long clock cycle.

Impact on global performance.

FPU control logic very complex.

Enormous amount of logic in FP units.

Alternative: Pipelining floating point unit.

Execution stage is repeated several times.

Multiple functional units in EX.

Example: Integer unit, FP and integer multiplier, FP adder, FP

and integer divider.

Computer Architecture - 2014 - J. Daniel García

Page 61: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Pipelining and floating point 61

EX stage may have a duration more than 1 clock

cycle.

IF ID MEM WB

EX Unidad entera

EX Multiplicación int/FP

EX Sumador FP

EX Divisor int/FP

Computer Architecture - 2014 - J. Daniel García

Page 62: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Latencies and initiation interval 62

Latency: Number of cycles between an instruction producing a result and an instruction using that result.

Initiation interval: Number of cycles between two instructions using the same functional units.

Operation Latency Initiation Interval

ALU entera 0 1

Loads 1 1

FP addition 3 1

FP multiplication 6 1

FP division 24 25

Computer Architecture - 2014 - J. Daniel García

Page 63: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

Summary 63

Pipelined architectures require higher memory bandwidth.

Pipeline hazards lead to stalls.

Performance degradation.

Stalls due to data hazards may be mitigated with compiler techniques.

Stalls due to control hazards may be reduced with:

Compile-time alternatives.

Run-time alternatives.

Multi-cycle operations allow for shorter clock cycles.

Computer Architecture - 2014 - J. Daniel García

Page 64: Arquitectura de computadores - Academia Cartagena99Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas COMPUTER ARCHITECTURE Instruction Level Parallelism Exploitation

References 64

Computer Architecture. A Quantitative Approach.

Fifth Edition.

Hennessy y Patterson.

Sections C.1, C2 y C5.

Recommended exercises:

C.1, C.2, C.3, C.4, C.5

Computer Architecture - 2014 - J. Daniel García