Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Part 8

Instruction Level Parallelism (ILP) -Pipelining

Computer Architecture

Slide Sets

WS 2012/2013

Prof. Dr. Uwe BrinkschulteM.Sc. Benjamin Betting

Hier wird Wissen Wirklichkeit

Parallel Computing

Pipelining

Superscalar

VLIW

EPIC

Multithreading

Multiprocessing

Multi-Cores

Cluster of Computers

Cloud- and Grid-Computing

Thread- and Task-Level Parallelism

Instruction-Level Parallelism

Computer Architecture – Part 8 – page 2 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting


Architectures with instruction level parallelism (ILP)Pipelining vs. concurrency

Basis of most computer architectures is still the well-known von

Neumann or Harvard principle. This principle relies on a sequential

operation.

In modern high performance processors this sequential operation

mode is extended by instruction level parallelism (ILP).

ILP can be implemented by two modes of parallelism:

• Parallelism in time (pipelining)

• Parallelism in space (concurrency)



These two techniques of parallelism are an important feature for the high

performance in combination with the technological improvement.

• Parallelism in time (pipelining) means that the execution of instruction is

overlapped in time by partitioning the instruction cycle.

• Parallelism in space (concurrency) means that more than one instruction

is executed in parallel, either in order or out of order.

Both techniques are combined in modern microprocessors and defines the

instruction level parallelism for better performance.

Pipelining vs. concurrency



stage

t tcycle

# #


pipelining concurrency

instruction 1 instruction 1



Parallelism in time relies on the assembly line principle, which is also

very matured in the automotive production.

It can be effective combined with concurrency.

Among computer architectures an assembly line is called pipeline



"Pipelines accelerate execution speed in the same way like Henry Ford

revolutionized car manufacturing with the introduction of the assembly line"

(Peter Wayner, 1992)

Pipelining means the fragmentation of a machine instruction into several

partial operations.

These partial operations are executed by partial units in a sequential

and synchronized manner.

Every processing unit executes only one specific partial operation.

All partial processing units are called a pipeline in total.




Fragmentation of the instruction cycle

1. instruction fetch

The instruction addressed by the program counter is loaded from

main memory or a cache into the instruction register. The program

counter is incremented.

2. instruction decode

Internal control signals are generated according to the instructions

opcode and addressing modes.

3. operand fetch

The operands are provided by registers or functional units.

Possible fragmentation into 5 stages:



Fragmentation of the instruction cycle

4. execute

The operation is executed with the operands.

5. write back

The result is written into a register or bypassed to serve as

operand for a succeeding operation.

Depending on the instruction or instruction class some stages may be

skipped.

The entirety of stages is called instruction cycle.



• In the first stage, the fetch unit accesses the instruction

• The fetched instruction is passed to instruction decode unit.

• While this second unit processes the instruction, the first unit already

fetches the next instruction.

• In best case scenarios n-stage pipelines executes n instructions in

parallel.

• Each instruction is in a different stage of its execution.

• When the pipeline is filled, the execution of one instruction is

finished every clock cycle.

• A processor capable of finishing one instruction per clock cycle is

called a scalar processor

Instruction pipelining


Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 10 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

1. instruction

2. instruction

3. instruction

clock

instruction fetch

instruction decode

operand fetch

executewrite back

instruction fetch

instruction decode

operand fetch

executewrite back

instruction fetch

instruction decode

operand fetch

executewrite back

Instruction pipelining


Pipeline design principles

• Pipeline stages are linked by registers

• The instruction and the intermediate result is forwarded every clock cycle (in special cases every half clock cycle) to the next pipeline register.

• A pipeline is as fast as its slowest stage

• Therefore, an important issue in pipeline design is to assure that the stages consume equivalent amounts of time

• A high number of pipeline stages (often called superpipeline) leads to short clock cycles and higher speedup

• But a stall of a long pipeline, e.g. due to a control flow dependency, results in long wait times till the pipeline can be refilled.

• Thus, a real trade off exists for the designer.



Basic pipeline measures

Pipelining belongs to the class of fine grain parallelism. It takes place at a microarchitectural level.

Definitions:

• An operation is the application

of a function F to operands. An

operation produces a result.

• An operation can be made up

of a set of partial operations f1 ... fp

(in most cases p = k).

It is assumed that the partial

operations are applied in

sequential order.

• An instruction defines through

its format the function, operands

and result.

A k-stage pipeline executes n operations of F in cycles

tp (n,k) = k + (n – 1)

k cycles to execute the first instruction (fill pipline)

n-1 cycles to execute the remaining n-1 instructions



figure shows example: tp(10,5) = 5 + (10-1) = 14

start-upor fill

processing

drain

t 1 2 3 4 5stages

Pipeline operation

i

i+1 i

i+2 i+1 i

i+3 i+2 i+1 i

i+4 i+3 i+2 i+1 i

i+5 i+4 i+3 i+2 i+1

i+6 i+5 i+4 i+3 i+2

i+7 i+6 i+5 i+4 i+3

i+8 i+7 i+6 i+5 i+4

i+9 i+8 i+7 i+6 i+5

i+9 i+8 i+7 i+6

i+9 i+8 i+7

i+9 i+8

i+9



Pipeline throughput:

Pipeline speedup:

In a best case scenario where a high number of linear succeeding operations is executed pipeline speedup converts to the number of pipeline stages.


knk

kn

nkS

nk

kn

timeexecutionpipelined

timeexecutiondunpipelineknS

)1(),(

)1(),(

lim

cycle

operations

nk

n

knt

operationsknT

p )1(),(

#),(



Pipeline efficiency:

Pipeline efficiency reaches 1 (peak performance) if a infinite operation stream without bubbles or stalls is executed. This is of course only a best case analysis.

Practical evaluation: Hockney numbers:n∞ : pipeline peak performance at infinite number of operationsn½ : # of operations at which the pipeline reaches its half peak performance

1)1n(k

n

n)k,(E

)1n(k

n

))1n(k(k

kn)k,n(S

k

1)k,n(E

lim




results

stage

F

. . .

instructionsand

operandsf1 f2 f3 fk

Pipeline stages

Stages are seperated by registers



Partitioning of an operation F:

If a partitioning of an operation is impossible, F can also beapplied in parallel and overlapped over two clock cycles.

time tf/2 time tf/2

time tf/2

time tf

time tf

time tf

F

F

F

1

1`

22´

f1 f2



Operation example for partitioning

time tf/2 time tf/2

time tf/2

time tf

time tf

time tf

F

F

F

1

1`

22´

f1 f2i

t

ii+1i+2i+3

t+1t+2t+3t+4t+5

i+1i+2i+3

tt+1t+2t+3t+4t+5



If tfi = max(tf1 ... tfk) determines the clock frequency in an unbalancedpipeline (tfi >> tf1, ... , tfi >> tfk), fi should be partitioned further for better performance

f1 f2 f3

f1

f1

f2

f2

f2

f2

f2bf2a f2c

f3

f1 << f2

f2 >> f3

version 2

version 1 f3

Balancing Pipeline Suboperations



Overall pipelined execution time of an operation F:

t (F) = (max (tfi) + tpd + tsu) • k

corresponds to clock period # of stages

= k max ( tfi ) + k ( tpd + tsu )

max. processing time register delay of a suboperation

Overall execution time, clock frequency

Clock period:

cp = max (tfi) + tpd + tsu

Register delays:

tpd = propagation delay time

tsu = set up time



Architecture of a linear 5-stage pipeline with registers

OR

OR

OR

OR

IF ID OF EX WB

WB

IR

DECR

RF DC

PC

IC

IC = instruction cacheDC = data cacheIR = instruction registerCR = control registerRF = register file, e.g. 3-gate register fileDE = decoder (control unit)OR = operand registerPC = program counter

IF = instruction fetchID = instruction decodeOF= operand fetchEX = executeWB = write back

ALU



Pipeline hazards

So far, we have assumed a smooth throughput of operations through the pipeline

But, there are several effects which can cause stalls in pipelined operations

These effects are called pipeline hazards

Pipeline hazards can be caused by

• dataflow dependencies

• resource dependencies

• controlflow dependencies



Dataflow dependencies

Pipelined processors have to consider 3 classes of dataflow dependencies. The same

dependencies have to be considered in concurrency.

1. true dependency: read after write (RAW)

destination (i) = source (i +1)

X A + B instruction i

Y X + B instruction i+1

A hazard occurs if the distance of two instructions is smaller than the number of pipelines stages. In this case X has to be read before it is created.

X has to be written by instruction i before it is read by the succeeding instruction.



2. anti dependency: write after read (WAR)

source (i) = destination (i +1)

Y has to be read by instruction i before it is written by the succeeding instruction.

X Y + B instruction i

Y A + C instruction i +1


A hazard occurs if the order of the instructions is changed in the pipeline.



3. output dependency: write after write (WAW)

destination (i) = destination (i + 1)

Both instructions write their results into the same register.

Y A / B instruction i

Y C + D instruction i + 1


A hazard occurs if the order of the instructions is changed in the pipeline.


Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 26 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Example of a short assembler program containing a true dependency, anti dependencies and a output dependency.

I1 ADD R1,2,R2 ; R1 = R2+2I2 ADD R4,R3,R1 ; R4 = R1+R3I3 MULT R3,3,R5 ; R3 = R5·3I4 MULT R3,3,R6 ; R3 = R6·3

I1

I2

I3

I4

true dependency

anti dependency

output dependency

anti depen-dency

Dependency graph


Example of a true dependency hazard(RAW) in a 5-staged pipeline

i+1 write Y

i write X

i+1

i+1

i + 1 read X, C

i+1 Y:=X op C

i

i

i read A, B

i X:=A op B

fetch decode read execute write

issuepoint

pipelinestages

t

issuechecki + 1

RAW

i: X:=A op B

i+1: Y:=X op C



Solutions for true dependency hazards

Software solutions:

• Inserting NOOP instructions

• Reorder instructions

Hardware solutions:

• Pipeline interlocking

• Forwarding

Any combinations of these solutions are possible as well



i+1 write Y

i write X

i + 1 read X, C

i+1 Y:=X op C

i

i

i read A, B

i X:=A op B


pipelinestages

t

NOOPs inserted by compiler or programmer

Solving a true dependency hazard by inserting NOOPs

The RAW hazard is eliminated through insertion of NOOPs (bubbles) into the pipeline.This was the solution used in first RISC processors.

NOOPs

i+1

i+1

NOOP

NOOP



Solving a true dependency hazard by reordering instructions

Sometimes, instead of inserting NOOPs instructions can be reordered to have the same effect

Therefore, instructions having no true dependencies and not changing the control flow are arranged in between the conflicting instructions

Example:

X:=A op B

NOOP

NOOP

Y:=X op C

Z:=D op E

F:= INP(0)

X:=A op B

Z:=D op E

F:= INP(0)

Y:=X op C



Solving a true dependency hazard by pipeline interlocking

Pipline interlocking means the pipeline processing is delayed by hardware until the conflict is solved

So the compiler or programmer is relieved (used e.g. in MIPS processor,Microprocessor with Interlocked Pipeline Stages)

i+1 write Y

i write X

i+1

i+1

i + 1 read X, C

i+1 Y:=X op C

i

i

i read A, B

i X:=A op B


issuepoint

pipelinestagest

Interlocking

issuechecki + 1



Forwarding

Forwarding is simple hardware technique to save one delay slot (NOOP).

An operand X needed for instruction i + 1 is directly forwarded from the output of the ALU to the input. The register file is by passed.

If more then one delay slot is necessary, forwarding is combined with interlocking or NOOP insertion.

The data forwarding path can also be used to provide operands of waiting instruction from the cache.

This shortens the delay slot between a load and an execute instruction using this operand.

Data cache access is speed up excessive through this technique.



cache memory register ALU

bypass: load forwarding

bypass:resultforwarding

Load and ResultForwarding



Hardware realization of the forward path

i+1 write Y

i write X

i+1i+1

i+1read X,Ci+1Y:=X op C

ii

i read A, Bi X:=A op B


issuepoint

pipeline stages

tdataforwarding

RFread

RFwriteEX

(R)

load data path(load forwarding)

(S1)

(S2)

forward controldata forwarding path(result forwarding)

1 NOOP or interlocking

issuecheck for

i + 1



Anti- and output-dependency hazards(false dependencies)

An output dependency hazard may occur if an instruction i needs

more time units to execute than instruction i+ 1.

Of course this is only possible if the processor consist of several

processing units with different numbers of stages.

Anti-dependency hazards only occur if the order of instructions is

changed in the pipeline.

This is never true for ordinary scalar pipelines

In superscalar pipelines, this hazard occurs



Output dependency hazard(regarding only 3 stages of the 5 stage pipeline)

RFread

RFwrite

read execute write

FU 2

FU 1

stages

i read A, B

i 2. A op B

i 3. A op B

i write Y

t

issue iissue i+1 i+1 read C, D

i+1 write Y

i +1 C op D

i 1. A op BFU1

FU2



Removing false dependencies

False dependencies can always be removed by register renaming

This can be done by hardware or by compiler

So the hazard will never occur

Example:

X:= Y op B Y:= A op B

Y:= A op C Y:= C op D

Renaming the second Y to Z:

X:= Y op B Y:= A op BZ:= A op C Z:= C op D



Resource dependencies

An intra-pipeline dependency occurs if instructions of two succeeding stages need the same pipeline resource.

The succeeding instruction (and the following instructions) have to be delayed till the resource becomes available.

This happens e.g. if the common register file lacks a sufficient number of ports or some instructions need more than one clock cycle to run through a particular pipeline resource

Examples: a register file with a common read/write port (possible conflict of read in stage 3 with write in stage 5) or a multi-cycle division unit in the execute stage.

Resource dependencies can be classified in:

• intra-pipeline dependencies

• instruction class dependencies



Resource dependencies

An instruction class dependency occurs if two or more instructions which are in the same pipeline stage need a pipeline resource existing only once.

This never happens in a scalar pipeline

Superscalar processors with several execution units often face this sort of conflict.

A twofold superscalar processor may issue two instructions to two execution units simultaneously.

If these instructions need the same (only once existent) execution unit an instruction class dependency arises.



Control flow dependencies

Every change in control flow is a potential candidate for conflict.

Several instruction classes cause changes in control flow:

• conditional branch

• jump

• jump to subroutine, return from subroutine

The control flow target is not yet available when the next instruction is

to be fetched

Especially conditional branches cause severe conflicts

The analysis of the condition determines the next instruction to issue,

which usually is finished in the last pipeline stages



BRANCH COND

CMP

BRANCH COND

CMPBRANCH COND

BRANCH COND

NEXT CORRECT I

CMP

CMP

CMP

BRANCH COND

IF ID OF EX WB

condition code

Example of a control flow hazarddue to a conditional branch

Control flow hazards



Solutions for control flow hazards

Software solutions:

• Inserting NOOP instructions

• Reorder instructions

Hardware solutions:

• Pipeline interlocking

• Forwarding

• Fast compare and jump logic

• Branch prediction



BRANCH COND CMP

BRANCH COND CMP

BRANCH COND

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

NEXT CORRECT I


NEXT+1 CORRECT I

CMP

CMP

CMP

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I

DELAYSLOT1

DELAYSLOT2

IF ID OF EX WB

condition code

Solution: interlocking or NOOPinsertion

NOOP or interlocking

Penalty: 6 cycles



IF ID OF EX WB

condition code

CMPBRANCH COND

CMPBRANCH COND

BRANCH COND

NEXT CORRECT I


NEXT+1 CORRECT I

NEXT CORRECT I


NEXT+1 CORRECT I

CMPCMP

CMPBRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I

DELAYSLOT2

Reducing penalty by forwarding the comparison result

BRANCH COND

Penalty: 4 cycles




IF ID OF EX WB

condition code

BRANCH COND

BRANCH COND

CMP

BRANCH CONDNEXT CORRECT I


NEXT+1 CORRECT I

NEXT CORRECT I


NEXT+1 CORRECT I

CMP

CMP

NEXT CORRECT I

NEXT+1 CORRECT I

CMP

BRANCH COND CMP

BRANCH COND

DELAYSLOT2

Reducing penalty by forwarding the next correct instruction address


Penalty: 3 cycles

NOOP or interlockingNOOP or interlocking



condition code

DELAYSLOT2

IF ID OF EX WB

BRANCH COND

BRANCH COND

NEXT CORRECT I


NEXT+1 CORRECT I

NEXT CORRECT I


NEXT+1 CORRECT I

CMPCMP

NEXT CORRECT I

NEXT+1 CORRECT I

CMPBRANCH COND CMP

BRANCH COND CMPBRANCH COND

fastjumplogic

fastcompare

logic comparisonresult

Reducing penalty by fast compareand jump logic

Penalty: 2 cycles




Reducing penalty by fast compareand jump logic

Special logic for compare and jump instructions can reduce the penalty

by one cycle.

These circuits can be much faster than a more general execution unit

(ALU) allowing to complete comparison and jump in one clock cycle.

The higher speed of the fast compare logic is possible because

normally only simple comparisons like equal, unequal, <0, >0, ≤0, ≥0,

=0 are needed.



Reducing penalty by fast compareand jump logic + reorder instructions

The remaining 2 NOOPs or interlockings can be replaced by reordering code

Two independent instructions could be moved after the branch instruction (delayed branch)

Example:

Z:=D op E

F:= INP(0)

CMP

BRANCH COND

NOOP

NOOP

NEXT INSTR (COND = FALSE)

. . .

NEXT INSTR (COND = TRUE)

CMP

BRANCH COND

Z:=D op E

F:= INP(0)

NEXT INSTR (COND = FALSE)

. . .

NEXT INSTR (COND = TRUE)



Branch prediction

Another possibility of avoiding control flow hazards is branch prediction

Here, the outcome of the branch (taken or not taken) is predicted before

the result of the comparison is known

In case of correct branch prediction, the penalty can be reduced up to 0

Firstly, lets assume we would have a perfectly working branch predictor



prediction result (taken or not taken)

IF ID OF EX WB

branchpredictor

Reducing penalty by branch prediction

BRANCH COND

BRANCH COND

NEXT CORRECT I


NEXT+1 CORRECT I

NEXT CORRECT I


NEXT+1 CORRECT I

CMPCMP

NEXT CORRECT I

NEXT+1 CORRECT I

CMPBRANCH COND CMP


Penalty: still 2 cycles

next address



Branch target address cache

To further reduce the penalty, a branch target address cache (BTAC)

can be introduced

This cache holds the addresses of

branches and the corresponding

target addresses

Therefore, if filled already in the

fetch phase a branch and its

possible target address can

be identified

branch address target address

. . . . . .

branch target address

part of branch address (e.g. lower m bits)

BTAC



NEXT CORRECT I

prediction result

IF ID OF EX WB

branchpredictor

Reducing penalty by branch prediction and branch target addresscache

BRANCH COND

BRANCH COND

CMPCMP

CMPBRANCH COND CMP


Penalty: 0 cycles

next address

BTAC

NEXT+1 CORRECT I

NEXT CORRECT INEXT+1 CORRECT I






Branch prediction and pipeline utilization

For having 0 cycles penalty, two prerequisites must meet:

• the branch address must be stored in the BTAC

• the branch prediction must be correct

Otherwise we will get a penalty



Branch prediction and pipeline utilization

In case of a BTAC miss, the penalty will be pb (in our example 2)

In case of a misprediction, the penalty will be the number of cycles pm needed to

flush the pipeline (e.g. 5)

In modern processors, this can be much more (e.g. 11 for Pentium II)

The overall penalty calculates to:

p = m pm + (1 - m) b pb with m: miss prediction rate, b: BTAC miss rate

The pipeline utilization can be calculated to:

u = n / (n + p) with n: number of instructions

So, an excellent branch prediction is necessary



Branch prediction techniques

In general, two classes of branch prediction techniques can be

distinguished:

• static branch prediction

for a given branch, the prediction is always the same, it never

changes

• dynamic branch prediction

for a given branch, the prediction changes dynamically



Static branch prediction

• Predict always not taken

most simple technique, no BTAC necessary, in the first attempt the

branch is always ignored

• Predict always taken

a bit more complicated, needs a BTAC to take the branch in the first

attempt. Produces slightly better results

• Predict backward taken, forward not taken

loop-oriented prediction, a backward branch often belongs to a loop and

therefore is taken quite often

• Compiler controlled

the compiler sets a bit for each branch to tell the processor how to

predict the branch. Still static since it never changes during runtime



Dynamic branch prediction

Dynamic branch prediction means that information about the probability of a

branch is collected at runtime.

Dynamic branch prediction is based on knowledge about the past behavior of the

branch.

This knowledge can be stored in a table and can be addressed

through the address of the branch instruction.

Often, this information is stored as well in the BTAC but there are also solutions

with separate tables

Dynamic branch prediction produces much better results then static branch

prediction.

Today, a misprediction rate below 10% is possible



Using the BTAC to store branch history information

branch address target address history bits

. . . . . . . . .

part of branch address (e.g. lower m bits)

BTAC

branch target address branch history



Interferences

Only a part of the branch address is used as index to the table containing branch history

If two branches have a identical bit pattern in this part, they share the same table entry => interference

This often leads to mispredictions, because one branch messes up the history of the other one

As larger the history table, as less interferences occur

Best case: all bits of the branch address would be used as an index => no interferences

Due to limited chip space, this is not possible for large programs



One bit predictor

Most simple predictor, only one bit is used to store the branch history

For each branch, two states (taken, not taken) dependent of the last

execution are stored

The prediction always refers to the last state

NT

NTT

T

Predict TakenPredict Not

Taken



Two bit predictor

Two bits per branch to store history

This results in for states (strongly taken, weakly taken, weakly not

taken, strongly not taken)

In a strong states, it takes two mispredictions to change the prediction

NT

NTT

T

(11)

Predict Strongly

Taken

NT

T

NT

T

(00)

Predict Strongly

Not Taken

(01)

Predict Weakly

Not Taken

(10)

Predict Weakly

Taken

Two bit predictor with saturation counter



Two bit predictor

Two bit predictor with hysteresis counter

NT

NT

T

T

(11)

Predict Strongly

Taken

NT

T

NT

T

(00)

Predict Strongly

Not Taken

(01)

Predict Weakly

Not Taken

(10)

Predict Weakly

Taken



One bit predictor versus two bit predictor

One bit predictor is simpler and needs less memory

For a branch at the end of a loop, the one bit predictor correctly predicts the

branch direction as long as the loop is iterated

In a nested loop, each iteration of the outer loop produces two mispredictions in

the inner loop

A two bit predictor avoids one of these two mispredictions

Technique can be extended to n bits, but no significant improve in performance

one bit predictor

mispred. when left inner loopmispred. when reentered inner loop

two bit predictor

mispred. when left inner loop



Correlation predictors

Often, branches are not independent

Example:

DEC A

BRZ X

. . .

X: LD A,0

BRZ Y

The second branch is always taken when the first branch is taken

Both branches are correlated

This is not exploited by the one or two bit predictors




One or two bit predictors only use self-history

Correlation predictors also use neighbor-history

This means, the own history and the history of neighbored, in execution

order preceding branches are used

Notation: a (m,n) predictor uses the last m branches to select one of 2m

predictors, while each of these predictors is a n bit predictor for a single

given branch

A branch history register (BHR) is used to store the direction of the last m

branches in a m-bit shift register

The BHR is used as an index to select a pattern history table (PHT)



Implementation of a (2,2) predictor

...

...

Pattern History Tables PHTs (2-bit predictors)

...

...

1 1

Branch address

10

0Branch History Register BHR (2-bit shift register) 1

select



Two level adaptive predictors

Two level adaptive predictors have been developed by Yeh and Patt

nearly the same time as the correlation predictors (1992)

Like the correlation predictor, the two level adaptive predictor uses two

levels of tables, while the first level is used to select prediction bits of the

second level

Variants of two level adaptive predictors:

global PHT

per-set PHTs

per-address PHTs

global scheme (global BHR) GAg GAs GAp

per-address-scheme (per-address BHT)

PAg PAs PAp

per-set-scheme (per-set BHT) SAg SAs SAp




Two level adaptive predictors

Examples:

GAg(4) GAp(4)

PAg(4) PAp(4)

For the s/S variants, only part of the branch address is used



gshare and gselect predictors

When using a global PHT, parts of the branch address bits and the BHR can be

combined in two ways to address a PHT entry:

gselect: branch address bits and BHR are concatenated

gshare: branch address bits and BHR are XORed

gshare performs a bit better than gselect due to less interferences

Example:

branch addr BHR gselect4/4 gshare8/8

00000000 00000001 00000001 00000001

00000000 00000000 00000000 00000000

11111111 00000000 11110000 11111111

11111111 10000000 11110000 01111111



Hybrid predictors

A hybrid or combined predictor consists of two different branch predictors and a

selection predictor choosing one of two branch predictor results for each branch

prediction

Any predictor can be used as selection predictor

Examples:

McFarling: two bit predictor combined with gshare

Young and Smith: compiler controlled static predictor combined with

two level adaptive predictor

Often, a simple predictor with reasonable results in the warm-up phase is combined

with a sophisticated predictor delivering better results later

The combined predictors are often better then the individuals



Misprediction rates

SAg, gshare and McFarling:

committed conditional takenApplication instructions branches branches (in millions) (in millions) (%) SAg gshare combining

compress 80.4 14.4 54.6 10.1 10.1 9.9gcc 250.9 50.4 49.0 12.8 23.9 12.2perl 228.2 43.8 52.6 9.2 25.9 11.4go 548.1 80.3 54.5 25.6 34.4 24.1m88ksim 416.5 89.8 71.7 4.7 8.6 4.7xlisp 183.3 41.8 39.5 10.3 10.2 6.8vortex 180.9 29.1 50.1 2.0 8.3 1.7jpeg 252.0 20.0 70.0 10.3 12.5 10.4mean 267.6 46.2 54.3 8.6 14.5 8.1

misprediction rate(%)



Multipath execution: in case of a branch both paths are followed by the processor simultaneously, the wrong path is discarded later

Multipath execution

RF read

ALU RF write

instruction issue point

DEC

DEC

IF

IF

CC

a simple multipath pipeline with two instruction fetch and decode stages



Predication

Predication means, the execution of an instruction is dependend on a predicate

Only if the predicate is true the instruction is executed

If all instructions of an instructions set supports predication, this is called a fully predicated instruction set

Examples for fully predicated instruction sets: IA64 Itanium, ARM,

Fully predicated instruction sets can avoid conditional branches

Example:

CMP A, 0 CMP A, 0, PBZ L1 P.ADD B,CADD B,C P.SUB C,D

SUB C,D LD A,3L1: LD 3,A

with cond. branch predicatedComputer Architecture – Part 8 – page 73 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting


Predication

On the hardware side, the predicated instruction is executed anyway.

In case of a false predicate, the result of the instruction is discarded

Advantages:

• conditional branches can be avoided

• no speculation necessary

• basic block length is increased resulting in better compiler optimization

Disadvantages:

• unnecessary execution of instructions

• additional predicate bits necessary in instruction format



Trace cache

A trace is a sequence of executed instructions which can span several basic blocks

Therefore, in a trace all branches are solved

A trace cache stores such traces while the trace is executed

If the same trace is executed again, the instruction sequence can be taken from the trace cache, no branch needs to be exectued

While an instruction cache contains the static instruction sequence, the trace cache contains the dynamic instruction sequence

Example for a trace cache: Pentium 4

I -c a c h e T ra c e C a c h e


Documents

Part 8 Instruction Level Parallelism (ILP) - Pipelining