Upload
melody
View
66
Download
1
Embed Size (px)
DESCRIPTION
Computer Architecture Slide Sets WS 2012/2013 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting. Part 8 Instruction Level Parallelism (ILP) - Pipelining. Parallel Computing. Pipelining Superscalar VLIW EPIC. Instruction-Level Parallelism. Thread- and Task-Level Parallelism. - PowerPoint PPT Presentation
Citation preview
Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Part 8
Instruction Level Parallelism (ILP) -Pipelining
Computer Architecture
Slide Sets
WS 2012/2013
Prof. Dr. Uwe BrinkschulteM.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Parallel Computing
Pipelining
Superscalar
VLIW
EPIC
Multithreading
Multiprocessing
Multi-Cores
Cluster of Computers
Cloud- and Grid-Computing
Thread- and Task-Level Parallelism
Instruction-Level Parallelism
Computer Architecture – Part 8 – page 2 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Architectures with instruction level parallelism (ILP)Pipelining vs. concurrency
Basis of most computer architectures is still the well-known von
Neumann or Harvard principle. This principle relies on a sequential
operation.
In modern high performance processors this sequential operation
mode is extended by instruction level parallelism (ILP).
ILP can be implemented by two modes of parallelism:
• Parallelism in time (pipelining)
• Parallelism in space (concurrency)
Computer Architecture – Part 8 – page 3 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
These two techniques of parallelism are an important feature for the high
performance in combination with the technological improvement.
• Parallelism in time (pipelining) means that the execution of instruction is
overlapped in time by partitioning the instruction cycle.
• Parallelism in space (concurrency) means that more than one instruction
is executed in parallel, either in order or out of order.
Both techniques are combined in modern microprocessors and defines the
instruction level parallelism for better performance.
Pipelining vs. concurrency
Computer Architecture – Part 8 – page 4 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
stage
t tcycle
# #
Pipelining vs. concurrency
pipelining concurrency
instruction 1 instruction 1
instruction 2 instruction 2
instruction 3 instruction 3
Parallelism in time relies on the assembly line principle, which is also
very matured in the automotive production.
It can be effective combined with concurrency.
Among computer architectures an assembly line is called pipeline
Computer Architecture – Part 8 – page 5 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
"Pipelines accelerate execution speed in the same way like Henry Ford
revolutionized car manufacturing with the introduction of the assembly line"
(Peter Wayner, 1992)
Pipelining means the fragmentation of a machine instruction into several
partial operations.
These partial operations are executed by partial units in a sequential
and synchronized manner.
Every processing unit executes only one specific partial operation.
All partial processing units are called a pipeline in total.
Pipelining vs. concurrency
Computer Architecture – Part 8 – page 6 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Fragmentation of the instruction cycle
1. instruction fetch
The instruction addressed by the program counter is loaded from
main memory or a cache into the instruction register. The program
counter is incremented.
2. instruction decode
Internal control signals are generated according to the instructions
opcode and addressing modes.
3. operand fetch
The operands are provided by registers or functional units.
Possible fragmentation into 5 stages:
Computer Architecture – Part 8 – page 7 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Fragmentation of the instruction cycle
4. execute
The operation is executed with the operands.
5. write back
The result is written into a register or bypassed to serve as
operand for a succeeding operation.
Depending on the instruction or instruction class some stages may be
skipped.
The entirety of stages is called instruction cycle.
Computer Architecture – Part 8 – page 8 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
• In the first stage, the fetch unit accesses the instruction
• The fetched instruction is passed to instruction decode unit.
• While this second unit processes the instruction, the first unit already
fetches the next instruction.
• In best case scenarios n-stage pipelines executes n instructions in
parallel.
• Each instruction is in a different stage of its execution.
• When the pipeline is filled, the execution of one instruction is
finished every clock cycle.
• A processor capable of finishing one instruction per clock cycle is
called a scalar processor
Instruction pipelining
Computer Architecture – Part 8 – page 9 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 10 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
1. instruction
2. instruction
3. instruction
clock
instruction fetch
instruction decode
operand fetch
executewrite back
instruction fetch
instruction decode
operand fetch
executewrite back
instruction fetch
instruction decode
operand fetch
executewrite back
Instruction pipelining
Hier wird Wissen Wirklichkeit
Pipeline design principles
• Pipeline stages are linked by registers
• The instruction and the intermediate result is forwarded every clock cycle (in special cases every half clock cycle) to the next pipeline register.
• A pipeline is as fast as its slowest stage
• Therefore, an important issue in pipeline design is to assure that the stages consume equivalent amounts of time
• A high number of pipeline stages (often called superpipeline) leads to short clock cycles and higher speedup
• But a stall of a long pipeline, e.g. due to a control flow dependency, results in long wait times till the pipeline can be refilled.
• Thus, a real trade off exists for the designer.
Computer Architecture – Part 8 – page 11 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Basic pipeline measures
Pipelining belongs to the class of fine grain parallelism. It takes place at a microarchitectural level.
Definitions:
• An operation is the application
of a function F to operands. An
operation produces a result.
• An operation can be made up
of a set of partial operations f1 ... fp
(in most cases p = k).
It is assumed that the partial
operations are applied in
sequential order.
• An instruction defines through
its format the function, operands
and result.
A k-stage pipeline executes n operations of F in cycles
tp (n,k) = k + (n – 1)
k cycles to execute the first instruction (fill pipline)
n-1 cycles to execute the remaining n-1 instructions
Computer Architecture – Part 8 – page 12 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
figure shows example: tp(10,5) = 5 + (10-1) = 14
start-upor fill
processing
drain
t 1 2 3 4 5stages
Pipeline operation
i
i+1 i
i+2 i+1 i
i+3 i+2 i+1 i
i+4 i+3 i+2 i+1 i
i+5 i+4 i+3 i+2 i+1
i+6 i+5 i+4 i+3 i+2
i+7 i+6 i+5 i+4 i+3
i+8 i+7 i+6 i+5 i+4
i+9 i+8 i+7 i+6 i+5
i+9 i+8 i+7 i+6
i+9 i+8 i+7
i+9 i+8
i+9
Computer Architecture – Part 8 – page 13 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Pipeline throughput:
Pipeline speedup:
In a best case scenario where a high number of linear succeeding operations is executed pipeline speedup converts to the number of pipeline stages.
Basic pipeline measures
knk
kn
nkS
nk
kn
timeexecutionpipelined
timeexecutiondunpipelineknS
)1(),(
)1(),(
lim
cycle
operations
nk
n
knt
operationsknT
p )1(),(
#),(
Computer Architecture – Part 8 – page 14 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Pipeline efficiency:
Pipeline efficiency reaches 1 (peak performance) if a infinite operation stream without bubbles or stalls is executed. This is of course only a best case analysis.
Practical evaluation: Hockney numbers:n∞ : pipeline peak performance at infinite number of operationsn½ : # of operations at which the pipeline reaches its half peak performance
1)1n(k
n
n)k,(E
)1n(k
n
))1n(k(k
kn)k,n(S
k
1)k,n(E
lim
Basic pipeline measures
Computer Architecture – Part 8 – page 15 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
results
stage
F
. . .
instructionsand
operandsf1 f2 f3 fk
Pipeline stages
Stages are seperated by registers
Computer Architecture – Part 8 – page 16 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Partitioning of an operation F:
If a partitioning of an operation is impossible, F can also beapplied in parallel and overlapped over two clock cycles.
time tf/2 time tf/2
time tf/2
time tf
time tf
time tf
F
F
F
1
1`
22´
f1 f2
Computer Architecture – Part 8 – page 17 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Operation example for partitioning
time tf/2 time tf/2
time tf/2
time tf
time tf
time tf
F
F
F
1
1`
22´
f1 f2i
t
ii+1i+2i+3
t+1t+2t+3t+4t+5
i+1i+2i+3
tt+1t+2t+3t+4t+5
Computer Architecture – Part 8 – page 18 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
If tfi = max(tf1 ... tfk) determines the clock frequency in an unbalancedpipeline (tfi >> tf1, ... , tfi >> tfk), fi should be partitioned further for better performance
f1 f2 f3
f1
f1
f2
f2
f2
f2
f2bf2a f2c
f3
f1 << f2
f2 >> f3
version 2
version 1 f3
Balancing Pipeline Suboperations
Computer Architecture – Part 8 – page 19 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Overall pipelined execution time of an operation F:
t (F) = (max (tfi) + tpd + tsu) • k
corresponds to clock period # of stages
= k max ( tfi ) + k ( tpd + tsu )
max. processing time register delay of a suboperation
Overall execution time, clock frequency
Clock period:
cp = max (tfi) + tpd + tsu
Register delays:
tpd = propagation delay time
tsu = set up time
Computer Architecture – Part 8 – page 20 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Architecture of a linear 5-stage pipeline with registers
OR
OR
OR
OR
IF ID OF EX WB
WB
IR
DECR
RF DC
PC
IC
IC = instruction cacheDC = data cacheIR = instruction registerCR = control registerRF = register file, e.g. 3-gate register fileDE = decoder (control unit)OR = operand registerPC = program counter
IF = instruction fetchID = instruction decodeOF= operand fetchEX = executeWB = write back
ALU
Computer Architecture – Part 8 – page 21 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Pipeline hazards
So far, we have assumed a smooth throughput of operations through the pipeline
But, there are several effects which can cause stalls in pipelined operations
These effects are called pipeline hazards
Pipeline hazards can be caused by
• dataflow dependencies
• resource dependencies
• controlflow dependencies
Computer Architecture – Part 8 – page 22 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Dataflow dependencies
Pipelined processors have to consider 3 classes of dataflow dependencies. The same
dependencies have to be considered in concurrency.
1. true dependency: read after write (RAW)
destination (i) = source (i +1)
X A + B instruction i
Y X + B instruction i+1
A hazard occurs if the distance of two instructions is smaller than the number of pipelines stages. In this case X has to be read before it is created.
X has to be written by instruction i before it is read by the succeeding instruction.
Computer Architecture – Part 8 – page 23 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
2. anti dependency: write after read (WAR)
source (i) = destination (i +1)
Y has to be read by instruction i before it is written by the succeeding instruction.
X Y + B instruction i
Y A + C instruction i +1
Dataflow dependencies
A hazard occurs if the order of the instructions is changed in the pipeline.
Computer Architecture – Part 8 – page 24 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
3. output dependency: write after write (WAW)
destination (i) = destination (i + 1)
Both instructions write their results into the same register.
Y A / B instruction i
Y C + D instruction i + 1
Dataflow dependencies
A hazard occurs if the order of the instructions is changed in the pipeline.
Computer Architecture – Part 8 – page 25 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 26 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Example of a short assembler program containing a true dependency, anti dependencies and a output dependency.
I1 ADD R1,2,R2 ; R1 = R2+2I2 ADD R4,R3,R1 ; R4 = R1+R3I3 MULT R3,3,R5 ; R3 = R5·3I4 MULT R3,3,R6 ; R3 = R6·3
I1
I2
I3
I4
true dependency
anti dependency
output dependency
anti depen-dency
Dependency graph
Hier wird Wissen Wirklichkeit
Example of a true dependency hazard(RAW) in a 5-staged pipeline
i+1 write Y
i write X
i+1
i+1
i + 1 read X, C
i+1 Y:=X op C
i
i
i read A, B
i X:=A op B
fetch decode read execute write
issuepoint
pipelinestages
t
issuechecki + 1
RAW
i: X:=A op B
i+1: Y:=X op C
Computer Architecture – Part 8 – page 27 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Solutions for true dependency hazards
Software solutions:
• Inserting NOOP instructions
• Reorder instructions
Hardware solutions:
• Pipeline interlocking
• Forwarding
Any combinations of these solutions are possible as well
Computer Architecture – Part 8 – page 28 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
i+1 write Y
i write X
i + 1 read X, C
i+1 Y:=X op C
i
i
i read A, B
i X:=A op B
fetch decode read execute write
pipelinestages
t
NOOPs inserted by compiler or programmer
Solving a true dependency hazard by inserting NOOPs
The RAW hazard is eliminated through insertion of NOOPs (bubbles) into the pipeline.This was the solution used in first RISC processors.
NOOPs
i+1
i+1
NOOP
NOOP
Computer Architecture – Part 8 – page 29 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Solving a true dependency hazard by reordering instructions
Sometimes, instead of inserting NOOPs instructions can be reordered to have the same effect
Therefore, instructions having no true dependencies and not changing the control flow are arranged in between the conflicting instructions
Example:
X:=A op B
NOOP
NOOP
Y:=X op C
Z:=D op E
F:= INP(0)
X:=A op B
Z:=D op E
F:= INP(0)
Y:=X op C
Computer Architecture – Part 8 – page 30 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Solving a true dependency hazard by pipeline interlocking
Pipline interlocking means the pipeline processing is delayed by hardware until the conflict is solved
So the compiler or programmer is relieved (used e.g. in MIPS processor,Microprocessor with Interlocked Pipeline Stages)
i+1 write Y
i write X
i+1
i+1
i + 1 read X, C
i+1 Y:=X op C
i
i
i read A, B
i X:=A op B
fetch decode read execute write
issuepoint
pipelinestagest
Interlocking
issuechecki + 1
Computer Architecture – Part 8 – page 31 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Forwarding
Forwarding is simple hardware technique to save one delay slot (NOOP).
An operand X needed for instruction i + 1 is directly forwarded from the output of the ALU to the input. The register file is by passed.
If more then one delay slot is necessary, forwarding is combined with interlocking or NOOP insertion.
The data forwarding path can also be used to provide operands of waiting instruction from the cache.
This shortens the delay slot between a load and an execute instruction using this operand.
Data cache access is speed up excessive through this technique.
Computer Architecture – Part 8 – page 32 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
cache memory register ALU
bypass: load forwarding
bypass:resultforwarding
Load and ResultForwarding
Computer Architecture – Part 8 – page 33 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Hardware realization of the forward path
i+1 write Y
i write X
i+1i+1
i+1read X,Ci+1Y:=X op C
ii
i read A, Bi X:=A op B
fetch decode read execute write
issuepoint
pipeline stages
tdataforwarding
RFread
RFwriteEX
(R)
load data path(load forwarding)
(S1)
(S2)
forward controldata forwarding path(result forwarding)
1 NOOP or interlocking
issuecheck for
i + 1
Computer Architecture – Part 8 – page 34 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Anti- and output-dependency hazards(false dependencies)
An output dependency hazard may occur if an instruction i needs
more time units to execute than instruction i+ 1.
Of course this is only possible if the processor consist of several
processing units with different numbers of stages.
Anti-dependency hazards only occur if the order of instructions is
changed in the pipeline.
This is never true for ordinary scalar pipelines
In superscalar pipelines, this hazard occurs
Computer Architecture – Part 8 – page 35 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Output dependency hazard(regarding only 3 stages of the 5 stage pipeline)
RFread
RFwrite
read execute write
FU 2
FU 1
stages
i read A, B
i 2. A op B
i 3. A op B
i write Y
t
issue iissue i+1 i+1 read C, D
i+1 write Y
i +1 C op D
i 1. A op BFU1
FU2
Computer Architecture – Part 8 – page 36 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Removing false dependencies
False dependencies can always be removed by register renaming
This can be done by hardware or by compiler
So the hazard will never occur
Example:
X:= Y op B Y:= A op B
Y:= A op C Y:= C op D
Renaming the second Y to Z:
X:= Y op B Y:= A op BZ:= A op C Z:= C op D
Computer Architecture – Part 8 – page 37 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Resource dependencies
An intra-pipeline dependency occurs if instructions of two succeeding stages need the same pipeline resource.
The succeeding instruction (and the following instructions) have to be delayed till the resource becomes available.
This happens e.g. if the common register file lacks a sufficient number of ports or some instructions need more than one clock cycle to run through a particular pipeline resource
Examples: a register file with a common read/write port (possible conflict of read in stage 3 with write in stage 5) or a multi-cycle division unit in the execute stage.
Resource dependencies can be classified in:
• intra-pipeline dependencies
• instruction class dependencies
Computer Architecture – Part 8 – page 38 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Resource dependencies
An instruction class dependency occurs if two or more instructions which are in the same pipeline stage need a pipeline resource existing only once.
This never happens in a scalar pipeline
Superscalar processors with several execution units often face this sort of conflict.
A twofold superscalar processor may issue two instructions to two execution units simultaneously.
If these instructions need the same (only once existent) execution unit an instruction class dependency arises.
Computer Architecture – Part 8 – page 39 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Control flow dependencies
Every change in control flow is a potential candidate for conflict.
Several instruction classes cause changes in control flow:
• conditional branch
• jump
• jump to subroutine, return from subroutine
The control flow target is not yet available when the next instruction is
to be fetched
Especially conditional branches cause severe conflicts
The analysis of the condition determines the next instruction to issue,
which usually is finished in the last pipeline stages
Computer Architecture – Part 8 – page 40 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
BRANCH COND
CMP
BRANCH COND
CMPBRANCH COND
BRANCH COND
NEXT CORRECT I
CMP
CMP
CMP
BRANCH COND
IF ID OF EX WB
condition code
Example of a control flow hazarddue to a conditional branch
Control flow hazards
Computer Architecture – Part 8 – page 41 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Solutions for control flow hazards
Software solutions:
• Inserting NOOP instructions
• Reorder instructions
Hardware solutions:
• Pipeline interlocking
• Forwarding
• Fast compare and jump logic
• Branch prediction
Computer Architecture – Part 8 – page 42 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
BRANCH COND CMP
BRANCH COND CMP
BRANCH COND
BRANCH COND
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
CMP
CMP
CMP
BRANCH COND
NEXT CORRECT I
NEXT+1 CORRECT I
DELAYSLOT1
DELAYSLOT2
IF ID OF EX WB
condition code
Solution: interlocking or NOOPinsertion
NOOP or interlocking
Penalty: 6 cycles
Computer Architecture – Part 8 – page 43 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
IF ID OF EX WB
condition code
CMPBRANCH COND
CMPBRANCH COND
BRANCH COND
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
CMPCMP
CMPBRANCH COND
NEXT CORRECT I
NEXT+1 CORRECT I
DELAYSLOT2
Reducing penalty by forwarding the comparison result
BRANCH COND
Penalty: 4 cycles
NOOP or interlocking
Computer Architecture – Part 8 – page 44 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
IF ID OF EX WB
condition code
BRANCH COND
BRANCH COND
CMP
BRANCH CONDNEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
CMP
CMP
NEXT CORRECT I
NEXT+1 CORRECT I
CMP
BRANCH COND CMP
BRANCH COND
DELAYSLOT2
Reducing penalty by forwarding the next correct instruction address
NOOP or interlocking
Penalty: 3 cycles
NOOP or interlockingNOOP or interlocking
Computer Architecture – Part 8 – page 45 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
condition code
DELAYSLOT2
IF ID OF EX WB
BRANCH COND
BRANCH COND
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
CMPCMP
NEXT CORRECT I
NEXT+1 CORRECT I
CMPBRANCH COND CMP
BRANCH COND CMPBRANCH COND
fastjumplogic
fastcompare
logic comparisonresult
Reducing penalty by fast compareand jump logic
Penalty: 2 cycles
NOOP or interlocking
Computer Architecture – Part 8 – page 46 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Reducing penalty by fast compareand jump logic
Special logic for compare and jump instructions can reduce the penalty
by one cycle.
These circuits can be much faster than a more general execution unit
(ALU) allowing to complete comparison and jump in one clock cycle.
The higher speed of the fast compare logic is possible because
normally only simple comparisons like equal, unequal, <0, >0, ≤0, ≥0,
=0 are needed.
Computer Architecture – Part 8 – page 47 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Reducing penalty by fast compareand jump logic + reorder instructions
The remaining 2 NOOPs or interlockings can be replaced by reordering code
Two independent instructions could be moved after the branch instruction (delayed branch)
Example:
Z:=D op E
F:= INP(0)
CMP
BRANCH COND
NOOP
NOOP
NEXT INSTR (COND = FALSE)
. . .
NEXT INSTR (COND = TRUE)
CMP
BRANCH COND
Z:=D op E
F:= INP(0)
NEXT INSTR (COND = FALSE)
. . .
NEXT INSTR (COND = TRUE)
Computer Architecture – Part 8 – page 48 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Branch prediction
Another possibility of avoiding control flow hazards is branch prediction
Here, the outcome of the branch (taken or not taken) is predicted before
the result of the comparison is known
In case of correct branch prediction, the penalty can be reduced up to 0
Firstly, lets assume we would have a perfectly working branch predictor
Computer Architecture – Part 8 – page 49 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
prediction result (taken or not taken)
IF ID OF EX WB
branchpredictor
Reducing penalty by branch prediction
BRANCH COND
BRANCH COND
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
NEXT CORRECT I
NEXT+1 CORRECT I NEXT CORRECT I
NEXT+1 CORRECT I
CMPCMP
NEXT CORRECT I
NEXT+1 CORRECT I
CMPBRANCH COND CMP
BRANCH COND CMPBRANCH COND
Penalty: still 2 cycles
next address
Computer Architecture – Part 8 – page 50 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Branch target address cache
To further reduce the penalty, a branch target address cache (BTAC)
can be introduced
This cache holds the addresses of
branches and the corresponding
target addresses
Therefore, if filled already in the
fetch phase a branch and its
possible target address can
be identified
branch address target address
. . . . . .
branch target address
part of branch address (e.g. lower m bits)
BTAC
Computer Architecture – Part 8 – page 51 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
NEXT CORRECT I
prediction result
IF ID OF EX WB
branchpredictor
Reducing penalty by branch prediction and branch target addresscache
BRANCH COND
BRANCH COND
CMPCMP
CMPBRANCH COND CMP
BRANCH COND CMPBRANCH COND
Penalty: 0 cycles
next address
BTAC
NEXT+1 CORRECT I
NEXT CORRECT INEXT+1 CORRECT I
NEXT CORRECT INEXT+1 CORRECT I
NEXT CORRECT INEXT+1 CORRECT I
NEXT CORRECT INEXT+1 CORRECT I
Computer Architecture – Part 8 – page 52 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Branch prediction and pipeline utilization
For having 0 cycles penalty, two prerequisites must meet:
• the branch address must be stored in the BTAC
• the branch prediction must be correct
Otherwise we will get a penalty
Computer Architecture – Part 8 – page 53 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Branch prediction and pipeline utilization
In case of a BTAC miss, the penalty will be pb (in our example 2)
In case of a misprediction, the penalty will be the number of cycles pm needed to
flush the pipeline (e.g. 5)
In modern processors, this can be much more (e.g. 11 for Pentium II)
The overall penalty calculates to:
p = m pm + (1 - m) b pb with m: miss prediction rate, b: BTAC miss rate
The pipeline utilization can be calculated to:
u = n / (n + p) with n: number of instructions
So, an excellent branch prediction is necessary
Computer Architecture – Part 8 – page 54 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Branch prediction techniques
In general, two classes of branch prediction techniques can be
distinguished:
• static branch prediction
for a given branch, the prediction is always the same, it never
changes
• dynamic branch prediction
for a given branch, the prediction changes dynamically
Computer Architecture – Part 8 – page 55 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Static branch prediction
• Predict always not taken
most simple technique, no BTAC necessary, in the first attempt the
branch is always ignored
• Predict always taken
a bit more complicated, needs a BTAC to take the branch in the first
attempt. Produces slightly better results
• Predict backward taken, forward not taken
loop-oriented prediction, a backward branch often belongs to a loop and
therefore is taken quite often
• Compiler controlled
the compiler sets a bit for each branch to tell the processor how to
predict the branch. Still static since it never changes during runtime
Computer Architecture – Part 8 – page 56 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Dynamic branch prediction
Dynamic branch prediction means that information about the probability of a
branch is collected at runtime.
Dynamic branch prediction is based on knowledge about the past behavior of the
branch.
This knowledge can be stored in a table and can be addressed
through the address of the branch instruction.
Often, this information is stored as well in the BTAC but there are also solutions
with separate tables
Dynamic branch prediction produces much better results then static branch
prediction.
Today, a misprediction rate below 10% is possible
Computer Architecture – Part 8 – page 57 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Using the BTAC to store branch history information
branch address target address history bits
. . . . . . . . .
part of branch address (e.g. lower m bits)
BTAC
branch target address branch history
Computer Architecture – Part 8 – page 58 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Interferences
Only a part of the branch address is used as index to the table containing branch history
If two branches have a identical bit pattern in this part, they share the same table entry => interference
This often leads to mispredictions, because one branch messes up the history of the other one
As larger the history table, as less interferences occur
Best case: all bits of the branch address would be used as an index => no interferences
Due to limited chip space, this is not possible for large programs
Computer Architecture – Part 8 – page 59 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
One bit predictor
Most simple predictor, only one bit is used to store the branch history
For each branch, two states (taken, not taken) dependent of the last
execution are stored
The prediction always refers to the last state
NT
NTT
T
Predict TakenPredict Not
Taken
Computer Architecture – Part 8 – page 60 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Two bit predictor
Two bits per branch to store history
This results in for states (strongly taken, weakly taken, weakly not
taken, strongly not taken)
In a strong states, it takes two mispredictions to change the prediction
NT
NTT
T
(11)
Predict Strongly
Taken
NT
T
NT
T
(00)
Predict Strongly
Not Taken
(01)
Predict Weakly
Not Taken
(10)
Predict Weakly
Taken
Two bit predictor with saturation counter
Computer Architecture – Part 8 – page 61 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Two bit predictor
Two bit predictor with hysteresis counter
NT
NT
T
T
(11)
Predict Strongly
Taken
NT
T
NT
T
(00)
Predict Strongly
Not Taken
(01)
Predict Weakly
Not Taken
(10)
Predict Weakly
Taken
Computer Architecture – Part 8 – page 62 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
One bit predictor versus two bit predictor
One bit predictor is simpler and needs less memory
For a branch at the end of a loop, the one bit predictor correctly predicts the
branch direction as long as the loop is iterated
In a nested loop, each iteration of the outer loop produces two mispredictions in
the inner loop
A two bit predictor avoids one of these two mispredictions
Technique can be extended to n bits, but no significant improve in performance
one bit predictor
mispred. when left inner loopmispred. when reentered inner loop
two bit predictor
mispred. when left inner loop
Computer Architecture – Part 8 – page 63 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Correlation predictors
Often, branches are not independent
Example:
DEC A
BRZ X
. . .
X: LD A,0
BRZ Y
The second branch is always taken when the first branch is taken
Both branches are correlated
This is not exploited by the one or two bit predictors
Computer Architecture – Part 8 – page 64 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Correlation predictors
One or two bit predictors only use self-history
Correlation predictors also use neighbor-history
This means, the own history and the history of neighbored, in execution
order preceding branches are used
Notation: a (m,n) predictor uses the last m branches to select one of 2m
predictors, while each of these predictors is a n bit predictor for a single
given branch
A branch history register (BHR) is used to store the direction of the last m
branches in a m-bit shift register
The BHR is used as an index to select a pattern history table (PHT)
Computer Architecture – Part 8 – page 65 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Implementation of a (2,2) predictor
...
...
Pattern History Tables PHTs (2-bit predictors)
...
...
1 1
Branch address
10
0Branch History Register BHR (2-bit shift register) 1
select
Computer Architecture – Part 8 – page 66 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Two level adaptive predictors
Two level adaptive predictors have been developed by Yeh and Patt
nearly the same time as the correlation predictors (1992)
Like the correlation predictor, the two level adaptive predictor uses two
levels of tables, while the first level is used to select prediction bits of the
second level
Variants of two level adaptive predictors:
global PHT
per-set PHTs
per-address PHTs
global scheme (global BHR) GAg GAs GAp
per-address-scheme (per-address BHT)
PAg PAs PAp
per-set-scheme (per-set BHT) SAg SAs SAp
Correlation predictors
Computer Architecture – Part 8 – page 67 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Two level adaptive predictors
Examples:
GAg(4) GAp(4)
PAg(4) PAp(4)
For the s/S variants, only part of the branch address is used
Computer Architecture – Part 8 – page 68 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
gshare and gselect predictors
When using a global PHT, parts of the branch address bits and the BHR can be
combined in two ways to address a PHT entry:
gselect: branch address bits and BHR are concatenated
gshare: branch address bits and BHR are XORed
gshare performs a bit better than gselect due to less interferences
Example:
branch addr BHR gselect4/4 gshare8/8
00000000 00000001 00000001 00000001
00000000 00000000 00000000 00000000
11111111 00000000 11110000 11111111
11111111 10000000 11110000 01111111
Computer Architecture – Part 8 – page 69 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Hybrid predictors
A hybrid or combined predictor consists of two different branch predictors and a
selection predictor choosing one of two branch predictor results for each branch
prediction
Any predictor can be used as selection predictor
Examples:
McFarling: two bit predictor combined with gshare
Young and Smith: compiler controlled static predictor combined with
two level adaptive predictor
Often, a simple predictor with reasonable results in the warm-up phase is combined
with a sophisticated predictor delivering better results later
The combined predictors are often better then the individuals
Computer Architecture – Part 8 – page 70 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Misprediction rates
SAg, gshare and McFarling:
committed conditional takenApplication instructions branches branches (in millions) (in millions) (%) SAg gshare combining
compress 80.4 14.4 54.6 10.1 10.1 9.9gcc 250.9 50.4 49.0 12.8 23.9 12.2perl 228.2 43.8 52.6 9.2 25.9 11.4go 548.1 80.3 54.5 25.6 34.4 24.1m88ksim 416.5 89.8 71.7 4.7 8.6 4.7xlisp 183.3 41.8 39.5 10.3 10.2 6.8vortex 180.9 29.1 50.1 2.0 8.3 1.7jpeg 252.0 20.0 70.0 10.3 12.5 10.4mean 267.6 46.2 54.3 8.6 14.5 8.1
misprediction rate(%)
Computer Architecture – Part 8 – page 71 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Multipath execution: in case of a branch both paths are followed by the processor simultaneously, the wrong path is discarded later
Multipath execution
RF read
ALU RF write
instruction issue point
DEC
DEC
IF
IF
CC
a simple multipath pipeline with two instruction fetch and decode stages
Computer Architecture – Part 8 – page 72 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Predication
Predication means, the execution of an instruction is dependend on a predicate
Only if the predicate is true the instruction is executed
If all instructions of an instructions set supports predication, this is called a fully predicated instruction set
Examples for fully predicated instruction sets: IA64 Itanium, ARM,
Fully predicated instruction sets can avoid conditional branches
Example:
CMP A, 0 CMP A, 0, PBZ L1 P.ADD B,CADD B,C P.SUB C,D
SUB C,D LD A,3L1: LD 3,A
with cond. branch predicatedComputer Architecture – Part 8 – page 73 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Predication
On the hardware side, the predicated instruction is executed anyway.
In case of a false predicate, the result of the instruction is discarded
Advantages:
• conditional branches can be avoided
• no speculation necessary
• basic block length is increased resulting in better compiler optimization
Disadvantages:
• unnecessary execution of instructions
• additional predicate bits necessary in instruction format
Computer Architecture – Part 8 – page 74 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting
Hier wird Wissen Wirklichkeit
Trace cache
A trace is a sequence of executed instructions which can span several basic blocks
Therefore, in a trace all branches are solved
A trace cache stores such traces while the trace is executed
If the same trace is executed again, the instruction sequence can be taken from the trace cache, no branch needs to be exectued
While an instruction cache contains the static instruction sequence, the trace cache contains the dynamic instruction sequence
Example for a trace cache: Pentium 4
I -c a c h e T ra c e C a c h e
Computer Architecture – Part 8 – page 75 of 75 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting