Upload
tayler-mixer
View
216
Download
0
Embed Size (px)
Citation preview
Hardware-based Devirtualization (VPC Prediction)
Hyesoon Kim, Jose A. Joao, Onur Mutlu++, Chang Joo Lee, Yale N. Patt, Robert Cohn*
++ *
3
Direct vs. Indirect Branch
TARG A+1
AT N
A
?
Conditional (Direct) Branch Indirect Branch
Indirect branches are costly on processor performance Much more difficult to predict than conditional (direct) branches: multiple target addresses Indirect branch predictor requires a large structure
br.cond TARGET R1 = MEM[R2]branch R1
4
Source code: Shape *s = …; a = s->area(); // virtual function call
Static assembly code: R1 = MEM[R2] // function address lookup call R1 // a register-indirect call
Source Code Examples
Switch structures
Virtual function calls
5
Indirect Branch Mispredictions
Data from Intel Core Duo processor
0
2
4
6
8
10
12
14
16
iexpl
orer
firef
oxvtu
ne
cygw
in
emac
s
acro
read
winexp
lorer
desk
top-
sear
ch
outlo
okex
cel
simics
winam
pav
ida
windvd
nasa
-wor
ldwind
pptvi
ew
sqlse
rvrAVG
MP
KI
direct
indirect
6
Direct Branch? Indirect Branch?
TARG2
TARG1
PC+1
Branch PredictorDirectionPredictor
Branch Target Buffer (BTB)
Indirect Branch Predictor
..1001010Hash
GHR
PC Addr 0x0800
TARG2 Predicted target
T
8
VPC Prediction: Basic Idea
Key idea: Treat an indirect branch as multiple “virtual” conditional branches Only for prediction purposes
Use the conditional branch predictor
9
TARG2
TARG1
VPC Branch PredictorDirectionPredictor
Branch Target Buffer
..1001010Hash
GHR
PC Addr 0x0800
VPC2 VPC1
Predicted target
10
VPC Prediction: Basic Idea Key idea: Treat an indirect branch as
multiple “virtual” conditional branches Only for prediction purposes
Use the conditional branch predictor
Benefits: No separate complex structure Can be applied to any other conditional branch
prediction algorithm Improve conditional branch prediction algorithm
Will improve the indirect branch prediction accuracy
11
Inspiration: Static Devirtualization
Source code:
Shape *s = …;
a = s->area(); // an indirect call
Optimized source code: Shape *s = …;
if (s->type == Rectangle) // a conditional branch at PC: X a = Rectangle::area(); else if (s->type == Circle) // a conditional branch at PC: Y a = Circle::area(); else
a = s->area(); // an indirect call at PC: Z
Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00)
12
VPC Prediction Source code: Shape *s = …; a = s->area(); // an indirect call
Static assembly code: R1 = MEM[R2] call R1 // PC: L
Dynamic virtual branches (for prediction purposes): conditional jump TARGET1 // virtual PC = L conditional jump TARGET2 // virtual PC = L XOR HASHVAL[1] conditional jump TARGET3 // virtual PC = L XOR HASHVAL[2] conditional jump TARGET4 // virtual PC = L XOR HASHVAL[3]
13
Virtual PC Address Generation
Use original PC address and iteration counter value
0xabcd
0x018a
0x7a9c
0x…
iteration counter value
PC
Virtual PC
Hash value table
14
VPC Prediction Process-I
1111
L
PC
GHR
Direction Predictor
BTB
not taken
TARG1
cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4
call R1 // PC: L Real Instruction
Virtual Instructions
Next iteration
15
VPC Prediction Process-II
1110
VL2
VPC
VGHR
BTB
not taken
TARG2
cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4
call R1 // PC: L Real Instruction
Virtual Instructions
Direction Predictor
Next iteration
16
VPC Prediction Process-III
cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4
call R1 // PC: L Real Instruction
Virtual Instructions
1100
VL3
VPC
VGHR
BTB
taken
TARG3
Direction Predictor
Predicted Target = TARG3
17
VPC Prediction Algorithm Access the conditional branch predictor and the BTB
with VPCA and VGHR
Compute VPCA and VGHR for the next iteration VPCA = PC XOR HASHVAL[iter] VGHR = VGHR << 1
Predicted not taken: Move to the next iteration
Predicted taken: Use the target in the BTB as the target of an indirect branch
Give up and stall if Iteration count > MAX_ITER or BTB miss
18
VPC Training Algorithm An iterative process when an indirect branch is
retired (not on the critical path)
Update the conditional branch predictor Virtual branch has a correct target: Taken Virtual branch has a wrong target: Not-taken
Update replacement policy bits of the correct target in the BTB
Insert the correct target into the BTB Conditional branch predictor: taken Replace the least frequently used target (LFU)
19
Iteration counter
Hardware Cost and Complexity
GHR VGHR BranchDirection Predictor
(BP)
PC
Hash Function
VPCABTB
+
Taken/Not Taken
Predict?
Direct/Indirect
Target Address
21
Simulation Methodology Pin-based x86 Simulator
Processor configuration 4K-entry BTB 64KB perceptron conditional branch predictor Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window Less aggressive processor (in the paper) Gshare, O-GEHL conditional branch predictors
Indirect branch intensive benchmarks 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ IBM server benchmarks (OLTP) (in the paper)
22
VPC MPKI
0
2
4
6
8
10
12
14
16
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
Ind
ire
ct
bra
nc
h M
isp
red
icti
on
s (
MP
KI)
baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16
0
2
4
6
8
10
12
14
16
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
Ind
ire
ct
bra
nc
h M
isp
red
icti
on
s (
MP
KI)
baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16
0
2
4
6
8
10
12
14
16
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
Ind
ire
ct
bra
nc
h M
isp
red
icti
on
s (
MP
KI)
baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16
0
2
4
6
8
10
12
14
16
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
Ind
ire
ct
bra
nc
h M
isp
red
icti
on
s (
MP
KI)
baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16
23
0102030405060708090
100110
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
VPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16
VPC Performance
24
Different Direction Predictors
0
5
10
15
20
25
30
35
gshare perceptron O-GEHL
IPC
im
pro
vem
en
t (%
) 98% 98.3% 99%
Improving conditional branch prediction accuracy alsoimproves indirect branch prediction accuracy!
Con
ditio
nal b
ranc
h ac
cura
cy (
%)
25
VPC vs. Static Devirtualization Advantages
Enables other compiler optimizations (function inlining) Can reduce the number of mispredictions
Disadvantages/Limitations Not all indirect branches can be statically devirtualized Extensive static analysis/profiling Lack of adaptivity to run-time input set and phase behavior
VPC prediction can be used with statically devirtualized binaries 10% improvement on top of static devirtualization
27
Conclusion
VPC dynamically converts indirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware
VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. Baseline: 26% IPC improvement O-GEHL: 31% IPC improvement
VPC can be an enabler encouraging programmers to use object-oriented programming styles
29
VPC vs. Cascaded IBP
-20
0
20
40
60
80
100
120
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
cascaded-704Bcascaded-1.4KBcascaded-2.8KBcascaded-5.5KBcascaded-11KBcascaded-22KBcascaded-44KBcascaded-88KBcascaded-176KBVPC-ITER-12
30
VPC vs. Other Indirect BP
gcc crafty eon perlbmk
TargetTag
Cache12KB 1.5KB >192KB 1.5KB
Cascaded >176KB 2.8KB >176KB 2.8KB
TTC: Chang et al. (’96)Cascaded: Driesen and Holzle(’98)
31
Iterative prediction
It doesn’t hurt performance significantly Results
Why? Most prediction is within a few iterations. Results
32
VPC Hit Iteration Counter
0%
20%
40%
60%
80%
100%
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
11-12
9-10
7-8
5-6
4
3
2
1
33
Can the BTB be pipelined?
Yes The next iteration of VPC can be
started without knowing the previous iteration in the pipeline.
Consecutive VPC prediction iterations can be simply pipelined.
If the iteration is not needed then simply discard the prediction.
34
Is 4K-entry BTB too large?
Pentium 4 has a 4K-entry BTB IBM Z series (z990) has an 8K-entry
BTB AMD Athlon and Hammer have 2K-
entry BTBs
35
BTB Size Effects
0
1
2
3
4
5
6
7
8
512 1024 2048 4096
Ind
irec
t b
ran
ch M
isp
red
icti
on
s (M
PK
I)
0
5
10
15
20
25
30
35
40
% I
PC
im
pro
vem
ent
ove
r b
asel
inebase
vpc
IPC improvement
36
VPC Prediction Accuracy
0%
20%
40%
60%
80%
100%
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
VP
C a
cc
es
s (
%)
no target
wrong target
correct
37
Target Distribution
0%
20%
40%
60%
80%
100%
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
16+
11-15
6-10
5
4
3
2
1
38
VPC vs. Tagged Target Cache
0
20
40
60
80
100
120
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBTTC-96KBTTC-192KBVPC-ITER-12
39
VPC Prediction Delay Effects
0
20
40
60
80
100
120
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG
% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
1br/cycle
2br/cycle4br/cycle
6br/cycle8br/cycle
10br/cycle
40
VPC with O-GEHL BP
0
20
40
60
80
100
120
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-12
41
VPC with a Less Aggressive Processor
0
10
20
30
40
50
60
70
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ay
richa
rds ixx
AVG% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-12
42
Server Benchmarks
0
2
4
6
8
10
12
14
16
OLTP1 OLTP2 OLTP3 AVG
Ind
irec
t b
ran
ch M
isp
red
icti
on
s (M
PK
I)
baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16
43
Server Benchmarks (VPC vs. TTC)
0
2
4
6
8
10
12
14
16
18
OLTP1 OLTP2 OLTP3 AVG
Ind
irec
t b
ran
ch M
isp
red
icti
on
s (M
PK
I)
baselineTTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-10
44
VPC Prediction vs. Compiler-Based Devirtualization (With TTC)
-10
0
10
20
30
40
50
60
70
80
90
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ayAVG
% I
PC
im
pro
ve
me
nt
ov
er
ba
se
lin
e
TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-12
45
Conditional Br. Prediction Effects
0
0.5
1
1.5
2
2.5
3
3.5
4
gshare perceptron O-GEHL
Co
nd
itio
na
l Br.
MP
KI
Base
VPC
VPC Prediction reduces the accuracy of direction branch prediction but not that much!
46
Indirect Branch Mispredictions
0
10
20
30
40
50
60
Pe
rce
nta
ge
of
all
mis
pre
dic
ted
bra
nc
he
s(%
)
indirect branches
47
VPC Prediction with Static Devirtualization
VPC prediction can be used with static devirtualized binaries. Not all indirect branches could be devirtualized
0
10
20
30
40
50
60
gcc
craf
tyeo
n
perlb
mk
gap
perlb
ench
gcc0
6sje
ng
nam
d
povr
ayAVG%
IP
C i
mp
rov
em
en
t o
ve
r b
as
eli
ne VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
48
VPC Training: Correct Prediction
call R1 // PC: L Retirement: Real Instruction
Known: Correct predicted, predicted iter = 3
Iter VPCA VGHR Direction BP BTB
1 L GHR Not-taken -
2 VL2 GHR<<1 Not-taken -
3 VL3 GHR<<2 TakenUpdate
replacement
49
VPC Training: Misprediction
call R1 // PC: L Retirement: Real Instruction
Known: Mispredicted, correct target address
Iter VPCA VGHR BTB AccessTrain
Direction BPTrain BTB
1 L GHRTARG != Correct
Not-taken -
2 VL2 GHR<<1TARG != Correct
Not-taken -
3 VL3 GHR<<2Target = Correct
TakenUpdate
replacement
50
VPC Training: Misprediction
call R1 // PC: L Retirement: Real Instruction
Known: Mispredicted, correct target address
Iter VPCA VGHR BTB AccessTrain
Direction BPTrain BTB
1 L GHRTARG != Correct
Not-taken -
2 VL2 GHR<<1TARG != Correct
Not-taken -
3 VL3 GHR<<2TARG != Correct
Not-taken -
No Target
51
VPC Training: Misprediction
call R1 // PC: L Retirement: Real Instruction
Known: Mispredicted, correct target address
Iter VPCA VGHR BTB AccessRepl.
counterTrain BP
Train BTB
1 L GHRTARG != Correct
3Not-
taken-
2 VL2GHR<<
1TARG != Correct
1Not-
takenNothing
3 VL3GHR<<
2TARG != Correct
8Not-
taken-
Replacement
Taken Insert0