Upload
stefan-marr
View
1.119
Download
0
Embed Size (px)
Citation preview
Tracing versus Partial Evaluation
Which Meta-Compilation Approach is Better for Self-Optimizing
Interpreters?
Stefan Marr, Stéphane DucasseOOPSLA, October 28, 2015
Work Done At
Disclaimer
2
I am currently funded by
* Würthinger, T.; Wimmer, C.; Wöß A.; Stadler, L.; Duboscq, G.; Humer, C.; Richards, G.; Simon, D. & Wolczko, M,
One VM to Rule Them All, in Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, ACM.
Oracle Labs
3
Compare Concrete Systems
Truffle + Graal
with Partial Evaluation
RPythonwith Meta-Tracing
[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
Oracle Labs
Selecting A Case Study
On both Systems
5
Self-Optimizing AST Interpreter
Represents Large Group ofDynamic Languages
Dynamically Typed (Smalltalk)
Classes(and everything is an Object)
Closures (lambdas)
Non-local Returns(almost exceptions)
Set of Benchmark6
http://som-st.github.io
SOMMT versus SOMPE
Meta-Tracing Partial Evaluation
7
cnt
1
+cnt:=
ifcnt:=
0
cnt
1+
cnt:=if cnt:=
0
[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy'sTracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
WHICH APPROACH IS FASTER FAST?
minimal amount of engineering to get good performance
8
Peak Performance of Basic Interpreters
Runtime Normalized
to Java 8
(lower is better)
Compiled
SOM[MT]
Compiled
SOM[PE]
10
100
Bounce
Bubble
Sort
DeltaB
lue
Fannkuch
Gra
phS
earc
hJson
Mand
elb
rot
NB
ody
Pa
geR
ank
Perm
ute
Queens
Quic
kS
ort
Ric
hard
sS
ieve
Sto
rage
Tow
ers
Bounce
Bubble
Sort
DeltaB
lue
Fannkuch
Gra
phS
earc
hJson
Mand
elb
rot
NB
ody
Pa
geR
ank
Perm
ute
Queens
Quic
kS
ort
Ric
hard
sS
ieve
Sto
rage
Tow
ers
Runtim
e n
orm
aliz
ed t
oJava (
com
pile
d o
r in
terp
rete
d)
SOMMT on RPython SOMPE on Truffle
Minimal SOMMT
5.5x slowermin. 1.6xmax. 14x
Minimal SOMPE
170x slowermin. 60x
max. 600x
WHICH APPROACH IS THE FASTEST?
best peak performance
10
Which Self-Optimizations Should a Language Implementer Add?
• Type-specialize variables
• Type-specialize object fields
• Type-specialize collection storage
• Lower control structures from library
• Lower common library operations
• Inline caching
• Inline primitive operations
• Cache globals
• …11
Peak Performance of Optimized InterpreterCompiled
SOM[MT]
Compiled
SOM[PE]
1
4
8
12
Bounce
Bubble
Sort
DeltaB
lue
Fannkuch
Gra
phS
earc
h
Json
Mand
elb
rot
NB
ody
Pa
geR
ank
Perm
ute
Queens
Quic
kS
ort
Ric
hard
s
Sie
ve
Sto
rage
Tow
ers
Bounce
Bubble
Sort
DeltaB
lue
Fannkuch
Gra
phS
earc
h
Json
Mand
elb
rot
NB
ody
Pa
geR
ank
Perm
ute
Queens
Quic
kS
ort
Ric
hard
s
Sie
ve
Sto
rage
Tow
ers
Runtim
e n
orm
aliz
ed t
oJava (
com
pile
d o
r in
terp
rete
d)
SOMMT on RPython SOMPE on Truffle
Runtime Normalized
to Java 8
(lower is better)
Optimized SOMMT
3x slowermin. 1.5xmax. 11x
Optimized SOMPE
2.3x slowermin. 4%
max. 4.9x
2.4xspeedup
80xspeedup
Optimization Impact on SOMPE
13
I
I
I
I
I
I
I
I
I
I
I
I
I
lower control structures
inline caching
cache globals
typed fields
lower common ops
array strategies
inline basic ops.
typed vars
opt. local vars
baseline
min. escaping closures
typed args
catch−return nodes0
.85
1.0
0
1.2
0
1.5
0
2.0
0
3.0
0
4.0
0
5.0
0
7.0
0
8.0
0
10
.00
12
.00
Speedup Factor(higher is better, logarithmic scale)Speedup Factor
(higher is better, logarithmic scale)
Implementation Sizes
RPython
From Minimal to Optimized
+57% LOC
From 3,455 LOC to 5,414 LOC
Truffle
From Minimal to Optimized
+ 103% LOC
From 5,424 LOC to 11,037 LOC
14
The Way I writePython
The Way I write Java
WHICH APPROACH GIVES BETTER STARTUP PERFORMANCE?
Considering the User-Perceived System Performance
15
Measuring “Whole Program” Runtime
16
4
8
12
16
0 200 400 600
GeoM
eanO
f(W
all−
Clo
ck T
ime for
x Ite
ration
s,
div
ided b
y c
orr
espondin
g J
ava r
esult)
VM
Java
RTruffleSOM−jit−experiments
TruffleSOM−graal−no−expgc
Wall−Clock Behavior for Various Run Lengths: Aggregation over all Benchmarks
Fact
or
ove
r Ja
va, f
or
x-it
erat
ion
s
Iterations of Benchmark in Same Process8sec 25sec 46sec
• Process Start to Finish
• Overall Wall-clock time
• Normalized to Java
Java
SOMMT
SOMPE
CONCLUSIONS
17
Tracing vs. Partial Evaluation
• Peak performance seems similar
– No indications of conceptual limitations
• Startup Performance
– Unclear, tiered compilation?
• But, tracing is faster fast!
– Requires less optimizations
– Better ‘prototype’ performance18
Peak Performance of Optimized InterpreterCompiled
SOM[MT]
Compiled
SOM[PE]
1
4
8
12
Bounce
Bubble
Sort
DeltaB
lue
Fannkuch
Gra
phS
earc
h
Json
Mand
elb
rot
NB
ody
Pa
geR
ank
Perm
ute
Queens
Quic
kS
ort
Ric
hard
s
Sie
ve
Sto
rage
Tow
ers
Bounce
Bubble
Sort
DeltaB
lue
Fannkuch
Gra
phS
earc
h
Json
Mand
elb
rot
NB
ody
Pa
geR
ank
Perm
ute
Queens
Quic
kS
ort
Ric
hard
s
Sie
ve
Sto
rage
Tow
ers
Runtim
e n
orm
aliz
ed t
oJava (
com
pile
d o
r in
terp
rete
d)
SOMMT on RPython SOMPE on Truffle
Runtime Normalized
to Java 8
(lower is better)
Optimized SOMMT
3x slowermin. 1.5xmax. 11x
Optimized SOMPE
2.3x slowermin. 4%
max. 4.9x