Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Tracing versus Partial Evaluation

Which Meta-Compilation Approach is Better for Self-Optimizing

Interpreters?

Stefan Marr, Stéphane DucasseOOPSLA, October 28, 2015

Work Done At

Disclaimer

2

I am currently funded by

* Würthinger, T.; Wimmer, C.; Wöß A.; Stadler, L.; Duboscq, G.; Humer, C.; Richards, G.; Simon, D. & Wolczko, M,

One VM to Rule Them All, in Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, ACM.

Oracle Labs

3

Compare Concrete Systems

Truffle + Graal

with Partial Evaluation

RPythonwith Meta-Tracing

[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.

[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.

Oracle Labs

Selecting A Case Study

On both Systems

5

Self-Optimizing AST Interpreter

Represents Large Group ofDynamic Languages

Dynamically Typed (Smalltalk)

Classes(and everything is an Object)

Closures (lambdas)

Non-local Returns(almost exceptions)

Set of Benchmark6

http://som-st.github.io

SOMMT versus SOMPE

Meta-Tracing Partial Evaluation

7

cnt

1

+cnt:=

ifcnt:=

0

cnt

1+

cnt:=if cnt:=

0

[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.

[2] Bolz et al., Tracing the Meta-level: PyPy'sTracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.

WHICH APPROACH IS FASTER FAST?

minimal amount of engineering to get good performance

8

Peak Performance of Basic Interpreters

Runtime Normalized

to Java 8

(lower is better)

Compiled

SOM[MT]

Compiled

SOM[PE]

10

100

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

hJson

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

sS

ieve

Sto

rage

Tow

ers

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

hJson

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

sS

ieve

Sto

rage

Tow

ers

Runtim

e n

orm

aliz

ed t

oJava (

com

pile

d o

r in

terp

rete

d)

SOMMT on RPython SOMPE on Truffle

Minimal SOMMT

5.5x slowermin. 1.6xmax. 14x

Minimal SOMPE

170x slowermin. 60x

max. 600x

WHICH APPROACH IS THE FASTEST?

best peak performance

10

Which Self-Optimizations Should a Language Implementer Add?

• Type-specialize variables

• Type-specialize object fields

• Type-specialize collection storage

• Lower control structures from library

• Lower common library operations

• Inline caching

• Inline primitive operations

• Cache globals

• …11

Peak Performance of Optimized InterpreterCompiled

SOM[MT]

Compiled

SOM[PE]

1

4

8

12

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Runtim

e n

orm

aliz

ed t

oJava (

com

pile

d o

r in

terp

rete

d)


Runtime Normalized

to Java 8

(lower is better)

Optimized SOMMT

3x slowermin. 1.5xmax. 11x

Optimized SOMPE

2.3x slowermin. 4%

max. 4.9x

2.4xspeedup

80xspeedup

Optimization Impact on SOMPE

13

I

I

I

I

I

I

I

I

I

I

I

I

I

lower control structures

inline caching

cache globals

typed fields

lower common ops

array strategies

inline basic ops.

typed vars

opt. local vars

baseline

min. escaping closures

typed args

catch−return nodes0

.85

1.0

0

1.2

0

1.5

0

2.0

0

3.0

0

4.0

0

5.0

0

7.0

0

8.0

0

10

.00

12

.00

Speedup Factor(higher is better, logarithmic scale)Speedup Factor

(higher is better, logarithmic scale)

Implementation Sizes

RPython

From Minimal to Optimized

+57% LOC

From 3,455 LOC to 5,414 LOC

Truffle

From Minimal to Optimized

+ 103% LOC

From 5,424 LOC to 11,037 LOC

14

The Way I writePython

The Way I write Java

WHICH APPROACH GIVES BETTER STARTUP PERFORMANCE?

Considering the User-Perceived System Performance

15

Measuring “Whole Program” Runtime

16

4

8

12

16

0 200 400 600

GeoM

eanO

f(W

all−

Clo

ck T

ime for

x Ite

ration

s,

div

ided b

y c

orr

espondin

g J

ava r

esult)

VM

Java

RTruffleSOM−jit−experiments

TruffleSOM−graal−no−expgc

Wall−Clock Behavior for Various Run Lengths: Aggregation over all Benchmarks

Fact

or

ove

r Ja

va, f

or

x-it

erat

ion

s

Iterations of Benchmark in Same Process8sec 25sec 46sec

• Process Start to Finish

• Overall Wall-clock time

• Normalized to Java

Java

SOMMT

SOMPE

CONCLUSIONS

17

Tracing vs. Partial Evaluation

• Peak performance seems similar

– No indications of conceptual limitations

• Startup Performance

– Unclear, tiered compilation?

• But, tracing is faster fast!

– Requires less optimizations

– Better ‘prototype’ performance18

Peak Performance of Optimized InterpreterCompiled

SOM[MT]

Compiled

SOM[PE]

1

4

8

12

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Runtim

e n

orm

aliz

ed t

oJava (

com

pile

d o

r in

terp

rete

d)


Runtime Normalized

to Java 8

(lower is better)

Optimized SOMMT

3x slowermin. 1.5xmax. 11x

Optimized SOMPE

2.3x slowermin. 4%

max. 4.9x

Technology

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?