19
Tracing versus Partial Evaluation Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters? Stefan Marr, Stéphane Ducasse OOPSLA, October 28, 2015 Work Done At

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Embed Size (px)

Citation preview

Page 1: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Tracing versus Partial Evaluation

Which Meta-Compilation Approach is Better for Self-Optimizing

Interpreters?

Stefan Marr, Stéphane DucasseOOPSLA, October 28, 2015

Work Done At

Page 2: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Disclaimer

2

I am currently funded by

* Würthinger, T.; Wimmer, C.; Wöß A.; Stadler, L.; Duboscq, G.; Humer, C.; Richards, G.; Simon, D. & Wolczko, M,

One VM to Rule Them All, in Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, ACM.

Oracle Labs

Page 3: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

3

Page 4: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Compare Concrete Systems

Truffle + Graal

with Partial Evaluation

RPythonwith Meta-Tracing

[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.

[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.

Oracle Labs

Page 5: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Selecting A Case Study

On both Systems

5

Self-Optimizing AST Interpreter

Page 6: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Represents Large Group ofDynamic Languages

Dynamically Typed (Smalltalk)

Classes(and everything is an Object)

Closures (lambdas)

Non-local Returns(almost exceptions)

Set of Benchmark6

http://som-st.github.io

Page 7: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

SOMMT versus SOMPE

Meta-Tracing Partial Evaluation

7

cnt

1

+cnt:=

ifcnt:=

0

cnt

1+

cnt:=if cnt:=

0

[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.

[2] Bolz et al., Tracing the Meta-level: PyPy'sTracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.

Page 8: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

WHICH APPROACH IS FASTER FAST?

minimal amount of engineering to get good performance

8

Page 9: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Peak Performance of Basic Interpreters

Runtime Normalized

to Java 8

(lower is better)

Compiled

SOM[MT]

Compiled

SOM[PE]

10

100

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

hJson

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

sS

ieve

Sto

rage

Tow

ers

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

hJson

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

sS

ieve

Sto

rage

Tow

ers

Runtim

e n

orm

aliz

ed t

oJava (

com

pile

d o

r in

terp

rete

d)

SOMMT on RPython SOMPE on Truffle

Minimal SOMMT

5.5x slowermin. 1.6xmax. 14x

Minimal SOMPE

170x slowermin. 60x

max. 600x

Page 10: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

WHICH APPROACH IS THE FASTEST?

best peak performance

10

Page 11: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Which Self-Optimizations Should a Language Implementer Add?

• Type-specialize variables

• Type-specialize object fields

• Type-specialize collection storage

• Lower control structures from library

• Lower common library operations

• Inline caching

• Inline primitive operations

• Cache globals

• …11

Page 12: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Peak Performance of Optimized InterpreterCompiled

SOM[MT]

Compiled

SOM[PE]

1

4

8

12

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Runtim

e n

orm

aliz

ed t

oJava (

com

pile

d o

r in

terp

rete

d)

SOMMT on RPython SOMPE on Truffle

Runtime Normalized

to Java 8

(lower is better)

Optimized SOMMT

3x slowermin. 1.5xmax. 11x

Optimized SOMPE

2.3x slowermin. 4%

max. 4.9x

2.4xspeedup

80xspeedup

Page 13: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Optimization Impact on SOMPE

13

I

I

I

I

I

I

I

I

I

I

I

I

I

lower control structures

inline caching

cache globals

typed fields

lower common ops

array strategies

inline basic ops.

typed vars

opt. local vars

baseline

min. escaping closures

typed args

catch−return nodes0

.85

1.0

0

1.2

0

1.5

0

2.0

0

3.0

0

4.0

0

5.0

0

7.0

0

8.0

0

10

.00

12

.00

Speedup Factor(higher is better, logarithmic scale)Speedup Factor

(higher is better, logarithmic scale)

Page 14: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Implementation Sizes

RPython

From Minimal to Optimized

+57% LOC

From 3,455 LOC to 5,414 LOC

Truffle

From Minimal to Optimized

+ 103% LOC

From 5,424 LOC to 11,037 LOC

14

The Way I writePython

The Way I write Java

Page 15: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

WHICH APPROACH GIVES BETTER STARTUP PERFORMANCE?

Considering the User-Perceived System Performance

15

Page 16: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Measuring “Whole Program” Runtime

16

4

8

12

16

0 200 400 600

GeoM

eanO

f(W

all−

Clo

ck T

ime for

x Ite

ration

s,

div

ided b

y c

orr

espondin

g J

ava r

esult)

VM

Java

RTruffleSOM−jit−experiments

TruffleSOM−graal−no−expgc

Wall−Clock Behavior for Various Run Lengths: Aggregation over all Benchmarks

Fact

or

ove

r Ja

va, f

or

x-it

erat

ion

s

Iterations of Benchmark in Same Process8sec 25sec 46sec

• Process Start to Finish

• Overall Wall-clock time

• Normalized to Java

Java

SOMMT

SOMPE

Page 17: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

CONCLUSIONS

17

Page 18: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Tracing vs. Partial Evaluation

• Peak performance seems similar

– No indications of conceptual limitations

• Startup Performance

– Unclear, tiered compilation?

• But, tracing is faster fast!

– Requires less optimizations

– Better ‘prototype’ performance18

Page 19: Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Peak Performance of Optimized InterpreterCompiled

SOM[MT]

Compiled

SOM[PE]

1

4

8

12

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Bounce

Bubble

Sort

DeltaB

lue

Fannkuch

Gra

phS

earc

h

Json

Mand

elb

rot

NB

ody

Pa

geR

ank

Perm

ute

Queens

Quic

kS

ort

Ric

hard

s

Sie

ve

Sto

rage

Tow

ers

Runtim

e n

orm

aliz

ed t

oJava (

com

pile

d o

r in

terp

rete

d)

SOMMT on RPython SOMPE on Truffle

Runtime Normalized

to Java 8

(lower is better)

Optimized SOMMT

3x slowermin. 1.5xmax. 11x

Optimized SOMPE

2.3x slowermin. 4%

max. 4.9x