Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08

Trace Fragment Selection within Method-based JVMs

Duane Merrill Kim Hazelwood

VEE ‘08

2

Overview

• Would trace fragment dispatch benefit VMs with JITs?

– Fragment-dispatch as a feedback-directed optimization

• Why?

– Improve VM performance via better instruction layout

• Overview

– Motivation

– New scheme for trace selection

– Viability in JikesRVM• Evaluate opportunities for code improvement• Evaluate trace selection overhead

3

Traditional VM Adaptive Code Generation

Phase 3: More Advanced JIT Compilation

Update Class/TOC dispatch tables, perform OSR

Phase 2: JIT Method compilation

Compilation Shape: Source Method

Dispatch Shape: Corresponding MC Code Array &

Machine Code Trace Fragment

Phase 1: Interpreter

Compilation Shape: Source Instruction

Dispatch Shape: Corresponding MC Instruction(s)


4

SDT/ DBI/ Embedded VM Adaptive Code Generation





Dispatch Shape: Corresponding MC Code Array &






5

Proposed VM Adaptive Code Generation





Dispatch Shape(s): Corresponding MC Code Array &






6

Trace Fragment Dispatch

• Trace

– A specific sequence of instructions observed at runtime

– Span:• Branches

• Procedure calls and returns

• Potentially arbitrary number of instructions

• Trace Fragment

– A finite, linear sequence of machine code instructions

– Single-entry, multiple-exit (viz. superblock)

– Cached, linked

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

7

Trace Fragment Dispatch: The Good

• Location, Location, Location

– “Inlining-like”:• Context sensitive

• Partial

– Spatial locality provides most of achieved speedup

• Simple, low-cost “local” optimizations

– Redundancy elimination

• Nimbly adjusts to changing behavior

– Efficient

– Lots of early-exits? Discard fragment and re-trace

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

8

Trace Fragment Dispatch: The Bad

• Lacks optimization power

– Data flow analysis

– Code motion & loop optimizations

• Code expansion

– Tail duplication

– Exponential growth (if all paths maintained indefinitely)

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

9


to A

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

C D M O P E

to N




• Code expansion



10


to A

to A

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

C D M O P E

to N

N P E




• Code expansion



11

Supplement Method Dispatch with Trace Dispatch

• Why?

– Improve VM performance via better instruction layout

– Easily-disposable fragments reflect current program behavior

• How?

– JIT compiler inserts instrumentation into method code arrays:• Monitor potential “hot trace headers”

• Record control flow

– VM runtime assembles & patches trace fragments:• Blocks “scavenged” from compiled code arrays

• Conditionals adjusted for proper fallthoughs

• Method code arrays patched to transfer control to fragments

• New fragments linked to existing fragments

12

Easy Fragment Management

• Improved trace selection

– JIT to identify trace starting

– VM to determine trace stopping locations

• “Friendly” encoding of instructions

– Patch spots built-in

– Avoid pesky PC-relative jumps (e.g., switch statements)

• Knowledge of language implementation features:– Calling conventions

– Stack layout

– Virtual method dispatch tables

13

Efficient Fragment Management

• “Mixed-mode” scheme:

– Execution in both method code arrays & trace fragments

• Share the same register allocation

– Control flows off-trace into method code arrays

• Fewer trace fragments

• Manageable code expansion

– JVM control is already built into yield points

– Disposable trace fragments

• No need to redo expensive analysis as behavior changes

14

Our Work: Trace Fragment Selection

1. Develop new trace selection methodology

– Leverage JIT global analysis, VM runtime

2. Implement trace selection in JikesRVM and evaluate viability

– Do recorded traces indicate room for code improvement?

– Do the traces exhibit good characteristics?

– Is instrumentation overhead reasonable?

15

Improved Trace Selection: Starting Locations

1. Loop Header Locations

– Identified by JIT loop analysis

– More accurate than “target of backward branch” heuristic

2. “Early exit” blocks

– Allows trace fragments to be “layered”

3. Method prologue

– Catches recursive execution

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

16

to A







3. Method prologue


A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

N P E

17







3. Method prologue


A

B C

D

foo()

A B D

to C

to Epilogue

18

Improved Trace Selection: Stopping Criteria

1. Cycle

Returned to the loop header

2. Abutted

Arrived at another loop header

3. Length Limited (unusual)

128 basic blocks encountered

4. Rejoined (unusual)

Returned to a basic block already in trace

5. Exited (unusual)

Exited the method without meeting above conditions. (Identifiable by stack height.)

to A

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

N P E

19

Improved Trace Selection: Stopping Criteria

1. Cycle

Returned to the loop header

2. Abutted

Arrived at another loop header

3. Length Limited (unusual)

128 basic blocks encountered

4. Rejoined (unusual)

Returned to a basic block already in trace

5. Exited (unusual)

Exited the method without meeting above conditions. (Identifiable by stack height.)

to A

A

B

E

C

D M

N O

P

foo()

bar()

A B D M O P E

to C to N

N P E

20

JIT-Inserted Instrumentation

(a) Assembly of original method code-block

(b) Assembly of code-block to be used for tracing

A B C D

Loop header counters Paths through blocks

Low-fidelity Instrumentation High-fidelity Instrumentation

A

JUMP_BLOCK

TRACE_HEAD_A

B C D

TRACE_HEAD_B

TRAMPOLINE_A

TRAMPOLINE_B

A’INSTRUM_A

B’ C’ D’

INSTRUM_B

TRAMPOLINE_A’

TRAMPOLINE_B’

INSTRUM_C

TRAMPOLINE_C’

TRAMPOLINE_D’

INSTRUM_D

(Loop header)

21





A

JUMP_BLOCK

TRACE_HEAD_A

B C D

TRACE_HEAD_B

TRAMPOLINE_A

TRAMPOLINE_B

A’INSTRUM_A

B’ C’ D’

INSTRUM_B

TRAMPOLINE_A’

TRAMPOLINE_B’

INSTRUM_C

TRAMPOLINE_C’

TRAMPOLINE_D’

INSTRUM_D

A B C D


(Loop header)

22





A

JUMP_BLOCK

TRACE_HEAD_A

B C D

TRACE_HEAD_B

TRAMPOLINE_A

TRAMPOLINE_B

A’INSTRUM_A

B’ C’ D’

INSTRUM_B

TRAMPOLINE_A’

TRAMPOLINE_B’

INSTRUM_C

TRAMPOLINE_C’

TRAMPOLINE_D’

INSTRUM_D

A B C D


(Loop header)

23





A

JUMP_BLOCK

B C D

TRACE_HEAD_B

TRAMPOLINE_A

TRAMPOLINE_B

A’INSTRUM_A

B’ C’ D’

INSTRUM_B

TRAMPOLINE_A’

TRAMPOLINE_B’

INSTRUM_C

TRAMPOLINE_C’

TRAMPOLINE_D’

INSTRUM_D

A B C D


(Loop header)

24

Improvement Opportunity

A

B

E

C

D M

N O

P

foo()

bar()

A B D E C M N P O

25


A

B

E

C

D M

N O

P

foo()

bar()

A B D E C M N P O

5B0480C6 (Low) 9BFE8D1F (High)Virtual Address Space (1GB)

26

Trace Layouts in Address Space (227_MTRT)T

races

5B0480C6 (Low) 9BFE8D1F (High)Virtual Address Space (1GB)

27


A

B

E

C

D M

N O

P

foo()

bar()

A B D E C M N P O

Gap Transition Fallthrough Transition

28

Trace ContinuityDaCapo & SpecJVM98 Benchmarks

– 1/3 traces necessarily fragmented (inter-procedural)

– Most intra-procedural traces non-contiguous

34% 47

%

22% 32

%

29% 42

%

46%

35%

33% 44

%

42%

3%

24%

46%

6%

39%

20%

44%

37%

53%

35%

41%

37%

59% 51

%

47% 25%

38%

45% 38

%

33%

74%

59%

33%

75%

47%

54%

39%

54%

20%

45%

25% 16

%

18%

17%

24%

33%

16%

20%

23% 18%

25%

23% 17

%

20%

19% 13

%

26% 17

% 10%

27% 19

%

44%

0%

20%

40%

60%

80%

100%

antlr

bloat

char

t

eclip

se fop

hsql

db

jytho

n

luind

ex

lusea

rch pmd

xala

n

_201

_com

pres

s

_202

_jess

_205

_ray

trace

_209

_db

_213

_java

c

_222

_mpe

gaud

io

_227

_mtrt

_228

_jack

_999

_che

ckit

Avera

ge

Tra

ce

s (

%)

With Interprocedural and Local Gaps Local Gaps Only Contiguous

29

Transitions between basic blocks

14%

16%

16%

16%

18%

8%

17%

14%

16%

16%

9%

21%

18%

16%

13%

18%

13%

13% 17%

4%

15%

80%

77%

81%

79%

78%

85% 78

%

82%

80%

79%

86% 79

%

78%

75%

86% 76

%

85% 77

%

78%

92% 80

%

0%

20%

40%

60%

80%

100%

antlr

bloat

char

t

eclip

se fop

hsqld

b

jytho

n

luind

ex

lusea

rch pmd

xalan

_201

_com

pres

s

_202

_jess

_205

_ray

trace

_209

_db

_213

_java

c

_222

_mpe

gaud

io

_227

_mtrt

_228

_jack

_999

_che

ckit

Avera

ge

Ba

sic

Blo

ck

Tra

ns

itio

ns

(%

)

Local Gaps Interprocedural Gaps Fallthroughs

– Appropriate fallthough block 80% of the time

– 15% misprediction rate for local control flow.

– 20% of all transitions could benefit from trace fragment dispatch

Distance Transition Gaps

0 - 64B (cacheline) 34.7%

65B - 4KB (page) 40.7%

4KB+ 24.6%

30

Trace Characteristics

17%

11%

34%

21%

24%

8% 6% 13%

16%

8% 6%

54%

19%

10% 25

%

12% 26

%

13%

17%

7%

16%

58%

56%

37%

52% 49%

58%

56%

58%

52%

63%

56%

26%

55%

54%

44%

58%

54%

39%

65%

60%

54%

23%

29%

25%

24%

23%

33%

37% 24

%

27%

28%

36%

17%

24%

35%

31%

27% 19

%

27% 15

%

33% 27

%

0%

20%

40%

60%

80%

100%

antlr

bloat

char

t

eclip

se fop

hsqld

b

jytho

n

luind

ex

lusea

rch pmd

xalan

_201

_com

pres

s

_202

_jess

_205

_ray

trace

_209

_db

_213

_java

c

_222

_mpe

gaud

io

_227

_mtrt

_228

_jack

_999

_che

ckit

Avera

ge

Tra

ce

s (

%)

Cycle Rejoined Length Limited Abutted Exited

– Cycle and abutted traces make the majority

– Few length-limited, rejoined traces

– Surprisingly large number of exited traces

• Sporadic loops

31

Instrumentation Overhead

85%

90%

95%

100%

105%

110%

115%

_201

_com

pres

s

_202

_jess

_205

_ray

trace

_209

_db

_213

_java

c

_222

_mpe

gaud

io

_227

_mtrt

_228

_jack

_999

_che

ckit

antlr

bloat

char

t

eclip

se fop

hsqld

b

jytho

n

luind

ex

lusea

rch pmd

xalan

Avera

ge

No

rma

lize

d E

xe

cu

tio

n T

ime

(%

)

Original Tracing

(Startup)

– One-iteration tests. (40x)

– Mixed slowdown results: 7.4% (jython), -6.5% (_227_mtrt)

– Average startup overhead: 1.7%

32

Instrumentation Overhead (Steady State)

85%

90%

95%

100%

105%

110%

115%

_201

_com

pres

s

_202

_jess

_205

_ray

trace

_209

_db

_213

_java

c

_222

_mpe

gaud

io

_227

_mtrt

_228

_jack

_999

_che

ckit

antlr

bloat

char

t

eclip

se fop

hsqld

bjyt

hon

luind

ex

lusea

rch

pmd

xalan

Avera

ge

No

rma

lize

d E

xe

cu

tio

n T

ime

(%

)

Original Tracing

– 40-iteration tests. (8x)

– Average steady-state overhead: 1.7%

33

Summary

• Envision trace fragment dispatch as a feedback-directed optimization

– Locality optimizations not addressed by JIT compiler

– Adapt to changing behavior without recompilation

• More accurate trace selection

– Enabled by the co-location with the JIT and VM runtime

• Evaluated opportunity and cost

– 20% of basic block transitions do not use sequential fallthough.

– 25% of taken branches/calls transfer control flow to locations outside the VM page

– Minimal startup and maintenance overhead for trace selection

34

Questions?

35







3. Method prologue


A

B

C

foo()

B C

to D

D

36

to A







3. Method prologue


A

B

C

foo()

B C

to D

D A

D

37

Normalized Trace Layouts (227_MTRT)T

races

Documents

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08