Upload
kathleen-owen
View
237
Download
0
Tags:
Embed Size (px)
Citation preview
Trace Fragment Selection within Method-based JVMs
Duane Merrill Kim Hazelwood
VEE ‘08
2
Overview
• Would trace fragment dispatch benefit VMs with JITs?
– Fragment-dispatch as a feedback-directed optimization
• Why?
– Improve VM performance via better instruction layout
• Overview
– Motivation
– New scheme for trace selection
– Viability in JikesRVM• Evaluate opportunities for code improvement• Evaluate trace selection overhead
3
Traditional VM Adaptive Code Generation
Phase 3: More Advanced JIT Compilation
Update Class/TOC dispatch tables, perform OSR
Phase 2: JIT Method compilation
Compilation Shape: Source Method
Dispatch Shape: Corresponding MC Code Array &
Machine Code Trace Fragment
Phase 1: Interpreter
Compilation Shape: Source Instruction
Dispatch Shape: Corresponding MC Instruction(s)
Machine Code Trace Fragment
4
SDT/ DBI/ Embedded VM Adaptive Code Generation
Phase 3: More Advanced JIT Compilation
Update Class/TOC dispatch tables, perform OSR
Phase 2: JIT Method compilation
Compilation Shape: Source Method
Dispatch Shape: Corresponding MC Code Array &
Machine Code Trace Fragment
Phase 1: Interpreter
Compilation Shape: Source Instruction
Dispatch Shape: Corresponding MC Instruction(s)
Machine Code Trace Fragment
5
Proposed VM Adaptive Code Generation
Phase 3: More Advanced JIT Compilation
Update Class/TOC dispatch tables, perform OSR
Phase 2: JIT Method compilation
Compilation Shape: Source Method
Dispatch Shape(s): Corresponding MC Code Array &
Machine Code Trace Fragment
Phase 1: Interpreter
Compilation Shape: Source Instruction
Dispatch Shape: Corresponding MC Instruction(s)
Machine Code Trace Fragment
6
Trace Fragment Dispatch
• Trace
– A specific sequence of instructions observed at runtime
– Span:• Branches
• Procedure calls and returns
• Potentially arbitrary number of instructions
• Trace Fragment
– A finite, linear sequence of machine code instructions
– Single-entry, multiple-exit (viz. superblock)
– Cached, linked
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
7
Trace Fragment Dispatch: The Good
• Location, Location, Location
– “Inlining-like”:• Context sensitive
• Partial
– Spatial locality provides most of achieved speedup
• Simple, low-cost “local” optimizations
– Redundancy elimination
• Nimbly adjusts to changing behavior
– Efficient
– Lots of early-exits? Discard fragment and re-trace
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
8
Trace Fragment Dispatch: The Bad
• Lacks optimization power
– Data flow analysis
– Code motion & loop optimizations
• Code expansion
– Tail duplication
– Exponential growth (if all paths maintained indefinitely)
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
9
Trace Fragment Dispatch: The Bad
to A
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
C D M O P E
to N
• Lacks optimization power
– Data flow analysis
– Code motion & loop optimizations
• Code expansion
– Tail duplication
– Exponential growth (if all paths maintained indefinitely)
10
Trace Fragment Dispatch: The Bad
to A
to A
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
C D M O P E
to N
N P E
• Lacks optimization power
– Data flow analysis
– Code motion & loop optimizations
• Code expansion
– Tail duplication
– Exponential growth (if all paths maintained indefinitely)
11
Supplement Method Dispatch with Trace Dispatch
• Why?
– Improve VM performance via better instruction layout
– Easily-disposable fragments reflect current program behavior
• How?
– JIT compiler inserts instrumentation into method code arrays:• Monitor potential “hot trace headers”
• Record control flow
– VM runtime assembles & patches trace fragments:• Blocks “scavenged” from compiled code arrays
• Conditionals adjusted for proper fallthoughs
• Method code arrays patched to transfer control to fragments
• New fragments linked to existing fragments
12
Easy Fragment Management
• Improved trace selection
– JIT to identify trace starting
– VM to determine trace stopping locations
• “Friendly” encoding of instructions
– Patch spots built-in
– Avoid pesky PC-relative jumps (e.g., switch statements)
• Knowledge of language implementation features:– Calling conventions
– Stack layout
– Virtual method dispatch tables
13
Efficient Fragment Management
• “Mixed-mode” scheme:
– Execution in both method code arrays & trace fragments
• Share the same register allocation
– Control flows off-trace into method code arrays
• Fewer trace fragments
• Manageable code expansion
– JVM control is already built into yield points
– Disposable trace fragments
• No need to redo expensive analysis as behavior changes
14
Our Work: Trace Fragment Selection
1. Develop new trace selection methodology
– Leverage JIT global analysis, VM runtime
2. Implement trace selection in JikesRVM and evaluate viability
– Do recorded traces indicate room for code improvement?
– Do the traces exhibit good characteristics?
– Is instrumentation overhead reasonable?
15
Improved Trace Selection: Starting Locations
1. Loop Header Locations
– Identified by JIT loop analysis
– More accurate than “target of backward branch” heuristic
2. “Early exit” blocks
– Allows trace fragments to be “layered”
3. Method prologue
– Catches recursive execution
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
16
to A
Improved Trace Selection: Starting Locations
1. Loop Header Locations
– Identified by JIT loop analysis
– More accurate than “target of backward branch” heuristic
2. “Early exit” blocks
– Allows trace fragments to be “layered”
3. Method prologue
– Catches recursive execution
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
N P E
17
Improved Trace Selection: Starting Locations
1. Loop Header Locations
– Identified by JIT loop analysis
– More accurate than “target of backward branch” heuristic
2. “Early exit” blocks
– Allows trace fragments to be “layered”
3. Method prologue
– Catches recursive execution
A
B C
D
foo()
A B D
to C
to Epilogue
18
Improved Trace Selection: Stopping Criteria
1. Cycle
Returned to the loop header
2. Abutted
Arrived at another loop header
3. Length Limited (unusual)
128 basic blocks encountered
4. Rejoined (unusual)
Returned to a basic block already in trace
5. Exited (unusual)
Exited the method without meeting above conditions. (Identifiable by stack height.)
to A
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
N P E
19
Improved Trace Selection: Stopping Criteria
1. Cycle
Returned to the loop header
2. Abutted
Arrived at another loop header
3. Length Limited (unusual)
128 basic blocks encountered
4. Rejoined (unusual)
Returned to a basic block already in trace
5. Exited (unusual)
Exited the method without meeting above conditions. (Identifiable by stack height.)
to A
A
B
E
C
D M
N O
P
foo()
bar()
A B D M O P E
to C to N
N P E
20
JIT-Inserted Instrumentation
(a) Assembly of original method code-block
(b) Assembly of code-block to be used for tracing
A B C D
Loop header counters Paths through blocks
Low-fidelity Instrumentation High-fidelity Instrumentation
A
JUMP_BLOCK
TRACE_HEAD_A
B C D
TRACE_HEAD_B
TRAMPOLINE_A
TRAMPOLINE_B
A’INSTRUM_A
B’ C’ D’
INSTRUM_B
TRAMPOLINE_A’
TRAMPOLINE_B’
INSTRUM_C
TRAMPOLINE_C’
TRAMPOLINE_D’
INSTRUM_D
(Loop header)
21
JIT-Inserted Instrumentation
(a) Assembly of original method code-block
(b) Assembly of code-block to be used for tracing
Low-fidelity Instrumentation High-fidelity Instrumentation
A
JUMP_BLOCK
TRACE_HEAD_A
B C D
TRACE_HEAD_B
TRAMPOLINE_A
TRAMPOLINE_B
A’INSTRUM_A
B’ C’ D’
INSTRUM_B
TRAMPOLINE_A’
TRAMPOLINE_B’
INSTRUM_C
TRAMPOLINE_C’
TRAMPOLINE_D’
INSTRUM_D
A B C D
Loop header counters Paths through blocks
(Loop header)
22
JIT-Inserted Instrumentation
(a) Assembly of original method code-block
(b) Assembly of code-block to be used for tracing
Low-fidelity Instrumentation High-fidelity Instrumentation
A
JUMP_BLOCK
TRACE_HEAD_A
B C D
TRACE_HEAD_B
TRAMPOLINE_A
TRAMPOLINE_B
A’INSTRUM_A
B’ C’ D’
INSTRUM_B
TRAMPOLINE_A’
TRAMPOLINE_B’
INSTRUM_C
TRAMPOLINE_C’
TRAMPOLINE_D’
INSTRUM_D
A B C D
Loop header counters Paths through blocks
(Loop header)
23
JIT-Inserted Instrumentation
(a) Assembly of original method code-block
(b) Assembly of code-block to be used for tracing
Low-fidelity Instrumentation High-fidelity Instrumentation
A
JUMP_BLOCK
B C D
TRACE_HEAD_B
TRAMPOLINE_A
TRAMPOLINE_B
A’INSTRUM_A
B’ C’ D’
INSTRUM_B
TRAMPOLINE_A’
TRAMPOLINE_B’
INSTRUM_C
TRAMPOLINE_C’
TRAMPOLINE_D’
INSTRUM_D
A B C D
Loop header counters Paths through blocks
(Loop header)
24
Improvement Opportunity
A
B
E
C
D M
N O
P
foo()
bar()
A B D E C M N P O
25
Improvement Opportunity
A
B
E
C
D M
N O
P
foo()
bar()
A B D E C M N P O
5B0480C6 (Low) 9BFE8D1F (High)Virtual Address Space (1GB)
26
Trace Layouts in Address Space (227_MTRT)T
races
5B0480C6 (Low) 9BFE8D1F (High)Virtual Address Space (1GB)
27
Improvement Opportunity
A
B
E
C
D M
N O
P
foo()
bar()
A B D E C M N P O
Gap Transition Fallthrough Transition
28
Trace ContinuityDaCapo & SpecJVM98 Benchmarks
– 1/3 traces necessarily fragmented (inter-procedural)
– Most intra-procedural traces non-contiguous
34% 47
%
22% 32
%
29% 42
%
46%
35%
33% 44
%
42%
3%
24%
46%
6%
39%
20%
44%
37%
53%
35%
41%
37%
59% 51
%
47% 25%
38%
45% 38
%
33%
74%
59%
33%
75%
47%
54%
39%
54%
20%
45%
25% 16
%
18%
17%
24%
33%
16%
20%
23% 18%
25%
23% 17
%
20%
19% 13
%
26% 17
% 10%
27% 19
%
44%
0%
20%
40%
60%
80%
100%
antlr
bloat
char
t
eclip
se fop
hsql
db
jytho
n
luind
ex
lusea
rch pmd
xala
n
_201
_com
pres
s
_202
_jess
_205
_ray
trace
_209
_db
_213
_java
c
_222
_mpe
gaud
io
_227
_mtrt
_228
_jack
_999
_che
ckit
Avera
ge
Tra
ce
s (
%)
With Interprocedural and Local Gaps Local Gaps Only Contiguous
29
Transitions between basic blocks
14%
16%
16%
16%
18%
8%
17%
14%
16%
16%
9%
21%
18%
16%
13%
18%
13%
13% 17%
4%
15%
80%
77%
81%
79%
78%
85% 78
%
82%
80%
79%
86% 79
%
78%
75%
86% 76
%
85% 77
%
78%
92% 80
%
0%
20%
40%
60%
80%
100%
antlr
bloat
char
t
eclip
se fop
hsqld
b
jytho
n
luind
ex
lusea
rch pmd
xalan
_201
_com
pres
s
_202
_jess
_205
_ray
trace
_209
_db
_213
_java
c
_222
_mpe
gaud
io
_227
_mtrt
_228
_jack
_999
_che
ckit
Avera
ge
Ba
sic
Blo
ck
Tra
ns
itio
ns
(%
)
Local Gaps Interprocedural Gaps Fallthroughs
– Appropriate fallthough block 80% of the time
– 15% misprediction rate for local control flow.
– 20% of all transitions could benefit from trace fragment dispatch
Distance Transition Gaps
0 - 64B (cacheline) 34.7%
65B - 4KB (page) 40.7%
4KB+ 24.6%
30
Trace Characteristics
17%
11%
34%
21%
24%
8% 6% 13%
16%
8% 6%
54%
19%
10% 25
%
12% 26
%
13%
17%
7%
16%
58%
56%
37%
52% 49%
58%
56%
58%
52%
63%
56%
26%
55%
54%
44%
58%
54%
39%
65%
60%
54%
23%
29%
25%
24%
23%
33%
37% 24
%
27%
28%
36%
17%
24%
35%
31%
27% 19
%
27% 15
%
33% 27
%
0%
20%
40%
60%
80%
100%
antlr
bloat
char
t
eclip
se fop
hsqld
b
jytho
n
luind
ex
lusea
rch pmd
xalan
_201
_com
pres
s
_202
_jess
_205
_ray
trace
_209
_db
_213
_java
c
_222
_mpe
gaud
io
_227
_mtrt
_228
_jack
_999
_che
ckit
Avera
ge
Tra
ce
s (
%)
Cycle Rejoined Length Limited Abutted Exited
– Cycle and abutted traces make the majority
– Few length-limited, rejoined traces
– Surprisingly large number of exited traces
• Sporadic loops
31
Instrumentation Overhead
85%
90%
95%
100%
105%
110%
115%
_201
_com
pres
s
_202
_jess
_205
_ray
trace
_209
_db
_213
_java
c
_222
_mpe
gaud
io
_227
_mtrt
_228
_jack
_999
_che
ckit
antlr
bloat
char
t
eclip
se fop
hsqld
b
jytho
n
luind
ex
lusea
rch pmd
xalan
Avera
ge
No
rma
lize
d E
xe
cu
tio
n T
ime
(%
)
Original Tracing
(Startup)
– One-iteration tests. (40x)
– Mixed slowdown results: 7.4% (jython), -6.5% (_227_mtrt)
– Average startup overhead: 1.7%
32
Instrumentation Overhead (Steady State)
85%
90%
95%
100%
105%
110%
115%
_201
_com
pres
s
_202
_jess
_205
_ray
trace
_209
_db
_213
_java
c
_222
_mpe
gaud
io
_227
_mtrt
_228
_jack
_999
_che
ckit
antlr
bloat
char
t
eclip
se fop
hsqld
bjyt
hon
luind
ex
lusea
rch
pmd
xalan
Avera
ge
No
rma
lize
d E
xe
cu
tio
n T
ime
(%
)
Original Tracing
– 40-iteration tests. (8x)
– Average steady-state overhead: 1.7%
33
Summary
• Envision trace fragment dispatch as a feedback-directed optimization
– Locality optimizations not addressed by JIT compiler
– Adapt to changing behavior without recompilation
• More accurate trace selection
– Enabled by the co-location with the JIT and VM runtime
• Evaluated opportunity and cost
– 20% of basic block transitions do not use sequential fallthough.
– 25% of taken branches/calls transfer control flow to locations outside the VM page
– Minimal startup and maintenance overhead for trace selection
34
Questions?
35
Improved Trace Selection: Starting Locations
1. Loop Header Locations
– Identified by JIT loop analysis
– More accurate than “target of backward branch” heuristic
2. “Early exit” blocks
– Allows trace fragments to be “layered”
3. Method prologue
– Catches recursive execution
A
B
C
foo()
B C
to D
D
36
to A
Improved Trace Selection: Starting Locations
1. Loop Header Locations
– Identified by JIT loop analysis
– More accurate than “target of backward branch” heuristic
2. “Early exit” blocks
– Allows trace fragments to be “layered”
3. Method prologue
– Catches recursive execution
A
B
C
foo()
B C
to D
D A
D
37
Normalized Trace Layouts (227_MTRT)T
races