Upload
lena
View
80
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Dynamic Optimization. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA [email protected]. What is Dynamic Optimization. Allow a running binary to adapt to the underlying hardware system dynamically - PowerPoint PPT Presentation
Citation preview
Dynamic Optimization
David KaeliDepartment of Electrical and Computer
EngineeringNortheastern University
Boston, [email protected]
What is Dynamic Optimization• Allow a running binary to adapt to the
underlying hardware system dynamically• Perform optimization while not sacrificing
performance
OS/HW Platform
Input
Static Source
Input
Fluid BinaryRuntime Dynamic Optimization System
Why Dynamic versus Static
• Allows code to adapt to:– Changes in the microarchitecture of the
underlying platform (related to binary translation)– Changes in program input– Environment dynamics (e.g., system load, system
availability)
• Involves very little user interaction (optimization should be applied transparently)
• Source code is not needed • Language independent
Challenges with Dynamic Optimization
• Reducing the associated overhead and maintaining transparency
• Addressing a range of workloads• Selecting appropriate
optimizations
Dynamic Optimization Systems• Dynamo
– HP labs, PA-RISC/HPUX– Runtime optimization
• Vulcan/Mojo– MS Research, X86-IA64/Win2K– Deskstop instrumentation, profiling and
optimization
• Jalapeno– IBM Research, JVM-PPC-SMPs/AIX– Java JIT designed for research
• Latte– University of Seoul, Korea– Java JIT designed for efficient register allocation
Dynamo
Dynamo
CPU Platform
Application + Libs(native binary)
CPU Platform
Application + Libs(native binary)
Normal execution model Dynamo execution model
To the application, Dynamo looks like a software interpreter that executes the same instruction set executed by underlying hardware interpreter (the CPU).
* Many of these slides were provided by Evelyn Duesterwald
Elements of Dynamo
A novel performance delivery mechanism:
– Optimize the code when it executes, not when it is created
A client-enabled performance mechanism Dynamic code re-layout Partial dynamic inlining/superblock formation Path-specific optimization Adaptive: machine and input specific Complementary to static optimization
Transparent: requires no compiler supportDynamo
CPU Platform
Application + Libs(native binary)
Flow within Dynamo
Interpretuntil taken branch
Lookup nextPC in Trace Cache
Hot start of trace?
Trace Selector
missno
yes
DynamoCode Cache
hit
Trace Optimizer
Input native instruction stream
Trace Linker
exit b
ranc
h
recycle counter
Emit Trace
Interpretation/Profiling Loop
Traces in Dynamo
A
E
C
F T
call
return
D
F
Control Flow Graph
A
F
B
C
E
D
call
return
Memory Layout
Trace = single-entry join-free dynamic sequence of basic blocks
Trace Cache Layout
A
B
C
E
D
exit toInterpreter
trampoline
connect to other trace
B
Traces in Dynamo
Interprocedural forward path: start-of-trace= target of backward branch
end-of-trace = taken backward branch
11 Paths through the loop: ABCEH ABCEHKMO ABCEHKNO ABCEIKMO ABCEIKNO ABCFJL ABCFJLNO ABDFJL ABDFJLNO ABDGJL ABDGJLNO
A
J
F G
D
B
LK
H I
E
C
N
O
M
Traces in Dynamo – typical path profiles
A
J
F G
D
B
LK
H I
E
C
N
O
M
Approach:
profile all edge frequencies
select hot trace by following highest
frequency branch outcome
Disadvantage:
Infeasible path: ignores branch correlation
Overhead: need to profile every conditional branch
Traces in Dynamo – Next Executing Tail Prediction
A
J
F G
D
B
LK
H I
E
C
N
O
M
Minimal profiling: profile only start-of-trace points (block A) Optimistic: at hot start-of-trace select next executing
Advantages: very light-weight
# instrumentation points = #targets of backward branches
#counters = #targets of backward branches
statistically likely to pick the hottest path feasible paths easy to implement
Trace Selection
0
10
20
30
40
50
60
70
80
90
time
# t
rac
es
se
lec
ted
li
When to stop creating new traces
• Excessively high trace selection rates cause unacceptable overhead and potential thrashing in the Dynamo code cache
• We need the opportunity to amortize the cost of creating traces, thus we need to turn off trace creation sometimes
• “Bail out” is entered when the creation rate per unit time is excessively high
Trace Optimization
Build lightweight Intermediate Representation (IR): Symbolic Labels, Extended Virtual Register Set
Optimization with integrated demand-driven analysis
Schedule & Register Allocation – retain previous mappings
Backward Pass
Lite IR
Forward Pass
Reg Alloc
List of trace blocks
Linker
Trace Optimization
Are there any runtime optimization opportunities in statically optimized code?
Limitations of static compiler optimization:
cost of call-specific interprocedural optimization
cost of path-specific optimization in presence of complex control flow
difficulty of predicting past indirect branches
lack of access to shared libraries
sub-optimal register allocation decisions
register allocation for individual array elements or pointers
Path-specific optimizations
Conservative Optimizations
precise signal delivery memory-safe
partial procedure inlining
redundant branch removal
constant propagation
constant folding
copy propagation
Aggressive
Optimizations
redundant load removal runtime disambiguated
(guarded) load removal
dead code elimination
partially dead code
sinking
loop unrolling
loop invariant hoisting aggressive optimization can be made memory- and signal-safe
compiler hints de-optimization
Dynamo Optimizations
• Constant propagation– Given x <- c ; for variable x and constant c– Replace all later uses of x with c, assuming that x will
not be modified
entry
b <- 3c <- 4 * bc > b
d <- b + 2
e <- a + b
exit
yn
entry
b <- 3c <- 4 * 3c > 3
d <- 3 + 2
e <- a + 3
exit
yn
Dynamo Optimizations• Constant folding
– Identifying that all operands in an assignment are constant after macro expansion and constant propagation
– Easy for booleans, a little trickier for integers (exceptions such as divide by zero and overflows), for FP this can be very tricky due to multiple FP formats
entry
b <- 3c <- 4 * 3c > 3
d <- 3 + 2
e <- a + 3
exit
yn
entry
b <- 3c <- 12c > 3e <- a + 3
d <- 3 + 2
exit
Dynamo Optimizations
• Partial load removal – LRE paper• Dead code elimination
– A variable is dead if it is not used on any path from where it is defined to where the function exits
– An instruction is dead if it computes only values that are not used on any executable path leading from the instruction
– Dead code is often created through the application of code optimization (e.g., strength reduction; replacing expensive ops by less expensive ops)
• Loop invariant hoisting – moving invariant operations out of the loop body
• Fragment link-time optimizations – apply peephole optimization around link, looking for dead code removal
Implementation IssuesProblem: Signal arrives when executing in the code cache –
How can we achieve transparent signal delivery?How can the original signal context be reconstructed?
Dynamo approach: intercept all signalsUpon arrival of a signal at code cache location L, Dynamo first
gains control:1. Save code cache context 2. Retranslate the trace and record:
i. Any changes in register mapping up to position Lii. Original code addresses of Liii. All context-modifying optimizations and steps for de-
optimization3. Update the code cache context to obtain native context4. Load native context and execute original signal handler
Dynamic Code Cache
Problem: How to control size of dynamically recompiled code?
How to react to phase changes?
Adaptive flushing based cache management scheme:
Preemptive cache flushes Fast allocation/de-allocation of traces Removal of old and cold traces Branch re-biasing to improve locality in cache
Configurable for various performance/memory-footprint trade-offs
Code cache default size: 300 Kbytes
Dynamo Performance
-1.53
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
25%co
mpr
ess
go
ijpeg
li
m88
ksim
perl
vort
ex
delta
blue
Ave
rage
Sp
ee
du
p o
ve
r n
ati
ve
ex
ec
uti
on
path optimization
trace selection
(+O2 compiled native binary running under Dynamo on a PA-8000)
Bailout
0
20
40
60
80
100
120
140
time
# tr
aces
sel
ecte
d
go after bail-outgo before bail-outli
• bail out if trace selection rate exceeds tolerable threshold
Bailout• To prevent degradation, Dynamo keeps
track of the current trace selection rate• Virtual time is recorded by counting the
number of interpreted BBs before we select N traces
• A threshold is set to judge if a rate is “high”• The trace selection rate is considered
excessive if k consecutive high rate time intervals have been encountered
• Bailout will turn off trace selection and optimization; execution resumes in the original binary
Performance speedups with bailout
-10%
-5%
0%
5%
10%
15%
20%
25%co
mp
ress go
ijpe
g li
m8
8ks
im
pe
rl
vort
ex
de
ltab
lue
Ave
rag
e
Sp
eed
up
ove
r n
ativ
ed e
xecu
tio
n
path optimizationtrace selection
(+O2 compiled native binary running under Dynamo on a PA-8000)
Memory Overhead – Dynamo text
Initialization4%
Memory Mgmt5%
Control15%
Trace Formation5%
Trace Optimization 5% + 15%
Interpreter21%
Decode30%
Total size = 273 Kb
PA-RISC dependent portion = 179 Kb (66%)
Summary of Dynamo
• Demonstrated the potential for dynamic optimization through an actual implementation
• Optimization impact tends to be program dependent
• More sophisticated bailout algorithms need to be devised
• Static compile-time hints should be used to help guide a dynamic optimization system
Vulcan – A. Srivastava
• Provides both static and dynamic code modification
• Performs optimization on x86, IA64 and MSIL binaries
• Can work in the presence of multithreading and variable length instructions (X86)
• Designed to be able to perform modifications on a remote machine using a distributed common object model (DCOM) interface
• Can also serve as a binary translator
Mojo – Dynamic Optimization using Vulcan (Chaiken&Gillies)
• Targets a desktop x86/Windows2000 environment
• Supports large, multithreaded, applications that use exception handlers
• Requires no OS support• Allows optimization across shared
library boundaries• Can be aided by information provided
by a static compiler
Mojo Structure
Mojo Dispatcher
Path Builder
NT DLLNT DLLOriginal CodeOriginal Code
Path C
ache
Basic B
lock Cache
Exception handling
Mojo Structure
Mojo Dispatcher
Path Builder
NT DLLNT DLLOriginal CodeOriginal Code
Path C
ache
Basic B
lock Cache
Exception handling
1. Interrogate the Path Cache for a hit
Mojo Structure
Mojo Dispatcher
Path Builder
NT DLLNT DLLOriginal CodeOriginal Code
Path C
ache
Basic B
lock Cache
Exception handling2. If hit, then
execute from the PC directly, else interrogate the Basic Block Cache for a hit
Mojo Structure
Mojo Dispatcher
Path Builder
NT DLLNT DLLOriginal CodeOriginal Code
Path C
ache
Basic B
lock Cache
Exception handling3. If hit in the
BBC, execute directly, else load the block from the original code.
Mojo Structure
Mojo Dispatcher
Path Builder
NT DLLNT DLLOriginal CodeOriginal Code
Path C
ache
Basic B
lock Cache
Exception handling
Each time control returns to the Mojo Dispatcher. BBs are checked for “hotness”.
Mojo Structure
Mojo Dispatcher
Path Builder
NT DLLNT DLLOriginal CodeOriginal Code
Path C
ache
Basic B
lock Cache
Exception handling
If a BB is hot enough, Mojo turns on Path Building. Once a complete path has been built and optimized, it is placed in the Path Cache.
Mojo Components
• Mojo Dispatcher– Is the control point in the dynamic optimization system– Manages execution context using its own stack space
• Basic Block Cache– Handles basic blocks that have not yet become hot– Identifies basic block boundaries by dynamically
decoding instruction bytes– Branches are modified to pass control to the
dispatcher, and passes the addresses of the next basic block to execute
– Additional information is kept in the BBC that is used when constructing paths
Mojo Components• Path Builder
– Responsible for selecting, building and optimizing hot paths
– Maintains “hotness” information for basic blocks– Utilizes the same heuristics for building hot
paths as is used in Dynamo (next path after counter overflow)
– Utilizes separate thresholds for back edge targets and path exit targets (need to detect hot side exits when constructing a dynamic path)
– Instructions are laid out contiguously (reordered), eliminating many taken conditional branches
Mojo Components – Path Builder
• Path Termination - Dynamo only terminates paths on back edges
A
B
C
B
A
C
Original nested loops
Dynamo back edge profiling
Mojo back edge and side exit
profiling
B
A
CLonger path
Exception Handling and Threads
• Mojo patches the ntdll.dll• Mojo captures the state of the machine
before passing off exceptions to the dispatcher
• The dispatcher prevents the exception handler from polluting the Path Cache
• To handle multithreading, Mojo allocates a basic block cache per thread, though uses a shared Path Cache
• Locking mechanisms are provided to access and update the shared Path Cache reliably
Mojo performance
020
406080
100120140
160180
Byte
OO
PACK
dry
stone
life
puzz
le
8queens
bubso
rt
csie
ve
qso
rt
ack
er
fib
Rela
tive E
xecu
tion T
ime t
o n
ati
ve
execu
tion
qsort, acker and fib are recursive programs
Mojo performance – SPEC2000/SPEC95
0
200
400
600
800
1000
1200
1400
Execu
tion T
ime (
seco
nds)
nativemojo
Mojo Execution - Windows
0
20
40
60
80
100
120
Winword FoxPro
Execu
tion T
ime (
seco
nds)
nativemojo
Comments
• For simple programs with simple control flow, Mojo shows good improvement
• For larger programs with more dynamic control flow, Mojo is overwhelmed with the amount of path creation (same problem that was encountered for Dynamo)
• Bailout strategy needed, along with better hot path detection algorithm
• Future work is investigating how to use hints obtained during static compilation to aid in the dynamic optimization of the code
What is a JIT• Just-in Time Compiler – developed to address
the performance issues encountered with java interpreter/translator performance
• Portability generally means lower performance; JITs attempt to bridge this gap
• JITs dynamically cache translated java bytecodes and perform extensive optimization on the native instructions
• Given the overhead of using an OO programming model (frequent method calls), extensive exception checking, and the overhead of dynamic translation/compilation, the quality of the JIT must be high
Common JITs
• SUN Java Development Kit (Sun)• Hotspot JIT (Sun)• Kaffe (Transvirtual Technologies)• Jalapeno (IBM Research)• Latte (Seoul National University)
IBM Jalapeno JVM and JIT
• Designed specifically for servers– Shared memory multiprocessor scalability– Manage a large number of concurrent
threads– High availability– Rapid response and graceful degradation (an
issue when garbage collection is involved)
• Mainly developed in Java (reliability?)• Designed specifically for extensive
dynamic optimization
The Jalapeno Adaptive Optimization System
• Translates bytecodes directly to the native ISA
• Recompilation is performed in a separate thread from the application, and thus can be done in parallel to program execution
• AOS has three components– Runtime measurement system– Controller– Recompilation system
Jalapeno AOS Architecture
AOSDatabase
RawData
RawData
FormattedData
OrganizerEvent Queue
ExecutingCode
FormattedData Compilation
Queue
Controller
ControllerControllerCompilationThreads
ControllerControllerCompilers
(Base, Opt, …)
Organizer
Hardware/VMPerformance Monitor
Organizer
RawData
Inst/CompPlan
Inst/OptCode
MeasurementSubsystem
Install new code
Profile data
Three Optimization Levels
• Level 0 – On-the-fly optimizations performed during translation (constant prop, constant folding, dead code detection)
• Level 1 – Adds to Level 0 common subexpression elimination, redundant load elimination, aggressive inlining
• Level 2 – Adds to Level 1 flow-sensitive optimizations, array bounds check elimination
Controller model
• Decides when to recompile a method• Decides which optimization level to
use• Measurements are used to guide the
profiling strategy and select the hot methods to recompile
• An analytical model is also used that represents the costs and benefits of performing these tasks
When to recompile?
Ti = Current total amount of time the program will spend executing method m
Cj = Cost of recompilation method m at optimization level j
Tj = Expected total amount of time the program will spend executing an optimized method m
For j=0,1,2 choose the j that minimizes: Cj + Tj
If Cj + Tj < Ti, the Controller recompiles at level j;Otherwise it decides not to recompile
When to recompile?
• To estimate Ti, we assume the program will run for a total time of Tf, and use profile data to indicate what percentage of the total execution time (Pm) is spent in method m (versus the rest of the program)
• We can compute Ti as:
Ti = Tf * Pm
• This is the initial estimated execution time for method m. A new Ti is computed based on an estimate of the speedup of method m.
• The above weight decays over time.
How well does optimization work in Jalapeno?
0
2
4
6
8
10
12
14
SPECjvm98 Benchmarks
Speedup O
ver
Base
line
Opt0Opt1Opt2
Comments about Jalapeno
• Focused on method-granularity optimization
• Simple heuristics for predicting runtimes and benefits/costs are highly sensitive to cold vs. warm invocation of the application
• New work looks at method specific optimizations that consider additional characteristics besides just the estimated runtime
Latte
• Addresses the inefficiencies in the stack-based Java bytecode machine by efficiently mapping stack space to a RISC register file
• Since traditional register coloring is an expensive algorithm, and allocation must be done in the same space as the runtime, this system looks at other ways to get good register allocation at a reduced cost
Java Translation to Native Code
1. Identifies control join points and subroutines in the bytecode using a depth-first search traversal
2. Bytecodes are translated in a control flow graph, mapping program variables to a set of pseudo registers
3. Traditional compiler optimizations are performed
4. Register allocation is performed5. CFG is converted to native host
(SPARC) code
Treeregion Scheduling
• The CFG is partitioned into treeregions (single entry, multiple exit subgraphs, that are shaped like trees)
• Treeregions start at the beginning of the program or at join points, and end either at the end of the program or at new join points
• Liveness analysis is performed• Individual treeregions are scheduled using
a backward sweep, followed by a forward sweep
How well does optimization work in Latte?
00.5
11.5
22.5
33.5
44.5
5
SPECjvm98 Benchmarks
Speedup O
ver
Base
line
LatteLatte(opt)HotSpot
Comments on Latte
• Good register allocation can help to improve the runtime performance of a dynamically tuned Java bytecode binary
• Optimization should target hot spots in the executable
• Can provide very competitive performance compared with the Sun JDK and HotSpot compilation tools
Summary on Dynamic Optimization
• There is always a struggle to balance the costs and benefits of particular types of dynamic optimizers
• Dynamic optimizers can be workload dependent
• There exists a lot of room in Java JITs to improve upon instruction schedules and register allocation
• This is a rich area for future research on compiler and memory management studies