Dynamic Optimization

Dynamic Optimization

David KaeliDepartment of Electrical and Computer

EngineeringNortheastern University

Boston, [email protected]

What is Dynamic Optimization• Allow a running binary to adapt to the

underlying hardware system dynamically• Perform optimization while not sacrificing

performance

OS/HW Platform

Input

Static Source

Input

Fluid BinaryRuntime Dynamic Optimization System

Why Dynamic versus Static

• Allows code to adapt to:– Changes in the microarchitecture of the

underlying platform (related to binary translation)– Changes in program input– Environment dynamics (e.g., system load, system

availability)

• Involves very little user interaction (optimization should be applied transparently)

• Source code is not needed • Language independent

Challenges with Dynamic Optimization

• Reducing the associated overhead and maintaining transparency

• Addressing a range of workloads• Selecting appropriate

optimizations

Dynamic Optimization Systems• Dynamo

– HP labs, PA-RISC/HPUX– Runtime optimization

• Vulcan/Mojo– MS Research, X86-IA64/Win2K– Deskstop instrumentation, profiling and

optimization

• Jalapeno– IBM Research, JVM-PPC-SMPs/AIX– Java JIT designed for research

• Latte– University of Seoul, Korea– Java JIT designed for efficient register allocation

Dynamo

Dynamo

CPU Platform

Application + Libs(native binary)

CPU Platform


Normal execution model Dynamo execution model

To the application, Dynamo looks like a software interpreter that executes the same instruction set executed by underlying hardware interpreter (the CPU).

* Many of these slides were provided by Evelyn Duesterwald

Elements of Dynamo

A novel performance delivery mechanism:

– Optimize the code when it executes, not when it is created

A client-enabled performance mechanism Dynamic code re-layout Partial dynamic inlining/superblock formation Path-specific optimization Adaptive: machine and input specific Complementary to static optimization

Transparent: requires no compiler supportDynamo

CPU Platform


Flow within Dynamo

Interpretuntil taken branch

Lookup nextPC in Trace Cache

Hot start of trace?

Trace Selector

missno

yes

DynamoCode Cache

hit

Trace Optimizer

Input native instruction stream

Trace Linker

exit b

ranc

h

recycle counter

Emit Trace

Interpretation/Profiling Loop

Traces in Dynamo

A

E

C

F T

call

return

D

F

Control Flow Graph

A

F

B

C

E

D

call

return

Memory Layout

Trace = single-entry join-free dynamic sequence of basic blocks

Trace Cache Layout

A

B

C

E

D

exit toInterpreter

trampoline

connect to other trace

B

Traces in Dynamo

Interprocedural forward path: start-of-trace= target of backward branch

end-of-trace = taken backward branch

11 Paths through the loop: ABCEH ABCEHKMO ABCEHKNO ABCEIKMO ABCEIKNO ABCFJL ABCFJLNO ABDFJL ABDFJLNO ABDGJL ABDGJLNO

A

J

F G

D

B

LK

H I

E

C

N

O

M

Traces in Dynamo – typical path profiles

A

J

F G

D

B

LK

H I

E

C

N

O

M

Approach:

profile all edge frequencies

select hot trace by following highest

frequency branch outcome

Disadvantage:

Infeasible path: ignores branch correlation

Overhead: need to profile every conditional branch

Traces in Dynamo – Next Executing Tail Prediction

A

J

F G

D

B

LK

H I

E

C

N

O

M

Minimal profiling: profile only start-of-trace points (block A) Optimistic: at hot start-of-trace select next executing

Advantages: very light-weight

# instrumentation points = #targets of backward branches

#counters = #targets of backward branches

statistically likely to pick the hottest path feasible paths easy to implement

Trace Selection

0

10

20

30

40

50

60

70

80

90

time

# t

rac

es

se

lec

ted

li

When to stop creating new traces

• Excessively high trace selection rates cause unacceptable overhead and potential thrashing in the Dynamo code cache

• We need the opportunity to amortize the cost of creating traces, thus we need to turn off trace creation sometimes

• “Bail out” is entered when the creation rate per unit time is excessively high

Trace Optimization

Build lightweight Intermediate Representation (IR): Symbolic Labels, Extended Virtual Register Set

Optimization with integrated demand-driven analysis

Schedule & Register Allocation – retain previous mappings

Backward Pass

Lite IR

Forward Pass

Reg Alloc

List of trace blocks

Linker

Trace Optimization

Are there any runtime optimization opportunities in statically optimized code?

Limitations of static compiler optimization:

cost of call-specific interprocedural optimization

cost of path-specific optimization in presence of complex control flow

difficulty of predicting past indirect branches

lack of access to shared libraries

sub-optimal register allocation decisions

register allocation for individual array elements or pointers

Path-specific optimizations

Conservative Optimizations

precise signal delivery memory-safe

partial procedure inlining

redundant branch removal

constant propagation

constant folding

copy propagation

Aggressive

Optimizations

redundant load removal runtime disambiguated

(guarded) load removal

dead code elimination

partially dead code

sinking

loop unrolling

loop invariant hoisting aggressive optimization can be made memory- and signal-safe

compiler hints de-optimization

Dynamo Optimizations

• Constant propagation– Given x <- c ; for variable x and constant c– Replace all later uses of x with c, assuming that x will

not be modified

entry

b <- 3c <- 4 * bc > b

d <- b + 2

e <- a + b

exit

yn

entry

b <- 3c <- 4 * 3c > 3

d <- 3 + 2

e <- a + 3

exit

yn

Dynamo Optimizations• Constant folding

– Identifying that all operands in an assignment are constant after macro expansion and constant propagation

– Easy for booleans, a little trickier for integers (exceptions such as divide by zero and overflows), for FP this can be very tricky due to multiple FP formats

entry

b <- 3c <- 4 * 3c > 3

d <- 3 + 2

e <- a + 3

exit

yn

entry

b <- 3c <- 12c > 3e <- a + 3

d <- 3 + 2

exit

Dynamo Optimizations

• Partial load removal – LRE paper• Dead code elimination

– A variable is dead if it is not used on any path from where it is defined to where the function exits

– An instruction is dead if it computes only values that are not used on any executable path leading from the instruction

– Dead code is often created through the application of code optimization (e.g., strength reduction; replacing expensive ops by less expensive ops)

• Loop invariant hoisting – moving invariant operations out of the loop body

• Fragment link-time optimizations – apply peephole optimization around link, looking for dead code removal

Implementation IssuesProblem: Signal arrives when executing in the code cache –

How can we achieve transparent signal delivery?How can the original signal context be reconstructed?

Dynamo approach: intercept all signalsUpon arrival of a signal at code cache location L, Dynamo first

gains control:1. Save code cache context 2. Retranslate the trace and record:

i. Any changes in register mapping up to position Lii. Original code addresses of Liii. All context-modifying optimizations and steps for de-

optimization3. Update the code cache context to obtain native context4. Load native context and execute original signal handler

Dynamic Code Cache

Problem: How to control size of dynamically recompiled code?

How to react to phase changes?

Adaptive flushing based cache management scheme:

Preemptive cache flushes Fast allocation/de-allocation of traces Removal of old and cold traces Branch re-biasing to improve locality in cache

Configurable for various performance/memory-footprint trade-offs

Code cache default size: 300 Kbytes

Dynamo Performance

-1.53

-20%

-15%

-10%

-5%

0%

5%

10%

15%

20%

25%co

mpr

ess

go

ijpeg

li

m88

ksim

perl

vort

ex

delta

blue

Ave

rage

Sp

ee

du

p o

ve

r n

ati

ve

ex

ec

uti

on

path optimization

trace selection

(+O2 compiled native binary running under Dynamo on a PA-8000)

Bailout

0

20

40

60

80

100

120

140

time

# tr

aces

sel

ecte

d

go after bail-outgo before bail-outli

• bail out if trace selection rate exceeds tolerable threshold

Bailout• To prevent degradation, Dynamo keeps

track of the current trace selection rate• Virtual time is recorded by counting the

number of interpreted BBs before we select N traces

• A threshold is set to judge if a rate is “high”• The trace selection rate is considered

excessive if k consecutive high rate time intervals have been encountered

• Bailout will turn off trace selection and optimization; execution resumes in the original binary

Performance speedups with bailout

-10%

-5%

0%

5%

10%

15%

20%

25%co

mp

ress go

ijpe

g li

m8

8ks

im

pe

rl

vort

ex

de

ltab

lue

Ave

rag

e

Sp

eed

up

ove

r n

ativ

ed e

xecu

tio

n

path optimizationtrace selection

(+O2 compiled native binary running under Dynamo on a PA-8000)

Memory Overhead – Dynamo text

Initialization4%

Memory Mgmt5%

Control15%

Trace Formation5%

Trace Optimization 5% + 15%

Interpreter21%

Decode30%

Total size = 273 Kb

PA-RISC dependent portion = 179 Kb (66%)

Summary of Dynamo

• Demonstrated the potential for dynamic optimization through an actual implementation

• Optimization impact tends to be program dependent

• More sophisticated bailout algorithms need to be devised

• Static compile-time hints should be used to help guide a dynamic optimization system

Vulcan – A. Srivastava

• Provides both static and dynamic code modification

• Performs optimization on x86, IA64 and MSIL binaries

• Can work in the presence of multithreading and variable length instructions (X86)

• Designed to be able to perform modifications on a remote machine using a distributed common object model (DCOM) interface

• Can also serve as a binary translator

Mojo – Dynamic Optimization using Vulcan (Chaiken&Gillies)

• Targets a desktop x86/Windows2000 environment

• Supports large, multithreaded, applications that use exception handlers

• Requires no OS support• Allows optimization across shared

library boundaries• Can be aided by information provided

by a static compiler

Mojo Structure

Mojo Dispatcher

Path Builder

NT DLLNT DLLOriginal CodeOriginal Code

Path C

ache

Basic B

lock Cache

Exception handling

Mojo Structure

Mojo Dispatcher

Path Builder


Path C

ache

Basic B

lock Cache

Exception handling

1. Interrogate the Path Cache for a hit

Mojo Structure

Mojo Dispatcher

Path Builder


Path C

ache

Basic B

lock Cache

Exception handling2. If hit, then

execute from the PC directly, else interrogate the Basic Block Cache for a hit

Mojo Structure

Mojo Dispatcher

Path Builder


Path C

ache

Basic B

lock Cache

Exception handling3. If hit in the

BBC, execute directly, else load the block from the original code.

Mojo Structure

Mojo Dispatcher

Path Builder


Path C

ache

Basic B

lock Cache

Exception handling

Each time control returns to the Mojo Dispatcher. BBs are checked for “hotness”.

Mojo Structure

Mojo Dispatcher

Path Builder


Path C

ache

Basic B

lock Cache

Exception handling

If a BB is hot enough, Mojo turns on Path Building. Once a complete path has been built and optimized, it is placed in the Path Cache.

Mojo Components

• Mojo Dispatcher– Is the control point in the dynamic optimization system– Manages execution context using its own stack space

• Basic Block Cache– Handles basic blocks that have not yet become hot– Identifies basic block boundaries by dynamically

decoding instruction bytes– Branches are modified to pass control to the

dispatcher, and passes the addresses of the next basic block to execute

– Additional information is kept in the BBC that is used when constructing paths

Mojo Components• Path Builder

– Responsible for selecting, building and optimizing hot paths

– Maintains “hotness” information for basic blocks– Utilizes the same heuristics for building hot

paths as is used in Dynamo (next path after counter overflow)

– Utilizes separate thresholds for back edge targets and path exit targets (need to detect hot side exits when constructing a dynamic path)

– Instructions are laid out contiguously (reordered), eliminating many taken conditional branches

Mojo Components – Path Builder

• Path Termination - Dynamo only terminates paths on back edges

A

B

C

B

A

C

Original nested loops

Dynamo back edge profiling

Mojo back edge and side exit

profiling

B

A

CLonger path

Exception Handling and Threads

• Mojo patches the ntdll.dll• Mojo captures the state of the machine

before passing off exceptions to the dispatcher

• The dispatcher prevents the exception handler from polluting the Path Cache

• To handle multithreading, Mojo allocates a basic block cache per thread, though uses a shared Path Cache

• Locking mechanisms are provided to access and update the shared Path Cache reliably

Mojo performance

020

406080

100120140

160180

Byte

OO

PACK

dry

stone

life

puzz

le

8queens

bubso

rt

csie

ve

qso

rt

ack

er

fib

Rela

tive E

xecu

tion T

ime t

o n

ati

ve

execu

tion

qsort, acker and fib are recursive programs

Mojo performance – SPEC2000/SPEC95

0

200

400

600

800

1000

1200

1400

Execu

tion T

ime (

seco

nds)

nativemojo

Mojo Execution - Windows

0

20

40

60

80

100

120

Winword FoxPro

Execu

tion T

ime (

seco

nds)

nativemojo

Comments

• For simple programs with simple control flow, Mojo shows good improvement

• For larger programs with more dynamic control flow, Mojo is overwhelmed with the amount of path creation (same problem that was encountered for Dynamo)

• Bailout strategy needed, along with better hot path detection algorithm

• Future work is investigating how to use hints obtained during static compilation to aid in the dynamic optimization of the code

What is a JIT• Just-in Time Compiler – developed to address

the performance issues encountered with java interpreter/translator performance

• Portability generally means lower performance; JITs attempt to bridge this gap

• JITs dynamically cache translated java bytecodes and perform extensive optimization on the native instructions

• Given the overhead of using an OO programming model (frequent method calls), extensive exception checking, and the overhead of dynamic translation/compilation, the quality of the JIT must be high

Common JITs

• SUN Java Development Kit (Sun)• Hotspot JIT (Sun)• Kaffe (Transvirtual Technologies)• Jalapeno (IBM Research)• Latte (Seoul National University)

IBM Jalapeno JVM and JIT

• Designed specifically for servers– Shared memory multiprocessor scalability– Manage a large number of concurrent

threads– High availability– Rapid response and graceful degradation (an

issue when garbage collection is involved)

• Mainly developed in Java (reliability?)• Designed specifically for extensive

dynamic optimization

The Jalapeno Adaptive Optimization System

• Translates bytecodes directly to the native ISA

• Recompilation is performed in a separate thread from the application, and thus can be done in parallel to program execution

• AOS has three components– Runtime measurement system– Controller– Recompilation system

Jalapeno AOS Architecture

AOSDatabase

RawData

RawData

FormattedData

OrganizerEvent Queue

ExecutingCode

FormattedData Compilation

Queue

Controller

ControllerControllerCompilationThreads

ControllerControllerCompilers

(Base, Opt, …)

Organizer

Hardware/VMPerformance Monitor

Organizer

RawData

Inst/CompPlan

Inst/OptCode

MeasurementSubsystem

Install new code

Profile data

Three Optimization Levels

• Level 0 – On-the-fly optimizations performed during translation (constant prop, constant folding, dead code detection)

• Level 1 – Adds to Level 0 common subexpression elimination, redundant load elimination, aggressive inlining

• Level 2 – Adds to Level 1 flow-sensitive optimizations, array bounds check elimination

Controller model

• Decides when to recompile a method• Decides which optimization level to

use• Measurements are used to guide the

profiling strategy and select the hot methods to recompile

• An analytical model is also used that represents the costs and benefits of performing these tasks

When to recompile?

Ti = Current total amount of time the program will spend executing method m

Cj = Cost of recompilation method m at optimization level j

Tj = Expected total amount of time the program will spend executing an optimized method m

For j=0,1,2 choose the j that minimizes: Cj + Tj

If Cj + Tj < Ti, the Controller recompiles at level j;Otherwise it decides not to recompile

When to recompile?

• To estimate Ti, we assume the program will run for a total time of Tf, and use profile data to indicate what percentage of the total execution time (Pm) is spent in method m (versus the rest of the program)

• We can compute Ti as:

Ti = Tf * Pm

• This is the initial estimated execution time for method m. A new Ti is computed based on an estimate of the speedup of method m.

• The above weight decays over time.

How well does optimization work in Jalapeno?

0

2

4

6

8

10

12

14

SPECjvm98 Benchmarks

Speedup O

ver

Base

line

Opt0Opt1Opt2

Comments about Jalapeno

• Focused on method-granularity optimization

• Simple heuristics for predicting runtimes and benefits/costs are highly sensitive to cold vs. warm invocation of the application

• New work looks at method specific optimizations that consider additional characteristics besides just the estimated runtime

Latte

• Addresses the inefficiencies in the stack-based Java bytecode machine by efficiently mapping stack space to a RISC register file

• Since traditional register coloring is an expensive algorithm, and allocation must be done in the same space as the runtime, this system looks at other ways to get good register allocation at a reduced cost

Java Translation to Native Code

1. Identifies control join points and subroutines in the bytecode using a depth-first search traversal

2. Bytecodes are translated in a control flow graph, mapping program variables to a set of pseudo registers

3. Traditional compiler optimizations are performed

4. Register allocation is performed5. CFG is converted to native host

(SPARC) code

Treeregion Scheduling

• The CFG is partitioned into treeregions (single entry, multiple exit subgraphs, that are shaped like trees)

• Treeregions start at the beginning of the program or at join points, and end either at the end of the program or at new join points

• Liveness analysis is performed• Individual treeregions are scheduled using

a backward sweep, followed by a forward sweep

How well does optimization work in Latte?

00.5

11.5

22.5

33.5

44.5

5

SPECjvm98 Benchmarks

Speedup O

ver

Base

line

LatteLatte(opt)HotSpot

Comments on Latte

• Good register allocation can help to improve the runtime performance of a dynamically tuned Java bytecode binary

• Optimization should target hot spots in the executable

• Can provide very competitive performance compared with the Sun JDK and HotSpot compilation tools

Summary on Dynamic Optimization

• There is always a struggle to balance the costs and benefits of particular types of dynamic optimizers

• Dynamic optimizers can be workload dependent

• There exists a lot of room in Java JITs to improve upon instruction schedules and register allocation

• This is a rich area for future research on compiler and memory management studies

Documents

Dynamic Optimization