Upload
duoyi-wu
View
70.289
Download
5
Embed Size (px)
DESCRIPTION
Introduction to virtual machine and JavaScript engine implement.
Citation preview
Virtual Machine & JavaScript Engine@nwind
(HLL) Virtual Machine
Take the red pillI will show you the rabbit hole.
Virtual Machine history• pascal 1970
• smalltalk 1980
• self 1986
• python 1991
• java 1995
• javascript 1995
The Smalltalk demonstration showed three amazing features. One was how computers could be networked; the second was how object-oriented programming worked. But Jobs and his team paid little attention to these attributes because they were so amazed by the third feature, ...
How Virtual Machine Work?
• Parser
• Intermediate Representation (IR)
• Interpreter
• Garbage Collection
• Optimization
Parser
• Tokenize
• AST
Tokenize
var foo = 10;keyword
space
identifier
equal
number
semicolon
AST
Assign
Variable foo Constant 10
{ "type": "Program", "body": [ { "type": "VariableDeclaration", "declarations": [ { "id": { "type": "Identifier", "name": "foo" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Identifier", "name": "bar" }, "right": { "type": "Literal", "value": 1 } } } ], "kind": "var" } ]}
AST demo (Esprima)
var foo = bar + 1;
http://esprima.org/demo/parse.html
Intermediate Representation
• Bytecode
• Stack vs. register
00000: deffun 0 null00005: nop00006: callvar 000009: int8 200011: call 100014: pop00015: stop
foo:00020: getarg 000023: one00024: add00025: return00026: stop
Bytecode (SpiderMonkey)
function foo(bar) { return bar + 1;}
foo(2);
8 m_instructions; 168 bytes at 0x7fc1ba3070e0; 1 parameter(s); 10 callee register(s)
[ 0] enter[ 1] mov! ! r0, undefined(@k0)[ 4] get_global_var! r1, 5[ 7] mov! ! r2, undefined(@k0)[ 10] mov! ! r3, 2(@k1)[ 13] call!! r1, 2, 10[ 17] op_call_put_result! ! r0[ 19] end! ! r0
Constants: k0 = undefined k1 = 2
3 m_instructions; 64 bytes at 0x7fc1ba306e80; 2 parameter(s); 1 callee register(s)
[ 0] enter[ 1] add! ! r0, r-7, 1(@k0)[ 6] ret! ! r0
Constants: k0 = 1
End: 3
Bytecode (JSC)
function foo(bar) { return bar + 1;}
foo(2);
Stack vs. register
• Stack
• JVM, .NET, php, python, Old JavaScript engine
• Register
• Lua, Dalvik, All modern JavaScript engine
• Smaller, Faster (about 30%)
• RISC
local a,t,i 1: LOADNIL 0 2 0a=a+i 2: ADD 0 0 2a=a+1 3: ADD 0 0 250 ; aa=t[i] 4: GETTABLE 0 1 2
Stack vs. register
local a,t,i 1: PUSHNIL 3a=a+i 2: GETLOCAL 0 ; a 3: GETLOCAL 2 ; i 4: ADD 5: SETLOCAL 0 ; aa=a+1 6: SETLOCAL 0 ; a 7: ADDI 1 8: SETLOCAL 0 ; aa=t[i] 9: GETLOCAL 1 ; t 10: GETINDEXED 2 ; i 11: SETLOCAL 0 ; a
Interpreter
• Switch statement
• Direct threading, Indirect threading, Token threading ...
while (true) {! switch (opcode) {! ! case ADD:! ! ! ...! ! ! break;! ! case SUB:! ! ! ...! ! ! break; ...! }}
Switch statement
mov %edx,0xffffffffffffffe4(%rbp)cmpl $0x1,0xffffffffffffffe4(%rbp)je 6e <interpret+0x6e>cmpl $0x1,0xffffffffffffffe4(%rbp)jb 4a <interpret+0x4a>cmpl $0x2,0xffffffffffffffe4(%rbp)je 93 <interpret+0x93>jmp 22 <interpret+0x22>...
typedef void *Inst;Inst program[] = { &&ADD, &&SUB };Inst *ip = program;goto *ip++;
ADD: ... goto *ip++;
SUB: ... goto *ip++;
Direct threadingmov 0xffffffffffffffe8(%rbp),%rdxlea 0xffffffffffffffe8(%rbp),%raxaddq $0x8,(%rax)mov %rdx,0xffffffffffffffd8(%rbp)jmpq *0xffffffffffffffd8(%rbp)
ADD: ... mov 0xffffffffffffffe8(%rbp),%rdx lea 0xffffffffffffffe8(%rbp),%rax addq $0x8,(%rax) mov %rdx,0xffffffffffffffd8(%rbp) jmp 2c <interpreter+0x2c>
http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
Garbage Collection
• Reference counting (php, python ...), smart pointer
• Tracing
• Stop the world
• Copying, Mark-and-sweep, Mark-and-compact
• Generational GC
• Precise vs. conservative
Precise vs. conservative
• Conservative
• If it looks like a pointer, treat it as a pointer
• Might have memory leak
• Cant’ move object, have memory fragmentation
• Precise
• Indirectly vs. Directly reference
It is time for the DARK Magic
Optimization Magic
• Interpreter optimization
• Compiler optimization
• JIT
• Type inference
• Hidden Type
• Method inline, PICs
Interpreter optimization
Switch work inefficient, Why?
CPU Pipeline
• Fetch, Decode, Execute, Write-back
• Branch prediction
http://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svg
ICONST_1_START: *sp++ = 1; ICONST_1_END: goto **(pc++);INEG_START: sp[-1] = -sp[-1]; INEG_END: goto **(pc++);DISPATCH_START: goto **(pc++); DISPATCH_END: ;
size_t iconst_size = (&&ICONST_1_END - &&ICONST_1_START); size_t ineg_size = (&&INEG_END - &&INEG_START);size_t dispatch_size = (&&DISPATCH_END - &&DISPATCH_START);
void *buf = malloc(iconst_size + ineg_size + dispatch_size); void *current = buf;memcpy(current, &&ICONST_START, iconst_size); current += iconst_size; memcpy(current, &&INEG_START, ineg_size); current += ineg_size; memcpy(current, &&DISPATCH_START, dispatch_size);...
goto **buf;
Solution: Inline Threading
Interpreter? JIT!
Compiler optimization
Compiler optimization
• SSA
• Data-flow
• Control-flow
• Loop
• ...
http://www.oracle.com/us/technologies/java/java7-renaissance-vm-428200.pdf | © 2011 Oracle Corporation
What a JVM can do...compiler tactics delayed compilation Tiered compilation on-stack replacement delayed reoptimization program dependence graph representation static single assignment representationproof-based techniques exact type inference memory value inference memory value tracking constant folding reassociation operator strength reduction null check elimination type test strength reduction type test elimination algebraic simplification common subexpression elimination integer range typingflow-sensitive rewrites conditional constant propagation dominating test detection flow-carried type narrowing dead code elimination
language-specific techniques class hierarchy analysis devirtualization symbolic constant propagation autobox elimination escape analysis lock elision lock fusion de-reflection speculative (profile-based) techniques optimistic nullness assertions optimistic type assertions optimistic type strengthening optimistic array length strengthening untaken branch pruning optimistic N-morphic inlining branch frequency prediction call frequency predictionmemory and placement transformation expression hoisting expression sinking redundant store elimination adjacent store fusion card-mark elimination merge-point splitting
loop transformations loop unrolling loop peeling safepoint elimination iteration range splitting range check elimination loop vectorizationglobal code shaping inlining (graph integration) global code motion heat-based code layout switch balancing throw inliningcontrol flow graph transformation local code scheduling local code bundling delay slot filling graph-coloring register allocation linear scan register allocation live range splitting copy coalescing constant splitting copy removal address mode matching instruction peepholing DFA-based code generator
Thursday, July 7, 2011
Just-In-Time (JIT)
JIT
• Method JIT, Trace JIT, Regular expression JIT
• Code generation
• Register allocation
How JIT work?
• mmap/new/malloc (mprotect)
• generate native code
• c cast/reinterpret_cast
• call the function
asm (".text\n"".globl " SYMBOL_STRING(ctiTrampoline) "\n"HIDE_SYMBOL(ctiTrampoline) "\n"SYMBOL_STRING(ctiTrampoline) ":" "\n" "pushl %ebp" "\n" "movl %esp, %ebp" "\n" "pushl %esi" "\n" "pushl %edi" "\n" "pushl %ebx" "\n" "subl $0x3c, %esp" "\n" "movl $512, %esi" "\n" "movl 0x58(%esp), %edi" "\n" "call *0x50(%esp)" "\n" "addl $0x3c, %esp" "\n" "popl %ebx" "\n" "popl %edi" "\n" "popl %esi" "\n" "popl %ebp" "\n" "ret" "\n");
Trampoline (JSC x86)
// Execute the code!inline JSValue execute(RegisterFile* registerFile, CallFrame* callFrame, JSGlobalData* globalData){ JSValue result = JSValue::decode( ctiTrampoline( m_ref.m_code.executableAddress(), registerFile, callFrame, 0, Profiler::enabledProfilerReference(), globalData)); return globalData->exception ? jsNull() : result;}
Register allocation
• Linear scan
• Graph coloring
Code generation
• Pipelining
• SIMD (SSE2, SSE3 ...)
• Debug
Type inference
a + b
Property access
“foo.bar”
00001f63!movl!%ecx,0x04(%edx)
foo.bar in C
__ZN2v88internal7HashMap6LookupEPvjb:00000338! pushl!%ebp00000339! pushl!%ebx0000033a! pushl!%edi0000033b! pushl!%esi0000033c! subl! $0x0c,%esp0000033f! movl! 0x20(%esp),%esi00000343! movl! 0x08(%esi),%eax00000346! movl! 0x0c(%esi),%ecx00000349! imull!$0x0c,%ecx,%edi0000034c! leal! 0xff(%ecx),%ecx0000034f! addl! %eax,%edi00000351! movl! 0x28(%esp),%ebx00000355! andl! %ebx,%ecx00000357! imull!$0x0c,%ecx,%ebp0000035a! addl! %eax,%ebp0000035c! jmp! 0x0000036a0000035e! nop00000360! addl! $0x0c,%ebp00000363! cmpl! %edi,%ebp00000365! jb! 0x0000036a00000367! movl! 0x08(%esi),%ebp0000036a! movl! 0x00(%ebp),%eax0000036d! testl!%eax,%eax0000036f! je! 0x0000038b00000371! cmpl! %ebx,0x08(%ebp)00000374! jne! 0x0000036000000376! movl! %eax,0x04(%esp)0000037a! movl! 0x24(%esp),%eax0000037e! movl! %eax,(%esp)00000381! call! *0x04(%esi)00000384! testb!%al,%al00000386! je! 0x0000036000000388! movl! 0x00(%ebp),%eax0000038b! testl!%eax,%eax0000038d! jne! 0x0000041800000393! cmpb! $0x00,0x2c(%esp)00000398! jne! 0x0000039e0000039a! xorl! %ebp,%ebp0000039c! jmp! 0x000004180000039e! movl! 0x24(%esp),%eax000003a2! movl! %eax,0x00(%ebp)000003a5! movl! $0x00000000,0x04(%ebp)000003ac! movl! %ebx,0x08(%ebp)000003af! movl! 0x10(%esi),%eax000003b2! leal! 0x01(%eax),%ecx000003b5! movl! %ecx,0x10(%esi)000003b8! shrl! $0x02,%ecx000003bb! leal! 0x01(%ecx,%eax),%eax... 27 lines more
foo.bar in JavaScript
__ZN2v88internal7HashMap6LookupEPvjb
means:
v8::internal::HashMap::Lookup(void*, unsigned int, bool)
How to optimize?
Hidden Typeadd property x
then add property y
http://code.google.com/apis/v8/design.html
But nothing is perfect
one secret in V8 hidden class
http://jsperf.com/test-v8-delete
20x times slower!
But property are rarely deleted
Figure 3 gives the average size of the code functions occurringin the JavaScript program source. These seem fairly consistentacross sites. More interestingly, Figure 4 shows the number ofevents per function, which roughly corresponds to the number ofbytecodes evaluated by the interpreter (note that some low-levelbytecodes such as branches and arithmetic are not recorded in thetrace). It is interesting to note that the median is fairly high, around20 events. This suggests that, in contrast to Java, there are fewershort methods (e.g. accessors) in JavaScript and thus possibly feweropportunities to benefit from inlining optimizations.
5.2 Instruction MixThe instruction mix of JavaScript program is also fairly traditional:more read operations are expected than write operations. As shownin Figure 5, reads are far more common than writes: over alltraces the proportion of reads to writes is 6 to 1. Deletes compriseonly .1% of all events. That graph further breaks reads, writesand deletes into various specific types; prop refers to accesses
280s
Apme
Bing
Blog
Digg
Fbok Flkr
Gmai
Gmap
Goog
IShk
Lvly
Twit
Wiki
Word
Ebay
YTub All*
0.0
0.2
0.4
0.6
0.8
1.0
Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch
280s
Apme
Bing
Blog
Digg
Fbok Flkr
Gmai
Gmap
Goog
IShk
Lvly
Twit
Wiki
Word
Ebay
YTub All*
0.0
0.2
0.4
0.6
0.8
1.0
Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch
280s
Apme
Bing
Blog
Digg
Fbok Flkr
Gmai
Gmap
Goog
IShk
Lvly
Twit
Wiki
Word
Ebay
YTub All*
0.0
0.2
0.4
0.6
0.8
1.0
Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch
280s
Apme
Bing
Blog
Digg
Fbok Flkr
Gmai
Gmap
Goog
IShk
Lvly
Twit
Wiki
Word
Ebay
YTub All*
0.0
0.2
0.4
0.6
0.8
1.0
Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch
280s
Apme
Bing
Blog
Digg
Fbok Flkr
Gmai
Gmap
Goog
IShk
Lvly
Twit
Wiki
Word
Ebay
YTub All*
0.0
0.2
0.4
0.6
0.8
1.0
Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch
280S
BING
BLOG
DIGG
EBAY
FBOK
FLKR
GMIL
GMAP
GOGL
ISHK
LIVE
MEC
M
TWIT
WIKI
WORD
YTUB
ALL*
0.0
0.2
0.4
0.6
0.8
1.0
280S
BING
BLOG
DIGG
EBAY
FBOK
FLKR
GMIL
GMAP
GOGL
ISHK
LIVE
MEC
M
TWIT
WIKI
WORD
YTUB
ALL*
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5. Instruction mix. The per-site proportion of read, write,delete, call instructions (averaged over multiple traces).
280S
BING
BLO
G
DIG
G
EBAY
FBO
K
FLKR
GM
AP
GM
IL
GO
GL
ISHK
LIVE
MEC
M
TWIT
WIK
I
WO
RD
YTUB AL
L
Prot
otyp
e ch
ain
leng
th
1 2
3 4
5 6
7 8
910
280S
BIN
G
BLO
G
DIG
G
EB
AY
FBO
K
FLK
R
GM
AP
GM
IL
GO
GL
ISH
K
LIV
E
ME
CM
TWIT
WIK
I
WO
RD
YTU
B
ALL
Pro
toty
pe c
hain
leng
th
1 2
3 4
5 6
7 8
910
Figure 6. Prototype chain length. The per-site quartile and max-imum prototype chain lengths.
using dot notation (e.g. x.f), hash refers to access using indexingnotation (e.g. x[s]), indx refers to accesses using indexing notationwith a numeric argument. The overall number of calls is high,20%, as the interpreter does not perform any inlining. Exceptionhandling is rather infrequent with a grand total of 1,328 throwsover 478 million trace events. There are some outliers such as ISHK,WORD and DIGG where updates are a much smaller proportion ofoperations (and influenced by the sheer number of objects in thesesites), but otherwise the traces are consistent.
5.3 Prototype ChainsOne higher-level metric is the length of an object’s prototype chain,which is the number of prototype objects that may potentially betraversed in order to find an object’s inherited property. This isroughly comparable to metrics of the depth of class hierarchies inclass-based languages, such as the Depth of Inheritance (DIT) met-ric discussed in [23]. Studies of C++ programs mention a maximumDIT of 8 and a median of 1, whereas Smalltalk has a median of 3and maximum of 10. Figure 6 shows that in all but four sites, themedian prototype chain length is 1. Note that we start our graph atchain length 1, the minimum. All objects except Object.prototypehave at least one prototype, which if unspecified, defaults to theObject.prototype. The maximum observed prototype chain lengthis 10. The majority of sites do not seem to use prototypes for codereuse, but this is possibly explained by the existence of other waysto achieve code reuse in JavaScript (i.e., the ability to assign clo-sures directly into a field of an object). The programs that do utilizeprototypes have similar inheritance properties to Java [23].
5.4 Object KindsFigure 7 breaks down the kinds of objects allocated at run-timeinto a number of categories. There are a number of frequently usedbuilt-in data types: dates (Date), regular expressions (RegExp), doc-ument and layout objects (DOM), arrays (Array) and runtime er-rors. The remaining objects are separated into four groups: anony-mous objects, instances, functions, and prototypes. Anonymous ob-jects are constructed with an object literal using the {...} notation,while instances are constructed by calls of the form new C(...).A function object is created for every function expression eval-uated by the interpreter and a prototype object is automaticallyadded to every function in case it is used as a constructor. Overall sites and traces, arrays account for 31% of objects allocated.Dates and DOM objects come next with 12% and 14%, respec-tively. Functions, prototypes, and instances each account for 10%of the allocated objects, and finally anonymous objects account for
280S
BING
BLOG
DIGG
FBOK
FLKR
EBAY
GOGL
GMAP
GMIL
ISHK
LIVE
MEC
M
TWIT
WIKI
WORD
YTUB
ALL*
anonymousdom
arraysdates
regexpsfunctions
instanceserrors
prototypes
Figure 7. Kinds of allocated objects. The per-site proportion ofruntime object kinds (averaged over multiple traces).
Only 0.1% delete
An Analysis of the Dynamic Behavior of JavaScript Programs
Optimize method call
function foo(bar) { return bar.pro();}
bar can be anything
adaptive optimization for self
Polymorphic inline cache
Tagged pointer
typedef union { void *p; double d; long l;} Value;
typedef struct { unsigned char type; Value value;} Object;
Object a;
Tagged pointer
sizeof(a)??if everything is object, it will be too much overhead for small integer
Tagged pointer
In almost all system, the pointer address will be aligned (4 or 8 bytes)
http://www.gnu.org/s/libc/manual/html_node/Aligned-Memory-Blocks.html
“The address of a block returned by malloc or realloc in the GNU system is always a multiple of eight (or sixteen on 64-bit systems). ”
Tagged pointer
Example: 0xc00ab958 the pointer’s last 2 or 3 bits must be 0
1 0 0 08888
PointerPointerPointerPointer
1 0 0 19999
Small NumberSmall NumberSmall NumberSmall Number
How about double?
* The top 16-bits denote the type of the encoded JSValue:** Pointer { 0000:PPPP:PPPP:PPPP* / 0001:****:****:***** Double { ...* \ FFFE:****:****:***** Integer { FFFF:0000:IIII:IIII
NaN-tagging (JSC 64 bit)In 64 bit system, we can only use 48 bits, that means it will have 16 bits are 0
V8
V8
• Lars Bak
• Hidden Class, PICs
• Built-in objects written in JavaScript
• Crankshaft
• Precise generation GC
Lars Bak
• implement VM since 1988
• Beta
• Self
• HotSpot
Source code Native Code
High-Level IR Low-Level IR Opt Native Code}Crankshaft
Hotspot client compiler
Crankshaft
• Profiling
• Compiler optimization
• On-stack replacement
• Deoptimize
High-Level IR (Hydrogen)
• function inline
• type inference
• stack check elimination
• loop-invariant code motion
• common subexpression elimination
• ...
http://wingolog.org/archives/2011/08/02/a-closer-look-at-crankshaft-v8s-optimizing-compiler
Low-Level IR (Lithium)
• linear-scan register allocator
• code generate
• lazy deoptimization
http://wingolog.org/archives/2011/09/05/from-ssa-to-native-code-v8s-lithium-language
function ArraySort(comparefn) { if (IS_NULL_OR_UNDEFINED(this) && !IS_UNDETECTABLE(this)) { throw MakeTypeError("called_on_null_or_undefined", ["Array.prototype.sort"]); }
// In-place QuickSort algorithm. // For short (length <= 22) arrays, insertion sort is used for efficiency.
if (!IS_SPEC_FUNCTION(comparefn)) { comparefn = function (x, y) { if (x === y) return 0; if (%_IsSmi(x) && %_IsSmi(y)) { return %SmiLexicographicCompare(x, y); } x = ToString(x); y = ToString(y); if (x == y) return 0; else return x < y ? -1 : 1; }; } ...
Built-in objects written in JS
v8/src/array.js
GC
V8 performance
Can V8 be faster?
Dart• Clear syntax, Optional types, Libraries
• Performance
• Can compile to JavaScript
• But IE, WebKit and Mozilla rejected it
• What do you think?
• My thought: Will XML replace HTML? No, but thanks Google, for push the web forward
Embed V8
Embed
v8::Handle<v8::Value> Print(const v8::Arguments& args) { for (int i = 0; i < args.Length(); i++) { v8::HandleScope handle_scope; v8::String::Utf8Value str(args[i]); const char* cstr = ToCString(str); printf("%s", cstr); } return v8::Undefined();}
v8::Handle<v8::ObjectTemplate> global = v8::ObjectTemplate::New();global->Set(v8::String::New("print"), v8::FunctionTemplate::New(Print));
Expose Function
Node.JS• Pros
• Async
• One language for everything
• Faster than PHP, Python
• Community
• Cons
• Lack of great libraries
• ES5 code hard to maintain
• Still too youth
JavaScriptCore (Nitro)
Where it comes from?
1997 Macworld
“Apple has decided to make Internet Explorer it’s default browser on macintosh.”
“Since we believe in choice. We going to be shipping other Internet Browser...”
Steve Jobs
JavaScriptCore History• 2001 KJS (kde-2.2)
• Bison
• AST interpreter
• 2008 SquirrelFish
• Bytecode(Register)
• Direct threading
• 2008 SquirrelFish Extreme
• PICs
• method JIT
• regular expression JIT
• DFG JIT (March 2011)
BytecodeAST Method JIT
SSA
Interpreter
DFG JIT
SipderMonkey
Monkey• SpiderMonkey
• Written by Brendan Eich
• interpreter
• TraceMonkey
• trace JIT
• removed
• JägerMonkey
• PICs
• method JIT (from JSC)
• IonMonkey
• Type Inference
• Compiler optimization
IonMonkey
• SSA
• function inline
• linear-scan register allocation
• dead code elimination
• loop-invariant code motion
• ...
Chakra (IE9)
Chakra
• Interpreter/JIT
• Type System (hidden class)
• PICs
• Delay parse
• Use utf-8 internal
Unlocking the JavaScript Opportunity with Internet Explorer 9
Unlocking the JavaScript Opportunity with Internet Explorer 9
Carakan (Opera)
Carakan
• Register VM
• Method JIT, Regex JIT
• Hidden type
• Function inline
Rhino and JVM
Rhino is SLOW, why?
Because JVM is slow?
JVM did’t support dynamic language well
Solution: invokedynamic
Before
After Caller
Some tricksCaller Method
MethodInvokedynamic
Hard to optimize in JVM
method handle
One ring to rule them all?
Rhino + invokedynamic• Pros
• Easier to implement
• Lots of great Java Libraries
• JVM optimization for free
• Cons
• Only in JVM7
• Not fully optimized yet
• Hard to beat V8
Compiler optimization is HARD
It there an easy way?
LLVM
LLVM
• Clang, VMKit, GHC, PyPy, Rubinius ...
• DragonEgg: replace GCC back-end
• IR
• Optimization
• Link, Code generate, JIT
• Apple
LLVM simplify
int foo(int bar) { int one = 1; return bar + one;}
int main() { foo(3);}
define i32 @foo(i32 %bar) nounwind ssp {entry: %bar_addr = alloca i32, align 4 %retval = alloca i32 %0 = alloca i32 %one = alloca i32 %"alloca point" = bitcast i32 0 to i32 store i32 %bar, i32* %bar_addr store i32 1, i32* %one, align 4 %1 = load i32* %bar_addr, align 4 %2 = load i32* %one, align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* %0, align 4 %4 = load i32* %0, align 4 store i32 %4, i32* %retval, align 4 br label %return
return: %retval1 = load i32* %retval ret i32 %retval1}
define i32 @main() nounwind ssp {entry: %retval = alloca i32 %"alloca point" = bitcast i32 0 to i32 %0 = call i32 @foo(i32 3) nounwind ssp br label %return
return: %retval1 = load i32* %retval ret i32 %retval1}
define i32 @foo(i32 %bar) nounwind ssp {entry: %bar_addr = alloca i32, align 4 %retval = alloca i32 %0 = alloca i32 %one = alloca i32 %"alloca point" = bitcast i32 0 to i32 store i32 %bar, i32* %bar_addr store i32 1, i32* %one, align 4 %1 = load i32* %bar_addr, align 4 %2 = load i32* %one, align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* %0, align 4 %4 = load i32* %0, align 4 store i32 %4, i32* %retval, align 4 br label %return
return: %retval1 = load i32* %retval ret i32 %retval1}
define i32 @main() nounwind ssp {entry: %retval = alloca i32 %"alloca point" = bitcast i32 0 to i32 %0 = call i32 @foo(i32 3) nounwind ssp br label %return
return: %retval1 = load i32* %retval ret i32 %retval1}
define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}
define i32 @main() nounwind readnone ssp {entry: ret i32 undef}
Optimization
Optimization (70+)
http://llvm.org/docs/Passes.html
define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}
define i32 @main() nounwind readnone ssp {entry: ret i32 undef}
LLVM backend
NativeCodeGenLinker
IPO/IPA.. Runtime
Optimizer
Offline Reoptimizer
Profile& Trace
Info
LLVM
LLVMLibraries
Compiler FE 1
Compiler FE N.o files
LLVMLLVM
exe &LLVM
LLVM
CPU
JIT
ProfileInfo
exe &LLVM
exe
LLVM
exe
LLVM
Figure 4: LLVM system architecture diagram
code in non-conforming languages is executed as “un-managed code”. Such code is represented in nativeform and not in the CLI intermediate representation,so it is not exposed to CLI optimizations. These sys-tems do not provide #2 with #1 or #3 because run-time optimization is generally only possible when us-ing JIT code generation. They do not aim to provide#4, and instead provide a rich runtime framework forlanguages that match their runtime and object model,e.g., Java and C#. Omniware [1] provides #5 andmost of the benefits of #2 (because, like LLVM, it usesa low-level represention that permits extensive staticoptimization), but at the cost of not providing infor-mation for high-level analysis and optimization (i.e.,#1). It does not aim to provide #3 or #4.
• Transparent binary runtime optimization systems likeDynamo and the runtime optimizers in Transmeta pro-cessors provide benefits #2, #4 and #5, but they donot provide #1. They provide benefit #3 only at run-time, and only to a limited extent because they workonly on native binary code, limiting the optimizationsthey can perform.
• Profile Guided Optimization for static languages pro-vide benefit #3 at the cost of not being transparent(they require a multi-phase compilation process). Ad-ditionally, PGO su!ers from three problems: (1) Em-pirically, developers are unlikely to use PGO, exceptwhen compiling benchmarks. (2) When PGO is used,the application is tuned to the behavior of the train-ing run. If the training run is not representative of theend-user’s usage patterns, performance may not im-prove and may even be hurt by the profile-driven opti-mization. (3) The profiling information is completelystatic, meaning that the compiler cannot make use ofphase behavior in the program or adapt to changingusage patterns.
There are also significant limitations of the LLVM strat-egy. First, language-specific optimizations must be per-formed in the front-end before generating LLVM code.LLVM is not designed to represent source languages typesor features directly. Second, it is an open question whetherlanguages requiring sophisticated runtime systems such asJava can benefit directly from LLVM. We are currently ex-ploring the potential benefits of implementing higher-levelvirtual machines such as JVM or CLI on top of LLVM.
The subsections below describe the key components ofthe LLVM compiler architecture, emphasizing design andimplementation features that make the capabilities abovepractical and e"cient.
3.2 Compile-Time: External front-end & staticoptimizer
External static LLVM compilers (referred to as front-ends)translate source-language programs into the LLVM virtualinstruction set. Each static compiler can perform three keytasks, of which the first and third are optional: (1) Performlanguage-specific optimizations, e.g., optimizing closures inlanguages with higher-order functions. (2) Translate sourceprograms to LLVM code, synthesizing as much useful LLVMtype information as possible, especially to expose pointers,structures, and arrays. (3) Invoke LLVM passes for globalor interprocedural optimizations at the module level. TheLLVM optimizations are built into libraries, making it easyfor front-ends to use them.
The front-end does not have to perform SSA construc-tion. Instead, variables can be allocated on the stack (whichis not in SSA form), and the LLVM stack promotion andscalar expansion passes can be used to build SSA form ef-fectively. Stack promotion converts stack-allocated scalarvalues to SSA registers if their address does not escape thecurrent function, inserting ! functions as necessary to pre-serve SSA form. Scalar expansion precedes this and expandslocal structures to scalars wherever possible, so that theirfields can be mapped to SSA registers as well.
Note that many “high-level” optimizations are not reallylanguage-dependent, and are often special cases of moregeneral optimizations that may be performed on LLVMcode. For example, both virtual function resolution forobject-oriented languages (described in Section 4.1.2) andtail-recursion elimination which is crucial for functional lan-guages can be done in LLVM. In such cases, it is better toextend the LLVM optimizer to perform the transformation,rather than investing e!ort in code which only benefits aparticular front-end. This also allows the optimizations tobe performed throughout the lifetime of the program.
3.3 Linker & Interprocedural OptimizerLink time is the first phase of the compilation process
where most7 of the program is available for analysis andtransformation. As such, link-time is a natural place toperform aggressive interprocedural optimizations across theentire program. The link-time optimizations in LLVM oper-ate on the LLVM representation directly, taking advantageof the semantic information it contains. LLVM currentlyincludes a number of interprocedural analyses, such as acontext-sensitive points-to analysis (Data Structure Anal-ysis [31]), call graph construction, and Mod/Ref analy-sis, and interprocedural transformations like inlining, deadglobal elimination, dead argument elimination, dead typeelimination, constant propagation, array bounds check elim-ination [28], simple structure field reordering, and Auto-
7Note that shared libraries and system libraries may notbe available for analysis at link time, or may be compileddirectly to native code.
Proceedings of the International Symposium on Code Generation and Optimization (CGO’04) 0-7695-2102-9/04 $ 20.00 © 2004 IEEE
LLVM on JavaScript
Emscripten
• C/C++ to LLVM IR
• LLVM IR to JavaScript
• Run on browser
define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}
define i32 @main() nounwind readnone ssp {entry: ret i32 undef}
...
function _foo($bar) { var __label__; var $0=((($bar)+1)|0); return $0;}
function _main() { var __label__; return undef;}Module["_main"] = _main;
...
Emscripten demo
• Python, Ruby, Lua virtual machine (http://repl.it/)
• OpenJPEG
• Poppler
• FreeType
• ...
https://github.com/kripken/emscripten/wiki
Performance? good enough!
benchmark SM V8 gcc ratiofannkuch (10) 1.158 0.931 0.231 4.04fasta (2100000) 1.115 1.128 0.452 2.47primes 1.443 3.194 0.438 3.29raytrace (7,256) 1.930 2.944 0.228 8.46dlmalloc (400,400) 5.050 1.880 0.315 5.97
The first column is the name of the benchmark, and inparentheses any parameters used in running it. The sourcecode to all the benchmarks can be found at https://github.com/kripken/emscripten/tree/master/tests(each in a separate file with its name, except for ‘primes’,which is embedded inside runner.py in the function test primes).A brief summary of the benchmarks is as follows:
• fannkuch and fasta are commonly-known benchmarks,appearing for example on the Computer Language Bench-marks Game8. They use a mix of mathematic operations(integer in the former, floating-point in the latter) andmemory access.
• primes is the simplest benchmark in terms of code. It isbasically just a tiny loop that calculates prime numbers.
• raytrace is real-world code, from the sphereflake ray-tracer9. This benchmark has a combination of memoryaccess and floating-point math.
• dlmalloc (Doug Lea’s malloc10) is a well-known real-world implementation of malloc and free. This bench-mark does a large amount of calls to malloc and free inan intermixed way, which tests memory access and inte-ger calculations.
Returning to the table of results, the second column isthe elapsed time (in seconds) when running the compiledcode (generated using all Emscripten and LLVM optimiza-tions as well as the Closure Compiler) in the SpiderMonkeyJavaScript engine (specifically the JaegerMonkey branch,checked out June 15th, 2011). The third column is theelapsed time when running the same JavaScript code in theV8 JavaScript engine (checked out Jun 15th, 2011). In boththe second and third column lower values are better; the bestof the two is in bold. The fourth column is the elapsed timewhen running the original code compiled with gcc -O3, us-ing GCC 4.4.4. The last column is the ratio, that is, howmuch slower the JavaScript code (running in the faster ofthe two engines for that test) is when compared to gcc. Allthe tests were run on a MacBook Pro with an Intel i7 CPUclocked at 2.66GHz, running on Ubuntu 10.04.
Clearly the results greatly vary by the benchmark, withthe generated JavaScript running from 2.47 to 8.46 timesslower. There are also significant differences between the
8http://shootout.alioth.debian.org/
9http://ompf.org/ray/sphereflake/
10http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_
derivatives
two JavaScript engines, with each better at some of thebenchmarks. It appears that code that does simple numericaloperations – like the primes test – can run fairly fast, whilecode that has a lot of memory accesses, for example dueto using structures – like the raytrace test – will be slower.(The main issue with structures is that Emscripten does not‘nativize’ them yet, as it does to simple local variables.)
Being 2.47 to 8.46 times slower than the most-optimizedC++ code is a significant slowdown, but it is still morethan fast enough for many purposes, and the main pointof course is that the code can run anywhere the web canbe accessed. Further work on Emscripten is expected toimprove the speed as well, as are improvements to LLVM,the Closure Compiler, and JavaScript engines themselves;see further discussion in the Summary.
2.3 LimitationsEmscripten’s compilation approach, as has been describedin this Section so far, is to generate ‘natural’ JavaScript, asclose as possible to normal JavaScript on the web, so thatmodern JavaScript engines perform well on it. In particu-lar, we try to generate ‘normal’ JavaScript operations, likeregular addition and multiplication and so forth. This is avery different approach than, say, emulating a CPU on a lowlevel, or for the case of LLVM, writing an LLVM bitcodeinterpreter in JavaScript. The latter approach has the bene-fit of being able to run virtually any compiled code, at thecost of speed, whereas Emscripten makes a tradeoff in theother direction. We will now give a summary of some of thelimitations of Emscripten’s approach.
• 64-bit Integers: JavaScript numbers are all 64-bit dou-bles, with engines typically implementing them as 32-bit integers where possible for speed. A consequence ofthis is that it is impossible to directly implement 64-bitintegers in JavaScript, as integer values larger than 32bits will become doubles, with only 53 bits for the sig-nificand. Thus, when Emscripten uses normal JavaScriptaddition and so forth for 64-bit integers, it runs the riskof rounding effects. This could be solved by emulating64-bit integers, but it would be much slower than nativecode.
• Multithreading: JavaScript has Web Workers, which areadditional threads (or processes) that communicate viamessage passing. There is no shared state in this model,which means that it is not directly possible to compilemultithreaded code in C++ into JavaScript. A partial so-lution could be to emulate threads, without Workers, bymanually controlling which blocks of code run (a varia-tion on the switch in a loop construction mentioned ear-lier) and manually switching between threads every sooften. However, in that case there would not be any uti-lization of additional CPU cores, and furthermore perfor-mance would be slow due to not using normal JavaScriptloops.
7 2011/7/23
JavaScript on LLVM
Fabric Engine
• JavaScript Integration
• Native code compilation (LLVM)
• Multi-threaded execution
• OpenGL Rendering
Fabric Engine
http://fabric-engine.com/2011/11/server-performance-benchmarks/
Conclusion?
David Wheeler
All problems in computer science can be solved by another level of indirection
References• The behavior of efficient virtual
machine interpreters on modern architectures
• Virtual Machine Showdown: Stack Versus Registers
• The implementation of Lua 5.0
• Why Is the New Google V8 Engine so Fast?
• Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters
• Effective Inline-Threaded Interpretation of Java Bytecode Using Preparation Sequences
• Smalltalk-80: the language and its implementation
References• Design of the Java HotSpotTM
Client Compiler for Java 6
• Oracle JRockit: The Definitive Guide
• Virtual Machines: Versatile platforms for systems and processes
• Fast and Precise Hybrid Type Inference for JavaScript
• LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
• Emscripten: An LLVM-to-JavaScript Compiler
• An Analysis of the Dynamic Behavior of JavaScript Programs
References• Adaptive Optimization for SELF
• Bytecodes meet Combinators: invokedynamic on the JVM
• Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters
• Efficient Implementation of the Smalltalk-80 System
• Design, Implementation, and Evaluation of Optimizations in a Just-In-Time Compiler
• Optimizing direct threaded code by selective inlining
• Linear scan register allocation
• Optimizing Invokedynamic
References• Representing Type Information
in Dynamically Typed Languages
• The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
• Trace-based Just-in-Time Type Specialization for Dynamic Languages
• The Structure and Performance of Efficient Interpreters
• Know Your Engines: How to Make Your JavaScript Fast
• IE Blog, Chromium Blog, WebKit Blog, Opera Blog, Mozilla Blog, Wingolog’s Blog, RednaxelaFX’s Blog, David Mandelin’s Blog...
!ank y"