Virtual machine and javascript engine

Virtual Machine & JavaScript Engine@nwind

(HLL) Virtual Machine

Take the red pillI will show you the rabbit hole.

Virtual Machine history• pascal 1970

• smalltalk 1980

• self 1986

• python 1991

• java 1995

• javascript 1995

The Smalltalk demonstration showed three amazing features. One was how computers could be networked; the second was how object-oriented programming worked. But Jobs and his team paid little attention to these attributes because they were so amazed by the third feature, ...

How Virtual Machine Work?

• Parser

• Intermediate Representation (IR)

• Interpreter

• Garbage Collection

• Optimization

Parser

• Tokenize

• AST

Tokenize

var foo = 10;keyword

space

identifier

equal

number

semicolon

AST

Assign

Variable foo Constant 10

{ "type": "Program", "body": [ { "type": "VariableDeclaration", "declarations": [ { "id": { "type": "Identifier", "name": "foo" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Identifier", "name": "bar" }, "right": { "type": "Literal", "value": 1 } } } ], "kind": "var" } ]}

AST demo (Esprima)

var foo = bar + 1;

http://esprima.org/demo/parse.html



Intermediate Representation

• Bytecode

• Stack vs. register

00000: deffun 0 null00005: nop00006: callvar 000009: int8 200011: call 100014: pop00015: stop

foo:00020: getarg 000023: one00024: add00025: return00026: stop

Bytecode (SpiderMonkey)

function foo(bar) { return bar + 1;}

foo(2);

8 m_instructions; 168 bytes at 0x7fc1ba3070e0; 1 parameter(s); 10 callee register(s)

[ 0] enter[ 1] mov! ! r0, undefined(@k0)[ 4] get_global_var! r1, 5[ 7] mov! ! r2, undefined(@k0)[ 10] mov! ! r3, 2(@k1)[ 13] call!! r1, 2, 10[ 17] op_call_put_result! ! r0[ 19] end! ! r0

Constants: k0 = undefined k1 = 2

3 m_instructions; 64 bytes at 0x7fc1ba306e80; 2 parameter(s); 1 callee register(s)

[ 0] enter[ 1] add! ! r0, r-7, 1(@k0)[ 6] ret! ! r0

Constants: k0 = 1

End: 3

Bytecode (JSC)

function foo(bar) { return bar + 1;}

foo(2);

Stack vs. register

• Stack

• JVM, .NET, php, python, Old JavaScript engine

• Register

• Lua, Dalvik, All modern JavaScript engine

• Smaller, Faster (about 30%)

• RISC

local a,t,i 1: LOADNIL 0 2 0a=a+i 2: ADD 0 0 2a=a+1 3: ADD 0 0 250 ; aa=t[i] 4: GETTABLE 0 1 2

Stack vs. register

local a,t,i 1: PUSHNIL 3a=a+i 2: GETLOCAL 0 ; a 3: GETLOCAL 2 ; i 4: ADD 5: SETLOCAL 0 ; aa=a+1 6: SETLOCAL 0 ; a 7: ADDI 1 8: SETLOCAL 0 ; aa=t[i] 9: GETLOCAL 1 ; t 10: GETINDEXED 2 ; i 11: SETLOCAL 0 ; a

Interpreter

• Switch statement

• Direct threading, Indirect threading, Token threading ...

while (true) {! switch (opcode) {! ! case ADD:! ! ! ...! ! ! break;! ! case SUB:! ! ! ...! ! ! break; ...! }}

Switch statement

mov %edx,0xffffffffffffffe4(%rbp)cmpl $0x1,0xffffffffffffffe4(%rbp)je 6e <interpret+0x6e>cmpl $0x1,0xffffffffffffffe4(%rbp)jb 4a <interpret+0x4a>cmpl $0x2,0xffffffffffffffe4(%rbp)je 93 <interpret+0x93>jmp 22 <interpret+0x22>...

typedef void *Inst;Inst program[] = { &&ADD, &&SUB };Inst *ip = program;goto *ip++;

ADD: ... goto *ip++;

SUB: ... goto *ip++;

Direct threadingmov 0xffffffffffffffe8(%rbp),%rdxlea 0xffffffffffffffe8(%rbp),%raxaddq $0x8,(%rax)mov %rdx,0xffffffffffffffd8(%rbp)jmpq *0xffffffffffffffd8(%rbp)

ADD: ... mov 0xffffffffffffffe8(%rbp),%rdx lea 0xffffffffffffffe8(%rbp),%rax addq $0x8,(%rax) mov %rdx,0xffffffffffffffd8(%rbp) jmp 2c <interpreter+0x2c>

http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html



Garbage Collection

• Reference counting (php, python ...), smart pointer

• Tracing

• Stop the world

• Copying, Mark-and-sweep, Mark-and-compact

• Generational GC

• Precise vs. conservative

Precise vs. conservative

• Conservative

• If it looks like a pointer, treat it as a pointer

• Might have memory leak

• Cant’ move object, have memory fragmentation

• Precise

• Indirectly vs. Directly reference

It is time for the DARK Magic

Optimization Magic

• Interpreter optimization

• Compiler optimization

• JIT

• Type inference

• Hidden Type

• Method inline, PICs

Interpreter optimization

Switch work inefficient, Why?

CPU Pipeline

• Fetch, Decode, Execute, Write-back

• Branch prediction

http://en.wikipedia.org/wiki/File:Pipeline,_4_stage.svg



ICONST_1_START: *sp++ = 1; ICONST_1_END: goto **(pc++);INEG_START: sp[-1] = -sp[-1]; INEG_END: goto **(pc++);DISPATCH_START: goto **(pc++); DISPATCH_END: ;

size_t iconst_size = (&&ICONST_1_END - &&ICONST_1_START); size_t ineg_size = (&&INEG_END - &&INEG_START);size_t dispatch_size = (&&DISPATCH_END - &&DISPATCH_START);

void *buf = malloc(iconst_size + ineg_size + dispatch_size); void *current = buf;memcpy(current, &&ICONST_START, iconst_size); current += iconst_size; memcpy(current, &&INEG_START, ineg_size); current += ineg_size; memcpy(current, &&DISPATCH_START, dispatch_size);...

goto **buf;

Solution: Inline Threading

Interpreter? JIT!

Compiler optimization

Compiler optimization

• SSA

• Data-flow

• Control-flow

• Loop

• ...

http://www.oracle.com/us/technologies/java/java7-renaissance-vm-428200.pdf | © 2011 Oracle Corporation

What a JVM can do...compiler tactics delayed compilation Tiered compilation on-stack replacement delayed reoptimization program dependence graph representation static single assignment representationproof-based techniques exact type inference memory value inference memory value tracking constant folding reassociation operator strength reduction null check elimination type test strength reduction type test elimination algebraic simplification common subexpression elimination integer range typingflow-sensitive rewrites conditional constant propagation dominating test detection flow-carried type narrowing dead code elimination

language-specific techniques class hierarchy analysis devirtualization symbolic constant propagation autobox elimination escape analysis lock elision lock fusion de-reflection speculative (profile-based) techniques optimistic nullness assertions optimistic type assertions optimistic type strengthening optimistic array length strengthening untaken branch pruning optimistic N-morphic inlining branch frequency prediction call frequency predictionmemory and placement transformation expression hoisting expression sinking redundant store elimination adjacent store fusion card-mark elimination merge-point splitting

loop transformations loop unrolling loop peeling safepoint elimination iteration range splitting range check elimination loop vectorizationglobal code shaping inlining (graph integration) global code motion heat-based code layout switch balancing throw inliningcontrol flow graph transformation local code scheduling local code bundling delay slot filling graph-coloring register allocation linear scan register allocation live range splitting copy coalescing constant splitting copy removal address mode matching instruction peepholing DFA-based code generator

Thursday, July 7, 2011

http://www.oracle.com/us/technologies/java/java7-renaissance-vm-428200.pdf

http://www.oracle.com/us/technologies/java/java7-renaissance-vm-428200.pdf

Just-In-Time (JIT)

JIT

• Method JIT, Trace JIT, Regular expression JIT

• Code generation

• Register allocation

How JIT work?

• mmap/new/malloc (mprotect)

• generate native code

• c cast/reinterpret_cast

• call the function

asm (".text\n"".globl " SYMBOL_STRING(ctiTrampoline) "\n"HIDE_SYMBOL(ctiTrampoline) "\n"SYMBOL_STRING(ctiTrampoline) ":" "\n" "pushl %ebp" "\n" "movl %esp, %ebp" "\n" "pushl %esi" "\n" "pushl %edi" "\n" "pushl %ebx" "\n" "subl $0x3c, %esp" "\n" "movl $512, %esi" "\n" "movl 0x58(%esp), %edi" "\n" "call *0x50(%esp)" "\n" "addl $0x3c, %esp" "\n" "popl %ebx" "\n" "popl %edi" "\n" "popl %esi" "\n" "popl %ebp" "\n" "ret" "\n");

Trampoline (JSC x86)

// Execute the code!inline JSValue execute(RegisterFile* registerFile, CallFrame* callFrame, JSGlobalData* globalData){ JSValue result = JSValue::decode( ctiTrampoline( m_ref.m_code.executableAddress(), registerFile, callFrame, 0, Profiler::enabledProfilerReference(), globalData)); return globalData->exception ? jsNull() : result;}

Register allocation

• Linear scan

• Graph coloring

Code generation

• Pipelining

• SIMD (SSE2, SSE3 ...)

• Debug

Type inference

a + b

Property access

“foo.bar”

00001f63!movl!%ecx,0x04(%edx)

foo.bar in C

__ZN2v88internal7HashMap6LookupEPvjb:00000338! pushl!%ebp00000339! pushl!%ebx0000033a! pushl!%edi0000033b! pushl!%esi0000033c! subl! $0x0c,%esp0000033f! movl! 0x20(%esp),%esi00000343! movl! 0x08(%esi),%eax00000346! movl! 0x0c(%esi),%ecx00000349! imull!$0x0c,%ecx,%edi0000034c! leal! 0xff(%ecx),%ecx0000034f! addl! %eax,%edi00000351! movl! 0x28(%esp),%ebx00000355! andl! %ebx,%ecx00000357! imull!$0x0c,%ecx,%ebp0000035a! addl! %eax,%ebp0000035c! jmp! 0x0000036a0000035e! nop00000360! addl! $0x0c,%ebp00000363! cmpl! %edi,%ebp00000365! jb! 0x0000036a00000367! movl! 0x08(%esi),%ebp0000036a! movl! 0x00(%ebp),%eax0000036d! testl!%eax,%eax0000036f! je! 0x0000038b00000371! cmpl! %ebx,0x08(%ebp)00000374! jne! 0x0000036000000376! movl! %eax,0x04(%esp)0000037a! movl! 0x24(%esp),%eax0000037e! movl! %eax,(%esp)00000381! call! *0x04(%esi)00000384! testb!%al,%al00000386! je! 0x0000036000000388! movl! 0x00(%ebp),%eax0000038b! testl!%eax,%eax0000038d! jne! 0x0000041800000393! cmpb! $0x00,0x2c(%esp)00000398! jne! 0x0000039e0000039a! xorl! %ebp,%ebp0000039c! jmp! 0x000004180000039e! movl! 0x24(%esp),%eax000003a2! movl! %eax,0x00(%ebp)000003a5! movl! $0x00000000,0x04(%ebp)000003ac! movl! %ebx,0x08(%ebp)000003af! movl! 0x10(%esi),%eax000003b2! leal! 0x01(%eax),%ecx000003b5! movl! %ecx,0x10(%esi)000003b8! shrl! $0x02,%ecx000003bb! leal! 0x01(%ecx,%eax),%eax... 27 lines more

foo.bar in JavaScript

__ZN2v88internal7HashMap6LookupEPvjb

means:

v8::internal::HashMap::Lookup(void*, unsigned int, bool)

How to optimize?

Hidden Typeadd property x

then add property y

http://code.google.com/apis/v8/design.html



But nothing is perfect

one secret in V8 hidden class

http://jsperf.com/test-v8-delete

20x times slower!



But property are rarely deleted

Figure 3 gives the average size of the code functions occurringin the JavaScript program source. These seem fairly consistentacross sites. More interestingly, Figure 4 shows the number ofevents per function, which roughly corresponds to the number ofbytecodes evaluated by the interpreter (note that some low-levelbytecodes such as branches and arithmetic are not recorded in thetrace). It is interesting to note that the median is fairly high, around20 events. This suggests that, in contrast to Java, there are fewershort methods (e.g. accessors) in JavaScript and thus possibly feweropportunities to benefit from inlining optimizations.

5.2 Instruction MixThe instruction mix of JavaScript program is also fairly traditional:more read operations are expected than write operations. As shownin Figure 5, reads are far more common than writes: over alltraces the proportion of reads to writes is 6 to 1. Deletes compriseonly .1% of all events. That graph further breaks reads, writesand deletes into various specific types; prop refers to accesses

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0

Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0


280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0


280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0


280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0


280S

BING

BLOG

DIGG

EBAY

FBOK

FLKR

GMIL

GMAP

GOGL

ISHK

LIVE

MEC

M

TWIT

WIKI

WORD

YTUB

ALL*

0.0

0.2

0.4

0.6

0.8

1.0

280S

BING

BLOG

DIGG

EBAY

FBOK

FLKR

GMIL

GMAP

GOGL

ISHK

LIVE

MEC

M

TWIT

WIKI

WORD

YTUB

ALL*

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5. Instruction mix. The per-site proportion of read, write,delete, call instructions (averaged over multiple traces).

280S

BING

BLO

G

DIG

G

EBAY

FBO

K

FLKR

GM

AP

GM

IL

GO

GL

ISHK

LIVE

MEC

M

TWIT

WIK

I

WO

RD

YTUB AL

L

Prot

otyp

e ch

ain

leng

th

1 2

3 4

5 6

7 8

910

280S

BIN

G

BLO

G

DIG

G

EB

AY

FBO

K

FLK

R

GM

AP

GM

IL

GO

GL

ISH

K

LIV

E

ME

CM

TWIT

WIK

I

WO

RD

YTU

B

ALL

Pro

toty

pe c

hain

leng

th

1 2

3 4

5 6

7 8

910

Figure 6. Prototype chain length. The per-site quartile and max-imum prototype chain lengths.

using dot notation (e.g. x.f), hash refers to access using indexingnotation (e.g. x[s]), indx refers to accesses using indexing notationwith a numeric argument. The overall number of calls is high,20%, as the interpreter does not perform any inlining. Exceptionhandling is rather infrequent with a grand total of 1,328 throwsover 478 million trace events. There are some outliers such as ISHK,WORD and DIGG where updates are a much smaller proportion ofoperations (and influenced by the sheer number of objects in thesesites), but otherwise the traces are consistent.

5.3 Prototype ChainsOne higher-level metric is the length of an object’s prototype chain,which is the number of prototype objects that may potentially betraversed in order to find an object’s inherited property. This isroughly comparable to metrics of the depth of class hierarchies inclass-based languages, such as the Depth of Inheritance (DIT) met-ric discussed in [23]. Studies of C++ programs mention a maximumDIT of 8 and a median of 1, whereas Smalltalk has a median of 3and maximum of 10. Figure 6 shows that in all but four sites, themedian prototype chain length is 1. Note that we start our graph atchain length 1, the minimum. All objects except Object.prototypehave at least one prototype, which if unspecified, defaults to theObject.prototype. The maximum observed prototype chain lengthis 10. The majority of sites do not seem to use prototypes for codereuse, but this is possibly explained by the existence of other waysto achieve code reuse in JavaScript (i.e., the ability to assign clo-sures directly into a field of an object). The programs that do utilizeprototypes have similar inheritance properties to Java [23].

5.4 Object KindsFigure 7 breaks down the kinds of objects allocated at run-timeinto a number of categories. There are a number of frequently usedbuilt-in data types: dates (Date), regular expressions (RegExp), doc-ument and layout objects (DOM), arrays (Array) and runtime er-rors. The remaining objects are separated into four groups: anony-mous objects, instances, functions, and prototypes. Anonymous ob-jects are constructed with an object literal using the {...} notation,while instances are constructed by calls of the form new C(...).A function object is created for every function expression eval-uated by the interpreter and a prototype object is automaticallyadded to every function in case it is used as a constructor. Overall sites and traces, arrays account for 31% of objects allocated.Dates and DOM objects come next with 12% and 14%, respec-tively. Functions, prototypes, and instances each account for 10%of the allocated objects, and finally anonymous objects account for

280S

BING

BLOG

DIGG

FBOK

FLKR

EBAY

GOGL

GMAP

GMIL

ISHK

LIVE

MEC

M

TWIT

WIKI

WORD

YTUB

ALL*

anonymousdom

arraysdates

regexpsfunctions

instanceserrors

prototypes

Figure 7. Kinds of allocated objects. The per-site proportion ofruntime object kinds (averaged over multiple traces).

Only 0.1% delete

An Analysis of the Dynamic Behavior of JavaScript Programs

Optimize method call

function foo(bar) { return bar.pro();}

bar can be anything

adaptive optimization for self

Polymorphic inline cache

Tagged pointer

typedef union { void *p; double d; long l;} Value;

typedef struct { unsigned char type; Value value;} Object;

Object a;

Tagged pointer

sizeof(a)??if everything is object, it will be too much overhead for small integer

Tagged pointer

In almost all system, the pointer address will be aligned (4 or 8 bytes)

http://www.gnu.org/s/libc/manual/html_node/Aligned-Memory-Blocks.html

“The address of a block returned by malloc or realloc in the GNU system is always a multiple of eight (or sixteen on 64-bit systems). ”



Tagged pointer

Example: 0xc00ab958 the pointer’s last 2 or 3 bits must be 0

1 0 0 08888

PointerPointerPointerPointer

1 0 0 19999

Small NumberSmall NumberSmall NumberSmall Number

How about double?

* The top 16-bits denote the type of the encoded JSValue:** Pointer { 0000:PPPP:PPPP:PPPP* / 0001:****:****:***** Double { ...* \ FFFE:****:****:***** Integer { FFFF:0000:IIII:IIII

NaN-tagging (JSC 64 bit)In 64 bit system, we can only use 48 bits, that means it will have 16 bits are 0

V8

V8

• Lars Bak

• Hidden Class, PICs

• Built-in objects written in JavaScript

• Crankshaft

• Precise generation GC

Lars Bak

• implement VM since 1988

• Beta

• Self

• HotSpot

Source code Native Code

High-Level IR Low-Level IR Opt Native Code}Crankshaft

Hotspot client compiler

Crankshaft

• Profiling


• On-stack replacement

• Deoptimize

High-Level IR (Hydrogen)

• function inline

• type inference

• stack check elimination

• loop-invariant code motion

• common subexpression elimination

• ...

http://wingolog.org/archives/2011/08/02/a-closer-look-at-crankshaft-v8s-optimizing-compiler



Low-Level IR (Lithium)

• linear-scan register allocator

• code generate

• lazy deoptimization

http://wingolog.org/archives/2011/09/05/from-ssa-to-native-code-v8s-lithium-language



function ArraySort(comparefn) { if (IS_NULL_OR_UNDEFINED(this) && !IS_UNDETECTABLE(this)) { throw MakeTypeError("called_on_null_or_undefined", ["Array.prototype.sort"]); }

// In-place QuickSort algorithm. // For short (length <= 22) arrays, insertion sort is used for efficiency.

if (!IS_SPEC_FUNCTION(comparefn)) { comparefn = function (x, y) { if (x === y) return 0; if (%_IsSmi(x) && %_IsSmi(y)) { return %SmiLexicographicCompare(x, y); } x = ToString(x); y = ToString(y); if (x == y) return 0; else return x < y ? -1 : 1; }; } ...

Built-in objects written in JS

v8/src/array.js

GC

V8 performance

Can V8 be faster?

Dart• Clear syntax, Optional types, Libraries

• Performance

• Can compile to JavaScript

• But IE, WebKit and Mozilla rejected it

• What do you think?

• My thought: Will XML replace HTML? No, but thanks Google, for push the web forward

Embed V8

Embed

v8::Handle<v8::Value> Print(const v8::Arguments& args) { for (int i = 0; i < args.Length(); i++) { v8::HandleScope handle_scope; v8::String::Utf8Value str(args[i]); const char* cstr = ToCString(str); printf("%s", cstr); } return v8::Undefined();}

v8::Handle<v8::ObjectTemplate> global = v8::ObjectTemplate::New();global->Set(v8::String::New("print"), v8::FunctionTemplate::New(Print));

Expose Function

Node.JS• Pros

• Async

• One language for everything

• Faster than PHP, Python

• Community

• Cons

• Lack of great libraries

• ES5 code hard to maintain

• Still too youth

JavaScriptCore (Nitro)

Where it comes from?

1997 Macworld

“Apple has decided to make Internet Explorer it’s default browser on macintosh.”

“Since we believe in choice. We going to be shipping other Internet Browser...”

Steve Jobs

JavaScriptCore History• 2001 KJS (kde-2.2)

• Bison

• AST interpreter

• 2008 SquirrelFish

• Bytecode(Register)

• Direct threading

• 2008 SquirrelFish Extreme

• PICs

• method JIT

• regular expression JIT

• DFG JIT (March 2011)

BytecodeAST Method JIT

SSA

Interpreter

DFG JIT

SipderMonkey

Monkey• SpiderMonkey

• Written by Brendan Eich

• interpreter

• TraceMonkey

• trace JIT

• removed

• JägerMonkey

• PICs

• method JIT (from JSC)

• IonMonkey

• Type Inference


IonMonkey

• SSA

• function inline

• linear-scan register allocation

• dead code elimination

• loop-invariant code motion

• ...

http://www.arewefastyet.com/



Chakra (IE9)

Chakra

• Interpreter/JIT

• Type System (hidden class)

• PICs

• Delay parse

• Use utf-8 internal

Unlocking the JavaScript Opportunity with Internet Explorer 9

Unlocking the JavaScript Opportunity with Internet Explorer 9

Carakan (Opera)

Carakan

• Register VM

• Method JIT, Regex JIT

• Hidden type

• Function inline

Rhino and JVM

Rhino is SLOW, why?

Because JVM is slow?

JVM did’t support dynamic language well

Solution: invokedynamic

Before

After Caller

Some tricksCaller Method

MethodInvokedynamic

Hard to optimize in JVM

method handle

One ring to rule them all?

Rhino + invokedynamic• Pros

• Easier to implement

• Lots of great Java Libraries

• JVM optimization for free

• Cons

• Only in JVM7

• Not fully optimized yet

• Hard to beat V8

Compiler optimization is HARD

It there an easy way?

LLVM

LLVM

• Clang, VMKit, GHC, PyPy, Rubinius ...

• DragonEgg: replace GCC back-end

• IR

• Optimization

• Link, Code generate, JIT

• Apple

LLVM simplify

int foo(int bar) { int one = 1; return bar + one;}

int main() { foo(3);}

define i32 @foo(i32 %bar) nounwind ssp {entry: %bar_addr = alloca i32, align 4 %retval = alloca i32 %0 = alloca i32 %one = alloca i32 %"alloca point" = bitcast i32 0 to i32 store i32 %bar, i32* %bar_addr store i32 1, i32* %one, align 4 %1 = load i32* %bar_addr, align 4 %2 = load i32* %one, align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* %0, align 4 %4 = load i32* %0, align 4 store i32 %4, i32* %retval, align 4 br label %return

return: %retval1 = load i32* %retval ret i32 %retval1}

define i32 @main() nounwind ssp {entry: %retval = alloca i32 %"alloca point" = bitcast i32 0 to i32 %0 = call i32 @foo(i32 3) nounwind ssp br label %return


define i32 @foo(i32 %bar) nounwind ssp {entry: %bar_addr = alloca i32, align 4 %retval = alloca i32 %0 = alloca i32 %one = alloca i32 %"alloca point" = bitcast i32 0 to i32 store i32 %bar, i32* %bar_addr store i32 1, i32* %one, align 4 %1 = load i32* %bar_addr, align 4 %2 = load i32* %one, align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* %0, align 4 %4 = load i32* %0, align 4 store i32 %4, i32* %retval, align 4 br label %return


define i32 @main() nounwind ssp {entry: %retval = alloca i32 %"alloca point" = bitcast i32 0 to i32 %0 = call i32 @foo(i32 3) nounwind ssp br label %return


define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}

define i32 @main() nounwind readnone ssp {entry: ret i32 undef}

Optimization

Optimization (70+)

http://llvm.org/docs/Passes.html





LLVM backend

NativeCodeGenLinker

IPO/IPA.. Runtime

Optimizer

Offline Reoptimizer

Profile& Trace

Info

LLVM

LLVMLibraries

Compiler FE 1

Compiler FE N.o files

LLVMLLVM

exe &LLVM

LLVM

CPU

JIT

ProfileInfo

exe &LLVM

exe

LLVM

exe

LLVM

Figure 4: LLVM system architecture diagram

code in non-conforming languages is executed as “un-managed code”. Such code is represented in nativeform and not in the CLI intermediate representation,so it is not exposed to CLI optimizations. These sys-tems do not provide #2 with #1 or #3 because run-time optimization is generally only possible when us-ing JIT code generation. They do not aim to provide#4, and instead provide a rich runtime framework forlanguages that match their runtime and object model,e.g., Java and C#. Omniware [1] provides #5 andmost of the benefits of #2 (because, like LLVM, it usesa low-level represention that permits extensive staticoptimization), but at the cost of not providing infor-mation for high-level analysis and optimization (i.e.,#1). It does not aim to provide #3 or #4.

• Transparent binary runtime optimization systems likeDynamo and the runtime optimizers in Transmeta pro-cessors provide benefits #2, #4 and #5, but they donot provide #1. They provide benefit #3 only at run-time, and only to a limited extent because they workonly on native binary code, limiting the optimizationsthey can perform.

• Profile Guided Optimization for static languages pro-vide benefit #3 at the cost of not being transparent(they require a multi-phase compilation process). Ad-ditionally, PGO su!ers from three problems: (1) Em-pirically, developers are unlikely to use PGO, exceptwhen compiling benchmarks. (2) When PGO is used,the application is tuned to the behavior of the train-ing run. If the training run is not representative of theend-user’s usage patterns, performance may not im-prove and may even be hurt by the profile-driven opti-mization. (3) The profiling information is completelystatic, meaning that the compiler cannot make use ofphase behavior in the program or adapt to changingusage patterns.

There are also significant limitations of the LLVM strat-egy. First, language-specific optimizations must be per-formed in the front-end before generating LLVM code.LLVM is not designed to represent source languages typesor features directly. Second, it is an open question whetherlanguages requiring sophisticated runtime systems such asJava can benefit directly from LLVM. We are currently ex-ploring the potential benefits of implementing higher-levelvirtual machines such as JVM or CLI on top of LLVM.

The subsections below describe the key components ofthe LLVM compiler architecture, emphasizing design andimplementation features that make the capabilities abovepractical and e"cient.

3.2 Compile-Time: External front-end & staticoptimizer

External static LLVM compilers (referred to as front-ends)translate source-language programs into the LLVM virtualinstruction set. Each static compiler can perform three keytasks, of which the first and third are optional: (1) Performlanguage-specific optimizations, e.g., optimizing closures inlanguages with higher-order functions. (2) Translate sourceprograms to LLVM code, synthesizing as much useful LLVMtype information as possible, especially to expose pointers,structures, and arrays. (3) Invoke LLVM passes for globalor interprocedural optimizations at the module level. TheLLVM optimizations are built into libraries, making it easyfor front-ends to use them.

The front-end does not have to perform SSA construc-tion. Instead, variables can be allocated on the stack (whichis not in SSA form), and the LLVM stack promotion andscalar expansion passes can be used to build SSA form ef-fectively. Stack promotion converts stack-allocated scalarvalues to SSA registers if their address does not escape thecurrent function, inserting ! functions as necessary to pre-serve SSA form. Scalar expansion precedes this and expandslocal structures to scalars wherever possible, so that theirfields can be mapped to SSA registers as well.

Note that many “high-level” optimizations are not reallylanguage-dependent, and are often special cases of moregeneral optimizations that may be performed on LLVMcode. For example, both virtual function resolution forobject-oriented languages (described in Section 4.1.2) andtail-recursion elimination which is crucial for functional lan-guages can be done in LLVM. In such cases, it is better toextend the LLVM optimizer to perform the transformation,rather than investing e!ort in code which only benefits aparticular front-end. This also allows the optimizations tobe performed throughout the lifetime of the program.

3.3 Linker & Interprocedural OptimizerLink time is the first phase of the compilation process

where most7 of the program is available for analysis andtransformation. As such, link-time is a natural place toperform aggressive interprocedural optimizations across theentire program. The link-time optimizations in LLVM oper-ate on the LLVM representation directly, taking advantageof the semantic information it contains. LLVM currentlyincludes a number of interprocedural analyses, such as acontext-sensitive points-to analysis (Data Structure Anal-ysis [31]), call graph construction, and Mod/Ref analy-sis, and interprocedural transformations like inlining, deadglobal elimination, dead argument elimination, dead typeelimination, constant propagation, array bounds check elim-ination [28], simple structure field reordering, and Auto-

7Note that shared libraries and system libraries may notbe available for analysis at link time, or may be compileddirectly to native code.

Proceedings of the International Symposium on Code Generation and Optimization (CGO’04) 0-7695-2102-9/04 $ 20.00 © 2004 IEEE

LLVM on JavaScript

Emscripten

• C/C++ to LLVM IR

• LLVM IR to JavaScript

• Run on browser



...

function _foo($bar) { var __label__; var $0=((($bar)+1)|0); return $0;}

function _main() { var __label__; return undef;}Module["_main"] = _main;

...

Emscripten demo

• Python, Ruby, Lua virtual machine (http://repl.it/)

• OpenJPEG

• Poppler

• FreeType

• ...

https://github.com/kripken/emscripten/wiki

http://repl.it

http://repl.it



Performance? good enough!

benchmark SM V8 gcc ratiofannkuch (10) 1.158 0.931 0.231 4.04fasta (2100000) 1.115 1.128 0.452 2.47primes 1.443 3.194 0.438 3.29raytrace (7,256) 1.930 2.944 0.228 8.46dlmalloc (400,400) 5.050 1.880 0.315 5.97

The first column is the name of the benchmark, and inparentheses any parameters used in running it. The sourcecode to all the benchmarks can be found at https://github.com/kripken/emscripten/tree/master/tests(each in a separate file with its name, except for ‘primes’,which is embedded inside runner.py in the function test primes).A brief summary of the benchmarks is as follows:

• fannkuch and fasta are commonly-known benchmarks,appearing for example on the Computer Language Bench-marks Game8. They use a mix of mathematic operations(integer in the former, floating-point in the latter) andmemory access.

• primes is the simplest benchmark in terms of code. It isbasically just a tiny loop that calculates prime numbers.

• raytrace is real-world code, from the sphereflake ray-tracer9. This benchmark has a combination of memoryaccess and floating-point math.

• dlmalloc (Doug Lea’s malloc10) is a well-known real-world implementation of malloc and free. This bench-mark does a large amount of calls to malloc and free inan intermixed way, which tests memory access and inte-ger calculations.

Returning to the table of results, the second column isthe elapsed time (in seconds) when running the compiledcode (generated using all Emscripten and LLVM optimiza-tions as well as the Closure Compiler) in the SpiderMonkeyJavaScript engine (specifically the JaegerMonkey branch,checked out June 15th, 2011). The third column is theelapsed time when running the same JavaScript code in theV8 JavaScript engine (checked out Jun 15th, 2011). In boththe second and third column lower values are better; the bestof the two is in bold. The fourth column is the elapsed timewhen running the original code compiled with gcc -O3, us-ing GCC 4.4.4. The last column is the ratio, that is, howmuch slower the JavaScript code (running in the faster ofthe two engines for that test) is when compared to gcc. Allthe tests were run on a MacBook Pro with an Intel i7 CPUclocked at 2.66GHz, running on Ubuntu 10.04.

Clearly the results greatly vary by the benchmark, withthe generated JavaScript running from 2.47 to 8.46 timesslower. There are also significant differences between the

8http://shootout.alioth.debian.org/

9http://ompf.org/ray/sphereflake/

10http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_

derivatives

two JavaScript engines, with each better at some of thebenchmarks. It appears that code that does simple numericaloperations – like the primes test – can run fairly fast, whilecode that has a lot of memory accesses, for example dueto using structures – like the raytrace test – will be slower.(The main issue with structures is that Emscripten does not‘nativize’ them yet, as it does to simple local variables.)

Being 2.47 to 8.46 times slower than the most-optimizedC++ code is a significant slowdown, but it is still morethan fast enough for many purposes, and the main pointof course is that the code can run anywhere the web canbe accessed. Further work on Emscripten is expected toimprove the speed as well, as are improvements to LLVM,the Closure Compiler, and JavaScript engines themselves;see further discussion in the Summary.

2.3 LimitationsEmscripten’s compilation approach, as has been describedin this Section so far, is to generate ‘natural’ JavaScript, asclose as possible to normal JavaScript on the web, so thatmodern JavaScript engines perform well on it. In particu-lar, we try to generate ‘normal’ JavaScript operations, likeregular addition and multiplication and so forth. This is avery different approach than, say, emulating a CPU on a lowlevel, or for the case of LLVM, writing an LLVM bitcodeinterpreter in JavaScript. The latter approach has the bene-fit of being able to run virtually any compiled code, at thecost of speed, whereas Emscripten makes a tradeoff in theother direction. We will now give a summary of some of thelimitations of Emscripten’s approach.

• 64-bit Integers: JavaScript numbers are all 64-bit dou-bles, with engines typically implementing them as 32-bit integers where possible for speed. A consequence ofthis is that it is impossible to directly implement 64-bitintegers in JavaScript, as integer values larger than 32bits will become doubles, with only 53 bits for the sig-nificand. Thus, when Emscripten uses normal JavaScriptaddition and so forth for 64-bit integers, it runs the riskof rounding effects. This could be solved by emulating64-bit integers, but it would be much slower than nativecode.

• Multithreading: JavaScript has Web Workers, which areadditional threads (or processes) that communicate viamessage passing. There is no shared state in this model,which means that it is not directly possible to compilemultithreaded code in C++ into JavaScript. A partial so-lution could be to emulate threads, without Workers, bymanually controlling which blocks of code run (a varia-tion on the switch in a loop construction mentioned ear-lier) and manually switching between threads every sooften. However, in that case there would not be any uti-lization of additional CPU cores, and furthermore perfor-mance would be slow due to not using normal JavaScriptloops.

7 2011/7/23

JavaScript on LLVM

Fabric Engine

• JavaScript Integration

• Native code compilation (LLVM)

• Multi-threaded execution

• OpenGL Rendering

Fabric Engine

http://fabric-engine.com/2011/11/server-performance-benchmarks/



Conclusion?

David Wheeler

All problems in computer science can be solved by another level of indirection

References• The behavior of efficient virtual

machine interpreters on modern architectures

• Virtual Machine Showdown: Stack Versus Registers

• The implementation of Lua 5.0

• Why Is the New Google V8 Engine so Fast?

• Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters

• Effective Inline-Threaded Interpretation of Java Bytecode Using Preparation Sequences

• Smalltalk-80: the language and its implementation

References• Design of the Java HotSpotTM

Client Compiler for Java 6

• Oracle JRockit: The Definitive Guide

• Virtual Machines: Versatile platforms for systems and processes

• Fast and Precise Hybrid Type Inference for JavaScript

• LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

• Emscripten: An LLVM-to-JavaScript Compiler

• An Analysis of the Dynamic Behavior of JavaScript Programs

References• Adaptive Optimization for SELF

• Bytecodes meet Combinators: invokedynamic on the JVM

• Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters

• Efficient Implementation of the Smalltalk-80 System

• Design, Implementation, and Evaluation of Optimizations in a Just-In-Time Compiler

• Optimizing direct threaded code by selective inlining

• Linear scan register allocation

• Optimizing Invokedynamic

References• Representing Type Information

in Dynamically Typed Languages

• The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures

• Trace-based Just-in-Time Type Specialization for Dynamic Languages

• The Structure and Performance of Efficient Interpreters

• Know Your Engines: How to Make Your JavaScript Fast

• IE Blog, Chromium Blog, WebKit Blog, Opera Blog, Mozilla Blog, Wingolog’s Blog, RednaxelaFX’s Blog, David Mandelin’s Blog...

!ank y"

Technology

Virtual machine and javascript engine