135
Virtual Machine & JavaScript Engine @nwind

Virtual machine and javascript engine

  • Upload
    duoyi-wu

  • View
    70.289

  • Download
    5

Embed Size (px)

DESCRIPTION

Introduction to virtual machine and JavaScript engine implement.

Citation preview

Page 1: Virtual machine and javascript engine

Virtual Machine & JavaScript Engine@nwind

Page 2: Virtual machine and javascript engine

(HLL) Virtual Machine

Page 3: Virtual machine and javascript engine

Take the red pillI will show you the rabbit hole.

Page 4: Virtual machine and javascript engine

Virtual Machine history• pascal 1970

• smalltalk 1980

• self 1986

• python 1991

• java 1995

• javascript 1995

Page 5: Virtual machine and javascript engine

The Smalltalk demonstration showed three amazing features. One was how computers could be networked; the second was how object-oriented programming worked. But Jobs and his team paid little attention to these attributes because they were so amazed by the third feature, ...

Page 6: Virtual machine and javascript engine

How Virtual Machine Work?

• Parser

• Intermediate Representation (IR)

• Interpreter

• Garbage Collection

• Optimization

Page 7: Virtual machine and javascript engine

Parser

• Tokenize

• AST

Page 8: Virtual machine and javascript engine

Tokenize

var foo = 10;keyword

space

identifier

equal

number

semicolon

Page 9: Virtual machine and javascript engine

AST

Assign

Variable foo Constant 10

Page 10: Virtual machine and javascript engine

{ "type": "Program", "body": [ { "type": "VariableDeclaration", "declarations": [ { "id": { "type": "Identifier", "name": "foo" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Identifier", "name": "bar" }, "right": { "type": "Literal", "value": 1 } } } ], "kind": "var" } ]}

AST demo (Esprima)

var foo = bar + 1;

http://esprima.org/demo/parse.html

Page 11: Virtual machine and javascript engine

Intermediate Representation

• Bytecode

• Stack vs. register

Page 12: Virtual machine and javascript engine

00000: deffun 0 null00005: nop00006: callvar 000009: int8 200011: call 100014: pop00015: stop

foo:00020: getarg 000023: one00024: add00025: return00026: stop

Bytecode (SpiderMonkey)

function foo(bar) { return bar + 1;}

foo(2);

Page 13: Virtual machine and javascript engine

8 m_instructions; 168 bytes at 0x7fc1ba3070e0; 1 parameter(s); 10 callee register(s)

[ 0] enter[ 1] mov! ! r0, undefined(@k0)[ 4] get_global_var! r1, 5[ 7] mov! ! r2, undefined(@k0)[ 10] mov! ! r3, 2(@k1)[ 13] call!! r1, 2, 10[ 17] op_call_put_result! ! r0[ 19] end! ! r0

Constants: k0 = undefined k1 = 2

3 m_instructions; 64 bytes at 0x7fc1ba306e80; 2 parameter(s); 1 callee register(s)

[ 0] enter[ 1] add! ! r0, r-7, 1(@k0)[ 6] ret! ! r0

Constants: k0 = 1

End: 3

Bytecode (JSC)

function foo(bar) { return bar + 1;}

foo(2);

Page 14: Virtual machine and javascript engine

Stack vs. register

• Stack

• JVM, .NET, php, python, Old JavaScript engine

• Register

• Lua, Dalvik, All modern JavaScript engine

• Smaller, Faster (about 30%)

• RISC

Page 15: Virtual machine and javascript engine

local a,t,i 1: LOADNIL 0 2 0a=a+i 2: ADD 0 0 2a=a+1 3: ADD 0 0 250 ; aa=t[i] 4: GETTABLE 0 1 2

Stack vs. register

local a,t,i 1: PUSHNIL 3a=a+i 2: GETLOCAL 0 ; a 3: GETLOCAL 2 ; i 4: ADD 5: SETLOCAL 0 ; aa=a+1 6: SETLOCAL 0 ; a 7: ADDI 1 8: SETLOCAL 0 ; aa=t[i] 9: GETLOCAL 1 ; t 10: GETINDEXED 2 ; i 11: SETLOCAL 0 ; a

Page 16: Virtual machine and javascript engine

Interpreter

• Switch statement

• Direct threading, Indirect threading, Token threading ...

Page 17: Virtual machine and javascript engine

while (true) {! switch (opcode) {! ! case ADD:! ! ! ...! ! ! break;! ! case SUB:! ! ! ...! ! ! break; ...! }}

Switch statement

mov %edx,0xffffffffffffffe4(%rbp)cmpl $0x1,0xffffffffffffffe4(%rbp)je 6e <interpret+0x6e>cmpl $0x1,0xffffffffffffffe4(%rbp)jb 4a <interpret+0x4a>cmpl $0x2,0xffffffffffffffe4(%rbp)je 93 <interpret+0x93>jmp 22 <interpret+0x22>...

Page 18: Virtual machine and javascript engine

typedef void *Inst;Inst program[] = { &&ADD, &&SUB };Inst *ip = program;goto *ip++;

ADD: ... goto *ip++;

SUB: ... goto *ip++;

Direct threadingmov 0xffffffffffffffe8(%rbp),%rdxlea 0xffffffffffffffe8(%rbp),%raxaddq $0x8,(%rax)mov %rdx,0xffffffffffffffd8(%rbp)jmpq *0xffffffffffffffd8(%rbp)

ADD: ... mov 0xffffffffffffffe8(%rbp),%rdx lea 0xffffffffffffffe8(%rbp),%rax addq $0x8,(%rax) mov %rdx,0xffffffffffffffd8(%rbp) jmp 2c <interpreter+0x2c>

http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

Page 19: Virtual machine and javascript engine

Garbage Collection

• Reference counting (php, python ...), smart pointer

• Tracing

• Stop the world

• Copying, Mark-and-sweep, Mark-and-compact

• Generational GC

• Precise vs. conservative

Page 20: Virtual machine and javascript engine

Precise vs. conservative

• Conservative

• If it looks like a pointer, treat it as a pointer

• Might have memory leak

• Cant’ move object, have memory fragmentation

• Precise

• Indirectly vs. Directly reference

Page 21: Virtual machine and javascript engine
Page 22: Virtual machine and javascript engine

It is time for the DARK Magic

Page 23: Virtual machine and javascript engine
Page 24: Virtual machine and javascript engine

Optimization Magic

• Interpreter optimization

• Compiler optimization

• JIT

• Type inference

• Hidden Type

• Method inline, PICs

Page 25: Virtual machine and javascript engine

Interpreter optimization

Page 26: Virtual machine and javascript engine

Switch work inefficient, Why?

Page 27: Virtual machine and javascript engine

CPU Pipeline

• Fetch, Decode, Execute, Write-back

• Branch prediction

Page 29: Virtual machine and javascript engine

ICONST_1_START: *sp++ = 1; ICONST_1_END: goto **(pc++);INEG_START: sp[-1] = -sp[-1]; INEG_END: goto **(pc++);DISPATCH_START: goto **(pc++); DISPATCH_END: ;

size_t iconst_size = (&&ICONST_1_END - &&ICONST_1_START); size_t ineg_size = (&&INEG_END - &&INEG_START);size_t dispatch_size = (&&DISPATCH_END - &&DISPATCH_START);

void *buf = malloc(iconst_size + ineg_size + dispatch_size); void *current = buf;memcpy(current, &&ICONST_START, iconst_size); current += iconst_size; memcpy(current, &&INEG_START, ineg_size); current += ineg_size; memcpy(current, &&DISPATCH_START, dispatch_size);...

goto **buf;

Solution: Inline Threading

Interpreter? JIT!

Page 30: Virtual machine and javascript engine

Compiler optimization

Page 31: Virtual machine and javascript engine

Compiler optimization

• SSA

• Data-flow

• Control-flow

• Loop

• ...

Page 32: Virtual machine and javascript engine

http://www.oracle.com/us/technologies/java/java7-renaissance-vm-428200.pdf | © 2011 Oracle Corporation

What a JVM can do...compiler tactics delayed compilation Tiered compilation on-stack replacement delayed reoptimization program dependence graph representation static single assignment representationproof-based techniques exact type inference memory value inference memory value tracking constant folding reassociation operator strength reduction null check elimination type test strength reduction type test elimination algebraic simplification common subexpression elimination integer range typingflow-sensitive rewrites conditional constant propagation dominating test detection flow-carried type narrowing dead code elimination

language-specific techniques class hierarchy analysis devirtualization symbolic constant propagation autobox elimination escape analysis lock elision lock fusion de-reflection speculative (profile-based) techniques optimistic nullness assertions optimistic type assertions optimistic type strengthening optimistic array length strengthening untaken branch pruning optimistic N-morphic inlining branch frequency prediction call frequency predictionmemory and placement transformation expression hoisting expression sinking redundant store elimination adjacent store fusion card-mark elimination merge-point splitting

loop transformations loop unrolling loop peeling safepoint elimination iteration range splitting range check elimination loop vectorizationglobal code shaping inlining (graph integration) global code motion heat-based code layout switch balancing throw inliningcontrol flow graph transformation local code scheduling local code bundling delay slot filling graph-coloring register allocation linear scan register allocation live range splitting copy coalescing constant splitting copy removal address mode matching instruction peepholing DFA-based code generator

Thursday, July 7, 2011

Page 33: Virtual machine and javascript engine

Just-In-Time (JIT)

Page 34: Virtual machine and javascript engine

JIT

• Method JIT, Trace JIT, Regular expression JIT

• Code generation

• Register allocation

Page 35: Virtual machine and javascript engine

How JIT work?

• mmap/new/malloc (mprotect)

• generate native code

• c cast/reinterpret_cast

• call the function

Page 36: Virtual machine and javascript engine

asm (".text\n"".globl " SYMBOL_STRING(ctiTrampoline) "\n"HIDE_SYMBOL(ctiTrampoline) "\n"SYMBOL_STRING(ctiTrampoline) ":" "\n" "pushl %ebp" "\n" "movl %esp, %ebp" "\n" "pushl %esi" "\n" "pushl %edi" "\n" "pushl %ebx" "\n" "subl $0x3c, %esp" "\n" "movl $512, %esi" "\n" "movl 0x58(%esp), %edi" "\n" "call *0x50(%esp)" "\n" "addl $0x3c, %esp" "\n" "popl %ebx" "\n" "popl %edi" "\n" "popl %esi" "\n" "popl %ebp" "\n" "ret" "\n");

Trampoline (JSC x86)

// Execute the code!inline JSValue execute(RegisterFile* registerFile, CallFrame* callFrame, JSGlobalData* globalData){ JSValue result = JSValue::decode( ctiTrampoline( m_ref.m_code.executableAddress(), registerFile, callFrame, 0, Profiler::enabledProfilerReference(), globalData)); return globalData->exception ? jsNull() : result;}

Page 37: Virtual machine and javascript engine

Register allocation

• Linear scan

• Graph coloring

Page 38: Virtual machine and javascript engine

Code generation

• Pipelining

• SIMD (SSE2, SSE3 ...)

• Debug

Page 39: Virtual machine and javascript engine

Type inference

Page 40: Virtual machine and javascript engine

a + b

Page 41: Virtual machine and javascript engine
Page 42: Virtual machine and javascript engine

Property access

Page 43: Virtual machine and javascript engine

“foo.bar”

Page 44: Virtual machine and javascript engine

00001f63!movl!%ecx,0x04(%edx)

foo.bar in C

Page 45: Virtual machine and javascript engine

__ZN2v88internal7HashMap6LookupEPvjb:00000338! pushl!%ebp00000339! pushl!%ebx0000033a! pushl!%edi0000033b! pushl!%esi0000033c! subl! $0x0c,%esp0000033f! movl! 0x20(%esp),%esi00000343! movl! 0x08(%esi),%eax00000346! movl! 0x0c(%esi),%ecx00000349! imull!$0x0c,%ecx,%edi0000034c! leal! 0xff(%ecx),%ecx0000034f! addl! %eax,%edi00000351! movl! 0x28(%esp),%ebx00000355! andl! %ebx,%ecx00000357! imull!$0x0c,%ecx,%ebp0000035a! addl! %eax,%ebp0000035c! jmp! 0x0000036a0000035e! nop00000360! addl! $0x0c,%ebp00000363! cmpl! %edi,%ebp00000365! jb! 0x0000036a00000367! movl! 0x08(%esi),%ebp0000036a! movl! 0x00(%ebp),%eax0000036d! testl!%eax,%eax0000036f! je! 0x0000038b00000371! cmpl! %ebx,0x08(%ebp)00000374! jne! 0x0000036000000376! movl! %eax,0x04(%esp)0000037a! movl! 0x24(%esp),%eax0000037e! movl! %eax,(%esp)00000381! call! *0x04(%esi)00000384! testb!%al,%al00000386! je! 0x0000036000000388! movl! 0x00(%ebp),%eax0000038b! testl!%eax,%eax0000038d! jne! 0x0000041800000393! cmpb! $0x00,0x2c(%esp)00000398! jne! 0x0000039e0000039a! xorl! %ebp,%ebp0000039c! jmp! 0x000004180000039e! movl! 0x24(%esp),%eax000003a2! movl! %eax,0x00(%ebp)000003a5! movl! $0x00000000,0x04(%ebp)000003ac! movl! %ebx,0x08(%ebp)000003af! movl! 0x10(%esi),%eax000003b2! leal! 0x01(%eax),%ecx000003b5! movl! %ecx,0x10(%esi)000003b8! shrl! $0x02,%ecx000003bb! leal! 0x01(%ecx,%eax),%eax... 27 lines more

foo.bar in JavaScript

__ZN2v88internal7HashMap6LookupEPvjb

means:

v8::internal::HashMap::Lookup(void*, unsigned int, bool)

Page 46: Virtual machine and javascript engine

How to optimize?

Page 47: Virtual machine and javascript engine

Hidden Typeadd property x

then add property y

http://code.google.com/apis/v8/design.html

Page 48: Virtual machine and javascript engine

But nothing is perfect

Page 49: Virtual machine and javascript engine

one secret in V8 hidden class

http://jsperf.com/test-v8-delete

20x times slower!

Page 50: Virtual machine and javascript engine

But property are rarely deleted

Figure 3 gives the average size of the code functions occurringin the JavaScript program source. These seem fairly consistentacross sites. More interestingly, Figure 4 shows the number ofevents per function, which roughly corresponds to the number ofbytecodes evaluated by the interpreter (note that some low-levelbytecodes such as branches and arithmetic are not recorded in thetrace). It is interesting to note that the median is fairly high, around20 events. This suggests that, in contrast to Java, there are fewershort methods (e.g. accessors) in JavaScript and thus possibly feweropportunities to benefit from inlining optimizations.

5.2 Instruction MixThe instruction mix of JavaScript program is also fairly traditional:more read operations are expected than write operations. As shownin Figure 5, reads are far more common than writes: over alltraces the proportion of reads to writes is 6 to 1. Deletes compriseonly .1% of all events. That graph further breaks reads, writesand deletes into various specific types; prop refers to accesses

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0

Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0

Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0

Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0

Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch

280s

Apme

Bing

Blog

Digg

Fbok Flkr

Gmai

Gmap

Goog

IShk

Lvly

Twit

Wiki

Word

Ebay

YTub All*

0.0

0.2

0.4

0.6

0.8

1.0

Write_propWrite_hashWrite_indxRead_propRead_hashRead_indxDelet_propDelet_hashDelet_indxDefineCreateCallThrowCatch

280S

BING

BLOG

DIGG

EBAY

FBOK

FLKR

GMIL

GMAP

GOGL

ISHK

LIVE

MEC

M

TWIT

WIKI

WORD

YTUB

ALL*

0.0

0.2

0.4

0.6

0.8

1.0

280S

BING

BLOG

DIGG

EBAY

FBOK

FLKR

GMIL

GMAP

GOGL

ISHK

LIVE

MEC

M

TWIT

WIKI

WORD

YTUB

ALL*

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5. Instruction mix. The per-site proportion of read, write,delete, call instructions (averaged over multiple traces).

280S

BING

BLO

G

DIG

G

EBAY

FBO

K

FLKR

GM

AP

GM

IL

GO

GL

ISHK

LIVE

MEC

M

TWIT

WIK

I

WO

RD

YTUB AL

L

Prot

otyp

e ch

ain

leng

th

1 2

3 4

5 6

7 8

910

280S

BIN

G

BLO

G

DIG

G

EB

AY

FBO

K

FLK

R

GM

AP

GM

IL

GO

GL

ISH

K

LIV

E

ME

CM

TWIT

WIK

I

WO

RD

YTU

B

ALL

Pro

toty

pe c

hain

leng

th

1 2

3 4

5 6

7 8

910

Figure 6. Prototype chain length. The per-site quartile and max-imum prototype chain lengths.

using dot notation (e.g. x.f), hash refers to access using indexingnotation (e.g. x[s]), indx refers to accesses using indexing notationwith a numeric argument. The overall number of calls is high,20%, as the interpreter does not perform any inlining. Exceptionhandling is rather infrequent with a grand total of 1,328 throwsover 478 million trace events. There are some outliers such as ISHK,WORD and DIGG where updates are a much smaller proportion ofoperations (and influenced by the sheer number of objects in thesesites), but otherwise the traces are consistent.

5.3 Prototype ChainsOne higher-level metric is the length of an object’s prototype chain,which is the number of prototype objects that may potentially betraversed in order to find an object’s inherited property. This isroughly comparable to metrics of the depth of class hierarchies inclass-based languages, such as the Depth of Inheritance (DIT) met-ric discussed in [23]. Studies of C++ programs mention a maximumDIT of 8 and a median of 1, whereas Smalltalk has a median of 3and maximum of 10. Figure 6 shows that in all but four sites, themedian prototype chain length is 1. Note that we start our graph atchain length 1, the minimum. All objects except Object.prototypehave at least one prototype, which if unspecified, defaults to theObject.prototype. The maximum observed prototype chain lengthis 10. The majority of sites do not seem to use prototypes for codereuse, but this is possibly explained by the existence of other waysto achieve code reuse in JavaScript (i.e., the ability to assign clo-sures directly into a field of an object). The programs that do utilizeprototypes have similar inheritance properties to Java [23].

5.4 Object KindsFigure 7 breaks down the kinds of objects allocated at run-timeinto a number of categories. There are a number of frequently usedbuilt-in data types: dates (Date), regular expressions (RegExp), doc-ument and layout objects (DOM), arrays (Array) and runtime er-rors. The remaining objects are separated into four groups: anony-mous objects, instances, functions, and prototypes. Anonymous ob-jects are constructed with an object literal using the {...} notation,while instances are constructed by calls of the form new C(...).A function object is created for every function expression eval-uated by the interpreter and a prototype object is automaticallyadded to every function in case it is used as a constructor. Overall sites and traces, arrays account for 31% of objects allocated.Dates and DOM objects come next with 12% and 14%, respec-tively. Functions, prototypes, and instances each account for 10%of the allocated objects, and finally anonymous objects account for

280S

BING

BLOG

DIGG

FBOK

FLKR

EBAY

GOGL

GMAP

GMIL

ISHK

LIVE

MEC

M

TWIT

WIKI

WORD

YTUB

ALL*

anonymousdom

arraysdates

regexpsfunctions

instanceserrors

prototypes

Figure 7. Kinds of allocated objects. The per-site proportion ofruntime object kinds (averaged over multiple traces).

Only 0.1% delete

An Analysis of the Dynamic Behavior of JavaScript Programs

Page 51: Virtual machine and javascript engine

Optimize method call

Page 52: Virtual machine and javascript engine

function foo(bar) { return bar.pro();}

bar can be anything

Page 53: Virtual machine and javascript engine

adaptive optimization for self

Page 54: Virtual machine and javascript engine

Polymorphic inline cache

Page 55: Virtual machine and javascript engine

Tagged pointer

Page 56: Virtual machine and javascript engine

typedef union { void *p; double d; long l;} Value;

typedef struct { unsigned char type; Value value;} Object;

Object a;

Tagged pointer

sizeof(a)??if everything is object, it will be too much overhead for small integer

Page 57: Virtual machine and javascript engine

Tagged pointer

In almost all system, the pointer address will be aligned (4 or 8 bytes)

http://www.gnu.org/s/libc/manual/html_node/Aligned-Memory-Blocks.html

“The address of a block returned by malloc or realloc in the GNU system is always a multiple of eight (or sixteen on 64-bit systems). ”

Page 58: Virtual machine and javascript engine

Tagged pointer

Example: 0xc00ab958 the pointer’s last 2 or 3 bits must be 0

1 0 0 08888

PointerPointerPointerPointer

1 0 0 19999

Small NumberSmall NumberSmall NumberSmall Number

Page 59: Virtual machine and javascript engine

How about double?

Page 60: Virtual machine and javascript engine

* The top 16-bits denote the type of the encoded JSValue:** Pointer { 0000:PPPP:PPPP:PPPP* / 0001:****:****:***** Double { ...* \ FFFE:****:****:***** Integer { FFFF:0000:IIII:IIII

NaN-tagging (JSC 64 bit)In 64 bit system, we can only use 48 bits, that means it will have 16 bits are 0

Page 61: Virtual machine and javascript engine

V8

Page 62: Virtual machine and javascript engine
Page 63: Virtual machine and javascript engine

V8

• Lars Bak

• Hidden Class, PICs

• Built-in objects written in JavaScript

• Crankshaft

• Precise generation GC

Page 64: Virtual machine and javascript engine
Page 65: Virtual machine and javascript engine

Lars Bak

• implement VM since 1988

• Beta

• Self

• HotSpot

Page 66: Virtual machine and javascript engine

Source code Native Code

High-Level IR Low-Level IR Opt Native Code}Crankshaft

Page 67: Virtual machine and javascript engine

Hotspot client compiler

Page 68: Virtual machine and javascript engine

Crankshaft

• Profiling

• Compiler optimization

• On-stack replacement

• Deoptimize

Page 69: Virtual machine and javascript engine
Page 70: Virtual machine and javascript engine

High-Level IR (Hydrogen)

• function inline

• type inference

• stack check elimination

• loop-invariant code motion

• common subexpression elimination

• ...

http://wingolog.org/archives/2011/08/02/a-closer-look-at-crankshaft-v8s-optimizing-compiler

Page 71: Virtual machine and javascript engine

Low-Level IR (Lithium)

• linear-scan register allocator

• code generate

• lazy deoptimization

http://wingolog.org/archives/2011/09/05/from-ssa-to-native-code-v8s-lithium-language

Page 72: Virtual machine and javascript engine
Page 73: Virtual machine and javascript engine

function ArraySort(comparefn) { if (IS_NULL_OR_UNDEFINED(this) && !IS_UNDETECTABLE(this)) { throw MakeTypeError("called_on_null_or_undefined", ["Array.prototype.sort"]); }

// In-place QuickSort algorithm. // For short (length <= 22) arrays, insertion sort is used for efficiency.

if (!IS_SPEC_FUNCTION(comparefn)) { comparefn = function (x, y) { if (x === y) return 0; if (%_IsSmi(x) && %_IsSmi(y)) { return %SmiLexicographicCompare(x, y); } x = ToString(x); y = ToString(y); if (x == y) return 0; else return x < y ? -1 : 1; }; } ...

Built-in objects written in JS

v8/src/array.js

Page 74: Virtual machine and javascript engine

GC

Page 75: Virtual machine and javascript engine

V8 performance

Page 76: Virtual machine and javascript engine

Can V8 be faster?

Page 77: Virtual machine and javascript engine

Dart• Clear syntax, Optional types, Libraries

• Performance

• Can compile to JavaScript

• But IE, WebKit and Mozilla rejected it

• What do you think?

• My thought: Will XML replace HTML? No, but thanks Google, for push the web forward

Page 78: Virtual machine and javascript engine

Embed V8

Page 79: Virtual machine and javascript engine

Embed

Page 80: Virtual machine and javascript engine

v8::Handle<v8::Value> Print(const v8::Arguments& args) { for (int i = 0; i < args.Length(); i++) { v8::HandleScope handle_scope; v8::String::Utf8Value str(args[i]); const char* cstr = ToCString(str); printf("%s", cstr); } return v8::Undefined();}

v8::Handle<v8::ObjectTemplate> global = v8::ObjectTemplate::New();global->Set(v8::String::New("print"), v8::FunctionTemplate::New(Print));

Expose Function

Page 81: Virtual machine and javascript engine
Page 82: Virtual machine and javascript engine

Node.JS• Pros

• Async

• One language for everything

• Faster than PHP, Python

• Community

• Cons

• Lack of great libraries

• ES5 code hard to maintain

• Still too youth

Page 83: Virtual machine and javascript engine

JavaScriptCore (Nitro)

Page 84: Virtual machine and javascript engine

Where it comes from?

Page 85: Virtual machine and javascript engine

1997 Macworld

Page 86: Virtual machine and javascript engine

“Apple has decided to make Internet Explorer it’s default browser on macintosh.”

“Since we believe in choice. We going to be shipping other Internet Browser...”

Steve Jobs

Page 87: Virtual machine and javascript engine

JavaScriptCore History• 2001 KJS (kde-2.2)

• Bison

• AST interpreter

• 2008 SquirrelFish

• Bytecode(Register)

• Direct threading

• 2008 SquirrelFish Extreme

• PICs

• method JIT

• regular expression JIT

• DFG JIT (March 2011)

Page 88: Virtual machine and javascript engine

BytecodeAST Method JIT

SSA

Interpreter

DFG JIT

Page 89: Virtual machine and javascript engine

SipderMonkey

Page 90: Virtual machine and javascript engine
Page 91: Virtual machine and javascript engine

Monkey• SpiderMonkey

• Written by Brendan Eich

• interpreter

• TraceMonkey

• trace JIT

• removed

• JägerMonkey

• PICs

• method JIT (from JSC)

• IonMonkey

• Type Inference

• Compiler optimization

Page 92: Virtual machine and javascript engine

IonMonkey

• SSA

• function inline

• linear-scan register allocation

• dead code elimination

• loop-invariant code motion

• ...

Page 93: Virtual machine and javascript engine

http://www.arewefastyet.com/

Page 94: Virtual machine and javascript engine

Chakra (IE9)

Page 95: Virtual machine and javascript engine

Chakra

• Interpreter/JIT

• Type System (hidden class)

• PICs

• Delay parse

• Use utf-8 internal

Page 96: Virtual machine and javascript engine

Unlocking the JavaScript Opportunity with Internet Explorer 9

Page 97: Virtual machine and javascript engine

Unlocking the JavaScript Opportunity with Internet Explorer 9

Page 98: Virtual machine and javascript engine

Carakan (Opera)

Page 99: Virtual machine and javascript engine

Carakan

• Register VM

• Method JIT, Regex JIT

• Hidden type

• Function inline

Page 100: Virtual machine and javascript engine

Rhino and JVM

Page 101: Virtual machine and javascript engine

Rhino is SLOW, why?

Page 102: Virtual machine and javascript engine

Because JVM is slow?

Page 103: Virtual machine and javascript engine

JVM did’t support dynamic language well

Page 104: Virtual machine and javascript engine

Solution: invokedynamic

Page 105: Virtual machine and javascript engine

Before

After Caller

Some tricksCaller Method

MethodInvokedynamic

Hard to optimize in JVM

method handle

Page 106: Virtual machine and javascript engine

One ring to rule them all?

Page 107: Virtual machine and javascript engine

Rhino + invokedynamic• Pros

• Easier to implement

• Lots of great Java Libraries

• JVM optimization for free

• Cons

• Only in JVM7

• Not fully optimized yet

• Hard to beat V8

Page 108: Virtual machine and javascript engine

Compiler optimization is HARD

Page 109: Virtual machine and javascript engine

It there an easy way?

Page 110: Virtual machine and javascript engine

LLVM

Page 111: Virtual machine and javascript engine
Page 112: Virtual machine and javascript engine

LLVM

• Clang, VMKit, GHC, PyPy, Rubinius ...

• DragonEgg: replace GCC back-end

• IR

• Optimization

• Link, Code generate, JIT

• Apple

Page 113: Virtual machine and javascript engine

LLVM simplify

Page 114: Virtual machine and javascript engine

int foo(int bar) { int one = 1; return bar + one;}

int main() { foo(3);}

define i32 @foo(i32 %bar) nounwind ssp {entry: %bar_addr = alloca i32, align 4 %retval = alloca i32 %0 = alloca i32 %one = alloca i32 %"alloca point" = bitcast i32 0 to i32 store i32 %bar, i32* %bar_addr store i32 1, i32* %one, align 4 %1 = load i32* %bar_addr, align 4 %2 = load i32* %one, align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* %0, align 4 %4 = load i32* %0, align 4 store i32 %4, i32* %retval, align 4 br label %return

return: %retval1 = load i32* %retval ret i32 %retval1}

define i32 @main() nounwind ssp {entry: %retval = alloca i32 %"alloca point" = bitcast i32 0 to i32 %0 = call i32 @foo(i32 3) nounwind ssp br label %return

return: %retval1 = load i32* %retval ret i32 %retval1}

Page 115: Virtual machine and javascript engine

define i32 @foo(i32 %bar) nounwind ssp {entry: %bar_addr = alloca i32, align 4 %retval = alloca i32 %0 = alloca i32 %one = alloca i32 %"alloca point" = bitcast i32 0 to i32 store i32 %bar, i32* %bar_addr store i32 1, i32* %one, align 4 %1 = load i32* %bar_addr, align 4 %2 = load i32* %one, align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* %0, align 4 %4 = load i32* %0, align 4 store i32 %4, i32* %retval, align 4 br label %return

return: %retval1 = load i32* %retval ret i32 %retval1}

define i32 @main() nounwind ssp {entry: %retval = alloca i32 %"alloca point" = bitcast i32 0 to i32 %0 = call i32 @foo(i32 3) nounwind ssp br label %return

return: %retval1 = load i32* %retval ret i32 %retval1}

define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}

define i32 @main() nounwind readnone ssp {entry: ret i32 undef}

Optimization

Page 116: Virtual machine and javascript engine

Optimization (70+)

http://llvm.org/docs/Passes.html

Page 117: Virtual machine and javascript engine

define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}

define i32 @main() nounwind readnone ssp {entry: ret i32 undef}

LLVM backend

Page 118: Virtual machine and javascript engine

NativeCodeGenLinker

IPO/IPA.. Runtime

Optimizer

Offline Reoptimizer

Profile& Trace

Info

LLVM

LLVMLibraries

Compiler FE 1

Compiler FE N.o files

LLVMLLVM

exe &LLVM

LLVM

CPU

JIT

ProfileInfo

exe &LLVM

exe

LLVM

exe

LLVM

Figure 4: LLVM system architecture diagram

code in non-conforming languages is executed as “un-managed code”. Such code is represented in nativeform and not in the CLI intermediate representation,so it is not exposed to CLI optimizations. These sys-tems do not provide #2 with #1 or #3 because run-time optimization is generally only possible when us-ing JIT code generation. They do not aim to provide#4, and instead provide a rich runtime framework forlanguages that match their runtime and object model,e.g., Java and C#. Omniware [1] provides #5 andmost of the benefits of #2 (because, like LLVM, it usesa low-level represention that permits extensive staticoptimization), but at the cost of not providing infor-mation for high-level analysis and optimization (i.e.,#1). It does not aim to provide #3 or #4.

• Transparent binary runtime optimization systems likeDynamo and the runtime optimizers in Transmeta pro-cessors provide benefits #2, #4 and #5, but they donot provide #1. They provide benefit #3 only at run-time, and only to a limited extent because they workonly on native binary code, limiting the optimizationsthey can perform.

• Profile Guided Optimization for static languages pro-vide benefit #3 at the cost of not being transparent(they require a multi-phase compilation process). Ad-ditionally, PGO su!ers from three problems: (1) Em-pirically, developers are unlikely to use PGO, exceptwhen compiling benchmarks. (2) When PGO is used,the application is tuned to the behavior of the train-ing run. If the training run is not representative of theend-user’s usage patterns, performance may not im-prove and may even be hurt by the profile-driven opti-mization. (3) The profiling information is completelystatic, meaning that the compiler cannot make use ofphase behavior in the program or adapt to changingusage patterns.

There are also significant limitations of the LLVM strat-egy. First, language-specific optimizations must be per-formed in the front-end before generating LLVM code.LLVM is not designed to represent source languages typesor features directly. Second, it is an open question whetherlanguages requiring sophisticated runtime systems such asJava can benefit directly from LLVM. We are currently ex-ploring the potential benefits of implementing higher-levelvirtual machines such as JVM or CLI on top of LLVM.

The subsections below describe the key components ofthe LLVM compiler architecture, emphasizing design andimplementation features that make the capabilities abovepractical and e"cient.

3.2 Compile-Time: External front-end & staticoptimizer

External static LLVM compilers (referred to as front-ends)translate source-language programs into the LLVM virtualinstruction set. Each static compiler can perform three keytasks, of which the first and third are optional: (1) Performlanguage-specific optimizations, e.g., optimizing closures inlanguages with higher-order functions. (2) Translate sourceprograms to LLVM code, synthesizing as much useful LLVMtype information as possible, especially to expose pointers,structures, and arrays. (3) Invoke LLVM passes for globalor interprocedural optimizations at the module level. TheLLVM optimizations are built into libraries, making it easyfor front-ends to use them.

The front-end does not have to perform SSA construc-tion. Instead, variables can be allocated on the stack (whichis not in SSA form), and the LLVM stack promotion andscalar expansion passes can be used to build SSA form ef-fectively. Stack promotion converts stack-allocated scalarvalues to SSA registers if their address does not escape thecurrent function, inserting ! functions as necessary to pre-serve SSA form. Scalar expansion precedes this and expandslocal structures to scalars wherever possible, so that theirfields can be mapped to SSA registers as well.

Note that many “high-level” optimizations are not reallylanguage-dependent, and are often special cases of moregeneral optimizations that may be performed on LLVMcode. For example, both virtual function resolution forobject-oriented languages (described in Section 4.1.2) andtail-recursion elimination which is crucial for functional lan-guages can be done in LLVM. In such cases, it is better toextend the LLVM optimizer to perform the transformation,rather than investing e!ort in code which only benefits aparticular front-end. This also allows the optimizations tobe performed throughout the lifetime of the program.

3.3 Linker & Interprocedural OptimizerLink time is the first phase of the compilation process

where most7 of the program is available for analysis andtransformation. As such, link-time is a natural place toperform aggressive interprocedural optimizations across theentire program. The link-time optimizations in LLVM oper-ate on the LLVM representation directly, taking advantageof the semantic information it contains. LLVM currentlyincludes a number of interprocedural analyses, such as acontext-sensitive points-to analysis (Data Structure Anal-ysis [31]), call graph construction, and Mod/Ref analy-sis, and interprocedural transformations like inlining, deadglobal elimination, dead argument elimination, dead typeelimination, constant propagation, array bounds check elim-ination [28], simple structure field reordering, and Auto-

7Note that shared libraries and system libraries may notbe available for analysis at link time, or may be compileddirectly to native code.

Proceedings of the International Symposium on Code Generation and Optimization (CGO’04) 0-7695-2102-9/04 $ 20.00 © 2004 IEEE

Page 119: Virtual machine and javascript engine

LLVM on JavaScript

Page 120: Virtual machine and javascript engine

Emscripten

• C/C++ to LLVM IR

• LLVM IR to JavaScript

• Run on browser

Page 121: Virtual machine and javascript engine

define i32 @foo(i32 %bar) nounwind readnone ssp {entry: %0 = add nsw i32 %bar, 1 ret i32 %0}

define i32 @main() nounwind readnone ssp {entry: ret i32 undef}

...

function _foo($bar) { var __label__; var $0=((($bar)+1)|0); return $0;}

function _main() { var __label__; return undef;}Module["_main"] = _main;

...

Page 122: Virtual machine and javascript engine

Emscripten demo

• Python, Ruby, Lua virtual machine (http://repl.it/)

• OpenJPEG

• Poppler

• FreeType

• ...

https://github.com/kripken/emscripten/wiki

Page 123: Virtual machine and javascript engine

Performance? good enough!

benchmark SM V8 gcc ratiofannkuch (10) 1.158 0.931 0.231 4.04fasta (2100000) 1.115 1.128 0.452 2.47primes 1.443 3.194 0.438 3.29raytrace (7,256) 1.930 2.944 0.228 8.46dlmalloc (400,400) 5.050 1.880 0.315 5.97

The first column is the name of the benchmark, and inparentheses any parameters used in running it. The sourcecode to all the benchmarks can be found at https://github.com/kripken/emscripten/tree/master/tests(each in a separate file with its name, except for ‘primes’,which is embedded inside runner.py in the function test primes).A brief summary of the benchmarks is as follows:

• fannkuch and fasta are commonly-known benchmarks,appearing for example on the Computer Language Bench-marks Game8. They use a mix of mathematic operations(integer in the former, floating-point in the latter) andmemory access.

• primes is the simplest benchmark in terms of code. It isbasically just a tiny loop that calculates prime numbers.

• raytrace is real-world code, from the sphereflake ray-tracer9. This benchmark has a combination of memoryaccess and floating-point math.

• dlmalloc (Doug Lea’s malloc10) is a well-known real-world implementation of malloc and free. This bench-mark does a large amount of calls to malloc and free inan intermixed way, which tests memory access and inte-ger calculations.

Returning to the table of results, the second column isthe elapsed time (in seconds) when running the compiledcode (generated using all Emscripten and LLVM optimiza-tions as well as the Closure Compiler) in the SpiderMonkeyJavaScript engine (specifically the JaegerMonkey branch,checked out June 15th, 2011). The third column is theelapsed time when running the same JavaScript code in theV8 JavaScript engine (checked out Jun 15th, 2011). In boththe second and third column lower values are better; the bestof the two is in bold. The fourth column is the elapsed timewhen running the original code compiled with gcc -O3, us-ing GCC 4.4.4. The last column is the ratio, that is, howmuch slower the JavaScript code (running in the faster ofthe two engines for that test) is when compared to gcc. Allthe tests were run on a MacBook Pro with an Intel i7 CPUclocked at 2.66GHz, running on Ubuntu 10.04.

Clearly the results greatly vary by the benchmark, withthe generated JavaScript running from 2.47 to 8.46 timesslower. There are also significant differences between the

8http://shootout.alioth.debian.org/

9http://ompf.org/ray/sphereflake/

10http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_

derivatives

two JavaScript engines, with each better at some of thebenchmarks. It appears that code that does simple numericaloperations – like the primes test – can run fairly fast, whilecode that has a lot of memory accesses, for example dueto using structures – like the raytrace test – will be slower.(The main issue with structures is that Emscripten does not‘nativize’ them yet, as it does to simple local variables.)

Being 2.47 to 8.46 times slower than the most-optimizedC++ code is a significant slowdown, but it is still morethan fast enough for many purposes, and the main pointof course is that the code can run anywhere the web canbe accessed. Further work on Emscripten is expected toimprove the speed as well, as are improvements to LLVM,the Closure Compiler, and JavaScript engines themselves;see further discussion in the Summary.

2.3 LimitationsEmscripten’s compilation approach, as has been describedin this Section so far, is to generate ‘natural’ JavaScript, asclose as possible to normal JavaScript on the web, so thatmodern JavaScript engines perform well on it. In particu-lar, we try to generate ‘normal’ JavaScript operations, likeregular addition and multiplication and so forth. This is avery different approach than, say, emulating a CPU on a lowlevel, or for the case of LLVM, writing an LLVM bitcodeinterpreter in JavaScript. The latter approach has the bene-fit of being able to run virtually any compiled code, at thecost of speed, whereas Emscripten makes a tradeoff in theother direction. We will now give a summary of some of thelimitations of Emscripten’s approach.

• 64-bit Integers: JavaScript numbers are all 64-bit dou-bles, with engines typically implementing them as 32-bit integers where possible for speed. A consequence ofthis is that it is impossible to directly implement 64-bitintegers in JavaScript, as integer values larger than 32bits will become doubles, with only 53 bits for the sig-nificand. Thus, when Emscripten uses normal JavaScriptaddition and so forth for 64-bit integers, it runs the riskof rounding effects. This could be solved by emulating64-bit integers, but it would be much slower than nativecode.

• Multithreading: JavaScript has Web Workers, which areadditional threads (or processes) that communicate viamessage passing. There is no shared state in this model,which means that it is not directly possible to compilemultithreaded code in C++ into JavaScript. A partial so-lution could be to emulate threads, without Workers, bymanually controlling which blocks of code run (a varia-tion on the switch in a loop construction mentioned ear-lier) and manually switching between threads every sooften. However, in that case there would not be any uti-lization of additional CPU cores, and furthermore perfor-mance would be slow due to not using normal JavaScriptloops.

7 2011/7/23

Page 124: Virtual machine and javascript engine

JavaScript on LLVM

Page 125: Virtual machine and javascript engine

Fabric Engine

• JavaScript Integration

• Native code compilation (LLVM)

• Multi-threaded execution

• OpenGL Rendering

Page 126: Virtual machine and javascript engine

Fabric Engine

http://fabric-engine.com/2011/11/server-performance-benchmarks/

Page 127: Virtual machine and javascript engine

Conclusion?

Page 128: Virtual machine and javascript engine

David Wheeler

All problems in computer science can be solved by another level of indirection

Page 129: Virtual machine and javascript engine
Page 130: Virtual machine and javascript engine
Page 131: Virtual machine and javascript engine

References• The behavior of efficient virtual

machine interpreters on modern architectures

• Virtual Machine Showdown: Stack Versus Registers

• The implementation of Lua 5.0

• Why Is the New Google V8 Engine so Fast?

• Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters

• Effective Inline-Threaded Interpretation of Java Bytecode Using Preparation Sequences

• Smalltalk-80: the language and its implementation

Page 132: Virtual machine and javascript engine

References• Design of the Java HotSpotTM

Client Compiler for Java 6

• Oracle JRockit: The Definitive Guide

• Virtual Machines: Versatile platforms for systems and processes

• Fast and Precise Hybrid Type Inference for JavaScript

• LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

• Emscripten: An LLVM-to-JavaScript Compiler

• An Analysis of the Dynamic Behavior of JavaScript Programs

Page 133: Virtual machine and javascript engine

References• Adaptive Optimization for SELF

• Bytecodes meet Combinators: invokedynamic on the JVM

• Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters

• Efficient Implementation of the Smalltalk-80 System

• Design, Implementation, and Evaluation of Optimizations in a Just-In-Time Compiler

• Optimizing direct threaded code by selective inlining

• Linear scan register allocation

• Optimizing Invokedynamic

Page 134: Virtual machine and javascript engine

References• Representing Type Information

in Dynamically Typed Languages

• The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures

• Trace-based Just-in-Time Type Specialization for Dynamic Languages

• The Structure and Performance of Efficient Interpreters

• Know Your Engines: How to Make Your JavaScript Fast

• IE Blog, Chromium Blog, WebKit Blog, Opera Blog, Mozilla Blog, Wingolog’s Blog, RednaxelaFX’s Blog, David Mandelin’s Blog...

Page 135: Virtual machine and javascript engine

!ank y"