84
If you don’t get this ref...shame on you

JavaScript on the GPU

Embed Size (px)

DESCRIPTION

I experimented with running JavaScript on the GPU - see how the first iteration of the experiment went.

Citation preview

Page 1: JavaScript on the GPU

If you don’t get this ref...shame on you

Page 3: JavaScript on the GPU

Work @ SenchaWeb Platform Team

Doing webkitty things...

Page 4: JavaScript on the GPU

WebKit Committer

Page 5: JavaScript on the GPU

Co-AuthorW3C Web Cryptography

API

Page 6: JavaScript on the GPU

JavaScript on the GPU

Page 7: JavaScript on the GPU

Why JavaScript on the GPURunning JavaScript on the GPU

What’s to come...

What I’ll blabber about today

Page 8: JavaScript on the GPU

Why JavaScript on the GPU?

Page 9: JavaScript on the GPU

Why JavaScript on the GPU?

Better question:Why a GPU?

Page 10: JavaScript on the GPU

Why JavaScript on the GPU?

Better question:Why a GPU?

A: They’re fast!(well, at certain things...)

Page 11: JavaScript on the GPU

Totally di!erent paradigm from CPUsData parallelism vs. Task parallelismStream processing vs. Sequential processing

GPUs can divide-and-conquer

Hardware capable of a large number of “threads”e.g. ATI Radeon HD 6770m:480 stream processing units == 480 cores

Typically very high memory bandwidthMany, many GigaFLOPs

GPUs are fast b/c...

Page 12: JavaScript on the GPU

Not all tasks can be accelerated by GPUsTasks must be parallelizable, i.e.:

Side e!ect freeHomogeneous and/or streamable

Overall tasks will become limited by Amdahl’s Law

GPUs don’t solve all problems

Page 13: JavaScript on the GPU
Page 14: JavaScript on the GPU

Let’s find out...

Page 15: JavaScript on the GPU

ExperimentCode Name “LateralJS”

Page 16: JavaScript on the GPU

LateralJS

Our MissionTo make JavaScript a first-class citizen on all GPUs and take advantage of hardware accelerated operations & data parallelization.

Page 17: JavaScript on the GPU

OpenCLAMD, Nvidia, Intel, etc.A shitty version of C99No dynamic memoryNo recursionNo function pointersTerrible toolingImmature (arguably)

Our OptionsNvidia CUDA

Nvidia onlyC++ (C for CUDA)Dynamic memoryRecursionFunction pointersGreat dev. toolingMore mature (arguably)

Page 18: JavaScript on the GPU

OpenCLAMD, Nvidia, Intel, etc.A shitty version of C99No dynamic memoryNo recursionNo function pointersTerrible toolingImmature (arguably)

Nvidia CUDANvidia onlyC++ (C for CUDA)Dynamic memoryRecursionFunction pointersGreat dev. toolingMore mature (arguably)

Our Options

Page 19: JavaScript on the GPU

We want full JavaScript supportObject / prototypeClosuresRecursionFunctions as objectsVariable typing

Type Inference limitationsReasonably limited to size and complexity of “kernel-esque” functionsNot nearly insane enough

Why not a Static Compiler?

Page 20: JavaScript on the GPU
Page 21: JavaScript on the GPU

We want it all baby - full JavaScript support!Most insane approachChallenging to make it good, but holds a lot of promise

Why an Interpreter?

Page 22: JavaScript on the GPU

OpenCL Headaches

Page 23: JavaScript on the GPU
Page 24: JavaScript on the GPU

Multiple memory spaces - pointer hellNo recursion - all inlined functionsNo standard libc librariesNo dynamic memoryNo standard data structures - apart from vector opsBuggy ass AMD/Nvidia compilers

Oh the agony...

Page 25: JavaScript on the GPU
Page 26: JavaScript on the GPU

In the order of fastest to slowest:

Multiple Memory Spaces

space description

privatevery faststream processor cache (~64KB)scoped to a single work item

localfast~= L1 cache on CPUs (~64KB)scoped to a single work group

globalconstant

slow, by orders of magnitude~= system memory over slow busavailable to all work groups/itemsall the VRAM on the card (MBs)

Page 27: JavaScript on the GPU

global uchar* gptr = 0x1000;local uchar* lptr = (local uchar*) gptr; // FAIL!uchar* pptr = (uchar*) gptr; // FAIL! private is implicit

Memory Space Pointer Hell

local privateglobal

0x1000 points to something di!erentdepending on the address space!

0x1000

Page 28: JavaScript on the GPU

#define GPTR(TYPE) global TYPE*#define CPTR(TYPE) constant TYPE*#define LPTR(TYPE) local TYPE*#define PPTR(TYPE) private TYPE*

Memory Space Pointer Hell

Pointers must always be fully qualifiedMacros to help ease the pain

Page 29: JavaScript on the GPU

uint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error}

No Recursion!?!?!?No call stackAll functions are inlined to the kernel function

Page 30: JavaScript on the GPU

No standard libc librariesmemcpy?strcpy?strcmp?etc...

Page 31: JavaScript on the GPU

No standard libc librariesImplement our own

#define MEMCPY(NAME, DEST_AS, SRC_AS) \ DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); \ DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { \ DEST_AS uchar* cDest = (DEST_AS uchar*)dest; \ SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; \ for (uint i = 0; i < size; i++) \ cDest[i] = cSrc[i]; \ return (DEST_AS void*)cDest; \ }PTR_MACRO_DEST_SRC(MEMCPY, memcpy)

Producesmemcpy_gmemcpy_lmemcpy_p

memcpy_gcmemcpy_glmemcpy_gp

memcpy_lcmemcpy_lgmemcpy_lp

memcpy_pcmemcpy_pgmemcpy_pl

Page 32: JavaScript on the GPU

No malloc()No free()What to do...

No dynamic memory

Page 33: JavaScript on the GPU

Create a large bu!er of global memory - our “heap”Implement our own malloc() and free()Create a handle structure - “virtual memory”P(T, hnd) macro to get the current pointer address

Yes! dynamic memory

GPTR(handle) hnd = malloc(sizeof(uint));GPTR(uint) ptr = P(uint, hnd);*ptr = 0xdeadbeef;free(hnd);

Page 34: JavaScript on the GPU
Page 35: JavaScript on the GPU

Ok, we get the point...FYL!

Page 36: JavaScript on the GPU

HostHostHost

High-level Architecture

Esprima Parser

V8

GPUs

Stack-basedInterpreter

Data Heap

Garbage Collector

Device Mgr

Data Serializer & Marshaller

Page 37: JavaScript on the GPU

HostHostHost

High-level Architecture

Esprima Parser

V8

GPUs

Stack-basedInterpreter

Data Heap

Garbage Collector

eval(code);

Build JSON AST

Device Mgr

Data Serializer & Marshaller

Page 38: JavaScript on the GPU

HostHostHost

High-level Architecture

Esprima Parser

Device Mgr

V8

GPUs

Stack-basedInterpreter

Data Serializer & Marshaller

Data Heap

Garbage Collector

eval(code);

Build JSON AST

Serialize ASTJSON => C Structs

Page 39: JavaScript on the GPU

HostHostHost

High-level Architecture

Esprima Parser

Device Mgr

V8

GPUs

Stack-basedInterpreter

Data Serializer & Marshaller

Data Heap

Garbage Collector

eval(code);

Build JSON AST

Serialize ASTJSON => C Structs

Ship to GPU to Interpret

Page 40: JavaScript on the GPU

HostHostHost

High-level Architecture

Esprima Parser

Device Mgr

V8

GPUs

Stack-basedInterpreter

Data Serializer & Marshaller

Data Heap

Garbage Collector

eval(code);

Build JSON AST

Serialize ASTJSON => C Structs

Ship to GPU to Interpret

Fetch Result

Page 41: JavaScript on the GPU

AST Generation

Page 42: JavaScript on the GPU

AST Generation

Esprima in V8

JSON AST(v8::Object)

JavaScript Source

Lateral AST(C structs)

Page 43: JavaScript on the GPU

$ resgen esprima.js resgen_esprima_js.c

Embed esprima.js

Resource Generator

Page 44: JavaScript on the GPU

const unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0};

Embed esprima.js

resgen_esprima_js.c

Page 45: JavaScript on the GPU

extern const char resgen_esprima_js;

void ASTGenerator::init(){ HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true;}

Embed esprima.js

ASTGenerator.cpp

Page 46: JavaScript on the GPU

ASTGenerator::esprimaParse( "var xyz = new Array(10);");

Build JSON AST

e.g.

Page 47: JavaScript on the GPU

Handle<Object> ASTGenerator::esprimaParse(const char* javascript){ if (!s_initialized) init();

HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima->Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1, (Handle<Value>*)&code));

s_context->Exit(); return scope.Close(ast);}

Build JSON AST

Page 48: JavaScript on the GPU

{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var"}

Build JSON AST

Page 49: JavaScript on the GPU

typedef struct ast_type_st { CL(uint) id; CL(uint) size;} ast_type;

typedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody;} ast_program;

typedef struct ast_identifier_st { ast_type type; CL(uint) name;} ast_identifier;

Lateral AST structs

#ifdef __OPENCL_VERSION__#define CL(TYPE) TYPE#else#define CL(TYPE) cl_##TYPE#endif

Structs shared between Host and OpenCL

Page 50: JavaScript on the GPU

ast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array");ast_type* vd1_1_init_args[1];vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10);ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1);free(vd1_1_init_id);for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]);ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz");ast_type* vd1_decls[1];vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init);free(vd1_1_id);free(vd1_1_init);ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var");for (int i = 0; i < 1; i++) free(vd1_decls[i]);

Lateral AST structs

v8::Object => ast_typeexpanded

Page 51: JavaScript on the GPU

ast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size);

// copy the string strcpy((char*)(ast_id + 1), str);

// fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset

return ast_id;}

Lateral AST structs

astCreateIdentifier

Page 52: JavaScript on the GPU

Lateral AST structsastCreateIdentifier(“xyz”)

offset field value

0 type.id AST_IDENTIFIER (0x01)

4 type.size 16

8 name 12 (offset)

12 str[0] ‘x’

13 str[1] ‘y’

14 str[2] ‘z’

15 str[3] ‘\0’

Page 53: JavaScript on the GPU

ast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size;

ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size;

CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new;

// copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size;

// copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs;

return ast_new;}

Lateral AST structsastCreateNewExpression

Page 54: JavaScript on the GPU

Lateral AST structsnew Array(10)

offset field value

0 type.id AST_NEW_EXPR (0x308)

4 type.size 52

8 callee 20 (offset)

12 arguments 40 (offset)

16 numArguments 1

20 callee node ast_identifier (“Array”)

40 arguments node ast_literal_number (10)

Page 55: JavaScript on the GPU

Shared across the Host and the OpenCL runtimeHost writes, Lateral reads

Constructed on Host as contiguous blobsEasy to send to GPU: memcpy(gpu, ast, ast->size);Fast to send to GPU, single bu!er writeSimple to traverse w/ pointer arithmetic

Lateral AST structs

Page 56: JavaScript on the GPU

Stack-basedInterpreter

Page 57: JavaScript on the GPU

Building Blocks

Heap

AST Traverse Loop Interpret Loop

AST Traverse Stack

Symbol/Ref TableCall/Exec Stack

Return Stack

Lateral State

Scope Stack

JS Type Structs

Page 58: JavaScript on the GPU

#include "state.h"#include "jsvm/asttraverse.h"#include "jsvm/interpreter.h"

// Setup VM structureskernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT}

// Interpret the ASTkernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE

ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack, call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); }}

Kernels

Page 59: JavaScript on the GPU

var x = 1 + 2;

Let’s interpret...

Page 60: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

Page 61: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDecl

Page 62: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDtor

Page 63: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

IdentBinary

VarDtor

Page 64: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

IdentLiteralLiteral

VarDtorBinary

Page 65: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

IdentLiteral

VarDtorBinaryLiteral

Page 66: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

Ident VarDtorBinaryLiteralLiteral

Page 67: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDtorBinaryLiteralLiteralIdent

Page 68: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDtorBinaryLiteralLiteral

“x”

Page 69: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDtorBinaryLiteral

“x”1

Page 70: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDtorBinary

“x”12

Page 71: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

VarDtor “x”3

Page 72: JavaScript on the GPU

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return

Page 73: JavaScript on the GPU

Benchmark

Page 74: JavaScript on the GPU

var input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}

Benchmark

Small loop of FLOPs

Page 75: JavaScript on the GPU

Execution Time

GPU CLATI Radeon 6770m

CPU CLIntel Core i7 4x2.4Ghz

V8Intel Core i7 4x2.4Ghz

116.571533ms 0.226007ms 0.090664ms

Lateral

Page 76: JavaScript on the GPU

Execution TimeLateral

GPU CLATI Radeon 6770m

CPU CLIntel Core i7 4x2.4Ghz

V8Intel Core i7 4x2.4Ghz

116.571533ms 0.226007ms 0.090664ms

Page 77: JavaScript on the GPU
Page 78: JavaScript on the GPU

EverythingStack-based AST Interpreter, no optimizationsHeavy global memory access, no optimizationsNo data or task parallelism

What went wrong?

Page 79: JavaScript on the GPU

Slow as molassesMemory hog Eclipse styleHeavy memory access

“var x = 1 + 2;” == 30 stack hits alone!Too much dynamic allocation

No inline optimizations, just following the yellow brick ASTStraight up lazy

Replace with something better!Bytecode compiler on HostBytecode register-based interpreter on Device

Stack-based Interpreter

Page 80: JavaScript on the GPU
Page 81: JavaScript on the GPU

Everything is dynamically allocated to global memoryRegister based interpreter & bytecode compiler can make better use of local and private memory

Too much global access

// 11.1207 secondssize_t tid = get_global_id(0);c[tid] = a[tid];while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop}

// 0.0445558 seconds!! HOLY SHIT!size_t tid = get_global_id(0);int tmp = a[tid]; // temp private variablefor(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loopc[tid] = tmp; // touch global memory one time

Optimizing memory access yields crazy results

Page 82: JavaScript on the GPU

Everything being interpreted in a single “thread”We have hundreds of cores available to us!Build in heuristics

Identify side-e!ect free statementsBreak into parallel tasks - very magical

No data or task parallelism

var input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}

input[9] = Math.pow((9 + 1) / 1.23, 3);

input[1] = Math.pow((1 + 1) / 1.23, 3);

input[0] = Math.pow((0 + 1) / 1.23, 3);

...

Page 83: JavaScript on the GPU

Acceptable performance on all CL devicesV8/Node extension to launch Lateral tasksHigh-level API to perform map-reduce, etc.Lateral-cluster...mmmmm

What’s in store

Page 84: JavaScript on the GPU

Thanks!

Jarred Nicholls@jarrednicholls

[email protected]