JavaScript on the GPU

If you don’t get this ref...shame on you

Jarred Nicholls@jarrednicholls

[email protected]

mailto:[email protected]


Work @ SenchaWeb Platform Team

Doing webkitty things...

WebKit Committer

Co-AuthorW3C Web Cryptography

API

JavaScript on the GPU

Why JavaScript on the GPURunning JavaScript on the GPU

What’s to come...

What I’ll blabber about today

Why JavaScript on the GPU?


Better question:Why a GPU?


Better question:Why a GPU?

A: They’re fast!(well, at certain things...)

Totally di!erent paradigm from CPUsData parallelism vs. Task parallelismStream processing vs. Sequential processing

GPUs can divide-and-conquer

Hardware capable of a large number of “threads”e.g. ATI Radeon HD 6770m:480 stream processing units == 480 cores

Typically very high memory bandwidthMany, many GigaFLOPs

GPUs are fast b/c...

Not all tasks can be accelerated by GPUsTasks must be parallelizable, i.e.:

Side e!ect freeHomogeneous and/or streamable

Overall tasks will become limited by Amdahl’s Law

GPUs don’t solve all problems

http://en.wikipedia.org/wiki/Amdahl's_law

http://en.wikipedia.org/wiki/Amdahl's_law

Let’s find out...

ExperimentCode Name “LateralJS”

LateralJS

Our MissionTo make JavaScript a first-class citizen on all GPUs and take advantage of hardware accelerated operations & data parallelization.

OpenCLAMD, Nvidia, Intel, etc.A shitty version of C99No dynamic memoryNo recursionNo function pointersTerrible toolingImmature (arguably)

Our OptionsNvidia CUDA

Nvidia onlyC++ (C for CUDA)Dynamic memoryRecursionFunction pointersGreat dev. toolingMore mature (arguably)

OpenCLAMD, Nvidia, Intel, etc.A shitty version of C99No dynamic memoryNo recursionNo function pointersTerrible toolingImmature (arguably)

Nvidia CUDANvidia onlyC++ (C for CUDA)Dynamic memoryRecursionFunction pointersGreat dev. toolingMore mature (arguably)

Our Options

We want full JavaScript supportObject / prototypeClosuresRecursionFunctions as objectsVariable typing

Type Inference limitationsReasonably limited to size and complexity of “kernel-esque” functionsNot nearly insane enough

Why not a Static Compiler?

We want it all baby - full JavaScript support!Most insane approachChallenging to make it good, but holds a lot of promise

Why an Interpreter?

OpenCL Headaches

Multiple memory spaces - pointer hellNo recursion - all inlined functionsNo standard libc librariesNo dynamic memoryNo standard data structures - apart from vector opsBuggy ass AMD/Nvidia compilers

Oh the agony...

In the order of fastest to slowest:

Multiple Memory Spaces

space description

privatevery faststream processor cache (~64KB)scoped to a single work item

localfast~= L1 cache on CPUs (~64KB)scoped to a single work group

globalconstant

slow, by orders of magnitude~= system memory over slow busavailable to all work groups/itemsall the VRAM on the card (MBs)

global uchar* gptr = 0x1000;local uchar* lptr = (local uchar*) gptr; // FAIL!uchar* pptr = (uchar*) gptr; // FAIL! private is implicit

Memory Space Pointer Hell

local privateglobal

0x1000 points to something di!erentdepending on the address space!

0x1000

#define GPTR(TYPE) global TYPE*#define CPTR(TYPE) constant TYPE*#define LPTR(TYPE) local TYPE*#define PPTR(TYPE) private TYPE*

Memory Space Pointer Hell

Pointers must always be fully qualifiedMacros to help ease the pain

uint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error}

No Recursion!?!?!?No call stackAll functions are inlined to the kernel function

No standard libc librariesmemcpy?strcpy?strcmp?etc...

No standard libc librariesImplement our own

#define MEMCPY(NAME, DEST_AS, SRC_AS) \ DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); \ DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { \ DEST_AS uchar* cDest = (DEST_AS uchar*)dest; \ SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; \ for (uint i = 0; i < size; i++) \ cDest[i] = cSrc[i]; \ return (DEST_AS void*)cDest; \ }PTR_MACRO_DEST_SRC(MEMCPY, memcpy)

Producesmemcpy_gmemcpy_lmemcpy_p

memcpy_gcmemcpy_glmemcpy_gp

memcpy_lcmemcpy_lgmemcpy_lp

memcpy_pcmemcpy_pgmemcpy_pl

No malloc()No free()What to do...

No dynamic memory

Create a large bu!er of global memory - our “heap”Implement our own malloc() and free()Create a handle structure - “virtual memory”P(T, hnd) macro to get the current pointer address

Yes! dynamic memory

GPTR(handle) hnd = malloc(sizeof(uint));GPTR(uint) ptr = P(uint, hnd);*ptr = 0xdeadbeef;free(hnd);

Ok, we get the point...FYL!

HostHostHost

High-level Architecture

Esprima Parser

V8

GPUs

Stack-basedInterpreter

Data Heap

Garbage Collector

Device Mgr

Data Serializer & Marshaller

HostHostHost


Esprima Parser

V8

GPUs


Data Heap

Garbage Collector

eval(code);

Build JSON AST

Device Mgr


HostHostHost


Esprima Parser

Device Mgr

V8

GPUs



Data Heap

Garbage Collector

eval(code);

Build JSON AST

Serialize ASTJSON => C Structs

HostHostHost


Esprima Parser

Device Mgr

V8

GPUs



Data Heap

Garbage Collector

eval(code);

Build JSON AST


Ship to GPU to Interpret

HostHostHost


Esprima Parser

Device Mgr

V8

GPUs



Data Heap

Garbage Collector

eval(code);

Build JSON AST


Ship to GPU to Interpret

Fetch Result

AST Generation

AST Generation

Esprima in V8

JSON AST(v8::Object)

JavaScript Source

Lateral AST(C structs)

$ resgen esprima.js resgen_esprima_js.c

Embed esprima.js

Resource Generator

const unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0};

Embed esprima.js

resgen_esprima_js.c

extern const char resgen_esprima_js;

void ASTGenerator::init(){ HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true;}

Embed esprima.js

ASTGenerator.cpp

ASTGenerator::esprimaParse( "var xyz = new Array(10);");

Build JSON AST

e.g.

Handle<Object> ASTGenerator::esprimaParse(const char* javascript){ if (!s_initialized) init();

HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima->Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1, (Handle<Value>*)&code));

s_context->Exit(); return scope.Close(ast);}

Build JSON AST

{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var"}

Build JSON AST

typedef struct ast_type_st { CL(uint) id; CL(uint) size;} ast_type;

typedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody;} ast_program;

typedef struct ast_identifier_st { ast_type type; CL(uint) name;} ast_identifier;

Lateral AST structs

#ifdef __OPENCL_VERSION__#define CL(TYPE) TYPE#else#define CL(TYPE) cl_##TYPE#endif

Structs shared between Host and OpenCL

ast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array");ast_type* vd1_1_init_args[1];vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10);ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1);free(vd1_1_init_id);for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]);ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz");ast_type* vd1_decls[1];vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init);free(vd1_1_id);free(vd1_1_init);ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var");for (int i = 0; i < 1; i++) free(vd1_decls[i]);

Lateral AST structs

v8::Object => ast_typeexpanded

ast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size);

// copy the string strcpy((char*)(ast_id + 1), str);

// fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset

return ast_id;}

Lateral AST structs

astCreateIdentifier

Lateral AST structsastCreateIdentifier(“xyz”)

offset field value

0 type.id AST_IDENTIFIER (0x01)

4 type.size 16

8 name 12 (offset)

12 str[0] ‘x’

13 str[1] ‘y’

14 str[2] ‘z’

15 str[3] ‘\0’

ast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size;

ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size;

CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new;

// copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size;

// copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs;

return ast_new;}

Lateral AST structsastCreateNewExpression

Lateral AST structsnew Array(10)

offset field value

0 type.id AST_NEW_EXPR (0x308)

4 type.size 52

8 callee 20 (offset)

12 arguments 40 (offset)

16 numArguments 1

20 callee node ast_identifier (“Array”)

40 arguments node ast_literal_number (10)

Shared across the Host and the OpenCL runtimeHost writes, Lateral reads

Constructed on Host as contiguous blobsEasy to send to GPU: memcpy(gpu, ast, ast->size);Fast to send to GPU, single bu!er writeSimple to traverse w/ pointer arithmetic

Lateral AST structs


Building Blocks

Heap

AST Traverse Loop Interpret Loop

AST Traverse Stack

Symbol/Ref TableCall/Exec Stack

Return Stack

Lateral State

Scope Stack

JS Type Structs

#include "state.h"#include "jsvm/asttraverse.h"#include "jsvm/interpreter.h"

// Setup VM structureskernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT}

// Interpret the ASTkernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE

ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack, call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); }}

Kernels

var x = 1 + 2;

Let’s interpret...

var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}

AST Call Return


AST Call Return

VarDecl


AST Call Return

VarDtor


AST Call Return

IdentBinary

VarDtor


AST Call Return

IdentLiteralLiteral

VarDtorBinary


AST Call Return

IdentLiteral

VarDtorBinaryLiteral


AST Call Return

Ident VarDtorBinaryLiteralLiteral


AST Call Return

VarDtorBinaryLiteralLiteralIdent


AST Call Return

VarDtorBinaryLiteralLiteral

“x”


AST Call Return

VarDtorBinaryLiteral

“x”1


AST Call Return

VarDtorBinary

“x”12


AST Call Return

VarDtor “x”3


AST Call Return

Benchmark

var input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}

Benchmark

Small loop of FLOPs

Execution Time

GPU CLATI Radeon 6770m

CPU CLIntel Core i7 4x2.4Ghz

V8Intel Core i7 4x2.4Ghz

116.571533ms 0.226007ms 0.090664ms

Lateral

Execution TimeLateral

GPU CLATI Radeon 6770m

CPU CLIntel Core i7 4x2.4Ghz

V8Intel Core i7 4x2.4Ghz

116.571533ms 0.226007ms 0.090664ms

EverythingStack-based AST Interpreter, no optimizationsHeavy global memory access, no optimizationsNo data or task parallelism

What went wrong?

Slow as molassesMemory hog Eclipse styleHeavy memory access

“var x = 1 + 2;” == 30 stack hits alone!Too much dynamic allocation

No inline optimizations, just following the yellow brick ASTStraight up lazy

Replace with something better!Bytecode compiler on HostBytecode register-based interpreter on Device

Stack-based Interpreter

Everything is dynamically allocated to global memoryRegister based interpreter & bytecode compiler can make better use of local and private memory

Too much global access

// 11.1207 secondssize_t tid = get_global_id(0);c[tid] = a[tid];while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop}

// 0.0445558 seconds!! HOLY SHIT!size_t tid = get_global_id(0);int tmp = a[tid]; // temp private variablefor(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loopc[tid] = tmp; // touch global memory one time

Optimizing memory access yields crazy results

Everything being interpreted in a single “thread”We have hundreds of cores available to us!Build in heuristics

Identify side-e!ect free statementsBreak into parallel tasks - very magical

No data or task parallelism

var input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}

input[9] = Math.pow((9 + 1) / 1.23, 3);

input[1] = Math.pow((1 + 1) / 1.23, 3);

input[0] = Math.pow((0 + 1) / 1.23, 3);

...

Acceptable performance on all CL devicesV8/Node extension to launch Lateral tasksHigh-level API to perform map-reduce, etc.Lateral-cluster...mmmmm

What’s in store

Thanks!

Jarred Nicholls@jarrednicholls

[email protected]



Technology

JavaScript on the GPU