Upload
jarred-nicholls
View
14.163
Download
8
Tags:
Embed Size (px)
DESCRIPTION
I experimented with running JavaScript on the GPU - see how the first iteration of the experiment went.
Citation preview
If you don’t get this ref...shame on you
Work @ SenchaWeb Platform Team
Doing webkitty things...
WebKit Committer
Co-AuthorW3C Web Cryptography
API
JavaScript on the GPU
Why JavaScript on the GPURunning JavaScript on the GPU
What’s to come...
What I’ll blabber about today
Why JavaScript on the GPU?
Why JavaScript on the GPU?
Better question:Why a GPU?
Why JavaScript on the GPU?
Better question:Why a GPU?
A: They’re fast!(well, at certain things...)
Totally di!erent paradigm from CPUsData parallelism vs. Task parallelismStream processing vs. Sequential processing
GPUs can divide-and-conquer
Hardware capable of a large number of “threads”e.g. ATI Radeon HD 6770m:480 stream processing units == 480 cores
Typically very high memory bandwidthMany, many GigaFLOPs
GPUs are fast b/c...
Not all tasks can be accelerated by GPUsTasks must be parallelizable, i.e.:
Side e!ect freeHomogeneous and/or streamable
Overall tasks will become limited by Amdahl’s Law
GPUs don’t solve all problems
Let’s find out...
ExperimentCode Name “LateralJS”
LateralJS
Our MissionTo make JavaScript a first-class citizen on all GPUs and take advantage of hardware accelerated operations & data parallelization.
OpenCLAMD, Nvidia, Intel, etc.A shitty version of C99No dynamic memoryNo recursionNo function pointersTerrible toolingImmature (arguably)
Our OptionsNvidia CUDA
Nvidia onlyC++ (C for CUDA)Dynamic memoryRecursionFunction pointersGreat dev. toolingMore mature (arguably)
OpenCLAMD, Nvidia, Intel, etc.A shitty version of C99No dynamic memoryNo recursionNo function pointersTerrible toolingImmature (arguably)
Nvidia CUDANvidia onlyC++ (C for CUDA)Dynamic memoryRecursionFunction pointersGreat dev. toolingMore mature (arguably)
Our Options
We want full JavaScript supportObject / prototypeClosuresRecursionFunctions as objectsVariable typing
Type Inference limitationsReasonably limited to size and complexity of “kernel-esque” functionsNot nearly insane enough
Why not a Static Compiler?
We want it all baby - full JavaScript support!Most insane approachChallenging to make it good, but holds a lot of promise
Why an Interpreter?
OpenCL Headaches
Multiple memory spaces - pointer hellNo recursion - all inlined functionsNo standard libc librariesNo dynamic memoryNo standard data structures - apart from vector opsBuggy ass AMD/Nvidia compilers
Oh the agony...
In the order of fastest to slowest:
Multiple Memory Spaces
space description
privatevery faststream processor cache (~64KB)scoped to a single work item
localfast~= L1 cache on CPUs (~64KB)scoped to a single work group
globalconstant
slow, by orders of magnitude~= system memory over slow busavailable to all work groups/itemsall the VRAM on the card (MBs)
global uchar* gptr = 0x1000;local uchar* lptr = (local uchar*) gptr; // FAIL!uchar* pptr = (uchar*) gptr; // FAIL! private is implicit
Memory Space Pointer Hell
local privateglobal
0x1000 points to something di!erentdepending on the address space!
0x1000
#define GPTR(TYPE) global TYPE*#define CPTR(TYPE) constant TYPE*#define LPTR(TYPE) local TYPE*#define PPTR(TYPE) private TYPE*
Memory Space Pointer Hell
Pointers must always be fully qualifiedMacros to help ease the pain
uint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error}
No Recursion!?!?!?No call stackAll functions are inlined to the kernel function
No standard libc librariesmemcpy?strcpy?strcmp?etc...
No standard libc librariesImplement our own
#define MEMCPY(NAME, DEST_AS, SRC_AS) \ DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); \ DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { \ DEST_AS uchar* cDest = (DEST_AS uchar*)dest; \ SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; \ for (uint i = 0; i < size; i++) \ cDest[i] = cSrc[i]; \ return (DEST_AS void*)cDest; \ }PTR_MACRO_DEST_SRC(MEMCPY, memcpy)
Producesmemcpy_gmemcpy_lmemcpy_p
memcpy_gcmemcpy_glmemcpy_gp
memcpy_lcmemcpy_lgmemcpy_lp
memcpy_pcmemcpy_pgmemcpy_pl
No malloc()No free()What to do...
No dynamic memory
Create a large bu!er of global memory - our “heap”Implement our own malloc() and free()Create a handle structure - “virtual memory”P(T, hnd) macro to get the current pointer address
Yes! dynamic memory
GPTR(handle) hnd = malloc(sizeof(uint));GPTR(uint) ptr = P(uint, hnd);*ptr = 0xdeadbeef;free(hnd);
Ok, we get the point...FYL!
HostHostHost
High-level Architecture
Esprima Parser
V8
GPUs
Stack-basedInterpreter
Data Heap
Garbage Collector
Device Mgr
Data Serializer & Marshaller
HostHostHost
High-level Architecture
Esprima Parser
V8
GPUs
Stack-basedInterpreter
Data Heap
Garbage Collector
eval(code);
Build JSON AST
Device Mgr
Data Serializer & Marshaller
HostHostHost
High-level Architecture
Esprima Parser
Device Mgr
V8
GPUs
Stack-basedInterpreter
Data Serializer & Marshaller
Data Heap
Garbage Collector
eval(code);
Build JSON AST
Serialize ASTJSON => C Structs
HostHostHost
High-level Architecture
Esprima Parser
Device Mgr
V8
GPUs
Stack-basedInterpreter
Data Serializer & Marshaller
Data Heap
Garbage Collector
eval(code);
Build JSON AST
Serialize ASTJSON => C Structs
Ship to GPU to Interpret
HostHostHost
High-level Architecture
Esprima Parser
Device Mgr
V8
GPUs
Stack-basedInterpreter
Data Serializer & Marshaller
Data Heap
Garbage Collector
eval(code);
Build JSON AST
Serialize ASTJSON => C Structs
Ship to GPU to Interpret
Fetch Result
AST Generation
AST Generation
Esprima in V8
JSON AST(v8::Object)
JavaScript Source
Lateral AST(C structs)
$ resgen esprima.js resgen_esprima_js.c
Embed esprima.js
Resource Generator
const unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0};
Embed esprima.js
resgen_esprima_js.c
extern const char resgen_esprima_js;
void ASTGenerator::init(){ HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true;}
Embed esprima.js
ASTGenerator.cpp
ASTGenerator::esprimaParse( "var xyz = new Array(10);");
Build JSON AST
e.g.
Handle<Object> ASTGenerator::esprimaParse(const char* javascript){ if (!s_initialized) init();
HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima->Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1, (Handle<Value>*)&code));
s_context->Exit(); return scope.Close(ast);}
Build JSON AST
{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var"}
Build JSON AST
typedef struct ast_type_st { CL(uint) id; CL(uint) size;} ast_type;
typedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody;} ast_program;
typedef struct ast_identifier_st { ast_type type; CL(uint) name;} ast_identifier;
Lateral AST structs
#ifdef __OPENCL_VERSION__#define CL(TYPE) TYPE#else#define CL(TYPE) cl_##TYPE#endif
Structs shared between Host and OpenCL
ast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array");ast_type* vd1_1_init_args[1];vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10);ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1);free(vd1_1_init_id);for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]);ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz");ast_type* vd1_decls[1];vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init);free(vd1_1_id);free(vd1_1_init);ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var");for (int i = 0; i < 1; i++) free(vd1_decls[i]);
Lateral AST structs
v8::Object => ast_typeexpanded
ast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size);
// copy the string strcpy((char*)(ast_id + 1), str);
// fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset
return ast_id;}
Lateral AST structs
astCreateIdentifier
Lateral AST structsastCreateIdentifier(“xyz”)
offset field value
0 type.id AST_IDENTIFIER (0x01)
4 type.size 16
8 name 12 (offset)
12 str[0] ‘x’
13 str[1] ‘y’
14 str[2] ‘z’
15 str[3] ‘\0’
ast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size;
ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size;
CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new;
// copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size;
// copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs;
return ast_new;}
Lateral AST structsastCreateNewExpression
Lateral AST structsnew Array(10)
offset field value
0 type.id AST_NEW_EXPR (0x308)
4 type.size 52
8 callee 20 (offset)
12 arguments 40 (offset)
16 numArguments 1
20 callee node ast_identifier (“Array”)
40 arguments node ast_literal_number (10)
Shared across the Host and the OpenCL runtimeHost writes, Lateral reads
Constructed on Host as contiguous blobsEasy to send to GPU: memcpy(gpu, ast, ast->size);Fast to send to GPU, single bu!er writeSimple to traverse w/ pointer arithmetic
Lateral AST structs
Stack-basedInterpreter
Building Blocks
Heap
AST Traverse Loop Interpret Loop
AST Traverse Stack
Symbol/Ref TableCall/Exec Stack
Return Stack
Lateral State
Scope Stack
JS Type Structs
#include "state.h"#include "jsvm/asttraverse.h"#include "jsvm/interpreter.h"
// Setup VM structureskernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT}
// Interpret the ASTkernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE
ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack, call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); }}
Kernels
var x = 1 + 2;
Let’s interpret...
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDecl
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDtor
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
IdentBinary
VarDtor
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
IdentLiteralLiteral
VarDtorBinary
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
IdentLiteral
VarDtorBinaryLiteral
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
Ident VarDtorBinaryLiteralLiteral
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDtorBinaryLiteralLiteralIdent
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDtorBinaryLiteralLiteral
“x”
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDtorBinaryLiteral
“x”1
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDtorBinary
“x”12
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
VarDtor “x”3
var x = 1 + 2;{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
AST Call Return
Benchmark
var input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}
Benchmark
Small loop of FLOPs
Execution Time
GPU CLATI Radeon 6770m
CPU CLIntel Core i7 4x2.4Ghz
V8Intel Core i7 4x2.4Ghz
116.571533ms 0.226007ms 0.090664ms
Lateral
Execution TimeLateral
GPU CLATI Radeon 6770m
CPU CLIntel Core i7 4x2.4Ghz
V8Intel Core i7 4x2.4Ghz
116.571533ms 0.226007ms 0.090664ms
EverythingStack-based AST Interpreter, no optimizationsHeavy global memory access, no optimizationsNo data or task parallelism
What went wrong?
Slow as molassesMemory hog Eclipse styleHeavy memory access
“var x = 1 + 2;” == 30 stack hits alone!Too much dynamic allocation
No inline optimizations, just following the yellow brick ASTStraight up lazy
Replace with something better!Bytecode compiler on HostBytecode register-based interpreter on Device
Stack-based Interpreter
Everything is dynamically allocated to global memoryRegister based interpreter & bytecode compiler can make better use of local and private memory
Too much global access
// 11.1207 secondssize_t tid = get_global_id(0);c[tid] = a[tid];while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop}
// 0.0445558 seconds!! HOLY SHIT!size_t tid = get_global_id(0);int tmp = a[tid]; // temp private variablefor(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loopc[tid] = tmp; // touch global memory one time
Optimizing memory access yields crazy results
Everything being interpreted in a single “thread”We have hundreds of cores available to us!Build in heuristics
Identify side-e!ect free statementsBreak into parallel tasks - very magical
No data or task parallelism
var input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}
input[9] = Math.pow((9 + 1) / 1.23, 3);
input[1] = Math.pow((1 + 1) / 1.23, 3);
input[0] = Math.pow((0 + 1) / 1.23, 3);
...
Acceptable performance on all CL devicesV8/Node extension to launch Lateral tasksHigh-level API to perform map-reduce, etc.Lateral-cluster...mmmmm
What’s in store
Thanks!
Jarred Nicholls@jarrednicholls