45
Building a Big Data Machine Learning Platform Cliff Click, CTO 0xdata [email protected] http://0xdata.com http://cliffc.org/blog

Building a Big Data Machine Learning Platform

Embed Size (px)

DESCRIPTION

H2O - It's open source, in-memory, big data, clustered computing - Math At Scale. We got the Worlds Fastest Logistic Regression (by a lot!), world's first (and fastest) distributed Gradient Boosted Method (GBM), plus Random Forest, PCA, KMeans++, etc... R's "plyr" style data munging at-scale, including ddply (Group-By for you SQL'rs) and much of R's expressive coding style. We built H2O, an open-source platform for working with in-memory distributed data. Then we built on top of H2O state-of-the-art predictive modeling and analytics (e.g. GLM & Logistic Regression, GBM, Random Forest, Neural Nets, PCA to name a few) that's 1000x faster than the disk-bound alternatives, and 100x faster than R (we love R but it's tooo slow on big data!). We can run R expressions on tera-scale datasets, or munge data from Scala & Python. We're building our newest algorithms in a few weeks, start to finish, because the platform makes Big Math easy. We routinely test on 100G datasets, have customers using 1T datasets. This talk is about the platform, coding style & API that lets us seamlessly deal with datasets from 1K to 1TB without changing a line of code, lets us use clusters ranging from your laptop to 100 server clusters with many many TB of ram and hundreds of CPUs.

Citation preview

Page 1: Building a Big Data Machine Learning Platform

Building a Big DataMachine Learning Platform

Cliff Click, CTO [email protected]://0xdata.comhttp://cliffc.org/blog

Page 2: Building a Big Data Machine Learning Platform

2

H2O is...

● Pure Java, Open Source: 0xdata.com● https://github.com/0xdata/h2o/

● A Platform for doing Parallel Distributed Math● In-memory analytics: GLM, GBM, RF, Logistic Reg,

Deep Learning, PCA, Kmeans...

● Accessible via REST & JSON, R, Python, Java● Scala & now Spark!

Page 3: Building a Big Data Machine Learning Platform

3

Sparkling Water

● Bleeding edge: Spark & H2ORDDs● Move data back & forth, model & munge● Same process, same JVM

● H2O Data as a:● Spark RDD or ● Scala Collection

● Code in:● https://github.com/0xdata/h2o-dev● https://github.com/0xdata/perrier

Frame.toRDD.runJob(...)

Frame.foreach{...}

Page 4: Building a Big Data Machine Learning Platform

4

A Collection of Distributed Vectors

// A Distributed Vector// much more than 2billion elementsclass Vec { long length(); // more than an int's worth

// fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx);

void set(long idx, double d); // writable void append(double d); // variable sized}

Page 5: Building a Big Data Machine Learning Platform

5

Distributed Data Taxonomy

A Single VectorVec

Page 6: Building a Big Data Machine Learning Platform

6

Distributed Data Taxonomy

A Very Large Single VecVec

>>

2billion elements

●Java primitive●Usually double

●Length is a long●>> 2^31 elements

●Compressed● Often 2x to 4x

●Random access

●Linear access is FORTRAN speed

Page 7: Building a Big Data Machine Learning Platform

7

JVM 4 Heap32Gig

JVM 1 Heap32Gig

JVM 2 Heap32Gig

JVM 3 Heap32Gig

Distributed Data Taxonomy

A Single Distributed VecVec

>>

2billion elements

●Java Heap● Data In-Heap● Not off heap

●Split Across Heaps

●GC management● Watch FullGC● Spill-to-disk● GC very cheap● Default GC

●To-the-metal speed●Java ease

Page 8: Building a Big Data Machine Learning Platform

8

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Collection of Distributed VecsVec Vec Vec Vec Vec

●Vecs aligned in heaps●Optimized for concurrent access●Random access any row, any JVM

●But faster if local... more on that later

Page 9: Building a Big Data Machine Learning Platform

9

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Frame: Vec[]age sex zip ID car

●Similar to R frame●Change Vecs freely●Add, remove Vecs●Describes a row of user data●Struct-of-Arrays (vs ary-of-structs)

Page 10: Building a Big Data Machine Learning Platform

10

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk, Unit of Parallel AccessVec Vec Vec Vec Vec

●Typically 1e3 to 1e6 elements●Stored compressed●In byte arrays●Get/put is a few clock cycles including compression●Compression is Good: more data per cache-miss

Page 11: Building a Big Data Machine Learning Platform

11

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk[]: Concurrent Vec Accessage sex zip ID car

●Access Row in a single thread●Like a Java object

●Can read & write: Mutable Vectors●Both are full Java speed●Conflicting writes: use JMM rules

class Person { }

Page 12: Building a Big Data Machine Learning Platform

12

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

Single Threaded ExecutionVec Vec Vec Vec Vec

●One CPU works a Chunk of rows●Fork/Join work unit●Big enough to cover control overheads●Small enough to get fine-grained par

●Map/Reduce●Code written in a simple single- threaded style

Page 13: Building a Big Data Machine Learning Platform

13

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

Distributed Parallel ExecutionVec Vec Vec Vec Vec

●All CPUs grab Chunks in parallel●F/J load balances

●Code moves to Data●Map/Reduce & F/J handles all sync●H2O handles all comm, data manage

Page 14: Building a Big Data Machine Learning Platform

14

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double

Row i – i'th elements of all the Vecs in a Frame

Page 15: Building a Big Data Machine Learning Platform

15

Sparkling Water: Spark and H2O

● Convert RDDs <==> Frames● In memory, simple fast call● In process, no external tooling needed● Distributed – data does not move*● Eager, not Lazy

● Makes a data copy!● H2O data is highly compressed● Often 1/4 to 1/10th original size

*See fine print

Page 16: Building a Big Data Machine Learning Platform

16

Spark Partitions and H2O Chunks

101 27 M 50000

102 21 F 25000

103 45 - 120000

104 53 M 150000

ID age sex salary

Spark: Partition[User]

JVM Heap

These structures are limited to 1 JVM heapThere can be many in the heap, limited by only by memory

H2O: Chunk

*Only data correspondance is shown; a real data copy is required!

Page 17: Building a Big Data Machine Learning Platform

17

Spark RDDs and H2O Frames

101 27 M 50000

102 21 F 25000

103 45 - 120000

104 53 M 150000

ID age sex salary

JVM Heap #1

105 33 F 140000

106 65 F -

107 20 M 55000

108 17 M -

109 27 M 50000

110 21 F 25000

111 45 - 120000

112 53 M 150000

JVM Heap #2

113 33 F 140000

114 65 F -

115 20 M 55000

116 17 M -

117 27 M 50000

118 21 F 25000

119 45 - 120000

120 53 M 150000

JVM Heap #3

121 33 F 140000

122 65 F -

123 20 M 55000

124 17 M -

125 27 M 50000

126 21 F 25000

127 45 - 120000

128 53 M 150000

JVM Heap #4

129 33 F 140000

130 65 F -

131 20 M 55000

132 17 M -

Frame

RDD

H2O: Chunk

Partition

Vec

Page 18: Building a Big Data Machine Learning Platform

18

Sparkling Water

● Convert to H2O Frame● Eager, executes RDDs immediately● Makes a compressed H2O copy

● Convert to Spark RDD● Lazy, defines a normal RDD● When executed, acts as a checkpoint

val fr = toDataFrame(sparkCx,rdd)

val rdd = toRDD(sparkCx,fr)

Page 19: Building a Big Data Machine Learning Platform

19

Distributed Coding Taxonomy

● No Distribution Coding:● Whole Algorithms, Whole Vector-Math● REST + JSON: e.g. load data, GLM, get results● R, Python, Web, bash/curl

● Simple Data-Parallel Coding:● Map/Reduce-style: e.g. Any dense linear algebra● Java/Scala foreach* style

● Complex Data-Parallel Coding● K/V Store, Graph Algo's, e.g. PageRank

Page 20: Building a Big Data Machine Learning Platform

20

Simple Data-Parallel Coding

● Map/Reduce Per-Row: Stateless● Example from Linear Regression, Σ y2

● Auto-parallel, auto-distributed● Fortran speed, Java Ease

double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; }}.doAll( vecY );

Page 21: Building a Big Data Machine Learning Platform

21

Simple Data-Parallel Coding

● Scala version in development:

● Or:

MR(var X:Double =0) { def map = (d:Double) => { X*X } def reduce = (that) => { X+that.X }}.doAll( vecY );

MR(0).map(_*_).reduce(add).doAll( vecY );

Page 22: Building a Big Data Machine Learning Platform

22

Simple Data-Parallel Coding

● Map/Reduce Per-Row: Statefull● Linear Regression Pass1: Σ x, Σ y, Σ y2

class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; }}

Page 23: Building a Big Data Machine Learning Platform

23

MR { var X, Y, X2=0.0; var nrows=0L; def map = (x:Double, y:Double) => { X=x; Y=y; X2=x*x; nrows=1 } def reduce = (that) => { X += that.X ; Y += that.Y; X2+= that.X2; nrows += that.nrows }}.doAll(vecX,vecY)

Simple Data-Parallel Coding

● Scala version in development:

Page 24: Building a Big Data Machine Learning Platform

24

Simple Data-Parallel Coding

● Map/Reduce Per-Row: Batch Statefullclass LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; }}

Page 25: Building a Big Data Machine Learning Platform

25

Other Simple Examples

● Filter & Count (underage males):● (can pass in any number of Vecs or a Frame)

● Scala syntax

long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; }}.doAll( vecAge, vecSex );

MR(0).map(_('age)<=17 && _('sex)==MALE ) .reduce(add).doAll( frame );

Page 26: Building a Big Data Machine Learning Platform

26

Other Simple Examples

● Filter into new set (underage males):● Can write or append subset of rows

– (append order is preserved)

class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set }};Vec risk = new AppendableVec();new Filter().doAll( risk, vecAge, vecSex );...risk... // all the underage males

Page 27: Building a Big Data Machine Learning Platform

27

Other Simple Examples

● Filter into new set (underage males):● Can write or append subset of rows

– (append order is preserved)

class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set }};Vec risk = new AppendableVec();new Filter().doAll( risk, vecAge, vecSex );...risk... // all the underage males

Page 28: Building a Big Data Machine Learning Platform

28

Other Simple Examples

● Group-by: count of car-types by ageclass AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; }}

Page 29: Building a Big Data Machine Learning Platform

29

class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; }}

Other Simple Examples

● Group-by: count of car-types by ageSetting carAges in map() makes it an output field. Private per-map call, single-threaded write access.

Must be rolled-up in the reduce call.

Setting carAges in map makes it an output field. Private per-map call, single-threaded write access.

Must be rolled-up in the reduce call.

Page 30: Building a Big Data Machine Learning Platform

30

Other Simple Examples

● Uniques● Uses distributed hash set

class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); }};long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();

Page 31: Building a Big Data Machine Learning Platform

31

Other Simple Examples

● Uniques● Uses distributed hash set

class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); }};long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();

Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only.

This one is written, so needs a reduce.

Page 32: Building a Big Data Machine Learning Platform

32

Summary: Write (parallel) Java

● Most simple Java “just works”● Scala API is experimental, but will also "just work"

● Fast: parallel distributed reads, writes, appends● Reads same speed as plain Java array loads● Writes, appends: slightly slower (compression)● Typically memory bandwidth limited

– (may be CPU limited in a few cases)

● Slower: conflicting writes (but follows strict JMM)● Also supports transactional updates

Page 33: Building a Big Data Machine Learning Platform

33

Summary: Writing Analytics

● We're writing Big Data Distributed Analytics● Deep Learning● Generalized Linear Modeling (ADMM, GLMNET)

– Logistic Regression, Poisson, Gamma● Random Forest, GBM, KMeans++, KNN

● Solidly working on 100G datasets● Testing Tera Scale Now

● Paying customers (in production!)● Come write your own (distributed) algorithm!!!

Page 34: Building a Big Data Machine Learning Platform

34

Q & A

0xdata.com

https://github.com/0xdata/h2o

https://github.com/0xdata/h2o-devhttps://github.com/0xdata/h2o-perrier

Page 35: Building a Big Data Machine Learning Platform

35

Cool Systems Stuff...

● … that I ran out of space for● Reliable UDP, integrated w/RPC● TCP is reliably UNReliable

● Already have a reliable UDP framework, so no prob

● Fork/Join Goodies:● Priority Queues● Distributed F/J● Surviving fork bombs & lost threads

● K/V does JMM via hardware-like MESI protocol

Page 36: Building a Big Data Machine Learning Platform

36

Speed Concerns

● How fast is fast?● Data is Big (by definition!) & must see it all● Typically: less math than memory bandwidth

● So decompression happens while waiting for mem● More (de)compression is better● Currently 15 compression schemes● Picked per-chunk, so can (does) vary across dataset● All decompression schemes take 2-10 cycles max

● Time leftover for plenty of math

Page 37: Building a Big Data Machine Learning Platform

37

Speed Concerns

● Serialization:● Rarely send Big Data around (too much of that!

Must be normally node-local access)● Instead it's POJO's doing the math (Histograms,

Gram Matrix, sums & variances, etc)

● Bytecode weaver on class load● Write fields via Unsafe into DirectByteBuffers● 2-byte unique token defines type (and nested types)● Compression on that too! (more CPU than network)

Page 38: Building a Big Data Machine Learning Platform

38

Serialization

● Write fields via Unsafe into DirectByteBuffers● All from simple JIT'd code -

– Just the loads & stores, nothing else● 2-byte token once per top-level RPC

– (more tokens if subclassed objects used)

● Streaming async NIO● Multiple shared TCP & UDP channels

– Small stuff via UDP & big via TCP● Full app-level retry & error recovery

– (can pull cable & re-insert & all will recover)

Page 39: Building a Big Data Machine Learning Platform

39

Map / Reduce

● Map: Once-per-chunk; typically 1000's per-node● Using Fork/Join for fine-grained parallelism

● Reduce: reduce-early-often – after every 2 maps● Deterministic, same Maps, same rows every time● Until all the Maps & Reduces on a Node are done

● Then ship results over-the-wire● And Reduce globally in a log-tree rollup

● Network latency is 2 log-tree traversals

Page 40: Building a Big Data Machine Learning Platform

40

Fork/Join Experience

● Really Good (except when it's not)● Good Stuff: easy to write...

● (after a steep learning curve)

● Works! Fine to have many many small jobs, load balances across CPUs, keeps 'em all busy, etc.

● Full-featured, flexible● We've got 100's of uses of it scattered throughout

Page 41: Building a Big Data Machine Learning Platform

41

Fork / Join Experience

● Really Good (except when it's not)● Blocking threads is hard on F/J

– (ManagedBlocker.block API is painful)– Still get thread starvation sometimes

● "CountedCompleters" – CPS by any other name– Painful to write explicit-CPS in Java

● No priority queues – a Must Have● And no Java thread priorities● So built up priority queues over F/J & JVM

Page 42: Building a Big Data Machine Learning Platform

42

Fork/Join Experience

● Default exception is silently dropped● Usual symptom: all threads idle, but job not done● Complete maintenance disaster – must catch &

track & log all exceptions– (and even pass around cluster distributed)

● Forgotten “tryComplete()” not too hard to track● Fork-Bomb – must cap all thread pools

● Which can lead to deadlock● Which leads to using CPS-style occasionally

● Despite issues, I'd use it again

Page 43: Building a Big Data Machine Learning Platform

43

The Platform

NFSHDFS

byte[]

extends Iced

extends DTask

AutoBuffer

RPC

extends DRemoteTask D/F/J

extends MRTask User code?

JVM 1

NFSHDFS

byte[]

extends Iced

extends DTask

AutoBuffer

RPC

extends DRemoteTask D/F/J

extends MRTask User code?

JVM 2

K/V get/put

UDP / TCP

Page 44: Building a Big Data Machine Learning Platform

44

TCP Fails

● In <5mins, I can force a TCP fail on Linux● "Fail": means Server opens+writes+closes

● NO ERRORS● Client gets no data, no errors● In my lab (no virtualization) or EC2

● Basically, H2O can mimic a DDOS attack● And Linux will "cheat" on the TCP protocol● And cancel valid, in-progress, TCP handshakes● Verified w/wireshark

Page 45: Building a Big Data Machine Learning Platform

45

TCP Fails

● Any formal verification? (yes lots)● Of recent Linux kernals?

● Ones with DDOS-defense built-in?