30
Tobias Johansson @ntjohansson 27/10/2016 Big data analytics Einstürzenden Neudaten: Building an analytics engine from scratch

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Embed Size (px)

Citation preview

Page 1: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Tobias Johansson @ntjohansson

27/10/2016

Big data analyticsEinstürzenden Neudaten: Building an analytics engine from scratch

Page 2: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Big data analytics engine

• Focusing on simplicity from an usage perspective• Single process containing

• Time-series repository • Semi-structured repository • Execution engine • Etc.

• Written in Scala/C++/Lua

What is Valo

Page 3: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• REST based

What is Valo

PUT /streams/sensors/environment/air

{“sampleTime”: { “type”: “datetime” },“sensor” : { “type”: “contributor” },“pollution” : { “type”: “double” }

}

POST /streams/sensors/environment/air

{“sampleTime”: “2016/10/27 15:13:00”,“sensor” : “131e90ad-e32a”,“pollution” : 85.6

}

Page 4: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Data friendly

What is Valo

POST /streams/sensors/environment/airContent-Type: application/json

POST /streams/sensors/environment/airContent-Type: application/cbor

POST /streams/sensors/environment/airContent-Type: application/csv

POST /streams/sensors/environment/airContent-Type: application/bson

Time-series Semi-structured

Page 5: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Real-time and historical queries

What is Valo

Page 6: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Looks simple?Trust me, it is not.

Page 7: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Looks simple?Trust me, it is not.

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Page 8: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Looks simple?Trust me, it is not.

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Page 9: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your clusterIt will crash

Page 10: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your cluster

• You need a cluster to run big data analytics on. But it is based on;

• Commodity hardware which can fail• Unreliable network

Page 11: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

Page 12: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

• Split network• Multiple working clusters• Mutable state is likely to diverge

Page 13: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your cluster

• Accept these issues and don’t try to fight it. Make life simpler by;

• Not having a single point of failure• No leaders• No master/slave• No special nodes

• Making it eventually consistent • Use CRDTs for sets, counters, etc.• Use vector-clocks for configuration

Page 14: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your data

Page 15: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Do not treat all data the same

• Time-series repository• CPU data, market data, ECG

• Semi-structured repository• Log files, emails

• KV repository• Configuration

• Unless you are Oracle or Microsoft, make your data immutable, append only.

• Streams are facts at points in time, and facts do not change

Know your data

Page 16: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Build properties into your data distribution policies. Properties which;

• Maximise resilience • Avoid replicas on the same physical server rack

• Optimise data locality• Minimise number of data transfers required when adding/removing

nodes• Deterministically tell where data lives in the cluster

• Where does data for T0 to T1 sit in the cluster?

Know your data

Page 17: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

Page 18: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x x D x x x

E x x x E x x x

F x x F x x x

G x G x x x

K x x K x x x

L x x L x x

M x M x

N N

Page 19: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x D x x x

E x X x E x x x

F x X F x x x

G x X G x x x

K x x K x x x

L x x L x x

M x M x

N N

Page 20: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your algos

Page 21: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Page 22: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Avg

Avg

Avg Avg

Page 23: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Avg

Avg

Avg

Avg

Avg

Avg

Avg

Page 24: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Avg

Avg

Avg Avg

Avg Avg Avg

Page 25: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Know your algos

Init: () -> βApply: β -> 'a list -> βReduce: β -> β -> βFinalise: β -> 'r

class AverageDouble {def apply(value: NamedDouble): Unit

def reset(): Unit

def merge(state: Parser)

def restore(state: Parser)

def getResult: NamedDouble

def save(gen: Generator)}

Page 26: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Travelling algos

Avg AvgAvgAvg Avg Avg

Node / Segment 1 2 3 4 5 6 8 9

A x

B x x

C x x x

D x x x

E x x x

F x x x

G x x x

K x x x

L x x

M x

N

from historical /streams/demo/infrastructure/itimegroup by timeStamp window of 5 minutes every 5 minutes fill last, alphaselect alpha, timeStamp, last(a) as la partition every 1 hour as implicit

Page 27: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generationQuery rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language

KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical modelsDistributed CRDTs

Transports

Real-time queries./valo

Page 28: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

www.valo.io

Thank youMeet us at the Startup Area

[email protected]@ntjohansson

Page 29: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Algos

MicroTickFrequency

MicroVolatility

OnlineMisraGries

Anomaly

Histogram

Bivar

Univar

Skyline

EMA

MovingKurtosis

MovingDerivative

RecursiveEMA

MovingVariance

MovingVariance

Average

Sum

Sum

TopK

Quantiles

Page 30: "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

What has brought us here today