"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Tobias Johansson @ntjohansson

27/10/2016

Big data analyticsEinstürzenden Neudaten: Building an analytics engine from scratch

• Big data analytics engine

• Focusing on simplicity from an usage perspective• Single process containing

• Time-series repository • Semi-structured repository • Execution engine • Etc.

• Written in Scala/C++/Lua

What is Valo

• REST based

What is Valo

PUT /streams/sensors/environment/air

{“sampleTime”: { “type”: “datetime” },“sensor” : { “type”: “contributor” },“pollution” : { “type”: “double” }

}

POST /streams/sensors/environment/air

{“sampleTime”: “2016/10/27 15:13:00”,“sensor” : “131e90ad-e32a”,“pollution” : 85.6

}

• Data friendly

What is Valo

POST /streams/sensors/environment/airContent-Type: application/json

POST /streams/sensors/environment/airContent-Type: application/cbor

POST /streams/sensors/environment/airContent-Type: application/csv

POST /streams/sensors/environment/airContent-Type: application/bson

Time-series Semi-structured

• Real-time and historical queries

What is Valo

Looks simple?Trust me, it is not.


Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries




Gossip protocols



Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing



Data atomicity

Back pressure

Elasticity


IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api


Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Know your clusterIt will crash

Know your cluster

• You need a cluster to run big data analytics on. But it is based on;

• Commodity hardware which can fail• Unreliable network

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

• Split network• Multiple working clusters• Mutable state is likely to diverge

Know your cluster

• Accept these issues and don’t try to fight it. Make life simpler by;

• Not having a single point of failure• No leaders• No master/slave• No special nodes

• Making it eventually consistent • Use CRDTs for sets, counters, etc.• Use vector-clocks for configuration

Know your data

• Do not treat all data the same

• Time-series repository• CPU data, market data, ECG

• Semi-structured repository• Log files, emails

• KV repository• Configuration

• Unless you are Oracle or Microsoft, make your data immutable, append only.

• Streams are facts at points in time, and facts do not change

Know your data

• Build properties into your data distribution policies. Properties which;

• Maximise resilience • Avoid replicas on the same physical server rack

• Optimise data locality• Minimise number of data transfers required when adding/removing

nodes• Deterministically tell where data lives in the cluster

• Where does data for T0 to T1 sit in the cluster?

Know your data

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data



Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x x D x x x

E x x x E x x x

F x x F x x x

G x G x x x

K x x K x x x

L x x L x x

M x M x

N N



Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x D x x x

E x X x E x x x

F x X F x x x

G x X G x x x

K x x K x x x

L x x L x x

M x M x

N N

Know your algos

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Know your algos


Avg

Avg

Avg Avg

Know your algos


Avg

Avg

Avg

Avg

Avg

Avg

Avg

Know your algos


Avg

Avg

Avg Avg

Avg Avg Avg

Know your algos

Init: () -> βApply: β -> 'a list -> βReduce: β -> β -> βFinalise: β -> 'r

class AverageDouble {def apply(value: NamedDouble): Unit

def reset(): Unit

def merge(state: Parser)

def restore(state: Parser)

def getResult: NamedDouble

def save(gen: Generator)}

Travelling algos

Avg AvgAvgAvg Avg Avg

Node / Segment 1 2 3 4 5 6 8 9

A x

B x x

C x x x

D x x x

E x x x

F x x x

G x x x

K x x x

L x x

M x

N

from historical /streams/demo/infrastructure/itimegroup by timeStamp window of 5 minutes every 5 minutes fill last, alphaselect alpha, timeStamp, last(a) as la partition every 1 hour as implicit



Gossip protocols



Expression trees and runtime code generationQuery rewriting and optimization

Consistent hashing



Data atomicity

Back pressure

Elasticity


IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language

KV-store

REST-api


Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical modelsDistributed CRDTs

Transports

Real-time queries./valo

www.valo.io

Thank youMeet us at the Startup Area

[email protected]@ntjohansson

mailto:[email protected]

Algos

MicroTickFrequency

MicroVolatility

OnlineMisraGries

Anomaly

Histogram

Bivar

Univar

Skyline

EMA

MovingKurtosis

MovingDerivative

RecursiveEMA

MovingVariance

MovingVariance

Average

Sum

Sum

TopK

Quantiles

What has brought us here today

Data & Analytics

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io