Upload
dataconomy-media
View
191
Download
4
Embed Size (px)
Citation preview
Tobias Johansson @ntjohansson
27/10/2016
Big data analyticsEinstürzenden Neudaten: Building an analytics engine from scratch
• Big data analytics engine
• Focusing on simplicity from an usage perspective• Single process containing
• Time-series repository • Semi-structured repository • Execution engine • Etc.
• Written in Scala/C++/Lua
What is Valo
• REST based
What is Valo
PUT /streams/sensors/environment/air
{“sampleTime”: { “type”: “datetime” },“sensor” : { “type”: “contributor” },“pollution” : { “type”: “double” }
}
POST /streams/sensors/environment/air
{“sampleTime”: “2016/10/27 15:13:00”,“sensor” : “131e90ad-e32a”,“pollution” : 85.6
}
• Data friendly
What is Valo
POST /streams/sensors/environment/airContent-Type: application/json
POST /streams/sensors/environment/airContent-Type: application/cbor
POST /streams/sensors/environment/airContent-Type: application/csv
POST /streams/sensors/environment/airContent-Type: application/bson
Time-series Semi-structured
• Real-time and historical queries
What is Valo
Looks simple?Trust me, it is not.
Looks simple?Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries
Looks simple?Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries
Know your clusterIt will crash
Know your cluster
• You need a cluster to run big data analytics on. But it is based on;
• Commodity hardware which can fail• Unreliable network
Know your cluster
• Issues;
• Unreachable nodes• Dropped messages • Delayed messages• No response
Know your cluster
• Issues;
• Unreachable nodes• Dropped messages • Delayed messages• No response
• Split network• Multiple working clusters• Mutable state is likely to diverge
Know your cluster
• Accept these issues and don’t try to fight it. Make life simpler by;
• Not having a single point of failure• No leaders• No master/slave• No special nodes
• Making it eventually consistent • Use CRDTs for sets, counters, etc.• Use vector-clocks for configuration
Know your data
• Do not treat all data the same
• Time-series repository• CPU data, market data, ECG
• Semi-structured repository• Log files, emails
• KV repository• Configuration
• Unless you are Oracle or Microsoft, make your data immutable, append only.
• Streams are facts at points in time, and facts do not change
Know your data
• Build properties into your data distribution policies. Properties which;
• Maximise resilience • Avoid replicas on the same physical server rack
• Optimise data locality• Minimise number of data transfers required when adding/removing
nodes• Deterministically tell where data lives in the cluster
• Where does data for T0 to T1 sit in the cluster?
Know your data
• Consistent hashing • Minimises number of data transfers in the cluster
• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
• Consistent hashing • Minimises number of data transfers in the cluster
• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x x D x x x
E x x x E x x x
F x x F x x x
G x G x x x
K x x K x x x
L x x L x x
M x M x
N N
• Consistent hashing • Minimises number of data transfers in the cluster
• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x D x x x
E x X x E x x x
F x X F x x x
G x X G x x x
K x x K x x x
L x x L x x
M x M x
N N
Know your algos
Know your algos
from historical /streams/demo/infrastructure/cpuselect avg(kernel)
Know your algos
from historical /streams/demo/infrastructure/cpuselect avg(kernel)
Avg
Avg
Avg Avg
Know your algos
from historical /streams/demo/infrastructure/cpuselect avg(kernel)
Avg
Avg
Avg
Avg
Avg
Avg
Avg
Know your algos
from historical /streams/demo/infrastructure/cpuselect avg(kernel)
Avg
Avg
Avg Avg
Avg Avg Avg
Know your algos
Init: () -> βApply: β -> 'a list -> βReduce: β -> β -> βFinalise: β -> 'r
class AverageDouble {def apply(value: NamedDouble): Unit
def reset(): Unit
def merge(state: Parser)
def restore(state: Parser)
def getResult: NamedDouble
def save(gen: Generator)}
Travelling algos
Avg AvgAvgAvg Avg Avg
Node / Segment 1 2 3 4 5 6 8 9
A x
B x x
C x x x
D x x x
E x x x
F x x x
G x x x
K x x x
L x x
M x
N
from historical /streams/demo/infrastructure/itimegroup by timeStamp window of 5 minutes every 5 minutes fill last, alphaselect alpha, timeStamp, last(a) as la partition every 1 hour as implicit
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generationQuery rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language
KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical modelsDistributed CRDTs
Transports
Real-time queries./valo
Algos
MicroTickFrequency
MicroVolatility
OnlineMisraGries
Anomaly
Histogram
Bivar
Univar
Skyline
EMA
MovingKurtosis
MovingDerivative
RecursiveEMA
MovingVariance
MovingVariance
Average
Sum
Sum
TopK
Quantiles
What has brought us here today