Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Scalable Data Ingestion for Stream Processing and Beyond
Gabriel AntoniuJoint work with Ovidiu Marcu, Alexandru Costan, Maria S. Pérez
9th JLESC workshop, UTK, Knoxville, April 15, 2019
Velocity
Data in motion
FluidDynamic
Volume
Data at rest
StationaryStatic
2
From Big Data to Fast Data
Correctness Exact results Approximate results
Latency High-latency Low-latency
Cost Stateless Stateful
Batch Streaming
3
State of the art until recently:Lambda Architectures
Historical events
Real-time events
Exact historical
model
Approximate real-time
model
Periodic queries
Continuous queries
Batch processing
Stream processing
Results &
Actions
What?
Why?
4
The streaming pipeline: latency happens
Unified batch and stream processing
Ingest delay (write latency)
Throughput(read latency)
Network delay or unavailable Backlog
Poor storage design
Starved resources
Hardware failure
DATATRANSFER
Edge Cloud
5
What is ingestion ?
•Collect data from various sources → producers
•Deliver them for processing / storage → consumers
•Optionally: buffer, log, pre-process
Ingestion determines the processing performance6
State of the art: Apache Kafka
Limitations• Scalability• Data duplication
400 nodes, peak 3.2M events/s50 nodes, average 200K events/s
7
The KerA approach to ingestion • Scalability → Dynamic partitioning• Enables seamless elasticity
• Data duplication → Unified ingestion and storage • Support for both
• Streams (unbounded data)
• Objects (bounded data)
8
Zoom: scalability
Each partition is statically associated with one consumer: limited scalability
Producers Brokers Consumers
Partitions
Partitions
Partitions
9
KerA: dynamic partitioning
10
• Streamlets: logical stream containers; #streamlets > #brokers• Groups: created and processed dynamically; maximum #active groups per broker
Streamlets Streamlets
Streamlets Streamlets
Streamlets Streamlets
GroupsGroups
GroupsGroups
Groups
Increased network and storage overheads 11
Zoom: data duplication
KerA: unified ingestion and storage
KerA
INGESTIONBrokers
Acquire Push/Pulldata access
Streams
Objects
Common data model for streams and objects
STORAGEBackups
Move less data, process them faster
12
Evaluating scalability
1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106
4 8 16 32
Aver
age
Aggr
egat
ed
Clie
nt T
hrou
ghpu
t (re
cord
s/s)
Clients Number
KeraProdKeraConsKafkaProdKafkaCons
1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106
4 8 12 16
Ave
rage
Agg
rega
ted
Clie
nt T
hrou
ghpu
t (re
cord
s/s)
Nodes Number
KeraProdKeraConsKafkaProdKafkaCons
64 clients, 32 partitions, 1MB request size, 100B records
# Brokers
2x better throughputwith 75% less resources
Vertical Horizontal
8x10x
# Clients
4 brokers, 32 partitions, 128KB request size, 100B records
21*105
19*105
17*105
15*105
13*105
11*105
9*105
7*105
5*105
3*105
1*105
Thro
ughp
ut
21*105
19*105
17*105
15*105
13*105
11*105
9*105
7*105
5*105
3*105
1*105
13
14
Our vision: hybrid analytics architecture
Present data
Stream processing
Past data
Historicalmodel
Real-timemodel
ComputationalModel
Batch processing
FuturemodelSimulation
Proactive control
Continuous update
In situ processing
In transit processing
HybridAnalytics
Hybrid analytics: processing architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest)
Historicaldata
BetterDecisionLearning
15
Data processing
16
Hybrid analytics architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
BetterDecisionLearning
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest)
Historicaldata
Postdoc (ANR OverFlow project)• Investigating Edge vs. Cloud
computing trade-offs for stream processing
• Methodology for benchmarking Edge processing frameworks
Ph.D. (to hire)• Uniform Cloud and Edge stream
processing for Fast Data analytics
Pedro Silva
Hybrid analytics architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
Historicaldata
BetterDecisionLearning
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest) Research Engineer (Inria ADT project)• Enable support for in situ Big Data
analytics• Elastic allocation of dedicated resources
(cores/nodes)Ovidiu Marcu
Hybrid analytics architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
BetterDecisionLearning
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest)
KerA+++seamless integration with in situ/in transit
+large state management
Historicaldata
Ovidiu Marcu
Startup (ZettaFlow)• Low and consistent latency
(lightweight offset indexing, independent memory management)
• Model applications not partitioning/stream storage
Ph.D. (Inria IPL project)• HPC – Big Data processing
convergence• Bridge in situ/in transit and
stream/batch processing
H2020 project in preparation
Thank You!