View
216
Download
0
Category
Tags:
Preview:
Citation preview
Data Streams, Data-flow parallelism, and Real-time Analytics
Christoph.Koch@epfl.ch -- EPFL DATA Lab
The Fail Whale
2
Real-time data processing use-cases
• Science: sensors; feedback for experiment control
• Monitoring, log analysis, root cause analysis for failures
• Finance: algorithmic trading, risk management, OLTP
• Commerce: OLTP
• Web: Twitter, Facebook; search frontends (Google), personalized <anything>, clickstream analysis
3
Real-time Analytics on Big Data
• Big Data 3V: Volume, Velocity, Variety
• This talk focuses on velocity and volume
• Continuous data analysis
• Stream monitoring & mining; enforcing policies/security.
• Timely response required (low latencies!)
• Performance: high throughput and low latencies!
4
Paths to (real-time) performance
• Parallelization
• Small data (seriously!)
• Incrementalization (online/anytime)
• Specialization
Comp. Arch. not to the rescue
• Current data growth outpaces Moore’s law.
• Sequential CPU performance does not grow anymore (already for three Intel processor generations).
• Logical states need time to stabilize.
• Moore’s law to fail by 2020: Only a few (2?) die shrinkage iterations left.
• Limitation on number of cores.
• Dennard scaling (the true motor of Moore’s law) has ended
• Energy cost and cooling problems!
• More computational power will always be more expensive!
6
Parallelization is no silver bullet
• Computer architecture
• Failure of Dennard’s law: Parallelization is expensive!
• Computational complexity theory
• There are inherently sequential problems: NC<PTIME
• Fundamental impossibilities in distributed computing:
• Distributed computation requires synchronization.
• Distributed consensus has a minimum latency dictated by spatial distance of compute nodes (and other factors).
• msecs in LAN, 100s of msecs in WAN. Speed of light!
• Max # of synchronous computation steps per second, no matter how much parallel hardware available.
7
Reminder: Two-Phase Commit
Coordinator Subordinate
Send prepare
Force-write prepare record
Send yes or no
Wait for all responses
Force-write commit or abort
Send commit or abort
Force-write abort or commit
Send ACK
Wait for all ACKs
Write end record
The cost of synchronization
• 2PC: Minimum latency two network roundtrips.
• Latency limits Xact throughput:Throughput #Xacts/second = 1/(cost of one run of 2PC)
• Lausanne-Shenzhen:9491km * 4; 126ms@speed of light.• Wide-area/cross data-center consistency – 8 Xacts/s ?!?
• Consensus >= 2-phase commit.
9
Xact throughput local vs. cross-datacenter (slide by P.Bailis)
10
The cost of synchronization
• for every network latency, there is a maximum scale at which a synchronized system can run.
• Jim Gray: In the late 80ies, for 2PC, optimal throughput was reached with ~50 machines (above that, throughput decreases). • Today the number is higher, but not by much.
11
Latency and scaling of synchronized parallel systems
• SIMD in CPU: does not need to scale, but is implemented in hardware and on one die: ok
• Cache coherence in multi-socket servers: a headache for computer architects!
• Linear algebra in HPC
• very special and easy problem, superfast interconnects, locality
• but scaling remains a challenge for designers of supercomputer and linear algebra routines.
• Scaling of ad-hoc problems on supercomputers: an open problem, ad-hoc solutions.
• Consistency inside data center (Xacts, MPI syncs): <10000 Hz
• Cross-data center consistency ~10 Hz
• Latency of heterogeneous local batch jobs: <2Hz (HPC, Mapreduce, Spark)
12
Latency and scaling of synchronized parallel systems #2
• Scaling of batch systems: <2Hz
• HPC jobs <1Hz(?)
• Map/reduce 0.1 Hz (Synchronziation via disk, slow scheduling)
• Spark <2 Hz; “Spark Streaming”
• Note: Hadoop efficiency: takes 80-100 nodes to match single-core performance.
13
The cost of synchronization
• Distributed computation needs synchronization.
• Low-latency stream processing?
• Asynchronous messageforwarding– data-flow parallelism
14
Machine 1 Machine 2
b := 1a := 1
2
3
5
5
sync
sync
sync
no sync
a += a+b
a += a+b
b += a+b
b += a+b
Does streaming/message passing defeat the 2PC lower bound?
• Assume we compute each statement once.
• Different machines handle statements
• Don’t compute until you have received all the msgs you need.
• Works!
• But requiressynchronized ts oninput stream.
• One stream sourceor synch of streamsources!
15
Machine (1, j) Machine (2, j)
b := 1a := 1
2
3
5
5
a’ += a+b
a’’ += a’+b’’
b’’ += a’+b’
b’’’ += a’’+b’’
Machine (i, 1)
Machine (i, 2)
Machine (i, 3)
Machine (i, 4)
Machine (i, 5)
Does streaming/message passing defeat the 2PC lower bound?
• Repeatedly compute values.
• Each msg has a (creation) epoch timestamp
• Multiple msg can share timestamp.
• Works in this case!
• Computes only sums oftwo objects. We knowwhen we have receivedall the msgs we need tomake progress!
16
Machine (1, j) Machine (2, j)
b := 1a := 1
2
3
5
5
a’ += a+b
b’’ += a’+b’
b’’’ += a’’+b’’
Machine (i, 1)
Machine (i, 2)
Machine (i, 3)
Machine (i, 4)
Machine (i, 5)
a’’ += a’+b’’
Does streaming/message passing defeat the 2PC lower bound?
• Repeatedly compute values.
• Each msgs has a (creation) epoch timestamp
• Multiple msg can share timestamp.
• Notify when no moremessages of a particular tsare to come from a sender.
• Requires to wait fornotify() from all sources.
• Synch again!
• If there is a cyclical dep(same vals read as written),2PC is back!
17
Machine (1, j) Machine (2, j)
b := 1a := 1
2
3
5
5
a’ += a+b
a’’ += a’+b’’
b’’ += a’+b’
b’’’ += a’’+b’’
Machine (i, 1)
Machine (i, 2)
Machine (i, 3)
Machine (i, 4)
Machine (i, 5)
Streaming+Iteration: Structured time [Naiad]
18
AA BB
CC
DD
EE
tt
(t) (t, t’)
(t, t’)
(t, t’)(t, t’)
(t, t’+1)
(t)
Data-flow parallel systems
• Popular for real-time analytics.
• Most popular framework: Apache Storm / Twitter Heron
• Simple programming model (“bolts” – analogous to map-reduce mappers).
• Requires nonblocking operators: e.g. symmetric hash-join vs. sort-merge join.
19
Latency and Data Skew
• Data skew: uneven parallelization.
• Reasons:• Bad initial parallelization• Uneven blowup of intermediate results• bad hashing in (map/reduce) reshuffling
• Occurs both in batch systems such as map-reduce and
• Fixes: skew resilient repartitioning, load shedding, …
• Node failures look similar.
20
Paths to (real-time) performance
• Parallelization
• Small data (seriously!) <cut>
• Incrementalization (online/anytime)
• Specialization
Paths to (real-time) performance
• Parallelization
• Small data (seriously!)
• Incrementalization (online/anytime) <cut>
• Specialization
Paths to (real-time) performance
• Parallelization
• Small data (seriously!)
• Incrementalization (online/anytime)
• Specialization• Hardware: GPU, FPGA, …• Software: compilation• Lots of activity both on the hw and sw fronts at EPFLÐZ
Summary
• The classical batch job is getting a lot of competition. People need low-latency for a variety of reasons.
• Part of a cloud/OpenStack deployment will be used for low-latency work.• Latency is a problem! Virtualization, MS vs Amazon
• Distributed computation at low latencies has fundamental limits.
• Very hard systems problems, huge design space to explore.
• Incrementalization can give asymptotic efficiency improvements. By many orders of magnitude in practice.
24
Recommended