Data Streams, Data-flow parallelism, and Real-time Analytics [email protected] -- EPFL DATA Lab

Data Streams, Data-flow parallelism, and Real-time Analytics

[email protected] -- EPFL DATA Lab

The Fail Whale

2

Real-time data processing use-cases

• Science: sensors; feedback for experiment control

• Monitoring, log analysis, root cause analysis for failures

• Finance: algorithmic trading, risk management, OLTP

• Commerce: OLTP

• Web: Twitter, Facebook; search frontends (Google), personalized <anything>, clickstream analysis

3

Real-time Analytics on Big Data

• Big Data 3V: Volume, Velocity, Variety

• This talk focuses on velocity and volume

• Continuous data analysis

• Stream monitoring & mining; enforcing policies/security.

• Timely response required (low latencies!)

• Performance: high throughput and low latencies!

4

Paths to (real-time) performance

• Parallelization

• Small data (seriously!)

• Incrementalization (online/anytime)

• Specialization

Comp. Arch. not to the rescue

• Current data growth outpaces Moore’s law.

• Sequential CPU performance does not grow anymore (already for three Intel processor generations).

• Logical states need time to stabilize.

• Moore’s law to fail by 2020: Only a few (2?) die shrinkage iterations left.

• Limitation on number of cores.

• Dennard scaling (the true motor of Moore’s law) has ended

• Energy cost and cooling problems!

• More computational power will always be more expensive!

6

Parallelization is no silver bullet

• Computer architecture

• Failure of Dennard’s law: Parallelization is expensive!

• Computational complexity theory

• There are inherently sequential problems: NC<PTIME

• Fundamental impossibilities in distributed computing:

• Distributed computation requires synchronization.

• Distributed consensus has a minimum latency dictated by spatial distance of compute nodes (and other factors).

• msecs in LAN, 100s of msecs in WAN. Speed of light!

• Max # of synchronous computation steps per second, no matter how much parallel hardware available.

7

Reminder: Two-Phase Commit

Coordinator Subordinate

Send prepare

Force-write prepare record

Send yes or no

Wait for all responses

Force-write commit or abort

Send commit or abort

Force-write abort or commit

Send ACK

Wait for all ACKs

Write end record

The cost of synchronization

• 2PC: Minimum latency two network roundtrips.

• Latency limits Xact throughput:Throughput #Xacts/second = 1/(cost of one run of 2PC)

• Lausanne-Shenzhen:9491km * 4; 126ms@speed of light.• Wide-area/cross data-center consistency – 8 Xacts/s ?!?

• Consensus >= 2-phase commit.

9

Xact throughput local vs. cross-datacenter (slide by P.Bailis)

10


• for every network latency, there is a maximum scale at which a synchronized system can run.

• Jim Gray: In the late 80ies, for 2PC, optimal throughput was reached with ~50 machines (above that, throughput decreases). • Today the number is higher, but not by much.

11

Latency and scaling of synchronized parallel systems

• SIMD in CPU: does not need to scale, but is implemented in hardware and on one die: ok

• Cache coherence in multi-socket servers: a headache for computer architects!

• Linear algebra in HPC

• very special and easy problem, superfast interconnects, locality

• but scaling remains a challenge for designers of supercomputer and linear algebra routines.

• Scaling of ad-hoc problems on supercomputers: an open problem, ad-hoc solutions.

• Consistency inside data center (Xacts, MPI syncs): <10000 Hz

• Cross-data center consistency ~10 Hz

• Latency of heterogeneous local batch jobs: <2Hz (HPC, Mapreduce, Spark)

12

Latency and scaling of synchronized parallel systems #2

• Scaling of batch systems: <2Hz

• HPC jobs <1Hz(?)

• Map/reduce 0.1 Hz (Synchronziation via disk, slow scheduling)

• Spark <2 Hz; “Spark Streaming”

• Note: Hadoop efficiency: takes 80-100 nodes to match single-core performance.

13


• Distributed computation needs synchronization.

• Low-latency stream processing?

• Asynchronous messageforwarding– data-flow parallelism

14

Machine 1 Machine 2

b := 1a := 1

2

3

5

5

sync

sync

sync

no sync

a += a+b

a += a+b

b += a+b

b += a+b

Does streaming/message passing defeat the 2PC lower bound?

• Assume we compute each statement once.

• Different machines handle statements

• Don’t compute until you have received all the msgs you need.

• Works!

• But requiressynchronized ts oninput stream.

• One stream sourceor synch of streamsources!

15

Machine (1, j) Machine (2, j)

b := 1a := 1

2

3

5

5

a’ += a+b

a’’ += a’+b’’

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)


• Repeatedly compute values.

• Each msg has a (creation) epoch timestamp

• Multiple msg can share timestamp.

• Works in this case!

• Computes only sums oftwo objects. We knowwhen we have receivedall the msgs we need tomake progress!

16


b := 1a := 1

2

3

5

5

a’ += a+b

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

a’’ += a’+b’’


• Repeatedly compute values.

• Each msgs has a (creation) epoch timestamp

• Multiple msg can share timestamp.

• Notify when no moremessages of a particular tsare to come from a sender.

• Requires to wait fornotify() from all sources.

• Synch again!

• If there is a cyclical dep(same vals read as written),2PC is back!

17


b := 1a := 1

2

3

5

5

a’ += a+b

a’’ += a’+b’’

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

Streaming+Iteration: Structured time [Naiad]

18

AA BB

CC

DD

EE

tt

(t) (t, t’)

(t, t’)

(t, t’)(t, t’)

(t, t’+1)

(t)

Data-flow parallel systems

• Popular for real-time analytics.

• Most popular framework: Apache Storm / Twitter Heron

• Simple programming model (“bolts” – analogous to map-reduce mappers).

• Requires nonblocking operators: e.g. symmetric hash-join vs. sort-merge join.

19

Latency and Data Skew

• Data skew: uneven parallelization.

• Reasons:• Bad initial parallelization• Uneven blowup of intermediate results• bad hashing in (map/reduce) reshuffling

• Occurs both in batch systems such as map-reduce and

• Fixes: skew resilient repartitioning, load shedding, …

• Node failures look similar.

20


• Parallelization

• Small data (seriously!) <cut>


• Specialization


• Parallelization


• Incrementalization (online/anytime) <cut>

• Specialization


• Parallelization



• Specialization• Hardware: GPU, FPGA, …• Software: compilation• Lots of activity both on the hw and sw fronts at EPFL&ETHZ

Summary

• The classical batch job is getting a lot of competition. People need low-latency for a variety of reasons.

• Part of a cloud/OpenStack deployment will be used for low-latency work.• Latency is a problem! Virtualization, MS vs Amazon

• Distributed computation at low latencies has fundamental limits.

• Very hard systems problems, huge design space to explore.

• Incrementalization can give asymptotic efficiency improvements. By many orders of magnitude in practice.

24

Documents

Data Streams, Data-flow parallelism, and Real-time Analytics [email protected] -- EPFL DATA Lab