24
Data Streams, Data-flow parallelism, and Real-time Analytics [email protected] -- EPFL DATA Lab

Data Streams, Data-flow parallelism, and Real-time Analytics [email protected] -- EPFL DATA Lab

Embed Size (px)

Citation preview

Page 1: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Data Streams, Data-flow parallelism, and Real-time Analytics

[email protected] -- EPFL DATA Lab

Page 2: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

The Fail Whale

2

Page 3: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Real-time data processing use-cases

• Science: sensors; feedback for experiment control

• Monitoring, log analysis, root cause analysis for failures

• Finance: algorithmic trading, risk management, OLTP

• Commerce: OLTP

• Web: Twitter, Facebook; search frontends (Google), personalized <anything>, clickstream analysis

3

Page 4: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Real-time Analytics on Big Data

• Big Data 3V: Volume, Velocity, Variety

• This talk focuses on velocity and volume

• Continuous data analysis

• Stream monitoring & mining; enforcing policies/security.

• Timely response required (low latencies!)

• Performance: high throughput and low latencies!

4

Page 5: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Paths to (real-time) performance

• Parallelization

• Small data (seriously!)

• Incrementalization (online/anytime)

• Specialization

Page 6: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Comp. Arch. not to the rescue

• Current data growth outpaces Moore’s law.

• Sequential CPU performance does not grow anymore (already for three Intel processor generations).

• Logical states need time to stabilize.

• Moore’s law to fail by 2020: Only a few (2?) die shrinkage iterations left.

• Limitation on number of cores.

• Dennard scaling (the true motor of Moore’s law) has ended

• Energy cost and cooling problems!

• More computational power will always be more expensive!

6

Page 7: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Parallelization is no silver bullet

• Computer architecture

• Failure of Dennard’s law: Parallelization is expensive!

• Computational complexity theory

• There are inherently sequential problems: NC<PTIME

• Fundamental impossibilities in distributed computing:

• Distributed computation requires synchronization.

• Distributed consensus has a minimum latency dictated by spatial distance of compute nodes (and other factors).

• msecs in LAN, 100s of msecs in WAN. Speed of light!

• Max # of synchronous computation steps per second, no matter how much parallel hardware available.

7

Page 8: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Reminder: Two-Phase Commit

Coordinator Subordinate

Send prepare

Force-write prepare record

Send yes or no

Wait for all responses

Force-write commit or abort

Send commit or abort

Force-write abort or commit

Send ACK

Wait for all ACKs

Write end record

Page 9: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

The cost of synchronization

• 2PC: Minimum latency two network roundtrips.

• Latency limits Xact throughput:Throughput #Xacts/second = 1/(cost of one run of 2PC)

• Lausanne-Shenzhen:9491km * 4; 126ms@speed of light.• Wide-area/cross data-center consistency – 8 Xacts/s ?!?

• Consensus >= 2-phase commit.

9

Page 10: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Xact throughput local vs. cross-datacenter (slide by P.Bailis)

10

Page 11: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

The cost of synchronization

• for every network latency, there is a maximum scale at which a synchronized system can run.

• Jim Gray: In the late 80ies, for 2PC, optimal throughput was reached with ~50 machines (above that, throughput decreases). • Today the number is higher, but not by much.

11

Page 12: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Latency and scaling of synchronized parallel systems

• SIMD in CPU: does not need to scale, but is implemented in hardware and on one die: ok

• Cache coherence in multi-socket servers: a headache for computer architects!

• Linear algebra in HPC

• very special and easy problem, superfast interconnects, locality

• but scaling remains a challenge for designers of supercomputer and linear algebra routines.

• Scaling of ad-hoc problems on supercomputers: an open problem, ad-hoc solutions.

• Consistency inside data center (Xacts, MPI syncs): <10000 Hz

• Cross-data center consistency ~10 Hz

• Latency of heterogeneous local batch jobs: <2Hz (HPC, Mapreduce, Spark)

12

Page 13: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Latency and scaling of synchronized parallel systems #2

• Scaling of batch systems: <2Hz

• HPC jobs <1Hz(?)

• Map/reduce 0.1 Hz (Synchronziation via disk, slow scheduling)

• Spark <2 Hz; “Spark Streaming”

• Note: Hadoop efficiency: takes 80-100 nodes to match single-core performance.

13

Page 14: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

The cost of synchronization

• Distributed computation needs synchronization.

• Low-latency stream processing?

• Asynchronous messageforwarding– data-flow parallelism

14

Machine 1 Machine 2

b := 1a := 1

2

3

5

5

sync

sync

sync

no sync

a += a+b

a += a+b

b += a+b

b += a+b

Page 15: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Does streaming/message passing defeat the 2PC lower bound?

• Assume we compute each statement once.

• Different machines handle statements

• Don’t compute until you have received all the msgs you need.

• Works!

• But requiressynchronized ts oninput stream.

• One stream sourceor synch of streamsources!

15

Machine (1, j) Machine (2, j)

b := 1a := 1

2

3

5

5

a’ += a+b

a’’ += a’+b’’

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

Page 16: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Does streaming/message passing defeat the 2PC lower bound?

• Repeatedly compute values.

• Each msg has a (creation) epoch timestamp

• Multiple msg can share timestamp.

• Works in this case!

• Computes only sums oftwo objects. We knowwhen we have receivedall the msgs we need tomake progress!

16

Machine (1, j) Machine (2, j)

b := 1a := 1

2

3

5

5

a’ += a+b

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

a’’ += a’+b’’

Page 17: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Does streaming/message passing defeat the 2PC lower bound?

• Repeatedly compute values.

• Each msgs has a (creation) epoch timestamp

• Multiple msg can share timestamp.

• Notify when no moremessages of a particular tsare to come from a sender.

• Requires to wait fornotify() from all sources.

• Synch again!

• If there is a cyclical dep(same vals read as written),2PC is back!

17

Machine (1, j) Machine (2, j)

b := 1a := 1

2

3

5

5

a’ += a+b

a’’ += a’+b’’

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

Page 18: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Streaming+Iteration: Structured time [Naiad]

18

AA BB

CC

DD

EE

tt

(t) (t, t’)

(t, t’)

(t, t’)(t, t’)

(t, t’+1)

(t)

Page 19: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Data-flow parallel systems

• Popular for real-time analytics.

• Most popular framework: Apache Storm / Twitter Heron

• Simple programming model (“bolts” – analogous to map-reduce mappers).

• Requires nonblocking operators: e.g. symmetric hash-join vs. sort-merge join.

19

Page 20: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Latency and Data Skew

• Data skew: uneven parallelization.

• Reasons:• Bad initial parallelization• Uneven blowup of intermediate results• bad hashing in (map/reduce) reshuffling

• Occurs both in batch systems such as map-reduce and

• Fixes: skew resilient repartitioning, load shedding, …

• Node failures look similar.

20

Page 21: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Paths to (real-time) performance

• Parallelization

• Small data (seriously!) <cut>

• Incrementalization (online/anytime)

• Specialization

Page 22: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Paths to (real-time) performance

• Parallelization

• Small data (seriously!)

• Incrementalization (online/anytime) <cut>

• Specialization

Page 23: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Paths to (real-time) performance

• Parallelization

• Small data (seriously!)

• Incrementalization (online/anytime)

• Specialization• Hardware: GPU, FPGA, …• Software: compilation• Lots of activity both on the hw and sw fronts at EPFL&ETHZ

Page 24: Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Summary

• The classical batch job is getting a lot of competition. People need low-latency for a variety of reasons.

• Part of a cloud/OpenStack deployment will be used for low-latency work.• Latency is a problem! Virtualization, MS vs Amazon

• Distributed computation at low latencies has fundamental limits.

• Very hard systems problems, huge design space to explore.

• Incrementalization can give asymptotic efficiency improvements. By many orders of magnitude in practice.

24