Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Data Streams, Data-flow parallelism, and Real-time Analytics

Christoph.Koch@epfl.ch -- EPFL DATA Lab

The Fail Whale

Real-time data processing use-cases

• Science: sensors; feedback for experiment control

• Monitoring, log analysis, root cause analysis for failures

• Finance: algorithmic trading, risk management, OLTP

• Commerce: OLTP

• Web: Twitter, Facebook; search frontends (Google), personalized <anything>, clickstream analysis

Real-time Analytics on Big Data

• Big Data 3V: Volume, Velocity, Variety

• This talk focuses on velocity and volume

• Continuous data analysis

• Stream monitoring & mining; enforcing policies/security.

• Timely response required (low latencies!)

• Performance: high throughput and low latencies!

Paths to (real-time) performance

• Parallelization

• Small data (seriously!)

• Incrementalization (online/anytime)

• Specialization

Comp. Arch. not to the rescue

• Current data growth outpaces Moore’s law.

• Sequential CPU performance does not grow anymore (already for three Intel processor generations).

• Logical states need time to stabilize.

• Moore’s law to fail by 2020: Only a few (2?) die shrinkage iterations left.

• Limitation on number of cores.

• Dennard scaling (the true motor of Moore’s law) has ended

• Energy cost and cooling problems!

• More computational power will always be more expensive!

Parallelization is no silver bullet

• Computer architecture

• Failure of Dennard’s law: Parallelization is expensive!

• Computational complexity theory

• There are inherently sequential problems: NC<PTIME

• Fundamental impossibilities in distributed computing:

• Distributed computation requires synchronization.

• Distributed consensus has a minimum latency dictated by spatial distance of compute nodes (and other factors).

• msecs in LAN, 100s of msecs in WAN. Speed of light!

• Max # of synchronous computation steps per second, no matter how much parallel hardware available.

Reminder: Two-Phase Commit

Coordinator Subordinate

Send prepare

Force-write prepare record

Send yes or no

Wait for all responses

Force-write commit or abort

Send commit or abort

Force-write abort or commit

Send ACK

Wait for all ACKs

Write end record

The cost of synchronization

• 2PC: Minimum latency two network roundtrips.

• Latency limits Xact throughput:Throughput #Xacts/second = 1/(cost of one run of 2PC)

• Lausanne-Shenzhen:9491km * 4; 126ms@speed of light.• Wide-area/cross data-center consistency – 8 Xacts/s ?!?

• Consensus >= 2-phase commit.

Xact throughput local vs. cross-datacenter (slide by P.Bailis)

• for every network latency, there is a maximum scale at which a synchronized system can run.

• Jim Gray: In the late 80ies, for 2PC, optimal throughput was reached with ~50 machines (above that, throughput decreases). • Today the number is higher, but not by much.

Latency and scaling of synchronized parallel systems

• SIMD in CPU: does not need to scale, but is implemented in hardware and on one die: ok

• Cache coherence in multi-socket servers: a headache for computer architects!

• Linear algebra in HPC

• very special and easy problem, superfast interconnects, locality

• but scaling remains a challenge for designers of supercomputer and linear algebra routines.

• Scaling of ad-hoc problems on supercomputers: an open problem, ad-hoc solutions.

• Consistency inside data center (Xacts, MPI syncs): <10000 Hz

• Cross-data center consistency ~10 Hz

• Latency of heterogeneous local batch jobs: <2Hz (HPC, Mapreduce, Spark)

Latency and scaling of synchronized parallel systems #2

• Scaling of batch systems: <2Hz

• HPC jobs <1Hz(?)

• Map/reduce 0.1 Hz (Synchronziation via disk, slow scheduling)

• Spark <2 Hz; “Spark Streaming”

• Note: Hadoop efficiency: takes 80-100 nodes to match single-core performance.

• Distributed computation needs synchronization.

• Low-latency stream processing?

• Asynchronous messageforwarding– data-flow parallelism

Machine 1 Machine 2

b := 1a := 1

no sync

a += a+b

b += a+b

Does streaming/message passing defeat the 2PC lower bound?

• Assume we compute each statement once.

• Different machines handle statements

• Don’t compute until you have received all the msgs you need.

• Works!

• But requiressynchronized ts oninput stream.

• One stream sourceor synch of streamsources!

Machine (1, j) Machine (2, j)

b := 1a := 1

a’ += a+b

a’’ += a’+b’’

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

• Repeatedly compute values.

• Each msg has a (creation) epoch timestamp

• Multiple msg can share timestamp.

• Works in this case!

• Computes only sums oftwo objects. We knowwhen we have receivedall the msgs we need tomake progress!

b := 1a := 1

a’ += a+b

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

a’’ += a’+b’’

• Repeatedly compute values.

• Each msgs has a (creation) epoch timestamp

• Multiple msg can share timestamp.

• Notify when no moremessages of a particular tsare to come from a sender.

• Requires to wait fornotify() from all sources.

• Synch again!

• If there is a cyclical dep(same vals read as written),2PC is back!

b := 1a := 1

a’ += a+b

a’’ += a’+b’’

b’’ += a’+b’

b’’’ += a’’+b’’

Machine (i, 1)

Machine (i, 2)

Machine (i, 3)

Machine (i, 4)

Machine (i, 5)

Streaming+Iteration: Structured time [Naiad]

(t) (t, t’)

(t, t’)

(t, t’)(t, t’)

(t, t’+1)

Data-flow parallel systems

• Popular for real-time analytics.

• Most popular framework: Apache Storm / Twitter Heron

• Simple programming model (“bolts” – analogous to map-reduce mappers).

• Requires nonblocking operators: e.g. symmetric hash-join vs. sort-merge join.

Latency and Data Skew

• Data skew: uneven parallelization.

• Reasons:• Bad initial parallelization• Uneven blowup of intermediate results• bad hashing in (map/reduce) reshuffling

• Occurs both in batch systems such as map-reduce and

• Fixes: skew resilient repartitioning, load shedding, …

• Node failures look similar.

• Parallelization

• Small data (seriously!) <cut>

• Specialization

• Parallelization

• Incrementalization (online/anytime) <cut>

• Specialization

• Parallelization

• Specialization• Hardware: GPU, FPGA, …• Software: compilation• Lots of activity both on the hw and sw fronts at EPFL&ETHZ

Summary

• The classical batch job is getting a lot of competition. People need low-latency for a variety of reasons.

• Part of a cloud/OpenStack deployment will be used for low-latency work.• Latency is a problem! Virtualization, MS vs Amazon

• Distributed computation at low latencies has fundamental limits.

• Very hard systems problems, huge design space to explore.

• Incrementalization can give asymptotic efficiency improvements. By many orders of magnitude in practice.

Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

Documents

Structuring Wikipedia Articles with Section Recommendations · 2018-05-07 · Structuring Wikipedia Articles with Section Recommendations Tiziano Piccardi EPFL tiziano.piccardi@epfl.ch

Fundamentals in Biophotonics - LBEN · Fundamentals in Biophotonics Optogentics Aleksandra Radenovic aleksandra.radenovic@epfl.ch EPFL –Ecole Polytechnique Federale de Lausanne

Introduction: ANC · Active acoustic absorption General presentation-applications Dr. HervéLissek Laboratoired’Electromagnétismeet d’Acoustique EPFL herve.lissek@epfl.ch Dr

Peter.Fauland@epfl.ch EDR @ EPFL, 14.10.05 Design and first Prototype for a B eam P ipe D istance M onitoring Safety Switch

Architecture et technologie des ordinateurs II Gianluca Tempesti EPFL-DI-LSL, INN 235 Tel: 693 2676 E-Mail: gianluca.tempesti@epfl.ch Web: lsltempesti/CoursUNIL.htm

Fundamentals in Biophotonics week2 - LBEN | EPFL · Fundamentals in Biophotonics Light, photon‐wave particle duality Aleksandra Radenovic aleksandra.radenovic@epfl.ch EPFL –Ecole

Craig Smith Fernando Porte-Agel WIRE, EPFL, Switzerland craig.smith@epfl.ch

Recherche opérationnelle - transp-or.epfl.chtransp-or.epfl.ch/courses/ro2011/slides/01-introduction.pdf · Recherche opérationnelle Michel Bierlaire michel.bierlaire@epfl.ch EPFL

1 Cooperation in Multi-Domain Sensor Networks Márk Félegyházi Levente Buttyán Jean-Pierre Hubaux {mark.felegyhazi, jean-pierre.hubaux}@epfl.ch EPFL, Switzerland

Mathematics of Data: From Theory to Computation - epfl.ch

Fundamentals in Biophotonics...Fundamentals in Biophotonics Photon /wave particle duality Aleksandra Radenovic aleksandra.radenovic@epfl.ch EPFL –Ecole Polytechnique Federale de

Akustisches Kolloquium Electroacoustic absorption: towards room equalization in the low- frequency range Dr. Hervé Lissek EPFL STI IEL LTS2 herve.lissek@epfl.ch

Andrea Camesi, Jarle Hulaas Firstname.Lastname@epfl.ch Software Engineering Laboratory Swiss Federal Institute of Technology in Lausanne (EPFL) Switzerland

fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Fundamentals in Biophotonics intro - LBEN | EPFL · Fundamentals in Biophotonics Aleksandra Radenovic aleksandra.radenovic@epfl.ch EPFL –Ecole Polytechnique Federale de Lausanne

TraCE - TRANSPORTATION CENTER EPFL 09 Sep 2009 sonia ... · TraCE - TRANSPORTATION CENTER EPFL 09 Sep 2009 sonia.lavadinho@epfl.ch STRC Ascona 1 (Coordinator) LOHR Industrie LI France

Anceyfr.ancey.ch/cours/masterSIE/cours-master_sie.pdf · ii C.Ancey, EPFL,ENAC/IIC/LHE, Ecublens,CH-1015Lausanne,Suisse christophe.ancey@epfl.ch,lhe.epfl.ch Risqueshydrologiquesetaménagementduterritoire/C.Ancey

A Formally Verified NAT - GitHub Pages · A Formally Verified NAT Arseniy Zaostrovnykh EPFL, Switzerland arseniy.zaostrovnykh@epfl.ch Solal Pirelli EPFL, Switzerland solal.pirelli@epfl.ch

Fundamentals in Biophotonics week8 - LBEN | EPFL · Fundamentals in Biophotonics Single Molecule FRET with Total Internal Reflection Microscopy Aleksandra Radenovic aleksandra.radenovic@epfl.ch

Jarle Hulaas Jarle.Hulaas@epfl.ch Swiss Federal Institute of Technology in Lausanne (EPFL) Switzerland (Formerly at CUI, Uni Geneva) Dimitri Kalas kalas8@etu.unige.ch