Upload
others
View
34
Download
0
Embed Size (px)
Citation preview
Distributed Data Management Summer Semester 2013
TU Kaiserslautern
Dr.-‐Ing. Sebas4an Michel
[email protected]‐saarland.de
Distributed Data Management, SoSe 2013, S. Michel 1
(DISTRIBUTED) DATA STREAM PROCESSING (INTRODUCTION)
Lecture 8+
Distributed Data Management, SoSe 2013, S. Michel 2
So Far: Databases/NoSQL Datastores • Data is changing, yes, but this is more due to inserts and update to stored data items
• Historic data is kept • Queries operate on full data (tables) • MapReduce is extreme, Write-‐once & Read-‐many 4mes
• Data warehousing, too: periodically loading data in store for deep(er) analy4cs
• Data mining Distributed Data Management, SoSe 2013, S. Michel 3
Data Stream Management vs. Tradi4onal Data Management
• At query 4me, data is accessed as a whole • Data is persistently stored • Queries are ad-‐hoc (mainly)
Distributed Data Management, SoSe 2013, S. Michel 4
DATA Base/Store
Query & Results Insert
Update
Delete
Data Stream Management vs. Tradi4onal Data Management (Cont’d)
• Data is moving! Con4nuously generated (assumed infinite!) • At high pace • Queries are (mainly) con4nuous (aka. standing). Registered
once, observed “forever”. • Answer to queries in (near) real-‐4me required (o`en) • Probabilis4c methods for efficiency or considering only part
of the stream (sliding window) Distributed Data Management, SoSe 2013, S. Michel 5
DATA STREAM
Set of queries
results
Sensor Networks • (Distributed) Sensor Networks; “Smart Dust” • Mainly numeric measurements of (natural) phenomena
• Computa4on of queries like max, average, min, quan4les, value > τ
Distributed Data Management, SoSe 2013, S. Michel 6
Mobile Ad-‐Hoc Networks
• Connec4on between sensors are ad-‐hoc • Efficient and reliable rou4ng • Understanding (changing) topology • Power consump4on • Also: Vehicular Ad-‐Hoc Netw.
Distributed Data Management, SoSe 2013, S. Michel 7
Social Sensors • Explicitly: Snow Tweets (hkp://snowcore.uwaterloo.ca/snowtweets/) – #snowtweets 50.0 cm. at K1A 0A2 – #snowtweets 10.0 in. at 20500 – #snowtweets 4 cm at Palmerston North 4414
• Implicitly: By men4oning topics, people, in social communica4on
Distributed Data Management, SoSe 2013, S. Michel 8 4me
9 source: hkp://blog.socialflow.com/
Earthquake News on Twiker
Stock Market
• Real-‐4me analysis of stock marked changes • Compu4ng sta4s4cs over streams, e.g., for decision support
• Opportuni4es for reac4ng in real-‐4me • Even with fully automated means: algorithmic trading
Distributed Data Management, SoSe 2013, S. Michel 10
DBMS vs. DSMS
Distributed Data Management, SoSe 2013, S. Michel 11
Database management system (DBMS) Data stream management system (DSMS) Persistent data (rela4ons) vola4le data streams Random access Sequen4al access One-‐4me queries Con4nuous queries
(theore4cally) unlimited secondary storage limited main memory
Only the current state is relevant Considera4on of the order of the input
rela4vely low update rate poten4ally extremely high update rate
Likle or no 4me requirements Real-‐4me requirements
Assumes exact data Assumes outdated/inaccurate data
Plannable query processing Variable data arrival and data characteris4cs
h"p://en.wikipedia.org/wiki/Data-‐stream_management_system
Data Stream Model
• Stream of data items is unbounded (available memory is not)
• No way to store en4re stream (how could we, its (probably) not ending)
• To compute query results, need to devise algorithm with likle memory consump4on
Distributed Data Management, SoSe 2013, S. Michel 12
Overview of Data Stream Topics • Synopses: – concise representa4ons of stream content – tailored to tasks, e.g., coun4ng dis4nct elements – usually not exact, but approxima4ons (es4mators) of true values.
• Windows: – focus of certain recent subset of data – computa4on of func4ons/joins over window(s) content
Distributed Data Management, SoSe 2013, S. Michel 13
Data Stream Mining: Teasers
• I tell you integer numbers between 1 and N • I will tell all but one number
• A`er N-‐1 numbers I ask: which number was missing.
Distributed Data Management, SoSe 2013, S. Michel 14
481 324 122 412 871 231 849 447 641 …
Data Stream Mining: Teasers (Cont’d)
• Keep Boolean array of length N: – Mark posi4on for observed number – Size required: N – Computa4on at end: N to find missing number
Distributed Data Management, SoSe 2013, S. Michel 15
• Much beker: – keep sum of numbers: S – Missing number is N*(N+1)/2 -‐ S
Coun4ng Occurrences
• Consider a stream of elements ai …, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How o`en does a2 occur?
• How to implement?
Distributed Data Management, SoSe 2013, S. Michel 16
• Keep counter for each id • Required space #ids (=N) • Not feasible of N is very large
Probabilis4c Coun4ng: Count-‐Min Sketch • Keep 2-‐dim array (h, r) • h hash func4ons* that map to range 0…(r-‐1)
Distributed Data Management, SoSe 2013, S. Michel 17
Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count-‐Min Sketch and its Applica4ons. J. Algorithms 55: 29–38.
0 1 2 3 4 5
• Arriving item a • For each j: array[j, hj(a)]++
h1
h2
h3
h4
Count-‐Min Sketch: Coun4ng
• How o`en did we see item a? • h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2 • Take minimum of the corresponding values in the 2-‐d array. Here: 4
• Es4mate is never underes4ma4ng • Overes4ma4on probabilis4cally bounded
Distributed Data Management, SoSe 2013, S. Michel 18
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5 h1
h2
h3
h4
9
8
8
4
Outlook
• Some more basics of stream processing • Few more fundamentals of data stream mining • Then, window based data stream management • Then going distributed: – Distributed DSMS – Distributed (ad-‐hoc) sensor networks – Scalable computa4on of massive data streams (think: Hadoop for data streams)
Distributed Data Management, SoSe 2013, S. Michel 19