29
Data Stream Management Systems Tore Risch Dept. of information technology Uppsala University Sweden 2016-12-05

Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Data Stream Management

Systems

Tore Risch

Dept. of information technology

Uppsala University

Sweden

2016-12-05

Page 2: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

New applications

• Data comes as huge data streams, e.g.

- Satellite data

- Scientific instruments

- Colliders

- Industrial equipment

- Patient monitoring

- Stock data

- Traffic monitoring

Page 3: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Too much data to store on disk

Would like to query data directly in the streams

Page 4: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

DataBase Management

Systems, DBMS

4

e.g. Oracle, MySQL, SQL Server

Query Processor

Data Manager

Meta –

data

DBMS

Dynamic SQL Queries

Stored

Data

Page 5: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

DataStream Management

Systems, DSMS

5

e.g. Tipco Streambase

Query Processor

Stream ManagerData

streams

Meta –

data

DSMS

Dynamic Continuous Queries (CQs)

Data

streams RDB

Page 6: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Data managers, key/value

stores

6

Data manager

Java/C++/JavaScript programs

Stored

Data

e.g. BerkeleyDB, Riak, Cassandra, DynamoDB

Page 7: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Data Stream Managers

7

e.g. Storm, Kinesis, Stream Mill, Flink

Data

streams

Data

streams

Data stream

manager

Java/C++/JavaScript programs

Page 8: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

sa.engine Stream Analythics

Engine

8

Page 9: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Continuous queries over streams

from expresswaysSchema for stream CarLocStr of tuples:

CarLocStr(car_id, /* unique car identifier */

speed, /* speed of the car */

exp_way, /* expressway: 0..10 */

lane, /* lane: 0,1,2,3 */

dir, /* direction: 0(east), 1(west) */

x-pos); /* coordinate in express way */

Continuous query expressed in the query language CQL (Stanford):

Continuously get the cars in a window of the stream every 30 seconds:

SELECT DISTINCT car_id

FROM CarLocStr [RANGE 30 SECONDS]

Continiuously compute the average speed of vehicles per expressway, direction, segment each 5 minutes:

SELECT exp_way, dir, seg, AVG(speed) as speed,

FROM CarSegStr [RANGE 5 MINUTES]

GROUP BY exp_way, dir, seg

Page 10: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

DSMSs vs. Relational Databases• Streams ordered

– Regular relational DBs are based on sets and bags

• The order of the result of a relational (sub-query) is insignificant

• Used in lots of query optimization methods

– Must preserve order when result is stream of tuples

• Streams potentially infinite in size– Regular DBs based on queries to finite tables

– A stream must normally be accessed in one pass

– A regular relational join (e.g. nested-loop join) may scan tablesseveral times

• NLJ impossible over streams

• Merge Join (MJ) possible in stream (tuple arrival) order only

– Cannot materialize (store) intermediate results

• Regular hash join impossible over streams

Different query processing for streams

Page 11: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

DSMSs vs. relational databases• Often very high stream data volume and rate

– Regular DBs much less demanding

– Regular database are optimized for high-watermark

• Once loaded it does not change in size very fast

– Streams may have extreme insert and delete rates as tuples flow through the DSMS

• Real-time delivery, Quality of Service– Regular databases also have QoS requirements, but slower!

• E.g. guaranteed average response time of 1s (soft real-time)

– If a DSMS does not keep up with the flow some actions have to be made

• E.g. skipping elements, shedding

Page 12: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

DSMSs vs. relational databases• Stream queries are continuous

– Regular DB queries passive

• Client send query – server replies

– Queries over stream run continuously until they are stopped

Continuous queries, CQs

• Active databases (triggers) are related, but:– Active database are very side effect oriented (very procedural)

– The action in a trigger typically updates the database

– Triggers activated by database state changes

• Continuous queries are non-pocedural– CQs can be seen as side effect free functions (algebra) over

streams

– Database updates are handled as streams insering tuples intosome database (after sufficient data reduction by CQ filters)

Page 13: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Continuous queries• CQs are turned on and run until stop condition becomes

true or when they are explicitly terminated– Regular queries finsish when demanded result data delivered

• CQs often based on time stamps (logs) of stream elements, i.e. streams contain temporal data– Regular queries often not temporal

– Temporal databases provides queries over time stamped stored data

• Non-monotone operators in CQs (e.g. grouping and sorting) must be made over stream windows– A stream window is a bounded stream segment

– By contrast, regular DBMS queries specified over entire tables

Page 14: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

A simple example

peakdetect(winagg(readstream(“SIG2"),256,256));

Page 15: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Stream windows• Limited size section of stream stored temporarily in

database– A window is regarded as a transient stored regular database

table

• Need window operators to chop stream into segments– as [RANGE 30 SECONDS] in CQL

• Window size based on:– A time window

• E.g. elements last 30 seconds

– i.e. a window contains all event processed during the last second

– Number of elements, a counting window

• E.g. last 10 elements

– i.e. windows has fixed size of 10 elements

– A landmark window

• All events from time t0 belongs to window

– c.f. log file

Page 16: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Stream windows

• Windows also have stride– Rule for how fast they move forward,

• E.g. 256 elements for a 256 element counting windowor 30s for a 30s time window

=> A tumbling window

• E.g. 128 elements for a 256 element counting windowor 10s for a 30s time window

=> A sliding windows

• The CQ language needs window operators to collect stream tuples into stream of windows of tuples– Stride and window size are parameters

Page 17: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Window joins

• Join over temporary data in window– Can be full join with aggregate functions (group-by)

• Window join operators become approximate– Since only windows of elements are joined rather than entire

stream

• In a relational database all elements in tables are matched in a join

– Often approximate summaries of streaming data maintained

• E.g. moving average, standard deviation, max, min

• The summaries also become dynamic streams!

Page 18: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Continuous query languages

• CQL– The classical CQ language developed by Stanford CSD

– Minimal extension to SQL by streams over tuples

– FROM clause is augumented with a […] construct to specify streams and windows:

SELECT DISTINCT car_id

FROM CarLocStr [RANGE 30 SECONDS];

SELECT exp_way, dir, seg, AVG(speed) as speed,

FROM CarSegStr [RANGE 5 MINUTES]

GROUP BY exp_way, dir, seg;

– [now] indicates no window:SELECT car_id, speed

FROM CarLocStr [now]

Page 19: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Continuous query languages

• There is no standard CQ language

• Most DSMS use variants of CQL– E.g. Tipco StreamBase, Azure Stream Analythics, SQL Stream

Variables always bound to rows in streams, tuple algebra andqueries over tuples

Syntactic sugar added to handle windows and sequences

Numerical models must be expressed in terms of rows, which many look unnatural for numerical models

• sa.engine (and sa.amos) based on domain algebra and queriesVariables bound to objects rather than rows

• For example, number, vectors, arrays, strings etc.

• Mathematical models easy and natural to express algebraically

Combines regular mathematical algebra with stream algebra, relational algebra, and object-oriented queries

Page 20: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Parallel DSMSs

• Classical DSMSs execute on one computer– Can be many threads (multi-core)

• The streams contain rather low volume events– Rather slow reading from sensors

• Sensor networks

• Smart dust

• Typically relatively simple processing (conditions) over events– E.g. thresholds

• Temperature > 100 and pressure < 200

• CQ typically makes data reduction by source filtering

• Could still be substantial data volumes at the data sink the receiver of the stream.– Collect all event into a relational database for SQL analyses

Page 21: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Parallel DSMSs

• New applications for DSMS require

– Advanced queries and computations on stream elements, e.g.:

• E.g. applying advanced statistical operators over windows

• Complex numerical computations (FFT, PCA) and other mathematical models

• Advanced pattern matching and AI techniques

– Scalability over both data volume and computations

• Solution

– Parallel DSMS• Computations made in in parallel

Page 22: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Well known DSMS SystemsSTREAM (Stanford):StreaMon: Baby & Widom: An Adaptive Engine for

Stream Query Processing, SIGMOD 2004, Classical, as DSMSprototype, defines CQL. Used by SQLStream product.

Aurora (Brown,MIT,Brandeis): Carney et al: Monitoring Streams – A New Class of Data ManagemBnt Applications, VLDB 2003. Classical, a classical DSMS, now Tipco Streambase product.

STORM (http://storm.incubator.apache.org/): Data stream manager developed by Twitter. Stream processing engine where programmers specify communicating distributed computations in Java (Scala, Clojure). Program library, no query language.

Kinesis (https://aws.amazon.com/kinesis/streams/): Amazon’s data stream manager.

Flink (https://flink.apache.org/): A data stream manager. CQL-similar query language being developed (https://calcite.apache.org/).

Borealis (Brown & Brandeis): Ahmad et al: StreaMon: An Adaptive Engine for Stream Query Processing, SIGMOD 2005, first parallell DSMS prototype.

Azure Stream Analytics (https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction), Microsoft’s DSMS, own CQL dialect.

Page 23: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

sa.engine (SCSQ++)• The world’s fastest DSMS 2010-now

• Measured with DSMS benchmark Linear Road: http://www.cs.brandeis.edu/~linearroad/

• Processes demanding computations on

streaming data at network speed

• Performance obtained by non-trivial

parallelization on clusters

• See: http://user.it.uu.se/~udbl/SCSQ.html

• SVALI: SCSQ extended with features for

analyses of streams from industrial equipment:http://link.springer.com/article/10.1007/s10844-016-0427-2

Page 24: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Performance

2,5 2,5 564

512

1024

0

200

400

600

800

1000

1200

Best timely response to compute intense

queries and analyses over high data stream

rates, www.cs.brandeis.edu/~linearroad:

Page 25: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

SAQL

sa.engine

How: Massively parallel stream query processing

Page 26: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Use case: fleet management

and analysis

Communicated

data streams

sa.enginesa.engine

sa.engine

sa.engine

sa.engine

sa.engine

sa.engine

sa.engine

sa.engine

Central

analytics

Edge

analytics

Edge

analytics

Equipment Monitoring centers Analysts

Page 27: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Stream Analysis and

Query Language SAQL

R/Matlab:

3 000 000 users

Hadoop

100 000

users

Empowers analysts to both implement and deploy

both analytical models and queries

by combining regular algebra and stream algebra in

very high-level queries

Automatic generation of (distributed) optimized execution

plans ensures efficiency and scalability

Existing algorithms in e.g. C or Java can be

reused in queries.

Page 28: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

The sa.engine approch

Real-time query and analysis of streaming data

End user analytics on very high level by query and analysis

language

Capability of handling compute intense analyses

over massive data streams in real-time (AI, learning, prediction)

Light-weight implementation enables edge analytics

on embedded IoT systems and mobile devices

Plug-in architecture where existing algorithms and indexes

can easily be reused

Page 29: Data Stream Management Systems · • Real-time delivery, Quality of Service – Regular databases also have QoS requirements, but slower! • E.g. guaranteed average response time

Overview paper

L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, http://www.acm.org/sigmod/record/issues/0306/1.golab-ozsu1.pdf