Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

Soheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao,

Emanuele Della Valle, Alessandra Mileo , Abraham Bernstein

ICWE - 25 June 2015

Outline

● Introduction to Continous Queries

● Motivating Example

● Problem Description

● Solution

● Experimental Results

● Conclusions

2ICWE - 25 June 2015

Introduction•R

DF Stream Processing engines usually register queries and execute them in a continuous fashion.

RDF Stream Generator

W(ω,β)

EvaluationEvaluation

Time-based sliding window

S9 S10

widthslideWindow

Introduction•C

omplex continuous queries combines data streams with remote background data.

Background data(SPARQL endpoint)

Motivating ExampleFinding Influential Users

•Influential User: users who have more than a specific number of followers and are mentioned more than a specific times in a specific period (200 seconds).

•Follower number: stored in a remote endpoint.

•Mention number: computed by processing the stream of messages.

Inspired by Chris Testa's SemTech 2011 talk: http://goo.gl/kLSqGo

Investigating the Scenario Symmetrical hash join

•Drawbacks:

• Data access constraints.• Background data is huge and has to be fetched at every

evaluation - slow and wasting computational and financial resources.

Investigating the Scenario Nested Loop Join

•Drawbacks:

• One invocation for each mapping from the WINDOW clause evaluation – high number of requests to the server.

• API restrictions (e.g., limited amount of requests over time).

Investigating the Scenario Local Views

•Challenges:

• Data goes out of date

Local View

Investigating the ScenarioMaintenance processes

•Maintenance introduces a trade-off between response quality and time.

•We propose to manage this trade-off by fixing time dimension based on query constraints and maximizing freshness of response.

Local View

Maintenance Process

Freshness decreases

Refresh Cost/Quality trade-

10ICWE - 25 June 2015

Problem Description

The maintenance process should identify elements of the local view that maximize response freshness.

11ICWE - 25 June 2015

Requirements of The Maintenance Process

1. should satisfy the Quality of Service constraints on responsiveness and freshness of the answer;

2. should take into account the change rates of the data elements in the REST API;

3. should consider the dynamicity of the change rate values;

4. may consider the sliding window operator.

12ICWE - 25 June 2015

Hypotheses

•We formulated the following hypotheses to build the maintenance process

•HP1: the freshness of the answer can increase by maintaining part of the local view involved in the current query evaluation

•HP2: the freshness of the answer increases by refreshing the (possibly) stale local view entries that would remain fresh in a higher number of evaluations

13ICWE - 25 June 2015

JOIN WSJWSJ WBMWBM

RefresherRefresher

Window

Solution: WSJ+WBM

Local View

14ICWE - 25 June 2015

t5 6 7 8 9 10 11

W1 W2 W3 W4

5 6 7 8 9 10 11 124

Terminology

Best Before Time: the time that an element will

become stale and is defined by:

Mappings from the WINDOW clause

Mappings in the LOCAL VIEW

Compatible mappings

15ICWE - 25 June 2015

t5 6 7 8 9 10 11

W1 W2 W3 W4

5 6 7 8 9 10 11 124

•WSJ identifies the candidate set: the possibly stale local view mappings involved in the current evaluation.

•WSJ analyzes the content of the current window evaluation and identifying the compatible mappings in the local view.

•The possibly stale mappings are identified by analyzing the associated best before time

16ICWE - 25 June 2015

V L Score

t5 6 7 8 9 10 11

W1 W2 W3 W4

5 6 7 8 9 10 11 124

•WBM ranks the candidate set to determine which mappings to update.

•The ranking is computed through two values: the renewed best before time and the remaining life time

•The top k elements are selected to be refreshed. The value k is selected according to the responsiveness constraint.

17ICWE - 25 June 2015

V L Score341

t5 6 7 8 9 10 11

W1 W2 W3 W4

5 6 7 8 9 10 11 124

WBM: renewed best before time

•When would the mappings became stale if refreshed now?

•The renewed best before time V is computed as:

18ICWE - 25 June 2015

V L Score3 34 11 3

t5 6 7 8 9 10 11

W1 W2 W3 W4

5 6 7 8 9 10 11 124

WBM: remaining life time and score

•For how many future evaluations the mappings is involved?

•The remaining life time L is computed as:

•WBM ranks the mappings by using a score:

Score=min(L,V)

• is selected for the maintenance

19ICWE - 25 June 2015

Experiment- Data Collection

1. Streaming APIa. Twitter stream data for mention count

2. Twitter APIs to get number of followersa. Create snapshots everyone minutesb. Simulate the change based on user’s predefined change rates.

Streaming Dataset

Snapshots /synthetic

20ICWE - 25 June 2015

Experimental setup

•We study our hypotheses using a comparative evaluation with

• LRU: use the least recently updated elements for maintenance• RND: use a random subset of elements for maintenance

•Error measure

• Comparing the differences between consecutive evaluation of the motivated query against cache and real/synthetic dataset.

•HP1: We compared the cumulative staleness of using WSJ or not (i.e., GNR) for both baselines.

• GNR: candidate set is the whole view entries.•H

P2: We compared the cumulative staleness of using WBM and the improved baselines.

21ICWE - 25 June 2015

HP1: Maintaining involved entries of local view maximizes response accuracy.

Synthetic

WSJ shows better improvement by increasing the update budget than GNR.

22ICWE - 25 June 2015

HP2: Maintaining possibly stale entries from local view that will stay fresh for a longer time maximizes response accuracy.

Synthetic

WBM doesn’t improve as well as WBM* which shows the estimation error has caused by wrong estimation for BBT. Use more accurate prediction for BBT.

23ICWE - 25 June 2015

Conclusions and Future Work•C

onclusions:• We proposed using the idea of materialization to optimize processing

continuous queries.• We proposed a policy to maximize the freshness according to time

constraint in continuous query.• We tested our policy against based line policies (LRU and Random).

•Future Work:

• Extensions of real continuous query processors with the proposed approach

• Measuring the time overhead of maintenance • Investigating more complex queries that have complicated join patterns

between the SERVICE and STREAM clauses.• Dynamically estimating the change rate of users.

24ICWE - 25 June 2015

Soheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao, Emanuele Della Valle, Alessandra Mileo , Abraham Bernstein

soheila.dehghanzadeh@insight-centre.org http://www.slideshare.net/sallyde

ICWE - 25 June 2015

Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

Engineering

Approximate Query Processing (AQP) in Data Streams

Streaming Algorithm: Filtering & Counting Distinct Elements€¦ · Processing Streams • Summarization –Maintain a small size sketch (or summary) of the stream –Answering queries

An Approximate L1-Diﬀerence Algorithm for Massive … · An Approximate L1-Diﬀerence Algorithm for Massive Data Streams∗ Joan Feigenbaum† Computer Science Yale University

Question Answering Question Answering Question Answering Structure Survey Structure Survey Structure Survey Dictation Dictation

Approximate Frequency Counts over Data Streams(hot spot and denial-of-service attack detection), longer term trafﬁc engineering (rerouting trafﬁc and upgrading selected links),

Intel Research Sketching Streams through the Net: Distributed Approximate Query Tracking (Joint work with Graham Cormode, Bell Labs) Minos Garofalakis

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES

Approximate Continuous Querying over …minos/Papers/tods08.pdfApproximate Continuous Querying over Distributed Streams • 9:3 routers (that cannot possibly store the log of all observed

Approximate maximum extent of ice during Pleistocene (last ...faculty.bennington.edu/~kwoods/classes/biol div...Glaciers are flowing streams of ice; their terminus is determined by

Emergency Answering Service Pittsburgh | Pittsburgh Telephone Answering Service

Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Fast Approximate Wavelet Tracking on Streamsdimitris/publications/EDBT06.pdfFast Approximate Wavelet Tracking on Streams Graham Cormode1, Minos Garofalakis2, and Dimitris Sacharidis3

An Evolutionary Perspective on Approximate RDF Query Answering

Data mining for XML query-answering support...the second feature. A prototype system and experimental results demonstrate the effectiveness of the approach. Index Terms—XML, approximate

Join Synopses for Approximate Query Answering

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003

Answering the Call Answering T the Call he General Board ... · Candidacy Guidebook. Candidacy Guidebook Answering the Call. Answering the Call. Candidacy Guidebook Answering the

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley

Approximate Voronoi Cell Computation on … Voronoi Cell Computation on Geometric Data Streams Mehdi Sharifzadeh and Cyrus Shahabi Computer Science Department University of Southern