Addressing Time/Quality Trade-off in View
Maintenance
Soheila Dehghanzadeh
Insight Centre for Data Analytics
Outline
•Introduction
•Terminology
•Problem definition
•Proposed solution
•Experimental results
•Conclusion
Slide 2
Introduction: Query Processing On Linked Data
•Report changes to the local store (maintenance)• sources pro-actively report changes or their existence (pushing).• query processor discover new sources and changes by crawling
(pulling).
•Maintenance trade-off• Fast maintenance leads high quality but slow response and vice versa.
•Problem: Maintenance according to user defined trade-off.
•Why is it important? It eliminates unnecessary maintenance and leads to faster response and better scalability.
Replication (database) or Caching (web)Off-line
materialization
Local Store
Query Processo
r
Query Response
UPDATES NEW source
s
ScalabilityAvailabilityperforman
ce
ScalabilityAvailabilityperforman
ce
Insight Centre for Data Analytics
View Maintenance Categorization
Slide 4
Trade-off Management V.s.Change Reporting Mechanism
Time/quality trade-off
query level replica level
quality constraint
time constraint
quality constraint
time constraint
update stream A B C D
no update stream
E F G H
Insight Centre for Data Analytics
Problem Definition
•Problem E• Optimizing maintenance to satisfy quality constraints within
the lowest response time for each query.
•Problem F• Optimizing maintenance to satisfy time constraints with the
highest response quality for each query.
Slide 5
Insight Centre for Data Analytics
Terminology
•Quality requirements:• Freshness B/(A+B)• Completeness B/(B+C)
•Maintenance plan• Each set of views chosen for maintenance is called a
maintenance plan.• Having n views, number of maintenance plans is .• Each maintenance plan leads to a different response
quality.
Slide 6
V1 V2 V3 V4
20% 90% 10% 80%
Insight Centre for Data Analytics
Freshness Examplea1 b1 T
a2 b2 T
a3 b3 F
a4 b4 T
a5 b5 F
Slide 7
a1
c1
F
a1
c2
F
a1
c3
T
a2
c4
T
a6
c5
F
a1 b1
c1 F
a1 b1
c2 F
a1 b1
c3 T
a2 b2
c4 T
60%
40%
50%
a1
b1
T
a2
b2
T
a3
b3
T
a4
b4
T
a5
b5
T
a1 c1 F
a1 c2 F
a1 c3 T
a2 c4 T
a6 c5 F
a1 b1 c1 F
a1 b1 c2 F
a1 b1 c3 T
a2 b2 c4 T100%
40%
50%
a1 b1 T
a2 b2 T
a3 b3 F
a4 b4 T
a5 b5 F
a1
c1
T
a1
c2
T
a1
c3
T
a2
c4
T
a6
c5
T
a1 b1 c1 T
a1 b1 c2 T
a1 b1 c3 T
a2 b2 c4 T60%
100%
100%
Research questions
•What is the least costly maintenance plan that fulfills response quality requirements.
• What is the quality of response without maintenance?• What is the quality of response of each “maintenance
plan”.
Insight Centre for Data Analytics
Slide 8
Experiment
•We use BSBM benchmark to create a dataset and a query set.
•We label triples with true/false to specify their freshness status.
•We summarize the cache to estimate the quality of a query response without actually executing the query on cache.
•To summarize the cache we extended the cardinality estimation techniques for freshness estimation problem.
Insight Centre for Data Analytics
Slide 9
Alice
Lives Dublin True
Bob Lives Berlin False
Alice
Job Teacher True
Bob Job Developer
False
Insight Centre for Data Analytics
Cardinality Estimation•Capture the data distribution by splitting data into buckets and only keep the bucket cardinality in the summary.
Slide 10
Alice Job Teacher
Alice Lives Dublin
Alice Job PhD student
Alice Lives Athlon
Bob Job Manager
Bob Lives Berlin
Bob Lives Chicago
Bob Lives Munich
Bob Lives Belfast
Bob Lives Limerick
Bob Job CEO
Bob Job Consultant
Alice Job * 2 Bob Job * 3 Alice Lives * 2 Bob Lives * 5
* Job * 5 * Lives * 7
Freshness
True
True
False
False
True
True
True
False
False
False
False
False
2
3
1
1
1
2
Q1: ?a Job ?bQ2: (?a Job ?b)^(?a Lives ?c)
Estimated
Actual
5 5
35 19
Estimated
Actual
5 5
19 19
Estimated
Actual
2/5 2/5
6/35 3/19
Estimated
Actual
2/5 2/5
3/19 3/19
Insight Centre for Data Analytics
Cardinality Estimation Approaches•System R assumptions for cardinality estimation:
• data is uniformly distributed per attribute.• predicates are independent (either in same table or among
different tables).
• predicate multiplication approaches make both assumptions.
•Histogram captures the dependencies among predicates for more accurate estimation.
Slide 11
Insight Centre for Data Analytics
Measure accuracy of the estimation approach
Slide 12
n is the number of queries
Measure the difference between the actual and estimated freshness of queries in a query set.
Estimation Results
Insight Centre for Data Analytics
Estimation Error 1
Slide 14
a Job teacher T
a Job professor F
a Job PhD F
b Job developer T
a Lives in
Dublin T
b Lives in
Galway F
b Lives in
Cork T
b Lives in
Limerick T
a teacher Dublin T
a Professor Dublin F
a PhD Dublin F
b Developer
Galway F
b Developer
Cork T
b Developer
Limerick
T
?s, Job, ?o 50%
50%
?s, Lives in, ?o 75%
Reason : DependenciesSolution : • A more granular index on join (subject) and bounded dimension
(predicate).• Histogram and table level synopses can capture these dependencies
and reduce the error accordingly.Experiment: We did not observe this error in our experiment because we didn’t have such dependencies in the dataset.
37.5% summary
Data
<?s,Job,?o1> join <?s, Lives in,?o2>
Insight Centre for Data Analytics
Estimation Error 2
20 October 2014 Slide 15
?s, Job, ?o1 50%
?s, Lives in, ?o2 75% summary
a Job teacher T
a Job professor F
a Job PhD F
b Job developer T
a Lives in
Dublin T
b Lives in
Galway F
b Lives in
Cork T
b Lives in
Limerick T
Data
<?s,Job,Developer> join <?s, Lives in,?o2>
b Developer
Galway F
b Developer
Cork T
b Developer
Limerick
T
37.5%
66%
Reason : bounded object
Solution : • A more granular index on join dimension (subject) and bounded
dimensions (predicate and object) => we need to index the whole dataset-> not efficient.
Experiment: We did not observe any improvement on this error by using histogram.
Concern 1 on problem definition
Bob Job Teacher
True
Bob Job PhD True
Alice Job Professor
True
Bob Job Teacher
True
Bob Job PhD False
Alice Job Professor
True
Bob Job Teacher
True
Bob Job PhD False
Alice Job Professor
False
Bob Job Teacher
False
Bob Job PhD False
Alice Job Professor
False
Bob Lives in
Limerick
True
Bob Lives in Galway True
Alice Lives in Dublin True
Alice Lives in Cork True
Bob Lives in
Limerick
True
Bob Lives in Galway True
Alice Lives in Dublin True
Alice Lives in Cork False
Bob Lives in
Limerick
True
Bob Lives in Galway False
Alice Lives in Dublin True
Alice Lives in Cork False
Bob Lives in
Limerick
False
Bob Lives in Galway False
Alice Lives in Dublin True
Alice Lives in Cork False
Bob Teacher Limerick
True
Bob Teacher Galway True
Bob PhD Limerick True
Bob PhD Galway True
Alice Professor Dublin True
Alice Professor Cork True
Bob Teacher Limerick
True
Bob Teacher Galway True
Bob PhD Limerick False
Bob PhD Galway False
Alice Professor Dublin True
Alice Professor Cork False
Bob Teacher Limerick True
Bob Teacher Galway False
Bob PhD Limerick False
Bob PhD Galway False
Alice Professor Dublin False
Alice Professor Cork False
Bob Teacher Limerick False
Bob Teacher Galway False
Bob PhD Limerick False
Bob PhD Galway False
Alice Professor Dublin False
Alice Professor Cork False
100%
100%
100%
66%
75%
50%
33%
50%
16%
0%25%
0%
True
False
True
True
False
66%
Insight Centre for Data Analytics
Concern 2 on the suggested solution•We need to build one summaries for each maintenance plan because summary of one maintenance plan can not be used for estimating freshness of a query executed on another maintenance plan.
•This is very inefficient given the space requirements and cost of maintaining these summaries.
Slide 17
Insight Centre for Data Analytics
Conclusion •We defined quality constraints based on freshness and completeness.
•We summarized a snapshot of a dataset to estimate the freshness of various queries using indexing and histogram for our freshness estimation problem.
•We need to build individual summaries for each maintenance plan since a summary for one maintenance plan can not be used to estimate the quality of a query executed on other maintenance plans.
•Our experiment didn’t fail by estimation error caused by dependency due to lack of such errors in the dataset. Next step is to design a more realistic dataset and again compare the result of histogram and predicate multiplication.
•Summarization techniques are designed for a very static environment and any change on the underlying data needs to build the summary from scratch. So does it really make sense to extend cardinality estimation for freshness estimation?
Slide 18
Insight Centre for Data Analytics
Problem Definition
•Problem E• Optimizing maintenance to satisfy quality constraints within
the lowest response time for each query.
•Problem F• Optimizing maintenance to satisfy time constraints with the
highest response quality for each query.
Slide 19
Insight Centre for Data Analytics
Problem description without join
Slide 20
Replica
User queries the replica with time constraints
Replica should maintain only a subset of result that is more likely to be expired.
Scenario Stream Data
Back Ground Data
Window Replica
Insight Centre for Data Analytics
Slide 23
Use CaseTwitter Stream Data
Back Ground Data
Number of mentions in the
last twitter window
User follower count Replica
Raising stars Query: find users who have been mentioned more than 100 times in the last 10 minutes and have more than 1000 followers.With constraint on the execution time.
Insight Centre for Data Analytics
Slide 23
Insight Centre for Data Analytics
Continuous join operator with one replica•We implemented a set of continuous join operators
• DWJoin : Uses the static replica and never change it(the quality of response degrades).
• Baseline join: Uses the LRU entry to choose entries to update from set of matches.(not necessary Least recently updated requires updating).
• Oracle Join: fetch data directly from source.• Smart Join: compute statistics of change rate and choose
those who are likely to be expired for fetching.• Mixed baseline-smart(possible extensions).
Slide 23
Insight Centre for Data Analytics
Performance of join operators
Slide 24
Insight Centre for Data Analytics
Possible extensions
Slide 25
The problem becomes complicated when the query is a join between replicas
Updating which combination of entries incurs the highest increase in join update?
ReplicaReplica
Insight Centre for Data Analytics
Future works
•Use a better model for learning the change rate in smart policy.
•We believe that smart policy will perform better if the change rate is more predictable.
•Investigate the problem where there is joins on the background knowledge side to know which combination of stale entries will contribute more to the result correctness if they become updated.
Slide 26
Insight Centre for Data Analytics
Thanks a lot for your attention !
Slide 27