Download pptx - addressing tim/quality trade-off in view maintenance

Addressing Time/Quality Trade-off in View

Maintenance

Soheila Dehghanzadeh

Insight Centre for Data Analytics

Outline

•Introduction

•Terminology

•Problem definition

•Proposed solution

•Experimental results

•Conclusion

Slide 2

Introduction: Query Processing On Linked Data

•Report changes to the local store (maintenance)• sources pro-actively report changes or their existence (pushing).• query processor discover new sources and changes by crawling

(pulling).

•Maintenance trade-off• Fast maintenance leads high quality but slow response and vice versa.

•Problem: Maintenance according to user defined trade-off.

•Why is it important? It eliminates unnecessary maintenance and leads to faster response and better scalability.

Replication (database) or Caching (web)Off-line

materialization

Local Store

Query Processo

r

Query Response

UPDATES NEW source

s

ScalabilityAvailabilityperforman

ce

ScalabilityAvailabilityperforman

ce


View Maintenance Categorization

Slide 4

Trade-off Management V.s.Change Reporting Mechanism

Time/quality trade-off

query level replica level

quality constraint

time constraint

quality constraint

time constraint

update stream A B C D

no update stream

E F G H


Problem Definition

•Problem E• Optimizing maintenance to satisfy quality constraints within

the lowest response time for each query.

•Problem F• Optimizing maintenance to satisfy time constraints with the

highest response quality for each query.

Slide 5


Terminology

•Quality requirements:• Freshness B/(A+B)• Completeness B/(B+C)

•Maintenance plan• Each set of views chosen for maintenance is called a

maintenance plan.• Having n views, number of maintenance plans is .• Each maintenance plan leads to a different response

quality.

Slide 6

V1 V2 V3 V4

20% 90% 10% 80%


Freshness Examplea1 b1 T

a2 b2 T

a3 b3 F

a4 b4 T

a5 b5 F

Slide 7

a1

c1

F

a1

c2

F

a1

c3

T

a2

c4

T

a6

c5

F

a1 b1

c1 F

a1 b1

c2 F

a1 b1

c3 T

a2 b2

c4 T

60%

40%

50%

a1

b1

T

a2

b2

T

a3

b3

T

a4

b4

T

a5

b5

T

a1 c1 F

a1 c2 F

a1 c3 T

a2 c4 T

a6 c5 F

a1 b1 c1 F

a1 b1 c2 F

a1 b1 c3 T

a2 b2 c4 T100%

40%

50%

a1 b1 T

a2 b2 T

a3 b3 F

a4 b4 T

a5 b5 F

a1

c1

T

a1

c2

T

a1

c3

T

a2

c4

T

a6

c5

T

a1 b1 c1 T

a1 b1 c2 T

a1 b1 c3 T

a2 b2 c4 T60%

100%

100%

Research questions

•What is the least costly maintenance plan that fulfills response quality requirements.

• What is the quality of response without maintenance?• What is the quality of response of each “maintenance

plan”.


Slide 8

Experiment

•We use BSBM benchmark to create a dataset and a query set.

•We label triples with true/false to specify their freshness status.

•We summarize the cache to estimate the quality of a query response without actually executing the query on cache.

•To summarize the cache we extended the cardinality estimation techniques for freshness estimation problem.


Slide 9

Alice

Lives Dublin True

Bob Lives Berlin False

Alice

Job Teacher True

Bob Job Developer

False


Cardinality Estimation•Capture the data distribution by splitting data into buckets and only keep the bucket cardinality in the summary.

Slide 10

Alice Job Teacher

Alice Lives Dublin

Alice Job PhD student

Alice Lives Athlon

Bob Job Manager

Bob Lives Berlin

Bob Lives Chicago

Bob Lives Munich

Bob Lives Belfast

Bob Lives Limerick

Bob Job CEO

Bob Job Consultant

Alice Job * 2 Bob Job * 3 Alice Lives * 2 Bob Lives * 5

* Job * 5 * Lives * 7

Freshness

True

True

False

False

True

True

True

False

False

False

False

False

2

3

1

1

1

2

Q1: ?a Job ?bQ2: (?a Job ?b)^(?a Lives ?c)

Estimated

Actual

5 5

35 19

Estimated

Actual

5 5

19 19

Estimated

Actual

2/5 2/5

6/35 3/19

Estimated

Actual

2/5 2/5

3/19 3/19


Cardinality Estimation Approaches•System R assumptions for cardinality estimation:

• data is uniformly distributed per attribute.• predicates are independent (either in same table or among

different tables).

• predicate multiplication approaches make both assumptions.

•Histogram captures the dependencies among predicates for more accurate estimation.

Slide 11


Measure accuracy of the estimation approach

Slide 12

n is the number of queries

Measure the difference between the actual and estimated freshness of queries in a query set.

Estimation Results


Estimation Error 1

Slide 14

a Job teacher T

a Job professor F

a Job PhD F

b Job developer T

a Lives in

Dublin T

b Lives in

Galway F

b Lives in

Cork T

b Lives in

Limerick T

a teacher Dublin T

a Professor Dublin F

a PhD Dublin F

b Developer

Galway F

b Developer

Cork T

b Developer

Limerick

T

?s, Job, ?o 50%

50%

?s, Lives in, ?o 75%

Reason : DependenciesSolution : • A more granular index on join (subject) and bounded dimension

(predicate).• Histogram and table level synopses can capture these dependencies

and reduce the error accordingly.Experiment: We did not observe this error in our experiment because we didn’t have such dependencies in the dataset.

37.5% summary

Data

<?s,Job,?o1> join <?s, Lives in,?o2>


Estimation Error 2

20 October 2014 Slide 15

?s, Job, ?o1 50%

?s, Lives in, ?o2 75% summary

a Job teacher T

a Job professor F

a Job PhD F

b Job developer T

a Lives in

Dublin T

b Lives in

Galway F

b Lives in

Cork T

b Lives in

Limerick T

Data

<?s,Job,Developer> join <?s, Lives in,?o2>

b Developer

Galway F

b Developer

Cork T

b Developer

Limerick

T

37.5%

66%

Reason : bounded object

Solution : • A more granular index on join dimension (subject) and bounded

dimensions (predicate and object) => we need to index the whole dataset-> not efficient.

Experiment: We did not observe any improvement on this error by using histogram.

Concern 1 on problem definition

Bob Job Teacher

True

Bob Job PhD True

Alice Job Professor

True

Bob Job Teacher

True

Bob Job PhD False

Alice Job Professor

True

Bob Job Teacher

True

Bob Job PhD False

Alice Job Professor

False

Bob Job Teacher

False

Bob Job PhD False

Alice Job Professor

False

Bob Lives in

Limerick

True

Bob Lives in Galway True

Alice Lives in Dublin True

Alice Lives in Cork True

Bob Lives in

Limerick

True

Bob Lives in Galway True


Alice Lives in Cork False

Bob Lives in

Limerick

True

Bob Lives in Galway False



Bob Lives in

Limerick

False

Bob Lives in Galway False



Bob Teacher Limerick

True

Bob Teacher Galway True

Bob PhD Limerick True

Bob PhD Galway True

Alice Professor Dublin True

Alice Professor Cork True

Bob Teacher Limerick

True

Bob Teacher Galway True

Bob PhD Limerick False

Bob PhD Galway False

Alice Professor Dublin True

Alice Professor Cork False

Bob Teacher Limerick True

Bob Teacher Galway False



Alice Professor Dublin False


Bob Teacher Limerick False

Bob Teacher Galway False



Alice Professor Dublin False


100%

100%

100%

66%

75%

50%

33%

50%

16%

0%25%

0%

True

False

True

True

False

66%


Concern 2 on the suggested solution•We need to build one summaries for each maintenance plan because summary of one maintenance plan can not be used for estimating freshness of a query executed on another maintenance plan.

•This is very inefficient given the space requirements and cost of maintaining these summaries.

Slide 17


Conclusion •We defined quality constraints based on freshness and completeness.

•We summarized a snapshot of a dataset to estimate the freshness of various queries using indexing and histogram for our freshness estimation problem.

•We need to build individual summaries for each maintenance plan since a summary for one maintenance plan can not be used to estimate the quality of a query executed on other maintenance plans.

•Our experiment didn’t fail by estimation error caused by dependency due to lack of such errors in the dataset. Next step is to design a more realistic dataset and again compare the result of histogram and predicate multiplication.

•Summarization techniques are designed for a very static environment and any change on the underlying data needs to build the summary from scratch. So does it really make sense to extend cardinality estimation for freshness estimation?

Slide 18


Problem Definition

•Problem E• Optimizing maintenance to satisfy quality constraints within

the lowest response time for each query.

•Problem F• Optimizing maintenance to satisfy time constraints with the

highest response quality for each query.

Slide 19


Problem description without join

Slide 20

Replica

User queries the replica with time constraints

Replica should maintain only a subset of result that is more likely to be expired.

Scenario Stream Data

Back Ground Data

Window Replica


Slide 23

Use CaseTwitter Stream Data

Back Ground Data

Number of mentions in the

last twitter window

User follower count Replica

Raising stars Query: find users who have been mentioned more than 100 times in the last 10 minutes and have more than 1000 followers.With constraint on the execution time.


Slide 23


Continuous join operator with one replica•We implemented a set of continuous join operators

• DWJoin : Uses the static replica and never change it(the quality of response degrades).

• Baseline join: Uses the LRU entry to choose entries to update from set of matches.(not necessary Least recently updated requires updating).

• Oracle Join: fetch data directly from source.• Smart Join: compute statistics of change rate and choose

those who are likely to be expired for fetching.• Mixed baseline-smart(possible extensions).

Slide 23


Performance of join operators

Slide 24


Possible extensions

Slide 25

The problem becomes complicated when the query is a join between replicas

Updating which combination of entries incurs the highest increase in join update?

ReplicaReplica


Future works

•Use a better model for learning the change rate in smart policy.

•We believe that smart policy will perform better if the change rate is more predictable.

•Investigate the problem where there is joins on the background knowledge side to know which combination of stale entries will contribute more to the result correctness if they become updated.

Slide 26


Thanks a lot for your attention !

Slide 27