42
1/39 University of Versailles, September 28th 1999 Hubert Naacke Mediator Cost Models for Heterogeneous Data Sources Hubert Naacke IN RIA Rocquencourt 78153 Le Chesnay [email protected]

1/39 University of Versailles, September 28th 1999 Hubert Naacke Mediator Cost Models for Heterogeneous Data Sources Hubert Naacke

Embed Size (px)

Citation preview

1/39

University of Versailles, September 28th 1999 Hubert Naacke

Mediator Cost Models for Heterogeneous Data Sources

Hubert Naacke

INRIA Rocquencourt78153 Le Chesnay

[email protected]

2/39

University of Versailles, September 28th 1999 Hubert Naacke

Outline

Context : DISCO a mediation system Problem and Objective

cost evaluation of query plan, use diverse cost info

Proposition extensible cost model, declarative language

Detailed solutioncost formulas hierarchical integration, cost evaluation

Validationexperimentation on Web data sources

Related solutionscomparison with: calibration, historical cost, query sampling

Conclusion

Context : DISCO a mediation system

3/39

University of Versailles, September 28th 1999 Hubert Naacke

Mediation Systems

goals intelligent integration of information exploit existing sources, reuse data and power

systems IRO-DB Univ. Versailles Hermes Univ. Maryland Garlic IBM Tsimmis Stanford Cords Univ. Waterloo DISCO Bull + INRIA Distributed Information Search Component

4/39

University of Versailles, September 28th 1999 Hubert Naacke

User Application

Query

Interface

Mediator-Wrapper ArchitectureRegistration phase

Query processing phase

Data Sources

Wrapper 1

Mediator

Wrapper n

Data Sources

Wrapper 1

Mediator

Wrapper n

Administrator

Data Sources

Wrapper 1

Mediator

Wrapper n

Result

5/39University of Versailles, September 28th 1999 Hubert Naacke

Query

DISCO Architecture

Query Decomposition

Optimization

Tuples

Relational Algebra

SchemaCost model

Tuples

Execution Engine

Catalog

Capabilities

Data Sources

Wrapper 1

Mediator

Wrapper n

Administrator

Catalog

Data Sources

Wrapper 1

Mediator

Wrapper n

Registration Query processingApplication

Query Recomposition

6/39

University of Versailles, September 28th 1999 Hubert Naacke

Outline

Context : DISCO a mediation system Problem

cost evaluation of query plan in the mediator

Proposition extensible cost model, declarative language

Detailed solutioncost formulas integration, cost evaluation

Validationexperimentation on Web data sources

Related solutionscomparison with: calibration, historical cost, query sampling

Conclusion

Problem cost evaluation of a query plan

7/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost Based Optimization

plan generation: 1 query n plans (sub-plans) logical: operator reordering, permutation, distribution physical: 1 operator n algorithms

cost estimation of plans (sub-plans)

compare plans 1 candidate for execution objective

minimize response time, resource, memory

Problem

8/39

University of Versailles, September 28th 1999 Hubert Naacke

Diversity of cost information

sources autonomy no cost info: black box statistics only statistics + cost formulas

instability of execution environment network communication (e.g. daily contention) workload (e.g. low load at night)

Problem

Objective: take into account all available cost info

9/39

University of Versailles, September 28th 1999 Hubert Naacke

Outline

Context : DISCO a mediation system Problem

cost evaluation of query plan

Proposition extensible cost model, hierarchical cost formulas

Detailed solutioncost formulas hierarchical integration, cost evaluation

Validationexperimentation on Web data sources

Related solutionscomparison with: calibration, historical cost, query sampling

Conclusion

Proposition extensible cost model

10/39

University of Versailles, September 28th 1999 Hubert Naacke

Proposition

declarative language describe wrapper cost model based on rules

hierarchical classification integrates generic/specialized cost info

cost evaluation algorithm

Proposed solution

expressiveness: sufficient for heterogeneous sourcesextensible framework

11/39

University of Versailles, September 28th 1999 Hubert Naacke

Assumptions

more cost info better plan cost info comes from

mediator: generic cost model wrapper: specialized cost model

cost model used by mediator input for query plan evaluation no cost evaluation in wrapper

12/39

University of Versailles, September 28th 1999 Hubert Naacke

Requirements

cost model specified by wrapper implementor statistics formulas

description may be incomplete missing statistics missing formulas

integration: wrapper mediator transfer at registration time merge wrapper cost models into mediator

Proposed solution

13/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost Communication Language

exporting statistics collections stat: total size, card, tuple size attribute stat: index, min, max, distinct values

exporting formulas math. formulas: selectivity, statistic of intermediate collections cost model = set of cost rules

interface for transfer extension of wrapper interface textual form parsed by mediator

Proposed solution

14/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost Rule Definition

head is a plan (or sub-plan):e.g., scan(Publication)

body: 1 formula per cost vector component cost vector = [Total time, Total size, Cardinality] depends on statistics

genericity: head contains variables unify a set of plans e.g., scan(Collection)

Proposed solution

15/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost Integration

hierarchical rule classification criteria: based on rule head variables

any wrapper = generic scope generic rule for fully autonomous sources

any collection = wrapper scopee.g., source latency

any predicate = collection scopee.g., access method specificity

...

Detailed solution

16/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost rules hierarchy

Wrapper-scoperules

Collection-scoperules

Predicate-scoperules

TotalSize = ...Count = ...

TotalTime = ......

wrapper 1: wrapper 2:

TotalTime = ... TotalTime = ...

TotalSize = ... TotalTime = ...

TotalTime = ... TotalSize = ...

select(publication, Predicate)

select (Collection, Predicate)

select (Collection, Predicate)

select (Collection, Predicate)

select(author, Pred)

Generic-scope rules

select(pub, year < S) select(pub, title = T)

Query specific rules

Detailed solution

any wrapper

17/39

University of Versailles, September 28th 1999 Hubert Naacke

traversing the plan tree top-down = attach node cost formulas

most specific formulas of the hierarchy may attach many formulas

bottom-up = cost computation depends on sub-nodes cost

sorting costs many dimensions (response time, size) compare: cost(plan1) < cost(plan2)

Cost Evaluation Algorithm Detailed solution

plan

1

2

2

18/39

University of Versailles, September 28th 1999 Hubert Naacke

Example

"get all persons younger than 25 who publish papers"

join(select(Person, age < 25), Pub, person.name = pub.author)

selectage < 25

scan

select(X, P)

scan(X)

scan(X)

selectage < 25

select(scan(Person), age) scan(X)

Detailed solution

Person

scan

Person

scan

Pub

scan

Pub

19/39

University of Versailles, September 28th 1999 Hubert Naacke

Outline

Context : DISCO a mediation system Problem

cost evaluation of query plan

Proposition extensible cost model, declarative language

Detailed solutioncost formulas hierarchical integration, cost evaluation

Validationexperimentation on Web data sources

Related solutionscomparison with: calibration, historical cost, query sampling

Conclusion

Validation experimentation on Web sources

20/39

University of Versailles, September 28th 1999 Hubert Naacke

Validation

objectives efficiency of cost model : generic vs. specialized

limited efficiency of generic cost model maximal efficiency of specialized cost model

cost language power: can describe heterogeneous execution models

experimentation data: real sources on the Web, TPC-D data queries: selection, projection, inter-site join

Validation

21/39

University of Versailles, September 28th 1999 Hubert Naacke

Validation method

evaluate the cost of plans P5 : best cost plan

execute the plans P3 : best response time plan

compare response time P3 / P5 according to the cost model

generic or specialized according to the query and data

varying selectivity, cardinality of collections

Validation

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Costbest cost

0

0,2

0,4

0,60,8

1

1,2

1 2 3 4 5

cost exec

best execexecP5

22/39

University of Versailles, September 28th 1999 Hubert Naacke

Experimental system: Sources

bibliographic data sources on the Web ACM : 80MB card: 30000 DBLP: 60 MB card: 90000

data from TPC-D benchmark stored in Oracle, access via JDBC size: 1GB (scaling factor=1) collections card: 150K - 1.5M

Validation

23/39

University of Versailles, September 28th 1999 Hubert Naacke

Wrappers for bibliographical sources Validation

Wrapper for ACM source schema: Publication(author, title, conf, year) capability: select on author (optionally conf, year) cost info: only statistics, no formula

Wrapper for DBLP source schema: Publication(author, title, conf, year) capability: select on author, select on conf and year (=) cost info: select(Pub, pred(conf=x2, year=x3)) :-

totalTime = S0 + card(Pub) * sel(pred) * S1

24/39

University of Versailles, September 28th 1999 Hubert Naacke

Experimental system : Mediator

schema Publication : ACM Pub or DBLP Pub (replication)

generic cost model cost of select:

select(Pub, Pred) : totalTime = S0 + card(Pub) * S1

cost of select on indexed attribute : select(Pub, Pred) : totalTime = S0 + card(Pub) * sel(pred) * S1

Validation

25/39

University of Versailles, September 28th 1999 Hubert Naacke

materialized

Cost model for select operation Validation

select *from Publicationwhere author = x1and conf = x2and year < x3

Selectauthor = x1conf = x2year < x3

plan P1:

selectconf = x2 year = xi

selectauthor = x1

s

s

s = submit to wrapper

plan P2:

ACM

DBLP

unionxmin<xi<x3

generic cost model P1

specializedcost model

P1 P2

Optimizer choice ?

26/39

University of Versailles, September 28th 1999 Hubert Naacke

Performance results Validation

Efficiency loss for low selectivity

0

2

4

6

8

10

12

14

16

18

0 0,2 0,4 0,6 0,8 1

Selectivity of predicate year<x3

Re

spo

nse

tim

e (

s)

exec(P1)

exec(P2)

0.42 selectivity(conference=x2)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 0,2 0,4 0,6 0,8 1

Selectivity of predicate year<x3R

ela

tive

effi

cie

ncy

(%

)

Generic cost model

Specialized cost model

selectivity(conference=x2) = 0,02

Execution of P1 vs. P2

27/39

University of Versailles, September 28th 1999 Hubert Naacke

Performance results Validation

Efficiency loss for high selectivity

Execution of P1 vs. P2

0

50

100

150

200

250

300

0 0,005 0,01 0,015 0,02 0,025 0,03

Selectivity of predicate author=x1

Res

pons

e tim

e (s

)

exec(P2)

exec(P1)

0.006

0

0,2

0,4

0,6

0,8

1

1,2

0 0,005 0,01 0,015 0,02 0,025 0,03

Selectivity of predicate author=x1

Rel

ativ

e ef

ficie

ncy

(%)

generic cost model

specialized cost model

28/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost model for join operation Validation

select *from ACM a, DBLP bwhere a.author = b.author and a.conf=SIGMOD

Inter-site join

selectauthor = X

selectconf = SIGMOD

ss

a.author=b.author

ACM DBLP

dep-join

selectconf = SIGMOD

ss

ACM

DBLP

a.author=b.authorhash-join

plan P2:plan P1:

29/39

University of Versailles, September 28th 1999 Hubert Naacke

Performance results Validation

Efficiency of specialized cost modelResponse time and cost of joins

0

20

40

60

80

100

120

140

0 5 10 15 20cardinality ratio for card(Order)=1000

Re

spo

nse

tim

e (

s)

exec(P1)

exec(P2)

cost(P1)

cost(P2)

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

cadinality ratio for card(Order)=1000re

lativ

e e

ffic

iency

(%

)

30/39

University of Versailles, September 28th 1999 Hubert Naacke

Specific projection within TPC-D

Validation

select *from Order o, Customer cwhere o.custkey = c.custkey

Order(orderkey, custkey, comment...), Customer(custkey, name)

selectorderkey=X

projorderkey, custkey

plan P2:

plan P1:

materialized

s

ss

selectcustkey=X

s

selectcustkey=X

s

o.custkey=c.custkeydep-join

Order

OrderCustomer Customer

Order

tmp.custkey=c.custkey

tmp.orderkey=c.orderkey

31/39

University of Versailles, September 28th 1999 Hubert Naacke

Performance results Validation

Generic cost model efficiency Response time: plan 2 / plan1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 0.05 0.1 0.15 0.2 0.25

Selectivity

Res

ponse

tim

e r

atio

P2/

P1

Order

OrderMedium

OrderLarge

0

0.2

0.4

0.6

0.8

1

1.2

0 0.05 0.1 0.15 0.2 0.25

SelectivityR

elat

ive

effi

cie

ncy

(%)

Order

OrderMedium

OrderLarge

32/39

University of Versailles, September 28th 1999 Hubert Naacke

Outline

Context : DISCO a mediation system Problem

cost evaluation of query plan

Proposition extensible cost model, declarative language

Detailed solutioncost formulas hierarchical integration, cost evaluation

Validationexperimentation on Web data sources

Related solutionscomparison with: calibration, historical cost, query sampling

Conclusion

Related work comparison with other approaches

33/39

University of Versailles, September 28th 1999 Hubert Naacke

Known Solutions (1)

IRO-DB [Gardarin et al. in CoopIS'97] hypothesis: generic cost model is unique calibration of parameters

Hermes [Subrahmanian et al. in SIGMOD'96] hypothesis: black box, stability over time historical costs:

cost[initial delay, total time, cardinality] = table[query]

34/39

University of Versailles, September 28th 1999 Hubert Naacke

Known Solutions (2)

Cords [Zhu, Larson in Distributed & Parallel DB, 98] hypothesis: similar query have similar cost query sampling + regression cost model

Garlic [Roth, Ozcan, Haas in VLDB ’99] multimedia context, cost of image retrieval cost [cold resp. time, hot resp. time, cardinality] override wrapper cost functions

35/39

University of Versailles, September 28th 1999 Hubert Naacke

Comparison with related work

generalizes calibration and sampling approach integrates Hermes historical cost

bottom layer of cost rules hierarchy useful for query caching

Garlic confirmation cost models do matter: need for wrapper input extensible : cost of method call trade-off expressiveness/abstraction

Conclusion

36/39

University of Versailles, September 28th 1999 Hubert Naacke

Conclusion

Disco’s cost model is efficient easy specialization of cost model for access method

yields 100% efficiency for typical web queries improves logical and physical optimization

Extensible may specify constraints imposed by source

periodic variation, resource limitation

Flexible fine granularity of cost formulas

Conclusion

37/39

University of Versailles, September 28th 1999 Hubert Naacke

Directions for Future Work

update cost model at runtime polling / notification

cost rules for query caching optimize cost evaluation algorithm

stop condition : response time < Tmax or result size < Smax

extensions EC : cost = price, new parameters cost of path expression, full-text search

Conclusion

38/39

University of Versailles, September 28th 1999 Hubert Naacke

References

Disco’s cost model definition Hubert Naacke, Georges Gardarin, and Anthony Tomasic. Leveraging Mediator Cost

Models with Heterogeneous Data Sources (Extended version). ICDE 1998. Early version in BDA 1997.

Disco’s cost model validation Hubert Naacke, Anthony Tomasic, and Patrick Valduriez. Validating Mediator Cost

Models with Disco. to appear, NISJ 1999.

Disco implementation Anthony Tomasic, Rémy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert Naacke,

and Louiqa Raschid. The Distributed Information Search Component (Disco) and the World Wide Web. In ACM SIGMOD 1997, Research Prototype Demonstration.

MIRO-WEB Esprit Project, Spanish Hospital Application.

39/39

University of Versailles, September 28th 1999 Hubert Naacke

T Validation

40/39

University of Versailles, September 28th 1999 Hubert Naacke

materialized view V(conf, year) :

DBLP Wrapper : Materialization Validation

select *from Publicationwhere conf = C and year = Y

Conf Year Author Title

VLDB 1985 Gray tVLDB 1985 Du u

‘‘ … … …‘‘ 1999 Haas v

Sigmod 1985 Kim w‘‘ … … …‘‘ 1999 Pu x

materializedview

41/39

University of Versailles, September 28th 1999 Hubert Naacke

Cost model for projection Validation

selectname=X

projectname

plan 2:

s

s

selectauthor=X

s

Person(name, picture), Publication(author, ...)

select *from Publication pub, Person perswhere pub.author = pers.name

plan 1:

materializedss

selectauthor=X

Publication

Person

Person Publication

Person

42/39

University of Versailles, September 28th 1999 Hubert Naacke

T Validation