Upload
lenard-phillips
View
213
Download
0
Embed Size (px)
Citation preview
1/39
University of Versailles, September 28th 1999 Hubert Naacke
Mediator Cost Models for Heterogeneous Data Sources
Hubert Naacke
INRIA Rocquencourt78153 Le Chesnay
2/39
University of Versailles, September 28th 1999 Hubert Naacke
Outline
Context : DISCO a mediation system Problem and Objective
cost evaluation of query plan, use diverse cost info
Proposition extensible cost model, declarative language
Detailed solutioncost formulas hierarchical integration, cost evaluation
Validationexperimentation on Web data sources
Related solutionscomparison with: calibration, historical cost, query sampling
Conclusion
Context : DISCO a mediation system
3/39
University of Versailles, September 28th 1999 Hubert Naacke
Mediation Systems
goals intelligent integration of information exploit existing sources, reuse data and power
systems IRO-DB Univ. Versailles Hermes Univ. Maryland Garlic IBM Tsimmis Stanford Cords Univ. Waterloo DISCO Bull + INRIA Distributed Information Search Component
4/39
University of Versailles, September 28th 1999 Hubert Naacke
User Application
Query
Interface
Mediator-Wrapper ArchitectureRegistration phase
Query processing phase
Data Sources
Wrapper 1
Mediator
Wrapper n
Data Sources
Wrapper 1
Mediator
Wrapper n
Administrator
Data Sources
Wrapper 1
Mediator
Wrapper n
Result
5/39University of Versailles, September 28th 1999 Hubert Naacke
Query
DISCO Architecture
Query Decomposition
Optimization
Tuples
Relational Algebra
SchemaCost model
Tuples
Execution Engine
Catalog
Capabilities
Data Sources
Wrapper 1
Mediator
Wrapper n
Administrator
Catalog
Data Sources
Wrapper 1
Mediator
Wrapper n
Registration Query processingApplication
Query Recomposition
6/39
University of Versailles, September 28th 1999 Hubert Naacke
Outline
Context : DISCO a mediation system Problem
cost evaluation of query plan in the mediator
Proposition extensible cost model, declarative language
Detailed solutioncost formulas integration, cost evaluation
Validationexperimentation on Web data sources
Related solutionscomparison with: calibration, historical cost, query sampling
Conclusion
Problem cost evaluation of a query plan
7/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost Based Optimization
plan generation: 1 query n plans (sub-plans) logical: operator reordering, permutation, distribution physical: 1 operator n algorithms
cost estimation of plans (sub-plans)
compare plans 1 candidate for execution objective
minimize response time, resource, memory
Problem
8/39
University of Versailles, September 28th 1999 Hubert Naacke
Diversity of cost information
sources autonomy no cost info: black box statistics only statistics + cost formulas
instability of execution environment network communication (e.g. daily contention) workload (e.g. low load at night)
Problem
Objective: take into account all available cost info
9/39
University of Versailles, September 28th 1999 Hubert Naacke
Outline
Context : DISCO a mediation system Problem
cost evaluation of query plan
Proposition extensible cost model, hierarchical cost formulas
Detailed solutioncost formulas hierarchical integration, cost evaluation
Validationexperimentation on Web data sources
Related solutionscomparison with: calibration, historical cost, query sampling
Conclusion
Proposition extensible cost model
10/39
University of Versailles, September 28th 1999 Hubert Naacke
Proposition
declarative language describe wrapper cost model based on rules
hierarchical classification integrates generic/specialized cost info
cost evaluation algorithm
Proposed solution
expressiveness: sufficient for heterogeneous sourcesextensible framework
11/39
University of Versailles, September 28th 1999 Hubert Naacke
Assumptions
more cost info better plan cost info comes from
mediator: generic cost model wrapper: specialized cost model
cost model used by mediator input for query plan evaluation no cost evaluation in wrapper
12/39
University of Versailles, September 28th 1999 Hubert Naacke
Requirements
cost model specified by wrapper implementor statistics formulas
description may be incomplete missing statistics missing formulas
integration: wrapper mediator transfer at registration time merge wrapper cost models into mediator
Proposed solution
13/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost Communication Language
exporting statistics collections stat: total size, card, tuple size attribute stat: index, min, max, distinct values
exporting formulas math. formulas: selectivity, statistic of intermediate collections cost model = set of cost rules
interface for transfer extension of wrapper interface textual form parsed by mediator
Proposed solution
14/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost Rule Definition
head is a plan (or sub-plan):e.g., scan(Publication)
body: 1 formula per cost vector component cost vector = [Total time, Total size, Cardinality] depends on statistics
genericity: head contains variables unify a set of plans e.g., scan(Collection)
Proposed solution
15/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost Integration
hierarchical rule classification criteria: based on rule head variables
any wrapper = generic scope generic rule for fully autonomous sources
any collection = wrapper scopee.g., source latency
any predicate = collection scopee.g., access method specificity
...
Detailed solution
16/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost rules hierarchy
Wrapper-scoperules
Collection-scoperules
Predicate-scoperules
TotalSize = ...Count = ...
TotalTime = ......
wrapper 1: wrapper 2:
TotalTime = ... TotalTime = ...
TotalSize = ... TotalTime = ...
TotalTime = ... TotalSize = ...
select(publication, Predicate)
select (Collection, Predicate)
select (Collection, Predicate)
select (Collection, Predicate)
select(author, Pred)
Generic-scope rules
select(pub, year < S) select(pub, title = T)
Query specific rules
Detailed solution
any wrapper
17/39
University of Versailles, September 28th 1999 Hubert Naacke
traversing the plan tree top-down = attach node cost formulas
most specific formulas of the hierarchy may attach many formulas
bottom-up = cost computation depends on sub-nodes cost
sorting costs many dimensions (response time, size) compare: cost(plan1) < cost(plan2)
Cost Evaluation Algorithm Detailed solution
plan
1
2
2
18/39
University of Versailles, September 28th 1999 Hubert Naacke
Example
"get all persons younger than 25 who publish papers"
join(select(Person, age < 25), Pub, person.name = pub.author)
selectage < 25
scan
select(X, P)
scan(X)
scan(X)
selectage < 25
select(scan(Person), age) scan(X)
Detailed solution
Person
scan
Person
scan
Pub
scan
Pub
19/39
University of Versailles, September 28th 1999 Hubert Naacke
Outline
Context : DISCO a mediation system Problem
cost evaluation of query plan
Proposition extensible cost model, declarative language
Detailed solutioncost formulas hierarchical integration, cost evaluation
Validationexperimentation on Web data sources
Related solutionscomparison with: calibration, historical cost, query sampling
Conclusion
Validation experimentation on Web sources
20/39
University of Versailles, September 28th 1999 Hubert Naacke
Validation
objectives efficiency of cost model : generic vs. specialized
limited efficiency of generic cost model maximal efficiency of specialized cost model
cost language power: can describe heterogeneous execution models
experimentation data: real sources on the Web, TPC-D data queries: selection, projection, inter-site join
Validation
21/39
University of Versailles, September 28th 1999 Hubert Naacke
Validation method
evaluate the cost of plans P5 : best cost plan
execute the plans P3 : best response time plan
compare response time P3 / P5 according to the cost model
generic or specialized according to the query and data
varying selectivity, cardinality of collections
Validation
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Costbest cost
0
0,2
0,4
0,60,8
1
1,2
1 2 3 4 5
cost exec
best execexecP5
22/39
University of Versailles, September 28th 1999 Hubert Naacke
Experimental system: Sources
bibliographic data sources on the Web ACM : 80MB card: 30000 DBLP: 60 MB card: 90000
data from TPC-D benchmark stored in Oracle, access via JDBC size: 1GB (scaling factor=1) collections card: 150K - 1.5M
Validation
23/39
University of Versailles, September 28th 1999 Hubert Naacke
Wrappers for bibliographical sources Validation
Wrapper for ACM source schema: Publication(author, title, conf, year) capability: select on author (optionally conf, year) cost info: only statistics, no formula
Wrapper for DBLP source schema: Publication(author, title, conf, year) capability: select on author, select on conf and year (=) cost info: select(Pub, pred(conf=x2, year=x3)) :-
totalTime = S0 + card(Pub) * sel(pred) * S1
24/39
University of Versailles, September 28th 1999 Hubert Naacke
Experimental system : Mediator
schema Publication : ACM Pub or DBLP Pub (replication)
generic cost model cost of select:
select(Pub, Pred) : totalTime = S0 + card(Pub) * S1
cost of select on indexed attribute : select(Pub, Pred) : totalTime = S0 + card(Pub) * sel(pred) * S1
Validation
25/39
University of Versailles, September 28th 1999 Hubert Naacke
materialized
Cost model for select operation Validation
select *from Publicationwhere author = x1and conf = x2and year < x3
Selectauthor = x1conf = x2year < x3
plan P1:
selectconf = x2 year = xi
selectauthor = x1
s
s
s = submit to wrapper
plan P2:
ACM
DBLP
unionxmin<xi<x3
generic cost model P1
specializedcost model
P1 P2
Optimizer choice ?
26/39
University of Versailles, September 28th 1999 Hubert Naacke
Performance results Validation
Efficiency loss for low selectivity
0
2
4
6
8
10
12
14
16
18
0 0,2 0,4 0,6 0,8 1
Selectivity of predicate year<x3
Re
spo
nse
tim
e (
s)
exec(P1)
exec(P2)
0.42 selectivity(conference=x2)
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 0,2 0,4 0,6 0,8 1
Selectivity of predicate year<x3R
ela
tive
effi
cie
ncy
(%
)
Generic cost model
Specialized cost model
selectivity(conference=x2) = 0,02
Execution of P1 vs. P2
27/39
University of Versailles, September 28th 1999 Hubert Naacke
Performance results Validation
Efficiency loss for high selectivity
Execution of P1 vs. P2
0
50
100
150
200
250
300
0 0,005 0,01 0,015 0,02 0,025 0,03
Selectivity of predicate author=x1
Res
pons
e tim
e (s
)
exec(P2)
exec(P1)
0.006
0
0,2
0,4
0,6
0,8
1
1,2
0 0,005 0,01 0,015 0,02 0,025 0,03
Selectivity of predicate author=x1
Rel
ativ
e ef
ficie
ncy
(%)
generic cost model
specialized cost model
28/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost model for join operation Validation
select *from ACM a, DBLP bwhere a.author = b.author and a.conf=SIGMOD
Inter-site join
selectauthor = X
selectconf = SIGMOD
ss
a.author=b.author
ACM DBLP
dep-join
selectconf = SIGMOD
ss
ACM
DBLP
a.author=b.authorhash-join
plan P2:plan P1:
29/39
University of Versailles, September 28th 1999 Hubert Naacke
Performance results Validation
Efficiency of specialized cost modelResponse time and cost of joins
0
20
40
60
80
100
120
140
0 5 10 15 20cardinality ratio for card(Order)=1000
Re
spo
nse
tim
e (
s)
exec(P1)
exec(P2)
cost(P1)
cost(P2)
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
cadinality ratio for card(Order)=1000re
lativ
e e
ffic
iency
(%
)
30/39
University of Versailles, September 28th 1999 Hubert Naacke
Specific projection within TPC-D
Validation
select *from Order o, Customer cwhere o.custkey = c.custkey
Order(orderkey, custkey, comment...), Customer(custkey, name)
selectorderkey=X
projorderkey, custkey
plan P2:
plan P1:
materialized
s
ss
selectcustkey=X
s
selectcustkey=X
s
o.custkey=c.custkeydep-join
Order
OrderCustomer Customer
Order
tmp.custkey=c.custkey
tmp.orderkey=c.orderkey
31/39
University of Versailles, September 28th 1999 Hubert Naacke
Performance results Validation
Generic cost model efficiency Response time: plan 2 / plan1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 0.05 0.1 0.15 0.2 0.25
Selectivity
Res
ponse
tim
e r
atio
P2/
P1
Order
OrderMedium
OrderLarge
0
0.2
0.4
0.6
0.8
1
1.2
0 0.05 0.1 0.15 0.2 0.25
SelectivityR
elat
ive
effi
cie
ncy
(%)
Order
OrderMedium
OrderLarge
32/39
University of Versailles, September 28th 1999 Hubert Naacke
Outline
Context : DISCO a mediation system Problem
cost evaluation of query plan
Proposition extensible cost model, declarative language
Detailed solutioncost formulas hierarchical integration, cost evaluation
Validationexperimentation on Web data sources
Related solutionscomparison with: calibration, historical cost, query sampling
Conclusion
Related work comparison with other approaches
33/39
University of Versailles, September 28th 1999 Hubert Naacke
Known Solutions (1)
IRO-DB [Gardarin et al. in CoopIS'97] hypothesis: generic cost model is unique calibration of parameters
Hermes [Subrahmanian et al. in SIGMOD'96] hypothesis: black box, stability over time historical costs:
cost[initial delay, total time, cardinality] = table[query]
34/39
University of Versailles, September 28th 1999 Hubert Naacke
Known Solutions (2)
Cords [Zhu, Larson in Distributed & Parallel DB, 98] hypothesis: similar query have similar cost query sampling + regression cost model
Garlic [Roth, Ozcan, Haas in VLDB ’99] multimedia context, cost of image retrieval cost [cold resp. time, hot resp. time, cardinality] override wrapper cost functions
35/39
University of Versailles, September 28th 1999 Hubert Naacke
Comparison with related work
generalizes calibration and sampling approach integrates Hermes historical cost
bottom layer of cost rules hierarchy useful for query caching
Garlic confirmation cost models do matter: need for wrapper input extensible : cost of method call trade-off expressiveness/abstraction
Conclusion
36/39
University of Versailles, September 28th 1999 Hubert Naacke
Conclusion
Disco’s cost model is efficient easy specialization of cost model for access method
yields 100% efficiency for typical web queries improves logical and physical optimization
Extensible may specify constraints imposed by source
periodic variation, resource limitation
Flexible fine granularity of cost formulas
Conclusion
37/39
University of Versailles, September 28th 1999 Hubert Naacke
Directions for Future Work
update cost model at runtime polling / notification
cost rules for query caching optimize cost evaluation algorithm
stop condition : response time < Tmax or result size < Smax
extensions EC : cost = price, new parameters cost of path expression, full-text search
Conclusion
38/39
University of Versailles, September 28th 1999 Hubert Naacke
References
Disco’s cost model definition Hubert Naacke, Georges Gardarin, and Anthony Tomasic. Leveraging Mediator Cost
Models with Heterogeneous Data Sources (Extended version). ICDE 1998. Early version in BDA 1997.
Disco’s cost model validation Hubert Naacke, Anthony Tomasic, and Patrick Valduriez. Validating Mediator Cost
Models with Disco. to appear, NISJ 1999.
Disco implementation Anthony Tomasic, Rémy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert Naacke,
and Louiqa Raschid. The Distributed Information Search Component (Disco) and the World Wide Web. In ACM SIGMOD 1997, Research Prototype Demonstration.
MIRO-WEB Esprit Project, Spanish Hospital Application.
40/39
University of Versailles, September 28th 1999 Hubert Naacke
materialized view V(conf, year) :
DBLP Wrapper : Materialization Validation
select *from Publicationwhere conf = C and year = Y
Conf Year Author Title
VLDB 1985 Gray tVLDB 1985 Du u
‘‘ … … …‘‘ 1999 Haas v
Sigmod 1985 Kim w‘‘ … … …‘‘ 1999 Pu x
materializedview
41/39
University of Versailles, September 28th 1999 Hubert Naacke
Cost model for projection Validation
selectname=X
projectname
plan 2:
s
s
selectauthor=X
s
Person(name, picture), Publication(author, ...)
select *from Publication pub, Person perswhere pub.author = pers.name
plan 1:
materializedss
selectauthor=X
Publication
Person
Person Publication
Person