View
218
Download
0
Category
Tags:
Preview:
Citation preview
MANAGING UNCERTAINTY OF XML SCHEMA MATCHING
Reynold Cheng, Jian Gong, David W. Cheung
ICDE’2010
22
THE DATA INTEGRATION PROBLEM Querying the source data through target query
interface Eg.: querying multiple data sources through a mediate query
interface
Data source
Query interface Target schema
Source schema
Schema mapping
2
…… ……
SCHEMA MATCHING & MAPPING Schema matching: finding element correspondences
with similarities between schemas Schema mapping: a set of one-to-one
correspondences between two schemas Generation: pick up the best correspondences
3
Sample mapping Order - ORDER BP - IP BCN – ICN ……
44
SCHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain
Compute Pr(Mi) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings
Which one is correct?
Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …
Example: Purchase Order schemas
4
55
DATA INTEGRATION RELOADED Managing uncertainty of XML schema matching
Issues: mapping generation and storage, query evaluation etc
Data source
Query interface Mediate schema
Source schema
Uncertain schema mapping
5
…… ……
66
OBSERVATION
Sharing among uncertain mappings
Uncertain mappings
Overlapping: “Order~ORDER” shared by m1-m5
“BP~IP” shared by m1, m2, m4, m5
“BCN~ICN” shared by m1, m2
… 6
77
OBSERVATION How much overlapping are there in real world schema
mappings? Overlapping ratio (o-ratio): the average overlap of the top-
100 possible schema mappings
7
OUR CONTRIBUTION Propose block tree: a novel data structure to represent
a set of mappings Definition Efficient generation
Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue
Improve the possible mapping generation process A divide-and-conquer approach
Conduct experiment on real data to validate our methods
8
RELATED WORK Schema matching approaches and tools [RB01]
COMA [DR02]
Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86]
Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04]
XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9
1010
OUTLINE
Introduction Problem
Data model Query model
Techniques Results Conclusion
10
1111
DATA MODEL XML schema and document [QYD07]
Node-labeled tree Document node may carry text values
Schema mapping [DHY07] One-to-one mapping
11
Schema
Schema
Document
Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …
1212
QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]
Step 1: rewrite target query into source query, based on schema mapping
M1: Order-ORDER, BP-IP, BCN-ICN, …
12
Source query: Target query:
Source schema: Target schema:
1313
QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]
Step 1: rewrite target query into source query, based on schema mapping
Step 2: evaluate source query on source document
13
Source query:
Source document:
1414
QUERY MODEL (UNCERTAIN MAPPINGS) Query evaluation with uncertain mappings [DHY07]
Mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)} The query answers from mapping Mi have probability Pr(Mi)
Target query QT
M1,Pr(M1)
…
Mh,Pr(Mh)
R1,Pr(M1)
…
Rh,Pr(Mh)
QS1
QSh
Rewriting Evaluation
14
Source query
1515
OUTLINE
Introduction Problem Techniques
Block tree Query evaluation Mapping generation
Results Conclusion
15
1616
THE BLOCK Each block, which is attached to a target schema
element, consists of: C: A set of correspondences M: A set of mappings
Block Block Block
16
Drawback: Exponential number of blocks to handle
Semantic: mappings in M share correspondences in C
1717
THE C-BLOCK A c-block (constrained block) is a block which:
Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation)
Contains shared mappings more than a threshold (else it’s not worthy to store it)
17
c-block
|pM| = 5Threshold = 0.4
1818
THE BLOCK TREE Creation of the block tree
Follows the structure of the target schema A bottom-up method
18
Lemma 1: (informal)The c-blocks for an element can be created from the c-blocks of its children.(detail)
Lemma 2: (informal)If an element has no c-block, then its parent (if any) has no c-blcok.
1919
THE BLOCK TREE Reducing the storage cost of uncertain mappings
IP
b4
b3
ICN
g2g1
b2
b1C: BCN~ICN
M: m1, m2
C: RCN~ICNM: m3, m4
C: OCN~SCNM: m2, m3
SCN
C: BCN~SCNM: m4, m5
b5
C: BP~IPM: m1, m2, m4, m5
C: BP~IP, BCN~ICNM: m1, m2
SP
...
ORDER
g3C: Order~ORDER
M: m1, m2, m3, m4, m5
m1 Order~ORDER
RCN~SCN...
m2 Order~ORDER
OCN~SCN...
b2.C
b3.C
b2.C
b4.C
m4 Order~ORDER BP~IP
...
b4.C
m5 Order~ORDER BP~IP OCN~ICN ...
b5.C b5.C
m3 Order~ORDER SP~IP BP~SP...
If part of a mapping is in the block tree, then replace it with a link
2020
OUTLINE
Introduction Problem Techniques
Block tree Query evaluation Mapping generation
Results Conclusion
20
2121
QUERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query
answers
Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …
Target query Q: //ICN
which finds all ICNs (contact names of invoice parties) in the purchase order
Example: a source document
Return by M1
Return by M2
21
2222
THE BASELINE APPROACH
Evaluate QT with each mapping in pM separately Drawback
When the mapping Mi is large, or h is large, the computation cost is expensive
Target query QT
M1,Pr(M1)
…
Mh,Pr(Mh)
R1,Pr(M1)
…
Rh,Pr(Mh)
QS1
QSh
Rewriting Evaluation
DS
DS
23
QUERY EVALUATION WITH BLOCK TREE Consider the root of a query
Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query
24
IP
ICN
QUERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the
blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings
25
QUERY EVALUATION WITH BLOCK TREE Consider the root of a query
Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query
Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers
26
IP
ICN
ORDER
SP
QUERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if
possible), invoke recursion, and join partial answers
ORDERIP
ICN
SP+ +
Direct query
Recursion Direct query
2727
OUTLINE
Introduction Problem
Data model Query model
Techniques Block tree Query evaluation Mapping generation
Results Conclusion
27
28
MAPPING GENERATION A mapping m for a schema S with another schema T
contains a set of correspondences (es,et) et may be EMPTY, i.e., es matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m’s score is the sum of similarities of its correspondences
Problem definition Given: two schemas S and T, a set of correspondences
(es,et) with similarities (which are schema matching results) Return: h mappings m1, …, mh, whose scores are among the
highest ones
29
MAPPING GENERATION Baseline solution
Finding h-maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite
30
MAPPING GENERATION Observation: XML schema matching is usually sparse Improvement: a divide-and-conquer approach
Derive partitions (Maximal Connected Sub-Graphs) of the bipartite
Find the top-h partial mappings from each partition Merge
3131
OUTLINE
Introduction Problem Techniques Results Conclusion
31
32
DATASET AND RESULTS XML schemas and documents
7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans)
Accompanied sample XML documents
Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-
method)
Target query 10 hand-write queries
33
RESULTS Uncertain mappings, do they really overlap?
34
RESULTS How much space does the block tree save for storing
uncertain mappings? And why?
35
RESULTS Is the block tree effective?
Intuitively, larger blocks tends to be more useful
36
RESULTS The block tree can be efficiently created
Fast, and controllable
37
RESULTS Can the block tree really help to improvement query
performance? Varies the total number of mappings
38
RESULTS Can it scale?
Probabilistic twig query and top-k query
39
RESULTS Top-h mapping generation
Performance gain of partitioning
40
CONCLUSION We study the problem of handling uncertainty in XML
schema matching Observation
Overlapping mappings, sparse bipartite, etc Approach
The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently
Future work Other types of queries, probabilistic document, index
update, relational scenario, etc
4141
THANKS!
Q & A
41
REFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in
PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data
integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema
matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k
schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in
DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in
increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema
matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”,
in SIGMOD, 2008 …
42
4343
QUERY REWRITING
Given A target twig query QT
A schema mapping m between S and T, which is a set of correspondences (es,et)
Mapping semantic For each sub-tree in source document DS which
contains a set of source element in m, there exists a sub-tree in target document DT which contains the corresponding target elements
Procedure For each element in QT, replace with a source
element Connect all the source elements
4444
LEMMA 1
An example
Lemma 1: (conceptually)The c-blocks for an schema element t can be created from the c-blocks of t’s children.(detail)
Order
InvoiceTo
27|24|25|24
name
Address
streetemail city country
DeliverTo
27|24|25|24
name
Address
streetemail city country
ContactContact
51|49 49|5110052|48 53|4749|5110052|48 50|50 51|49
...
b1.M: 1-52b2.M: 53-100
b3.M: 1,3,5,…b4.M: 2,4,6,...
45
RESULTS
What kind of queries do we used?
Recommended