Creating semantic mappings (based on the slides of the course: CIS 550 – Database &...

Preview:

Citation preview

Creating semantic mappings

(based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

What we have studied before? Formalisms for specifying source descriptions and

how to use these descriptions to reformulate queries

What is the goal now? Set of techniques that helps a designer create

semantic mappings and source descriptions for a particular data integration application Heuristic task Idea is to reduce the time it takes to create semantic

mappings

Motivating example

DVD vendor schemaProduct(title, productionYear, releaseDate, basePrice, listPrice, rating, customerReviews, saleLocation)

Movie(title, year, director, directorOrigin, mainAtors, genre, awards)

Locations(name, taxRate, shippingCharge)

Online Aggregator SchemaItem(title, year, classification, genre, director, price, starring, angReviews)

Recap. semantic heterogeneities Table and attribute names can differ

Attributes rating and classification as well as mainActors and starring

Multiple attributes in one schema correspond to a single attribute in the other basePrice and taxRate from the vendor are used to

compute the value of price in the aggregator schema Tabular organization may be different

DVD vendor requires 3 tables, aggregator needs only one. Coverage and level of detail may differ

DVD vendor models releaseDate and awards that the aggregator does not

Process of creating schema mappings1) Schema matching

Creating correspondences between elements of 2 schemas

Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema

2) Creating schema mappings from the correspondences (and filling in missing details)

Specifies the transformations that have to be applied the the source data in order to produce the target data

Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate

Schema matchingGoal: to automatically create a set of correspondences

between the 2 schemas Given two schemas S1 and S2 and being schema

elements the table/attribute names in the schema, a correspondence A -> B states that a set of elements A in S1 maps to a set of attributes B in S2 Most common correspondences are 1-1, where A and B

are singletonsEx: for the target relation Item, there are the following

correspondences: Product.title -> title Movie.year -> year {Product.basePrice, Locations.taxRate} ->

price

Output of the schema matcher Associate a confidence measure ([0,1]) with

every correspondence Because of the heuristic nature of the schema

matching process Possible heuristics: examine the names of the

schema elements; examine the data values; etc May associate a filter with a correspondence

Ex: {Product.basePrice, Locations.taxRate} -> price may apply only to locations in the US

Components of a schema matcher

Basic Matcher: predicts correspondences based on cues available in schema and data

Combiner: combines the predictions of the basic matchers into a single similarity matrix

Constraint enforcer: applies domain knowledge and constraints to prune the possible matches

Match selector: chooses the best match or matches from the similarity matrix

Process of creating schema mappings1) Schema matching

Creating correspondences between elements of 2 schemas

Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema

2) Creating schema mappings from the correspondences (and filling in missing details)

Specifies the transformations that have to be applied the the source data in order to produce the target data

Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate

Creating mappings from correspondences

Create the actual mappings from the matches (correspondences) Find how tuples from one source can be translated into

tuples in the other Challenge: there may be more than one possible

way of joining the data Ex: To compute the value of price in the aggregator

schema, may join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate, or join Product with Movie to obtain the origin of director, and compute the price based on the taxes in the director’s country of birth

Outline

Base matchers Combining match predictions Applying constraints and domain knowledge

to candidate schema matches Match selector Applying machine learning techniques to

enable the schema matcher to learn Discovering m-m matches From matches to mappings

Base matchers

Input: pair of schemas S1 and S2, whith elements A and

B, respectively Additonal available information, such as data

instances or text descriptions Output:

Correspondence matrix that assigns to every pair of elements (Ai, Bj) a number between 0 and 1 predicting whether Ai corresponds to Bj

Classes of base matchers

1) Name-based matchers: based on comparing names of schema elements

2) Instance-based matchers: based on inspecting data instances

For specific domains, it is possible to develop more specialized and effective matchers

Must look at large amounts of data Slower, but efficiency can be improved More precise

Name-based matchers

Compare the names of the elements, hoping that the names convey the true semantics of elements

Challenge: to find effective distance measures reflecting the distance of element names Names are never written in exactly the same way

Edit distance Edit distance between the strings that represent the names of the

elements Levenshtein distance:

Minimum nb of operations (insertions, deletions or replacements) needed to transform one string into another

Given two schema elements represented by the strings s1 and s2 and their edit distance, denoted by editDistance(s1,s2)

editSimilarity(s1, s2) = 1 – editDistance(s1, s2)/max(length(s1),

length(s2))

Ex: The Levenshtein distance between the strings FullName and FName is 3; between TotalPrice and PriceSum is 8The editSimilarity between FullName and FName is 0.625; between TotalPrice and PriceSum is 0.2

Qgram distance Qgrams: substrings within the names Given a positive integer q, the set of q-grams of s,

qgrams(s), consists of all the substrings of s of size q

Ex: the 3-grams of price are: {pri, ric, ice} Given a number q, qgramSimilarity(s1, s2) = ||qgrams(s1) qgrams(s2)||/ ||qgrams(s1) qgrams(s2)||

Ex: the 3-gram similarity between pricemarked and markedprice is 7/18 = 0.39

Advantages over edit distance: Faster More resilient to word-order interchaging

Sound distance

Based on the way strings sound Soundex algorithm: encodes names by their

sound; can detect when two names sound the same even if spelling varies

Ex: ship2 is similar to shipto

Normalization

Element names can be composed of acronyms or short phrases to express their meanings

Normalization: replaces a single token by several tokens that can be compared Element names should be normalized before applying

distance measures Some normalization techniques:

Expand known abbreviations Expand a string with its synonyms Remove articles, propositions and conjunctions

Instance-based matchers

Data instances, if available, convey the meaning of a schema element, more than its name Use them for predicting correspondences between schema

elements Techniques:

Develop a set of rules for inferring common types from the format of the data values

Ex: phone numbers, prices, zip codes, etc Value overlap Text-field analysis

Value overlap

Measuring the overlap of values in the two elements Applies to categorical elements: whose values range in some

finite domain (e.g., movie ratings, country name) Jaccard coefficient: fraction of the values for the two elements

that can be an instance for both of them Also defined as: conditional probability of a value being an

instance of both elements given that it is an instance of one of them

JaccardSim(e1,e2) = Pr(e1e2 | e1e2) =||D(e1) D(e2)||/||D(e1) D(e2)||

where D(e) is the set of values for element e

Text-field analysis

Applies to elements whose values are longer texts (e.g., house descriptions) Their values can vary drastically The probability of finding the exact string for both

elements is very low Idea: to compare the general topics these

text fields are about Use text classifiers

Text classifiers

Classifier for a concept C: algorithm that identifies instances of C from those that are not Creates an internal model based on training

examples: positive examples that are known to be instances of C and negative examples that are known not to be instances of C

Given an example c, the classifier applies its model to decide whether c is an instance of C

Combining match predictions (1) Result of basic matchers summarized in a

similarity cube: Suppose the schema matcher used l base

matchers to predict correspondences between elements A1, ..., An of S1 and the elements B1, ...,Bm of S2.

The similarity cube assigns each triple (b,i,j) a number between 0 and 1 describing the prediction of base matcher b about the correspondence between Ai and Bj

Combining match predictions (2) Output: a similarity matrix that combines the predictions of

the base matchers For every pair (i, j), want a value between o and 1, Combined(i, j) that gives a single prediction about the correspondence between Ai and Bj

Two possible combinations: Combined(i,j) = max b=1

l Base(b, i, j) Combined(i,j) = 1/l sum b=1

l Base(b, i, j) Max used when we trust in a matcher that outputs a high

value Avg used otherwise Also multi-step combination functions and give weights to

matchers

Applying constraints and domain knowledge to candidate matches There may exist domain-specific knowledge helpful

in the process of schema matching Expressed as a set of constraints that enable

pruning candidate matches Hard constraints: must be applied; schema matcher will

not output any match that violates them Soft constraints: more heuristic nature; may be violated in

some schemas; nb of violated should be minimized A cost is associated to each constraint: infinite for hard

constraints; any positive number for soft constraints

ExampleSchema:

Book(ISBN, publisher, pubCountry, title, review)

Item(code, name, brand, origin, desc)

Inventory(ISBN, quantity, location)

InStore(code, availQuant)

Constraints:T1: if A -> Item.name, then A is a key. Cost = T2: if sim(A1,B1) sim(A2,B1), A1 is next to A3 and B1 is next to B2

and sim(A3,B2) > 0.8 then match A1 to B1. Cost = 2

T3: Average(length(desc)) 20. Cost = 1.5

T4: if ||{Ai attributes(R1) | Bj attributes(R2) s.t. sim(Ai,Bj) > 0.8}|| >= ||attributes(R1)||/2 then match table R1 to table R2. Cost = 1

Algorithms for applying constraints to the similarity matrix Applying constraints with A* search

Guarantees to find the optimal solution Computationally more expensive

Applying constraints with local propagation Faster May get stuck in a local minimum

Components of a schema matcher

Match selector Input: similarity matrix Output: a schema match or the top few matches If matching system interactive, the user can choose

among the top k correspondences The system computes several possible matchesEx: schema1(shipAddr, shipPhone, billAddr, billPhone)

schema2(addr, phone)The correspondences:shipPhone ->phone and billPhone->phone are chosen until the user indicates:shipAddr->addrthen shipPhone->phone becomes more likely than billPhone->phone

The algorithm behind The match selection problem can be formulated as an instance

of finding a stable marriage Elements of S1: men; elements of S2:women Sim(i,j): the degree to which Ai and Bj desire each other

Goal: find a stable match between men and women A match is unstable if there are Ai ->Bj and Ak->Bl, such that

sim(i,l)>sim(i,j) and sim(i,l)>sim(k,l) If these couples existed then Ai and Bl would want to be matched

together To produce a schema match without unhappy couples do:

Match={} Repeat:

Let (i,j) be the highest value in sim such that Ai and Bj are not in match

Add Ai->Bj to match

Applying machine learning techniques to enable the schema matcher to learn

Schema matching tasks often repetitive When working in the same domain, one starts to identify

how common domain concepts get expressed in schemas So the designer can create schema matches more quickly

over time So,

Can the schema matching also improve over time? Or: can a schema matcher learn from previous experience? Machine learning techniques can be applied to schema

matching, thus enabling the matcher to improve over time

Learning to match

Suppose n data sources s1,..., sn whose schemas must be mapped into the mediated schema G

Goal: To train the system by manually providing it with

schema matches on a small nb of data sources (e.g., s1,...,sm, where m is much smaller than n)

The system generalizes from the training examples so that it is able to predict matches for sources sm+1,...sn

Components of the system

Training phase (1)

1. Manually specify mappings for several sources

2. Extract source data

3. Create training data for each base learner

4. Train the base learners

5. Train the meta-learner

Training phase (2)

Learning classifiers for elements in the mediated schema The classifier for an element e in the mediated schema

examines an element in a source schema and predict whether it matches e or not

To create classifiers, employs a machine learning algorithm

Each machine learning algorithm typically considers only one aspect of the schema and has advantages/inconvenients So, use a multi-strategy learning technique

Multi-strategy learning Training phase:

Employ a set of learners l1, ..., lk Each base learner creates a classifier for each element e of the

mediated schema from its training examples Use a meta-learner to learn weights for the different base

learners For each element e of the mediated schema and base learner l, the

meta-learner computes a weight we,l He knows how to do that, because we are working with training examples

Matching phase: When presented with a schema S whose elements are e1’,..,

et’ Apply the base learners to e1’,.., et’. Let pe,l(e’) be the

prediction of learner l on whether e’ matches e Combine the learners:

pe(e’) = j=1 k we,lj* pe,lj(e’)

Charlie comesto town

Find houses with 2 bedrooms

priced under 300K

homes.comrealestate.com homeseekers.com

Example

Data Integration

mediated schema

homes.comrealestate.com

source schema 2

homeseekers.com

wrapper wrapperwrapper

source schema 3source schema 1

Find houses with 2 bedrooms priced under 300K

Example

Rule-based learner

Naive-bayes learner

listed-price $250,000 $110,000 ...

address price agent-phone description

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>

<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>

Training the Learners

Naive Bayes Learner

(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...

(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...

realestate.com

Name Learner

address price agent-phone description

Schema of realestate.com

Mediated schema

location listed-price phone comments

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>

Applying the Learners

Name LearnerNaive Bayes

Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Base Learners Input

schema information: name, proximity, structure, ... data information: value, format, ...

Output prediction weighted by confidence score

Examples Name learner

agent-name => (name,0.7), (phone,0.3) Naive Bayes learner

“Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)

Rule-based learner

Examines a set of training examples and computes a set of rules that can be applied to test instances Rules can be represented as logical formulae or

as decision trees Works well in domains where the set of rules can

accurately characterize instances of the class (e.g., identifying elements that adhere to certain formats)

Example: rule-based learner for identifying phone numbers (1) Positive and negative examples of phone numbers:

Example instance? # of digits position of ( position of ) position of –

(608)435-2322 yes 10 1 5 9

(60)445-284 no 9 1 4 8

849-7394 yes 7 - - 4

(1343) 429-441 no 10 1 6 10

43 43 (12 1285) no 10 5 12 -

5549902 no 7 - - -

(212) 433 8842 yes 10 1 5 -

Example: rule-based learner for identifying phone numbers (2) Common method to learn rules is to create a decision tree

Encodes rules such as: If i has 10 digits, a ‘(‘ in position 1 and ‘)’ in position

5, then yes If i has 7 digits, but no ‘-’ in position 4, then no, ...

Naive Bayes Learner

Examines the tokens of a testing instance and assigns to the instance the most likely class given the occurrences of tokens in the training set

Effective for recognizing text fields Given a test instance, the learner converts it

into a bag of tokens

Naive Bayes learner at work

Given that c1, .., cn are elements of the mediated schema, the learner is given a test instance d = {w1,..., wk} to classify

Goal: assign d to the element cd with the highest posterior probability given d:

Cd = arg maxci P(Ci|d)

P(Ci|d)= P(d|Ci)P(Ci)/P(d)

=> Cd = arg maxci [P(d|Ci)P(Ci)/P(d)]

= arg maxci [P(d|Ci)P(Ci)] P(d|Ci) and P(Ci) must be estimated from the training

data

Estimation of P(d|ci) and P(ci)

P(ci)is approximated by the portion of the training instances with label ci

To compute P(d|ci) assume that the tokens wj appear in d independently of each other given ci

P(d|ci) = P(w1|ci) P(w2|ci)... P(wk|ci)

P(wj|ci) = n(wj,ci)/n(ci), where

n(ci): total number of tokens in the training instances with label ci

n(wj,ci): number of times token wj appears in all training instances with label ci

Conclusion of Naive Bayes learner Naive Bayes performs well in many domains in spite

of the fact the independence assumption is not always valid

Works best when: There are tokens strongly indicative of the correct label,

because they appear in one element and not in the othersEx: “beautiful”, “fantastic” to describe houses

There are only weakly suggestive tokens, but many of them Doesn’t work well

Short or numeric fields

Recap. Multi-strategy learning Training phase:

Employ a set of learners l1, ..., lk Each base learner creates a classifier for each element e of

the mediated schema from its training examples Use a meta-learner to learn weights for the different base

learners For each element e of the mediated schema and base learner

l, the meta-learner computes a weight we,l

Matching phase: When presented with a schema S whose elements are e1’,.., et’

Apply the base learners to e1’,.., et’. Let pe,l(e’) be the prediction of learner l on whether e’ matches e

Combine the learners:

pe(e’) = j=1 k we,lj* pe,lj(e’)

Training the meta-learner

Learns the weights to attach to each of the base learners, from the training examples Can be different for every mediated-schema element

How does it work? Asks the base learners for predictions on training examples Judges how well each learner performed in providing the

prediction for each mediated-schema element Assigns to each combination (mediated schema element c i,

base learner lj) a weight indicating how much it trusts that learner predictions regarding ci

Can use any classification algorithm to compute the weights

Outline

Base matchers Combining match predictions Applying constraints and domain knowledge

to candidate schema matches Match selector Applying machine learning techniques to

enable the schema matcher to learn Discovering m-m matches From matches to mappings

From matches to mappings Schema matches: correspondences between the

source and the target schemas Now: specifying the operations to be performed on

the source data so that they can be transformed into the target data Use DBMS as transformation engines Creating mappings becomes a process of query discovery

Find the queries, using joins, unions, filtering, aggregates, that correctly transform the data into the desired schema

Algorithm that explores the space of possible schema mappings Used in the CLIO system

User interaction

Creating mappings is a complex process System generates the mapping expressions

automatically The possible mappings are automatically

produced using the semantics conveyed by constraints such as foreign keys.

System shows the designer example data instances so that she can verify which are the right mappings

Motivating example

Question: Union professor salaries with employee salaries,

or Join salaries computed from the two

correspondences

Possible mappings

If attribute ProjRank is a foreign key of the relation PayRate, then the mapping would be:

SELECT P.HrRate * W.Hrs

FROM PayRate P, WorksOn W

WHERE P.Rank = W.ProjRank If attribute ProjRank is not a foreign key of the relation PayRate.

Instead, the name attribute of WorksOn is a foreign key of Student and the Yr attribute of Student is a foreign key of PayRate (the salary depends on the year of the student). Then, the following query should be chosen:

SELECT P.HrRate * W.Hrs

FROM PayRate P, WorksOn W, Student S

WHERE W.Name = S.Name AND S.Yr = P.Rank

Not clear which join path to choose for mapping f1!

Possible mappings (2)

One interpretation of f2 is that values produced from f1 should be joined with those produced by f2

Then, most of the values in the source DB would not be mapped to the target

Another interpretation: there are two ways of computing the salary of employees: one applying to professors and another to other emplyoyees. The corresponding mapping is:

SELECT P.HrRate * W.Hrs

FROM PayRate P, WorksOn W, Student S

WHERE W.Name=S.Name AND S.Yr = P.Rank

UNION ALL

SELECT Salary

FROM Professor

Principles to guide the mapping construction If possible, all values in the source appear in

the target Choose a union rather than a join

If possible, a value from the source should only contribute once to the target Associations between values that exist in the

source should not be lost Use a join rather than a cartesian product to

compute the salary value using f1

Possible mappings (3)

Consider a few more correspondences:

f3: Professor(Id) -> Personnel(Id)

f4: Professor(Name) -> Personnel(Name)

f5: Address(Addr) -> Personnel(Addr)

f6: Student(Name) -> Personnel(Name)

They fall into two candidate sets of correspondences: f2, f3, f4 and f5: map from Professor to Personnel f1, f6: map from other employees to Professor

The algorithm explores the possible joins within every candidate set and considers how to union the transformations corresponding to each candidate set.

Possible mappings (4)f3: Professor(Id) -> Personnel(Id)

f4: Professor(Name) -> Personnel(Name)

f5: Address(Addr) -> Personnel(Addr)

f6: Student(Name) -> Personnel(Name)

Most reasonable mapping is:

SELECT P.Id, P.Name, P.Sal, A.Addr

FROM Professor P, Address A

WHERE A.Id = P.Id

UNION ALL

SELECT NULL as ID, S.Name, P.HrRate*W.Hrs, Null as Addr

FROM Student S, PayRate P, WorksOn W

WHERE S.name=W.name AND S.Yr = P.Rank

Possible mappings (5)f3: Professor(Id) -> Personnel(Id)

f4: Professor(Name) -> Personnel(Name)

f5: Address(Addr) -> Personnel(Addr)

f6: Student(Name) -> Personnel(Name)

But this one is also possible:

SELECT NULL as ID, NULL as Name, NULL as Sal, Addr

FROM Address A

UNION ALL

SELECT P.Id, P.Name, P.Sal, NULL as Addr

FROM Professor P

UNION ALL

SELECT NULL as ID, NULL as Name, NULL as Sal, NULL as Addr

FROM Student S

...

Query discovery algorithm - Goal

Eliminates unlikely mappings from the large search space of candidate mappings and

identifies correct mappings a user might not otherwise have considered

Query discovery algorithm - Characteristics

Is interactive: Explores the space of possible mappings and

proposes the most likely ones to the user Accepts user feedback

To guide it in the right direction Uses heuristics

Can be replaced by better ones is available

Query discovery algorithm – Input

Set of correspondences M = {fi: (Ai -> Bi)}, where Ai: set of attributes in the source S1

Bi : one attribute of the target S2

Possible filters on source attributes Range restriction on an attribute, aggregate of an

attribute, etc

Query discovery algorithm – 1st phase

Create all possible candidate sets (subsets of M), that contain at most one correspondence per attribute of S2 Represents one way of computing the

attributes of S2 If a set covers all attributes of S2, it is called

complete set Elements of do not need to be disjoint

Example

Given the correspondences: f1 : S1.A -> T.C f2 : S2.A -> T.D f3 : S2.B -> T.C

Then the complete candidate sets are: {{f1, f2}, {f2, f3}}

The singleton sets {f1}, {f2} and {f3} are also candidate sets.

Query discovery algorithm – 2nd phase

Consider the candidate sets in and search for the best set of joins within each candidate set Considering a candidate set v in and supose (Ai -> Bi) v, and

Ai includes attributes from multiple relations in S1.

Then, search for a join path connecting the relations mentioned in Ai

using the following:

Heuristic: A join path can be either:

A path through foreign keys A path proposed by inspecting previous queries on S, or A path discovered by mining the data for joinable columns in S

Query discovery algorithm – 2nd phase

The set of candidate sets in for which we find join paths is denoted by . When there are multiple join paths, use the following for selecting join paths:

Heuristic: Prefer paths through foreign keys. If there are multiple such paths, choose one that involves an

attribute on which there is a filter in a correspondence, if it exists.

To further rank paths, favor the join path where the estimated difference between the outer join and inner join is the smallest

Favors joins with the least number of dangling tuples

Query discovery algorithm – 3rd phase Examine the candidate sets in , and tries to combine

them by union so they cover all the correspondences in M.

Search for covers of the correspondences A subset T of is a cover is it includes all the correspondences in M

and it is minimal (cannot remove a candidate set from T and still obtain a cover)

Example: = {{f1, f2}, {f2, f3}, {f1}, {f2}, {f3}} Possible covers include

T1 = {{f1}, {f2, f3}} T2 = {{f1, f2}, {f2, f3}.

Query discovery algorithm – 3rd phase

If there are multiple possible covers, use the following:

Heuristic: Choose the cover with the smallest nb of candidate

sets (a simpler mapping should be more appropriate) If there is more than one with the same nb of

candidate sets, choose the one that includes more attributes of S2 (to cover more of that schema)

Query discovery algorithm – 4th phase Creates a schema mapping expression as an SQL

query First creates an SQL query for each candidate set in the

selected cover and then unions them

Example: Suppose v is a candidate set: Attributes of S2 in v are put in the SELECT clause Each of the relations in the join paths found for v are put in the

FROM clause The corresponding join predicates are put in the WHERE clause Any filters associated with the correspondences in v are also

added to the WHERE clause Finally takes the union of the queries for each candidate set in the

cover Compute the SQL mapping expression for T1 and T2 left

as an exercice

The CLIO tool

http://www.almaden.ibm.com/cs/projects/criollo/

http://birte08.stanford.edu/ppts/11-ho.pd

References Chapter 4, Draft of the book on “Principles of Data

Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation).

Erhard Rahm and Philip A. Bernstein. “A survey of approaches to automatic schema matching”. VLDB Journal, 10(4):334–350, 2001.

AnHai Doan, Pedro Domingos, Alon Halevy, “Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach”, SIGMOD 2001

R.J. Miller, L.M. Haas, and M. Hernandez. “Schema Matching as Query Discovery”. In VLDB, 2000.

Renée J. Miller, Mauricio A. Hernandez, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho and Ronald Fagin, Lucian Popa, “The Clio Project: Managing Heterogeneity”, SIGMOD Record 30(1), March 2001, pp. 78-83.

Virtual Data Integration in industry Known as EII: Enterprise Information Integration Different from EAI: Enterprise Application Integration Different from ETL in DW: Extraction, Tranformation

and Loading in Data Warehousing

Good reference to take a look at: A. Halevy et al, “Enterprise Information Integration: Successes,

Challenges and Controversies”, SIGMOD’05.

Recommended