Upload
justin-glenn
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
Learning Source Mappings
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Database & Information Systems
October 27, 2008
LSD Slides courtesy AnHai Doan
2
Administrivia
Midterm due Thursday 5-10 pages (single-spaced, 10-12 pt)
3
Semantic Mappings between Schemas Mediated & source schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
4
Suppose user wants to integrate 100 data sources
1. User manually creates mappings for a few sources, say 3 shows LSD these mappings
2. LSD learns from the mappings “Multi-strategy” learning incorporates many types of
info in a general way Knowledge of constraints further helps
3. LSD proposes mappings for remaining 97 sources
The LSD (Learning Source Descriptions) Approach
5
listed-price $250,000 $110,000 ...
address price agent-phone description
Example
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
6
LSD’s Multi-Strategy Learning
Use a set of base learners each exploits well certain types of information
Match schema elements of a new source apply the base learners combine their predictions using a meta-learner
Meta-learner uses training sources to measure base learner
accuracy weighs each learner based on its accuracy
7
Base Learners Input
schema information: name, proximity, structure, ...
data information: value, format, ... Output
prediction weighted by confidence score Examples
Name learner agent-name => (name,0.7), (phone,0.3)
Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)
8
<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>
<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>
Training the Learners
Naive Bayes Learner
(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...
(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...
realestate.com
Name Learner
address price agent-phone description
Schema of realestate.com
Mediated schema
location listed-price phone comments
9
<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>
<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>
<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>
Applying the Learners
Name LearnerNaive Bayes
Meta-Learner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(address,0.6), (description,0.4)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(agent-phone,0.9), (description,0.1)
address price agent-phone description
Schema of homes.com Mediated schema
area day-phone extra-info
10
Domain Constraints Impose semantic regularities on sources
verified using schema or data
Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested
in a
Can be specified up front when creating mediated schema independent of any actual source schema
11
area: address contact-phone: agent-phoneextra-info: description
area: address contact-phone: agent-phoneextra-info: address
area: (address,0.7), (description,0.3)contact-phone: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)
The Constraint Handler
Can specify arbitrary constraints User feedback = domain constraint
ad-id = house-id Extended to handle domain heuristics
a = agent-phone & b = agent-name a & b are usually close to each other
0.30.10.40.012
0.70.90.60.378
0.70.90.40.252
Domain Constraintsa = address & b = adderss a = b
Predictions from Meta-Learner
12
Putting It All Together: LSD System
L1 L2 Lk
Mediated schema
Source schemas
Data listings
Training datafor base learners Constraint Handler
Mapping Combination
User Feedback
Domain Constraints
Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner
uses stacking [Ting&Witten99, Wolpert92] returns linear weighted combination of base learners’ predictions
Matching PhaseTraining Phase
13
Empirical Evaluation
Four domains Real Estate I & II, Course Offerings, Faculty
Listings
For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: 14 - 66 elements, source DTDs:
13 – 48 Ten runs for each experiment - in each run:
manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified
14
LSD Matching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72%+ Meta-learner: + 5 - 22%+ Constraint handler: + 7 - 13%+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
15
LSD Summary
Applies machine learning to schema matching use of multi-strategy learning Domain & user-specified constraints
Probably the most flexible means of doing schema matching today in a semi-automated way
Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings
Since LSD…
A lot more work on the following: Alternative schemes for putting together info from
base learners Hierarchical learners
Compare two trees: parent nodes are likely to be the same if child nodes are similar; child nodes are likely to be the same if parent nodes are similar
Using mass collaboration – humans do the work
And a lot of work on entity resolution or record matching Uses similar ideas to try to determine when two
records are referring to the same entity
16
17
Jumping Up a Level
We’ve now seen how heterogeneous data makes a huge difference … In the need for relating different kinds of
attributes Mapping languages Mapping tools Query reformulation
… and in query processing Adaptive query processing
Next time we’ll go even further, and start to consider search – focusing on Google