Upload
amena-morrow
View
27
Download
5
Tags:
Embed Size (px)
DESCRIPTION
The LSD Project. Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. AnHai Doan, Pedro Domingos, Alon Halevy University of Washington. wrapper. wrapper. wrapper. Data Integration. Find houses with four bathrooms priced under $500,000. mediated schema. - PowerPoint PPT Presentation
Citation preview
AnHai Doan, Pedro Domingos, Alon Halevy
University of Washington
Reconciling Schemas of Disparate Data Sources: Reconciling Schemas of Disparate Data Sources: A Machine Learning ApproachA Machine Learning Approach
The LSD ProjectThe LSD Project
2
Data IntegrationData Integration
Find houses with four bathrooms priced under $500,000
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
wrapper wrapperwrapper
source schema 3source schema 1
3
Semantic Mappings between SchemasSemantic Mappings between Schemas
Mediated & source schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
4
Current State of AffairsCurrent State of Affairs
Finding semantic mappings is now the bottleneck!– largely done by hand– labor intensive & error prone
Will only be exacerbated– data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data– reconciling ontologies on the semantic web
Need (semi-)automatic approaches to scale up!
5
Suppose user wants to integrate 100 data sources
1. User – manually creates mappings for a few sources, say 3– shows LSD these mappings
2. LSD learns from the mappings
3. LSD proposes mappings for remaining 97 sources
The LSD (The LSD (LLearning earning SSource ource DDescriptions) escriptions) Approach Approach
6
listed-price $250,000 $110,000 ...
address price agent-phone description
Example Example
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
7
Our ContributionsOur Contributions
1. Use of multi-strategy learning– well-suited to exploit multiple types of knowledge– highly modular & extensible
2. Extend learning to incorporate constraints– handle a wide range of domain & user-specified constraints
3. Develop XML learner– exploit hierarchical nature of XML
8
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners– each exploits well certain types of information
Match schema elements of a new source– apply the base learners– combine their predictions using a meta-learner
Meta-learner– uses training sources to measure base learner accuracy– weighs each learner based on its accuracy
9
Base LearnersBase Learners Input
– schema information: name, proximity, structure, ...– data information: value, format, ...
Output– prediction weighted by confidence score
Examples– Name learner
– agent-name => (name,0.7), (phone,0.3)
– Naive Bayes learner– “Kent, WA” => (address,0.8), (name,0.2)– “Great location” => (description,0.9), (address,0.1)
10
<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>
<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>
Training the Learners Training the Learners
Naive Bayes Learner
(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...
(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...
realestate.com
Name Learner
address price agent-phone description
Schema of realestate.com
Mediated schema
location listed-price phone comments
11
<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>
<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>
<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>
Applying the LearnersApplying the Learners
Name LearnerNaive Bayes
Meta-Learner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(address,0.6), (description,0.4)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(agent-phone,0.9), (description,0.1)
address price agent-phone description
Schema of homes.com Mediated schema
area day-phone extra-info
12
Domain ConstraintsDomain Constraints
Impose semantic regularities on sources– verified using schema or data
Examples– a = address & b = address a = b– a = house-id a is a key– a = agent-info & b = agent-name b is nested in a
Can be specified up front– when creating mediated schema– independent of any actual source schema
13
area: address contact-phone: agent-phoneextra-info: description
area: address contact-phone: agent-phoneextra-info: address
area: (address,0.7), (description,0.3)contact-phone: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)
The Constraint HandlerThe Constraint Handler
Can specify arbitrary constraints User feedback = domain constraint
– ad-id = house-id Extended to handle domain heuristics
– a = agent-phone & b = agent-name a & b are usually close to each other
0.30.10.40.012
0.70.90.60.378
0.70.90.40.252
Domain Constraintsa = address & b = adderss a = b
Predictions from Meta-Learner
14
Putting It All Together: the LSD SystemPutting It All Together: the LSD System
L1 L2 Lk
Mediated schema
Source schemas
Data listings
Training datafor base learners Constraint Handler
Mapping Combination
User Feedback
Domain Constraints
Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner
– uses stacking [Ting&Witten99, Wolpert92]– returns linear weighted combination of base learners’ predictions
Matching PhaseTraining Phase
15
Empirical EvaluationEmpirical Evaluation
Four domains– Real Estate I & II, Course Offerings, Faculty Listings
For each domain– create mediated DTD & domain constraints– choose five sources– extract & convert data listings into XML– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
Ten runs for each experiment - in each run:– manually provide 1-1 mappings for 3 sources– ask LSD to propose mappings for remaining 2 sources– accuracy = % of 1-1 mappings correctly identified
16
High Matching AccuracyHigh Matching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner: + 5 - 22%
+ Constraint handler: + 7 - 13%
+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
17
Performance SensitivityPerformance Sensitivity
40
50
60
70
80
90
100
0 100 200 300 400 500
Ave
rage
mat
chin
g ac
cura
cy (
%)
Number of data listings per source
18
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs. DataContribution of Schema vs. Data
More experiments in the paper!
Ave
rage
mat
chin
g ac
cura
cy (
%)
19
Related WorkRelated Work
Rule-based approaches– TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99],
[Palopoli et. al. 98], CUPID [Madhavan et. al. 01]– utilize only schema information
Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability
Others– DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01]
Multi-strategy learning in other domains– series of workshops [91,93,96,98,00]– [Freitag98], Proverb [Keim et. al. 99]
20
SummarySummary
LSD project– applies machine learning to schema matching
Main ideas & contributions– use of multi-strategy learning– extend learning to handle domain & user-specified constraints– develop XML learner
System design: A contribution to generic schema-matching – highly modular & extensible– handle multiple types of knowledge– continuously improve over time
21
Ongoing & Future WorkOngoing & Future Work
Improve accuracy– address current system limitations
Extend LSD to more complex mappings Apply LSD to other application contexts
– data translation– data warehousing– e-commerce– information extraction– semantic web
www.cs.washington.edu/homes/anhai/lsd.html
22
Contribution of Each ComponentContribution of Each Component
0
20
40
60
80
100
Real Estate I Course Offerings Faculty Listings Real Estate II
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
Without Name Learner
Without Naive Bayes
Without Whirl Learner
Without Constraint Handler
The complete LSD system
23
Existing learners flatten out all structures
Developed XML learner– similar to the Naive Bayes learner
– input instance = bag of tokens
– differs in one crucial aspect– consider not only text tokens, but also structure tokens
Exploiting Hierarchical Structure Exploiting Hierarchical Structure
<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>
<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>