Upload
clyde-curtis
View
214
Download
0
Embed Size (px)
Citation preview
Robert McCannUniversity of Illinois
Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan
VLDB 2005
Mapping Maintenance Mapping Maintenance for Data Integration Systemsfor Data Integration Systems
Data Integration SystemsData Integration Systems
mediated schema
windermere.com
source schema 2
yahoo.com
wrapper wrapper
homeseekers.com
wrapper
source schema 3source schema 1
Find homes under $300K
Mapping Maintenance is a Key BottleneckMapping Maintenance is a Key Bottleneck Constructing mappings has proven difficult…
– (see first speaker)
…but maintenance often quickly dominates cost E.g., Integrated Genome Database Project [Stein, 03]
– 12 genomic databases, each remodeled data twice per year– System broke every two weeks, abandoned after 1 year
E.g., Integration Project at Illinois– Integrated 400 DB researcher homepages– 2 system administrators, stopped after 3 months
Reducing maintenance costs is now crucial!
Problem DefinitionProblem Definition
5 weeks later(source has changed)
cost | city | numbeds | numbaths
price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2
homeseekers.com
wrapper
cost | city | numbeds | numbaths
price location beds baths $180,000 61801 2 2 $260,000 98195 3 2
homeseekers.com
wrapper
?
mediated schema mediated schema
Example 1: Change Source Schema or DataExample 1: Change Source Schema or Data Update tuples
Change units of price
homeseekers.com
wrapper
price location beds baths 185 “Urbana, IL” 2 2 270 “Seattle, WA” 3 2
homeseekers.com
wrapper
cost | city | numbeds | numbaths
homeseekers.com
wrapper
price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2
price location beds baths$180,000 “Urbana, IL” 2 2$260,000 “Seattle, WA” 3 2
Example 2: Change Presentation FormatExample 2: Change Presentation Format
cost | city | numbeds | numbaths
homeseekers.com
wrapper
Display location as zipcode
$185,000Urbana, IL2bed/2bath Century 21
homeseekers.com
wrapper
Rearrange page layout
homeseekers.com
wrapper
$185,000 - Urbana, IL2bed/2bathCentury 21
$185,000 618012bed/2bath Century 21
price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2
price location beds baths$185,000 61801 2 2$270,000 98195 3 2
price location beds baths$185,000 “Century 21” 2 2$270,000 “RE/MAX” 3 2
Suppose administrator wants to maintain mappings for 1 year
1. For a short initial period (e.g., 5 weeks)– Administrator manually verifies each mapping– MAVERIC probes the source to learn data characteristics
2. For remaining time (e.g., 47 weeks)– MAVERIC probes the source to observe new data instances– MAVERIC outputs an alarm if characteristics differ– If an alarm, administrator repairs mappings
The MAVERIC Approach The MAVERIC Approach
ExampleExample Training phase
Verification phase
Learned data characteristics
homeseekers.com on week 1
wrapper
homeseekers.com on week 5
wrapper
price location beds baths 132 “Century 21” 1 2 365 “RE/MAX” 2 4
homeseekers.com on week 6
wrapper
If average price < 100,000,
output alarm
If layout of attributeschanges, output alarm
If beds < baths,output alarm
price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2
price location beds baths$132,000 “Salem, OR” 2 1$365,000 “Atlanta, GA” 4 2
ContributionsContributions Develop core MAVERIC system
– An ensemble of sensors that exploit multiple characteristics of data– A combiner that leverages the most effective sensors
Significantly improve core system– Generate synthetic data to improve training– Leverage external data to improve training– Employ filters to reduce false alarms
Extensive evaluation over 114 sources in 6 domains– Core MAVERIC outperforms related work, improving F-1 by 4-19%– Enhancements further improve F-1 by 2-13%
Training the Core MAVERIC SystemTraining the Core MAVERIC System Sensors learn internal profiles of data characteristics Combiner learns weight for each sensor
sm
combiner
…...s1
employ Winnow to learn weights
avg value of price
layout of attributes in HTML pages:
price location beds / baths
homeseekers.com on week 1
wrapper
homeseekers.com on week 5
wrapper
price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2
price location beds baths$132,000 “Salem, OR” 2 1$365,000 “Atlanta, GA” 4 2
Verifying with the Core MAVERIC SystemVerifying with the Core MAVERIC System Sensors leverage internal profiles to output sensor scores Combiner combines scores based upon weights
price location beds baths 132 “Century 21” 1 2 365 “RE/MAX” 2 4
homeseekers.com on week 6
wrapper
sm
combiner
…...s1new avg price
score1 scorem
layout of attributes has
changed
alarm if combined score ≥ θ
Improving Training via PerturbationImproving Training via Perturbation
Idea: expand training data by generating synthetic data Simulate natural source changes during training
– Source data changes, e.g., insert and delete tuples– Presentation format changes, e.g., $29.99 becomes 29.99 USD
source S at t1
wrapper
query results at t1
source S at tn
wrapper
query results at tn
sm
combiner
…...s1 perturber
- apply change
- reapply wrapper
- test results
training data for Sperturbed results
original results
System “practices ahead of time”
Example: Reformatting PriceExample: Reformatting Price
homeseekers.com
wrapper
$185,000Urbana, IL3bed/2bath…
original HTML
original results
price location beds baths $185,000 “Urbana, IL” 3 2
wrapper
185,000 USDUrbana, IL3bed/2bath…
perturbed HTML
perturbed results
price location beds baths 185,000 USD “Urbana, IL” 3 2
training data
?=
sm
combiner
…...s1
perturbed training example
perturbation
original trainingexample
Additional ImprovementsAdditional Improvements Improve training by borrowing data from other sources
Reduce false alarms via filtering
Web Search Engines:
• “price is 185,000 USD”
• “costs 185,000 USD”
Other Sources:
price185,000 USD
amount
210 K
potentially corrupt attribute
price is valid
Monetary Recognizers:
• $185,000
• $185000.00
house $185,000
source schema
wrapper
source schema
wrapper
mediated schemacostdescription
S’S
“This…” 185,000 USD comments amount category price
(see paper for details)
Empirical EvaluationEmpirical Evaluation
Test verification ability over 114 sources in 6 domains
DomainNumber
of Sources
Schema Size (Number of Attributes)
ProbingSchedule
Snapshots
Correct Mappings
Broken Mappings
Flights 19 8 weekly for 10 weeks 164 26Books 21 6 weekly for 12 weeks 210 42
Researchers 60 4 daily for 313 days 12480 6274Real Estate 5 17 11 snapshots per source 30 25Inventory 4 7 11 snapshots per source 24 20Courses 5 11 11 snapshots per source 30 25
Core MAVERIC Outperforms Prior WorkCore MAVERIC Outperforms Prior Work
Achieve F-1 from 82-93%,
an improvement of 4-19% in all domains
DomainLerman System Sensor Ensemble
P / R F-1 P / R F-1Flights 0.81 / 1.00 0.85 0.93 / 0.98 0.93Books 0.83 / 1.00 0.89 0.90 / 0.99 0.93
Researchers 0.77 / 0.99 0.84 0.90 / 0.99 0.93Real Estate 0.45 / 0.90 0.63 0.80 / 0.82 0.82Inventory 0.52 / 0.89 0.67 0.75 / 0.90 0.77Courses 0.49 / 0.94 0.66 0.92 / 0.88 0.88
Compare with recent system
[Lerman et al, Journal of AI Research 03]
Enhancements Boost PerformanceEnhancements Boost Performance
Each enhancement improved F-1 in at least 4 domains
Progressively enhanced versions of MAVERIC
0.6
0.7
0.8
0.9
1
Flights Books Researchers Real Estate Inventory Courses
F-1
Sensor EnsembleSensor Ensemble + PerturbationSensor Ensemble + Perturbation + Multi-Src TrainSensor Ensemble + Perturbation + Multi-Src Train + Filtering
Reasons for MistakesReasons for Mistakes Unrecognized instance formats
– E.g., trained over TIME with format 2:00 pm,
source changed format to 1400, output false alarm– E.g., trained over DAYS with format M-W-F,
source changed format to Mon Wed Fri, output false alarm– Train with additional perturbations? Leverage more sources?
Attributes with similar values– E.g., trained with ORDER-DATE before SHIP-DATE,
source reversed order, missed alarm on reversed values
(ORDER-DATE = 7/13/2004, SHIP-DATE = 7/4/2004)– Include additional domain constraints?
Related WorkRelated Work Schema matching
– [Dhamankar et al, 04], [He & Chang, 03], [Kang & Naughton, 03], [Rahm & Bernstein, 01], [Doan, 01]
– Quantify semantics to compute matching scores
Activity monitoring– [Shavlik & Shavlik, 04], [Lazarevic et al, 03], [Stolfo et al, 01], [Fawcett
& Provost, 99], [Allan et al, 98]– Profile normal behavior to detect notable events (e.g., intrusions)
Mapping and wrapper maintenance– Wrapper verification: [Lerman et al, 03], [Kushmerick, 00]– Mapping and wrapper repair: [Velegrakis et al, 03], [Meng et al, 03],
[Chidlovskii, 01]
Conclusion & Future WorkConclusion & Future Work Developed MAVERIC to reduce maintenance costs
– An ensemble of sensors that exploit multiple characteristics of data
Significantly improved core system– Perturbation, multi-source training, and filtering
Extensively evaluated over 114 sources in 6 domains– Core outperformed related work, improving F-1 by 4-19%– Enhancements further improved F-1 by 2-13%
Future work– Further improve and evaluate MAVERIC– Develop a solution for repairing broken mappings