20
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance Mapping Maintenance for Data for Data Integration Systems Integration Systems

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Embed Size (px)

Citation preview

Page 1: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Robert McCannUniversity of Illinois

Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan

VLDB 2005

Mapping Maintenance Mapping Maintenance for Data Integration Systemsfor Data Integration Systems

Page 2: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Data Integration SystemsData Integration Systems

mediated schema

windermere.com

source schema 2

yahoo.com

wrapper wrapper

homeseekers.com

wrapper

source schema 3source schema 1

Find homes under $300K

Page 3: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Mapping Maintenance is a Key BottleneckMapping Maintenance is a Key Bottleneck Constructing mappings has proven difficult…

– (see first speaker)

…but maintenance often quickly dominates cost E.g., Integrated Genome Database Project [Stein, 03]

– 12 genomic databases, each remodeled data twice per year– System broke every two weeks, abandoned after 1 year

E.g., Integration Project at Illinois– Integrated 400 DB researcher homepages– 2 system administrators, stopped after 3 months

Reducing maintenance costs is now crucial!

Page 4: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Problem DefinitionProblem Definition

5 weeks later(source has changed)

cost | city | numbeds | numbaths

price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2

homeseekers.com

wrapper

cost | city | numbeds | numbaths

price location beds baths $180,000 61801 2 2 $260,000 98195 3 2

homeseekers.com

wrapper

?

mediated schema mediated schema

Page 5: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Example 1: Change Source Schema or DataExample 1: Change Source Schema or Data Update tuples

Change units of price

homeseekers.com

wrapper

price location beds baths 185 “Urbana, IL” 2 2 270 “Seattle, WA” 3 2

homeseekers.com

wrapper

cost | city | numbeds | numbaths

homeseekers.com

wrapper

price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2

price location beds baths$180,000 “Urbana, IL” 2 2$260,000 “Seattle, WA” 3 2

Page 6: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Example 2: Change Presentation FormatExample 2: Change Presentation Format

cost | city | numbeds | numbaths

homeseekers.com

wrapper

Display location as zipcode

$185,000Urbana, IL2bed/2bath Century 21

homeseekers.com

wrapper

Rearrange page layout

homeseekers.com

wrapper

$185,000 - Urbana, IL2bed/2bathCentury 21

$185,000 618012bed/2bath Century 21

price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2

price location beds baths$185,000 61801 2 2$270,000 98195 3 2

price location beds baths$185,000 “Century 21” 2 2$270,000 “RE/MAX” 3 2

Page 7: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Suppose administrator wants to maintain mappings for 1 year

1. For a short initial period (e.g., 5 weeks)– Administrator manually verifies each mapping– MAVERIC probes the source to learn data characteristics

2. For remaining time (e.g., 47 weeks)– MAVERIC probes the source to observe new data instances– MAVERIC outputs an alarm if characteristics differ– If an alarm, administrator repairs mappings

The MAVERIC Approach The MAVERIC Approach

Page 8: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

ExampleExample Training phase

Verification phase

Learned data characteristics

homeseekers.com on week 1

wrapper

homeseekers.com on week 5

wrapper

price location beds baths 132 “Century 21” 1 2 365 “RE/MAX” 2 4

homeseekers.com on week 6

wrapper

If average price < 100,000,

output alarm

If layout of attributeschanges, output alarm

If beds < baths,output alarm

price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2

price location beds baths$132,000 “Salem, OR” 2 1$365,000 “Atlanta, GA” 4 2

Page 9: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

ContributionsContributions Develop core MAVERIC system

– An ensemble of sensors that exploit multiple characteristics of data– A combiner that leverages the most effective sensors

Significantly improve core system– Generate synthetic data to improve training– Leverage external data to improve training– Employ filters to reduce false alarms

Extensive evaluation over 114 sources in 6 domains– Core MAVERIC outperforms related work, improving F-1 by 4-19%– Enhancements further improve F-1 by 2-13%

Page 10: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Training the Core MAVERIC SystemTraining the Core MAVERIC System Sensors learn internal profiles of data characteristics Combiner learns weight for each sensor

sm

combiner

…...s1

employ Winnow to learn weights

avg value of price

layout of attributes in HTML pages:

price location beds / baths

homeseekers.com on week 1

wrapper

homeseekers.com on week 5

wrapper

price location beds baths$185,000 “Urbana, IL” 2 2$270,000 “Seattle, WA” 3 2

price location beds baths$132,000 “Salem, OR” 2 1$365,000 “Atlanta, GA” 4 2

Page 11: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Verifying with the Core MAVERIC SystemVerifying with the Core MAVERIC System Sensors leverage internal profiles to output sensor scores Combiner combines scores based upon weights

price location beds baths 132 “Century 21” 1 2 365 “RE/MAX” 2 4

homeseekers.com on week 6

wrapper

sm

combiner

…...s1new avg price

score1 scorem

layout of attributes has

changed

alarm if combined score ≥ θ

Page 12: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Improving Training via PerturbationImproving Training via Perturbation

Idea: expand training data by generating synthetic data Simulate natural source changes during training

– Source data changes, e.g., insert and delete tuples– Presentation format changes, e.g., $29.99 becomes 29.99 USD

source S at t1

wrapper

query results at t1

source S at tn

wrapper

query results at tn

sm

combiner

…...s1 perturber

- apply change

- reapply wrapper

- test results

training data for Sperturbed results

original results

System “practices ahead of time”

Page 13: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Example: Reformatting PriceExample: Reformatting Price

homeseekers.com

wrapper

$185,000Urbana, IL3bed/2bath…

original HTML

original results

price location beds baths $185,000 “Urbana, IL” 3 2

wrapper

185,000 USDUrbana, IL3bed/2bath…

perturbed HTML

perturbed results

price location beds baths 185,000 USD “Urbana, IL” 3 2

training data

?=

sm

combiner

…...s1

perturbed training example

perturbation

original trainingexample

Page 14: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Additional ImprovementsAdditional Improvements Improve training by borrowing data from other sources

Reduce false alarms via filtering

Web Search Engines:

• “price is 185,000 USD”

• “costs 185,000 USD”

Other Sources:

price185,000 USD

amount

210 K

potentially corrupt attribute

price is valid

Monetary Recognizers:

• $185,000

• $185000.00

house $185,000

source schema

wrapper

source schema

wrapper

mediated schemacostdescription

S’S

“This…” 185,000 USD comments amount category price

(see paper for details)

Page 15: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Empirical EvaluationEmpirical Evaluation

Test verification ability over 114 sources in 6 domains

DomainNumber

of Sources

Schema Size (Number of Attributes)

ProbingSchedule

Snapshots

Correct Mappings

Broken Mappings

Flights 19 8 weekly for 10 weeks 164 26Books 21 6 weekly for 12 weeks 210 42

Researchers 60 4 daily for 313 days 12480 6274Real Estate 5 17 11 snapshots per source 30 25Inventory 4 7 11 snapshots per source 24 20Courses 5 11 11 snapshots per source 30 25

Page 16: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Core MAVERIC Outperforms Prior WorkCore MAVERIC Outperforms Prior Work

Achieve F-1 from 82-93%,

an improvement of 4-19% in all domains

DomainLerman System Sensor Ensemble

P / R F-1 P / R F-1Flights 0.81 / 1.00 0.85 0.93 / 0.98 0.93Books 0.83 / 1.00 0.89 0.90 / 0.99 0.93

Researchers 0.77 / 0.99 0.84 0.90 / 0.99 0.93Real Estate 0.45 / 0.90 0.63 0.80 / 0.82 0.82Inventory 0.52 / 0.89 0.67 0.75 / 0.90 0.77Courses 0.49 / 0.94 0.66 0.92 / 0.88 0.88

Compare with recent system

[Lerman et al, Journal of AI Research 03]

Page 17: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Enhancements Boost PerformanceEnhancements Boost Performance

Each enhancement improved F-1 in at least 4 domains

Progressively enhanced versions of MAVERIC

0.6

0.7

0.8

0.9

1

Flights Books Researchers Real Estate Inventory Courses

F-1

Sensor EnsembleSensor Ensemble + PerturbationSensor Ensemble + Perturbation + Multi-Src TrainSensor Ensemble + Perturbation + Multi-Src Train + Filtering

Page 18: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Reasons for MistakesReasons for Mistakes Unrecognized instance formats

– E.g., trained over TIME with format 2:00 pm,

source changed format to 1400, output false alarm– E.g., trained over DAYS with format M-W-F,

source changed format to Mon Wed Fri, output false alarm– Train with additional perturbations? Leverage more sources?

Attributes with similar values– E.g., trained with ORDER-DATE before SHIP-DATE,

source reversed order, missed alarm on reversed values

(ORDER-DATE = 7/13/2004, SHIP-DATE = 7/4/2004)– Include additional domain constraints?

Page 19: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Related WorkRelated Work Schema matching

– [Dhamankar et al, 04], [He & Chang, 03], [Kang & Naughton, 03], [Rahm & Bernstein, 01], [Doan, 01]

– Quantify semantics to compute matching scores

Activity monitoring– [Shavlik & Shavlik, 04], [Lazarevic et al, 03], [Stolfo et al, 01], [Fawcett

& Provost, 99], [Allan et al, 98]– Profile normal behavior to detect notable events (e.g., intrusions)

Mapping and wrapper maintenance– Wrapper verification: [Lerman et al, 03], [Kushmerick, 00]– Mapping and wrapper repair: [Velegrakis et al, 03], [Meng et al, 03],

[Chidlovskii, 01]

Page 20: Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data

Conclusion & Future WorkConclusion & Future Work Developed MAVERIC to reduce maintenance costs

– An ensemble of sensors that exploit multiple characteristics of data

Significantly improved core system– Perturbation, multi-source training, and filtering

Extensively evaluated over 114 sources in 6 domains– Core outperformed related work, improving F-1 by 4-19%– Enhancements further improved F-1 by 2-13%

Future work– Further improve and evaluate MAVERIC– Develop a solution for repairing broken mappings