Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai

Learning Source Mappings

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

October 27, 2008

LSD Slides courtesy AnHai Doan

2

Administrivia

Midterm due Thursday 5-10 pages (single-spaced, 10-12 pt)

3

Semantic Mappings between Schemas Mediated & source schemas = XML DTDs

house

location contact

house

address

name phone

num-baths

full-baths half-baths

contact-info

agent-name agent-phone

1-1 mapping non 1-1 mapping

4

Suppose user wants to integrate 100 data sources

1. User manually creates mappings for a few sources, say 3 shows LSD these mappings

2. LSD learns from the mappings “Multi-strategy” learning incorporates many types of

info in a general way Knowledge of constraints further helps

3. LSD proposes mappings for remaining 97 sources

The LSD (Learning Source Descriptions) Approach

5

listed-price $250,000 $110,000 ...

address price agent-phone description

Example

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

6

LSD’s Multi-Strategy Learning

Use a set of base learners each exploits well certain types of information

Match schema elements of a new source apply the base learners combine their predictions using a meta-learner

Meta-learner uses training sources to measure base learner

accuracy weighs each learner based on its accuracy

7

Base Learners Input

schema information: name, proximity, structure, ...

data information: value, format, ... Output

prediction weighted by confidence score Examples

Name learner agent-name => (name,0.7), (phone,0.3)

Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)

8

<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>

<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>

Training the Learners

Naive Bayes Learner

(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...

(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...

realestate.com

Name Learner


Schema of realestate.com

Mediated schema

location listed-price phone comments

9

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>

Applying the Learners

Name LearnerNaive Bayes

Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)


Schema of homes.com Mediated schema

area day-phone extra-info

10

Domain Constraints Impose semantic regularities on sources

verified using schema or data

Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested

in a

Can be specified up front when creating mediated schema independent of any actual source schema

11

area: address contact-phone: agent-phoneextra-info: description

area: address contact-phone: agent-phoneextra-info: address

area: (address,0.7), (description,0.3)contact-phone: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)

The Constraint Handler

Can specify arbitrary constraints User feedback = domain constraint

ad-id = house-id Extended to handle domain heuristics

a = agent-phone & b = agent-name a & b are usually close to each other

0.30.10.40.012

0.70.90.60.378

0.70.90.40.252

Domain Constraintsa = address & b = adderss a = b

Predictions from Meta-Learner

12

Putting It All Together: LSD System

L1 L2 Lk

Mediated schema

Source schemas

Data listings

Training datafor base learners Constraint Handler

Mapping Combination

User Feedback

Domain Constraints

Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner

uses stacking [Ting&Witten99, Wolpert92] returns linear weighted combination of base learners’ predictions

Matching PhaseTraining Phase

13

Empirical Evaluation

Four domains Real Estate I & II, Course Offerings, Faculty

Listings

For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: 14 - 66 elements, source DTDs:

13 – 48 Ten runs for each experiment - in each run:

manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

14

LSD Matching Accuracy

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II CourseOfferings

FacultyListings

LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72%+ Meta-learner: + 5 - 22%+ Constraint handler: + 7 - 13%+ XML learner: + 0.8 - 6%

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

15

LSD Summary

Applies machine learning to schema matching use of multi-strategy learning Domain & user-specified constraints

Probably the most flexible means of doing schema matching today in a semi-automated way

Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings

Since LSD…

A lot more work on the following: Alternative schemes for putting together info from

base learners Hierarchical learners

Compare two trees: parent nodes are likely to be the same if child nodes are similar; child nodes are likely to be the same if parent nodes are similar

Using mass collaboration – humans do the work

And a lot of work on entity resolution or record matching Uses similar ideas to try to determine when two

records are referring to the same entity

16

17

Jumping Up a Level

We’ve now seen how heterogeneous data makes a huge difference … In the need for relating different kinds of

attributes Mapping languages Mapping tools Query reformulation

… and in query processing Adaptive query processing

Next time we’ll go even further, and start to consider search – focusing on Google

Documents

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai