52
HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm Speaker: Eric Lo http://www.csis.hku.hk/~dbgroup/seminar/ seminar020927.htm

HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

Embed Size (px)

DESCRIPTION

DB Seminar3 Why Schema Matching? Done by domain experts Time consuming Reduce user effort Semi-automatic –Need user to verify –Need user to modify

Citation preview

Page 1: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

HKU CSIS DB Seminar:HKU CSIS DB Seminar:COMA-A system for flexible

combination of schema matching approaches

- VLDB 2002 -Hong-Hai Do and Erhard Rahm

Speaker: Eric Lohttp://www.csis.hku.hk/~dbgroup/seminar/seminar020927.htm

Page 2: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 2

What is Schema Matching?

• Finding semantic correspondences between elements of two schemas

• Input: 2 schemas• Output: A set of mappings

Page 3: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 3

Why Schema Matching?

• Done by domain experts• Time consuming• Reduce user effort• Semi-automatic

– Need user to verify– Need user to modify

Page 4: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 4

Application domains

• Ecommerce:– E.g. a comparison shopping website– Aggregates product offer from multiple

independent online stores– Match each product catalog against their

combined catalog• [Amazon].product_code [Combined].product_id• [Wrox].bookid [Combined].product_id

Page 5: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 5

Application domains

• Data warehouses and data integration system– Preprocessing

• Data translation– XML Relation data mapping

Page 6: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 6

Schema matching categories

• Goal: High match accuracy for large variety of schemas• A single technique is not enough for different schemas

combine different approach effectively• Hybrid approach:

– Most common– Different match criteria (e.g. name, data type, dictionary,

thesaurus…) are used in a single algorithm• Composite approach:

– High flexibility– 1 match algorithm for single match criteria– Combine the independent result from algorithms

Page 7: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 7

Outline

• Introduction• COMA system• Overview of different matchers• Reuse matcher from COMA• Evaluation• Conclusions• Discussions• References

Page 8: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 8

COMA-COmbining MAtch algorithm

• Composite approach• No previous work on composite generic

matching• A generic match system• Support multiple schema (e.g. XML and

relational)

Page 9: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 9

COMA

• Different match algorithm exists as extensible library in COMA Matcher

• Support different combination of extensible library (match algorithm) result

• An evaluation platform to systematically examine and compare the effectiveness of different matchers (matching algorithm/extensible library) and combination strategies

Page 10: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 10

COMA

• Interactive and iterative match process which allow user feed-back

• Also propose a new matcher, reusing previously obtained match results (they observed that many schemas to be matched are very similar to previously matched schema)

Page 11: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 11

Schema1

Schema2

UserFeedbackMatcher1

Matcher3

Matcher2 SimilarityCube

S1 S2

S2 S1

Combine match result

Matcher Library:Simple matchers: ngram, synoymnHybrid: NamePath

Matching Process

<schema1.cname><schema2.companyname> Sim = 0.95<schema1.cname><schema2.businessname> Sim = 0.8<schema1.address><schema2.address> Sim =1

Page 12: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 12

5 Steps

• Step1: Schema Representation• Step2: Schema Tree Distinct Elements• Step3: Matching Algorithm (Matcher)• Step4: Aggregation of k matcher values• Step5: Selection

Page 13: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 13

Step1: Schema Representation

Page 14: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 14

XML Schema Representation

Page 15: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 15

Step 2

• Traverse the schema tree• Represented each schema

element by its path– Sequences of nodes from root– E.g. Address in PO2– Multiple paths

• PO2DeliverToAddress• PO2BillToAddress

Page 16: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 16

Step 3: Match algorithmS

• Take in each schema element path• Returning similarity value• If involve human feedback:

– User approved, similarity is 1 (0 in contrast)• Different matchers return similarity value

between 0 to 1• COMA support simple, hybrid, reuse-

oriented matchers now (discuss later)

Page 17: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 17

Storing k matchers resultby Similarity cube

• k matchers• m schema 1 elements• n schema 2 elements• A cube of k x m x n is stored in repository

for later combination and selection steps

k

m

n

Page 18: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 18

Some samples from similarity cube

Matcher PO1 Elements PO2 Elements Sim

Matcher1:Type-name

ShipTo.shipToCity DeliverToAddress.City 0.65

ShipTo.shipToStreet 0.3

ShipTo.Customer.custCity 0.8

Matcher2:Name-path

ShipTo.shipToCity DeliverTo.Address.City 0.78

ShipTo.shipToStreet 0.73

ShipTo.Customer.custCity 0.53

Page 19: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 19

Step 4 and 5: Combine match result

• Combine k result from the similarity cube• Step 4: Aggregation

– Aggregation of matcher-specific results• E.g. taking average of k values / max /min

• ShipTo.shipToCity DeliverToAddress.City 0.72• ShipTo.shipToStreet 0.52• ShipTo.Customer.custCity 0.67

• Step 5: Selection– Selection of match candidates

• Select ShipTo.shipToCity DeliverToAddress.City (0.72)

Page 20: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 20

How the matchers work?

• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection

Page 21: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 21

COMA Matcher LibraryType Name Schema Info Aux. Info

Simple Affix Element names -

N-gram Element names -

Soundex Element names -

EditDistance Element names -

Synonym Element names Extern, dictionaries

Data Type Data types Data type compatibility table

UserFeedback - User-specified (mis-) matches

Hybrid NameMatcher Element names -

NamePath Names+Paths -

TypeName DataTypes+Path -

Children Child elements -

Leaves Leaf elements -

Reuse-oriented Schema - Existing schema-level match results

Page 22: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 22

Simple Matcher

• Use element name to compare– Name string– Name semantic

• Can use approximate string matching technique (apply on data cleansing)

• Affix: Looks for common (prefix and suffix) on NameString

• DataType: Similarity = degree of compatibility of 2 datatypes (values are predefined)– E.g. int and bit = 0.6, text and hex =0.1

Page 23: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 23

Hybrid Matcher

• Fixed combination of simple matcher• E.g. EditDistance + Data Type• Hybrid Matcher 1 (Name Matcher):

– Tokenization(POShipTo PO, Ship, To)– Expansion (PO Purchase, Order)– Then use e.g. Affix + Trigram

Page 24: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 24

Another Hybrid Matcher

• NamePath Matcher:– Name + Path (element + structure)– Build a long string from path– Apply Name Matcher– E.g. PurchaseOrder.ShipTo.Street and

PurchaseOrder.shipToStreet– Same in Name Matcher, but not in NamePath

Page 25: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 25

Outline

• Introduction• COMA system• Overview of different matchers [Step 3]• Reuse matcher for COMA [Step 3]• Evaluation• Conclusions• Discussions• References

Page 26: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 26

Reuse of previous match result

• Based on authors observation:– Many schemas to be matched are similar (or identical)

to previous matched schema– Build a reuse-oriented matcher to save resources– A match with B before (A B) [Match 101]– B match with C before (B C) [Match 234]– Now new match task, A C

• MatchCompose operation combine previous match result to obtain new match result

Page 27: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 27

MatchCompose operation

• Given 2 match results: – match1: S1<-> S2 – match2: S2 <-> S3

• MatchCompose derives a new match result S1 <-> S3

• PO1.Contact <-> PO2.Contact <-> PO3.Contact• Name name lastName• Email email firstName• Company email• company MatchCompose

mapping

Match:S1<->S3

Page 28: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 28

MatchCompose in relation

PO2 PO3 SIM23

Name lastName 0.6

Name firstName 0.6

e-mail email 1.0

PO1 PO2 SIM12Name name 1.0Email e-mail 1.0

Match1

Match2

PO1 PO3 SIM13

Name lastName 0.8

Name firstName 0.8

Email email 1.0

MatchCompose

Page 29: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 29

Re-use: Schema matcher

• All previous match store in repository• New matching problem comes, e.g. S1 match with S2• Find all match result with schema (Si, Sj and Sk) related

to BOTH S1 and S2 in any order• Each pair undergoes MatchCompose

Page 30: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 30

How to aggregate the results from k matchers?

• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection

Page 31: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 31

How to combine similarity values from different matcher?

• Aggregate to a single similarity value from different matchers•Max: return the max values from M matchers•Weighted sum: weight assign according to the expected importance of the matchers•Average•Min

Page 32: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 32

Along so many combinations, how to select the set of result which return to user?

• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection

Page 33: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 33

Select candidates from combined cube

• Direction of match candidates selection•Given 2 schemas S1 and S2 with |S2| <= |S1|•3 Directions: LargeSmall, SmallLarge, Both•LargeSmall: Match Large Schema S1 with Small target S2,i.e. elements from S1 are ranked and selected with respect to each S2 element

Page 34: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 34

3 directions

DeliverToAddress BillToAddressshipToCity 0.72 0.71

custCity 0.67 0.68

shipToStreet 0.52 0.6

LargeSmall SmallLarge BothFor each small schema element For each large schema element LargeSmall + Small Large

- DeliverToAddress choose shipToCity

- shipToCity choose DeliverToAddress

YES

- BillToAddress choose shipToCity

- custCity choose BillToAddress

NO

- shipToStreet choose BillToAddress

NO

Small SchemaLarge Schem

a

Page 35: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 35

Selecting candidates (cont)

• Along one direction, 3 ways to select:– MaxN: Select n candidates with top sim. values

• If n=1, 1 to 1 correspondence– MaxDelta: select the MaxN one, given a

tolerance value d, also select those candidates with sim value > MaxN – d

• Select those almost maximum– Threshold: All elements > threshold t

Page 36: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 36

Evaluation

• Test by 5 real world schemas on purchase order– CIDX, Excel, Noris, Paragon and Aperturm (from

www.biztalk.org)– |Inner or Leaf nodes| != |paths| Schema share some

fragments

Page 37: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 37

Data Sets• 5 schemas, 10 match tasks• Done manually, domain experts• #Matches = no of correspondences to identified• Shows the problem sizes• Schema Similarity=#MatchedPaths/#AllPaths

Page 38: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 38

Evaluation – match quality

• Automatic match returns P matches• I is true positive (by domain experts)

• Precision= |c|/|P| reliability of match predictions• Recall= |c|/|I| % of real matches found• Accuracy = Recall*(2-1/Precision)• Accuracy = no. of labour saving to modify incorrect

matches to correct matches + no of labour saving to identify missed matches

P Ic

Page 39: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 39

Experimental result

• Only in automatic mode• Conducted 12,312 experiments set

– Different choices of matchers– Different choices of direction etc

• Each combination runs on 10 schemas matching task (1<->2, …)

Page 40: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 40

Distribution of no-reuse matchers

Accuracy

• 1 series = 1 combination• Most (7077) no-reuse matchers with Accuracy < 0

Page 41: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 41

Distribution w.r.t. aggregation

Accuracy

Page 42: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 42

Distribution w.r.t. direction

Accuracy

Page 43: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 43

Distribution w.r.t. selection

Accuracy

Page 44: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 44

Outline

• Introduction• COMA system• Overview of different matchers• Reuse matcher from COMA• Evaluation• Conclusions• Discussions• References

Page 45: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 45

Conclusions

• COMA provides a framework for combining different matcher for different purposes

• A new matcher – Reuse-oriented matcher

Page 46: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 46

Discussions

• Most are 1:1 matching, n:1 , n:m?

• Accuracy metric• Time is a problem?• To match 2 schemas, A B is a must?

– How about if A map to B in some extend, B map to A in another extend?

a c (1:1) local(2:1) global

b c (1:1) local

a cb

(2:1) local

Page 47: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 47

References

• [VLDB02] COMA-A system for flexible combination of schema matching approaches– By Hong-hai Do, Erhard Rahm– University of Leipzig

• [ICDE02] Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching– By Sergey Melik, Hector Garcia-Molina, Erhard Rahm– Stanford and University of Leipzig

• [VLDB02] Translating Web Data– By Lucian Popa, Yannis Velegrakis, Renee J. Miller, et. al.– IBM Almaden Research Center and University of Toronto

Page 48: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 48

eNd

Page 49: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 49

Interactive mode

• In contrast with auto mode• User interactive with COMA for each

iteration (optional)• E.g.

– Specify which matcher (simple / hybrid)– Accept / reject match candidates

• Improve match quality

Page 50: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 50

Simple Matcher

• EditDistance: Similarity = No of edit need to transform one string to another

• Synonym: Looking up the terminological relationship in a specific dictionary

• N-gram: i.e. sequences of n characters• Soundex: Based on the phonetic similarity

Page 51: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 51

Hybrid Matcher

• TypeName Matcher:– DataType + Name Matcher

• Children Matcher:– Leaf compared with TypeName Matcher– If compare two non-leave elements A and B,

compare A’s children with B’s children

Page 52: HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm

DB Seminar 52

Hybrid Matcher

• Leave Matcher:– Similar to Children Matcher, but only consider the

leaves with TypeName Matcher– PO1.ShipTo.shipToStreet– PO1.ShipTo.shipToCity– PO2.DeliverTo.Address.Street– PO2.DeliverTo.Address.City– If cmp ShipTo with DeliverTo by Children Matcher,

i.e. shipToStreet cmp with Address!!