HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching...

Preview:

DESCRIPTION

DB Seminar3 Why Schema Matching? Done by domain experts Time consuming Reduce user effort Semi-automatic –Need user to verify –Need user to modify

Citation preview

HKU CSIS DB Seminar:HKU CSIS DB Seminar:COMA-A system for flexible

combination of schema matching approaches

- VLDB 2002 -Hong-Hai Do and Erhard Rahm

Speaker: Eric Lohttp://www.csis.hku.hk/~dbgroup/seminar/seminar020927.htm

DB Seminar 2

What is Schema Matching?

• Finding semantic correspondences between elements of two schemas

• Input: 2 schemas• Output: A set of mappings

DB Seminar 3

Why Schema Matching?

• Done by domain experts• Time consuming• Reduce user effort• Semi-automatic

– Need user to verify– Need user to modify

DB Seminar 4

Application domains

• Ecommerce:– E.g. a comparison shopping website– Aggregates product offer from multiple

independent online stores– Match each product catalog against their

combined catalog• [Amazon].product_code [Combined].product_id• [Wrox].bookid [Combined].product_id

DB Seminar 5

Application domains

• Data warehouses and data integration system– Preprocessing

• Data translation– XML Relation data mapping

DB Seminar 6

Schema matching categories

• Goal: High match accuracy for large variety of schemas• A single technique is not enough for different schemas

combine different approach effectively• Hybrid approach:

– Most common– Different match criteria (e.g. name, data type, dictionary,

thesaurus…) are used in a single algorithm• Composite approach:

– High flexibility– 1 match algorithm for single match criteria– Combine the independent result from algorithms

DB Seminar 7

Outline

• Introduction• COMA system• Overview of different matchers• Reuse matcher from COMA• Evaluation• Conclusions• Discussions• References

DB Seminar 8

COMA-COmbining MAtch algorithm

• Composite approach• No previous work on composite generic

matching• A generic match system• Support multiple schema (e.g. XML and

relational)

DB Seminar 9

COMA

• Different match algorithm exists as extensible library in COMA Matcher

• Support different combination of extensible library (match algorithm) result

• An evaluation platform to systematically examine and compare the effectiveness of different matchers (matching algorithm/extensible library) and combination strategies

DB Seminar 10

COMA

• Interactive and iterative match process which allow user feed-back

• Also propose a new matcher, reusing previously obtained match results (they observed that many schemas to be matched are very similar to previously matched schema)

DB Seminar 11

Schema1

Schema2

UserFeedbackMatcher1

Matcher3

Matcher2 SimilarityCube

S1 S2

S2 S1

Combine match result

Matcher Library:Simple matchers: ngram, synoymnHybrid: NamePath

Matching Process

<schema1.cname><schema2.companyname> Sim = 0.95<schema1.cname><schema2.businessname> Sim = 0.8<schema1.address><schema2.address> Sim =1

DB Seminar 12

5 Steps

• Step1: Schema Representation• Step2: Schema Tree Distinct Elements• Step3: Matching Algorithm (Matcher)• Step4: Aggregation of k matcher values• Step5: Selection

DB Seminar 13

Step1: Schema Representation

DB Seminar 14

XML Schema Representation

DB Seminar 15

Step 2

• Traverse the schema tree• Represented each schema

element by its path– Sequences of nodes from root– E.g. Address in PO2– Multiple paths

• PO2DeliverToAddress• PO2BillToAddress

DB Seminar 16

Step 3: Match algorithmS

• Take in each schema element path• Returning similarity value• If involve human feedback:

– User approved, similarity is 1 (0 in contrast)• Different matchers return similarity value

between 0 to 1• COMA support simple, hybrid, reuse-

oriented matchers now (discuss later)

DB Seminar 17

Storing k matchers resultby Similarity cube

• k matchers• m schema 1 elements• n schema 2 elements• A cube of k x m x n is stored in repository

for later combination and selection steps

k

m

n

DB Seminar 18

Some samples from similarity cube

Matcher PO1 Elements PO2 Elements Sim

Matcher1:Type-name

ShipTo.shipToCity DeliverToAddress.City 0.65

ShipTo.shipToStreet 0.3

ShipTo.Customer.custCity 0.8

Matcher2:Name-path

ShipTo.shipToCity DeliverTo.Address.City 0.78

ShipTo.shipToStreet 0.73

ShipTo.Customer.custCity 0.53

DB Seminar 19

Step 4 and 5: Combine match result

• Combine k result from the similarity cube• Step 4: Aggregation

– Aggregation of matcher-specific results• E.g. taking average of k values / max /min

• ShipTo.shipToCity DeliverToAddress.City 0.72• ShipTo.shipToStreet 0.52• ShipTo.Customer.custCity 0.67

• Step 5: Selection– Selection of match candidates

• Select ShipTo.shipToCity DeliverToAddress.City (0.72)

DB Seminar 20

How the matchers work?

• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection

DB Seminar 21

COMA Matcher LibraryType Name Schema Info Aux. Info

Simple Affix Element names -

N-gram Element names -

Soundex Element names -

EditDistance Element names -

Synonym Element names Extern, dictionaries

Data Type Data types Data type compatibility table

UserFeedback - User-specified (mis-) matches

Hybrid NameMatcher Element names -

NamePath Names+Paths -

TypeName DataTypes+Path -

Children Child elements -

Leaves Leaf elements -

Reuse-oriented Schema - Existing schema-level match results

DB Seminar 22

Simple Matcher

• Use element name to compare– Name string– Name semantic

• Can use approximate string matching technique (apply on data cleansing)

• Affix: Looks for common (prefix and suffix) on NameString

• DataType: Similarity = degree of compatibility of 2 datatypes (values are predefined)– E.g. int and bit = 0.6, text and hex =0.1

DB Seminar 23

Hybrid Matcher

• Fixed combination of simple matcher• E.g. EditDistance + Data Type• Hybrid Matcher 1 (Name Matcher):

– Tokenization(POShipTo PO, Ship, To)– Expansion (PO Purchase, Order)– Then use e.g. Affix + Trigram

DB Seminar 24

Another Hybrid Matcher

• NamePath Matcher:– Name + Path (element + structure)– Build a long string from path– Apply Name Matcher– E.g. PurchaseOrder.ShipTo.Street and

PurchaseOrder.shipToStreet– Same in Name Matcher, but not in NamePath

DB Seminar 25

Outline

• Introduction• COMA system• Overview of different matchers [Step 3]• Reuse matcher for COMA [Step 3]• Evaluation• Conclusions• Discussions• References

DB Seminar 26

Reuse of previous match result

• Based on authors observation:– Many schemas to be matched are similar (or identical)

to previous matched schema– Build a reuse-oriented matcher to save resources– A match with B before (A B) [Match 101]– B match with C before (B C) [Match 234]– Now new match task, A C

• MatchCompose operation combine previous match result to obtain new match result

DB Seminar 27

MatchCompose operation

• Given 2 match results: – match1: S1<-> S2 – match2: S2 <-> S3

• MatchCompose derives a new match result S1 <-> S3

• PO1.Contact <-> PO2.Contact <-> PO3.Contact• Name name lastName• Email email firstName• Company email• company MatchCompose

mapping

Match:S1<->S3

DB Seminar 28

MatchCompose in relation

PO2 PO3 SIM23

Name lastName 0.6

Name firstName 0.6

e-mail email 1.0

PO1 PO2 SIM12Name name 1.0Email e-mail 1.0

Match1

Match2

PO1 PO3 SIM13

Name lastName 0.8

Name firstName 0.8

Email email 1.0

MatchCompose

DB Seminar 29

Re-use: Schema matcher

• All previous match store in repository• New matching problem comes, e.g. S1 match with S2• Find all match result with schema (Si, Sj and Sk) related

to BOTH S1 and S2 in any order• Each pair undergoes MatchCompose

DB Seminar 30

How to aggregate the results from k matchers?

• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection

DB Seminar 31

How to combine similarity values from different matcher?

• Aggregate to a single similarity value from different matchers•Max: return the max values from M matchers•Weighted sum: weight assign according to the expected importance of the matchers•Average•Min

DB Seminar 32

Along so many combinations, how to select the set of result which return to user?

• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection

DB Seminar 33

Select candidates from combined cube

• Direction of match candidates selection•Given 2 schemas S1 and S2 with |S2| <= |S1|•3 Directions: LargeSmall, SmallLarge, Both•LargeSmall: Match Large Schema S1 with Small target S2,i.e. elements from S1 are ranked and selected with respect to each S2 element

DB Seminar 34

3 directions

DeliverToAddress BillToAddressshipToCity 0.72 0.71

custCity 0.67 0.68

shipToStreet 0.52 0.6

LargeSmall SmallLarge BothFor each small schema element For each large schema element LargeSmall + Small Large

- DeliverToAddress choose shipToCity

- shipToCity choose DeliverToAddress

YES

- BillToAddress choose shipToCity

- custCity choose BillToAddress

NO

- shipToStreet choose BillToAddress

NO

Small SchemaLarge Schem

a

DB Seminar 35

Selecting candidates (cont)

• Along one direction, 3 ways to select:– MaxN: Select n candidates with top sim. values

• If n=1, 1 to 1 correspondence– MaxDelta: select the MaxN one, given a

tolerance value d, also select those candidates with sim value > MaxN – d

• Select those almost maximum– Threshold: All elements > threshold t

DB Seminar 36

Evaluation

• Test by 5 real world schemas on purchase order– CIDX, Excel, Noris, Paragon and Aperturm (from

www.biztalk.org)– |Inner or Leaf nodes| != |paths| Schema share some

fragments

DB Seminar 37

Data Sets• 5 schemas, 10 match tasks• Done manually, domain experts• #Matches = no of correspondences to identified• Shows the problem sizes• Schema Similarity=#MatchedPaths/#AllPaths

DB Seminar 38

Evaluation – match quality

• Automatic match returns P matches• I is true positive (by domain experts)

• Precision= |c|/|P| reliability of match predictions• Recall= |c|/|I| % of real matches found• Accuracy = Recall*(2-1/Precision)• Accuracy = no. of labour saving to modify incorrect

matches to correct matches + no of labour saving to identify missed matches

P Ic

DB Seminar 39

Experimental result

• Only in automatic mode• Conducted 12,312 experiments set

– Different choices of matchers– Different choices of direction etc

• Each combination runs on 10 schemas matching task (1<->2, …)

DB Seminar 40

Distribution of no-reuse matchers

Accuracy

• 1 series = 1 combination• Most (7077) no-reuse matchers with Accuracy < 0

DB Seminar 41

Distribution w.r.t. aggregation

Accuracy

DB Seminar 42

Distribution w.r.t. direction

Accuracy

DB Seminar 43

Distribution w.r.t. selection

Accuracy

DB Seminar 44

Outline

• Introduction• COMA system• Overview of different matchers• Reuse matcher from COMA• Evaluation• Conclusions• Discussions• References

DB Seminar 45

Conclusions

• COMA provides a framework for combining different matcher for different purposes

• A new matcher – Reuse-oriented matcher

DB Seminar 46

Discussions

• Most are 1:1 matching, n:1 , n:m?

• Accuracy metric• Time is a problem?• To match 2 schemas, A B is a must?

– How about if A map to B in some extend, B map to A in another extend?

a c (1:1) local(2:1) global

b c (1:1) local

a cb

(2:1) local

DB Seminar 47

References

• [VLDB02] COMA-A system for flexible combination of schema matching approaches– By Hong-hai Do, Erhard Rahm– University of Leipzig

• [ICDE02] Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching– By Sergey Melik, Hector Garcia-Molina, Erhard Rahm– Stanford and University of Leipzig

• [VLDB02] Translating Web Data– By Lucian Popa, Yannis Velegrakis, Renee J. Miller, et. al.– IBM Almaden Research Center and University of Toronto

DB Seminar 48

eNd

DB Seminar 49

Interactive mode

• In contrast with auto mode• User interactive with COMA for each

iteration (optional)• E.g.

– Specify which matcher (simple / hybrid)– Accept / reject match candidates

• Improve match quality

DB Seminar 50

Simple Matcher

• EditDistance: Similarity = No of edit need to transform one string to another

• Synonym: Looking up the terminological relationship in a specific dictionary

• N-gram: i.e. sequences of n characters• Soundex: Based on the phonetic similarity

DB Seminar 51

Hybrid Matcher

• TypeName Matcher:– DataType + Name Matcher

• Children Matcher:– Leaf compared with TypeName Matcher– If compare two non-leave elements A and B,

compare A’s children with B’s children

DB Seminar 52

Hybrid Matcher

• Leave Matcher:– Similar to Children Matcher, but only consider the

leaves with TypeName Matcher– PO1.ShipTo.shipToStreet– PO1.ShipTo.shipToCity– PO2.DeliverTo.Address.Street– PO2.DeliverTo.Address.City– If cmp ShipTo with DeliverTo by Children Matcher,

i.e. shipToStreet cmp with Address!!

Recommended