91
SeMap: A Generic Schema Matching System by Ting Wang B.Sc., Zhejiang University, 2004 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia August, 2006 c Ting Wang 2006

SeMap: A Generic Schema Matching · PDF fileSeMap: A Generic Schema Matching System by Ting Wang ... 2.4 Sample Prototypes ... two schemas S and T represent the concepts of ‘class

  • Upload
    vannhi

  • View
    236

  • Download
    0

Embed Size (px)

Citation preview

SeMap: A Generic Schema MatchingSystem

by

Ting Wang

B.Sc., Zhejiang University, 2004

A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

The Faculty of Graduate Studies

(Computer Science)

The University Of British Columbia

August, 2006

c© Ting Wang 2006

Abstract

The rapidly growing number of autonomous data sources on the web makesthe need of effective tools of creating semantic mappings increasingly crucial.Moreover, the goal of allowing applications to have more expressive seman-tics requires a change in focus. While most previous work focus on creatingmappings in specific data models for data transformation, they fail to cap-ture a richer set of possible relationships between schema elements. Forexample, current schema matching approaches might discover that ‘TA’ inone schema equals to ‘grad TA’ in another one, even though the relationshipcan be modeled more accurately by saying that ‘grad TA’ is a specializa-tion of ‘TA’. This increased semantics of the mapping in turn allows forapplications involving richer semantics.

In this thesis we concentrate on the following problem: given initialmatch (correspondence) information produced by current schema match-ing techniques, how to construct a complex, semantically richer mappingthat can be used across data models? Specifically, we aim at detecting therelationship types of ‘Has-a’, ‘Is-a’, ‘Associates’ and ‘Equivalent ’. Tech-nically, we achieve this goal in mainly three steps: (1) exploiting varioustypes of semantic evidence for possible matches; (2) finding a globally op-timal match assignment; (3) identifying the relationship embedded in theselected matches. We implemented our semantic matching approach withina prototype system SeMap, and tested its accuracy and effectiveness.

ii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Relationship Classification . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Equivalence Relationships . . . . . . . . . . . . . . . . 102.1.2 Set-Theoretic Relationships . . . . . . . . . . . . . . . 112.1.3 Generic Relationships . . . . . . . . . . . . . . . . . . 11

2.2 Schema Matching Techniques . . . . . . . . . . . . . . . . . . 122.2.1 Rule-Based Solutions . . . . . . . . . . . . . . . . . . . 132.2.2 Learning-Based Solutions . . . . . . . . . . . . . . . . 15

2.3 Ontology Alignment Techniques . . . . . . . . . . . . . . . . . 162.4 Sample Prototypes . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Rondo . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 Cupid . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 COMA . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.4 iMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

Table of Contents

3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Semantic Resources . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Internal Resources . . . . . . . . . . . . . . . . . . . . 273.3.2 External Resources . . . . . . . . . . . . . . . . . . . . 29

3.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 303.4.1 Schema Matcher . . . . . . . . . . . . . . . . . . . . . 313.4.2 Match Selector . . . . . . . . . . . . . . . . . . . . . . 333.4.3 Mapping Assembler . . . . . . . . . . . . . . . . . . . 35

4 SeMap System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.1 Schema Matcher . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Base Matcher . . . . . . . . . . . . . . . . . . . . . . . 384.1.2 Similarity Score and Lineage Information . . . . . . . 394.1.3 Element-Level Matcher . . . . . . . . . . . . . . . . . 394.1.4 Structure-Level Matcher . . . . . . . . . . . . . . . . . 464.1.5 Architecture of Schema Matcher . . . . . . . . . . . . 48

4.2 Match Selector . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . 514.2.2 Bidirectional search . . . . . . . . . . . . . . . . . . . 534.2.3 Modeling user interaction . . . . . . . . . . . . . . . . 55

4.3 Mapping Assembler . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Combining Maps and Mapt . . . . . . . . . . . . . . . 584.3.2 Identifying relationships . . . . . . . . . . . . . . . . . 584.3.3 Assembling mapping . . . . . . . . . . . . . . . . . . . 62

5 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . 645.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . 65

5.1.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 655.1.2 Expert Mapping . . . . . . . . . . . . . . . . . . . . . 665.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 675.1.4 Experimental Methodology . . . . . . . . . . . . . . . 68

5.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . 695.2.1 Matching Accuracy . . . . . . . . . . . . . . . . . . . . 705.2.2 Component Contribution . . . . . . . . . . . . . . . . 745.2.3 Incorporating User Feedback . . . . . . . . . . . . . . 745.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . 78

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

iv

List of Tables

5.1 Characteristics of the input schemas. . . . . . . . . . . . . . 665.2 Characteristics of the expert mappings. . . . . . . . . . . . . 67

v

List of Figures

1.1 An example of input schemas and output mapping. . . . . . . 2

2.1 A classification of current schema matching techniques. Cour-tesy of [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Representation of model. The left plot shows a graphicalrepresentation of a model, comprised of nodes (elements) andedges (relationships). The right table shows the tuple repre-sentation of edges. . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Illustration of four relationship types handled by SeMap. . . . 253.3 An example of complex mapping handled by SeMap. . . . . 263.4 Illustration of the matching process. . . . . . . . . . . . . . . 263.5 The basic system architecture of SeMap. It takes two models

and external resources as input, and produces generic seman-tic mapping. It consists of three main parts: the schemamatcher, the match selector and the mapping assembler. . . 31

4.1 Architecture of schema matcher. It consists of three layers,base matcher, combining layer and structure matcher. . . . . 48

4.2 Partial match assignments from the perspectives of sourceand target schemas respectively. . . . . . . . . . . . . . . . . . 55

4.3 Mapping assembling for matches of different types. Each 1-1 equivalence match corresponds to one mapping element,while each element of complex match is associated with onemapping element. . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1 Matching accuracy of SeMap. The three plots show the recall,precision and F-measure of the matching results for the threerelationship types Equivalent, Has-a, Is-a and total correctmatches respectively. . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Error analysis of the resulting mappings. . . . . . . . . . . . 73

vi

List of Figures

5.3 The precision of SeMap after pruning incorrect matches. Thebars from left to right shows the matching results for the threerelationship types Equivalent, Has-a, Is-a respectively. . . . 73

5.4 Relative contribution of different types of semantic evidencesto the matching results of SeMap. The two plots (from up todown) show the F-measure of identified matches (correspon-dences) and identified relationships respectively. . . . . . . . 75

5.5 F-measure of correct correspondences versus the amount ofuser interaction (percentage of expert matches provided overthe total number of matches). The curves for four datasets(Real Estate 1/2, Course Info 1/2) are shown. . . . . . . . . 77

vii

Acknowledgements

I would like to express my gratitude to all those who have offered me help incompleting this thesis. Especially, I owe the greatest thanks to my supervisorRachel Pottinger, who provided me with excellent guidance and support inthe entire process of this thesis project. I want to thank Dr. Tsiknis forgiving me insightful comments on this work, and being my second reader.

I would like also to thank all the members at database management lab,especially Jian Xu for their constructive suggestions. Without their help,this work would not be possible. Finally, I thank all my friends at theUniversity of British Columbia. It has been a wonderful experience to growup with them.

viii

Chapter 1

Introduction

1.1 Motivation

Spurred by the growth of data sources on the web, information systems are

witnessing a paradigm shift from monolithic databases to heterogeneous,

interacting data sources. The fundamental problem in sharing data from

multiple sources is to deal with the semantic heterogeneity inherent in their

autonomous nature, and the key is to identifying the semantic correspon-

dences between them. The operation of finding such correspondences is

called Match, which takes two schemas as input and produces a semantic

mapping, specifying the relationships between elements of the two schemas.

Such semantic mappings play a crucial role in numerous data sharing ap-

plications, including web data integration, schema evolution and migration,

component-based development, etc.

Currently, the creation of semantic mappings, especially complex ones is

still mostly done manually, possibly supported by a graphical user interac-

tion interface. Manually creating semantic mappings is a tedious, error-

prone process. The labor-intensity grows linearly as the matches to be

performed. Hence the rapidly increasing number of web data sources ne-

cessitates automatic support for schema matching.

1

Chapter 1. Introduction

The problem of semi-automatically creating mappings has attracted on

intensive research in both the database and AI communities [2, 4, 10, 15, 28].

The procedure is comprised of two phases, schema matching and mapping

construction. In schema matching, equivalence correspondences between el-

ements of both schemas are identified. The equivalence correspondences can

be one-to-one (1-1) matches, e.g., ‘class’ corresponds to ‘course’, or complex

matches containing more than one element in each schema, e.g., ‘TA’ maps

to some combination of ‘grad TA’ and ‘ugrad TA’. Note that the focus of

schema matching is to find such potential correspondences, rather than giv-

ing a final mapping to the users. Finding this mapping is done in mapping

construction, where the identified correspondences are built on by adding

more specific semantic information to generate a semantically rich mapping.

professorfaculty

ugrad TA

dept

m2

(=)

m3

(=)

m4

(=)

m1

(=)

m5

(=)instructor

m8

(=)

m6

(=)

m7

(=)

m9

(=)

grad TA

college

class

schema S

Associates

schema T

course

Associates

Is-a

Is-a

Associates

Is-a

Map S T

TA

Has-a

Has-a

Has-a Has-a

Has-a

Has-a

Is-a

Figure 1.1: An example of input schemas and output mapping.

As a typical example of mapping construction, Clio [32] includes a set

of user-interaction techniques to create SQL-style mappings, based on the

output of an initial schema match. Such semantic mappings are necessary

2

Chapter 1. Introduction

to transform data. Clio however, like most other previous work on mapping

construction, is restricted to relational and XML-style schemas; it does not

capture the general richness of the possible relationships between elements

in a data-model-independent fashion. Thus, although many common rela-

tionship types exist across SQL and XML (e.g., specialization), this work

cannot be used to create the XML-style mapping.

Data sources on the web however are, of various data models, e.g., XML,

HTML, RDF, ontologies, text, etc. Hence exploring how to create richer,

general relationships between schema elements, rather than concentrating

on the specific data model under consideration, allows us to understand

the general space of the possibilities. It also allows better reuse of ideas,

since one does not have to create a separate algorithm for each ensuing data

model. After a mapping with such general relationships is constructed, the

transformations into a specific data model can be made more concretely. For

example, it can be easily transformed into specific forms, e.g., SQL views

or XSLT transformations, thus excluding the need of maintaining specific

mappings separately. Also, a generic mapping can create a uniform in-

terface between domain knowledge (ontologies) and web interface (database

schemas), which is helpful for semantic web applications. Furthermore it can

be fed into a model management system [17], which aims to solve meta-data

problems in a data model neutral fashion, or used for knowledge inference

when applied to ontology domain.

An example of a generic semantic mapping is shown in Figure 1.1, where

two schemas S and T represent the concepts of ‘class’ and ‘course’ respec-

tively. A generic mapping S T is constructed, specifying a rich collection of

3

Chapter 1. Introduction

semantic relationships between the elements of S and T , e.g., ‘college’ of T

‘Has-A’ ‘dept’ of S, while ‘instructor’ of S ‘Is-A’ ‘faculty’ of T . The relation-

ship types adopted in this thesis follow the relationship classification of [21].

Compared with the equivalence relationships (1-1 or complex) considered in

previous literature, this relationship classification is semantically richer and

more expressive. Equipped with such generic mappings, one can envision a

number of applications. For example, one problem facing current semantic

web applications is the lack of domain specific knowledge (e.g., ontologies).

If domain knowledge in different representations can be mutually converted,

the collection of knowledge will be significantly enriched.

1.2 Contribution

In this thesis we explore constructing such generic semantic mappings, based

on initial match information that shows correspondences between the ele-

ments of both schemas. This initial match information can be produced by

current schema matching techniques.

Mapping construction takes as input a set of initial matches produced

by a set of schema matching algorithms, and generates a semantically richer

mapping, such as the one in Figure 1.1, which describes complex relation-

ships between elements of both schemas. Specifically, mapping construction

is responsible for searching for a global optimal match assignment from

the pool of possible assignments, solving the conflicts among the selected

matches, and identifying the complex relationships between the schema el-

ements, e.g., the ‘Has-A’ relationships in Figure 1.1. However, constructing

4

Chapter 1. Introduction

a generic semantic mapping is fundamentally difficult for several reasons:

• Finding correspondences with generic semantic relationships is sub-

stantially harder than simple equivalence, since the space of possibil-

ity under consideration is much larger, and more semantic evidence is

needed;

• The pool of initial matches is possibly quite large. This search space

is large enough in considering n:1 equivalence matches to make most

matching algorithms only consider 1:1 matches, but when relationships

other than simple equivalence are considered, it is infeasible to try all

possible combinations to find the optimal assignment;

• Various semantic constraints can be imposed, rendering match selec-

tion a complicated constrained optimization problem;

• Identifying the relationships implicit in matches is a hard problem,

and one that is made more difficult by attempting to make our output

data model independent.

As in schema matching, mapping construction inherently can not be

fully automatic. The importance of user feedback is recognized in schema

matching research [4, 31], however no systematic modeling of user interaction

for mapping construction is available to date. One of the goals of our work

is to limit interaction to critical points to help focus user attention and

minimize user effort.

Aiming at overcoming the problems listed above, in this thesis we de-

scribe a prototype system SeMap to create a generic, semantic mapping. We

5

Chapter 1. Introduction

choose a graph-based representation that is similar to that used in model

management [17], which is expressive enough to accommodate both schemas

of many types and other meta-data, such as ontologies. Specifically, we make

the following contributions:

• An architecture for semi-automatically constructing generic semantic

mappings based on initial correspondence information;

• A novel probabilistic framework that incorporates match uncertainty

and semantic constraints in a uniform way, and expresses match selec-

tion to a mathematical optimization problem;

• Effective modeling of user interaction to help focus user attention and

minimize user effort, by detecting critical points where feedback is

maximally useful;

• Effective solution to extracting implicit relationship of initial match

based on various types of semantic evidences;

• A prototype system embodying the innovations above and a set of

experiments to illustrate the correctness and effectiveness of our ap-

proach.

1.3 Organization

This thesis is a specification of our schema matching system SeMap. The

goal is to present the technical details in implementing the system. Specifi-

cally, we intend to make clear mainly the following three aspects:

6

Chapter 1. Introduction

1. The formulation of the problem, including the exact representation of

the input/output of the system, the resources we use and the assump-

tions we have made;

2. The specification of the system, including the system architecture, the

exact input/output and interior structure of each component and their

interaction;

3. The experimental analysis, including the dataset we can use, the met-

ric we use to evaluate our approach, the experimental result and its

explanation.

The remainder of the thesis will be organized as follows: Chapter 2 presents a

survey of related work; Chapter 3 formally defines the problem of mapping

construction and gives an overview of the architecture of our system. In

Chapter 4, we describe our mapping construction approach in more details;

Chapter 5 presents the experimental analysis of our approach; and Chapter 6

concludes this thesis and presents future work.

7

Chapter 2

Related Work

Semi-automatically creating semantic mappings has attracted upon inten-

sive research in both the database (schema matching) and AI (ontology

alignment) communities. The key differences and similarities of schema

matching and ontology alignment include:

• Differences. Ontologies are logical systems, which obey some formal

semantics, i.e., they can be interpreted as a set of logical axioms;

however database schemas often provide no explicit semantics for their

data.

• Similarities. Schemas and ontologies are quite similar in the sense that

(1) they both provide a vocabulary of terms that describe a domain

of interest and (2) they both constrain the meaning of terms used in

the vocabulary [30].

Due to their differences, schema matching is usually performed with the

techniques to guess the semantics implicit in the schemas, while ontology

alignment is designed to exploit the knowledge explicitly encoded in the

ontologies. Their similarities however make the solutions from these two

problems mutually beneficial. Following, we will discuss the problems of

schema matching and ontology alignment as a whole.

8

Chapter 2. Related Work

In this chapter, we present a survey of related work in three parts: first

in Section 2.1 we classify the current schema matching/ontology alignment

techniques based on the relationships they can handle; we then discuss some

typical techniques used in these approaches, specifically, schema matching in

Section 2.2 and ontology alignment in Section 2.3; finally, we present several

example prototype matching systems in Section 2.4.

2.1 Relationship Classification

The relationship types created by matching techniques can be roughly di-

vided into three categories: equivalent relationship, set-theoretic relation-

ships and generic relationships. Specifically, two schema elements having

the equivalent relationship means they are semantically equivalent, and the

techniques to identify equivalent relationship is described in Section 2.1.1;

the set-theoretic relationship classification regards each schema element as a

set, and specifies their relationship as one of equivalence, subsumption, inter-

section, disjointness and incompatibility, which is discussed in Section 2.1.2;

the generic relationships refer to those non-equivalent relationships, such

as Has-a and Is-a relationships discussed in this thesis. Two typical clas-

sification of generic relationships can be found in ontology modelling [18]

and meta-data management [21]. The techniques developed so far to handle

generic relationships is presented in Section 2.1.3.

9

Chapter 2. Related Work

2.1.1 Equivalence Relationships

With the main goal of data transformation in specific data models, most

schema matching/ontology alignment algorithms to date aim at discover-

ing the equivalence relationships [2, 3, 4, 10, 13, 14, 16, 31]. The found

equivalence correspondence can be the case of a 1-to-1 match (e.g., ‘course’

= ‘class’), or a complex match (e.g., ‘name’ = concat(‘first-name’ +‘last-

name’)).

The complexity of creating multi-arity (1-to-n or even n-to-m) matches

is significantly harder than 1-to-1 matches for several reasons: (1) while the

number of candidate matches is bounded for 1-to-1 match (the product of

the sizes of two schemas), the number of match candidates to be considered

in complex case is much larger. (2) it is inherently difficult to generate a

match to start with in the case of multi-arity matches. That is in the case

of n-to-m match, it is difficult to determine n and m in order to generate a

set of candidate matches. Hence to date most the work on schema matching

has been focused on discovering 1-to-1 equivalence correspondences between

schema elements [3, 4, 10, 13, 14, 16, 31]. R. Dhamankar et al. [2] proposed

iMAP, a prototype of identifying 1-to-n correspondence matches, which re-

formulates schema matching as a search in an often very large match space.

To search effectively, it employs a set of searchers, each discovering specific

types of complex matches.

However, while attempting to discover semantically equivalent corre-

spondences, it is possible that the matches identified by these techniques

may not be exactly of equivalence relationships; they may instead be the

10

Chapter 2. Related Work

semantically richer relationships we are endeavoring to find, such as the re-

lationship between ‘TA’ and {‘grad TA’, ‘ugrad TA’} as shown in Figure 1.1.

2.1.2 Set-Theoretic Relationships

The equivalence relationship can be considered as a special case of the set-

theoretic relationships, which can specify the relative containment relation-

ship between two sets. In [26], an effective solution is proposed to identify

inter-set relationships by bidirectionally comparing the containment of data

instances and meta-data, of different schema elements. The problem with

this approach is that the data instances associated with the two schemas

should be in the same universe, otherwise the comparison of containment

relationship is not meaningful. However, in many applications, especially

web data integration, the data sources do not overlap.

2.1.3 Generic Relationships

There have been very few works on finding generic relationships between

schema elements. The solution proposed by D. Embley et al [5] relies heav-

ily on a domain-specific ontology to find the relationships of Merge/Split

(e.g., ‘Address’ consists of ‘Street’, ‘City’ and ‘State’), Superset/Subset (e.g.,

‘Phone’ contains both ‘Phone day’ and ‘Phone evening’), and Set-Name as

Value (e.g., the attribute ‘Water-front’ in one schema appears as a value of

the attribute ‘House-description’ in the other schema).

The basic idea is to first map the schema elements to a comprehensive

domain-specific ontology, and the relationships between schema elements

can then be determined by that of their counterparts in the ontology. This

11

Chapter 2. Related Work

approach requires (1) a comprehensive ontology that covers all possible con-

cepts that may appear in schemas in that domain; (2) a domain-specific the-

saurus that can map schema elements to their alternative representations in

the ontology. Such ontology and thesaurus are usually hard to obtain in real

scenarios. Our work has fairly simple requirement for the needed semantic

information, available in most schemas, and does not assume any compre-

hensive ontology. Nevertheless the existence of such ontology can improve

the quality of the matching results of our system.

F. Giunchiglia et al. [7] proposed the concept of semantic matching,

a pure schema-based approach. The basic idea is to first populate each

element name with their meanings in some domain-specific dictionaries, and

computes the specialization relationship of schema elements based on the

containment relationship of their meanings. Their approach however works

only for identifying Is-a relationship and tree-structured schemas.

2.2 Schema Matching Techniques

The research on schema matching/ontology alignment provides a wealth of

techniques to semiautomatically find semantic matches. The techniques can

be classified by the information they exploit [22] as shown in Figure 2.1: the

matches can be found by exploiting one type of semantic evidence (schema-

level, data instance-level, etc), or combining multiple types of evidences (i.e.,

hybrid matchers, which integrate multiple matching criteria, and composite

matchers, which combine results of independently executed matchers [22]).

Matching techniques can also be classified by their methodologies into rule-

12

Chapter 2. Related Work

based and learning-based solutions, which will be discussed in Sections 2.2.1

and 2.2.2 respectively.

schema matching techniques

individual matcher combining matcher

hybrid matcherschema-level combiningindividual matcher

instance-level

element-level structure-level element-level

user feedback

iterative matcher selection

result combination

manually: automatic:

· name similarity · graph matching · value patterns

· data distribution

· frequent term

. . .

. . .

. . .

· type similarity

Figure 2.1: A classification of current schema matching techniques. Cour-tesy of [22].

2.2.1 Rule-Based Solutions

Rule-based matching [7, 14, 16] techniques constitute a wealthy collection

of schema matching solutions, which have been used in both early and cur-

rent matching applications. Rule-based techniques discover similar schema

elements by exploiting schema-level information using hand-crafted rules. A

broad variety of rules have been devised to exploit all possible information,

including element name (label), data types, structures, number of subele-

ments, and integrity constraints. For example, F. Giunchiglia et al. [7]

proposed to exploit the semantic meanings of element names to discover

similar elements; Cupid [14] employs rules that categorize elements based

on names, data types and domains; Similarity flooding measures pairwise

similarity by propagating similarity from some fixed points according to the

schema structures.

13

Chapter 2. Related Work

The rule-based techniques have some desirable features: (1) they are

usually inexpensive in computation and require no training process as in

learning-based approaches; (2) they usually require only schema-level infor-

mation, which is available in most matching scenarios; (3) if some domain

knowledge is available, one can specify domain-specific rules, which can work

very well in certain types of applications. For example, users can write reg-

ular expressions that encode times or phone numbers, or quickly compile a

collection of zip codes to help recognize these types of entities. The learning

methods however can hardly deal with these scenarios. They either can not

learn some complex rules, or require a large amount of training data with

the correct representation for desired result, which is usually hard to obtain.

However the rule-based techniques have several drawbacks: (1) they can

not effectively exploit data-instance level information, even though the data

instances provide valuable information, e.g., precise data format, data dis-

tribution, statistical values, etc. It is possible in some cases that the schema-

level information is opaque or very difficult to interpret, e.g., the element

names like A or B1 are too abstract to be interpreted. In contrast, learn-

ing methods such as Naive Bayes can easily construct some probabilistic

rules that find similarity in such scenarios, based on the distribution of data

instances [11]; (2) moreover, rule-based techniques can not exploit previ-

ous matching results to improve the current matching process. Hence in a

matching application for a specific domain, the rule-based techniques are

usually insufficient.

14

Chapter 2. Related Work

2.2.2 Learning-Based Solutions

Motivated by the drawbacks of rule-based matching methods, a collection of

learning-based solutions have been proposed: these methods have considered

a variety of learning techniques, and exploited both schema-level and data

instance level information. For example, Doan et al. proposed the LSD

system, which employs Naive Bayes learning method over data instances,

and also exploited the structure information of XML data format; The iMAP

system [2] pays attention to the description of elements, in addition to other

schema information.

In developing learning techniques for schema matching, it has been real-

ized that considering only schema-level or data instance-level evidence in the

schemas being matched is often insufficient for a purpose of more accurate

matching. Hence, several types of external resources have been considered

to improve the matching quality. For example, assuming a domain-specific

ontology is available, one technique is to first maps schemas/ontologies into

the ontology, then constructing the matches based on the relationships inher-

ent in that ontology [5]. For example, it is hard to identify the relationship

between ‘direct’ and ‘free toll’ by using regular approaches such as string

comparison. However, by mapping them to a domain-specific ontology, one

can find that they are both specializations of the concept ‘phone’, so that it

can be concluded that ‘direct’ is highly similar to ‘free toll’.

Some recent works advocate exploiting past matching results to improve

current ones [3, 4], with the basic idea of learning from past matches to pre-

dict unseen matching scenarios. An alternative solution considers learning

15

Chapter 2. Related Work

from a corpus of schemas and matches [14]. Such corpus provides alternative

representations of concepts in the domain, i.e., functions in the same way

as ontology, thus can be leveraged to discover similarity between schema el-

ements. However, it is not always practical to have such external resources,

particularly since such these resources must be domain-specific to be effec-

tive.

2.3 Ontology Alignment Techniques

Ontology alignment deals with finding corresponding concepts in different

ontologies. In this section, we present some typical work on ontology align-

ment, and a comprehensive survey is referred to [10].

OntoMorph [1] focused on the problem of translation of symbolically

represented knowledge between different knowledge representations. It used

a description logic based approach, and offers syntactic rewriting to support

the translation between two different knowledge languages, and semantic

rewriting to support inference-based transformation. OntoMorph requires

users to provide transformation rules, thus can be regarded as one type of

rule-based technique.

Prompt [20] proposed an ontology alignment mechanism that finds cor-

responding concepts by refining an initial mapping (pairs of anchors) given

by users or some simple linguistic matching approaches. Specifically, it ana-

lyzes the paths in sub-graphs limited by the anchors and determines which

concepts frequently appear in similar positions on similar paths. The phi-

losophy followed by Prompt is similar to that of Similarity Flooding [16].

16

Chapter 2. Related Work

FCA-Merge [28] is an example of alignment technique depending on

external resource. The resources used in FCA-Merge are domain-specific

documents, which cover the concepts in the ontologies. Through natural

language analysis techniques, it generates a formal context for each docu-

ment, which tells which documents contain which concepts. Based on these

formal contexts, the Is-a relationships between concepts are inferred. How-

ever since the formal context is built upon the generalization/specialization

hierarchy of the concepts, this approach could not be extended to other re-

lationships, such as Has-a. Moreover, the requirement of domain-specific

documents is not always feasible.

MAFRA [15] proposed a framework for sharing distributed ontologies via

mapping. A multi-strategy process is employed to calculate the similarities

between ontology entities, including lexical similarity, property similarity

(attributes or relations). Both top-down and bottom-up similarity propaga-

tions are employed. This can be considered as a counterpart of the hybrid

matching techniques in schema matching.

To our best knowledge, though the ontology itself can have complex re-

lationships, e.g., Has-a or Is-a, the focus of most previous work on ontology

alignment is finding semantically equivalent concepts or one specific type

of relationships, e.g., Is-a in FCA-Merge, in different ontologies [10], rather

than discovering corresponding concepts with more types of generic rela-

tionships, and the rich relationships in ontologies are used only as one type

of semantic evidence.

17

Chapter 2. Related Work

2.4 Sample Prototypes

In this section, we consider some recent prototype of schema matching sys-

tems.

2.4.1 Rondo

Rondo [17] is a complete prototype of generic model-management system,

in which high-level operators are used to manipulate models and mappings

between models. As one of its main operators, match is implemented using

the Similarity Flooding (SF) algorithm [16]. SF utilizes a hybrid matching

approach based on the idea of similarity propagation. It starts from a string-

based comparison, e.g., common prefixes, suffixes, of the schema elements

names to get an initial mapping, which is further refined using a fix-point

computation. The matching process is well formulated as a mathematical

optimization problem in SF.

2.4.2 Cupid

Cupid [14] implements a hybrid matching algorithm that analyzes syntactic

information at elements (e.g., string prefixes, suffixes), and structure infor-

mation of schemas (e.g., tree matching weighted by leaves). Moreover, it

exploits external resources, i.e., a pre-compiled thesaurus.

2.4.3 COMA

COMA [3] is a composite schema matching system. It provides a matcher

library composed of different matching algorithms. Its framework allows

18

Chapter 2. Related Work

the combination of partial results. The matcher library can be extended

by adding new matching algorithms. Specifically, it contains 6 elementary

matchers, 5 hybrid matchers and one reuse-oriented matcher. Compared

with Cupid, this reuse-oriented matcher is a novel algorithm, which tries to

leverage previously obtained results for new schemas.

2.4.4 iMAP

iMAP [2] is a matching system that considers 1-to-n equivalence matches.

The authors regard the problem of matching as a search in a usually infi-

nite match space. The overall goal is achieved in three steps: (1) a set of

basic matchers, called searchers are employed to detect similar elements ac-

cording to different criteria (e.g., linguistic similarity, numerical equivalence,

etc). Specifically, for each element in the target schema, a set of similar el-

ements are found in the source schema by the searchers, including 1-to-1

and n-to-1 matches. (2) the match candidates generated in the first step are

evaluated by a similarity evaluator module in this step, and the result is a

similarity matrix which indicates the similarity between the target element

and different match candidates. (3) a match selector module selects the

best match candidate as the final result. iMAP also provides a explanation

module which can provide explanation for the generated matches, e.g., the

reason the match is selected, and the implicit equivalence relationship, etc.

To the best of our knowledge, most previous work on schema matching focus

on one-to-one equivalence relationships in finding semantic correspondences

between two schemas. Little work is done in identifying multiple types of

19

Chapter 2. Related Work

complex relationships. In the following chapters, we present SeMap, a pro-

totype schema matching system which is designed to find generic semantic

correspondences.

20

Chapter 3

Problem Formulation

As discussed in the related work (Chapter 2), most work on schema match-

ing so far focuses on finding one-to-one equivalence relationships between

schema elements. The overall goal of our schema matching system, , is to

identify generic semantic mapping between two schemas. And “generic se-

mantic mapping” means (1) the matches may be non one-to-one, e.g., one

element is mapped to multiple elements of the other schema, a.k.a, 1-to-n

matches; (2) the relationship types may be non-equivalence, e.g., Has-a,

Is-a, etc, as classified in Vanilla meta-meta model [21].

An example of a generic semantic mapping is shown in Figure 1.1, where

two schemas represent the concept of ‘class’/‘course’ in different ways. The

mapping contains complex correspondences, such as ‘TA’ of schema S is

mapped to ‘undergrad TA’ and ‘grad TA’ of schema T. Instead of the equiv-

alence relation considered in most schema matching approaches, the rela-

tionship types involved are also complex, e.g., the ‘department’ of schema S

is considered as a member of the ‘college’ of schema T.

21

Chapter 3. Problem Formulation

grad TA

faculty

ugrad TA

college

Associates

course

Associates

Has-aHas-a

model

(a)

s p o

‘course’ Associates ‘course’‘course’ Associates ‘faculty’‘course’ Has-a ‘ugrad TA’‘course’ Has-a ‘grad TA’

(b)

Figure 3.1: Representation of model. The left plot shows a graphical rep-resentation of a model, comprised of nodes (elements) and edges (relation-ships). The right table shows the tuple representation of edges.

3.1 Representation

In this thesis we consider how to form a generic semantic mapping. Because

we are attempting to solve this problem in a data model neutral fashion that

could be applied equally well to relational or XML schemas or an ontology,

we adopt the terminology from Model Management [17], and say that we

take as input two models1.

A model is a complex design artifact, such as a relational schema, XML

schema, XML DTD, or an ontology, etc. Technically, a model can be rep-

resented as a directed labelled graph (V, E). Specifically, V is the set of

nodes, each denoting an element of the schema, e.g., attributes in relational

database table, type definition in XML schemas, clauses of SQL statement,

etc. E is the set of binary, directed typed edges over V . Formally, each

edge is a tuple < s, p, o >, where s is the source node, p is the type of edge,

and o the target node2, and p denotes the relationship between s and o. An1In what follows, we will use schema and model exchangeably.2The notation < s, p, o > follows the notation of <subject, predict, object> in ontolo-

gies.

22

Chapter 3. Problem Formulation

example of model representation is depicted in Figure 3.1, which illustrates

the concept of ‘course’.

As indicated in [19, 24], in addition to Equivalent relationship, the con-

cepts of generalization/specialization and part-of/whole have been long rec-

ognized as ubiquitous and essential mechanisms in object-oriented modeling

techniques, which have a large scope of applications, such as CAD, man-

ufacturing, software development and computer graphics. In this thesis,

we follow the relationship classification in Vanilla meta-meta model [21],

which embeds the concepts of generalization/specialization and part-of/whole.

Specifically, in the Vanilla meta-meta model, there are five relationship

types, namely Associates, Is-a, Has-a, Contains, and Type-of, In this the-

sis, we concentrate on the first four Associates, Is-a, Has-a and Contains,

where Is-a represents the concept of generalization/specialization, Contains

and Has-a represent the concept of part/whole, and Associates represents

all other weak semantic relationships.

Strictly speaking, though both Has-a and Contains embed the concept

of part-of/whole, they are different in semantics. As indicated in [19], part-

of relationships can be categorized in two dimensions, that is (1) the degree

of sharing of parts among whole objects and (2) the degree of dependence

between some part objects and some whole object(s). Contains and Has-a

are different in the second dimension in that part objects are highly depen-

dent on whole object(s) in Contains, while this dependence is not so strong

in Has-a. This difference brings the rule that in a Contains relationship,

the containee is a part of its container element, and cannot exist on its

own (delete propagation). Moreover, Contains is a transitive relationship

23

Chapter 3. Problem Formulation

and must be acyclic; while Has-a is weaker than Contains in that it does

not propagate deletion and can be cyclic. Since we focus on the high-level

part-of/whole relationship, we treat Has-a and Contains as the same in our

framework.

In addition, we also consider the equivalence relationship, which is the

main focus of previous schema matching approaches. So totally, in our

framework, we consider four relationship types: Equivalent, Has-a, Is-a,

and Associates. Their formal definitions are specified as follows, and their

graphical representation is shown in Figure 3.2:

• Equivalent : E(x, y) means that x is equivalent to y semantically. This

is a symmetric relationship type, i.e., E(x, y) ⇔ E(y, x);

• Has-a: H(x, y) indicates that x has a sub-component/member of y.

This is an asymmetric relationship, i.e., H(x, y) can not infer H(y, x);

• Is-a: I(x, y) means that x is a specialization of y. This is an asym-

metric relationship;

• Associates: A(x, y) indicates that x is associated with y. It is the

weakest relationship that can be expressed. It has no constraints or

special semantics. This is a symmetric relationship type.

This representation is complex enough to capture many of the semantic

relationships that appear in models, and yet is simple enough for a reason-

able initial foray into the problem.

A mapping , MapS−T is a formal description of the semantic relationships

between two schemas, S and T . A mapping itself is a model consisting of a

24

Chapter 3. Problem Formulation

college facultyclassclass

course dept professordept

Associates Equivalent Has-a Is-a

Figure 3.2: Illustration of four relationship types handled by SeMap.

set of mapping elements E , and a set of relationships R on E .

The elements of the two schemas are related through the mapping ele-

ments. Each mapping element e ∈ E is like any other element in schemas S

and T . In addition to being the origin or destination of any kind of relation-

ship found in a model, i.e., R, each e ∈ E can be the origin of one or more

mapping relationships, M(e, s), where s ∈ S ∪ T , which specifies that the

origin element e, corresponds to the destination element s. The semantics

of a mapping relationship is such that for all s1, s2 ∈ S ∪ T s.t. M(e, s1)

and M(e, s2), s1 = s2, and s1 corresponds to s2.

Given this rich mapping structure, the generic semantic relationship,

not just simple correspondences between the elements of S and T can be

expressed in this way: two semantically equivalent elements is represented

by one mapping element; while the relationship of two mapping elements

indicate that between their corresponding schema elements. For example,

in Figure 3.3, the mapping element m1 corresponds to the elements ‘class’

and ‘course’ representing the same concept; the relationship between m4 and

m5 indicates ‘instructor’ ‘is-a’ ‘faculty’.

25

Chapter 3. Problem Formulation

professorfaculty

ugrad TA

dept

m2

(=)

m3

(=)

m4

(=)

m1

(=)

m5

(=)instructor

m8

(=)

m6

(=)

m7

(=)

m9

(=)

grad TA

college

class

schema S

Associates

schema T

course

Associates

Is-a

Is-a

Associates

Is-a

Map S T

TA

Has-a

Has-a

Has-a Has-a

Has-a

Has-a

Is-a

Figure 3.3: An example of complex mapping handled by SeMap.

3.2 Problem Statement

Given the definition of model and mapping, we are now ready to formally

define the goal of SeMap: given two models, S and T , find generic semantic

relationships required to create the mapping S T between S and T .

There may be some optional inputs to the matching process, specifically

(1) an initial mapping S T ′ which provides an initial set of correspondences,

and needs to be refined by the process; (2) external semantic resources r

used by the matching process, e.g., domain-specific thesauri, ontologies, etc.

The matching process is illustrated in Figure 3.4.

Schema T

resource r

Matching Mapping S Tinitial mapping S T′

Schema S

Figure 3.4: Illustration of the matching process.

26

Chapter 3. Problem Formulation

3.3 Semantic Resources

The semantic resources used by matching techniques can be categorized as

internal resources, which is contained in the input schemas or associated

data instances, and external resources, which is the semantic information

not presented in the schemas or data instances.

3.3.1 Internal Resources

The semantic resources of the input schemas include both element-level in-

formation, which refers to the information stored at each schema element

(e.g., element name, data type, structure, etc) and structure-level informa-

tion, which refers to the information contained in the relationships between

schema elements (e.g., relationship type, constraints, etc). In Section 3.3.1.1

and 3.3.1.2, we introduce the element-level and structure-level resources con-

sidered in our SeMap system respectively.

3.3.1.1 Element-Level Information

We consider the following element-level information:

• Element name (label). Each element name is of String type. The name

(label) provides a first layer semantic evidence of the possible meaning

of this schema element.

• Element type. If an element contains data, it is usually associated

with a type indicating the storing format of the data. Note that in

many representations, the data type of an model element is consid-

ered as a separate element, which is linked to the element itself by a

27

Chapter 3. Problem Formulation

Type-of relationship. In our system, we consider data type as an at-

tribute of the model element, e.g., String is an attribute of the element

‘professor’, rather a separate element. The element type can provide

hints in the sense that similar schema elements usually have the same

or compatible data types.

• Element description. It is a short description of the semantic mean-

ing of the element, which usually contains more information than the

element name only. For web interface where only schema-level infor-

mation is available, the element description is especially valuable in

determining the exact semantics of the elements. For example, it is

hard to tell the semantics of an element only by its name ‘people’, in

a flight ticket booking website. However, with the help of its descrip-

tion of ‘total passengers’, one can conclude that ‘people’ stands for the

overall number of tickets bought.

• Data instances. As discussed in Chapter 2, data instances can provide

valuable information that could not be found in schemas, e.g., precise

data format, data distribution, statistical values, etc. Specifically, the

data type of an element may not be exactly how its data is stored,

which can only be found in data instances. For example, the element

‘phone’ may be of an Integer type. However, if looking at its data

instances, one may notice that its exact format is of ‘xxx-xxx-xxxx’,

which is not reflected in its data type. Meanwhile, the distribution

of data instances is also useful in identifying similar schema elements,

especially when the element names are obscure, e.g., A1 and B2 [11].

28

Chapter 3. Problem Formulation

3.3.1.2 Structure-Level Information

In addition to the element-level information discussed above, we also con-

sider structure-level information. In our system SeMap, we mainly consider

two types of structure-level evidence:

• Relationship Type. Each edge between two schema elements is of

certain relationship, which can be leveraged in matching process. The

basic intuition is that if two elements are semantically similar, the

elements having the same relationship with them are also highly likely

to be semantically related.

• Constraints. Each edge can have constraints, including (1) cardinality

in relational database table, e.g., 1-n, 1-1, etc, (2) key properties of

elements, e.g., unique, primary, etc.

3.3.2 External Resources

Previous work on matching techniques has shown that internal semantic evi-

dence is usually insufficient for achieving high quality matching results; some

additional external resources should be leveraged to improve the matching

quality.

In SeMap, we consider two types of external resources:

• Thesaurus. It is a dictionary which provides the different represen-

tations of the same concept. Hence the element names can be first

populated with their synonyms, so that one has a better chance to

find similar elements. Specifically, SeMap uses WordNet as the the-

saurus. WordNet is a comprehensive English lexical reference system,

29

Chapter 3. Problem Formulation

which organize more than 60000 nouns, 11000 verbs, 6000 adjectives

and 3000 adverbs into synonym sets (synsets). It is considered one of

the most powerful tools for computational linguistics, and has been

used in several matching applications [7].

• Ontology. Ontologies, especially domain-specific ontologies are pow-

erful tools in discovering similar elements, even in identifying their

implicit relationships. However they are not always obtainable. The

collection of ontologies we employed in SeMap is provided from Onto-

Builder [6].

3.4 Approach Overview

In this section, we present an overview of our generic matching system

SeMap. As an implementation of the match operator, SeMap takes as in-

put two schemas (models) S and T , and produces their generic semantic

mapping S T . In addition, SeMap has additional input of external semantic

resource r.

In order to identify the generic semantic relationships between schema

elements, SeMap not only has to identify the correspondences of complex

relationships, but also extract the implicit relationship types. Figure 3.5

shows the basic architecture of this mapping construction system. SeMap

implements this goal mainly in three phases: In the first phase, schema

matching , the candidate matches (correspondences) of generic semantic re-

lationships are identified. Note that most previous work focus on finding

correspondences of Equivalent relationships, while in our work we also have

30

Chapter 3. Problem Formulation

similarityscore

schemassource candidate

matches

Mapt

lineage information

Maps

genericmapping

domain knowledge

matcherschema match mapping

assemblerselector

Figure 3.5: The basic system architecture of SeMap. It takes two modelsand external resources as input, and produces generic semantic mapping. Itconsists of three main parts: the schema matcher, the match selector andthe mapping assembler.

to consider the correspondences of other relationships, which significantly

increases the difficulties; In the second step, match selection, a subset of can-

didate matches are selected to form the complete mapping. In this phase,

we develop a novel probabilistic framework that incorporates both match

uncertainty and domain constraints, and implements match selection as a

constrained optimization problem; Finally, the implicit relationship types of

the selected matches are determined in the third phase, mapping assembling.

To the best of our knowledge, most previous work focuses either on finding

Equivalent relationships or one specific type of relationship, i.e., Is-a. Not

much work has been done on the problem of exploiting different types of

semantic information to identify the generic semantic relationship,

3.4.1 Schema Matcher

The schema matcher takes as input two schemas S and T , and generates

a set of initial matches showing the correspondence between the elements

31

Chapter 3. Problem Formulation

of both schemas, i.e., it constructs the input to the mapping construction

problem. For each initial match, it also produces its similarity score indi-

cating its uncertainty, and lineage information recording how it is identified;

this lineage information is retained to help in mapping construction. While

most matching techniques developed so far support the estimation of simi-

larity score, recording the lineage information for each match is novel, which

is designed specifically to support identifying generic semantic relationships

in the mapping assembler. More detailted discussion of schema matcher is

referred to Section 4.1.

Functionality: The schema matcher consists of a set of base matchers,

as in many composite approaches (see Chapter 2). A base matcher is an

algorithm that looks at some aspect of the model/schema, and generates a

series of candidate matches. Each candidate match shows a 1-to-1 or 1-to-n

correspondence. The schema matcher is responsible for ensuring that (1)

the two schemas/models are input to each basic matcher (where applicable)

and (2) that the output results from each basic matcher are combined to

generate a similarity score and lineage information for each candidate match.

Notations: Candidate match: the schema matcher generates a list of

candidate matches, of the form e → e′, where e and e′ are single elements

in either schema.

Tactics: Schema matching can be considered as a searching problem in a

huge space, i.e., the number of candidate matches is quite large. There are

an unbounded number of functions for combining attributes in a schema,

and each one of them could be a candidate match. To search the space

effectively, we employ a set of base matchers. Each base matcher considers

32

Chapter 3. Problem Formulation

a meaningful subset of the space, corresponding to specific types of semantic

information, e.g., element name, type, etc.

The generic semantic match can be either of the form 1-to-1 or the

form 1-to-n (one source element corresponds to multiple target elements).

However, since the focus of our work is to identify the relationship between

schema elements, rather than finding the exact transformation rules, we

consider the case of 1-to-n matches as n 1-to-1 matches, and identify the

relationship between the source element and each target element, which

significantly simplifies the searching complexity.

There are a number of different kinds of matchers that we can consider.

In the current implementation of SeMap, the base matchers include the

label matcher, the sense matcher, the type matcher, the structure matcher.

SeMap’s architecture is flexible enough that it would be easy to extend it

to include other base matchers.

The partial results produced by the basic matchers are combined to

generate the similarity score and lineage information for each candidate

match. Currently, the similarity score is calculated as a weighted sum of the

similarity estimation in all base matchers. And the lineage information is

the union of all the semantic evidences found by the base matchers.

3.4.2 Match Selector

The match selector is responsible for assigning each schema element to a

match, which as whole has the minimum uncertainty, and outputs two best

match assignments Maps and Mapt for the elements of S and T respectively.

For example, as shown in Figure 3.3, from the perspective of source schema,

33

Chapter 3. Problem Formulation

the Maps may be this set of matches: ‘class’ : ‘course’, ‘professor’ : ‘faculty’,

‘dept’ : ‘college’, ‘instructor’ : ‘faculty’, and ‘TA’ : ‘grad TA’, while the Mapt

may consist of this set of matches: ‘course’ : ‘class’, ‘college’ : ‘dept’, ‘ugrad

TA’ : ‘TA’, ‘grad TA’ : ‘TA’ and ‘faculty’ : ‘professor’. These two mappings

are then merged to form the final mapping as shown in Figure 3.3. More

detailed discussion is referred to Section 4.2.2.

Functionality: Match selector takes as input the list of candidate

matches associated with their similarity scores, and search for the best

global assignments from the set of candidate matches. Technically, from

the pool of candidate matches, it selects a proper candidate match for each

source/target element. Here ‘proper’ means that the candidate match has

high probability (similarity score) and it satisfies user-defined domain con-

straints to the maximum extent.

Notations: Constraint: Constraint refers to the regularities imposed

on the generated mapping. Roughly, in our system, we consider two kinds

of constraints. (1) Schema-independent constraints are those imposed by

the meta-meta data model languages, which express general rules that each

model need to obey. For example, the Is-a relationship must be acyclic. (2)

Schema-dependent constraints refer to those regularities imposed on specific

schemas and data of the sources in the domain.

Tactics: The functionality of the match selector is similar in spirit to the

constraint handler as described in iMAP [4], which applies a set of domain

constraints to select a subset of candidate matches. However, we propose a

novel statistical model to incorporate both match uncertainty and domain

constraints in the same framework, and express the match selection as a

34

Chapter 3. Problem Formulation

constrained optimization problem, for which effective solution is available.

We believe that if user interaction is involved, the accuracy of prediction

can be greatly improved. In SeMap, we apply the technique of active learning

to identify the critical points in selecting the matches where user interaction

is maximally useful, so that the user effort can be significantly reduced.

3.4.3 Mapping Assembler

The mapping assembler combines Maps and Mapt, identifies the relation-

ship embedded in the selected matches, and assembles them into a generic

semantic mapping that includes richer semantic relationships. For example,

by consulting the associated lineage information, specifically, that the labels

of the element ‘TA’ is a substring of that of ‘grad TA’, the two elements are

detected of Is-a relationship. A detailed discussion of this part is referred

to Section 4.3.

Functionality: The best match assignments Maps and Mapt are com-

bined to form a final mapping. Then for each match in the final mapping,

the mapping assembler extracts related semantic evidences from lineage in-

formation to identify the implicit relationship type.

Tactics: The combining of the two match assignments Maps and Mapt

is achieved by ranking each match according to their contribution to the

final mapping, that is (1) the likelyhood of this match (2) the violation

of domain constraints by including this match in the final mapping. The

highest subset of matches are selected to form the final mapping.

To identify the implicit relationship of each selected match, all types of

semantic evidences generated by schema matcher are considered. For each

35

Chapter 3. Problem Formulation

type of semantic evidence, a set of heuristic rules are specified. The results

are then combined to vote for the final decision of the relationship type.

In this chapter, we formalize the problem of creating generic seman-

tic mapping, and give an overview of the schema matching system SeMap.

We propose the techniques that (1) identify the correspondences (candidate

matches) of generic semantic relationships; (2) select the matches from the

pool to form the final mapping; (3) determine the relationship types implicit

in the selected matches. In the next chapter, we present the SeMap system

in detail.

36

Chapter 4

SeMap System

In this chapter, we present the technique details of our schema matching

system SeMap. As shown in Chapter 3, the overall architecture of SeMap

consists of three main parts, the schema matcher, the match selector, and

the mapping assembler, which are responsible for finding correspondences

(candidate matches), selecting a subset of candidate matches to form a map-

ping and identifying the implicit relationships respectively. The three parts

are discussed in detail in Section 4.1 4.2 and 4.3 respectively. To illustrate

how SeMap can produce the generic semantic matches, we show the match-

ing process of the example in Figure 3.3 along introducing the components

of our system.

4.1 Schema Matcher

One problem facing schema matching/ontology alignment is the lack of suf-

ficient semantic evidence to discover matches, which is even more severe if

one intends to identify generic relationships between schema elements. A

key conclusion from previous research is that an effective schema match-

ing tool requires a combination of base techniques, e.g., linguistic matching,

structure matching, detecting overlapping of data instance, etc [22]. Hence

37

Chapter 4. SeMap System

we follow a composite approach in detecting potential correspondences.

4.1.1 Base Matcher

The components of a schema matcher are a set of pre-existing matchers

that exploit any available information, and incorporate base techniques in

a uniform framework, such like COMA [3] and iMAP [2], as discussed in

Chapter 2. In these frameworks, a set of base matching approaches (aka

matchers) are organized as a matcher library. They discover initial corre-

spondences based on different types of semantic evidence, e.g., schema level,

data-instance level, corpus, and ontology.

With the equipment of this set of base matchers, the problem of dis-

covering initial matches can be modeled as a search in the possible space

using various matchers, each of which exploits a meaningful subset of the

space [2].

Based on the level of granularity on which matching is performed, base

matchers can be classified as either element-level matchers, or structure-

level matchers. The former computes mapping elements between individ-

ual nodes, and the latter computes mapping elements between subgraphs.

Named with the semantic resource they exploit, the base matchers we imple-

ment in SeMap include sense matcher, label matcher, type matcher, ontology

matcher, data instance matcher and structure matcher. Among them, the

sense matcher, the label matcher, the type matcher, the ontology matcher

and the data instance matcher are on element-level, while the structure

matcher is on structure-level.

38

Chapter 4. SeMap System

4.1.2 Similarity Score and Lineage Information

In addition to the set of initial matches (correspondences), we also expect the

schema matcher to provide the following information: (1) similarity score.

Each candidate match m is associated with a similarity score Sim(m) ∈[0, 1], indicating the belief about its uncertainty, with 1 meaning perfectly

certain; (2) lineage information. For each initial match, one records the

flow of information in and out of the system, such as assumptions, domain

knowledge, etc. From this lineage information, one can trace how this match

is generated. Lineage information will be valuable in discovering the generic

relationships.

The task of similarity evaluation has been discussed in previous work [2,

3, 4], which exploits various types of similarity information, and employs

learning, statistical or heuristic techniques. The module of recording lin-

eage information is similar in spirit to the explanation module in [2], which

provides explanation for user questions posed on the generated matches,

such as explaining existing match, absent match, or match ranking. How-

ever our work focuses on distinguishing different semantic relationships. The

implementation of similarity evaluation and lineage information recording

in SeMap will be specified along presenting each base matcher.

4.1.3 Element-Level Matcher

Element-level matching techniques analyze the information at individual

elements. At the element level we exploit all the techniques discussed in the

literature [7, 15, 22]. However these techniques can not be directly applied,

39

Chapter 4. SeMap System

instead they must be extended to support the similarity score and lineage

information.

4.1.3.1 Label Matcher

The label-based matcher finds semantically related elements by evaluating

the syntactic similarity of their labels (names). Typically it will find as

similar names ‘Match’ and ‘match’, but not ‘match’ and ‘alignment’.

Before applying strict string comparison, we employ a number of stan-

dard natural language preprocessing procedures that can help greatly im-

prove the results of comparison:

• Case normalization. It converts each alphabetic character in a string

to its lower-case counterpart;

• Soundex elimination. Soundex is an encodings of names based on their

pronunciation instead of their spelling, e.g., ‘4U’ and ‘for you’;

• Digit suppression. It suppresses digits and leaves characters only, how-

ever it needs to be used with care, since there are cases digits are

semantically meaningful, e.g., soundex, or chemical names;

• Canonicalization. This is the procedure of converting names to its

standard form by stemming and some other techniques. This is impor-

tant to deal with symbols with special prefix/suffix [22], e.g., ‘cname’

→ ‘customer name’, and ‘empno’ → ‘employee number’;

• Tokenization. The label may be comprised of a set of tokens, e.g.,

‘french-course’ is segmented as ‘french’ and ‘course’;

40

Chapter 4. SeMap System

• Stopword elimination. It eliminates those frequently-used words that

can be found in a list (usually like, ‘to’, ‘a’, . . .).

The preprocessing steps take as input element labels, and produce for

each label a set of tokens whose conjunction carries its meanings. For ex-

ample, the label of ‘grad-TA’ is converted to the tokens of ‘graduate’ and

‘TA’.

The similarity of two strings can be defined in various ways, e.g., Ham-

ming distance, substring similarity, N-gram distance, edit distance, etc [10].

In our implementation, we adopt the N-gram distance, which is proved

effective in information retrieval research, e.g., Rondo [17]. N-gram distance

works as follows: let ngram(s, n) be the set of substrings of string s of length

n, the N-gram distance of strings s1 and s2 is defined as

δ(s1, s2) =|ngram(s1, n) ∩ ngram(s2, n)|

n ∗min(|s1|, |s2|)

The normalization guarantees the N-gram distance within the range [0, 1].

Let T1 and T2 be the sets of tokens of the two elements respectively, the

similarity score is defined as

Sim(T1, T2) = 1− mint1∈T1,t2∈T2

δ(t1, t2)

i.e., it corresponds to the minimum distance between the tokens of two

elements.

The lineage information recorded by the label matcher includes (1) if

the token sets of the two elements are equivalent; (2) if any tokens of the

41

Chapter 4. SeMap System

two labels overlap; (3) if any tokens of the two elements are prefix/suffix

or substring/string. As an example, in Figure 3.3, the tokens of ‘TA’, ‘grad

TA’ and ‘ugrad TA’ overlap, which is recorded as the lineage information of

matches ‘TA’:‘grad TA’ and ‘TA’:‘ugrad TA’. From such semantic evidence,

one can infer the possible relationships existing between the two elements,

which will be discussed in Section 4.3.

4.1.3.2 Type Matcher

The type of schema elements carries the information of data type (e.g, string,

integer, float, etc), value domains (e.g., integer in the range of [1, 12]), and

key characteristics (e.g., unique, primary, foreign). The type matcher deter-

mines the similarity of schema elements based on such semantic information.

In our implementation, we consider six commonly used basic data types,

namely String, Integer, Float, Date, Nodata, Enum. The first four data types

are self-explanatory. The Nodata type means that the element has no asso-

ciated data instance, which usually appears at non-leaf node in the schema,

e.g., XML tree. Enum is short for Enumerate, e.g., an element may have the

instance from a set of color {‘red’, ‘yellow’, ‘blue’}. Note that all these basic

data types can be combined sequentially to from a complex data type, i.e.,

Composition data type. For example, the address format ‘Street + P.O.Box

+ City’ can be considered as a composition of String, Integer and String.

In implementing the type matcher, we define the possible relationship

between two data types as one of equivalent, compatible and incompatible. We

determine the relationship between two types using the following heuristic

rules:

42

Chapter 4. SeMap System

• The same basic data types are equivalent;

• All the basic data types are mutually incompatible, except that the

Integer type is compatible with that of Float;

• Two composite data types are compatible if one appears as a subse-

quence of the other, e.g., {String, Integer, String} is compatible with

{Float, String, Integer, String}, while {Integer, String, String} is not;

• Two value domains are compatible if one is a subset of the other, e.g.,

positive integer number is compatible with positive float number.

In SeMap, the similarity score produced by the type matcher is defined on

three metrics, data type, value domain (if there is any), and key character-

istics (if there is any). For equivalent data types, it produces 1, compatible

0.5, and incompatible 0.

The lineage information recorded by the type matcher includes (1) if the

data types of the two elements are equivalent, compatible or incompatible;

(2) if the data types are composite, and the number of basic data types

involved; (3) if the value domains of the two elements overlap; (3) if their

key properties (if any) are the same. For example in Figure 3.3, the elements

‘TA’, ‘grad TA’, ‘ugrad TA’, ‘professor’, ‘instructor’ and ‘faculty’ all have the

same type String, which means they are possibly related. While ‘class id’

and ‘course no’ not only have the same data type, but also the same key

property since they are both unique in the two schemas.

43

Chapter 4. SeMap System

4.1.3.3 Sense Matcher

It is common that the same concept has different semantically equivalent rep-

resentations in different schemas. Such implicit similarity can hardly be de-

tected by syntactic analysis techniques. For example, ‘car’ and ‘automobile’

are synonyms, and refer to the same concept; while ‘book’ and ‘article’ have

the same hypernym ‘publication’, and they are considered similar. Such

semantic-based matching is especially valuable for schemas with relatively

flat structure, and other scarce affiliated syntactic information (e.g., data

instances). Moreover, since we are attempting to deal with generic rela-

tionships, this semantic matching technique provides important hints for

detecting the embedded relationships.

Exploiting synonyms and hypernyms however requires the use of thesauri

or dictionaries. In the implementation of SeMap, we employ WordNet as

the dictionary, which is an English lexical database. In WordNet, the senses

(atomic meanings of a word or expression) of words are organized into sets

of synonyms (synsets), and synsets are in turn organized into hierarchical

based on their semantic relationships.

The basic idea of the sense matcher is to first populate the tokens of

each element with their senses in WordNet. The senses of an element is thus

the union of the senses of all its tokens. By comparing their senses, one can

evaluate the similarity of the corresponding two elements.

The possible relationship between two synsets X and Y considered in

our system include:

• Holonym. Y is a holonym of X if X is a part of Y ;

44

Chapter 4. SeMap System

• Hypernym. Y is a hypernym of X if X is a (kind of) Y ;

• Hyponym. X is a hyponym of Y if X is a (kind of) Y ;

• Meronym. X is a meronym of Y if X is a part of Y .

We measure the similarity score of two elements in terms of the rela-

tionship of their senses, as defined as follows: (1) 1 if there is at least one

sense of the first label which is the same or a synonym of the second, i.e.,

they share the same synset; (2) 0.5 if there exists at least one sense of one

label that has a sense of the other as a hypernym, a holonym, a hyponym

or a meronym; (3) 0 if two labels share no sense in common. The lineage

information records the number of relationships detected for each kind of

holonym, hypernym, hyponym and meronym. For example, in Figure 3.3,

‘course’ and ‘class’ are synonyms; one sense of ‘department’ is a meronym of

that of ‘college’; while both ‘professor’ and ‘instructor’ have senses which are

hyponyms of that of ‘faculty’.

Note that within the context of the schema, the token may have only a

small subset of its possible senses. The pruning of those fake senses depends

on other matchers. This pruning will be explored in our future work.

4.1.3.4 Ontology Matcher

Domain-specific ontology provides a powerful tool to identify related schema

elements. The basic idea is to first map the schema elements to their counter-

parts in the domain-specific ontology, (which may requires enhancing schema

elements by populating with their semantically equivalent representations),

45

Chapter 4. SeMap System

and then identify the relationships of schema elements by that between their

counterparts in the ontology.

4.1.3.5 Data-Instance Matcher

Instance-level data provides important insight into the semantic meaning

of the schema elements. This is especially valuable when the schema-level

information is limited. Even when substantial schema information is avail-

able, the use of data instance-level matching can also be used to uncover

incorrect interpretations of schema information.

Specifically, some useful matching techniques that can be applied to data

instances include: (1) For text data type, information retrieval techniques,

such as linguistic characterization evaluates the similarity of two schema

elements by comparing the relative frequencies of words and combinations

of words in their data instances; (2) For numerical data type, statistical

characterization such as numerical value ranges and averages, can provide

insight into the similarity of the corresponding schema elements.

The main benefit of evaluating instances is a precise characterization of

the actual contents of schema elements, which can moreover, help enhance

those schema-level matchers. For example, the value ranges found by the

data-instance matcher can improve the effectiveness of the type matcher.

4.1.4 Structure-Level Matcher

While element-level matcher provides background or context information

for each pair of elements, one still needs to consider the matching problem

within the picture of the whole schema, which is achieved by a structure-level

46

Chapter 4. SeMap System

matcher.

Within the structure-level matching techniques, each schema is viewed as

a labeled graph, representing schema elements and their inter-relationships,

and the similarity comparison between a pair of nodes is based on their po-

sitions in their graph and their relationship with neighboring nodes. The

intuition behind structure-level matcher is that if two schema elements from

two schemas are highly related, their neighbors having same relationship to

them might also be related to some extent. Matching graphs is a combina-

torial problem that is computationally expensive. It can usually only solved

by approximate approaches.

The structure-level matcher employed in SeMap is Similarity Flood-

ing [15]. It solves the optimization problem by a fixed-point algorithm.

Based on an initial mapping generated by other matching techniques, simi-

larity flooding propagates the similarity of mapped elements to adjacent ones

which have similar relationship to the mapped elements. The algorithm ter-

minates after a fixed point has been reached. The output of the similarity

flooding is a similarity score for each pair of schema elements from the two

schemas. For example, in Figure 3.3, if the elements ‘class’ and ‘course’ are

detected as highly similar, then the elements ‘TA’ (which has Has-a relation-

ship to ‘class’), and ‘ugrad TA’ (which has Has-a relationship to ‘course’) are

also likely to be similar. Note that structure-level matcher is based only on

initial mapping information and structure of graph, and does not consider

any other semantic information. Its output does not contain any lineage

information.

47

Chapter 4. SeMap System

siminitial

label

sense

type

token

sense

type

com

bin

ing matcher

simfinal

lineage infopreprocessing

WordNet matcher

labelmatcher

matcher

structure

...

Figure 4.1: Architecture of schema matcher. It consists of three layers, basematcher, combining layer and structure matcher.

4.1.5 Architecture of Schema Matcher

Instead of applying the matchers discussed above individually, we explore

their interaction in order to achieve a best matching quality. The basic archi-

tecture of the schema matcher is illustrated in Figure 4.1. It mainly consists

of three layers, element-level matcher, combining layer and structure-level

matcher.

Element-level matchers include the label matcher, the sense matcher,

the type matcher, etc, which find the similarity between schema elements

by exploiting their element-level information. Note that some element-level

matchers need preprocessing step to facilitate matching. For example, the

preprocessing step converts the labels of elements into meaningful tokens,

which can then be processed by the label matcher; the tokens of each element

should be first populated with their senses in WordNet, in order to be used by

the sense matcher. The output of each element-level matcher is (1) similarity

score and (2) lineage information, as discussed in previous section.

In the combining phase, the similarity scores produced by the element-

level matchers are integrated to form a unified score. The combining scheme

48

Chapter 4. SeMap System

can be of various forms, as long as it guarantee that the final result is well

normalized, i.e., in the range of [0, 1]. In the implementation of SeMap, we

follow a simple linear combining scheme, that is, each element-level matcher

em is associated with a weight wem, and the similarity score Simem produced

by that matcher is damped by that weight when computing the unified score.

Formally, the similarity score Sim =∑

em wemSimem, where∑

em wem = 1

guarantees that the result is well normalized. This set of weight parame-

ters must be tuned carefully in order to achieve the optimal result. And

the tuning of the parameters is problem-specific, i.e., specific for different

input schemas, which is not the focus of this thesis. An automatic learning

approach is referred to [27].

As an example, in Figure 3.3, the elements ‘instructor’ and ‘faculty’ are

detected as equivalent by the type matcher, similar by the sense matcher,

and unsimilar by the label matcher. If all the mathers have the same weight,

and the equivalent, similar and unsimilar matches have similarity scores 1,

0.5, and 0 respectively. Then ‘instructor’ and ‘faculty’ have the unified score

(1 + 0.5 + 0)/3 = 0.5.

The unified similarity scores generated in the second phase are fed into

the structure-level matcher as initial mapping. The structure matching is

then performed to produce the final similarity score for each pair of elements.

In conclusion, for each potential match, the schema matcher produces

(1) similarity score, indicating the uncertainty about this match, and (2)

lineage information, recording how this match is detected.

49

Chapter 4. SeMap System

4.2 Match Selector

As shown in the SeMap architecture, Figure 3.5, the matcher selector is the

second main part of the architecture, which is responsible for selecting a sub-

set from the pool of initial matches produced by the schema matcher. It also

contains a user interaction module which exploits user feedback to improve

the mapping quality. In this chapter, we introduce a novel probabilistic

framework, which allows us to express the match uncertainty and domain

constraints in a uniform way, and match selection can then be transformed

as an optimization problem, shown in Section 4.2.2. Within this framework,

we reduce the need of user interaction and focus user attention by identify-

ing critical points where user feedback is maximally helpful, as discussed in

Section 4.2.3.

Given the pool of initial matches and associated similarity scores, the

match selector searches for a global optimal match assignment that satisfies

a set of domain constraints, e.g., in the case of Figure 1.1, user may specify

that each class has only one instructor, then mapping ‘instructor’ to two

elements will violate this constraint.

Most prior work studied this problem in the case of 1-1 correspondences.

They either first apply the constraints to narrow down the pool of possible

mappings, and then transform match selection to a stable marriage problem

in a bipartite graph [16], or assume that the initial matches are mutually

independent, and seek a trade-off between the uncertainty and constraints

satisfaction [4].

As indicated above, the similarity scores represent our belief about the

50

Chapter 4. SeMap System

initial matches, and the semantic constraints represent our preference for

certain matches, i.e., they are both probabilistic in nature. Hence it is

natural to adopt a probabilistic model to express match uncertainty and

constraints in a uniform way. In this thesis, we present such a framework,

which allows us to in turn model match selection as an optimization prob-

lem. In Section 4.2.1, we introduce the representation of this probabilistic

framework, and discuss the match selection problem and its solution within

this framework in Section 4.2.2.

4.2.1 Representation

In this section we show how to incorporate match uncertainty and semantic

constraints in a probabilistic framework. This approach is inspired by the

work of [12]. As a research proposal, [12] indicates the need of a probabilis-

tic model to express this uncertainty. Our work can be considered as an

implementation of this idea, moreover, we extend this framework to support

domain constraints. Hence it is a contribution to both schema matching and

mapping construction.

Formally, each schema element e is associated with a set of initial matches

Me, and can be assigned to a match m ∈Me. The probability of assigning

e to match m∗ ∈Me is defined as:

P (e = m∗) =Sim(m∗)∑

m∈MeSim(m)

where Sim(m) is the similarity score of match m provided by the schema

matcher. Intuitively, this represents the preference for matches with lower

51

Chapter 4. SeMap System

uncertainty. It is easy to verify that this model is well normalized:

m∈Me

P (e = m) = 1

Each mapping consists of a set of matches, of which some may violate

the domain constraints specified by the users. To achieve the best quality

mapping, we take consideration of domain constraints in selecting matches.

Specifically, we associate each constraint c with a penalty function Πc(M),

checking whether a match assignmentM violates c and returning the degree

of violation. Describing Πc(M) is beyond the scope of this thesis, but it

must increase as the severeness of the violation. Each constraint c is also

assigned a weight αc, indicating its strictness. The weights can be hard

coded or learned from known mappings [14]. Given a set of constraints C,the probability of a set of elements E taking match assignment M (where

ei ∈ E is assigned to mi ∈M) is defined as:

P (E = M|C) =1Z

[∏

ei∈EP (ei = mi)](

Pc∈C −αcΠc(M))

where Z is a normalization constant, in order to guarantee that P (E =

M|C) ∈ [0, 1].3 In this equation, the total likelihood of the match assignment

for E is captured in∏

ei∈E P (ei = mi), while the exponential part represents

the penalty of violating the constraints. The more constraints that are

violated, the lower the probability is.3Formally, Z =

PMQ

ei∈E P (ei = mi)(P

c∈C −αcΠc(M))

52

Chapter 4. SeMap System

4.2.2 Bidirectional search

We intend to express the match selection as a well-defined mathematical

problem, so that existing formal approaches can be used. Within the prob-

abilistic framework introduced in Section 4.2.1, one can model match selec-

tion as an optimization problem, and a rich set of tools can be applied to

the problem. Since in this framework, the joint probability of the selected

matches measures the uncertainty of the mapping, for a set of schema ele-

ments E , match selection amounts to finding the match assignment M that

maximizes the probability P (E = M|C) under constraints C.It is infeasible to consider all possible combinations from the domain of E

to find the optimal match assignment. For example if each e ∈ E is associated

with k initial matches, one potentially has to consider |E|k combinations. It

is highly likely that E comprises a series of disjoint subset E1, E2, . . ., which

are mutually independent, i.e., in graphical model, the graph of E consists

of a set of separated sub-graphs. Hence we can optimize these independent

parts separately, that is

maxM

P (E = M|C) ≡ maxM1,M2,...

i

P (Ei = Mi|C)

However it is possible that the size of Ei ∈ E can still be quite large,

leading us to considering more efficient solutions, including several effective

graphical model optimization algorithms proposed in machine learning com-

munity [9], e.g., graph-cuts, and several heuristic searching methods, e.g., A∗

algorithm. In the implementation of SeMap, we apply the A∗ algorithm [8]

to this optimization problem. A∗ algorithm is a graph search algorithm that

53

Chapter 4. SeMap System

finds a path from a given initial node to a given goal node (or one passing a

given goal test). It employs a heuristic estimate that ranks each node by an

estimate of the best route that goes through that node. It visits the nodes

in order of this heuristic estimate. In our implementation, each schema el-

ement is considered as a node, and the candidate matches are the possible

paths to the schema elements on the other side. The value of each path is

heuristically estimated, and a globally optimal assignment is selected.

The discussion above focuses on match selection for an arbitrary set

of elements E . In the context of complex match, the perspectives from

source and target schemas could be significantly different. For example, in

Figure 3.3, starting from the side of Schema T , one may discover that ‘grad

TA’ is a specialization of ‘TA’, but miss the fact that ‘TA’ consists of both

‘grad TA’ and ‘ugrad TA’.

To simplify the problem, one is willing to distinguish the perspectives

of source and target schemas, i.e., treat the match selection for source and

target elements separately. Thus instead of searching the best matches for

source elements Es or target elements Et separately, as in previous work, e.g.,

iMAP [2], one runs the optimization algorithms for both source elements Es

and target elements Et, i.e., bidirectional search. The result of bidirectional

search are two sets of matches Maps and Mapt. An example of Maps and

Mapt is shown in Figure 4.2, where from the side of S, the element ‘TA’

is assigned the correct match, however ‘instructor’ is not; while from the

perspective of T , the case is just opposite. Maps and Mapt will then be

merged to form a final complete mapping. The merge operation will be

discussed in Section 4.3.

54

Chapter 4. SeMap System

ugrad TA

grad TA

professor

facultyinstructor instructor

faculty

grad TA

schema T

TA

TA

schema Sschema S schema T

Figure 4.2: Partial match assignments from the perspectives of source andtarget schemas respectively.

4.2.3 Modeling user interaction

Capturing user feedback is crucial for improving mapping quality. Though

user feedback is used for error correction, mapping refinement, etc. [32],

modeling user interaction is a comparatively under-studied problem. Fol-

lowing we show that the probabilistic framework discussed above can be

extended to modeling user interaction.

The key to modeling user interaction is to identify those critical points

where feedback is maximally helpful, so that user attention can be focused on

important problems, and user workload is minimized. We present an active

learning solution to simulating the effect of user’s selection of candidate

match for schema elements. This is an extension of the approach discussed

in [31], where active learning is applied to learning the optimal parameters

in matching web interface. In our approach, the model elements are ranked

based on their potential information value, and user feedback is asked for

on those most informative ones.

A natural measurement of information value is entropy, which represents

the uncertainty about a signal or random invents. Formally, for a variable

55

Chapter 4. SeMap System

x with distribution P (x), its entropy H(x) is defined as

H(x) = −∑

x

P (x) log P (x)

Intuitively, entropy is a measure of randomness or uncertainty. The higher

the uncertainty, the larger the entropy value. In the context of match se-

lection, the entropy corresponding to each schema element measures the

uncertainty about its match assignment, which however does not take ac-

count of its influence on match assignment of other schema elements. This

deficiency of entropy leads to another metric, mutual information (IM):

I(x, y) = H(x)−H(x|y) = H(y)−H(y|x)

Intuitively, the mutual information of two random variables x and y, I(x, y)

expresses the reduction in the uncertainty about x by virtue of being in-

formed the value of y (or vise versa). Hence one seeks the element having the

maximum mutual information with other elements, i.e., the one whose match

assignment can maximally help identify the best match for others. Formally,

one seeks the most informative element e that maximizes∑

e′∈E I(e, e′). The

joint probability of P (e, e′|C) can be calculated according to Equation 4.2.1,

which will yield the most informative element.

Once the most informative element has been selected, this element should

be disambiguated by the user, which is the key tenant of active learning: the

system should prompt the user to disambiguate cases where the user’s input

will provide the most information to the system. Once the user’s selection on

56

Chapter 4. SeMap System

the most informative element is obtained, the setting of the match selection

problem needs to be updated accordingly. Specifically, assuming that the

match assignment of an element e is fixed, e.g., e = m∗, then P (e = m∗) = 1,

and P (e = m) = 0 (m 6= m∗). Based on the new belief about e, the joint

probability of variables is updated accordingly. Note that only the joint

probability of the subset of E involving e needs to be updated.

In this section, we present a novel probabilistic framework, which incor-

porates both match uncertainty and domain constraints in a uniform way.

The match selection is solved as a constrained optimization problem, and

user interaction is significantly reduced by identifying those most critical

points.

4.3 Mapping Assembler

The bidirectional search (Section 4.2.2) produces two sets of matches Maps

and Mapt, representing the correspondences from the perspectives of source

and target schemas respectively. The following component of the mapping

construction process is the mapping assembler, which is the final phase of

SeMap system, as shown in the architecture diagram, Figure 3.5. In the

mapping assembler, Maps and Mapt are combined to form a final generic

semantic mapping. Specifically, in this process, we aim to solve the fol-

lowing problems: select an optimal set of matches from both mappings

(Section 4.3.1); identify the relationship implicit in the selected matches

(Section 4.3.2); assemble these matches together to form a final, generic

semantic mapping (Section 4.3.3).

57

Chapter 4. SeMap System

4.3.1 Combining Maps and Mapt

By using our novel bidirectional searching to create Maps and Mapt, we

have considerably narrowed the number of matches that must be considered

in order to build our generic semantic mapping. Thus, it results in a much

smaller search space to be examined in determining semantic relationships

(such as the ones in Figure 3.3). To merge the Maps and Mapt to form a

final mapping, we present a heuristic approach, which our preliminary tests

have shown to work effectively in practice.

Let M denote the set of matches in Maps and Mapt. For each match

m ∈M, one calculates the reward of including it in the final mapping:

R(m) = Sim(m)(P

c∈CP

m′∈M−αcΠc(m,m′))

Intuitively, this reward function takes account of the similarity score of the

match, and the constraints it violates together with other matches in M.

One then filters those matches with reward R(m) lower than certain thresh-

old ε.

4.3.2 Identifying relationships

One key step of SeMap system is identifying the relationship implicit in

the selected matches. Since we intend to create generic semantic relation-

ships, there is little previous work done on this problem. In SeMap, we

propose a rule-based method to identify the implicit relationships. Because

the matches are identified by various schema matching techniques, based on

different semantic evidence, a uniform solution is hard to obtain. Instead,

58

Chapter 4. SeMap System

for each type of semantic evidence, we define specific rules to extract the

generic relationship embedded inside. Following, we show how to identify

the four specific relationships, Equivalent, Has-a, Is-a and Associates with

the help of semantic evidence.

We classify the semantic evidence into three categories, schema-level

(e.g., label, type, structure), semantic-level (e.g., sense), instance-level(i.e.,

data-level), and ontology. Usually a match is associated with semantic ev-

idence from multiple categories. We describe each category in more detail

below.

Schema-level evidence includes label, type, and structure information,

etc. Generally speaking, schema-level information alone is insufficient to

determine the embedded generic relationship. However, it provides support

for the results claimed by other semantic evidence, from which it can further

help deduct implicit relationships. In our implementation, corresponding to

each kind of lineage information, we applied the following heuristic rules:

• Label. Two elements with Equivalent relationship are likely to share

similar names; Two elements with Is-a or Has-a relationships tend

to have labels with prefix/suffix relationships; e.g., For example, from

the fact that ‘grad-TA’ has a substring/suffix of ‘TA’, we know that

probably that ‘grad-TA’ is a part or a specification of ‘TA’, which

depends on other type of semantic information;

• Type. Two elements with Equivalent or Is-a relationship probably

share the same data type, while if one element has a data type as a

subcomponent of that of the other, it is likely that they have Has-a

59

Chapter 4. SeMap System

relationship;

• Structure. Two non-leaf nodes, or two leaf nodes possibly have the

relationship of Equivalent or Is-a, while one non-leaf node and one

leaf node are likely to be of Has-a relation.

Semantic-level evidence is the sense of the label (name) of the schema

element. Based on the lineage information produced by the sense matcher,

one can infer the embedded semantic relationships. The heuristic rules are

as follows:

• If two elements share some same sense, it is highly possible that they

are semantically similar or equivalent;

• If some senses of one element act as the hypernym of that of the other,

then they may be in a Is-a relationship;

• If some senses of one element are the hyponyms of that of the other,

the two elements probably are of Is-a relation;

• If some senses of one element appear as the holonym or meronym of

that of the other, they can be in a Has-a relationship;

Instances (i.e., data) give the entity-level clues about the relationship

between schema elements. Hence instance-level evidence usually precisely

characterizes the actual content of schema elements. By studying the sub-

sumption, or distribution similarity of the data instances of schema elements,

one can discover the relationship that is difficult to identify on the schema-

level, due to the differences of schema designs.

60

Chapter 4. SeMap System

• If two elements have similar data distribution, they are likely to be

‘equivalent’;

• If the instance of one element x subsumes that of another element y,

it is likely that x ‘has-a’ y as its member;

• If the instances of x and y intersects, it is possible that x ‘associates’

with y.

A domain-specific ontology provides alternative representations of con-

cepts in the domain, and their possible relationships. Since they directly

provide the information about the relationships of the identified matches, it

is trivial to refine these relationships to generic ones. For example, one can

simply follow certain mapping between the two relationship classification

systems, e.g., that between ontology modeling and meta-data management,

and convert the relationship from ontology/corpus to our generic relation-

ship types.

Note that a match is usually associated with various semantic evidence,

which possibly indicate however different types of generic relationships. In

combining the results suggested by various semantic evidence, we follow a

voting scheme: each type of evidence is associated with a weight, and a

unified conclusion is obtained by linearly combining these results as in the

case of schema matcher (Section 4.1.5). Generally speaking, the weights

are set according to the relative plausibility of semantic evidences. In our

system, we regard the types of semantic evidence has the following rank:

ontology > instance > sense > schema information, and set the weights

accordingly.

61

Chapter 4. SeMap System

schema T schema S

ma

(=) aa

(=)

(=)

(=)

b

b1

b2

mb

mb1

mb2Is-a

Is-a

Figure 4.3: Mapping assembling for matches of different types. Each 1-1equivalence match corresponds to one mapping element, while each elementof complex match is associated with one mapping element.

As an example in Figure 3.3, for the match of ‘professor’ and ‘faculty’, the

sense evidence suggests Is-a relationship, the type evidence indicates Equiva-

lent or Is-aequally, while the label and structure evidence provide no advice.

Assuming the four types of evidence have weights 0.4, 0.3, 0.2 and 0.1 re-

spectively, the voting result correctly identifies the match ‘professor’:‘faculty’

as Is-a relationship.

4.3.3 Assembling mapping

Finally, given that we have discovered the semantic relationships as in Sec-

tion 4.3.2, we must create the final mapping. In order to ensure that the final

mapping meets the specifications in the problem definition (Chapter 3), we

apply the following rules to create mapping elements: (1) For 1-1 equivalent

match, one mapping element is sufficient to represent them in the mapping;

(2) For 1-1 non-equivalent match or complex match, one mapping element

is created for each element of the match; (3) For each of other input element

not involving in match, one corresponding mapping element is created. All

62

Chapter 4. SeMap System

these cases are illustrated in Figure 4.3. Note that it is possible to iden-

tify the relationships between the mapping elements belonging to different

matches, in order to form a model-like complete mapping, using approach

similar to the merge operator [21], though it is not the focus of this thesis.

63

Chapter 5

Experimental Analysis

To evaluate the effectiveness of our schema matching system, we applied

SeMap to several real-world domains. Specifically, the experiments are per-

formed with two main goals:

• To evaluate the matching accuracy of our schema matching system.

Since our goal is to detect the complex relationship existing between

schema elements, the matching accuracy includes not only the corre-

spondences as measured in most previous schema matching systems,

but also the detected relationship types.

• To measure the relative contribution of different system components to

the result. Specifically, we are interested in measuring the performance

gain from (1) different base matchers (2) match selector including user

interaction.

This chapter presents the empirical analysis of the SeMap system. We

first describe the experimental setting in Section 5.1. This includes (1) the

dataset used in experiments (Section 5.1.1); (2) the expert mappings (Sec-

tion 5.1.2); (3) the metrics to evaluate matching results (Section 5.1.3) and

(4) the experimental methodology (Section 5.1.4). The evaluation results

of matching accuracy are then presented in Section 5.2.1. Next we show

64

Chapter 5. Experimental Analysis

the relative contribution of each system component to the final matching

results in Section 5.2.2. Finally, based on the evaluation results, we analyze

the strengths and weaknesses of our approach in Section 5.2.4.

5.1 Experimental Setting

5.1.1 Data Set

We evaluated SeMap on both synthetic and real datasets. The synthetic

dataset is the example shown in Figure 3.3. The real-life datasets are from

two domains, namely Real Estate, and Course Information. All the real

datasets are imported from the Illinois Semantic Integration Archive [29].

Specifically, the Real Estate dataset is a set of schemas describing the infor-

mation of houses for sale; And the Course dataset is a set of schemas on the

information of courses offered across different universities. All the schemas

used in evaluation are included in the Appendix A of this thesis. Some of

these schemas are associated with data instances (Real Estate and Course

Info). In our experiments, for the schemas with data instances, we exploit

the exact data format of the elements by looking at their data instances

in addition to the schema-level information. For schemas without data in-

stances (Synthetic), we evaluated our approach on schema-level information

only.

Since all the original real-life schemas are in DTD format, we performed

a converting operation to import the schemas into our model representation.

We faithfully mirror the structure and terms from source schemas. Because

in the model representation, every edge between two elements is of certain

65

Chapter 5. Experimental Analysis

domain schema # leaf/non-leaf # relationshipHomeseekers 25/3 27

Real Estate Texas 31/3 33Yahoo 23/2 24Reed 12/3 14

Course Info Rice 11/4 14UWM 15/4 18

Synthetic course 5 /1 5class 5/1 5

Table 5.1: Characteristics of the input schemas.

relationship type, while it is not the case in DTD, we set all the relationships

as Has-a by default. For example, if the schema in Figure 3.1 is represented

as DTD, which has no relationship type between the element ’course’ and

any other elements, SeMap by default regards all these links are of Has-a

relationship.

In preparing the data, it is necessary to apply some trivial data cleaning

operation, such like splitting ‘custno’ into ‘cust’ and ‘no’. This is not the

focus of our SeMap system, so we will not discuss it further in the thesis.

More detailed discussion of data cleaning can be found in [23].

5.1.2 Expert Mapping

In preparing the real-life datasets, we extracted three schemas from each

domain. We chose the schemas with complex structure, and among which

complex relationships can exist. The characteristics of these schemas after

preprocessing step are shown in Table 5.1, including the number of elements

(leaf and non-leaf), the number of relationships, the maximum depth of the

tree, etc.

66

Chapter 5. Experimental Analysis

# relationshipdomain schema S/T Equivalent Is-a Has-a total

Real Estate Homeseekers/Texas 18 6 12 36Homeseekers/Yahoo 20 0 11 31

Course Info Reed/Washington 11 0 7 18Reed/WSU 18 1 8 27

Synthetic course/class 2 4 3 9

Table 5.2: Characteristics of the expert mappings.

For each pair of schemas in the same domain, we created an expert

mapping, which act as the ‘standard answer’ to the matching problem. The

characteristics of the expert mappings are shown in Table 5.2. The table

shows the total number of matches, the number of matches of each specific

relationship type, also the percentage of elements involved in mapping from

both source and target schemas. Note that in expert mapping, we do not

consider the matches with the relationship Associates, because of its weak

semantics, though in matching, SeMap may find the Associates relationship,

when it lacks sufficient semantic evidence.

5.1.3 Evaluation Metrics

Following previous work on schema matching, we evaluated the performance

of our approach on three metrics: recall, precision, and F-measure [25]. Pre-

cision P represents the percentage of correctly identified matches over all

matches identified by the system; Recall R is the percentage of correctly

identified matches over the all correct matches in the given expert map-

ping. Formally, let #detected be the number of correct matches detected,

#mapping the total number of correct matches in the expert mapping, and

67

Chapter 5. Experimental Analysis

#result the total number of matches in the results, the recall R and preci-

sion P is defined as:

R =#detected

#mappingP =

#detected

#result

Recall and precision are inversely related, hence it is desirable to have one

measurement to incorporate both recall and precision. F-measure F (pre-

cisely F1 measure) and equally weights recall R and precision P :

F =2PR

P + R

As discussed above, we deal with not only correspondences, but also the

implicit relationships. Thus a correct match means (1) the found corre-

spondence is correct, (2) the identified relationship is exactly as that in the

expert mapping. So we measured the matching accuracy in two-fold: (1) the

total number of correct correspondences detected; (2) the number of correct

matches for each type of relationships, e.g., Equivalent, Is-a, etc.

5.1.4 Experimental Methodology

For each domain, we performed three sets of experiments, i.e., between

each possible pair of schemas. We first evaluated the matching accuracy

of SeMapfor each sets of data, and investigated how sensitive it is against

the setting of parameters; We then evaluated the relative contribution of

each component and user interaction to the final mapping.

Currently, the parameters needed to be tuned in our system are the

68

Chapter 5. Experimental Analysis

weights of different semantic evidences as produced by the various match-

ers, i.e., label (λl), type (λt), sense (λs), etc. The influence of these weights

parameters are referred to Section 4.1.5. The specific setting of parameters

for different datasets are as follows: (1) For the Real Estate dataset, the

parameters are set as λl = 0.3, λt = 0.5, and λs = 0.2, reflecting the ob-

servation that in the Real Estate dataset, the name, type and sense are all

important to identifying the implicit relationship, however due to a number

of abbreviation, e.g., ‘ac’ for ‘air conditioner’, the type information is espe-

cially important to finding the hidden correspondences; (2) For the Course

Info dataset, λl = 0.4, λt = 0.4, and λs = 0.2, which means the name and

type convey equally significant semantic information, and the sense carries

less importance weight due to a small quantity of synonyms, hyponyms, etc

in this dataset; (3) For the synthetic dataset, λl = 0.4, λt = 0.3, and λs

= 0.3. That is all types of semantic information carry equally importance

weights.

5.2 Experimental Result

We present in the following the matching performance of the SeMap system

on the synthetic and real datasets. We first show the matching accuracy

of the system exploiting all the semantic information, and then analyze the

contribution from different types of semantic evidences. Finally we measure

the performance gain from incorporating user interaction into the match

selector phase.

69

Chapter 5. Experimental Analysis

Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20

20

40

60

80

100Matching Accuracy

Rec

all (

%)

Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20

20

40

60

80

100

Pre

cisi

on (

%)

Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20

20

40

60

80

100

F−

mea

sure

EquivalentIs−AHas−AOverall

Figure 5.1: Matching accuracy of SeMap. The three plots show the recall,precision and F-measure of the matching results for the three relationshiptypes Equivalent, Has-a, Is-a and total correct matches respectively.

5.2.1 Matching Accuracy

Figure 5.1 shows the matching results of SeMap over the five datasets, mea-

sured using the metrics recall, precision and F-measure. As discussed above,

we measured both the accuracy of identified matches, and identified rela-

tionships. The four bars (from left to right) show the matching accuracy

for the three relationship types Equivalent, Has-a, Is-a (correct match and

correct relationship) and total number of correct matches (not including the

relationship) respectively.

The result shows that SeMap achieved a high average matching accuracy

not only in detecting the correct correspondences (total number of correct

70

Chapter 5. Experimental Analysis

matches), but also in detecting the implicit relationships. Take F-measure

as an example, the percentage of correct correspondences range from 70%

to 100%, and the average accuracy of detected relationships are 79%, 69%

and 64% for the three relationships respectively. The accuracy of correspon-

dences detected is comparable to the results claimed in iMAP [2] (Tested

on Real Estate dataset, 77-100% for 1-to-1 matches and 50-86% for 1-to-

n matches), and that produced by [5] (Tested on Real Estate dataset and

Course Info dataset, 73% recall and 67% precision without domain ontology,

and 94% recall and 90% precision with domain ontology).

Note that our system did not resort to any domain ontology or data

frames as used in [5]. Moreover, the problem we are tackling is more dif-

ficult than these works in the sense that we have to not only identify the

correct correspondences, but also extract the generic semantic relationships

implicit in the correspondences. In Section 5.2.4, we will identify the reasons

that prevent SeMap from achieving higher accuracy in identifying correspon-

dences and extracting implicit relationships.

Looking at these results in further detail, we note that in the Real Es-

tate 1 and Course Info 2 datasets however, SeMap has low precision (about

40% for Real Estate 1 and 20% for Course Info 2) in identifying the Has-a

relationship. To further explore why this is the case, we provide more de-

tailed analysis of detailed composition of the matches identified by SeMap,

i.e., of the matches identified as each type, how many are (1) correct corre-

spondence and correct relationship (2) correct correspondence but incorrect

relationship (3) incorrect correspondence (non-match). The result is shown

in Figure 5.2, where for each type of relationship, we look into the compo-

71

Chapter 5. Experimental Analysis

sition of matches as identified by SeMap. It can be noticed in both case

of Real Estate 1 and Course Info 2, incorrect correspondences are the main

cause of the low precision in identifying matches of Has-a relationship. By

barring those incorrect correspondences, the precision of SeMap in identi-

fying semantic relationships is much more higher. The accuracy of SeMap

after barring incorrect correspondences is shown in Figure fig:impaccuracy,

where the precision reaches near 100% over most datasets. Note that the

synthetic dataset is very small, despite its low precision in this case, SeMap

misclassified only one match. We would like to argue that finding correspon-

dences is the focus of schema matching, and our SeMap system is built on

a moderate set of current schema matching techniques, it can be expected

that with more powerful schema matching techniques or domain knowledge,

the number of incorrect matches can be significantly reduced. In the de-

tailed error analysis, one can also notice that on average SeMap has higher

accuracy in identifying Is-a relationship than Has-a relationship. This can

be explained by the fact that the thesaurus employed in SeMap (WordNet)

returns comprehensive information of meronym/holonym relationships be-

tween two words, even though the senses may not be exactly that appearing

in the context of schemas. The way to improve the accuracy of identifying

Has-a relationship can be (1) prune the noise senses in using the thesaurus,

which will be explored in our future work; (2) lowering the weight of the

sense matcher, which however may affect finding other types of relationships.

72

Chapter 5. Experimental Analysis

compositionschemas relationship Equivalent Is-a Has-a non-match

Real Equivalent 13 0 0 4Estate 1 Is-a 0 2 0 0

Has-a 4 2 8 7Real Equivalent 9 0 0 3

Estate 2 Is-a 0 5 0 3Has-a 3 0 7 4

Synthetic Equivalent 2 1 0 0Is-a 0 1 0 0

Has-a 0 0 4 0Course Equivalent 11 1 0 1Info 1 Is-a 0 2 0 0

Has-a 0 0 0 0Course Equivalent 12 0 0 0Info 2 Is-a 0 4 0 1

Has-a 0 0 1 4

Figure 5.2: Error analysis of the resulting mappings.

Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20

20

40

60

80

100

Pre

cisi

on (

%)

EquivalentIs−AHas−A

Figure 5.3: The precision of SeMap after pruning incorrect matches. Thebars from left to right shows the matching results for the three relationshiptypes Equivalent, Has-a, Is-a respectively.

73

Chapter 5. Experimental Analysis

5.2.2 Component Contribution

In this section, we studied the relative contribution of different types of

semantic evidences to the matching results. Specifically, we tested three

types of semantic evidences, element label (name), element type and element

sense, which are available in most of schemas.

In each test, we left out one type of semantic evidence, and used the rest

two to identify the correspondences and the associated relationship types.

The weights of the two types of semantic evidences were both set as 0.5,

reflecting a neutral assumptions that both evidences are equally important.

The two plots of Figure 5.4 show the F-measure of identified matches

(correspondences) and identified relationships respectively. Within each

plot, the bars from left to right represent the results produced by SeMap

without label evidence, without type evidence, without sense evidence and

complete SeMap respectively. It is observed that in almost all the cases,

each type of semantic evidence contributes to the overall performance, ex-

cept for the dataset of Real Estate 2, where the F-measure of relationships

identified by a complete system is worse than that by a system without type

evidence. This can be explained by the fact that in some cases, different

semantic evidences are in conflict in determining the implicit relationships,

and some kinds of evidence may dilute the correct diagnosis.

5.2.3 Incorporating User Feedback

We also studied the performance gain by incorporating user feedback into

SeMap system. We applied the user interaction mechanism as discussed in

74

Chapter 5. Experimental Analysis

Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20

20

40

60

80

100Matching Accuracy

F−

mea

sure

(%

)id

entif

ied

mat

ches

Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20

20

40

60

80

100

F−

mea

sure

(%

)id

entif

ied

rela

tions

hips

semap w/o labelsemap w/o typesemap w/o sensecomplete semap

Figure 5.4: Relative contribution of different types of semantic evidences tothe matching results of SeMap. The two plots (from up to down) show the F-measure of identified matches (correspondences) and identified relationshipsrespectively.

Chapter 4: Each candidate match of a schema elements is associated with an

uncertainty estimate, based on its similarity score and domain constraints.

At each iteration, the schema element whose candidate matches have the

largest mutual entropy with other elements is selected, and user is asked to

provide the correct answer for this element. It then updates the uncertainty

estimation. The procedure repeats until a threshold is reached.

In this test, we measured the number of correct matches needed to be

provided before a perfect set of matches is reached. Note that we only

tested on the accuracy of identifying the correspondences, but not extracting

the implicit relationships. F-measure of correct correspondences versus the

75

Chapter 5. Experimental Analysis

amount of user interaction needed (percentage of expert matches provided

over the total number of matches) is shown in Figure 5.5 (the synthetic

dataset is skipped, since 100% correct correspondences is reached without

any help of user interaction). It is clear that over the datasets Real Estate

1, Real Estate 2 and Course Info 2, F-measure reaches its maximum value

when about 20% of expert matches is provided; While over the datasets

Course Info 1, about 10% of expert matches lead to a perfect set of matches.

This result suggests that SeMap can effectively incorporate user interaction,

that is it needs only a few equality constraints provided by users to achieve

high-accuracy matches. Note that the maximum possible F-measure may

not necessarily be 100%, this can be explained by the fact that F-measure

incorporates both precision and recall, and it is not always possible to achieve

100% recall (all correct matches are detected), due to the limit of the current

matching techniques that SeMap takes as input.

5.2.4 Discussion

In this section, we discuss the difficulties that prevent SeMap from achiev-

ing better performance in identifying both correspondences and implicit re-

lationships. Specifically, there are three reasons facing the current SeMap

system:

• More types of base matchers are needed in order to fully exploit the

possible schema-level information. In our current implementation,

only four types of base matchers are employed, which sometimes leads

to low precision, as discussed in Section 5.2.1, and it can be expected

76

Chapter 5. Experimental Analysis

0 20 40 60 80 10080

85

90

95

100Real Estate 1

provided matches (%)

F−

mea

sure

(%

)

0 20 40 60 80 10065

70

75

80

85

90

95

100Real Estate 2

provided matches (%)

F−

mea

sure

(%

)

0 20 40 60 80 10093

94

95

96

97Course Info 1

provided matches (%)

F−

mea

sure

(%

)

0 20 40 60 80 10080

85

90

95

100

Course Info 2

provided matches (%)

F−

mea

sure

(%

)

Figure 5.5: F-measure of correct correspondences versus the amount of userinteraction (percentage of expert matches provided over the total number ofmatches). The curves for four datasets (Real Estate 1/2, Course Info 1/2)are shown.

that more types of evidences can lead to finding more hidden informa-

tion;

• Data instances should be taken into consideration. It is only on the

data-instance level that one can understand the exact format and se-

mantics of schema elements.

• Parameter tuning is critical for schema matching. In SeMap system,

there are five parameters (three for weights of evidence, two thresh-

olds), which interact in a complicated way. A systematic (e.g., learning

approach [27]) tuning scheme can significantly improve the results.

77

Chapter 6

Conclusion & Future Work

In this thesis we presented an approach of identifying generic, semantic re-

lationships between the elements of two models (e.g., database schemas,

ontologies, web interfaces, etc) based on initial match information provided

by current schema matching techniques. Our main contributions include (1)

we point out the importance of the problem of identifying generic semantic

relationships between schema elements; (2) we designed an architecture for

semi-automatically constructing generic semantic mappings based on initial

correspondence information; (3) we created a novel probabilistic framework

that transforms match selection to a well-defined mathematical optimization

problem; (4) we effectively modeled the user interaction to help focus user at-

tention and minimize user effort, by detecting critical points where feedback

is maximally useful; (5) we proposed effective solution to extracting rela-

tionship implicit in matches based on various types of semantic evidences;

(6) we implemented a prototype system embodying the innovations above

and a set of experiments to illustrate the effectiveness of our approach.

We envision several future directions. The first is to incorporate our

system into a model management system [17], and explore the new possi-

bility in meta-data management brought by generic, semantically rich map-

78

Chapter 6. Conclusion & Future Work

pings. Second, we would like to enhance our current prototype by adding in

more matching techniques, and considering more types of semantic evidence.

Third, more domain constraints (e.g., frequency, contiguity, nesting [4], etc)

would be taken in consideration to enhance the current implementation of

match selector. Finally, in the process of judging which relationship is best

represented by the input correspondences, SeMap takes into account addi-

tional information (e.g., lineage information) that the matchers themselves

do not consider. As a result of this, if SeMapis unable to suggest an appro-

priate relationship, it may indicate that the input correspondence is wrong.

One future direction is to redirect such concerns to improve the quality of

the input match.

79

Bibliography

[1] Hans Chalupsky. Ontomorph: A translation system for symbolic knowl-edge. In Principles of Knowledge Representation and Reasoning, 2000.

[2] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy, andPedro Domingos. imap: discovering complex semantic matches betweendatabase schemas. In SIGMOD ’04: Proceedings of the 2004 ACMSIGMOD international conference on Management of data, pages 383–394, New York, NY, USA, 2004. ACM Press.

[3] Hong-Hai Do and Erhard Rahm. Coma - a system for flexible combi-nation of schema matching approaches. In VLDB ’02: Proceedings ofthe 28TH international conference on Very large data bases, 2002.

[4] AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Reconcilingschemas of disparate data sources: a machine-learning approach. InSIGMOD ’01: Proceedings of the 2001 ACM SIGMOD internationalconference on Management of data, pages 509–520, New York, NY,USA, 2001. ACM Press.

[5] David W. Embley, Li Xu, and Yihong Ding. Automatic direct andindirect schema mapping: experiences and lessons learned. SIGMODRecord, 33(4):14–19, 2004.

[6] A. Gal, G.A. Modica, and H.M. Jamil. Ontobuilder: Fully automaticextraction and consolidation of ontologies from web sources, 2005.

[7] Fausto Giunchiglia and Mikalai Yatskevich. Semantic matching. Knowl-edge Engineering Review, 18(3):265–280, 2004.

[8] P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristicdetermination of minimum cost paths. IEEE Transactions on SystemsScience and Cybernetics (SSC), pages 100–108, 1968.

[9] Finn V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001.

80

Chapter 6. Conclusion & Future Work

[10] Yannis Kalfoglou and Marco Schorlemmer. Ontology mapping: thestate of the art. Knowledge Engineering Review, 18(1):1–31, 2003.

[11] Jaewoo Kang and Jeffrey F. Naughton. On schema matching withopaque column names and data values. In SIGMOD ’03: Proceedingsof the 2003 ACM SIGMOD international conference on Managementof data, pages 205–216, New York, NY, USA, 2003. ACM Press.

[12] Jayant Madhavan. Learning mappings between models of data. Re-search Proposal, 1999.

[13] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Halevy.Corpus-based schema matching. In ICDE ’05: Proceedings of the 21stInternational Conference on Data Engineering (ICDE’05), pages 57–68, Washington, DC, USA, 2005. IEEE Computer Society.

[14] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Genericschema matching with cupid. In VLDB ’01: Proceedings of the 27thInternational Conference on Very Large Data Bases, pages 49–58, SanFrancisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[15] Alexander Maedche, Boris Motik, Nuno Silva, and Raphael Volz. Mafra- a mapping framework for distributed ontologies. In EKAW ’02: Pro-ceedings of the 13th International Conference on Knowledge Engineer-ing and Knowledge Management Ontologies and the Semantic Web,pages 235–250, London, UK, 2002. Springer-Verlag.

[16] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarityflooding: A versatile graph matching algorithm and its application toschema matching. In ICDE ’02: Proceedings of the 18th InternationalConference on Data Engineering, page 117, Washington, DC, USA,2002. IEEE Computer Society.

[17] Sergey Melnik, Erhard Rahm, and Philip A. Bernstein. Rondo: a pro-gramming platform for generic model management. In SIGMOD ’03:Proceedings of the 2003 ACM SIGMOD international conference onManagement of data, pages 193–204, New York, NY, USA, 2003. ACMPress.

[18] Boris Motik, Alexander Maedche, and Raphael Volz. A con-ceptual modeling approach for semantics-driven enterprise applica-tions. In On the Move to Meaningful Internet Systems, 2002 -DOA/CoopIS/ODBASE 2002 Confederated International Conferences

81

Chapter 6. Conclusion & Future Work

DOA, CoopIS and ODBASE 2002, pages 1082–1099, London, UK,2002. Springer-Verlag.

[19] Renate Motschnig-Pitrik and Jens Kaasboll. Part-whole relationshipcategories and their application in object-oriented analysis. IEEETransactions on Knowledge and Data Engineering, 11(5):779–797,1999.

[20] Natalya Fridman Noy and Mark A. Musen. Prompt: Algorithm andtool for automated ontology merging and alignment. In Proceedings ofAAAI/IAAI, 2000.

[21] Rachel Pottinger and Phil Bernstein. Merging models based on givencorrespondences. U of Washington. Technical Report UW-CSE-03-02-03, 2003., 2003.

[22] Erhard Rahm and Philip A. Bernstein. A survey of approaches toautomatic schema matching. The VLDB Journal, 10(4):334–350, 2001.

[23] Erhard Rahm and Honghai Do. Data cleaning: Problems and currentapproaches. IEEE Bulletin of the Technical Committee on Data Engi-neering, 23(4), 2000.

[24] Xiao Renguo, Tharam S. Dillon, J. Wenny Rahayu, Elizabeth Chang,and Narasimhaiah Gorla. An indexing structure for aggregation rela-tionship in oodb. In DEXA ’00: Proceedings of the 11th InternationalConference on Database and Expert Systems Applications, pages 21–30,London, UK, 2000. Springer-Verlag.

[25] C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. ofComputer Science, University of Glasgow, 1979.

[26] Nikos Rizopoulos. Automatic discovery of semantic relationships be-tween schema elements. In ICEIS ’04: Proceedings of the 6th Interna-tional Conference on Enterprise Information Systems, 2004.

[27] Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon S.Rosenthal. Tuning schema matching software using synthetic scenarios.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 994–1005. VLDB Endowment, 2005.

[28] Gerd Stumme and Alexander Maedche. FCA-MERGE: Bottom-upmerging of ontologies. In IJCAI ’01: Proceedings of the 7th inter-

82

Chapter 6. Conclusion & Future Work

national conference on Artificial Intelligence, pages 225–230, Seattle,WA, USA, 2001.

[29] Department of Computer Science University of Illinois. Illinois semanticintegration archive. http://anhai.cs.uiuc.edu/archive, 2002.

[30] Michael Uschold and Michael Gruninger. Ontologies and semantics forseamless connectivity. SIGMOD Record, 33(4):58–64, 2004.

[31] Wensheng Wu, Clement Yu, AnHai Doan, and Weiyi Meng. An inter-active clustering-based approach to integrating source query interfaceson the deep web. In SIGMOD ’04: Proceedings of the 2004 ACM SIG-MOD international conference on Management of data, pages 95–106,New York, NY, USA, 2004. ACM Press.

[32] Ling Ling Yan, Renee J. Miller, Laura M. Haas, and Ronald Fagin.Data-driven understanding and refinement of schema mappings. InSIGMOD ’01: Proceedings of the 2001 ACM SIGMOD internationalconference on Management of data, pages 485–496, New York, NY,USA, 2001. ACM Press.

83