50
Searching Integrated Relational and Record-based Legacy Data Using Ontologies L.L. Miller and Hsine-Jen Tsai Department of Computer Science Iowa State University Ames, Iowa 50011 Sree Nilakanta College of Business Iowa State University Ames, Iowa 50011 Mehdi Owrang American University Washington, D.C. Abstract Integration of data continues to be a problem. The number of databases available to a corporation continues to grow. Simply keeping track of the number and diversity of the attributes (fields) can be a difficult problem in large organizations. In this paper we define an ontology model for the domain that uses ontologies for object (entity) search over a set of integrated relational databases and record-based legacy systems. The integration process is based on a hypergraph model that makes use of the theory of universal relations. The design of the complete system model is given and a prototype of the model is briefly discussed. 1. Introduction Managing the vast amounts of information in large computer networks presents a number of difficulties to system users and designers. New applications and databases are created on a regular basis to solve local problems as they arise. For large organizations, it means the number of databases can be staggering. We can not expect users to know terms for identifying specific information from multiple data sources. Most databases are created and maintained by 1

Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

Searching Integrated Relational and Record-based Legacy Data Using Ontologies

L.L. Miller and Hsine-Jen Tsai Department of Computer Science

Iowa State UniversityAmes, Iowa 50011

Sree NilakantaCollege of Business

Iowa State UniversityAmes, Iowa 50011

Mehdi OwrangAmerican University

Washington, D.C.

AbstractIntegration of data continues to be a problem. The number of databases available to a corporation continues to grow. Simply keeping track of the number and diversity of the attributes (fields) can be a difficult problem in large organizations. In this paper we define an ontology model for the domain that uses ontologies for object (entity) search over a set of integrated relational databases and record-based legacy systems. The integration process is based on a hypergraph model that makes use of the theory of universal relations. The design of the complete system model is given and a prototype of the model is briefly discussed. 1. Introduction

Managing the vast amounts of information in large computer networks presents a

number of difficulties to system users and designers. New applications and databases are

created on a regular basis to solve local problems as they arise. For large organizations, it

means the number of databases can be staggering. We can not expect users to know terms

for identifying specific information from multiple data sources. Most databases are

created and maintained by local groups and/or organizations, that use software that

optimizes local transactions. Even if we assume that all databases use a standard

hardware/software platform, language and protocol, there still is the issue of conceptual

heterogeneity. To assist users in obtaining an integrated view of information from

heterogeneous distributed data sources continues to be an active research area.

Among the research groups working on this problem, the use of an ontology

[6,8,10,13,18,20,21] seems very appealing. Since the beginning of the nineties,

ontologies have become a popular research topic investigated by several artificial

1

Page 2: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

intelligence research communities, including knowledge engineering, natural-language

processing and knowledge representation. More recently, the notion of ontologies has

become widespread in fields such as intelligent information integration, information

retrieval on the Internet, and knowledge management. The reason for ontologies being so

popular is in large part due to what they promise: a shared and common understanding of

some domain that can be communicated across people and computers. General ontologies

have not been effective. Therefore, the best one expects from an ontology is for it to be

domain specific. However, for imprecise queries, the first problem is to take query terms

and map them to database terms. Therefore, minimally we must modify the ontology to

make it database specific. The Summary Schemas Model (SSM) [2,3,4] provides a way

to link database terms to the ontology.

In spite of the large amount of research on database integration of heterogeneous

data sources that has been done, the problem continues to create difficulties for most

organizations. In the present work we look at a subproblem of the general integration

problem, that is the case where the data sources are controlled by one organization and

the data sources consist of relational databases and record-based legacy systems. While

this is a small part of the general problem, it covers a large number of applications that

typical organizations are concerned with integrating.

Our contribution in this paper is the development of an ontology-based model that

provides access to a distributed set of relational databases and record-based legacy

systems through imprecise queries. A database specific ontology is integrated with a set

of semantically disjoint universal relations over the set of data sources to provide access.

The use of universal relations which simplifies the connection between the ontology and

the set of distributed data sources. For any request that requests semantically related

data, there is a single universal relation that is capable of responding to the request.

Specifically, we develop the notion of database specific weighted ontologies as a means

of determining the required universal relation. The use of universal relations in this

context is made possible due to our data integration scheme. The integration scheme is

based on the use of hypergraphs and the theory of relational databases. Such an

approach provides the additional capability of testing the correctness of any query

generated.

2

Page 3: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

A brief overview of ontologies, Summary Schema Model (SSM), and integration

issues are presented in Section 2. The overall model is overviewed in Section 3. In

Section 4 we present our approach to ontologies and look at the issue of generating SSM

tree fragments and database specific ontologies. Section 5 looks at the issues that

makeup our data integration scheme. Section 6 overviews our current version of the

feasibility prototype. Finally, we conclude by summarizing our results.

2. Background

2.1 Ontologies

The word “ontology” is borrowed from philosophy, in which it refers to the

“subject of existence” [8]. It is the science of “what is”. It discusses the structures of

entities, the properties of entities and the relations between entities. In a word, it seeks to

find an appropriate classification of entities. In the context of artificial intelligence, an

ontology is a model of some portion of the world and is described by defining a set of

representational terms [6]. A formal definition is “a formal, explicit specification of a

shared conceptualization” [8]. “Conceptualization” refers to an abstract model of some

phenomena in the world by having identified the relevant concepts of those phenomena.

So, an ontology is a description of concepts and relationships between them.

The main motivation of an ontology is knowledge sharing and reuse [9,25]. In the

field of information system, different groups gather data using their own terminologies.

When all those data are integrated, a major problem that needs to be handled is the

terminological and conceptual incompatibility. It could be done in a case-by-case basis.

But a solution based on a “consistent and unambiguous description of concepts and their

potential relation” [19] will be much better than a case-by-case one. In the Knowledge

Sharing Effort (KSE) project [18], ontologies are put forward as means to share

knowledge bases between various knowledge-based systems.

A major challenge in using ontologies lies in how to build them, or what should

they look like? Several groups have given solutions. They describe how ontologies

should be constructed so that they contain the richest information in the least space and

can be efficiently retrieved for use. A solution based on the definition of a “core library”

has been proposed in [25]. More often, an ontology is considered as a taxonomic

3

Page 4: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

hierarchy of words with the “is-a” relation between them [9]. Some technology has also

been proposed to modify a poorly designed ontology into a better one [11].

In dealing with multi-database systems, ontologies can be used effectively to

organize keywords as well as database concepts by capturing the semantic relationships

among keywords or among tables and fields in a relational database. By using these

relationships, a network of concepts can be created to provide users with an abstract view

of an information space for their domain of interest. Ontologies are well suited for

knowledge sharing in a distributed environment where, if necessary, various ontologies

can be integrated to form a global ontology.

Database owners find ontologies useful because they form a basis for integrating

separate databases through identification of logical connections or constraints between

the information pieces. Ontologies can provide a simple conversational interface to

existing databases and support extraction of information from them. Because of the

distinctions made within an ontological structure, they have been used to support

database cleaning, semantic database integration, consistency-checking, and data mining

[20].

An example of using ontologies in databases is Ontolingua [9]. Ontolingua is

being built with the purpose of enabling databases (and the people and systems that

interface with them) to share its ontology that is specific to the computer science and

mathematics domains with the intention of enabling data sharing and reuse. Another

example of database application is the Cyc ontology that had a knowledge base built on a

core of approximately 400,000 hand-entered assertions (or rules) designed to capture a

large portion of what we normally consider consensus knowledge about the world [14].

Partitioned into an Upper Cyc Ontology and the full Cyc Knowledge Base, there are

3,000 terms of the most general concepts of human consensus reality in the Upper Cyc

ontology with literally millions of logical axioms of more specific concepts descending

below, populating the Cyc Knowledge Base. Cyc foundation enables it to address

effectively a broad range of otherwise intractable software. The global ontologies

objects, attributes, transitions, and relationships are accepted as forming the domain’s

universe.

2.2 Summary Schema Model (SSM)

4

Page 5: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

The SSM was first proposed by M. Bright et al. [2,3,4]. The SSM was designed

to address the following issues [2,3,4]:

1. In multi-database system, users cannot be expected to remember

voluminous specific access terms, so the global database should provide

system aids for matching user requests to system data access.

2. Because of different local requirements, independent database designers are

unlikely to use consistent terms in structuring data. The system must take

responsibility for matching user requests to precise system access terms.

The SSM provides the following capabilities: it allows imprecise queries and

automatically maps imprecise data references to the semantically closest system access

terms. Note that the SSM deals with imprecision in database access terms rather than data

values within the database.

The SSM uses a taxonomy of the English language that maintains synonym and

hypernym/hyponym links between terms. Roget’s original thesaurus provided just such a

taxonomy and is the current basis for the SSM. Identifying semantic similarity is the first

step in mapping local to global data representation.

The SSM creates an abstract view of the data available in local databases by

forming a hierarchy of summary schemas. A database schema is a group of access terms

that describe the structure and content of the data available in a database. A summary

schema is a concise, although more abstract, description of the data available in a group

of lower level schemas. In SSM, schemas are summarized by mapping each access term

to its hypernym. Hypernyms are semantically close to their hyponyms, so summary

schemas retain most of the semantic content of the input schemas.

The SSM trees structure the nodes of a multi-database into a logical hierarchy.

Each leaf node contributes a database schema, and each access term in a leaf schema is

associated with an entry-level term in the system taxonomy. Once these terms have been

linked to the taxonomy hierarchy, creating the summary schemas at the internal nodes is

automatic. Each internal node maintains a summary schema representing the schemas of

its children. Conceptually, only leaf nodes have participating DBMSs, while internal

nodes are responsible for the summary schemas structure and most of the SSM

processing.

5

Page 6: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

2.3. Integration

Bright et al. [1] define a multidatabase system as a system layer that allows global

access to multiple, autonomous, heterogeneous, and preexisting local databases. This

global layer provides full database functionality and interacts with the local DBMSs at

their external user interface. Both the hardware and software intricacies of the different

local systems are transparent to the user, and access to different local systems appears to

the user as a single, uniform system. The term multidatabase includes federated

databases, global schema multidatabases, multidatabase language systems, and

homogeneous multidatabase language systems.

Multidatabases inherit many of the problems associated with distributed

databases, but also must content with the autonomy and heterogeneity of the databases

that they are trying to integrate. As the number of local systems and the degree of

heterogeneity among these systems rises, the cost of integration increases.

There has been considerable research on multidatabase systems. A great deal of

the work has been examined in [12]. This approach has focused on the problem from the

point of view of applying traditional database techniques to bridge the mismatch between

the underlying data sources.

Several researchers have explored the use of intelligent agents called mediators

[26,27] as a means of bridging the mismatch between the heterogeneous data sources. At

present, there is no implemented system that offers the full range of functionality

envisioned by Wiederhold in his paper [26]. Examples of projects that have been

developed include HERMES being developed at the University of Maryland [23],

CoBase at UCLA [5] and NCL at the University of Florida [22], and MIX at SDSC[15].

The advantage of such mediator-based systems is that to add a new data source it is only

necessary to find the set of rules that define the new data source.

More recently, a number of researchers have started to look at XML-based data

integration techniques as a way to attack the general data integration problem. The use of XML

in the general data integration problem is especially interesting as the unstructured format that

XML supports allows one to manipulate a variety of data types. Beyond simply storing the data

in XML format, data integration requires mechanisms to do the integration. Zamboulis makes

use of Graph Restructuring to accomplish the integration [30]. A number of groups have looked

6

Page 7: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

at XQuery as the basis of their approach to XML-based data integration[6,7,8,12]. The Tukwila

Data Integration System provides a complete solution that involves not only integration, but

activities like optimizing network performance as well [23].

In the next section we overview the complete model before examining the two

principle components of our model in more detail.

3. Model Overview

The proposed model makes use of a database specific ontology and an integration

scheme based on universal relations to support imprecise queries over a distributed set of

relational databases and record-based legacy systems. Figure 3.1 illustrates the

relationship between the objects used to construct the physical state of our model. The

universal relations are used to provide a simple query interface to the set of distributed

relational databases and record-based legacy systems. The Summary Schema Model

(SSM) tree fragments are used to convert a domain entity ontology into a database

specific ontology. The result is that the model is capable of supporting imprecise

requests. Once the terms used in the user’s request are related to the appropriate database

terms (i.e., attribute names), the model automatically generates a result relation and

returns it to the user.

7Figure 3.1. Block diagram showing relationships between the objects in the model.

Entity Ontology

SSM tree fragments

Universal Relations

Relational Databases and Legacy Systems

Page 8: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

Figure 3.2 looks at the model from the perspective of the processes that are

required to enable the model. The components inside the dotted rectangle provide an

illustration of the relationship between the components.

The interactions between the components of the model is best illustrated by

looking at the way that data flows within the model. The front end system passes the

model a set of terms and conditions as a request (query). The controller passes the terms,

including any terms in the conditions, to the Ontology Mediation Manager. The terms

are used to search the ontology to find the universal relation(s) that are needed to

generate the universal relation query to respond to the request. Terms that cann’t be

located in the database specific ontology are typically mediated with the user. There are

multiple ways that this mediation could be implemented depending on the nature of the

front end. In our discussion (and prototype) we have assumed the use of a GUI to

conduct this mediation as a visual process, but this would not be required.

Locating the terms in the ontology would identify one or more universal relations

that can be used to answer the request. In general only one universal relation would be

identified due to the universal relations being semantically disjoint. More details on this

issue are discussed in Section 5. As a result, in the remainder of the paper we will

assume that only one universal relation is required to produce a result for a given request.

Based on the results of the ontology search, a universal relation query is generated.

The universal relation query is passed to the Query Engine along with a request

id. There it is converted into an integration query that makes use of the relations and

legacy system records that define the universal relation’s data space. The integration

query is partitioned by the Data/Query Manager and the resulting subqueries are sent to

the appropriate data sources. The relations that are generated by the subqueries are

returned to the Data/Query Manager where they are merged and the final result relation is

sent back to the front end system.

8

Page 9: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

In the next two sections, we take a more detailed look at the components of the

model. Our approach to creating and searching database specific ontologies is examined

in Section 4. An overview of our integration scheme is given in Section 5.

4. Ontology Design

Ontologies are in general domain specific. In an environment where one is trying

to integrate a set of heterogeneous, distributed data sources, this means that it is

necessary to make the ontology used to search the data sources database specific. For an

9

ControllerOntologyMediationManager

Query Engine

Data/Query Manager

Data Sources

MetaData

MetaData

DatabaseSpecific

Ontology

UserFront End System

Figure 3.2. Block diagram of the proposed model.

Model

OntologySearch

Manager

Page 10: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

ontology, this means that the attribute names used in the universal relations must be

incorporated into the ontology.

4.1 Ontology Design

The focus in this section of the paper is moving from domain specific ontologies

to database specific ontologies. We see ontologies as representing the entities (objects) in

the domain that the user of the integrated databases is working. The domain is

represented by terms that define the problem area. Note that the user’s problem and the

available databases must come from the same domain in order for a solution to exist.

An ontology can be defined as a graph = (,), where is the set of terms used

to represent the domain and is the set of edges connecting the nodes representing the

terms. Each term node can have properties assigned to it. In our ontology model there

are four types of edges in , namely, the is-a, is-part-of, synonym, and antonym edges.

Is-a and is-part-of edges are directed, while synonym and antonym edges have no

direction. Let I() be the set of is-a edges in the ontology . Then (,I()) represents a

directed acyclic graph (dag) with the more general terms higher in the dag and more

specific terms lower in the dag. As expected, synonym and antonym edges are used to

connect terms with the same and opposite meaning, respectively.

To enhance the search operation, we add the notion of edge weights to create a

weighted ontology. Let be the set of weights such that i is the weight for Ei .

We use the weights to prune the search of the ontology. For E I(), the weights are

used to estimate the relative closeness of the is-a relationship. A similar argument can be

made for is-part-of edges. Going through a term like Physical Object would not be

useful. To block the search, the weight assigned to the edges connected to such a term

are set to a large values. In our current ontology design weights for is-a and is-part-of

edges are integers. Note that the use of weights is to reduce the number of questions that

a user must be asked during the search. In meaningful queries there are likely to be

several query terms. This combined with the expected bushiness of the ontologies give

rise to the possibility of an overwhelming number of questions that the user could be

asked if the user had to resolve all of the choices.

The weights on the synonym and antonym edges range from zero to one, where

one indicates an exact match for a synonym and an exact opposite for an antonym.

10

Page 11: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

Using weights on these edges allows us to show the degree of the match. A small

example of a weighted ontology is shown in Figure 4.1.

The method of generation of the weights depends on the builder of the ontology.

The weights can be assigned by hand or can be generated automatically. We have

generated the weights by hand in our current test sets, but we have designed an algorithm

for generating the weights from metadata and domain documents.

3.2 Creating a database specific ontology

To move from a domain specific ontology to a database specific ontology, we

make use of Summary Schema Model (SSM) tree fragments. The process of creating a

database specific ontology requires us to create SSM tree fragments that are relatively

specific. The SSM tree fragments are constructed starting with the attribute names used

in the schema of the universal relations that are defined by the data source data. To

successfully search a database specific ontology, it is critical that the SSM tree fragments

do not generalize. If the root term of an SSM tree fragment is too general, the database

terms will not be found by searches starting at meaningful domain terms.

To start the process of making an ontology database specific, we check the

attribute names in the universal relation defined by the data sources to determine if they

already exist as terms in the ontology. If the term exists, a pointer is added to the

ontology term property set to point to the universal relation that the attribute is located in.

For the remaining universal relation attributes, the metadata of the databases is used to

11

Entity20

20

50

10

0.915

3010

100

15

1515

25

Physical Object

Living Being

Social Entity

AppleGreen

Animal HumanBeing

Country

PersonGreen Apple

Figure 4.1 A weighted ontology.

Is-a edge

Synonym edge

Page 12: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

unify the attribute names into one or more SSM tree fragments. In particular the

definitions of the database fields named by the attribute names given in the meta data are

used to determine related (i.e., unifyable) terms. The term that is used to unify a subset of

the remaining universal relation attributes is then matched against the ontology terms. If

it is found, the SSM tree fragment is attached to the ontology term. Weights are

assigned by the individual expanding the ontology. If the root term of the new fragment

is not in the ontology, the unification process asks the user for related terms and again

checks the ontology. If no match exists, our algorithm looks to incorporate more

universal relation attributes into the SSM tree fragment(i.e., grow the SSM tree

fragment). Our early attempts to completely automate the process have not been very

promising, so we are currently using a human aided approach. The metadata definitions

and related documents are used to determine likely unification terms. This gives the

human guiding the process the opportunity to choose a unifying term from an existing

list.

At each step, the root term of the SSM tree fragment is checked to see if it exists in the

ontology. When all of the attribute names have been incorporated into the ontology in

this manner, we say that the ontology is database specific. Figure 4.2 shows a block

diagram of the database specific ontology.

3.3 Search

12

Entity Ontology

Figure 4.2. Block diagram of ontology and SSM fragment design.

SSM tree fragments

Universal Relations

Page 13: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

The basic premise of our ontology search is to allow the user to give a set of

search terms and proceed from the search terms to “near by” database terms. Weights

combined with user interaction are used to define what is meant by “near by”. To look at

the search, we provide a set of basic rules used in the search.

Ontology Search Rules for is-a, synonym, and antonym edges:

1. A user creates a request by supplying a set of search terms. A search algorithm searches the database specific ontology to locate the search terms. If some of the search terms are not found in the ontology, the user is asked to refine the query terms.2. Weights are used to block paths that are unlikely to provide useful results. As an example, an is-a edge from a very general term to a specific term (e.g., Apple in Figure 4.1) is unlikely to yield a useful “near by” term. Weights are used in combination with user interaction to provide an effective search without overwhelming the user.3. In a typical successful search, when no link to a universal relation is found at an original term node, the algorithm starts from the node by looking for synonym edges. If one is found the weight is tested against the synonym threshold. If the weight is larger than the threshold, the search moves to the next node and continues. Since more than one synonym edge may be followed, the weights on synonym edges are multiplied and the product is tested against the threshold. Whether more edges are followed from the individual nodes depends on whether we are looking for all “near by” database terms or one. If no synonym edge exist, then the is-a edges are used as indicated in rule 2.4. For a NOT search, the algorithm starts from the query term in the ontology and looks for an antonym edge leaving the term node. If one exists, its weight is tested against the antonym threshold. If an appropriate antonym edge is found, the search moves to the new term node and a positive search (rule 3) is initiated from that point.5. In all cases if no “near by” database term is found for a query term, the user is notified and asked to refine the query term.6. When all query terms have been processed, the search algorithm returns a set of universal relations and attribute names that can be used to generate the required universal relation query.

4. Integration Scheme

While there has been a great deal of activity on integrating heterogeneous

databases, important questions remain. To bridge this gap, we use an integration model

designed to operate on a subset of the general integration problem where the data sources

are limited to relational databases and record-based legacy systems. Our approach takes

advantage of the work on universal relation interfaces (URIs) [7,17]. The idea behind an

13

Page 14: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

Pseudo

Relation

Request

For Data

Wrapper

Set of

records

RecordBasedLegacySystem

Request for

records

Figure 4.3. Relationship between wrapper and legacy system.

RelationViewManager

URI is to provide a single relation view of a set of relations found in the same database.

The set of relations should have sufficient semantic overlap so that the single universal

relation view was able to provide a semantically correct “view” of the data. In addition

an URI has to support development of a correct query.

The task of applying the earlier work on URIs to the integration of relational

databases and record-based legacy systems has three basic steps:

1. Give the record-based legacy systems a relational structure,

which we call a pseudo relation.

2. Group attributes so that only semantically equivalent attributes

have the same name in the integrated environment.

3. Model each set of connected relations (defined in Section 4.3) as

a universal relation.

The result of applying the three steps is a set of universal relations that are visible

to any software interacting with the integration model. The number of universal relations

will depend on the degree of overlap between relations and pseudo relations. The next

three subsections look at the three steps in more detail.

4.1 Defining Pseudo Relations

Our approach is to have the local data administrator of each record-based legacy

system define the set of export “relation view(s)” (records) that he/she is willing to

export into the integrated environment. This set can change over time. The local data

administrator defines these “relation views” as a set of requests to the legacy system at

the programmatic level (batch mode). Each “relation view” places a pseudo relation in

the integrated environment. A pseudo relation is a set of tuples with each column named

by a unique attribute name.

A wrapper for the legacy system is then created that resides on the same platform

as the legacy system. The wrapper is a static agent that interfaces with the integration

model by exporting the required “relation view” as a set of tuples (i.e., a pseudo

relation). To generate the pseudo relation the view manager executes the appropriate

request to the

14

Page 15: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

legacy system through the “relation views” defined by the local administrator. Figure 4.3

illustrates the relationship.

Each application of retrieving data through a wrapper results in placing a pseudo

relation in the integrated environment. Selection of rows in the resulting table can easily

be implemented as part of the view manager.

4.2. Attribute Names

In any set of database relations and legacy systems there is likely to be problems

with attribute names. In particular one expects some instances of semantically equivalent

attributes with different names and some cases of attributes with the same names, but

different meanings.

We use the typical solution to this problem, i.e., have the designer of the

integrated system evaluate the existing name set by reviewing the metadata defined over

the data sources. He/she can then rename attributes within the integrated system to

remove the problem. For relational databases, this can accomplished by using views.

The use of views can also be used by the local database administrator as a means of

controlling what data is exported into the integrated environment. Since the local data

administrator of a legacy system is already defining a “relation view” in the integrated

environment for each export schema, any required name changes can be handled at that

level.

15

Page 16: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

The result is that we can look at the integrated environment as defining a set of

attributes, such that, if two attributes have the same semantics, they have the same name.

Also, if two attributes have the same name, they have the same semantics.

Another advantage of renaming the attributes in the proposed environment is that

attribute names can be chosen to provide more semantic meaning. This results in easier

SSM tree fragment construction.

4.3 Universal Relations

A universal relation u(U) is seen as a virtual relation u over a scheme U. We use

U and attr(U) interchangeability to mean the attributes in the scheme U. The universal

relation u can be defined over a set of relations {r1 (R1), r2 (R2), … , rn (Rn)} where u= r1

r2 … rn and attr(U) = attr(R1) attr(R2) … attr(Rn).

The universal relations used in our integration model are restricted to being

connected and maximal. A universal relation over a set of relations R = {R1, R2, …, Rn}

is connected as long as it is not possible to partition the set of relations into two sets, say

O1 and O2, such that O1 and O2 are subsets of R and O1 ∩ O2 ≠ . A universal relation

u(U) is considered to be maximal if attr(U) is the maximum set of attributes and u is

connected.

In the remainder of this presentation, we use the phrase universal relation to mean

a maximal and connected universal relation. In the next subsection, we look at the basic

aspects of our integration model.

4.4. Data Integration

The Ontology Mediation Manager (Figure 3.2) sees the data through the

integration scheme as a set of disjoint universal relations. As such, it simply generates a

universal relation SQL query of the form Select attribute list From universal relation

Where condition. The Ontology Mediation Manager tags the universal relation query

with the request id from the front end system and supplemented by the controller to

identify the front end and the user making the request. The task of the integration system

is to

1. Convert the universal relation query into a query over the relations that support

the universal relation.

2. Ensure the correctness of the query.

16

Page 17: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

3. Partition the query with respect to the data sources.

4. Query the individual data sources, combine the results into a final relation, and

return it to the user.

The integration system is made up of two primary components, namely, a Query

Engine and a Data/Query Manager (Figure 3.2). The Query Engine makes use of a

hypergraph model of the set of relations that support the universal relation used in the

universal relation query to generate the query and test its correctness. The Data/Query

Manager receives the universal relation query from the Query Engine, partitions it with

respect to the location of the data, sends the resulting sub-queries to the appropriate data

sources, and combines the results of the subqueries if there is more than one sub-query.

In the next two sections we look briefly at the underlying concepts of the Query

Engine and Data/Query Manager, respectively.

5. Query Generation and Correctness Overview

Hypergraphs play a critical role in our approach to integration. A hypergraph is a

couple H = (N,E), where N is the set of vertices and E is the set of hyperedges, which are

nonempty subsets of N. There is a natural correspondence between database schemes

and hypergraphs. Consider the set of relation schemes R = {R1, R2, …, Rn}. We can

define the set of attributes of R as being attr(R) = ni=1Ri. The hypergraph HR =

(attr(R),R) can be seen to be a hypergraph representation of the set of relations.

Typically, the hypergraph has been used to represent the scheme of a single database, but

there is no reason that we can not use the more general interpretation of having it

represent the scheme of the relations and pseudo relations that define the data in the

integrated environment.

Let L = {L1, L2, …, Lm} be the set of pseudo relations that are defined for the

record-based legacy systems as described in Section 4.1. Let R = {R1, R2, …, Rn} be the

set of relation schemes associated with the relational databases that exist within the

integrated environment. If RENAME() is the process described in Section 4.2, then S =

RENAME(L) RENAME(R) can be perceived as the relation set for the integrated

17Figure 5.1 Relationship between the hypergraph, complete intersection graph, and the ABFS tree for query requiring the attributes A, B, and F.

ABC

BF

CDE AEF

C A

E

B

F

b) complete intersection graph

a) hypergraph

ABC {A,B}

BF {A,B,F}

Aset = {CDE}

c) Adjusted BFS tree with root ABC.

Page 18: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

environment. We can then look at HI = (attr(S),S) as a hypergraph representation of the

integrated environment. The hypergraph HI defines a set of one or more connected

subhypergraphs. The precise number of connected subhypergraphs is dependent on the

connectivity of the relations and pseudo relations in the integrated environment. Each

connected subhypergraph, say Hu = (attr(U),U) where U is a subset of S and

attr(U)∩attr(S-U) =Ø, provides the basis of one universal relation.

Looking at the elements of S = {S1, S2, …, Sm+n}, where Si = RENAME(Li)

mi1and Sj+m = RENAME(Rj) nj1, we assume that the Sk m+nk1 define

18

Page 19: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

meaningful groupings of attributes within the integrated environment. Using the results

of [7], we then have the join dependency [S] defined over the integrated environment.

The importance of this is that we can apply the strategy used in our earlier work on

universal relations [16,17] to check the correctness of any queries generated in the

integrated environment.

To translate a universal relation query to an integration query, we must translate

the request to the target data space (the hypergraph representing the collection of

connected operational databases). Finally, the target query hypergraph needs to be

mapped to an SQL query. To create the mapping, we convert the underlying hypergraph

into a set of Adjusted Breadth First Search (ABFS) trees [17]. An ABFS tree is created

by applying a variation of the breadth first search to the complete intersection graph

(CIG) defined by the underlying hypergraph model. An ABFS tree is created for each

node (relation) in the CIG that contains attributes required in the SQL query. Each path

from the root to a leaf of the ABFS tree defines a set of relations that can be joined.

From this set of paths, we choose a subset that covers the attributes required in the query.

The ABFS tree that requires joining the fewest relations is chosen to create the relation

list in the new SQL query. Figure 5.1 illustrates a simple example of this process. The

complete details of mapping the request to an SQL query are given in [17, Appendix A].

To ensure that the correctness of the integration query, we need to have the join

sequence define a lossless join. Using the result from [7], the join dependency [U] is

defined over the relations and pseudo relations that make up the universal relation used in

the universal relation query that is being translated. The importance of this is that FD-

Hinge of Hu defines a set of edges whose corresponding relations have a lossless join

[16]. The test for correctness starts by testing if the edges that correspond to the join

sequence define an FD-Hinge in Hu. Failing that, the set of edges are expanded to form

an FD-Hinge.

6. Data/Query Manager Overview

The first task of the Data/Query Manager is that the integration query generated

by the Query Engine must be partitioned into subqueries with respect to the location of

the relations/pseudo relations involved in the query. Once the integration query has

19

Page 20: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

been partitioned, the resulting subqueries are sent to the appropriate data sources.

Example 1 provides a simple example of the partition process.

Example 1: Example of query partition using SQL syntax.

Data layout: Site 1 tables: R1(A,B,C), R2(C,D,E)

Site 2 tables: R3(E,F,G)

Universal Relation Query:Select G,BWhere F=10

Integration Query:Select G,BFrom R1,R2,R3Where R1.C=R2.C and R2.E=R3.E and F=10

Partition results:Query for Site 1 (Q1):

Select B, EFrom R1, R2Where R1.C=R2.C

Query for Site 2 (Q2):Select E, GFrom R3Where F=10

Request Framework Query:Select G, BFrom Q1, Q2

Where Q1.E = Q2.E

The Data/Query Manager retains the Request Framework Query so that it can

combine the results when two or more subqueries are needed. Assuming Id1 is the

request identifier for the universal relation query, Site1 is the site location, and Q1 & Q2

are the subquery identifiers for the two subqueries in Example 1, then Example 2

illustrates the strings used by the Data/Query Manager to represent the subqueries and the

Request Framework Query.

Example 2: The query string for the result given in Example 1:

20

Page 21: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

SubQuery Queue:

“Select B, E From R1, R2 Where R1.C = R2.C”:<Id1,Q1,Site1>

“Select E, G From R3 Where F =10”:<Id1,Q2,Site 2>

Request Framework Query Queue:

”Select G, B From Q1, Q2 Where Q1.E = Q2.E”:<Id1>

The results of the subqueries are placed in a temporary database at the site of the

Data/Query Manager. When results from all of the subqueries have returned and are

stored in the local database, the Request Framework Query is used to combine the

intermediate results before returning the final result relation to the front end system.

7. Prototype

A prototype was implemented to test the feasibility of our approach. The

prototype was implemented in JAVA, developed on the Red Hat Linux platform, and

tested in the Windows platform. Figure 7.1 illustrates a block diagram of the prototype.

It is made up of four primary components: the User Interface, the Ontology Search

Manager, the Query Engine, and the Data/Query Manager. The functionality for the

Ontology Mediation Manager has been incorporated into the User Interface in the current

version of the prototype. The User Interface allows a user to enter a set of domain search

terms and a condition. The beginning screen with an example in progress is shown in

Figure 7.2.

21

Figure 7.1 Block diagram of the prototype.

Data/Query Manager

Ontology Search Manager

Query EngineUser Interface

Page 22: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

When the user is satisfied with what has been entered, he/she clicks on the Start Request

button. The Ontology Search System performs the search described in Section 3.3. The

ontology is searched for the domain terms provided by the user. If all of the domain

search terms are found in the ontology, the database information found though the SSM

fragments is returned to the user interface module. The user is notified of a successful

ontology search with the screen shown in Figure 7.3. The user has the option to see the

SQL query that has been constructed, see the results of the query on screen, or restart the

query process. Note that, the motivation for the prototype has been to test our underlying

systems and not to develop a full featured user interface.

22

Ontology Aided Search Environment

name, department, sales

Enter Domain Search Terms:

location = 'US'

Enter Condition:

Start Request

Figure 7.2 The initial screen of the prototype with an example.

Help

Page 23: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

The discussion above assumes that the domain terms that the user entered were in

the ontology used by the system. When the ontology search doesn’t find all of the

domain search terms, the system creates a screen showing the terms that can not be

found. Two conditions exist, namely, the ontology search found a term that appears to

be close or no term(s) can be found. In the first case the system returns the fragment of

the ontology that it thinks may be relevant. The user can choose one of the terms shown

in the ontology fragment or enter another term. Figure 7.4 shows an example of the case

where a fragment of the ontology is presented to the user. The screen illustrates how the

system prototype engages the user to help out the ontology search. The example shows

three is-a relationships with "country" being the likely choice for the user. The

"geographic feature" node represents an is-closely-related relationship. Note that neither

the type of the arc nor the weights are shown at this point. We are hoping to get the

user's interpretation without biasing the user's choice.

23

Ontology Aided Search Environment

name, department, sales

Domain Search Terms used in search:

location = 'US'

Current Condition:

View Query Results on Screen Restart Request

Figure 7.3 Successful ontology search screen.

Help

Ontology Search Successful

Buttons Show Current Options

User Help is help is required to complete the search!

location?

Domain Search Term in question:

Enter best choice or enter a new search term:

geographic featuresgeographic location

country state city

Terms assumed to be related:

Figure 7.4. Screen showing user/ontology interaction.

Help

Page 24: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

In the case that no terms are considered close to the domain search term, the user

is asked to enter a new domain search term with the same meaning.

When the ontology search (with the user's assistance) resolves the search terms to

database terms, the information is passed to the Query Generation System, where an SQL

query is generated. The Query Generation System tests for the correctness of the

generated query [17]. If the Query Generation System was called through the View Query

button, the SQL query is shown. Again since we are in the test mode, we have chosen to

show the full SQL query with the tables or pseudo tables from the distributed data

sources as though they are in the same database. In a commercial package, more options

of how to show the query would have to be considered.

When the user clicks on the Results on Screen button, the query information is

passed to the Data Integration System. There the query is partitioned into queries for the

24

Page 25: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

individual relational databases or legacy systems and the sub queries are sent to the

individual data source sites. The results of the individual queries are sent back to the

Data Integration system and the query results for the user’s query are prepared using the

original query. The sub-queries are used in the prototype to define and spawn a set of

mobile agents. The agents are sent to the sites that contain the relevant data. Each agent

carries one of the SQL queries. The data returned by the agents is combined to produce

the required result. The result of the request is then returned to the user and displayed on

the screen.

The choice of mobile agents is not critical to the model, but rather represents a

method for quickly generating the necessary infrastructure. Client-Server models using

SOAP, CORBA or Java JDBC connections could also be used. We have used all four

types of connections in related projects.

8. Conclusions

A model for using domain specific ontologies, converting them to database

specific ontologies for aiding in the interpretation of a user's query has been given. The

model allows users to define both domain specific search terms and domain specific

functions to operate on the results of the query. The model was built on an integrated

database/legacy system environment. Our data integration scheme provides a universal

relation view of the distributed data sources. A prototype to test the feasibility of the

ontology and data integration model has been designed and implemented. The prototype

takes the user input and generates SQL queries for the relational databases/legacy

systems over which the ontology search operates.

9. References

1. Bright, M.W., A.R. Hurson and S.H. Pakzad. A taxonomy and current issues in multidatabase systems. IEEE Computer, Vol. 25, No. 3, pages 50-60.2. Bright, M.W and A. Hurson, “Summary Schemas in multidatabase systems”, Computer Engineering Technical Report at PennState, 1990.3. Bright, M.W., A. Hurson, S. Pakzad, and H. Sarma, “The Summary Schemas Model – An approach for handling Multidatabases: Concept and Performance Analysis”, Multidatabase System: An Advanced Solution for Global Information Sharing, pp.199, 1994.4. Bright, M.W. and A. Hurson, “Automated Resolution of Semantic Heterogeneity in Multidatabases”, ACM Transactions on Database Systems, pp. 213, 19(2), 1994.

25

Page 26: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

5. Chu, W.W., H. Yang, K. Chiang, M. Minock, G. Chow \& C. Larson. CoBase:A Scalable and Extensible Cooperative Information System, Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 223.6. Corazzon, Raul. ed. “Descriptive and Formal Ontology”, http://www.formalontology.it. 7. Fagin, R., A.O. Mendelzon, and J.D. Ullman. 1982. A simplified universal relation assumption and its properties. ACM Transactions on Database Systems. Vol. 7. Pages 343-360.[6] Gardarin, Georges, Antoine Mensch, Anthony Tomasic: An Introduction to the e-XML Data Integration Suite. EDBT 2002: 297-306.[7] Gardarin, Georges, Fei Sha, Tuyet-Tram Dang-Ngoc: XML-based Components for Federating Multiple Heterogeneous Data Sources. ER 1999: 506-519.[8] Gardarin, Georges, Antoine Mensch, Tuyet-Tram Dang-Ngoc, L. Smit: Integrating Heterogeneous Data Sources with XML and XQuery. DEXA Workshops 2002: 839-846.8. Gruber, T. "A translation approach to portable ontologies," Knowledge Acquisition, pp. 199-220, 5(2), 1993.9. Gruber, T. “Toward Principles for the Design of Ontologies Used for Knowledge Sharing”, ed. N. Guarino. International Workshop on Formal Ontology, Padova, Italy 1993.10. Guarino N., “Formal Ontology, Conceptual Analysis and Knowledge Representation”. International Journal of Human and Computer Studies, special issue on The Role of Formal Ontology in the Information Technology edited by N. Guarino and R. Poli, vol 43 no. 5/6, 1995.11. Guarino, N. and C. Welty, "Ontological Analysis of Taxonomic Relationships", In, A. Laender, V. Storey, eds, Proceedings of ER-2000: The 19th International Conference on Conceptual Modeling, October, 2000.12. Hurson, A., M. Bright, S. Pakzad (ed.): Multidatabase systems - an advanced solution for global information sharing. IEEE Computer Soc. Press 1994.13. Peter D. Karp, Vinay K. Chaudhri and Jerome Thomere “XOL: An XML-Based Ontology Exchange Language”, http://www.oasis-open.org/cover/xol-03.html.[12] Lehti, Patrick, Peter Fankhauser: XML Data Integration with OWL: Experiences and Challenges. SAINT 2004: 160-170.14. Lenat, D. B. “Welcome to the Upper Cyc Ontology”, http://www.cyc.com/overview.html, 1996.15. Ludäscher, B., Y. Papakonstantinou, P. Velikhov. A Framework for Navigation-Driven Lazy Mediators. ACM Workshop on the Web and Databases. 1999.16. Miller, L. L. 1992. Generating hinges from arbitrary subhypergraphs. Information Processing Letters. Vol. 41. No. 6. Pages 307-312.17. Miller, L.L. and M.M. Owrang. 1997. A dynamic approach for finding the join sequence in a universal relation interface. Journal of Integrated Computer-Aided Engineering. No. 4. Pages 310-318.18. Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R., "Enabling technology for knowledge sharing", AI Magazine, pp.36-56, 12(3), 1991.19. Schulze-Kremer, S. "Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology", Proceedings of The Fifth Intemational Conference on Intelligent Systems for Molecular Biology, T. Gaasterland, Et al,(eds.), Halkidiki, Greece, June 1997.

26

Page 27: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

20. N. J. Slattery, “A Study of Ontology and Its Uses in Information Technology Systems”, http://www.mitre.org/support/papers/swee/papers/slattery/.21. Sowa, John. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks/Cole. Pacific Grove, CA. 2000.22. Su, S.Y.W., H.L. Yu, J.A. Arroyo-Figueroa, Z. Yang and S. Lee. NCL:A Common Language for Achieving Rule-Based Interoperability Among Heterogeneous Systems, Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 171-198.23. Subrahmanian,V.S, Sibel Adali, Anne Brink, Ross Emery, J.ames J. Lu, Adil Rajput, Timothy J. Rogers, Robert Ross, Charles Ward. HERMES: Heterogeneous Reasoning and Mediator System, http://www.cs.umd.edu/ projects/hermes/publications/abstracts/ hermes.html .24. Swartout, W.R., P. Patil, K. Knight, and T. Russ, “Toward Distributed Use of Large-Scale Ontologies” In Proceedings of the 10th Knowledge Acquisition for Knowledge-Based Systems Workshop. Banff, Canada. 1996.[23] Tukwila Data Integration System. University of Washington. http://data.cs.washington.edu/integration/tukwila. Accessed 10/5/2004.25. van Heijst, G., A. Schreiber, B. Wielinga. "Using explicit ontologies in KBS development", International Journal of Human-Computer Studies, pp. 183-292, Vol. 46, No. 2/3, Feb, 1997.26. Wiederhold, G. Mediators in the Architecture of Future Information Systems, IEEE Computer, Vol. 25, No. 3, 1992, pp. 38-49.27. Wiederhold, G. and M. Genesereth. The Conceptual Basis for Mediation Services, IEEE Expert, Vol.12 No.5, 1997, pp 38-47.[30] Zamboulis, L., XML Data Integration By Graph Restructuring, Proc. BNCOD21, Edinburgh, July 2004. Springer-Verlag, LNCS 3112, pp 57-71.

Appendix A. Query GenerationTo create a query, we must translate the request to the target data space (the

hypergraph representing the collection of connected operational databases). Finally, the

27

Page 28: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

target query hypergraph is mapped to an SQL query. To look at this process in more

detail, we consider the basic data structures and algorithms. We start by looking at the

notion of a complete intersection graph.

Complete Intersection Graph (CIG)

Let H = (U, R) be a hypergraph where U = {A1 , A2 , ... , An} is a set of

attributes and R = {R1 , R2 , ..., Rp} is a set of relation schemes over U. The complete

intersection graph (CIG) [17] is an undirected graph (R, E) where E = { (Ri , Rj) : Ri

Rj , Ri R, Rj R, i j }. Note that the edge (Ri , Rj) between vertices (or

nodes) Ri and Rj exists if and only if Ri and Rj have at least one attribute in common.

The edge (Ri, Rj) will be labeled with Rij where Rij = Ri Rj. An example of a hypergraph

and its complete intersection graph is shown in Figure A.1.

Adjusted Breadth First Search (ABFS)

The adjusted breadth first search (ABFS) [17] is a variation of the breadth first

search (BFS) to determine the join sequence for a target hypergraph. ABFS supplements

BFS by including a path label for each node and an adjustment set in the search tree so

that the search is more efficient. The resulting search tree is called an ABFS tree [17].

The node from which the search is started is called the root of the ABFS tree. A sample

ABFS tree is shown in Figure A.2.

The path label [17] for an ABFS tree node is the union of all query attributes on

this ABFS tree node and its ancestors on the search path. So the path label of an ABFS

28

Figure A.1 A hypergraph & its complete intersection graph (CIG).

ABC

BF

CDE AEF

C A

E

B

F

a) hypergraph b) complete intersection graph

Page 29: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

tree node should be a superset of its parent’s path label. In the process of creating an

ABFS tree, the path labels will be used to prune or delay the expansion of subsets where

the unused nodes that are adjacent to the current endpoint of the search path do not

contribute any new query attributes to the path label. Any nodes falling into this class

will be stored in the adjustment set [17] (denoted by ASet) with a pointer to the position

where they could be added to the ABFS tree during the further search or expansion. The

relevant CIG can be applied to determine which nodes are adjacent to the current

endpoint of the search path.

The expansion of the ABFS tree will continue until the union of the path labels of

all the leaves in the current ABFS tree contains all the query attributes. If the ABFS tree

can not be expanded and the union of the path labels of all the leaves in the current ABFS

tree does not contain all the query attributes, then a node can be taken from the adjustment

set and the process can be restarted from the position pointed by this node. Note that this

process of creating an ABFS tree should terminate successfully in finite steps since all the

query attributes are in the hypergraph and can be reached eventually.

In addition, using the above approach, many different ABFS trees with the same

root may be generated. This is because the order of search is not unique. Also, there are

more than one way (such as FIFO, LIFO, or randomly) to select nodes from the

adjustment set.

Join Sequence

Finding an optimal join sequence for the selected query attributes (including the

attributes appeared in the query condition) is a crucial part in the modeler design

and implementation. Once the ABFS tree with a given root is created, we can

determine the join sequence defined by this tree. The approach is to select a set

of paths connected to the root such that the union of the path labels contains all of

29

ABC {A,B}

BF {A,B,F}

Aset = {CDE}

Figure A.2. Adjusted BFS tree using the CIG of Figure 4.1 with root ABC and the query attributes ABF.

Page 30: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

the desired query attributes. We use the following procedure to select the

appropriate paths [17]:

<Step 1.0> Set W := the set of query attributes. Go to <Step 1.1>.<Step 1.1> Mark every leaf and its ancestors if its path label has a query attribute that appears only once in the path labels of all leaves in the ABFS tree. Remove the query attributes included in the path labels of the marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.2>.<Step 1.2> If there is a contributing query attribute in more than one path labels of the unmarked leaves with the same parent, then mark one (and only one) of those leaves and its ancestors. Remove the query attributes in the path labels of the marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.3>. (By a contributing query attribute we mean a query attribute that occurs in the path label of a leaf but does not occur in its parent’s path label.)<Step 1.3> If there is a leaf which contains a remaining query attribute in W with the lowest frequency, then mark this leaf and its ancestors. In case of tie, choose the leaf with the shortest path and mark the nodes on this path. It is worth noting that the approach described in the previous subsection does not guarantee to create the optimal ABFS tree with a given root since the order of search in that approach is not necessarily optimal. The so-called optimal ABFS tree with a given root is actually the one with the minimum weight over all possible ABFS trees for this root. By weight of an ABFS tree we mean the length of the join sequence defined by the ABFS tree. The creation of a non-optimal ABFS tree with some given root does not cause serious problems. On one hand, our goal is to generate an optimal join sequence which is the one with the minimum weight over the optimal ABFS trees for all roots. The probability of creating non-optimal ABFS trees for all roots is very low. On the other hand, one can remove a redundant join from the resulting join sequence at a later stage. Another point worth noting is that we do not have to generate ABFS trees for all roots.We need only to generate ABFS trees for the so-called legal roots. A root is called illegal if it does not contain any query attribute or its set of query attributes is properly contained in the set of query attributes of an adjacent node. The algorithm to find a join sequence for a given hypergraph and a set of query attributes is summarized as follows:<Step 2.0> Create the CIG for the hypergraph. Find the set LR of all legal roots in the CIG. Set minweight:= the number of nodes in CIG. Go to <Step 2.1>.<Step 2.1> If LR is empty, then stop. Otherwise, choose a root r LR , set LR := LR - {r}, and go to <Step 2.2>.

30

Page 31: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

<Step 2.2> Create an ABFS tree with root r. Find the weight and the corresponding join sequence for this tree. If the weight is smaller than minweight, then save this join sequence as the current best one, and replace minweight with the weight. Go to <Step 2.1>.

31

Page 32: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

Appendix B. Query Correctness Our correctness process has been built on the issue of testing for fd-hinges [16].

To come to an understanding of what this includes, it is necessary to briefly look at the

underlying concepts of hinges and fd-hinges.

A hypergraph H is reduced if no hyperedge of H is properly contained in another

hyperedge of H. H is connected if every pair of its hyperedges is connected by some

path of hyperedges. If H is a reduced connected hypergraph with the vertex set N and

edge set E, then E’ is a complete subset of E if and only if E’ E and for each Ei in E if

Ei attr(E’), then Ei belongs to E’. E’ is said to be a trivial subset of E if |E’| <= 1 or E

= E’.

Let E’ be a complete subset of E and E1, E2 E – E’.

Then we say E1 and E2 are connected with respect to E’ if and only if they have

common vertices not belonging to E’.

Let E’ be a nontrivial complete subset of E and 1, 2, …, p be connected

components of E-E’ with respect to E’. Then E’ has the bridge-property if and only if

for every i = 1, 2, …, p there exists Ei E’ such that (attr(E’)NiEi, where Ni =

attr(i), Ei is called a separating edge of E’ corresponding to i. A nontrivial complete

subset E’ of E with the bridge property is call a hinge of H. An example of a hinge is

shown in Figure B.1. Note that {E2,E3,E4} is not a hinge.

Let F be a set of functional dependencies (fds). Let TE’ be the tableau defined

over the attributes in E for the schemes represented by the edges in E’ E, the

chaseF(TE’) is the result of using the fds in F to chase the tableau TE’. Now let E* be the

32

Figure B.1. Hinge example with E1 as the separating edge and {E1,E2,E3,E4} as a Hinge.

Page 33: Integrating Data Using Ontology and SSM Fragmentslmiller/cs362/ontol_rec_legacyJournal...  · Web viewTo assist users in obtaining an integrated view of information from heterogeneous

set defined by chaseF(E’) such that E* = {Si|if wi(A) is a distinguished variable and wi

chaseF(TE’), then ASi}. In other words each element in E* corresponds to a row in the

tableau chaseF(TE’) and consists of the attributes that have distinguished values in the

row. Note that by the definition of the chase algorithm each element of E* is a super set

of the corresponding element in E’ that was used to initially define the row in the

tableau. Construct the hypergraph HE*,F = (attr(E),(E-E’)E*). Then E’ is an F-fd-hinge

of a hypergraph H when E* is a hinge of HE*,F.

In [16] we showed that an F-fd-hinge was equivalent to an embedded join dependency.

In other words any time that a set of edges defines an F-fd-hinge, the set of relation schemes that

correspond to the edges define a lossless join. As a result, our test for query correctness comes

down to testing the set to determine if they define an F-fd-hinge.

33