Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Searching Integrated Relational and Record-based Legacy Data Using Ontologies
L.L. Miller and Hsine-Jen Tsai Department of Computer Science
Iowa State UniversityAmes, Iowa 50011
Sree NilakantaCollege of Business
Iowa State UniversityAmes, Iowa 50011
Mehdi OwrangAmerican University
Washington, D.C.
AbstractIntegration of data continues to be a problem. The number of databases available to a corporation continues to grow. Simply keeping track of the number and diversity of the attributes (fields) can be a difficult problem in large organizations. In this paper we define an ontology model for the domain that uses ontologies for object (entity) search over a set of integrated relational databases and record-based legacy systems. The integration process is based on a hypergraph model that makes use of the theory of universal relations. The design of the complete system model is given and a prototype of the model is briefly discussed. 1. Introduction
Managing the vast amounts of information in large computer networks presents a
number of difficulties to system users and designers. New applications and databases are
created on a regular basis to solve local problems as they arise. For large organizations, it
means the number of databases can be staggering. We can not expect users to know terms
for identifying specific information from multiple data sources. Most databases are
created and maintained by local groups and/or organizations, that use software that
optimizes local transactions. Even if we assume that all databases use a standard
hardware/software platform, language and protocol, there still is the issue of conceptual
heterogeneity. To assist users in obtaining an integrated view of information from
heterogeneous distributed data sources continues to be an active research area.
Among the research groups working on this problem, the use of an ontology
[6,8,10,13,18,20,21] seems very appealing. Since the beginning of the nineties,
ontologies have become a popular research topic investigated by several artificial
1
intelligence research communities, including knowledge engineering, natural-language
processing and knowledge representation. More recently, the notion of ontologies has
become widespread in fields such as intelligent information integration, information
retrieval on the Internet, and knowledge management. The reason for ontologies being so
popular is in large part due to what they promise: a shared and common understanding of
some domain that can be communicated across people and computers. General ontologies
have not been effective. Therefore, the best one expects from an ontology is for it to be
domain specific. However, for imprecise queries, the first problem is to take query terms
and map them to database terms. Therefore, minimally we must modify the ontology to
make it database specific. The Summary Schemas Model (SSM) [2,3,4] provides a way
to link database terms to the ontology.
In spite of the large amount of research on database integration of heterogeneous
data sources that has been done, the problem continues to create difficulties for most
organizations. In the present work we look at a subproblem of the general integration
problem, that is the case where the data sources are controlled by one organization and
the data sources consist of relational databases and record-based legacy systems. While
this is a small part of the general problem, it covers a large number of applications that
typical organizations are concerned with integrating.
Our contribution in this paper is the development of an ontology-based model that
provides access to a distributed set of relational databases and record-based legacy
systems through imprecise queries. A database specific ontology is integrated with a set
of semantically disjoint universal relations over the set of data sources to provide access.
The use of universal relations which simplifies the connection between the ontology and
the set of distributed data sources. For any request that requests semantically related
data, there is a single universal relation that is capable of responding to the request.
Specifically, we develop the notion of database specific weighted ontologies as a means
of determining the required universal relation. The use of universal relations in this
context is made possible due to our data integration scheme. The integration scheme is
based on the use of hypergraphs and the theory of relational databases. Such an
approach provides the additional capability of testing the correctness of any query
generated.
2
A brief overview of ontologies, Summary Schema Model (SSM), and integration
issues are presented in Section 2. The overall model is overviewed in Section 3. In
Section 4 we present our approach to ontologies and look at the issue of generating SSM
tree fragments and database specific ontologies. Section 5 looks at the issues that
makeup our data integration scheme. Section 6 overviews our current version of the
feasibility prototype. Finally, we conclude by summarizing our results.
2. Background
2.1 Ontologies
The word “ontology” is borrowed from philosophy, in which it refers to the
“subject of existence” [8]. It is the science of “what is”. It discusses the structures of
entities, the properties of entities and the relations between entities. In a word, it seeks to
find an appropriate classification of entities. In the context of artificial intelligence, an
ontology is a model of some portion of the world and is described by defining a set of
representational terms [6]. A formal definition is “a formal, explicit specification of a
shared conceptualization” [8]. “Conceptualization” refers to an abstract model of some
phenomena in the world by having identified the relevant concepts of those phenomena.
So, an ontology is a description of concepts and relationships between them.
The main motivation of an ontology is knowledge sharing and reuse [9,25]. In the
field of information system, different groups gather data using their own terminologies.
When all those data are integrated, a major problem that needs to be handled is the
terminological and conceptual incompatibility. It could be done in a case-by-case basis.
But a solution based on a “consistent and unambiguous description of concepts and their
potential relation” [19] will be much better than a case-by-case one. In the Knowledge
Sharing Effort (KSE) project [18], ontologies are put forward as means to share
knowledge bases between various knowledge-based systems.
A major challenge in using ontologies lies in how to build them, or what should
they look like? Several groups have given solutions. They describe how ontologies
should be constructed so that they contain the richest information in the least space and
can be efficiently retrieved for use. A solution based on the definition of a “core library”
has been proposed in [25]. More often, an ontology is considered as a taxonomic
3
hierarchy of words with the “is-a” relation between them [9]. Some technology has also
been proposed to modify a poorly designed ontology into a better one [11].
In dealing with multi-database systems, ontologies can be used effectively to
organize keywords as well as database concepts by capturing the semantic relationships
among keywords or among tables and fields in a relational database. By using these
relationships, a network of concepts can be created to provide users with an abstract view
of an information space for their domain of interest. Ontologies are well suited for
knowledge sharing in a distributed environment where, if necessary, various ontologies
can be integrated to form a global ontology.
Database owners find ontologies useful because they form a basis for integrating
separate databases through identification of logical connections or constraints between
the information pieces. Ontologies can provide a simple conversational interface to
existing databases and support extraction of information from them. Because of the
distinctions made within an ontological structure, they have been used to support
database cleaning, semantic database integration, consistency-checking, and data mining
[20].
An example of using ontologies in databases is Ontolingua [9]. Ontolingua is
being built with the purpose of enabling databases (and the people and systems that
interface with them) to share its ontology that is specific to the computer science and
mathematics domains with the intention of enabling data sharing and reuse. Another
example of database application is the Cyc ontology that had a knowledge base built on a
core of approximately 400,000 hand-entered assertions (or rules) designed to capture a
large portion of what we normally consider consensus knowledge about the world [14].
Partitioned into an Upper Cyc Ontology and the full Cyc Knowledge Base, there are
3,000 terms of the most general concepts of human consensus reality in the Upper Cyc
ontology with literally millions of logical axioms of more specific concepts descending
below, populating the Cyc Knowledge Base. Cyc foundation enables it to address
effectively a broad range of otherwise intractable software. The global ontologies
objects, attributes, transitions, and relationships are accepted as forming the domain’s
universe.
2.2 Summary Schema Model (SSM)
4
The SSM was first proposed by M. Bright et al. [2,3,4]. The SSM was designed
to address the following issues [2,3,4]:
1. In multi-database system, users cannot be expected to remember
voluminous specific access terms, so the global database should provide
system aids for matching user requests to system data access.
2. Because of different local requirements, independent database designers are
unlikely to use consistent terms in structuring data. The system must take
responsibility for matching user requests to precise system access terms.
The SSM provides the following capabilities: it allows imprecise queries and
automatically maps imprecise data references to the semantically closest system access
terms. Note that the SSM deals with imprecision in database access terms rather than data
values within the database.
The SSM uses a taxonomy of the English language that maintains synonym and
hypernym/hyponym links between terms. Roget’s original thesaurus provided just such a
taxonomy and is the current basis for the SSM. Identifying semantic similarity is the first
step in mapping local to global data representation.
The SSM creates an abstract view of the data available in local databases by
forming a hierarchy of summary schemas. A database schema is a group of access terms
that describe the structure and content of the data available in a database. A summary
schema is a concise, although more abstract, description of the data available in a group
of lower level schemas. In SSM, schemas are summarized by mapping each access term
to its hypernym. Hypernyms are semantically close to their hyponyms, so summary
schemas retain most of the semantic content of the input schemas.
The SSM trees structure the nodes of a multi-database into a logical hierarchy.
Each leaf node contributes a database schema, and each access term in a leaf schema is
associated with an entry-level term in the system taxonomy. Once these terms have been
linked to the taxonomy hierarchy, creating the summary schemas at the internal nodes is
automatic. Each internal node maintains a summary schema representing the schemas of
its children. Conceptually, only leaf nodes have participating DBMSs, while internal
nodes are responsible for the summary schemas structure and most of the SSM
processing.
5
2.3. Integration
Bright et al. [1] define a multidatabase system as a system layer that allows global
access to multiple, autonomous, heterogeneous, and preexisting local databases. This
global layer provides full database functionality and interacts with the local DBMSs at
their external user interface. Both the hardware and software intricacies of the different
local systems are transparent to the user, and access to different local systems appears to
the user as a single, uniform system. The term multidatabase includes federated
databases, global schema multidatabases, multidatabase language systems, and
homogeneous multidatabase language systems.
Multidatabases inherit many of the problems associated with distributed
databases, but also must content with the autonomy and heterogeneity of the databases
that they are trying to integrate. As the number of local systems and the degree of
heterogeneity among these systems rises, the cost of integration increases.
There has been considerable research on multidatabase systems. A great deal of
the work has been examined in [12]. This approach has focused on the problem from the
point of view of applying traditional database techniques to bridge the mismatch between
the underlying data sources.
Several researchers have explored the use of intelligent agents called mediators
[26,27] as a means of bridging the mismatch between the heterogeneous data sources. At
present, there is no implemented system that offers the full range of functionality
envisioned by Wiederhold in his paper [26]. Examples of projects that have been
developed include HERMES being developed at the University of Maryland [23],
CoBase at UCLA [5] and NCL at the University of Florida [22], and MIX at SDSC[15].
The advantage of such mediator-based systems is that to add a new data source it is only
necessary to find the set of rules that define the new data source.
More recently, a number of researchers have started to look at XML-based data
integration techniques as a way to attack the general data integration problem. The use of XML
in the general data integration problem is especially interesting as the unstructured format that
XML supports allows one to manipulate a variety of data types. Beyond simply storing the data
in XML format, data integration requires mechanisms to do the integration. Zamboulis makes
use of Graph Restructuring to accomplish the integration [30]. A number of groups have looked
6
at XQuery as the basis of their approach to XML-based data integration[6,7,8,12]. The Tukwila
Data Integration System provides a complete solution that involves not only integration, but
activities like optimizing network performance as well [23].
In the next section we overview the complete model before examining the two
principle components of our model in more detail.
3. Model Overview
The proposed model makes use of a database specific ontology and an integration
scheme based on universal relations to support imprecise queries over a distributed set of
relational databases and record-based legacy systems. Figure 3.1 illustrates the
relationship between the objects used to construct the physical state of our model. The
universal relations are used to provide a simple query interface to the set of distributed
relational databases and record-based legacy systems. The Summary Schema Model
(SSM) tree fragments are used to convert a domain entity ontology into a database
specific ontology. The result is that the model is capable of supporting imprecise
requests. Once the terms used in the user’s request are related to the appropriate database
terms (i.e., attribute names), the model automatically generates a result relation and
returns it to the user.
7Figure 3.1. Block diagram showing relationships between the objects in the model.
Entity Ontology
SSM tree fragments
Universal Relations
Relational Databases and Legacy Systems
Figure 3.2 looks at the model from the perspective of the processes that are
required to enable the model. The components inside the dotted rectangle provide an
illustration of the relationship between the components.
The interactions between the components of the model is best illustrated by
looking at the way that data flows within the model. The front end system passes the
model a set of terms and conditions as a request (query). The controller passes the terms,
including any terms in the conditions, to the Ontology Mediation Manager. The terms
are used to search the ontology to find the universal relation(s) that are needed to
generate the universal relation query to respond to the request. Terms that cann’t be
located in the database specific ontology are typically mediated with the user. There are
multiple ways that this mediation could be implemented depending on the nature of the
front end. In our discussion (and prototype) we have assumed the use of a GUI to
conduct this mediation as a visual process, but this would not be required.
Locating the terms in the ontology would identify one or more universal relations
that can be used to answer the request. In general only one universal relation would be
identified due to the universal relations being semantically disjoint. More details on this
issue are discussed in Section 5. As a result, in the remainder of the paper we will
assume that only one universal relation is required to produce a result for a given request.
Based on the results of the ontology search, a universal relation query is generated.
The universal relation query is passed to the Query Engine along with a request
id. There it is converted into an integration query that makes use of the relations and
legacy system records that define the universal relation’s data space. The integration
query is partitioned by the Data/Query Manager and the resulting subqueries are sent to
the appropriate data sources. The relations that are generated by the subqueries are
returned to the Data/Query Manager where they are merged and the final result relation is
sent back to the front end system.
8
In the next two sections, we take a more detailed look at the components of the
model. Our approach to creating and searching database specific ontologies is examined
in Section 4. An overview of our integration scheme is given in Section 5.
4. Ontology Design
Ontologies are in general domain specific. In an environment where one is trying
to integrate a set of heterogeneous, distributed data sources, this means that it is
necessary to make the ontology used to search the data sources database specific. For an
9
ControllerOntologyMediationManager
Query Engine
Data/Query Manager
Data Sources
MetaData
MetaData
DatabaseSpecific
Ontology
UserFront End System
Figure 3.2. Block diagram of the proposed model.
Model
OntologySearch
Manager
ontology, this means that the attribute names used in the universal relations must be
incorporated into the ontology.
4.1 Ontology Design
The focus in this section of the paper is moving from domain specific ontologies
to database specific ontologies. We see ontologies as representing the entities (objects) in
the domain that the user of the integrated databases is working. The domain is
represented by terms that define the problem area. Note that the user’s problem and the
available databases must come from the same domain in order for a solution to exist.
An ontology can be defined as a graph = (,), where is the set of terms used
to represent the domain and is the set of edges connecting the nodes representing the
terms. Each term node can have properties assigned to it. In our ontology model there
are four types of edges in , namely, the is-a, is-part-of, synonym, and antonym edges.
Is-a and is-part-of edges are directed, while synonym and antonym edges have no
direction. Let I() be the set of is-a edges in the ontology . Then (,I()) represents a
directed acyclic graph (dag) with the more general terms higher in the dag and more
specific terms lower in the dag. As expected, synonym and antonym edges are used to
connect terms with the same and opposite meaning, respectively.
To enhance the search operation, we add the notion of edge weights to create a
weighted ontology. Let be the set of weights such that i is the weight for Ei .
We use the weights to prune the search of the ontology. For E I(), the weights are
used to estimate the relative closeness of the is-a relationship. A similar argument can be
made for is-part-of edges. Going through a term like Physical Object would not be
useful. To block the search, the weight assigned to the edges connected to such a term
are set to a large values. In our current ontology design weights for is-a and is-part-of
edges are integers. Note that the use of weights is to reduce the number of questions that
a user must be asked during the search. In meaningful queries there are likely to be
several query terms. This combined with the expected bushiness of the ontologies give
rise to the possibility of an overwhelming number of questions that the user could be
asked if the user had to resolve all of the choices.
The weights on the synonym and antonym edges range from zero to one, where
one indicates an exact match for a synonym and an exact opposite for an antonym.
10
Using weights on these edges allows us to show the degree of the match. A small
example of a weighted ontology is shown in Figure 4.1.
The method of generation of the weights depends on the builder of the ontology.
The weights can be assigned by hand or can be generated automatically. We have
generated the weights by hand in our current test sets, but we have designed an algorithm
for generating the weights from metadata and domain documents.
3.2 Creating a database specific ontology
To move from a domain specific ontology to a database specific ontology, we
make use of Summary Schema Model (SSM) tree fragments. The process of creating a
database specific ontology requires us to create SSM tree fragments that are relatively
specific. The SSM tree fragments are constructed starting with the attribute names used
in the schema of the universal relations that are defined by the data source data. To
successfully search a database specific ontology, it is critical that the SSM tree fragments
do not generalize. If the root term of an SSM tree fragment is too general, the database
terms will not be found by searches starting at meaningful domain terms.
To start the process of making an ontology database specific, we check the
attribute names in the universal relation defined by the data sources to determine if they
already exist as terms in the ontology. If the term exists, a pointer is added to the
ontology term property set to point to the universal relation that the attribute is located in.
For the remaining universal relation attributes, the metadata of the databases is used to
11
Entity20
20
50
10
0.915
3010
100
15
1515
25
Physical Object
Living Being
Social Entity
AppleGreen
Animal HumanBeing
Country
PersonGreen Apple
Figure 4.1 A weighted ontology.
Is-a edge
Synonym edge
unify the attribute names into one or more SSM tree fragments. In particular the
definitions of the database fields named by the attribute names given in the meta data are
used to determine related (i.e., unifyable) terms. The term that is used to unify a subset of
the remaining universal relation attributes is then matched against the ontology terms. If
it is found, the SSM tree fragment is attached to the ontology term. Weights are
assigned by the individual expanding the ontology. If the root term of the new fragment
is not in the ontology, the unification process asks the user for related terms and again
checks the ontology. If no match exists, our algorithm looks to incorporate more
universal relation attributes into the SSM tree fragment(i.e., grow the SSM tree
fragment). Our early attempts to completely automate the process have not been very
promising, so we are currently using a human aided approach. The metadata definitions
and related documents are used to determine likely unification terms. This gives the
human guiding the process the opportunity to choose a unifying term from an existing
list.
At each step, the root term of the SSM tree fragment is checked to see if it exists in the
ontology. When all of the attribute names have been incorporated into the ontology in
this manner, we say that the ontology is database specific. Figure 4.2 shows a block
diagram of the database specific ontology.
3.3 Search
12
Entity Ontology
Figure 4.2. Block diagram of ontology and SSM fragment design.
SSM tree fragments
Universal Relations
The basic premise of our ontology search is to allow the user to give a set of
search terms and proceed from the search terms to “near by” database terms. Weights
combined with user interaction are used to define what is meant by “near by”. To look at
the search, we provide a set of basic rules used in the search.
Ontology Search Rules for is-a, synonym, and antonym edges:
1. A user creates a request by supplying a set of search terms. A search algorithm searches the database specific ontology to locate the search terms. If some of the search terms are not found in the ontology, the user is asked to refine the query terms.2. Weights are used to block paths that are unlikely to provide useful results. As an example, an is-a edge from a very general term to a specific term (e.g., Apple in Figure 4.1) is unlikely to yield a useful “near by” term. Weights are used in combination with user interaction to provide an effective search without overwhelming the user.3. In a typical successful search, when no link to a universal relation is found at an original term node, the algorithm starts from the node by looking for synonym edges. If one is found the weight is tested against the synonym threshold. If the weight is larger than the threshold, the search moves to the next node and continues. Since more than one synonym edge may be followed, the weights on synonym edges are multiplied and the product is tested against the threshold. Whether more edges are followed from the individual nodes depends on whether we are looking for all “near by” database terms or one. If no synonym edge exist, then the is-a edges are used as indicated in rule 2.4. For a NOT search, the algorithm starts from the query term in the ontology and looks for an antonym edge leaving the term node. If one exists, its weight is tested against the antonym threshold. If an appropriate antonym edge is found, the search moves to the new term node and a positive search (rule 3) is initiated from that point.5. In all cases if no “near by” database term is found for a query term, the user is notified and asked to refine the query term.6. When all query terms have been processed, the search algorithm returns a set of universal relations and attribute names that can be used to generate the required universal relation query.
4. Integration Scheme
While there has been a great deal of activity on integrating heterogeneous
databases, important questions remain. To bridge this gap, we use an integration model
designed to operate on a subset of the general integration problem where the data sources
are limited to relational databases and record-based legacy systems. Our approach takes
advantage of the work on universal relation interfaces (URIs) [7,17]. The idea behind an
13
Pseudo
Relation
Request
For Data
Wrapper
Set of
records
RecordBasedLegacySystem
Request for
records
Figure 4.3. Relationship between wrapper and legacy system.
RelationViewManager
URI is to provide a single relation view of a set of relations found in the same database.
The set of relations should have sufficient semantic overlap so that the single universal
relation view was able to provide a semantically correct “view” of the data. In addition
an URI has to support development of a correct query.
The task of applying the earlier work on URIs to the integration of relational
databases and record-based legacy systems has three basic steps:
1. Give the record-based legacy systems a relational structure,
which we call a pseudo relation.
2. Group attributes so that only semantically equivalent attributes
have the same name in the integrated environment.
3. Model each set of connected relations (defined in Section 4.3) as
a universal relation.
The result of applying the three steps is a set of universal relations that are visible
to any software interacting with the integration model. The number of universal relations
will depend on the degree of overlap between relations and pseudo relations. The next
three subsections look at the three steps in more detail.
4.1 Defining Pseudo Relations
Our approach is to have the local data administrator of each record-based legacy
system define the set of export “relation view(s)” (records) that he/she is willing to
export into the integrated environment. This set can change over time. The local data
administrator defines these “relation views” as a set of requests to the legacy system at
the programmatic level (batch mode). Each “relation view” places a pseudo relation in
the integrated environment. A pseudo relation is a set of tuples with each column named
by a unique attribute name.
A wrapper for the legacy system is then created that resides on the same platform
as the legacy system. The wrapper is a static agent that interfaces with the integration
model by exporting the required “relation view” as a set of tuples (i.e., a pseudo
relation). To generate the pseudo relation the view manager executes the appropriate
request to the
14
legacy system through the “relation views” defined by the local administrator. Figure 4.3
illustrates the relationship.
Each application of retrieving data through a wrapper results in placing a pseudo
relation in the integrated environment. Selection of rows in the resulting table can easily
be implemented as part of the view manager.
4.2. Attribute Names
In any set of database relations and legacy systems there is likely to be problems
with attribute names. In particular one expects some instances of semantically equivalent
attributes with different names and some cases of attributes with the same names, but
different meanings.
We use the typical solution to this problem, i.e., have the designer of the
integrated system evaluate the existing name set by reviewing the metadata defined over
the data sources. He/she can then rename attributes within the integrated system to
remove the problem. For relational databases, this can accomplished by using views.
The use of views can also be used by the local database administrator as a means of
controlling what data is exported into the integrated environment. Since the local data
administrator of a legacy system is already defining a “relation view” in the integrated
environment for each export schema, any required name changes can be handled at that
level.
15
The result is that we can look at the integrated environment as defining a set of
attributes, such that, if two attributes have the same semantics, they have the same name.
Also, if two attributes have the same name, they have the same semantics.
Another advantage of renaming the attributes in the proposed environment is that
attribute names can be chosen to provide more semantic meaning. This results in easier
SSM tree fragment construction.
4.3 Universal Relations
A universal relation u(U) is seen as a virtual relation u over a scheme U. We use
U and attr(U) interchangeability to mean the attributes in the scheme U. The universal
relation u can be defined over a set of relations {r1 (R1), r2 (R2), … , rn (Rn)} where u= r1
r2 … rn and attr(U) = attr(R1) attr(R2) … attr(Rn).
The universal relations used in our integration model are restricted to being
connected and maximal. A universal relation over a set of relations R = {R1, R2, …, Rn}
is connected as long as it is not possible to partition the set of relations into two sets, say
O1 and O2, such that O1 and O2 are subsets of R and O1 ∩ O2 ≠ . A universal relation
u(U) is considered to be maximal if attr(U) is the maximum set of attributes and u is
connected.
In the remainder of this presentation, we use the phrase universal relation to mean
a maximal and connected universal relation. In the next subsection, we look at the basic
aspects of our integration model.
4.4. Data Integration
The Ontology Mediation Manager (Figure 3.2) sees the data through the
integration scheme as a set of disjoint universal relations. As such, it simply generates a
universal relation SQL query of the form Select attribute list From universal relation
Where condition. The Ontology Mediation Manager tags the universal relation query
with the request id from the front end system and supplemented by the controller to
identify the front end and the user making the request. The task of the integration system
is to
1. Convert the universal relation query into a query over the relations that support
the universal relation.
2. Ensure the correctness of the query.
16
3. Partition the query with respect to the data sources.
4. Query the individual data sources, combine the results into a final relation, and
return it to the user.
The integration system is made up of two primary components, namely, a Query
Engine and a Data/Query Manager (Figure 3.2). The Query Engine makes use of a
hypergraph model of the set of relations that support the universal relation used in the
universal relation query to generate the query and test its correctness. The Data/Query
Manager receives the universal relation query from the Query Engine, partitions it with
respect to the location of the data, sends the resulting sub-queries to the appropriate data
sources, and combines the results of the subqueries if there is more than one sub-query.
In the next two sections we look briefly at the underlying concepts of the Query
Engine and Data/Query Manager, respectively.
5. Query Generation and Correctness Overview
Hypergraphs play a critical role in our approach to integration. A hypergraph is a
couple H = (N,E), where N is the set of vertices and E is the set of hyperedges, which are
nonempty subsets of N. There is a natural correspondence between database schemes
and hypergraphs. Consider the set of relation schemes R = {R1, R2, …, Rn}. We can
define the set of attributes of R as being attr(R) = ni=1Ri. The hypergraph HR =
(attr(R),R) can be seen to be a hypergraph representation of the set of relations.
Typically, the hypergraph has been used to represent the scheme of a single database, but
there is no reason that we can not use the more general interpretation of having it
represent the scheme of the relations and pseudo relations that define the data in the
integrated environment.
Let L = {L1, L2, …, Lm} be the set of pseudo relations that are defined for the
record-based legacy systems as described in Section 4.1. Let R = {R1, R2, …, Rn} be the
set of relation schemes associated with the relational databases that exist within the
integrated environment. If RENAME() is the process described in Section 4.2, then S =
RENAME(L) RENAME(R) can be perceived as the relation set for the integrated
17Figure 5.1 Relationship between the hypergraph, complete intersection graph, and the ABFS tree for query requiring the attributes A, B, and F.
ABC
BF
CDE AEF
C A
E
B
F
b) complete intersection graph
a) hypergraph
ABC {A,B}
BF {A,B,F}
Aset = {CDE}
c) Adjusted BFS tree with root ABC.
environment. We can then look at HI = (attr(S),S) as a hypergraph representation of the
integrated environment. The hypergraph HI defines a set of one or more connected
subhypergraphs. The precise number of connected subhypergraphs is dependent on the
connectivity of the relations and pseudo relations in the integrated environment. Each
connected subhypergraph, say Hu = (attr(U),U) where U is a subset of S and
attr(U)∩attr(S-U) =Ø, provides the basis of one universal relation.
Looking at the elements of S = {S1, S2, …, Sm+n}, where Si = RENAME(Li)
mi1and Sj+m = RENAME(Rj) nj1, we assume that the Sk m+nk1 define
18
meaningful groupings of attributes within the integrated environment. Using the results
of [7], we then have the join dependency [S] defined over the integrated environment.
The importance of this is that we can apply the strategy used in our earlier work on
universal relations [16,17] to check the correctness of any queries generated in the
integrated environment.
To translate a universal relation query to an integration query, we must translate
the request to the target data space (the hypergraph representing the collection of
connected operational databases). Finally, the target query hypergraph needs to be
mapped to an SQL query. To create the mapping, we convert the underlying hypergraph
into a set of Adjusted Breadth First Search (ABFS) trees [17]. An ABFS tree is created
by applying a variation of the breadth first search to the complete intersection graph
(CIG) defined by the underlying hypergraph model. An ABFS tree is created for each
node (relation) in the CIG that contains attributes required in the SQL query. Each path
from the root to a leaf of the ABFS tree defines a set of relations that can be joined.
From this set of paths, we choose a subset that covers the attributes required in the query.
The ABFS tree that requires joining the fewest relations is chosen to create the relation
list in the new SQL query. Figure 5.1 illustrates a simple example of this process. The
complete details of mapping the request to an SQL query are given in [17, Appendix A].
To ensure that the correctness of the integration query, we need to have the join
sequence define a lossless join. Using the result from [7], the join dependency [U] is
defined over the relations and pseudo relations that make up the universal relation used in
the universal relation query that is being translated. The importance of this is that FD-
Hinge of Hu defines a set of edges whose corresponding relations have a lossless join
[16]. The test for correctness starts by testing if the edges that correspond to the join
sequence define an FD-Hinge in Hu. Failing that, the set of edges are expanded to form
an FD-Hinge.
6. Data/Query Manager Overview
The first task of the Data/Query Manager is that the integration query generated
by the Query Engine must be partitioned into subqueries with respect to the location of
the relations/pseudo relations involved in the query. Once the integration query has
19
been partitioned, the resulting subqueries are sent to the appropriate data sources.
Example 1 provides a simple example of the partition process.
Example 1: Example of query partition using SQL syntax.
Data layout: Site 1 tables: R1(A,B,C), R2(C,D,E)
Site 2 tables: R3(E,F,G)
Universal Relation Query:Select G,BWhere F=10
Integration Query:Select G,BFrom R1,R2,R3Where R1.C=R2.C and R2.E=R3.E and F=10
Partition results:Query for Site 1 (Q1):
Select B, EFrom R1, R2Where R1.C=R2.C
Query for Site 2 (Q2):Select E, GFrom R3Where F=10
Request Framework Query:Select G, BFrom Q1, Q2
Where Q1.E = Q2.E
The Data/Query Manager retains the Request Framework Query so that it can
combine the results when two or more subqueries are needed. Assuming Id1 is the
request identifier for the universal relation query, Site1 is the site location, and Q1 & Q2
are the subquery identifiers for the two subqueries in Example 1, then Example 2
illustrates the strings used by the Data/Query Manager to represent the subqueries and the
Request Framework Query.
Example 2: The query string for the result given in Example 1:
20
SubQuery Queue:
“Select B, E From R1, R2 Where R1.C = R2.C”:<Id1,Q1,Site1>
“Select E, G From R3 Where F =10”:<Id1,Q2,Site 2>
Request Framework Query Queue:
”Select G, B From Q1, Q2 Where Q1.E = Q2.E”:<Id1>
The results of the subqueries are placed in a temporary database at the site of the
Data/Query Manager. When results from all of the subqueries have returned and are
stored in the local database, the Request Framework Query is used to combine the
intermediate results before returning the final result relation to the front end system.
7. Prototype
A prototype was implemented to test the feasibility of our approach. The
prototype was implemented in JAVA, developed on the Red Hat Linux platform, and
tested in the Windows platform. Figure 7.1 illustrates a block diagram of the prototype.
It is made up of four primary components: the User Interface, the Ontology Search
Manager, the Query Engine, and the Data/Query Manager. The functionality for the
Ontology Mediation Manager has been incorporated into the User Interface in the current
version of the prototype. The User Interface allows a user to enter a set of domain search
terms and a condition. The beginning screen with an example in progress is shown in
Figure 7.2.
21
Figure 7.1 Block diagram of the prototype.
Data/Query Manager
Ontology Search Manager
Query EngineUser Interface
When the user is satisfied with what has been entered, he/she clicks on the Start Request
button. The Ontology Search System performs the search described in Section 3.3. The
ontology is searched for the domain terms provided by the user. If all of the domain
search terms are found in the ontology, the database information found though the SSM
fragments is returned to the user interface module. The user is notified of a successful
ontology search with the screen shown in Figure 7.3. The user has the option to see the
SQL query that has been constructed, see the results of the query on screen, or restart the
query process. Note that, the motivation for the prototype has been to test our underlying
systems and not to develop a full featured user interface.
22
Ontology Aided Search Environment
name, department, sales
Enter Domain Search Terms:
location = 'US'
Enter Condition:
Start Request
Figure 7.2 The initial screen of the prototype with an example.
Help
The discussion above assumes that the domain terms that the user entered were in
the ontology used by the system. When the ontology search doesn’t find all of the
domain search terms, the system creates a screen showing the terms that can not be
found. Two conditions exist, namely, the ontology search found a term that appears to
be close or no term(s) can be found. In the first case the system returns the fragment of
the ontology that it thinks may be relevant. The user can choose one of the terms shown
in the ontology fragment or enter another term. Figure 7.4 shows an example of the case
where a fragment of the ontology is presented to the user. The screen illustrates how the
system prototype engages the user to help out the ontology search. The example shows
three is-a relationships with "country" being the likely choice for the user. The
"geographic feature" node represents an is-closely-related relationship. Note that neither
the type of the arc nor the weights are shown at this point. We are hoping to get the
user's interpretation without biasing the user's choice.
23
Ontology Aided Search Environment
name, department, sales
Domain Search Terms used in search:
location = 'US'
Current Condition:
View Query Results on Screen Restart Request
Figure 7.3 Successful ontology search screen.
Help
Ontology Search Successful
Buttons Show Current Options
User Help is help is required to complete the search!
location?
Domain Search Term in question:
Enter best choice or enter a new search term:
geographic featuresgeographic location
country state city
Terms assumed to be related:
Figure 7.4. Screen showing user/ontology interaction.
Help
In the case that no terms are considered close to the domain search term, the user
is asked to enter a new domain search term with the same meaning.
When the ontology search (with the user's assistance) resolves the search terms to
database terms, the information is passed to the Query Generation System, where an SQL
query is generated. The Query Generation System tests for the correctness of the
generated query [17]. If the Query Generation System was called through the View Query
button, the SQL query is shown. Again since we are in the test mode, we have chosen to
show the full SQL query with the tables or pseudo tables from the distributed data
sources as though they are in the same database. In a commercial package, more options
of how to show the query would have to be considered.
When the user clicks on the Results on Screen button, the query information is
passed to the Data Integration System. There the query is partitioned into queries for the
24
individual relational databases or legacy systems and the sub queries are sent to the
individual data source sites. The results of the individual queries are sent back to the
Data Integration system and the query results for the user’s query are prepared using the
original query. The sub-queries are used in the prototype to define and spawn a set of
mobile agents. The agents are sent to the sites that contain the relevant data. Each agent
carries one of the SQL queries. The data returned by the agents is combined to produce
the required result. The result of the request is then returned to the user and displayed on
the screen.
The choice of mobile agents is not critical to the model, but rather represents a
method for quickly generating the necessary infrastructure. Client-Server models using
SOAP, CORBA or Java JDBC connections could also be used. We have used all four
types of connections in related projects.
8. Conclusions
A model for using domain specific ontologies, converting them to database
specific ontologies for aiding in the interpretation of a user's query has been given. The
model allows users to define both domain specific search terms and domain specific
functions to operate on the results of the query. The model was built on an integrated
database/legacy system environment. Our data integration scheme provides a universal
relation view of the distributed data sources. A prototype to test the feasibility of the
ontology and data integration model has been designed and implemented. The prototype
takes the user input and generates SQL queries for the relational databases/legacy
systems over which the ontology search operates.
9. References
1. Bright, M.W., A.R. Hurson and S.H. Pakzad. A taxonomy and current issues in multidatabase systems. IEEE Computer, Vol. 25, No. 3, pages 50-60.2. Bright, M.W and A. Hurson, “Summary Schemas in multidatabase systems”, Computer Engineering Technical Report at PennState, 1990.3. Bright, M.W., A. Hurson, S. Pakzad, and H. Sarma, “The Summary Schemas Model – An approach for handling Multidatabases: Concept and Performance Analysis”, Multidatabase System: An Advanced Solution for Global Information Sharing, pp.199, 1994.4. Bright, M.W. and A. Hurson, “Automated Resolution of Semantic Heterogeneity in Multidatabases”, ACM Transactions on Database Systems, pp. 213, 19(2), 1994.
25
5. Chu, W.W., H. Yang, K. Chiang, M. Minock, G. Chow \& C. Larson. CoBase:A Scalable and Extensible Cooperative Information System, Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 223.6. Corazzon, Raul. ed. “Descriptive and Formal Ontology”, http://www.formalontology.it. 7. Fagin, R., A.O. Mendelzon, and J.D. Ullman. 1982. A simplified universal relation assumption and its properties. ACM Transactions on Database Systems. Vol. 7. Pages 343-360.[6] Gardarin, Georges, Antoine Mensch, Anthony Tomasic: An Introduction to the e-XML Data Integration Suite. EDBT 2002: 297-306.[7] Gardarin, Georges, Fei Sha, Tuyet-Tram Dang-Ngoc: XML-based Components for Federating Multiple Heterogeneous Data Sources. ER 1999: 506-519.[8] Gardarin, Georges, Antoine Mensch, Tuyet-Tram Dang-Ngoc, L. Smit: Integrating Heterogeneous Data Sources with XML and XQuery. DEXA Workshops 2002: 839-846.8. Gruber, T. "A translation approach to portable ontologies," Knowledge Acquisition, pp. 199-220, 5(2), 1993.9. Gruber, T. “Toward Principles for the Design of Ontologies Used for Knowledge Sharing”, ed. N. Guarino. International Workshop on Formal Ontology, Padova, Italy 1993.10. Guarino N., “Formal Ontology, Conceptual Analysis and Knowledge Representation”. International Journal of Human and Computer Studies, special issue on The Role of Formal Ontology in the Information Technology edited by N. Guarino and R. Poli, vol 43 no. 5/6, 1995.11. Guarino, N. and C. Welty, "Ontological Analysis of Taxonomic Relationships", In, A. Laender, V. Storey, eds, Proceedings of ER-2000: The 19th International Conference on Conceptual Modeling, October, 2000.12. Hurson, A., M. Bright, S. Pakzad (ed.): Multidatabase systems - an advanced solution for global information sharing. IEEE Computer Soc. Press 1994.13. Peter D. Karp, Vinay K. Chaudhri and Jerome Thomere “XOL: An XML-Based Ontology Exchange Language”, http://www.oasis-open.org/cover/xol-03.html.[12] Lehti, Patrick, Peter Fankhauser: XML Data Integration with OWL: Experiences and Challenges. SAINT 2004: 160-170.14. Lenat, D. B. “Welcome to the Upper Cyc Ontology”, http://www.cyc.com/overview.html, 1996.15. Ludäscher, B., Y. Papakonstantinou, P. Velikhov. A Framework for Navigation-Driven Lazy Mediators. ACM Workshop on the Web and Databases. 1999.16. Miller, L. L. 1992. Generating hinges from arbitrary subhypergraphs. Information Processing Letters. Vol. 41. No. 6. Pages 307-312.17. Miller, L.L. and M.M. Owrang. 1997. A dynamic approach for finding the join sequence in a universal relation interface. Journal of Integrated Computer-Aided Engineering. No. 4. Pages 310-318.18. Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R., "Enabling technology for knowledge sharing", AI Magazine, pp.36-56, 12(3), 1991.19. Schulze-Kremer, S. "Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology", Proceedings of The Fifth Intemational Conference on Intelligent Systems for Molecular Biology, T. Gaasterland, Et al,(eds.), Halkidiki, Greece, June 1997.
26
20. N. J. Slattery, “A Study of Ontology and Its Uses in Information Technology Systems”, http://www.mitre.org/support/papers/swee/papers/slattery/.21. Sowa, John. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks/Cole. Pacific Grove, CA. 2000.22. Su, S.Y.W., H.L. Yu, J.A. Arroyo-Figueroa, Z. Yang and S. Lee. NCL:A Common Language for Achieving Rule-Based Interoperability Among Heterogeneous Systems, Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 171-198.23. Subrahmanian,V.S, Sibel Adali, Anne Brink, Ross Emery, J.ames J. Lu, Adil Rajput, Timothy J. Rogers, Robert Ross, Charles Ward. HERMES: Heterogeneous Reasoning and Mediator System, http://www.cs.umd.edu/ projects/hermes/publications/abstracts/ hermes.html .24. Swartout, W.R., P. Patil, K. Knight, and T. Russ, “Toward Distributed Use of Large-Scale Ontologies” In Proceedings of the 10th Knowledge Acquisition for Knowledge-Based Systems Workshop. Banff, Canada. 1996.[23] Tukwila Data Integration System. University of Washington. http://data.cs.washington.edu/integration/tukwila. Accessed 10/5/2004.25. van Heijst, G., A. Schreiber, B. Wielinga. "Using explicit ontologies in KBS development", International Journal of Human-Computer Studies, pp. 183-292, Vol. 46, No. 2/3, Feb, 1997.26. Wiederhold, G. Mediators in the Architecture of Future Information Systems, IEEE Computer, Vol. 25, No. 3, 1992, pp. 38-49.27. Wiederhold, G. and M. Genesereth. The Conceptual Basis for Mediation Services, IEEE Expert, Vol.12 No.5, 1997, pp 38-47.[30] Zamboulis, L., XML Data Integration By Graph Restructuring, Proc. BNCOD21, Edinburgh, July 2004. Springer-Verlag, LNCS 3112, pp 57-71.
Appendix A. Query GenerationTo create a query, we must translate the request to the target data space (the
hypergraph representing the collection of connected operational databases). Finally, the
27
target query hypergraph is mapped to an SQL query. To look at this process in more
detail, we consider the basic data structures and algorithms. We start by looking at the
notion of a complete intersection graph.
Complete Intersection Graph (CIG)
Let H = (U, R) be a hypergraph where U = {A1 , A2 , ... , An} is a set of
attributes and R = {R1 , R2 , ..., Rp} is a set of relation schemes over U. The complete
intersection graph (CIG) [17] is an undirected graph (R, E) where E = { (Ri , Rj) : Ri
Rj , Ri R, Rj R, i j }. Note that the edge (Ri , Rj) between vertices (or
nodes) Ri and Rj exists if and only if Ri and Rj have at least one attribute in common.
The edge (Ri, Rj) will be labeled with Rij where Rij = Ri Rj. An example of a hypergraph
and its complete intersection graph is shown in Figure A.1.
Adjusted Breadth First Search (ABFS)
The adjusted breadth first search (ABFS) [17] is a variation of the breadth first
search (BFS) to determine the join sequence for a target hypergraph. ABFS supplements
BFS by including a path label for each node and an adjustment set in the search tree so
that the search is more efficient. The resulting search tree is called an ABFS tree [17].
The node from which the search is started is called the root of the ABFS tree. A sample
ABFS tree is shown in Figure A.2.
The path label [17] for an ABFS tree node is the union of all query attributes on
this ABFS tree node and its ancestors on the search path. So the path label of an ABFS
28
Figure A.1 A hypergraph & its complete intersection graph (CIG).
ABC
BF
CDE AEF
C A
E
B
F
a) hypergraph b) complete intersection graph
tree node should be a superset of its parent’s path label. In the process of creating an
ABFS tree, the path labels will be used to prune or delay the expansion of subsets where
the unused nodes that are adjacent to the current endpoint of the search path do not
contribute any new query attributes to the path label. Any nodes falling into this class
will be stored in the adjustment set [17] (denoted by ASet) with a pointer to the position
where they could be added to the ABFS tree during the further search or expansion. The
relevant CIG can be applied to determine which nodes are adjacent to the current
endpoint of the search path.
The expansion of the ABFS tree will continue until the union of the path labels of
all the leaves in the current ABFS tree contains all the query attributes. If the ABFS tree
can not be expanded and the union of the path labels of all the leaves in the current ABFS
tree does not contain all the query attributes, then a node can be taken from the adjustment
set and the process can be restarted from the position pointed by this node. Note that this
process of creating an ABFS tree should terminate successfully in finite steps since all the
query attributes are in the hypergraph and can be reached eventually.
In addition, using the above approach, many different ABFS trees with the same
root may be generated. This is because the order of search is not unique. Also, there are
more than one way (such as FIFO, LIFO, or randomly) to select nodes from the
adjustment set.
Join Sequence
Finding an optimal join sequence for the selected query attributes (including the
attributes appeared in the query condition) is a crucial part in the modeler design
and implementation. Once the ABFS tree with a given root is created, we can
determine the join sequence defined by this tree. The approach is to select a set
of paths connected to the root such that the union of the path labels contains all of
29
ABC {A,B}
BF {A,B,F}
Aset = {CDE}
Figure A.2. Adjusted BFS tree using the CIG of Figure 4.1 with root ABC and the query attributes ABF.
the desired query attributes. We use the following procedure to select the
appropriate paths [17]:
<Step 1.0> Set W := the set of query attributes. Go to <Step 1.1>.<Step 1.1> Mark every leaf and its ancestors if its path label has a query attribute that appears only once in the path labels of all leaves in the ABFS tree. Remove the query attributes included in the path labels of the marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.2>.<Step 1.2> If there is a contributing query attribute in more than one path labels of the unmarked leaves with the same parent, then mark one (and only one) of those leaves and its ancestors. Remove the query attributes in the path labels of the marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.3>. (By a contributing query attribute we mean a query attribute that occurs in the path label of a leaf but does not occur in its parent’s path label.)<Step 1.3> If there is a leaf which contains a remaining query attribute in W with the lowest frequency, then mark this leaf and its ancestors. In case of tie, choose the leaf with the shortest path and mark the nodes on this path. It is worth noting that the approach described in the previous subsection does not guarantee to create the optimal ABFS tree with a given root since the order of search in that approach is not necessarily optimal. The so-called optimal ABFS tree with a given root is actually the one with the minimum weight over all possible ABFS trees for this root. By weight of an ABFS tree we mean the length of the join sequence defined by the ABFS tree. The creation of a non-optimal ABFS tree with some given root does not cause serious problems. On one hand, our goal is to generate an optimal join sequence which is the one with the minimum weight over the optimal ABFS trees for all roots. The probability of creating non-optimal ABFS trees for all roots is very low. On the other hand, one can remove a redundant join from the resulting join sequence at a later stage. Another point worth noting is that we do not have to generate ABFS trees for all roots.We need only to generate ABFS trees for the so-called legal roots. A root is called illegal if it does not contain any query attribute or its set of query attributes is properly contained in the set of query attributes of an adjacent node. The algorithm to find a join sequence for a given hypergraph and a set of query attributes is summarized as follows:<Step 2.0> Create the CIG for the hypergraph. Find the set LR of all legal roots in the CIG. Set minweight:= the number of nodes in CIG. Go to <Step 2.1>.<Step 2.1> If LR is empty, then stop. Otherwise, choose a root r LR , set LR := LR - {r}, and go to <Step 2.2>.
30
<Step 2.2> Create an ABFS tree with root r. Find the weight and the corresponding join sequence for this tree. If the weight is smaller than minweight, then save this join sequence as the current best one, and replace minweight with the weight. Go to <Step 2.1>.
31
Appendix B. Query Correctness Our correctness process has been built on the issue of testing for fd-hinges [16].
To come to an understanding of what this includes, it is necessary to briefly look at the
underlying concepts of hinges and fd-hinges.
A hypergraph H is reduced if no hyperedge of H is properly contained in another
hyperedge of H. H is connected if every pair of its hyperedges is connected by some
path of hyperedges. If H is a reduced connected hypergraph with the vertex set N and
edge set E, then E’ is a complete subset of E if and only if E’ E and for each Ei in E if
Ei attr(E’), then Ei belongs to E’. E’ is said to be a trivial subset of E if |E’| <= 1 or E
= E’.
Let E’ be a complete subset of E and E1, E2 E – E’.
Then we say E1 and E2 are connected with respect to E’ if and only if they have
common vertices not belonging to E’.
Let E’ be a nontrivial complete subset of E and 1, 2, …, p be connected
components of E-E’ with respect to E’. Then E’ has the bridge-property if and only if
for every i = 1, 2, …, p there exists Ei E’ such that (attr(E’)NiEi, where Ni =
attr(i), Ei is called a separating edge of E’ corresponding to i. A nontrivial complete
subset E’ of E with the bridge property is call a hinge of H. An example of a hinge is
shown in Figure B.1. Note that {E2,E3,E4} is not a hinge.
Let F be a set of functional dependencies (fds). Let TE’ be the tableau defined
over the attributes in E for the schemes represented by the edges in E’ E, the
chaseF(TE’) is the result of using the fds in F to chase the tableau TE’. Now let E* be the
32
Figure B.1. Hinge example with E1 as the separating edge and {E1,E2,E3,E4} as a Hinge.
set defined by chaseF(E’) such that E* = {Si|if wi(A) is a distinguished variable and wi
chaseF(TE’), then ASi}. In other words each element in E* corresponds to a row in the
tableau chaseF(TE’) and consists of the attributes that have distinguished values in the
row. Note that by the definition of the chase algorithm each element of E* is a super set
of the corresponding element in E’ that was used to initially define the row in the
tableau. Construct the hypergraph HE*,F = (attr(E),(E-E’)E*). Then E’ is an F-fd-hinge
of a hypergraph H when E* is a hinge of HE*,F.
In [16] we showed that an F-fd-hinge was equivalent to an embedded join dependency.
In other words any time that a set of edges defines an F-fd-hinge, the set of relation schemes that
correspond to the edges define a lossless join. As a result, our test for query correctness comes
down to testing the set to determine if they define an F-fd-hinge.
33