7
116 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 6, NO. 2, JUNE 2002 Architecture of a Mediator for a Bioinformatics Database Federation Graham J. L. Kemp, Nicos Angelopoulos, and Peter M. D. Gray Abstract—Developments in our ability to integrate and analyze data held in existing heterogeneous data resources can lead to an increase in our understanding of biological function at all levels. However, supporting ad hoc queries across multiple data resources and correlating data retrieved from these is still difficult. To ad- dress this, we are building a mediator based on the functional data model database, P/FDM, which integrates access to heterogeneous distributed biological databases. Our architecture makes use of the existing search capabilities and indexes of the underlying databases, without infringing on their autonomy. Central to our design philosophy is the use of schemas. We have adopted a federated architecture with a five-level schema, arising from the use of the ANSI-SPARC three-level schema to describe both the existing autonomous data resources and the mediator itself. We describe the use of mapping functions and list comprehensions in query splitting, producing execution plans, code generation, and result fusion. We give an example of cross-database querying involving data held locally in P/FDM systems and external data in SRS. Index Terms—Bioinformatics, functional data model, mediator, multidatabase, Prolog, query optimization. I. INTRODUCTION T HE internet is an increasingly important research tool for scientists working in biotechnology and the biological sci- ences, and many collections of biological data can be accessed via the World Wide Web. Some on-line data resources provide search facilities to enable scientists to find items of interest in a particular database more easily. However, working interac- tively with an internet browser is extremely limited when one wants to ask complex questions involving related data held at different locations and in different formats since one must for- mulate a series of data access requests, run these against the various databanks and databases, and then combine the results retrieved from the different sources. This is both awkward and time-consuming for the user. To streamline this process, we are developing a federated ar- chitecture and mediator to integrate access to heterogeneous, distributed biological databases. The suitability of a federated multidatabase approach for integrating biological databases is advocated by Robbins [17] and also proposed by Karp [10]. The spectrum of choices for data integration is summarized in Fig. 1. Manuscript received December 4, 2000; revised December 6, 2001. This work was supported by the BBSRC/EPSRC Joint Programme in Bioinformatics under Grant 1/BIF06716. An earlier version of this paper appeared in the Proceedings of the 1st IEEE International Symposium on Bio-Informatics and Biomedical Engineering. The authors are with the Department of Computing Science, University of Aberdeen, King’s College, Aberdeen, Scotland, AB24 3UE, U.K. Publisher Item Identifier S 1089-7771(02)04903-8. Fig. 1. Continuum from tightly coupled to loosely coupled systems, involving multiple databases (from [17]). In order to get the desired federated information infrastructure, we believe, with Robbins, that we do not require the adoption of a common hardware platform or vendor DBMS, but we do need a “shared data model across participating sites.” Our approach does not require that the participating sites use the same data model. Rather, it is sufficient for the mediator to hold descrip- tions of the participating sites that are expressed in a common data model—in our system the functional data model [19] is used for this purpose. Our work is based around the P/FDM object database system, which has been developed at Aberdeen for storing and integrating protein structure data. In this paper we dis- cuss the different levels of schemas present in our federated architecture, and we relate these to the classic ANSI-SPARC three-layer model. We describe the implementation of our mediator software with particular emphasis on the use of list comprehensions in transforming queries against a conceptual schema into queries expressed against schemas in other layers of the system, with an example that accesses external bioin- formatics databases. Finally, we discuss the role of mapping functions and wrappers in our architecture and compare these with similar software modules in other systems. II. DATA RESOURCES IN THE FEDERATION P/FDM [6] is an object database management system that is based on a semantic data model: the functional data model (FDM) [19]. The basic concepts in the P/FDM database are en- tities and functions. Entities are used to represent conceptual objects, while functions represent the properties of an object. Functions are used to model both scalar attributes and rela- tionships. Functions may be single-valued or multivalued, and their values can either be stored or computed on demand. Entity classes can be arranged in subtype hierarchies, with subclasses inheriting the properties of their superclass, as well as having 1089-7771/02$17.00 © 2002 IEEE

Architecture of a mediator for a bioinformatics database federation

  • Upload
    pmd

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architecture of a mediator for a bioinformatics database federation

116 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 6, NO. 2, JUNE 2002

Architecture of a Mediator for a BioinformaticsDatabase Federation

Graham J. L. Kemp, Nicos Angelopoulos, and Peter M. D. Gray

Abstract—Developments in our ability to integrate and analyzedata held in existing heterogeneous data resources can lead to anincrease in our understanding of biological function at all levels.However, supportingad hocqueries across multiple data resourcesand correlating data retrieved from these is still difficult. To ad-dress this, we are building amediatorbased on the functional datamodel database, P/FDM, which integrates access to heterogeneousdistributed biological databases. Our architecture makes use ofthe existing search capabilities and indexes of the underlyingdatabases, without infringing on their autonomy. Central toour design philosophy is the use ofschemas. We have adopted afederated architecture with a five-level schema, arising from theuse of the ANSI-SPARC three-level schema to describe both theexisting autonomous data resources and the mediator itself. Wedescribe the use of mapping functions and list comprehensionsin query splitting, producing execution plans, code generation,and result fusion. We give an example of cross-database queryinginvolving data held locally in P/FDM systems and external datain SRS.

Index Terms—Bioinformatics, functional data model, mediator,multidatabase, Prolog, query optimization.

I. INTRODUCTION

T HE internet is an increasingly important research tool forscientists working in biotechnology and the biological sci-

ences, and many collections of biological data can be accessedvia the World Wide Web. Some on-line data resources providesearch facilities to enable scientists to find items of interest ina particular database more easily. However, working interac-tively with an internet browser is extremely limited when onewants to ask complex questions involving related data held atdifferent locations and in different formats since one must for-mulate a series of data access requests, run these against thevarious databanks and databases, and then combine the resultsretrieved from the different sources. This is both awkward andtime-consuming for the user.

To streamline this process, we are developing a federated ar-chitecture andmediator to integrate access to heterogeneous,distributed biological databases. The suitability of a federatedmultidatabase approach for integrating biological databases isadvocated by Robbins [17] and also proposed by Karp [10]. Thespectrum of choices for data integration is summarized in Fig. 1.

Manuscript received December 4, 2000; revised December 6, 2001. Thiswork was supported by the BBSRC/EPSRC Joint Programme in Bioinformaticsunder Grant 1/BIF06716. An earlier version of this paper appeared in theProceedings of the 1st IEEE International Symposium on Bio-Informatics andBiomedical Engineering.

The authors are with the Department of Computing Science, University ofAberdeen, King’s College, Aberdeen, Scotland, AB24 3UE, U.K.

Publisher Item Identifier S 1089-7771(02)04903-8.

Fig. 1. Continuum from tightly coupled to loosely coupled systems, involvingmultiple databases (from [17]).

In order to get the desired federated information infrastructure,we believe, with Robbins, that we do not require the adoption ofa common hardware platform or vendor DBMS, but we do needa “shared data model across participating sites.” Our approachdoes not require that the participating sites use the same datamodel. Rather, it is sufficient for the mediator to hold descrip-tions of the participating sites that are expressed in a commondata model—in our system the functional data model [19] isused for this purpose.

Our work is based around the P/FDM object databasesystem, which has been developed at Aberdeen for storingand integrating protein structure data. In this paper we dis-cuss the different levels of schemas present in our federatedarchitecture, and we relate these to the classic ANSI-SPARCthree-layer model. We describe the implementation of ourmediator software with particular emphasis on the use of listcomprehensions in transforming queries against a conceptualschema into queries expressed against schemas in other layersof the system, with an example that accesses external bioin-formatics databases. Finally, we discuss the role of mappingfunctions and wrappers in our architecture and compare thesewith similar software modules in other systems.

II. DATA RESOURCES IN THEFEDERATION

P/FDM [6] is an object database management system thatis based on a semantic data model: the functional data model(FDM) [19]. The basic concepts in the P/FDM database are en-tities and functions. Entities are used to represent conceptualobjects, while functions represent the properties of an object.Functions are used to model both scalar attributes and rela-tionships. Functions may be single-valued or multivalued, andtheir values can either be stored or computed on demand. Entityclasses can be arranged in subtype hierarchies, with subclassesinheriting the properties of their superclass, as well as having

1089-7771/02$17.00 © 2002 IEEE

Page 2: Architecture of a mediator for a bioinformatics database federation

KEMP et al.: ARCHITECTURE OF A MEDIATOR FOR A BIOINFORMATICS DATABASE FEDERATION 117

their own specialized properties. P/FDM is being used at Ab-erdeen to store data on special classes of proteins, but it is in-creasingly used for data integration. Thead hocquery languageused in P/FDM is an implementation of Daplex. Daplex queriesare normally processed in P/FDM by first being compiled intoan internal intermediate code (“ICode”) [5] which is a Prologterm structure that resembles a list comprehension. The ICodeform of a query is then optimized, and a code generator trans-lates the ICode into Prolog code for execution. As we shall seein Section IV, ICode plays an important role in processing mul-tidatabase queries within the mediator. The P/FDM mediator,which is described in Section IV, is implemented as an exten-sion of the P/FDM database management system.

The sequence retrieval system (SRS) [4] maintains indexesrelating entries in one flat file databank (e.g., the SWISS-PROTprotein sequence databank, or the BRENDA enzyme databank)to entries in others. Thus, SRS maintains a large network ofrelated databank entries. SRS provides search tools that enableusers to retrieve particular entries or to find the accession codesof entries that satisfy specified criteria, e.g., entries with fieldsthat match a given pattern string, or fields with numeric valuesin a given range. SRS also provides “link” operators that enablethe user to follow cross references and find all entries in onedatabank with corresponding entries in another. These can evenfollow a chain of cross references where there is no “direct” linkbetween the two. The SRS system has its own query languagethat can be accessed via an HTTP server.

III. SCHEMAS IN THE FEDERATION

The design philosophy of our architecture can be illustratedwith reference to thethree-level schemathat was proposed bythe ANSI-SPARC standards working party [1]. This promotesdata independence by demanding that database systems be con-structed so that they provided both logical and physical data in-dependence. Logical data independence provides that the con-ceptual data model must be able toevolvewithout changingexternal application programs. Only view definitions and map-pings may need changing, for example to replace access to astored field by access to a derived field calculated from othersin the revised schema. Physical data independence allows us torefine the internal schema for improved performance, withoutneeding to alter the way queries are formulated.

The clear separation between schemas at different levelshelps us in building a database federation in a modular fashion.In our architecture we see the ANSI-SPARC three-level schemain two situations: in each of the individual data resources, andin the mediator itself.

First, let us consider an external data resource. The resource’sconceptual schema(which we call ) describes the logicalstructure of the data contained in that resource. If the resourceis a relational database then this will include information abouttable names and column names, and type information aboutstored values. With SRS [4], it is the databank names and fieldnames. These systems also provide a mechanism for queryingthe data resource in terms of the table/class/databank names andcolumn/tag/attribute/field names that are presented in the con-ceptual schema.

Fig. 2. ANSI-SPARC schema architecture describing the mediator (left) andan external data resource (right).

The internal schema(or storage schema, which we call )contains details of allocation of data records to storage areas,placement strategy, use of indexes, set ordering, and internaldata structures that impact on efficiency and implementation de-tails [6]. In this paper, we do not concern ourselves with the in-ternal schemas of individual data resources. The mapping fromthe conceptual schema to the internal schema has already beenimplemented by others within each of the individual resourcesthat we use and we assume that this has been done to make bestuse of the resources’ internal organization.

A resource’sexternal schema(which we call ) describesa view onto the data resource’s conceptual schema. At its sim-plest, the external schema could be identical to the conceptualschema. However, the ANSI-SPARC model allows for there tobe differences between the schemas at these layers so that dif-ferent users and application programmers can each be presentedwith a view that best suits their individual requirements and ac-cess privileges. Thus, there can be many external schemas, eachproviding users with a different view onto the resource’s con-ceptual schema. A resource’s external, conceptual and internalschemas are represented on the right side of Fig. 2.

We can also use the ANSI-SPARC three-layer model in de-scribing the mediator that is central to our database federation,and this is shown on the left-hand side of Fig. 2. The mediator’sconceptual schema, , describes the content of the data re-sources that are members of the federation, including the se-mantic relationships that hold between data items in these re-sources. We also refer to this as the federation’s integrationschema. We have chosen to express this schema using the func-tional data model. As far as possible, the is designed basedon the semantics of the domain, rather than consideration of theactual partitioning and organization of data in the external re-sources. For example, different attributes of the same conceptualentity can be spread across different external data resources andsubclass–superclass relationships between entities in the con-ceptual model of the domain might not be present explicitly inthe external resources [8].

We cannot expect scientists to agree on a single schema.Different scientists are interested in different aspects of the

Page 3: Architecture of a mediator for a bioinformatics database federation

118 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 6, NO. 2, JUNE 2002

Fig. 3. Schemas in a database federation.

data and will want to see data structured in a way that matchesthe concepts, attributes, and relationships in their own personalmodel. This is made possible by following the ANSI-SPARCmodel; the principle of logical data independence means thatthe system can provide different users with different viewsonto the integration schema. We use to refer to an externalschema presented to a user of the mediator. In this paper,queries are expressed directly against an integration schema

, but these could alternatively be expressed against anexternal schema . If the external schema is also an FDMschema then an additional layer of mapping functions wouldbe required to translate the query from to .

A vital task performed by the mediator is to map betweenand . To facilitate this process, we introduce another

schema layer, which, in contrast to , is based on the struc-ture and content of the external data resources. This schema isinternal to the mediator, and is referred to as. The medi-ator needs to have a view onto the data resource that matchesthis internal schema, thus and should be the same. Inour system, we have chosen to use the functional data modelto represent . Having the same data model for and

brings advantages in processing multidatabase queries,as will be seen in Section IV.

Redrawing Fig. 2, the situation where there aredifferent ex-ternal schemas presented to users, and there aredata resources,the relationship between schemas in the federation is as shownin Fig. 3. The five-level schema shown here is similar to thatdescribed by Sheth and Larson [18].

IV. M EDIATOR ARCHITECTURE

The role of the mediator is to process queries expressedagainst the federation’s integration schema . The medi-ator holds metadata describing the integration schema and alsothe external schemas of each of the federation’s data resources

. In P/FDM, metadata takes the form of Prolog clausesthat are compiled from high-level schema descriptions.

The architecture of the P/FDM mediator is shown in Fig. 4.The main components of the mediator are described below.

1) Parser: This module that reads a Daplex query (Daplexis the query language for the FDM), checks it for consistency

Fig. 4. Mediator architecture. The components of the mediator are showninside the dashed line.

against a schema (in this case the integration schema), and pro-duces alist comprehensioncontaining the essential elements ofthe query in a form that is easier to process than Daplex text (wecall this internal form “ICode”).

2) Simplifier: The simplifier’s role is to produce shorter,more elegant, and more consistent ICode, mainly throughremoving redundant variables and expressions (e.g., if theICode contains an expression equating two variables, thenthat expression can be eliminated, provided that all referencesto one variable are replaced by references to the other), andflattening out nested expressions where this does not changethe meaning of the query. Significantly, simplifying the ICodeform of a query makes the subsequent query processing stepsmore efficient by reducing the number of equivalent ICodecombinations that need to be checked.

3) Rewriter: The rule-based rewritermatches expressionsin the query with patterns present on the left-hand side ofdeclarative rewrite rules and replaces these with the right-handside of the rewrite rule after making appropriate variablesubstitutions. Rewrite rules can be used to performsemanticquery optimization. This capability is important since graphicalinterfaces make it easy for users to express inefficient queries

Page 4: Architecture of a mediator for a bioinformatics database federation

KEMP et al.: ARCHITECTURE OF A MEDIATOR FOR A BIOINFORMATICS DATABASE FEDERATION 119

Fig. 5. (a) Integration schema and (b) distributed databases (P/FDM and SRS).

which cannot always be optimized using general purpose queryoptimization strategies. This is because transforming the originalquery to a more efficient one may require domain knowledge,e.g., two or more alternative navigation paths may exist betweendistantly related object classes but domain knowledge is neededto recognize that these are indeed equivalent.

A recent enhancement to the mediator is an extension to theDaplex compiler that enables generic rewrite rules to be ex-pressed using a declarative high-level syntax [11]. This makesit easy to add new query optimization strategies to the mediator.

4) Optimizer: This module performs generic query opti-mization.

5) Reordering Module:Thereordering modulereorders ex-pressions in the ICode to ensure that all variable dependenciesare observed.

6) Constraint Compiler:This module reads declarativestatements about the conditions that must hold between dataitems in different external data resources in order that thesevalues can be mapped on to the integration schema.

7) ICode Rewriter: The original ICode is expanded in thisstep by applying mapping functions that transform references tothe integration schema into references to the federation’s com-ponent databases. Essentially the same rewriter that was men-tioned above is used here, but with a different set of rewrite rules.These rewrite rules enhance the ICode by adding tags to indicatethe actual data sources that contain particular entity classes andattribute values. Thus, the ICode rewriter transforms the queryexpressed against the into a query expressed against theof one or more external databases.

8) Query Splitter: The mediator identifies which externaldatabases hold data referred to by parts of an integrated query byinspecting the metadata, and adjacent query elements referringto the same database are grouped together into “chunks.” Query“chunks” are shuffled and variable dependencies are checkedto produce alternative execution plans. A generic description ofcosts is used to select a good schedule/sequence of instructionsfor accessing the remote databases. The crucial idea is to moveselective filter operations in the query down into the appropriatechunks so that they can be applied early and efficiently usinglocal search facilities as registered with the mediator [12].

9) Code Generators:Each ICode chunk is sent to one ofseveral code generators. These translate ICode into queries thatare executable by the remote databases, transforming queryfragments from to . New code generators can be linkedinto the mediator at runtime.

10) Wrappers: These deal with communication with the ex-ternal data resources. They consist of two parts: code respon-sible for sending queries to remote resources and code that re-ceives and parses the results returned from the remote resources.Wrappers for new resources can be linked into the mediator atruntime. Note that a wrapper can only make use of whateverquerying facilities are provided by the federation’s componentdatabases. Thus, the mediator’s conceptual model willonly be able to map onto those data values that are identifiedin the remote resource’s conceptual model . Thus, queriesinvolving concepts like “gene” and “chromosome” in canonly be transformed into queries that run against a remote re-source if that resource exports these concepts.

11) Result Fuser:The result fuser provides a synchroniza-tion layer, combining results retrieved from external databasesso that the rest of the query can proceed smoothly. It interactstightly with the wrappers.

When a new external resource is added to the federation, thecontents of that resource must be described in terms of entities,attributes, and relationships—the basic concepts in the FDM.For example, entity classes and attributes are used to describethe tables and columns in a relational database, the classes andtags in an ACEDB database, and the databanks and fields ac-cessed by SRS. The integration schema has to be extended toinclude concepts in the new resource, and mapping functions tobe used by the ICode rewriter must be generated. Since the me-diator has a modular architecture in which query transformationis done in stages, the only new software components that mighthave to be written are code generators and wrappers—the com-ponents shown with dark borders in Fig. 4. All other componentswithin the mediator are generic. However, the federation admin-istrator might want to add declarative rewrite rules that can beused by the rewriter to improve the performance of queries in-volving the new resource.

V. EXAMPLE

A prototype mediator has been used to combine access data-banks at the EBI via an SRS server [4] and (remote) P/FDMtest servers. Remote access to a P/FDM database is providedthrough a CORBA server [13]. This example, using a mini-inte-gration schema, illustrates the steps involved in processing mul-tidatabase queries.

In this example, three different databases are viewed througha unifying integration schema, which is shown in the upper part

Page 5: Architecture of a mediator for a bioinformatics database federation

120 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 6, NO. 2, JUNE 2002

Fig. 6. Daplex query expressed against an integration schema.

Fig. 7. ICode corresponding to the query in Fig. 6.

of Fig. 5. There are three classes in this schema:protein, en-zyme, andswissprotentry. A function representing the enzymeclassification number (ecnumber) is defined on the classen-zyme, and enzymes inherit those functions that are declared onthe superclassprotein. Each instance of the classproteincan berelated to a set ofswissprotentry instances.

The lower part of Fig. 5 shows the actual distribution ofdata across the three databases. Db I is a P/FDM databasethat contains the codes and name of proteins. Db II is also aP/FDM database, and contains the protein code (here calledpdb code) and enzyme classification code of enzymes. Toidentify SWISS-PROT entries at the EBI that are related to agiven protein instance we must first identify the protein databank (PDB) entry whose id matches the protein code andthen follow further links to find related SWISS-PROT entries.Relationships between data in remote databases can be definedby conditions that must hold between the values of the relatedobjects. Constraints on identifying values are represented bydashed arrows in Fig. 5.

Fig. 6 shows a Daplex query expressed against this integra-tion schema. This query prints information about enzymes thatsatisfy certain selection criteria, and their related SWISS-PROTentries. Fig. 7 shows a “pretty-printed” version of the ICode thatis produced when this query is compiled. This ICode is thenprocessed by the query splitter, producing ICode that will beturned into queries that will be sent to the three external dataresources (Fig. 9). Note that the variable is common to allthree query fragments. Values for this variable that are retrievedfrom P/FDM Db II are used in constructing queries to be sent tothe other data resources.

In the example, the classprotein is related to the classswis-sprot entryin the integration schema by a multivalued relation-ship function calledswissprotentries. The mapping functiongiven in Fig. 8 is used in transforming queries that contain thisrelationship into “enhanced” ICode that refers to the externalschemas of the actual data resources. This code shows an actualexample of P/FDM ICode. Mapping functions such as this canbe compiled from high-level declarative rewrite rules and do nothave to be written by hand.

VI. DISCUSSION

The use of mediators was originally proposed by Wiederhold[20] and became an important part of the Knowledge SharingEffort architecture [15]. Examples of such intelligent informa-tion-seeking architectures are Infosleuth [2] and KRAFT [7]. In

this architecture, the mediator can run on the client machine, orelse be available as middleware on some shared machine, whilethe wrapper is on the remote machine containing the knowledgesource. The idea behind this is that existing knowledge sourcescan evolve their schemas, yet present a consistent interface tothe mediator by suitable changes to the wrapper. For this pur-pose the wrapper may be as simple as an SQL view, or it maybe more complex, involving mapping of code. In any case, thesite is able to preserve some local autonomy. Other mediatorsdo not have to worry how the site evolves internally. Also newsites can join a growing network by registering themselves witha facilitator. All the mediator needs to know is how to contactthe facilitator and that any knowledge sources the facilitator rec-ommends will conform to the integration schema.

In this paper, we describe an alternative architecture, wherethe wrappers reside with the mediator. This has the advantagethat there is no need to get the knowledge source to install andmaintain custom-provided wrapper software.

In our architecture (as shown in Fig. 4) the code generatorsproduce code in a different query language or constraint lan-guage. Thus, they are used in two directions. In one direction,they map queries or constraints into a language that can be useddirectly at the knowledge source. This can be crucial for effi-ciency by allowing one to move selection predicates closer tothe knowledge source in a form that is capable of using local in-dexes, etc. This can have a very big effect with database queriesbecause it saves bringing many “penny packets” of data backthrough the interface, only to be filtered and rejected on the farside [12]. In the other direction, wrappers are used to map datavalues, for example, by using scaling factors to change units orby using a lookup table to replace values by their new identi-fiers.

Note that we are not advocating building a so-calledglobalin-tegration schema. This has often been criticized on the groundsthat attempts to map every single concept in one all-embracingschema is both laborious and never ending. Instead, we visu-alize an incrementally growing integration schema driven byuser needs. Ideally the schema would be built interactively usinga GUI and rules that suggest various mappings, as proposed byMitra et al. [14] in their ONION system for incremental devel-opment of ontology mappings. The crucial thing to realize isthat the integration schema represents a virtual database, whichallows it to evolve much more easily than a physical database.

Related work in the bioinformatics field includes the Kleislisystem [3], [21]. The query language used in Kleisli is the Col-lection Programming Language (CPL), which is a comprehen-sion based language in which thegeneratorsare calls to libraryfunctions that request data from specific databases accordingto specific criteria. Thus, when writing queries the user mustbe aware of how data are partitioned across external sites. Thiscontrasts with our own approach where references to particularresources do not feature in the integration schema or in userqueries. Of course, an interface based on domain concepts andwithout references to particular resources could be built on topof Kleisli.

The TAMBIS system [16] writes query plans in CPL. Plansin TAMBIS are based on following a classification hierarchy,whereas our plans are oriented towardad hoc SQL3-like

Page 6: Architecture of a mediator for a bioinformatics database federation

KEMP et al.: ARCHITECTURE OF A MEDIATOR FOR A BIOINFORMATICS DATABASE FEDERATION 121

Fig. 8. Mapping function used to expand the relationship swissprotentries in the integration schema into ICode that refers to data held at the EBI.

Fig. 9. ICode subqueries against the actual data resources that need to beaccessed in answering the query in Fig. 6.

queries. However, the overall approach is similar in using ahigh-level intermediate code translated through wrappers.

Another related project is DiscoveryLink [9]. The architec-ture of the DiscoveryLink system is similar to that presentedhere. While we use the functional data model, DiscoveryLinkuses the relational data model and all of the databases accessedvia DiscoveryLink must present an SQL interface.

VII. CONCLUSION

We have developed the P/FDMmediator—a computer pro-gram which supports transparent and integrated access to dif-ferent data collections and resources.Ad hocqueries can beasked against anintegration schema, which is a predefined col-lection of entity classes, attributes, and relationships. The inte-gration schema can be extended at any time by adding declara-tive descriptions of new data resources to the mediator’s setupfiles.

Rather than building a data warehouse, we have developeda system that brings data across on demand from remote sites.The P/FDM mediator arranges for this to happen without furtherhuman intervention. Our approach preserves the autonomy ofthe external data resources, and makes use of existing searchcapabilities implemented in those systems.

Bioinformatics faces a “crisis of data integration” [17],which we believe is best addressed through federations thatallow their constituent databases to develop autonomously andindependently. The existence of schemas at different levels,as shown in Section III, makes apparent the requirements forquery transformation in a mediator in a database federation.

This results in a modular design for the mediator that enablesthe federation to evolve incrementally.

REFERENCES

[1] “ACM SIGFIDET,” ANSI, vol. 7, Interim Rep. ANSI/X3/SPARC StudyGroup Data Base Management Syst., 1975.

[2] R. J. Bayardo, B. Bohrer, R. S. Brice, A. Cichocki, J. Fowler, A. Helal,V. Kashyap, T. Ksiezyk, G. Martin, M. H. Nodine, M. Rashid, M.Rusinkiewicz, R. Shea, C. Unnikrishnan, A. Unruh, and D. Woelk,“InfoSleuth: Semantic integration of information in open and dynamicenvironments (experience paper),” inProc. ACM SIGMOD Int. Conf.Management Data SIGMOD 1997, J. Peckham, Ed., pp. 195–206.

[3] P. Buneman, S. B. Davidson, K. Hart, G. C. Overton, and L. Wong, “Adata transformation system for biological data sources,” inProc. 21thInt. Conf. Very Large Data Bases VLDB’95, U. Dayal, P. M. D. Gray,and S. Nishio, Eds, pp. 158–169.

[4] T. Etzold and P. Argos, “SRS an indexing and retrieval tool for flat filedata libraries,” inCABIOS, vol. 9, 1993, pp. 49–57.

[5] P. M. D. Gray, S. M. Embury, K. Y. Hui, and G. J. L. Kemp, “Theevolving role of constraints in the functional data model,”J. Intell. In-form. Syst., vol. 12, pp. 113–137, 1999.

[6] P. M. D. Gray, K. G. Kulkarni, and N. W. Paton,Object-OrientedDatabases: A Semantic Data Model Approach. Englewood Cliffs, NJ:Prentice-Hall, 1992, Prentice-Hall Ser. Comput. Sci..

[7] P. M. D. Gray, A. D. Preece, N. J. Fiddian, W. A. Gray, T. J. M. Bench-Capon, M. J. R. Shave, N. Azarmi, M. Wiegand, M. Ashwell, M. Beer,Z. Cui, B. Diaz, S. M. Embury, K. Hui, A. C. Jones, D. M. Jones, G. J.L. Kemp, E. W. Lawson, K. Lunn, P. Marti, J. Shao, and P. R. S. Visser,“KRAFT: Knowledge fusion from distributed databases and knowledgebases,” inProc. 8th Int. Workshop Database Expert Syst. Applicat., R.R. Wagner, Ed.. Los Alamitos, CA, 1997, pp. 682–691.

[8] S. Grufman, F. Samson, S. M. Embury, and P. M. D. Gray, “Distributingsemantic constraints between heterogeneous databases,” inProc. 13thInt. Conf. Data Eng., A. Gray and P.-Å. Larson, Eds. Los Alamitos, CA,1997, pp. 33–42.

[9] L. M. Haas, P. Kodali, J. E. Rice, P. M. Schwarz, and W. C. Swope, “In-tegrating life sciences data—With a little garlic,”Proc. IEEE Int. Symp.Bio-Informatics Biomed. Eng., pp. 5–12, 2000.

[10] P. D. Karp, “A vision of DB interoperation,” presented at the Proc. Meet.Interconnection Molecular Biology Databases, 1995.

[11] G. J. L. Kemp, P. M. D. Gray, and A. R. Sjöstedt, “Rewrite rules forquantified subqueries in a federated database,” inProc. 13th Int. Conf.Sci. Statist. Database Management, L. Kerschberg and M. Kafatos, Eds.Los Alamitos, CA, 2001, pp. 134–143.

[12] G. J. L. Kemp, J. J. Iriarte, and P. M. D. Gray, “Efficient access to FDMobjects stored in a relational database,” inProc. 12th British Nat. Conf.Databases: Directions in Databases, D. S. Bowers, Ed.. Berlin, Ger-many, 1994, pp. 170–186.

[13] G. J. L. Kemp, C. J. Robertson, P. M. D. Gray, and N. Angelopoulos,“CORBA and XML: Design choices for database federations,” inProc.17th British Nat. Conf. Databases, B. Lings and K. Jeffery, Eds. Berlin,Germany, 2000, pp. 191–208.

[14] P. Mitra, G. Weiderhold, and M. Kersten, “A graph-oriented model forarticulation of ontology interdependencies,” inProc. Advanced in Data-base Technology—EDBT 2000, C. Zaniolo, P. C. Lockerman, M. H.Scholl, and T. Grust, Eds. Berlin, Germany, 2000, pp. 86–100.

Page 7: Architecture of a mediator for a bioinformatics database federation

122 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 6, NO. 2, JUNE 2002

[15] R. Neches, R. Fikes, T. Finin, T. Gruber, R. Patil, T. Senatir´, and W. R.Swartout, “Enabling technology for knowledge sharing,”AI Mag., vol.12, no. 3, pp. 36–56, 1991.

[16] N. W. Paton, R. Stevens, P. Baker, C. A. Goble, S. Bechhofer, and A.Brass, “Query processing in the TAMBIS bioinformatics source inte-gration system,” inProc. 11th Int. Conf. Sci. Statist. Database Manage-ment. Los Alamitos, CA, 1999, pp. 138–147.

[17] R. J. Robbins, “Bioinformatics: Essential infrastructure for global bi-ology,” J. Comput. Biol., vol. 3, pp. 465–478, 1996.

[18] A. P. Sheth and J. A. Larson, “Federated database systems for managingdistributed, heterogeneous and autonomous databases,”ACM Comput.Surveys, vol. 22, pp. 183–236, 1990.

[19] D. W. Shipman, “The functional data model and the data languageDAPLEX,” ACM Trans. Database Syst., vol. 6, no. 1, pp. 140–173,1981.

[20] G. Wiederhold, “Mediators in the architecture of future information sys-tems,”IEEE Comput., vol. 25, pp. 38–49, 1992.

[21] L. Won, “Kleisli, its exchange format, supporting tools and an applica-tion in protein interaction extraction,”Proc. IEEE Int. Symp. Bio-Infor-matics Biomed. Eng., pp. 21–28, 2000.

Graham J. L. Kemp received the BSc Honours degree in computing sciencein 1987 and the Ph.D. degree on "Protein modelling using an object-orienteddatabase" in 1991, both from the University of Aberdeen, Aberdeen, Scotland,U.K.

He was a Research Assistant, Research Fellow, and Lecturer at the Universityof Aberdeen. He joined Chalmers University of Technology, Sweden, as an As-sociate Professor in January 2002. He has been involved in bioinformatics anddatabase research since 1987. He is one of the developers of the P/FDM objectdatabase management system, and has strong interests in protein structure.

Nicos Angelopoulos, photograph and biography not available at the time ofpublication.

Peter M. D. Gray has been with Aberdeen Univer-sity since 1968, where he is now a full Professor, re-searching in AI Logic and KR techniques applied toDatabases.

Since 1987, he has worked on bioinformatics ap-plications, using the P/FDM database mediator. Heis currently working on knowledge representationssuitable for knowledge reuse and exchanging knowl-edge between agents, as part of Advanced Knowl-edge Technologies.

Prof. Gray was European PC Chair for the 21st In-ternational VLDB Conference, Zurich, Switzerland, in 1995.