20
Agent based data management in digital libraries Yanyan Yang a , Omer F. Rana a, * , David W. Walker a , Christos Georgousopoulos a , Giovanni Aloisio b , Roy Williams c a Department of Computer Science, Cardiff University, CF10 3XQ Cardiff, UK b Department of Innovation Engineering, Engineering Faculty, University of Lecce, Via per Monteroni, 73100 Lecce, Italy c CACR, 158-79 California Institute of Technology, Pasadena, CA 91125, USA Received 11 March 2001; received in revised form 20 November 2001; accepted 20 November 2001 Abstract Digital libraries (DLs) generally contain a collection of independently maintained data sets, in different formats, which may be queried by geographically dispersed users. A DL should enable new data sources to be dynamically added, and allow changes in content and in the schema of sources which are already part of the library, to take place. An agent based framework for managing access to data, supporting parallel queries to data repositories, and providing an XML based data model for integrating data from different repositories is outlined. Our approach utilises stationary agents which undertake specific roles, and mobile agents which can carry analysis algorithms to data repositories. We illustrate our approach with a DL of images of the Earth acquired by the space shuttle, obtained via the synthetic aperture radar. The DL described contains multi-spectral images, and text based data from various regional geographic information servers, and must support data fusion across these data sets. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Data migration and management; Data fusion; Parallel data analysis; Mobile and intelligent agents 1. Introduction A digital library (DL) generally includes a large collection of objects, stored and maintained by multiple sources, such as databases, image banks, file systems, e-mail www.elsevier.com/locate/parco Parallel Computing 28 (2002) 773–792 * Corresponding author. Tel.: +44-29-2087-4000. E-mail address: [email protected] (O.F. Rana). 0167-8191/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-8191(02)00099-6

Agent based data management in digital libraries

Embed Size (px)

Citation preview

Page 1: Agent based data management in digital libraries

Agent based data management in digital libraries

Yanyan Yang a, Omer F. Rana a,*, David W. Walker a,Christos Georgousopoulos a, Giovanni Aloisio b, Roy Williams c

a Department of Computer Science, Cardiff University, CF10 3XQ Cardiff, UKb Department of Innovation Engineering, Engineering Faculty, University of Lecce,

Via per Monteroni, 73100 Lecce, Italyc CACR, 158-79 California Institute of Technology, Pasadena, CA 91125, USA

Received 11 March 2001; received in revised form 20 November 2001; accepted 20 November 2001

Abstract

Digital libraries (DLs) generally contain a collection of independently maintained data

sets, in different formats, which may be queried by geographically dispersed users. A DL

should enable new data sources to be dynamically added, and allow changes in content and

in the schema of sources which are already part of the library, to take place. An agent based

framework for managing access to data, supporting parallel queries to data repositories, and

providing an XML based data model for integrating data from different repositories is outlined.

Our approach utilises stationary agents which undertake specific roles, and mobile agents

which can carry analysis algorithms to data repositories. We illustrate our approach with a

DL of images of the Earth acquired by the space shuttle, obtained via the synthetic aperture

radar. The DL described contains multi-spectral images, and text based data from various

regional geographic information servers, and must support data fusion across these data sets.

� 2002 Elsevier Science B.V. All rights reserved.

Keywords: Data migration and management; Data fusion; Parallel data analysis; Mobile and intelligent

agents

1. Introduction

A digital library (DL) generally includes a large collection of objects, stored andmaintained by multiple sources, such as databases, image banks, file systems, e-mail

www.elsevier.com/locate/parco

Parallel Computing 28 (2002) 773–792

*Corresponding author. Tel.: +44-29-2087-4000.

E-mail address: [email protected] (O.F. Rana).

0167-8191/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-8191 (02 )00099-6

Page 2: Agent based data management in digital libraries

servers, and Web based repositories. Such objects can be maintained on a diverserange of platforms and can be organised and managed in different ways, based onlocal, resource-specific policies. Research in DLs has therefore generally focusedon providing seamless and transparent access to such objects [1]. To support theserequirements, it is important to provide a means to organise objects within a DLto allow multiple heterogeneous data sources to co-exist, and to provide supportfor managing vast quantities of data representing each digital object. Integrating dif-ferent information sources generally requires the development of a system wide datamodel, and subsequent translation of each source specific data model into the systemwide model. Specifically, DLs require:

(1) Storage support for large quantities of data, and support for structured, semi-structured and unstructured data. The underlying models should enable updatesto both content and source schema.

(2) Support for multimedia data sets, enabling visual (image and video), streambased (video and audio) and textual (data sets) to be managed and archived.Metadata is needed to support the management of such data sets, and may beautomatically extracted, or obtained from catalogues which express attributesabout the data being stored.

(3) Support for modifying source schemas, such as in scientific computing wherenew experimental procedures may necessitate a change in the schema for record-ing the output of an experiment. Various approaches can be adopted, rangingfrom database triggers (active databases), to analysing database logs for identi-fying when a particular change should be effected.

(4) Support different user capabilities and needs––to enable users with differentphysical, technical, linguistic and domain expertise to access stored objects.For instance, depending on the device from which access is being made, theDL should automatically send data suitable for that device. Hence, usersequipped with a text-only display accessing a multimedia data store (with text,video, audio capability), should only be sent text. User preferences and profilesmay also be used for identifying requirements, based on past query history ofa user.

The ‘‘mediator’’ based approach is the most commonly employed technique forsupporting some of these requirements, and involves creating standard interfacesto various information sources (wrappers). Each wrapper translates a user query intoa query native to the source. The results are then combined by the mediator and pre-sented to the user. Each wrapper is therefore unique to a knowledge source, and con-tains information about the source structure. A mediator may also break down aquery into multiple sub-queries, which are issued to various information sources itis aware of. The mediator is therefore required to know about the structure of var-ious information sources, and which information sources are currently available forit to query. This is one of the limitations of this approach, as the number of infor-mation sources need to be pre-defined, and it is generally not possible for a mediatorto discover new sources of interest.

774 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 3: Agent based data management in digital libraries

Our work is centered on the use of an agent based infrastructure for supportingthe requirements identified above. Services offered to users can be dynamically addedor updated, based on the use of mobile code that can update the capability of agiven information source. A combination of mobile and stationary agents are utilisedto analyse spatial data obtained from various Earth observation satellites. Mobileagents that encapsulate executable code may be dispatched to a remote server hostinglarge data sets, in scenarios where moving the computation to the data sets is a morerealistic and feasible approach, compared to migrating large quantities of data to acentral server for analysis. The system architecture comprises a number of collaborat-ing agents, where each agent undertakes a pre-defined role––such as a user presenta-tion agent, a database query agent, a query-migration agent etc. (see Section 4). Eachagent can adapt its role within pre-defined constraints outlined in its role definition.Our approach enables new information sources to be integrated into the system, andalso provides support for handling a large number of concurrent users. Query-migra-tion agents enable the utilisation of user defined data analysis algorithms, and suchagents can modify their itinerary (i.e. which data repositories they visit) based oninformation gathered at each point in their travel path. In this way, new repositoriesmay be added to the system, enabling existing repositories to update their links––sim-ilar to the update of routing tables at network nodes. We describe our system archi-tecture with reference to a multi-spectral image based DL, although the proposedapproach is more general and can also be used in other domains. The main emphasislies on coordinating and balancing computation across distributed data sets, andexecuting user defined queries (either those maintained by the information source,or migratable user defined code) across parallel computing resources. A practicaldevelopment approach is adopted in this paper, and used to describe the particulartechnologies adopted, and design decision made in building the infrastructure.

2. Related work

Marchionini [16] has characterised DL research and development as falling intofour categories: content, services, technology, and culture. Research issues relatedto content includes the integration of multimedia objects; data acquisition, includinganalog to digital conversion; metadata extraction and standardisation; indexing,storage and retrieval; work-flow processes and management; and collection preserva-tion and maintenance. Service research issues are strongly dependent on user inter-faces and include search, filtering and browsing; reference support and instruction.Technology research efforts are mainly related to high-speed networking, securityand billing, and interoperability across many DLs. The culture issues include intel-lectual property; insuring data quality, privacy, and equity; and organisational in-terfaces for various communities of practice. In addition to these research anddevelopment challenges, meta issues related to managing and evaluating DLs andtheir impact on people and organisations are also active areas of study.

The Digital Libraries Initiative (DLI) announced in late 1993 [11], was focused atadvancing the means to collect, store, and organise information in digital forms, and

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 775

Page 4: Agent based data management in digital libraries

make available such information for searching, retrieval, and processing via commu-nication networks in user-friendly ways. 1 With the selection and funding of the sixDLI projects, interest and activities related to DLs accelerated rapidly [20,21]. Webriefly discuss the six DLI projects below. The DELOS [8] initiative in the Europeancommunity has also increased interest in DLs––with various members of the appli-cations community (such as library sciences, cultural heritage, etc.) coming togetherto identify information recording needs of their particular community.

• The University of Michigan Digital Library (UMDL) [4] utilises specialised infor-mation agents to perform information retrieval across heterogeneous sources.Each agent has two properties: autonomy and negotiation. The UMDL architec-ture consists of a cooperating set of three software agent types: User InterfaceAgents (UIAs), mediation agents, and collection agents. UIAs conduct interviewswith users to establish their needs such as what they need to know, and thebreadth and depth of the information they require. The interface agents enablethe user to specify areas of interest so that the system can notify the user of itemsof potential relevance. Mediation agents coordinate searches of many distinct butnetworked collections by taking orders from the interface agents.

• The UC Berkeley Digital Library Project [23] aimed to develop the tools and tech-nologies to support highly improved models of the ‘‘scholarly information lifecycle’’, to facilitate the move from the current centralised, discrete publishingmodel, to a distributed, continuous, and self-publishing model.

• The Alexandria Digital Library (ADL) project was focussed at providing a broadaccess to distributed collections of spatially indexed information [3]. The ADLarchitecture consisted of four components: collections, catalogue, interfaces andingest facilities [12].

• The Carnegie Mellon informedia digital video library project focused on auto-mated video and audio indexing, navigation, visualisation, search and retrievaland embedding audio/video content in a system for use in education, informationand entertainment environments [6].

• The UIUC DL project mainly used Web technology to effectively search technicaldocuments on the Internet. An experimental testbed was built with a large numberof full-text journal articles from physics, engineering, and computer science, andthese articles were made available over the Web. A content indexing system fortext documents was developed to enable federated search across multiple sources,with tests were undertaken on millions of documents for supporting a ‘‘semanticfederation’’ [5].

• The Stanford Digital Library Project aimed to resolve issues of heterogeneity ofinformation and services [17]. Based on CORBA technology, the project devel-oped a prototype for providing uniform access to heterogeneous informationsources and services.

1http://www.cise.nsf.gov/iis/dli_home.html

776 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 5: Agent based data management in digital libraries

Besides these six projects, other work in DLs include Rutgers University’sDigiTerra [1] project, which is a space and land-based digital system consistingof multi-layered information and data management functions. The integrationand interoperability layer is concerned with the collection and assimilation of avast array of environmental data. Extensible markup language (XML) has beenchosen as a common language to be used by the mediators and wrappers to repre-sent queries and responses. The ontology layer enables users with diverse back-grounds to query across multiple domains. The data warehousing/data-mininglayer provides fast and efficient access to the integrated data, efficient data analysis,and historical, temporal and chronological views. The concept indexing and content-based retrieval layer provides efficient retrieval by suitably organising the multimediadata, based on the concepts associated with the objects. The universal access layerprovides methodologies to cater for different user characteristics, preferences, andcapabilities.

The Virtual Community Library (VCL) [18] is centered on the provision of a col-lection of interacting self-interested agents––where an agent encapsulates the knowl-edge and interests of an individual user. Each agent is perceived as a personal DLserving one individual user. Within the VCL, the agents support an individuals’ in-formation acquisition and dissemination needs by querying other agents and in-terpreting the results according to the querying agent’s knowledge, maintainingsubscriptions and publication commitments according to the users’ interests, andproviding speculative recommendations based both social and collaborative filtering.

3. System overview

We utilise an agent based system, with an emphasis on mobility, and utilising themobile agent paradigm to transfer analysis algorithms to the data repositories. Ourfocus differs from the projects outlined above, although we do utilise a common datamodel in XML to encode queries, and present results of queries to a user. XML isalso used to encode the type of data maintained by a server, and is used to planthe mobile agent itinerary. Our approach is also most pertinent in scenarios wheredata migration cannot be undertaken, due to the raw data being proprietary, ordue to physical limitations on how much of the data can be moved across a network.Mobile agent are used in our system to process, mine, and fuse distributed multi-agency archives of remote sensing data. Our basic model comprises independentlydeveloped agents situated within an environment where they interact with eachother, and with a user. Each agent is responsible for offering a particular type of ser-vice, and the integration of services is based on a user request. Both users and serverscan have specialised agents, which undertake specialised tasks. We use XML for en-coding agent communication, and to enable agents to query data repositories. Thisrepresentation forms an encoding format for a system wide data model for spatialdata repositories, and is defined using extensible scientific interchange language(XSIL) [25]. Our approach is similar to the DigiTerra project, in that we useXML to wrap queries and encode results of queries.

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 777

Page 6: Agent based data management in digital libraries

There are two ways in which parallelism is achieved within the system:

• From a user perspective, parallelism is achieved by the mobile agents being sent tomultiple repositories simultaneously. A user can perform queries in parallel bywrapping analysis algorithms as mobile agents, and migrating these to remoteservers. Data processing then takes place locally on each of these servers. Thisis equivalent to performing multiple simultaneous processing requests, the resultsof which must then be integrated by the user in some way. Each mobile agentmaintains control of undertaking the processing, and the user is only responsiblefor integrating results. This therefore shifts control away from the user to themobile agents that carry the processing algorithm, and enables a mobile agentto adapt to the environment of the remote server.

• From a data server perspective, multiple mobile agents can be received, andqueued. It is then up to the server to determine which of these should be ableto access the data first. If the server supports replicated versions of the datasource, or has multiple processors which can access a file system simultaneously,then it can allow multiple mobile agent to run. In the same way, it is also possiblefor a single mobile agent to run an analysis on multiple processors on a givenserver, provided the server administrator allows this to happen.

Our emphasis is on an ‘active’ DL for accessing on-line data archives of the Syn-thetic Aperture Radar Atlas (SARA). The data maintained within the SARA systemwas acquired by the space shuttle, during a week-long flight, covering an area ofroughly 50 million square kilometer. The data was acquired using ‘‘active’’ radar,which measures the strength and round-trip time of the microwave signals that areemitted by a radar antenna and reflected off a distant surface object. Hence, eachpixel in the generated image corresponds to radar backscatter. Darker areas in theimage represent low backscatter, and bright areas represent a high backscatter.Bright features mean that a large fraction of the radar energy has been reflectedback, where the backscatter depends on the size of the scattering objects in the targetarea, the moisture content of the target area, the polarisation of the pulses and theobservation angles. The microwave transmissions are vertically and horizontally po-larised, and received on two separate channels. Hence, radar backscatter can be infour polarisation combinations: HH (horizontally transmitted, horizontally re-ceived), VV (vertically transmitted, vertically received) and VH and HV. This enablesthe derivation of the complete scattering matrix of a scene on a pixel by pixel basis.Subsequent analysis involves allocating colours to these polarisations to identify par-ticular surface features, such as vegetation cover, sub-surface discontinuities etc. Ad-ditional details of how the imaging radar works can be found at [13].

When using the SARA system, a user selects a particular part of the Earth, choos-ing an area where data is available, generally shown as a marked region on a map.Alternatively, the user can create a polygon over the surface of a globe/map, fromwhich latitude and longitude co-ordinates are derived. A URL is generated whosecontent is the multi-channel data set (image) to be analysed by a particular algo-rithm. The Generated URL corresponds to the region of interest as an SAR image.

778 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 7: Agent based data management in digital libraries

The processed multi-spectral data may be further processed by choosing a mappingfrom the frequency/polarisation channels to be red, green and blue components ofthe final image. This mapping may be optimised to highlight aspects such as groundecology or snow/ice conditions, for instance. A user-server communicates with ametadata server, which contains metadata descriptions such as the position of theimage on the surface of the Earth, and the user-server opens an HTTP connectionto a data-server. The data-server retrieves the requested data from a mass storage sys-tem, and delivers the image as a JPEG file. Additional compute-servers are providedfor performing activities such as image processing, land classification etc. The data-server is linked to an HPSS/Unitree file system, from which the required image tobe viewed, or analysed, is retrieved and passed to a compute-server. Currently, thedata sets are replicated on archives at Caltech (US) and the University of Lecce (It-aly).

4. Agent based architecture

Fig. 1 illustrates the various agents that participate in the SARA system––andgenerally consist of an overall Local Interface Agent (LIA), and a User InterfaceAgent (UIA). The UIA is primarily responsible for supporting the user in formulat-ing a query, building a profile of the user, for launching and receiving mobile agents,and for presenting the results of a query to a user. It is divided into two sub-agents, aUser Request Agent (URA) and a User Assistant Agent (UAA), which collectivelyperform the tasks of the UIA. Each of these agents, and the roles within the LIA are

Fig. 1. The agent based SARA system.

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 779

Page 8: Agent based data management in digital libraries

defined in detail below. We separate mobile agents from static intelligent agents, lo-calising the most complex functionality in non-mobile LIAs, which remain at one lo-cation, and communicate with the mobile UIA, and provide resources and facilitiesto lightweight mobile agents that require less processor time to be serialised, and arequicker to transmit.

4.1. Local Interface Agents

LIAs are stationary agents that provide an interface between resource servers andrequesting mobile agents. Each LIA is responsible for managing a given informationsource, and perform various roles associated with that repository––such as providinga wrapper for XSIL. New image repositories must provide a translator to the XSILrepresentation. A new repository is added by publishing its location to UIAs andLIAs that are known at the time. The repository does not advertise its capabilitiesat the outset, requiring the UIA to do this at a later stage. Each mobile agents thatvisits an LIA can update the links at its UIA about additional repositories that theLIA is aware of. Each LIA is also responsible for managing the file space associatedwith an information source, such as storing results of partial queries generated by agiven user request. Each UIA owner is allocated temporary space on the repositoriesto store the results of their queries. This space is monitored by the LIA to ensure thatusers do not exceed their allocation. Each LIA has a number of sub-agents, eachof which undertakes a particular role:

• A Local Assistant Agent (LAA) supports interaction with any visiting URAs, andinforms them about the available data and computing resources, and supports thecompletion of the task carried by the URA.

• A Local Management Agent (LMA) co-ordinates access to other LIAs and sup-ports negotiation between agents. It is responsible for optimising the itinerary of avisiting agent, in order to ensure that the next site the agent visits is of relevancewith respect to data acquired at the current site. The LIA therefore minimises thebottlenecks inherent in parallel processing in selectively choosing the path for anagent, and ensures that the URA is transferred successfully.

• A Local InteGration Agent (LIGA) provides a gateway to a local workstationcluster, or a parallel machine. The agent ensures that suitable libraries are avail-able on the required machine to ensure execution of the program carried by a vis-iting URA can take place successfully. These libraries are dynamic link librarieswhich are necessary for the program carried by the URA to work, and are appli-cation specific libraries. The LIGA must also have its environment configuredto access the Java runtime engine.

• A Local Retrieval Agent (LRA) translates query tasks and performs data retrievalfrom the local archive. In addition to retrieval, a LRA may also perform otheroperations such as saving the results to a file before sending it to the user.

• A Local Security Agent (LSA) is responsible for authenticating and performing avalidation check on the incoming URA. The URA will be allocated an access per-mission level, which determines the range of data that the URA can access on the

780 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 9: Agent based data management in digital libraries

local repository. Access permissions are set by the server administrator, and re-strictions are defined using the local schema. Generally, agents from registeredusers may use and have access to more information resources than agents fromunregistered users.

The LMA is responsible for determining the itinerary of the mobile agent, andbases this on the URLs maintained by the LAA. When a repository is first started,it contains a pre-defined list of other servers which also provide a repository of im-ages and GIS data. This list is updated as new mobile agents visit this server, andalso by periodic announcements made by other servers. At present we do not usea ‘‘yellow pages’’ service to help a server locate other servers offering similar repos-itories. The data maintained at a given repository is recorded in an XML docu-ment, and searched by the LMA when deciding on an itinerary.

4.2. User Interface Agents

The UIA provides a front end to the end user, for checking user input, profiling auser and for displaying results. The UIA undertakes two roles:

• A URA is responsible for enabling a user to create a request––either in SQL, or toprovide a link to a Java executable, which is then carried to the appropriate localarchive sites. The URA interacts with various agents at each LIA based remotesite, executes the query based on permissions allocated to it, and either returnsback to the user site with the results of analysis, or with a URL reference forthe results of the analysis.

• A UAA is a static agent that remains at the client site. A UAA manages the profileof the user and provides control functions to enable the user to launch and receiveURAs. The UAA also tracks the progress of a URA and its current location, andprovides the dispatched URA with a contact point to which the URA can returnthe results. It provides a graphical interface to the user for displaying the resultsof analysis based on pre-defined user choices. For example, the retrieval resultsmay be presented as a simple list of URLs. It can also display the results asthumbnail images through a Web browser.

New analysis algorithms which are to be carried by the URA must be encodedusing the XSIL representation, and as used also by the LMA on the server thatthe URA is likely to visit. Any analysis algorithm then operates on the data retrievedby the LMA within the temporary file space allocated to it on the server. Each repos-itory publishes the range of data that it manages, and records this in a document thatit makes available to a UIA and to LIA at other similar servers that it is aware of.

4.3. Implementation

To handle the requirements outlined in Section 1, our system provides the follow-ing:

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 781

Page 10: Agent based data management in digital libraries

• Modularity: The system is composed of interchangeable modules, each providingsome of the required functionality. For example, LRA can translate a query taskand retrieve information from a local archive, which could be a database systemor a file system. If the local archive system changes, only the LRA must be up-dated to support the appropriate data storage device.

• Scalability: The system must deal with a large number of requests coming frommany users simultaneously. As the load grows, the system should scale gracefully.For example, in our architecture, the LMA may intelligently assign computationresources––as the load on a given repository increases.

• Decentralisation: The system should be open and evolvable. There should be noglobal administrator agent, because agents are submitted from various remotesites and it would be inefficient to route all agents through a central site. Controlis therefore distributed.

• Extensibility: The system should allow new elements, such as new services andnew archive systems to be easily added.

• Reliability: The system should be reliable during network overload, failures andupdates. LMA is responsible for informing URAs of any changes and updatestaking place in the system dynamically.

• Security: The LSA stands as a shield for the system. Only authorised agents arepermitted to enter, communicate and perform in the SARA system.

Our system has been implemented in Java using the Voyager mobile object li-braries [22], as many end users are required to access it on a variety of softwareand hardware platforms. Currently we have implemented parts of the URA, UPA,LAA, and LRA agents. The system is capable of serving multiple users simulta-neously and it has been tested against an Oracle test-database. A user can interactwith the SARA system using a Web-based graphical user interface (GUI). Initially,a user visits the SARA Web-site and enters his/her private information (username/password) and an SQL query. Once s/he submits his/her information a UPA(through a Servlet) is generated for the user. The UPA launches the URA to its firstitinerary, when the URA moves to the destination server, where it interacts with theLAA to perform a JDBC connection. After the JDBC connection is generated theURA communicates with the LRA, which executes the URAs SQL query and storesthe results in a file. Once the results have been acquired the URA instructs the LAAto close the JDBC connection, sends to the UPA a file reference pointing to the re-sults, and either continues to another server, or is terminated. Any errors and/orwarnings detected during the process of acquiring the data are appended to the usersfile.

Each user has a fixed amount of physical storage on each server, where their filesare being stored. The users’ files are stored in a hierarchical structure, arranged ac-cording to the site from which the URA was launched, the date of access, the username, and a file reference. The root directory containing the site name is made avail-able via a local Web-server, and password protected by the users. Every user canview and upload the results of their computation on the local server. The objectiveof the LAAs resource-check is to maintain the file-space of each user, and to prevent

782 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 11: Agent based data management in digital libraries

a user from exceeding the fixed amount of physical storage space that s/he owns on agiven data server. The LAA performs a resource-check at fixed periods of time. Inthe case where a user has exceeded the limit of his/her file-space, the LAA displaysa warning-message on the server console for the appropriate user. In the next cycleof the LAAs resource-check, if the user has not deleted some of his/her files, the LAAdeletes the users oldest files, and informs the user via an e-mail message of their filespace updates.

Fig. 2 provides the user interface to enable the specification of an SQL query, orthe ability to launch an agent based on a reference file held locally. The reference filecontains an image analysis algorithm used to perform queries on the remote imagerepositories. Fig. 3 provide the capability to specify the query once an image hasbeen selected, and Fig. 4 shows the subsequent processing of the image by an edgedetection analysis algorithm.

4.4. XML Encoded data model

It is essential that agents used to access heterogeneous remote-sensing data ar-chives communicate and co-operate with each other in order to provide serviceand satisfy user requests within the pre-defined constraints. A simple way to do thisis to define an interaction protocol for communication in the particular problemarea. The best way to represent such a protocol and to define a standard messageformat with meaningful structure and semantics have become key issues.

Fig. 2. User interface for query specification.

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 783

Page 12: Agent based data management in digital libraries

Fig. 3. The GUI showing a retrieved SARA image––the colours correspond to polarisation information

based on electro-magnetic polarisations.

Fig. 4. The edge detection algorithm applied to a retrieved SARA image.

784 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 13: Agent based data management in digital libraries

Most agent communication languages such as KQML [14] and FIPA [10] havebeen designed to minimise the size of the message and to function more as a data-passing protocol. Little emphasis has been placed on the flexibility or the transpar-ency of the semantics of the messages. Many other agent communication languagesuse the basic KQML/FIPA style, but replace or extend the sets for special purposes.The XML is becoming the standard for data interchange on the Internet, and en-ables Web services that are not meant for direct use by humans, but rather to be usedby other software. Its flexibility and ability to clearly represent and identify datamake XML ideal for transferring data between agents. Several agent communicationlanguages such as KQML and FIPA, the most well known, have been converted tosimple XML form. Several XML-based schemas are being designed, such as resourcedescription framework (RDF) [19], XML-data [24], and document content definition(DCD) [7]. Most existing XML schemas focus on strong data formats. For example,RDF focuses on how to represent semantic networks; XML-data considers basicdata types such as Integer, Long, and Date; DCD is a simplification of RDF thattakes account of the data types of XML-Data.

We use an XML schema for agent communication, that combines the most attrac-tive aspects of KQML, FIPA ACL and other agent communication languages, andthat enable agents to communication with each other by expressing intentions in theSARA ontology. Our XML schema allows efficient parsing and is modular andextensible to support evolving classes of XML documents. In addition it retainsits simplicity and clarity, and is readable by the user. Each message has a standardstructure, showing the message type, context information, message sequence, and thebody of the message. Agents cooperate by sending messages and using concepts fromthe SARA ontology.

In the SARA DL, the ontology describes terms and concepts (such as a track, alatitude/longitude coordinate, etc.) and their inter-relationships. The agent sendingand receiving the message must share an understanding of what the words are in-tended to mean. We represent the system ontology by listing terms, their meaningsand intended use in the document type definition (DTD). A DTD can define ele-ments, attributes, types, and required, optional, or default values for those attri-butes. While the XML specification contains the structured information, the DTDdefines the semantics of that structure, effectively defining the meaning of theXML-encoded message. Hence, we define a set of DTDs for agent communicationin the SARA system that specifies all of the legal message types, constraints onthe attributes, and message sequences, for instance:

DTD for track query:

<?xml version¼ 01.00 encoding¼ 00UTF-800?><!ELEMENT trackquery (condition+)><!ELEMENT condition (and | or)+><!ELEMENT and (Equal | LessThanOrEqual | MoreThanOrEqual)+><!ELEMENT or (Equal | LessThanOrEqual | MoreThanOrEqual)+><!ELEMENT Equal (left, right)>

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 785

Page 14: Agent based data management in digital libraries

<!ELEMENT LessThanOrEqual (left, right)><!ELEMENT MoreThanOrEqual (left, right)><!ELEMENT left (#CDATA)><!ELEMENT right (#CDATA)>

A DTD for agent message communication:

<?xml version¼ 01.00 encoding¼ 00UTF-800?><!ELEMENT message (context+, content+)><!ATTLIST message type (request | response | failure | refuse)

#REQUIRED date CDATA #IMPLIED id CDATA #REQUIRED><!ELEMENT EMPTY)><!ATTLIST context sender CDATA #IMPLIED receiver CDATA #IM-

PLIED originator CDATA #IMPLIED returnby CDATA #IMPLIED><!ELEMENT content (itinerary+, querydef+, results)><!ELEMENT itinerary (server)+><!ELEMENT server (Cardiff | Lecce | Caltech , server2?)><!ELEMENT server2 (Cardiff | Lecce | Caltech)><!ENTITY query SYSTEM 00query.xml00><!ENTITY querydef (&query;)+><!ELEMENT results (#PCDATA)>

In XML-based messages, agents encode information with meaningful structureand commonly agreed semantics. On the receiving side, the information can be iden-tified and used by different services, based on predefined tags. A mobile agent cancarry an XML document to a remote data archive for data exchange, where bothqueries and responses are XML-encoded. We have currently identified various typesof messages for agent interaction, such as UPA–URA messages, URA–LIA mes-sages, and LIA–UPA message. Messaging is performed asynchronously so thatlaunching the URA is independent of receiving a message from the UPA. A LIA–UPA message is sent from a LIA to a related UPA when the tasks are finished. Inour system, we use the JAXP [15] interface to parse our XML documents, supportingboth SAX and the document object model (DOM) [9].

4.5. XML-based data specification

In choosing how our system exchanges results, trade-offs need to be made betweenefficiency and flexibility. The efficiency of the communication is maximised by bulkbinary transfers where sender and receiver know everything about the transfer beforeit begins. Flexibility is maximised by using ASCII text with redundancy and syntaxchecking. We believe that the key to the trade-off between efficiency and flexibility isto separate command and control from large binary objects. Our solution is that theresult information can be encoded in XML when there is no large binary object. Butif there is a large result, we will place it on a cache disk at the local resource, and sendthe URL back with the agent, so the user can download the result later.

786 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 15: Agent based data management in digital libraries

In our system we use XML to encode system structure as metadata. The metadataconsists of four tables. The ‘Track’ table houses information about the image such asits name, date of acquisition, unique id, width, height, and number of channels. The‘Coords’ table contains the latitude and longitude coordinates of the four vertices ofthe image. In the ‘File’ table the filenames of the files constituting the image are re-corded, and finally the ‘Stored’ table contains the information about where the imageis actually stored, that is, one of the data servers that comprises the distributed DL.The DTD for this metadata is presented above, also demonstrated is an XML doc-ument produced according to the DTD (see Fig. 5). Additional details about theXML implementation can be found in [2].

We have developed a visual interface called XML Viewer for integrated browsingof one or more XML documents. Fig. 6 presents a screenshot of our interface. Ourinterface is divided into two regions. The left frame contains a tree representation ofthe XML document that can be expanded as needed. The right frame displays anydetailed content associated with XML elements when the mouse is moved over thecorresponding node. The hiding of detailed information simplifies the representationof complex XML documents.

4.6. Application scenario

The current system can be employed in a wide range of application domains, suchas the analysis of multi-temporal images corresponding to changes in the ecology ofa particular region, and studies of environmental pollution. SARA images can alsobe compared based on phase and amplitude differences of the backscatter radiation,

Fig. 5. SARA DTD.

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 787

Page 16: Agent based data management in digital libraries

to study seismic or volcanic processes, motions of ice-sheets or glaciers, or other sim-ilar geological events. Support for real time processing can facilitate frequent over-passing of satellites over a given region in case of natural disasters such as forestfires or flash floods. The agent-based approach provides a useful system for enablingsuch applications to be more effectively deployed due to the reasons mentionedabove, and involves an integration of data resources, various types of physical stor-age media, and various compute servers and analysis algorithms.

Fig. 7 illustrates a possible use case, with the sequence of steps necessary to pro-cess a given user request. A user defines a query, either an SQL query or a referenceto an analysis algorithm with the support of the UIA. This is facilitated by the useruploading a Java Applet from a Web server. This query is wrapped and sent as mo-bile code within the URA to the first point on its itinerary (step 1). At the informa-tion source, the agent is received by the LSA, which allocates a security-level to theincoming agent, based on the site from which the agent has been received. The LSAalso registers this incoming agent with the agent context existing on the local server,and alerts all other agents that a new request agent has arrived (steps 2 and 3). Oncethis agent has been registered, a reference to it is passed to the LRA, which translatesthe query carried by the URA to a query that can be executed on the local data re-pository. In case of an executable query (executable code), the LRA passes an exe-cution context to the incoming agent. Interaction between the URA and LRA isbased on the data model outlined in Section 4.4––whereas the query is encoded inXSIL [25]. The LRA is also responsible for ensuring that the database is on-lineand can accept queries from incoming agents (step 4). The results of the query arethen stored locally at the information resource, with the user-ID, a query-ID, andtime of the query by the LAA (step 5). The LAA also determines the next site thatURA must visit––either based on information encoded within the URA, or informa-tion maintained locally by the LAA (step 6). The URA can contain a maximum hop

Fig. 6. XML browser for SARA document.

788 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 17: Agent based data management in digital libraries

count––which provides an upper bound on the number of hosts a URA is willing tovisit before it is returned to the user who generated it. After visiting multiple sites, theURA keeps the results of analysis on the local information server, at which the re-sults were generated, and returns back to the user with a list of URLs of the results(steps 7 and 8). The user then has the option to upload all or some of the results ofanalysis.

A user may also employ an intermediate server to forward user requests to the in-formation repository, also illustrated in Fig. 7. In this case, the intermediate server canalso maintain profiles across a number of users, and support a cache of most used re-quests. This server is primarily provided to support a better scaling of the system.

5. Conclusions and further work

We provide an agent based framework for managing large data sets, and under-taking parallel queries on data repositories using mobile agents. Our framework isaimed at enabling multiple information repositories to co-exist, and to support dataintegration from multiple servers based on a system wide data model. The ability todecide when to migrate code vs. data is primarily user driven at present, and can besupported by system monitoring tools that gather statistics of network latency and

Fig. 7. Usage scenario for the SARA system.

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 789

Page 18: Agent based data management in digital libraries

workload on user servers. Such tools can be easily integrated into the current systemat each LIA or UIA.

The agent based approach enables us to divide functionality into a set of collab-orating agents, where each agent is responsible for a part of the total work. The roleplayed by each agent can be modified over time, and agent interaction can also varydepending on usage characteristics of the system. We have provided infrastructureissues in establishing a DL that can cater for multiple information sources, usersand platforms. An agent based SARA system has been used as a proof-of-conceptapplication to outline the principles involved.

Appendix A

A sample of a SARA message:

<?xml version¼ 01.00><!DOCTYPE message SYSTEM 00message.dtd00><Message type¼ 00request00 id¼ 00CLIENTID00><Context sender¼ 00//cscardiff/scmly00 receiver¼ 00//caltech/laa100 returnby¼ 0028/02/01 5pm 00/><Content><itinerary><server>Lecce</server><server2>Caltech</server2></itinerary><querydef>&trackquery;</querydef><results>[email protected]</results></Content></Message>

Track query definition:

<?xml version¼ 01.00><!DOCTYPE trackquery SYSTEM 00trackquery.dtd00><trackquery><Condition><and><MoreThanOrEqual><left>latitude.upperleft</left><right>50</right></MoreThanOrEqual><MoreThanOrEqual><left>longtitude.upperleft</left><right>50</right></MoreThanOrEqual>

790 Y. Yang et al. / Parallel Computing 28 (2002) 773–792

Page 19: Agent based data management in digital libraries

<MoreThanOrEqual><left>latitude.upperright</left><right>85</right></MoreThanOrEqual><LessThanOrEqual><left>longtitude.upperright</left><right>75</right></LessThanOrEqual><LessThanOrEqual><left>latitude.lowerleft</left><right>105</right></LessThanOrEqual><MoreThanOrEqual><left>longitude.lowerleft</left><right>60</right></MoreThanOrEqual><LessThanOrEqual><left>latitude.lowerright</left><right>150</right></LessThanOrEqual><LessThanOrEqual><left>longitude.lowerright</left><right>150</right></LessThanOrEqual></and></Condition></trackquery>

Message type represents intentions such as request, response, failure, and re-fuse explicitly and allows the system to monitor and control the progress of the in-teraction. For example, we can define a message for a request to search for tracks,and another message for information passing to return tracks. Context is usedto identify the sender, the intended recipient of the message or originator for for-warded messages, using some form of local, regional, or global naming scheme. Re-turnby sets a deadline for user’s waiting time. Content defines itinerary of agentand user’s request wrapping in XML, as well as forms of returning results

References

[1] N.R. Adam, V. Atluri, I. Adiwijaya, SI in digital libraries, CACM 43 (6) (2000) 64–72.

[2] G. Aloisio, G. Milillo, R.D. Williams, An XML architecture for high-performance Web-based

analysis of remote-sensing archives, Future Generation Computer Systems 16 (1999) 91–100.

[3] D. Andresen et al., The WWW prototype of the Alexandria Digital Library, in: Proceedings of

ISDL’95: International Symposium on Digital Libraries, Japan, 22–25 August 1995, pp. 17–27.

Y. Yang et al. / Parallel Computing 28 (2002) 773–792 791

Page 20: Agent based data management in digital libraries

[4] W. Birmingham, E. Durfee, T. Mullen, M. Wellman, The distributed agent architecture of the

University of Michigan digital library, IEEE Computer (May) 1996, special issue on building large-

scale digital libraries.

[5] A.P. Bishop, Digital libraries and knowledge disaggregation: The use of journal article components,

in: Proceedings of the ACM Digital Libraries 1998 Conference, ACM, New York, 1998.

[6] M. Christel, D. Martin, Information visualization within a digital video library, Journal of Intelligent

Information Systems 11 (3) (1998) 235–257.

[7] W3C, Document content description for XML. Available from http://www.w3.org/TR/NOTE-

dcd.

[8] European Commission, Network of excellence in digital libraries. Available from http://

www.ercim.org/delos/.

[9] W3C, Document object model. Available from http://www.w3.org/DOM/.

[10] Foundation of Intelligent Physical Agents. Available from http://www.fipa.org/.

[11] E. Fox, Digital libraries, IEEE Computer 26 (11) (1993) 79–81.

[12] J. Frew, M. Freeston, N. Freitas, L. Hill, G. Janee, K. Lovette, R. Nideffer, T. Smith, Q. Zheng. The

Alexandria Digital Library Architecture. in: C. Nikolaou, C. Stephanidis (Eds.), Proceedings of the

Second European Conference on Research and Advanced Technology for Digital Libraries

(ECDL’98), Heraklion, Crete, Greece, September 1998, pp. 61–73.

[13] The NASA/JPL Imaging Radar. Available from http://southport.jpl.nasa.gov/.

[14] T. Finin, Y. Labrou, J. Mayfield, in: J.M. Bradshaw (Ed.), KQML as an Agent Communication

Language, in Software Agents, MIT Press, Cambridge, MA, 1997, pp. 291–316.

[15] Sun Microsystems, Java API for XML Parsing. Available from http://java.sun.com/

products/xml/index.html.

[16] G. Marchionini, in: A. Kent (Ed.), Digital Library Research and Development, Encyclopedia of

Library and Information Science. Available from http://www.glue.umd.edu/march/digi-

tal_library_R_and_D.html.

[17] A. Paepcke, M. Baldonado, C. Chang, S. Cousins, H. Garcia-Molina, Using distributed objects to

build the Stanford digital library Infobus, IEEE Computer (February) 1999.

[18] A. Rasmusson, T. Olsson, P. Hansen, A Virtual Community Library: SICS digital library

infrastructure project, research and advanced technology for digital libraries, in: Proceedings of

second European Conference, ECDL98, Heraklion, Crete, 19–23 September, 1998.

[19] W3C, Resource description framework (RDF) schema specification. Available from http://

www.w3.org/TR/WD-rdf-schema/.

[20] Special issue on digital libraries, Communications of the ACM 38 (4) (1995), ACM, New York.

[21] Special issue on digital libraries, IEEE Computer Magazine (May) 1996.

[22] ObjectSpace, Inc., The Voyager Mobile Object System. Available from http://www.object-

space.com/.

[23] R. Wilensky, UC Berkeley’s Digital Library project, CACM 38, 4, April 1995.

[24] W3C, XML-Data. Available from http://www.w3.org/TR/1998/NOTE-XML-data.

[25] K. Blackburn, A. Lazzarini, T. Prince, R. Williams, XSIL: extensible scientific interchange language,

in: Proceedings of HPCN’99, Amsterdam, February 1999, pp. 513–524.

792 Y. Yang et al. / Parallel Computing 28 (2002) 773–792