Upload
vunguyet
View
216
Download
0
Embed Size (px)
Citation preview
This manuscript has been reproduced from the microfilm master. UMI films the
text directly from the original or copy submitted. Thus, some thesis and
dissertation copies are in typewriter face, while others may be from any type of
computer printer.
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleedthrough, substandard margins, and improper alignment
can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete manuscript and
there are missing pages, these will be noted. Also, if unauthorized copyright
material had to be removed, a note will indicate the deletion.
Oversize materials (e-g., maps, drawings, charts) are reproduced by sectioning
the original, beginning at the upper left-hand comer and continuing from left to
right in equal sections with small overlaps. Each original is also photographed in
one exposure and is included in reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6" x 9" black and white photographic
prints are available for any photographs or illustrations appearing in this copy for
an additional charge. Contact UMI directly to order.
Bell & Howell Information and Learning 300 North Zeeb Road, Ann Arbor, MI 48106-1346 USA
800-521 -mOO
A Robust Distributed S t orage System for Large Information Retrieval Applications
Antonio S. Cheng
A thesis submitted in conformity with the requirements for the degree of Master of Applied Science
Graduate Department of Electrical and Computer Engineering University of Toronto
@ Copyright by Antonio S. Cheng 1998
National Library BibliothQue nationale du Canada
Acquisitions and Acquisitions et Bibliographic Services services bibliographiques
395 Wellington Street 395. rue Weliington OttawaON KlAON4 Ottawa ON K I A ON4 Canada Canada
Yow Me vom reference
Our fib [ l e re r8terence
The author has granted a non- L'autew a accorde m e licence non exclusive licence allowing the exclusive pennettant a la National Library of Canada to Bibliotheque nationale du Canada de reproduce, loan, distribute or sell reproduire, prster, distribuer ou copies of this thesis in microform, vendre des copies de cette these sous paper or electronic formats. la fome de microfiche/6lm, de
reproduction sur papier ou sur format electronique.
The author retains ownership of the L'auteur conserve la propriete du copyright in this thesis. Neither the droit d'auteur qui protege cette these. thesis nor substantial extracts from it Ni la these ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent 6tre imprimes reproduced without the author's ou autrement reproduits sans son permission. autorisation.
A Robust Distributed Storage System for Large Information Retrieval
Applications
Antonio S. Cheng
Master of Applied Science
Graduate Depaftment of Electrical and Computer Engineering
University of Toronto
1998
Abstract
This thesis presents an architecture and design for a Robust Distributed Storage System
(RDSS) targeted at digit a1 library, multimedia, and information retrieval applications,
and implemented on networks of low-cost workstations or personal computers. In partic-
ular, the system addresses problems associated with managing large distributed indices
in the context of these applications. The RDSS provides a framework for scaling a single-
node server to create a reliable distributed system. In addition to performance benefits
achieved by distributing these applications, the RDSS provides efficient data mirroring,
on-line failure recovery, and node management.
Acknowledgements
I would like to express my deepest thanks to my supervisor, Professor Charles L. A.
Clarke, for his guidance and support throughout my graduate study at University of
Toronto. Professor Clarke has been very kind in providing me with research direction
and very patient in correcting my mistakes. His endless support throughout the research
and the thesis write-up is crucial in the completion of this thesis. I sincerely hope that
the result of this study will benefit the future evolution of the MultiText project.
I would also like to thanks my exmination committee members: Professor S. A. Bortoff
(chair), Professor H. M. Hinton, and Professor M. Stumrn. In addition, I would like to
send my gratitudes to Miched Van D m for helping me with the thesis layout, and to
Sam Griffiths and Dr. Rob Irish for proofreading the thesis in short notice.
Thanks to my fellow electrical graduate students for making the learning experience
enjoyable. Thanks are also due to other members of the Electrical Engineering Computer
Group (EECG), for the knowledge (technical or otherwise) that I have learned from them
during my time with the group. Also, my sincere gratitude to the technical support staff
and the administrative staff in the electrical engineering department.
Last but not least, I would like to thank my family and friends for their encouragement.
This work is supported by NSERC PGS-A postgraduate scholarship.
Contents
I Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 MultiText Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Distribution Problem . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Index Mirroring Problem . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Overview of the Robust Distributed Storage System (RDSS) . . . . . . . 10
1.4.1 System Environment . . . . . . . . . . . . . . . . . . . . . . . . . 10
1 - 4 2 Mirroring Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Application Environment 16
2.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
i v
2.2.1 Data Addressing Scheme . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Capacity Definition . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Data Organization Requirements Summary . . . . . . . . . . . . .
2.3 Application Server Design Requirements . . . . . . . . . . . . . . . . . .
2.3.1 Application Server Interfaces Requirements . . . . . . . . . . . . .
2.3.3 Application Server Protocols Requirements . . . . . . . . . . . . .
2.4 Application Front-End Design Requirements . . . . . . . . . . . . . . . .
2.4.1 Application Query Front-End . . . . . . . . . . . . . . . . . . . .
2.4.2 Application Update Front-End . . . . . . . . . . . . . . . . . . . .
3 Architectural Overview
3.1 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Design Assumptions . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Server Node Components . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Marshaller-Dispatcher Components . . . . . . . . . . . . . . . . .
3.2.3 Non-volat ile Storage Management . . . . . . . . . . . . . . . . . .
3.3 Node Configuration Monitor (NCM) Module . . . . . . . . . . . . . . . .
3.3.1 Node Contact List . . . . . . . . . . . . . . . . . . . . . . . . . .
v
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Guardian Node
3.3.3 Server Nodes Synchronization . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Crash Recovery
3.4 Node State Machine (NSM) Module . . . . . . . . . . . . . . . . . . . . .
3.4.1 NSM Initialization State . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 NSM Steady State
3.4.3 NSM Modifying State . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 NSM Degraded State . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.5 NSM FailedState . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Application Servers Manager (ASM) Module . . . . . . . . . . . . . . . .
3.6 Hardware Abstraction Layer (HAL) Module . . . . . . . . . . . . . . . .
3.7 Marshaller-Dispatcher Port Manager (MDPM) Module . . . . . . .
3.8 Query Marshaller-Dispatcher (QMD) Module . . . . . . . . . . . . . . .
3.9 Update Marshaller-Dispatcher (UMD) Module . . . . . . . . . . . . . . .
3.10 Sample Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.1 Successful System Startup . . . . . . . . . . . . . . . . . . . . . .
3.10.2 Successful Queries and Updates . . . . . . . . . . . . . . . . . . .
3.10.3 Successful Recovery . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.4 Changing Storage Size . . . . . . . . . . . . . . . . . . . . . . . .
vi
4 RDSS Detailed Design 60
4.1 Target Platform Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Node Configuration Monitor (NCM) Detailed Design . . . . . . . . . . . 61
4.2.1 NCM Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 Detecting Node Failure . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Initiating Remodelling Synchronization . . . . . . . . . . . . . . . 64
4.2.4 Completing Remodelling . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.5 Node Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Node State Machine (NSM) Detailed Design . . . . . . . . . . . . . . . . 66
4.3.1 NSM Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 From Initialization State to Steady State . . . . . . . . . . . . . . 67
4.3.3 FromInitializationStateto Degradedstate . . . . . . . . . . . . . 67
4.3.4 From Steady State to Modifying State . . . . . . . . . . . . . . . . 68
4.3.5 From Steady State or Modifying State to Degraded State . . . . . 68
4.3.6 From Modifying State to Steady State . . . . . . . . . . . . . . . . 69
4.3.7 From Degraded State to Steady State . . . . . . . . . . . . . ,. . 69
4.3.8 From Degraded State to Failed State . . . . . . . . . . . . . . . . 69
4.3.9 NSMTermination . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.10 On-1ineRemovalofaNode . . . . . . . . . . . . . . . . . . . . . . 70
4.3.11 On-line Addition of a New Node . . . . . . . . . . . . . . . . . . . 71
vii
4.3.12 Relocating Data from a Deleted Node to a New Node . . . . . . . 71
4.4 Application Servers Manager (ASM) Detailed Design . . . . . 72
4.4.1 Data Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Remodelling Interface 73
4.4.3 Controlling Port Visibility . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Persistent Information . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.5 ASM Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Hardware Abstraction Layer (HAL) Detailed Design . . . . . . . . . . . . 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Virtual Disk 79
4.5.2 Dynamic Repartitioning . . . . . . . . . . . . . . . . . . . . . . . PO
4.5.3 H A E O M Library Interfaces . . . . . . . . . . . . . . . . . . . . . 81
4.6 Marshaller-Dispatcher Port Manager (MDPM) Detailed Design . . . . . . 82
4.6.1 Port Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6.2 Observing RDSS state changes . . . . . . . . . . . . . . . . . . . 84
4.6.3 Maintaining the Node List . . . . . . . . . . . . . . . . . . . . . . 84
4.6.4 Maintaining the Query Connection List . . . . . . . . . . . . . . . 84
4.7 Update Marshaller-Dispatcher (UMD) Detailed Design . . . . . . . . . . 85
4.7.1 Update Client Trust Model . . . . . . . . . . . . . . . . . . . . . 55
4.7.2 Default Application Update Front-End . . . . . . . . . . . . . . . 86
4.7.3 Update Transaction Integrity . . . . . . . . . . . . . . . . . . . . 87
... Vll l
. . . . . . . . . . . . 4.7.4 RDSS Update Marshder-Dispatcher Library 87
. . . . . . . . . . . 4.8 Query Maxshder-Dispatcher (QMD ) Detailed Design 89
4.8.1 Query Marshaller-Dispatcher Library (QMD-Lib) . . . . . . . . . 89
5 Implementation Status 93
. . . . . . . . . . . . . . . . . . . . . . 5.1 RDSS Prototype Implementation 94
. . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Prototyping Framework 94
. . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Implementation Issues 96
. . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Simple Text Snippet Server 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 User Interfaces 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Storage Format 98
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Storage Resizing 99
. . . . . . . . . . . . . . . . . . . . . 5.3 Conversion of the MultiText System 101
6 Conclusion 103
. . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Contributionof theThesis 103
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work 104
A Application Update Protocol BNF 108
B Hardware Abstraction Layer . Application Server Interface
C Query Marshaller-Dispatcher Library Interface
D Update Marshaller-Dispatcher Library Interface
Glossary
Bibliography
List of Tables
. . . . . . . . . . . . . . . . . . . . . 2.1 Application Server Update Protocol 28
. . . . . . . . . . . . . . . . . . . 4.1 Application Servers Manager Interfaces 79
List of Figures
1.1 Architecture of the MultiText System . . . . . . . . . . . . . . . . . . . . 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 System View of RDSS I1
. . . . . . . . . . . . . . . . . . . . . . . . 1.3 Data Replication in the RDSS 13
. . . . . . . . . . . . 1.4 Load Balancing of the Scattered Mirroring Strategy 15
. . . . . . . . . . . . . . . . . . . . . . . 2.1 Stand Alone Application Server 18
. . . . . . . . . . 2.2 Application Distributed View with Front-End Modules 30
. . . . . . . . . . . . . . . . 3.1 Network Overview of the RDSS Architecture 35
. . . . . . . . . . . . 3.2 StateTransitionsDiagramofanRDSSServerNode 46
. . . . . . . . . . . . 3.3 Sub-state Transitions with the Degraded NSM State 51
. . . . . . . . . . . . . . . . . . . . . . . 4.1 RDSS Server Node Components 75
. . . . . . . . . . . . . . 4.2 RDSS Marshaller-Dispatcher Node Components 53
5.1 Physical Storage Layout of Simple Text Snippet Server . . . . . . . . . . 98
5.2 Mapping Table Relocation during Storage Expansion . . . . . . . . . . . 101
xii
Chapter 1
Introduction
With the growth of the information age, the demand for timely information and data
also grows in leaps and bounds. Many technological trends are driven by the need to
satisfy this demand. From hardware advances like higher network bandwidth and data
storage capacity, to software developments in data mining and digital libraries, the driving
requirement is to get more information to end users quicker and cheaper.
The purpose of this thesis is to propose, develop and refine a flexible framework for a Ro-
bust Distributed Storage System (RDSS) that will satisfy this driving requirement. The
RDSS is a storage management framework targeted toward digital library, multimedia
and information retrieval applications. It provides the benefits of distributed computing
and data replication. Prototypes of the RDSS components are developed as part of the
design process.
1.1 Background
Most work in reliable, distributed database systems has targeted relational databases
and file systems [HHBSG]. The work described in this thesis is aimed at digital libraries,
multimedia, and information retrieval applications. The focus of these applications is on
search and retrieval. Because of this focus, it is relatively easy to structure distributed
versions of these applications. Presently, the most visible applications in this class are
World Wide Web search engines (e.g., Altavista, http : //m. altavista. digital. corn,
and Excite, http : //m. excite. corn).
With the growth of the World Wide Web, and other new digital library services such as
video-on-demand, the need for a large-scale digital library systems is on the rise [Les97].
In 1996, the largest disk storage m a y was 20 terabytes in size. By the time the Sloan Dig-
ital Sky Survey is completed, it will contain about 200 terabytes of digital data [Sky97]. It
is also estimated that the scanned version of the US national library will require a digital
library of 1 petabyte in size [Les97] (1 petabyte = 1,048,576 megabytes, for comparison).
While building faster and bigger server hardware and storage subsystems is possible,
emerging database applications require efficient storage but cannot justify the cost of a
single computer super server. In 1998, a high bandwidth storage subsystem, which is
capable of storing 200 gigabytes or more, costs more than US$80,000 without disk or
server (Digital Storage Work Enterprise Storage Array 10000). The goal of this study
is to provide a distributed framework, such that almost any stand-alone digital library
application can be easily converted to a scdable distributed system using a network of
low-cost personal computers.
1.1.1 MultiText Project
An example of a digital library is the parent project of this thesis - the MultiText
Project. The MultiText system allows search and retrieval of relevant text passages us-
ing innovative ranking methods - shortest sub-string ranking with passage-based refine-
ment [CCPT97], along with support for semi-structured data [CCB95a, CCB95bl. The
content of the text can be net-news, email, postscript or any other text based documents
in various formats and languages.
The current MultiText architecture consists of multiple replicated index engines with
one or more text servers on a local area network. Incoming client queries are routed
by a rnarsshaller/dispatcher to the the index engines in the network. The results from
the index engines axe used for text passage retrieval from the text servers. From the
query clients' point of view, the whole network of index engines, the text servers and
the marshaller/dispatcher function as a single search application. Figure 1.1 shows the
architecture of the MultiText system.
The issues of concern to the MultiText Project include data distribution, load balancing,
fault tolerance, fast update, compression, document structure, ranking and user interac-
tion [CBCG95]. The RDSS architecture proposed in this thesis addresses the f i s t three
issues.
1.1.2 Distribution Problem
Compared to the retrieval server (multimedia or text), index engine operations are com-
putational intensive. An expensive large RAID (Redundant Arrays of Inexpensive Disks)
subsystem may solve the storage problem, but it cannot solve the problem of increased
computational demands. Thus, an index engine supporting a sizable information man-
agement system would benefit in both cost and performance by spreading both data and
processing loads across many nodes.
Many large databases and digital libraries, including MultiText, utilize replicated nodes
(replicated index engines in MultiText ) to achieve better performance. Many Read One
Write All (ROWA) techniques [HHB96] have been developed to handle the data con-
sistency problem in a distributed system. Data queries in some replicated systems are
served by multiple servers, each containing the same data set. The workload of each
query is not shared among nodes. As a result, although the throughput of the search
may be sped up by replicating the index list to multiple locations, the response time of
each query is not improved.
To make a completely distributed version of a search-enabled digital library, both the
Clients
Engine 0 Index Engine
Text Server
Figure 1.1: Architecture of the MultiText System
index engine and the retrieval server need to be distributed across multiple processing
nodes. However, the data storage format in an index engine is different from that in its as-
sociated retrieval server. Therefore, a content independent distributed storage framework
is needed such that the index engine can use one storage format in one framework and
the associated retrieval (multimedia or document) server can employ a different storage
format in another framework.
1.1.3 Index Mirroring Problem
As digital libraries become more critical to users, fault tolerance becomes indispensable.
At the single node level, various RAID [CLG+94] and similar technologies have improved
storage reliability. At the system level of a distributed server, mirroring or correction
code based backup methods [GCCT96, BS92] have been proposed and developed.
However, unlike blocks in a storage device, index lists in an index engine cannot be
arbitrarily partitioned. For an index search to be completed, an index engine needs
to operate on index lists describing whole documents. Thus, a blind block scattering
strategy (e-g., the Tiger Video server [BBD+96]) for mirroring, where data is replicated
and scattered across the network in fixed blocks regardless of content, would not work.
While a blind scattering strategy for mirroring does not work for the index engine, a
strategy of mirroring all data to a single location (e.g., HotBot web search system [Ink96])
would upset load balancing in the event of a node failure. To circumvent this problem, this
thesis proposes a data-entry-based scattered-mirroring strategy. It avoids load balancing
problems when the backup data is needed, while keeping related information together.
In addition, it is flexible enough that various retrieval servers associated with the index
engine can also utilize the same mirroring scheme on a separate setup of the same RDSS
framework. Section 1.4.2 contains a more detailed description of the mirroring scheme
in the proposed storage system.
1.2 Related Work
There have been many studies on distributed databases [Burgo, CM96, Lin91, TG96],
and information retrieval systems utilizing clustering [BBD+96, DAN091, InkS6, 5095,
Stago]. The parent project of this study, MultiText [CBCG95, CCPT971, also multiple
search and retrieval servers in its setup.
Many of these studies, including the information retrieval system by Tomasic and Garcia-
Molina [TG96] and the current MultiText system [CCPT97] have distributed retrieval
servers and index engines. Some other studies, like the distributed indexing system by
Danzig et al. (DAN0911 , focus on the index engine only and do not address the availability
issues. Commercid systems like the Tiger Video Server [BBD+96] and the HotBot
server (Ink961 include a backup strategy7 but are narrow in their target applications.
None of them provides a complete solution framework for distributing a generic digital
library application that also includes on-line management with fault tolerance and load
balancing capabilities.
For the related work with respect to reliability, there is the proven RAID technology,
which is summarized by Chen et. al. [CLG+94]. Other studies on reliable distributed
storage are mostly file-system related [BS9%, GCCT96, GJSO91, GNAf97, LGGf91,
LSgO].
Discussions on the problems associated with developing distributed appIications, includ-
ing distributed agreement and transaction processing are well documented by Coulouris
and Dollimore [CD8S] and by Gray and Reuter [GR93]. The two-phased distributed
agreement and transaction protocol has been indispensable in the development of the
RDSS framework in this study.
Finally, background information on digital library application can be found in Lesk7s
book [Les97] or in Witten et al. [WMB94]. Also, a recent list of experimental and
commercial reliable distributed systems can be found in Held, Heddaya and Bhargava's
book [HHB96] on replication techniques.
1.3 Design Objectives
Applications in the RDSS targeted design domain, index engine or retrieval server, re-
quire on-line retrieval (query) and batched data modification (update) capability. Any
application that falls within this domain could be adapted, in principle, to become an
RDSS-enabled distributed application. The concise specification of the application do-
main is given in chapter 2.
In this domain, the query clients are generally considered to represent external users,
while the update clients usually represent administrators or trusted users. From this
point on, the term 'user' refers to a 'query client' unless stated otherwise.
An ideal distributed storage system would support aL1 distributed applications with high
efficiency and without discrimination. Also, like RAID storage, the distributed layer
would be transparent to the application. However, in reality the desire for efficiency
and transparency are in conflict. Thus, compromises and decisions me made necessary
in order to make the final system practical. Attempts are made to make sure these
constraints (see chapter 2) are minimal m d well-defined.
The main architectural design criteria is listed in this section. They are:
e location transparency,
0 reliability,
rnaint ainability and extensibility,
load balancing, and
query performance.
While satisfying these design objectives, the RDSS ~ c h i t e c t u r e is designed to minimize
the restrictions on the target application and maximize the ease of its implementation.
Location Transparency
One of the main design goals of the RDSS as a distributed architecture is to maintain
a high degree of transparency. From a query client's point of view, the RDSS-enabled
application should be completely location transparent. That is, the multi-computer envi-
ronment is hidden and the application would appear to an external client as a stand-alone
application on a single server.
Similarly, the update client should be able to manage the whole distributed system as
an integrated unit. Depending on the implementation of the final system, however, the
RDSS is flexible enough to allow for individual node management for both the system
administrator and the update client.
In terms of the application, if the front-end marshaller-dispatcher module is excluded,
its external environment should behave like the external environment of an equivalent
st and-alone single server.
Reliability
The second main design objective of the RDSS architecture is reliability. To achieve
this goal, the RDSS architecture needs to provide better fault tolerance and higher re-
coverability compared to a stand-alone single server application. The ultimate goal is to
incorporate enough flexibility such that various fault tolerances and recovery technologies
and policies may be integrated into the system without affecting the application or the
users' environment.
In the current design, a single level of data redundancy is built into the architecture.
Only a node level failure is considered to be the responsibility of the RDSS; however, the
reliability of each node within the RDSS can be improved using RAID storage.
CHAPTER 1. INTRODUCTTON
Maintainability and Extensibility
To further improve an application's availability over its corresponding stand-alone single
computer version, on-line administration is part of the RDSS. Reconfiguration of the
RDSS environment can be done while users' queries axe being processed.
More specifically7 on-line removal of a failed node and on-line addition of a new node are
supported. The latter operation allows the system to expand without any interruption
to the application's availability.
Load Balancing
The RDSS attempts to provide good load balancing among all nodes whether or not
backup (mirrored) data is in use or not. During normal operations, with no node failure,
data entries are spread evenly among all nodes. In the event of a node failure, the entry-
based scat tered-mirroring strategy keeps the workload of the remaining nodes balanced.
Load balancing minimizes the performance degradation in the event of a node failure.
The objective is to maximize the total serving capacity of the RDSS network, and to
reduce the performance impact of the reliability objective.
Query Performance
While adding the above capabilities, the RDSS should have minimal impact on query
performance. One of the main benefits of a multi-computer distributed application is the
increased throughput relative to a single computer system. Regardless of the through-
put gain due to its distributive nature, the RDSS attempts to minimize the penalty on
individual node throughput and latency on every query request.
1.4 Overview of the Robust Distributed Storage Sys-
tem (RDSS)
The RDSS architecture is designed to counter the distribution and replication problems
of an index engine, while meeting the design objectives listed in the previous section. The
rest of the thesis presents the detailed constraints, design, and behaviour of the system.
This section offers a brief summary of its features.
1.4.1 System Environment
Figure 1.2 shows the network view of a sample RDSS setup. Each RDSS server node
runs the same RDSS and application software and manages one or more physical storage
units. In addition to the RDSS server nodes, a marshaller-dispatcher module performs
routing of requests and responses to and from the clients. The current RDSS design
limits the number of update clients to one and relies on external means to guarantee the
trust-worthiness of the update client. However, no limit is placed on the number of query
clients supported.
Each server node and the marshaller-dispatcher node run on separate workstations. How-
ever, there is no technical reason (other than performance issues) from preventing the
marshaller-dispatcher module from co-locating with a server node on the same worksta-
tion.
During normal operation, a query request from a query client is distributed across the
system to individual server nodes by the marshaller-dispatcher module to which the
query client is connected. The responses from the server nodes are then gathered by
the marshaller-dispatcher module, arid the amalgamated result is returned to the query
client. The operation for a trusted update request is similar to this, except that the
replication strategy of the RDSS is enforced on any modification of data to ensure that
data consistency is maintained.
RDSS Mars hder- Dispatcher
TRUSTED UPDATE
VT
RDSS Compliant AppIication Update Protocol
Application Query Protocol
QUERY CLIENTS
Figure 1.2: System View of RDSS
1.4.2 Mirroring Strategy
A mirroring strategy is chosen as the basis for fault tolerance in the RDSS because of its
simplicity and its quick recovery ability. More importantly, mirrored data can be used
directly by the application servers without additional processing cost. This means that
the throughput performance of the system would be less affected in the degraded mode.
Instead of a data blind strategy, the RDSS requires the application under its infiuence
to associate data entries to a range of addresses on a Linear address space (0 to 2K). The
details of the RDSS data organization requirements are given in chapter 2.
In the current design, for simplicity, only one level of mirroring is included in the system.
Each server mirrors its primary data to the rest of the server nodes in the system.
Data scattering (stripping) is done along the boundary of data entries, such that each
mirrored data entry can be accessed as a whole from a single mirroring location without
any recombination. The reason for stripping the secondary data across N- 1 server nodes
is to improve performance during the degraded mode (in an N nodes system). If one
server becomes unavailable, each remaining node will only have to carry an additional
1 / ( N - 1) of the workload.
Figure 1.3 depicts a simple distribution of data on a fictitious RDSS application with N
server nodes. The primary copy of each data entry is stored in one of the server node.
The secondary (mirrored) copy of each data entry is stored in another server node. Wr
example, if the primaq copy of entry x is stored in node N, the mirrorred copy of entry
x may be stored in any other node but node N .
If extra storage is needed, new server nodes can be added on-line to the RDSS without
interruption to the application services. Conversely, a server node can be deleted on-line
if necessary. To perform node addition and deletion, the RDSS may need to resize the
mirroring partitions on-line.
The entry based scattered mirroring strategy allows the workload to remain balanced
1111111111111111.
Primary data storage 111111111111
Secondary mirrorred data storage
Figure 1.3: Data Replication in the RDSS
even if one node has failed. Figure 1.4 shows the distribution of data on a three-node
system before and after a node failure. The data entries in the primary partition on each
node is scattered evenly on the two other nodes. Data entries that are located on storage
partition 2P (node 2 primary partition) on node 2 are mirrored to either partition 2aS
(secondary partition for node 2 data) on node 1 or partition 2bS on node 3. Similary
data entries on 1P are mirrored to laS and lbS, and data entries on 3P are mirrored to
3aS and 3bS.
After the failure of node 2, the secondary (mirrored) data partition of node 2 on node 1
(partition 2aS) and node 3 (partition 2bS) becomes active. In this example, the distri-
bution of data is assumed to be closely related to the actual processing workload usage.
Such that after the failure of node 2: the workload is balanced between the two remaining
nodes.
1.5 Thesis Outline
The rest of this thesis is structured as follows. Chapter 2 provides a complete discussion
of the application constraints and requirements. An application must follow them for it
to work properly within the RDSS framework. Chapter 3 contains the architecture design
overview of the RDSS. The member components are shown, along with their relationship
to each other.
Chapter 4 is the detailed design chapter that presents an implementable design of the
RDSS prototype and describes the behaviour of the member components. Chapter 5
gives the implementation status of the RDSS prototype. Finally, a conclusion and a
list of possible future work are given in chapter 6. The appendices contain further
implementation information on the prototype.
Before any node failure
Node I Node 2 Node 3
Primary data storage
Secondary mirrorred data storage
Active data storage serving queries
Node 3
Figure 1.4: Load Balancing of the Scattered Mirroring Strategy
Chapter 2
Application Environment
This chapter prsents the Robust Distributed Storage System (RDSS) application domain
and the constraints introduced by the RDSS. The applications within the target engi-
neering domain are described, and the design and implementation requirements on the
target application are given.
2.1 Application Domain
One of the goals of the RDSS is to minimize the differences seen by the application devel-
oper between a single node non-distributed version of an application and its equivalent
distributed version. Instead of designing for the universe, the RDSS targets a chosen
subset of applications.
The chosen domain of the RDSS encompasses most of the digital library applications,
regardless of their content - text, hypertext, multimedia streams, objects or indices.
It does not include all database applications, because the RDSS places constraints that
many database applications cannot meet.
The RDSS enabled application is assumed to reside on a local area network. Clients of
the application may reside outside of this network, provided that means of connection and
authentication axe available. Many applications may be adapted t o fit within the RDSS
architecture, providing their data organization and external interfaces may be adapted
to meet RDSS requirements. The data organization required by the RDSS environment
will be presented in section 2.2.
For an application to function properly in the scalable RDSS environment, it must comply
with certain constraints specified in this chapter. An application designer is given a list of
precise yet flexible constraints and tools, such that applications in the RDSS engineering
domain can be created without additional difficulties.
In terms of the application's environment, the RDSS software layers are mostly transpar-
ent. An RDSS compliant application can be used as a single stand-alone server without
any software modification. The system synchronization, and the duplication and recovery
processes are invisible to the application.
The goal is for any single stand-alone application within the application domain to be-
come a distributed application via the RDSS framework, with a few simple additions
and no major modification. To illustrate the application requirements, we will use as an
example throughout the chapter, a simple text snippet application, that stores arbitrary
text objects. Implementation of the application will be used as a further example in
chapter 5 .
Figure 2.1 depicts the interfaces that an RDSS compiiant application is required to have
In the following sections, the constraints required by the RDSS on each of the application
interfaces are described in more detail.
2.2 Data Organization
For the RDSS to provide automatic data mirroring and on-line storage management,
certain data organization constraints are needed. The RDSS addressing scheme and the
QUERY CLIENTS
APPLCWON QUERY PROTOCOL (AQP)
TRUSTED UPDATE CLIENT
I RDSS COMPLIANT APPLICATION UPDATE PROTOCOL (RC-AUP)
query interface update interfacb
APPLICATION SERVER (AS)
PHYSICAL STORAGE
Figure 2.1: Stand Alone Application Server
CHAPTER 2. APPLICATION ENVIRONMENT 19
notion of capacity are the two main requirements on any RDSS-compat ible applications.
2.2.1 Data Addressing Scheme
In the RDSS world, data entries are associated with addresses on a linear address space.
The size of the address space is determined by the RDSS implementation (usually 64
bits). Each data address location corresponds to a virtual storage quantum, which may
be a byte, a word of text, a video frame or any other arbitrary storage element. Only
active address locations may have physical storage allocated to them.
Externally, the application server stores a collection of data entries. Each data entry
is associated with a finite, non-zero set of storage quanta. More specifically, each data
entry must be mapped to a contiguous range of addresses in the RDSS address space.
The mechanism for translating a virtual storage quantum into a physical storage block
is internal to the application.
No two data entries may share a single address location. Each active address location
must correspond to a unique quantum in a unique data entry. Each data entry must
be mapped individually on a unique unbroken range in the RDSS address range. If a
data entry X is assigned to RDSS addresses 1024 to 2047, then no RDSS position within
that range can refer to data outside X. Un-associated address locations, however, may
exist between addresses used by data entries. For example, the simple text snippet server
stores variable length generic text snippets. Each entry is associated with a single RDSS
address. In the simple text server, the RDSS address doubles as the retrieval handle
for the text snippet. Using the assigned RDSS address, specific text snippets can be
retrieved. For both data retrieva.1 consistency and RDSS compliance, each entry needs a
unique RDSS address tag (a contiguous range of RDSS address).
2.2.2 Capacity Definition
Each data entry will also be assigned a value representing its capacity requirements.
Capacity is the notion of estimated storage usage. It represents the maximum storage
requirement of the data entry in units that are specific t o the application. It may or
may not be directly correspond to the actual physical storage size. the application may
employ compression and data correction techniques, such that the actual storage required
for a data entry may be smaller or larger than the external format size seen by a query
client. The notion of capacity is used for two purposes:
1. to measure the maximum application dependent storage requirements for a data
entry.
2. to measure the minimum available storage in an application server.
The RDSS uses the notion of capacity to allocate entries t o servers. Given a set of data
entries with capacities Q, ..., c ~ - 1 and a server with capacity Cs then N entries can be
on the server if:
In the simple text snippet server example, no data compression, encryption, or error
encoding is done within the application. The capacity usage for each data entry is the
size of the data rounded up to the next block size. The header block and translation
table usages are taken out of the capacity calculation by the application because they
are not tied to the amount of data in the storage. A 786 byte text snippet will have the
capacity of 2 on a system using 512 byte storage blocks.
Regardless of whether any compression, error correction, encryption or any other trans-
formation is applied to the data, the application must store the resulting data entries
in a virtual linear block device provided by the RDSS. Each data entry stored must be
independent from others. In particular, the application cannot use information from one
data entry for compression, error correction, or encryption of another data entry.
2.2.3 Data Organization Requirements Summary
To summarize, the RDSS assumes that the appIication in its environment has the fol-
lowing data organization:
0 Data entries are quantized.
Data entries are mapped onto the linear RDSS address scheme.
0 An estimated storage capacity is associated with each data entry.
Data entries must be independent in their stored fonnat.
The usage of this data organization may be found in the architecture design of chapter 3.
In short, the quantized addressing scheme allows the RDSS to perform automatic data
mirroring, while the notion of capacity allows on-line storage management.
2.3 Application Server Design Requirements
There are two components in an RDSS compliant application: the application server
module and the application front-end module. The application server module is a seIf-
contained database server that can be executed alone as a single node database appli-
cation if desired. However, when multiple application servers are running together as a
distributed application, an application front-end module is needed. It acts as a rnulti-
plexor/dernultiplexor for external queries.
Most of the RDSS compliance requirements for an application are focused on the external
interface of the application server. Internally, the application server has to comply with
the data organization restrictions described in the last section. The physical storage
Limitations will be addressed in the interface section (2.3.1).
In addition, the application must support the transaction model when dealing with data
updates. An update request is not finalized until it has been committed by a specific
commit message. Support of the abort message is also required.
Finally, the application needs to support e6cient transfer of large amounts of data b e
tween two instances of itself. This is needed for load balancing, mirroring and recovery
operations. On request, the application can extract a subset of its contents and send it
to another instance of itself thorough a specified TCP port. The receiving application
instance, on request, should establish a TCP connection to the aforementioned port lo-
cation and merge the incoming data with its own contents. The data format during the
transfer is not constrained by the RDSS. In the simple text server example, only the text
blocks and RDSS addresses are transferred.
The following sub-section details the interface requirements on the connection and com-
munication mechanisms of the application server module. The protocol sub-section gives
the format of the communication contents that are understood by the RDSS.
2.3.1 Application Server Interfaces Requirements
As shown in figure 2.1, the RDSS assumes and requires a compLiant application server
to have three depicted interfaces: a query interface, an update interface, and a storage
interface. The given application server may have other interfaces that do not belong
to any of the three interfaces described. However, for an application server to function
properly with other copies of itself in the RDSS environment, the three interfaces shown
must exist and comply with the requirements in this section.
There are two categories of requirements for any one of the three RDSS required applica-
tion interfaces. The first contains the requirements imposed by the RDSS architecture.
For example, the need for the three aforementioned interfaces is an architectural con-
straint. It is an RDSS design goal that the architectural interface constraints remain as
lightweight as possible.
The other category of requirements is due to the RDSS implementation. These require-
ments depend on the implementation and may change even if the RDSS architecture
remains unchzuged. The limitations placed by the current RDSS prototype are used.
CHAPTER 2. APPLICATION ENVIRONMENT
Application Query Interface Constraints
The RDSS architectural constraints on the application query interface are the following:
0 The interface is stream based.
Concurrent multiple clients must be supported.
The prototype RDSS implementation constraints on the application query interface are
these:
0 Each client has its own session.
The protocol used is session based. (See the query protocol requirements in sec-
tion 2.3.2.)
Each session is connection based.
Multiple simultaneous sessions are supported.
0 The connection mechanism is implemented on the TCP transport layer.
Application Update Interface Constraints
The following lists the RDSS architectural constraints on the application update interface:
0 The interface is stream based.
The interface complies with the RDSS application update protocol restrictions.
However, a superset of the protocol is possible. (See the update protocol require-
ments in section 3.3.2.)
The prototype RDSS implementation constraints on the application update interface are
the following:
The protocol used is session based.
0 Each session is connection based.
0 The connection mechanism is implemented on the TCP transport layer.
CHAPTER 2. APPLICATION ENVIRONMENT
Application Storage Interface Constraints
All storage operations of the application must be done via the Hardware Abstraction
Layer (HAL). It provides an abstraction of a linear block device accessed through a virtual
disk interface. The architectural constraints of the storage interface on the application
are these:
It must use HAL Application Server (HAL-AS) library to access any physical stor-
age.
0 No application persistent state may exist outside of HAL managed storage.
The HAL software library is designed to be a simplified version of the 110 routines
of the original low-level system library. See the HAL interface in the next section for
details. Appendix B contains the calling interfaces of the HAL-AS library prototype.
The implementation limitations imposed by the prototype are the following:
The protocol is specified by the HAL-AS library interface.
The application must link and execute with the HAL-AS library.
0 Only one HAL setup is allowed for each physical storage unit (disk partition, disk,
or group of disks).
Constraints on Other Interfaces
The RDSS does not use any other activation or communication interfaces provided by the
application server. If the application server has such interfaces, the following architectural
constraints apply:
0 These interfaces are not necessary for the normal query and update operations of
an application server.
They do not conflict with the requirements on the other three interfaces mentioned
previously.
An example interface that falls into this category would be a profiling interface. Any addi-
tional support for gathering appLication usage statistics is up to the application designers.
Additional modules may be added to combine information from all the servers. In the
simple text snippet server, an activity log is produced for the performance benchmarking
and debugging. Its activities do not interfere with three RDSS required interfaces in any
way.
2.3.2 Application Server Protocols Requirements
The protocol requirements specify the restrictions on the contents sent via the interfaces
listed above. Again, some of the constraints may be due to the architectural design of
the RDSS, while others are due to the implementation choices in the RDSS prototype.
The following details the limitations on the protocol used by the three RDSS compliant
interfaces on the application server. No protocol constraints are placed on the non-RDSS
interfaces.
Application Query Protocol Constraints
Other than the interface constraints given in the last section, there is no restriction on the
query protocol required by the RDSS. That is, the query format is completely application
dependent. Any session-based protocol (binary or plain text) may be used.
Application Update Protocol Constraints
In order for the RDSS to perform correctly, it has to manipulate data among the net-
worked application servers. To do so, it needs to know the protocol used by the appli-
cation server update interface. Thus, in addition to restricting the update interface, it
also needs certain features in the update protocol. The following is a description of the
update commands required of the update interface:
ADD data entry:
D ELETE entries:
EXTRACT entries:
M ERG E entries:
Update C O M M I T :
Update ABORT:
CAPACITY available:
ST0 RAGE available:
A new data item is added to the server. The start and end RDSS
address are assigned externally by the update client. Capacity and
external format size of the entry will also be needed.
All complete data entries within the given RDSS address range will
be removed from the system. Data entries that partly fall within
the range are not removed.
All data entries within the specified data range (inclusive) are ei-
ther sent to the output stream of the application or to a given port
number. The data may be in an externally formatted form if desired.
Add a number of data entries to the server from an external port
location (e-g., the output port on the EXTRACT command). The
application will attempt t o connect to the given port location and
read in the data entries. The transfer format is the same one used
by the EXTRACT command. No data entries will be committed
until the update commit command is issued.
For an add or delete operation, the application will first acknowledge
that it is ready. Only when the commit message arrives, will the
visible change occur. The committing step in the application is
assumed to be atomic (i-e., the transaction is either committed or
not at all).
Instead of committing the current outstanding update requests, the
operations are aborted and the changes are discarded.
Upon receiving this command, the application returns the minimum
capacity available for new data. (See section 2.2.2 on the definition
of capacity.)
Upon receiving this command, the application returns how much
storage space in the virtual disk is available for reclamation. Block
TRUNCATE storage:
EXPAN D storage:
SHUTDOWN server:
size and number of blocks available are returned. It should be used
before any storage is truncated using the TRUNCATE command.
This command is used to notify the application of any pending stor-
age size decrease. The application should free up the necessary
blocks at the end of the virtual linear block device visible to the ap-
plication. After a successful storage truncation, the RDSS HAL can
safely reduce the storage size allocated to that particular application
server.
This command is use to notify the application that a storage size
increase has occurred. The application can then make use of the
additional storage blocks at the end of the virtual linear block device.
This command instructs the application to terminate itself.
For the RDSS to operate correctly, the update protocol must be implemented exactly in
both the RDSS and the application. A superset of the given command set is possible,
but any additiond features wilI not be used by the RDSS.
The complete BNF (Buckus Naur Form) notation of the protocol used by the RDSS
prototype can be found in appendix A. In the RDSS prototype, the update protocol
is implemented as plain text commands. Table 2.1 briefly summarizes the parameters
accepted by each update command listed above.
Application Storage Protocol Constraints
To use the storage interface, the application server has to be linked with the hardware
abstraction layer Application Server (HALAS) library. It is a software library containing
abstractions for the physical storage. Physical storage is presented as a virtual disk
partition to the application. It allows the RDSS to manipulate the application physical
storage allocation without interfering with the application operations.
Operations
I/ Merge in data entries
11 Commit a modification
11 Abort a modification
11 Check capacity available
Check storage available
Truncate storage
Expand storage
Shutdown server
Input Arguments
Formatted data size
RDSS address range
Capacity size
Data
RDSS address range
RDSS address range
Extract format
(Output port number)
Source Server
Source port
Storage block count
Storage block count
Return Parameters
Ready for commit
(or operation failed)
Ready for commit
(or operation failed)
Ready for commit
(or operation failed)
Ready for commit
(or operation failed)
Success or failure
Success or failure . -
Capacity available
Block size
Number of blocks
Success or failure
Success or failure
Table 2.1: Application Server Update Protocol
Instead of following a given message format protocol like the update interface, the ap-
plication server must use the routines provided by the HAL-AS Library for storing and
retrieving data. The basic methods available are:
Open:
Read:
Write:
Status:
Close:
This routine initializes the HAL-AS library and synchronizes with other HALAS
users on the current server node. It opens a single virtual disk for read and write.
This has to be called before other HAEAS routines.
This method allows the application to read into memory buffer a specific number
of blocks from a given virtual disk partition.
Using this method, the application writes blocks of data from the memory to the
virtual disk partition.
The current information about the virtual disk partition is returned, including
the virtual disk block size and the number of usable blocks.
The opened HAGAS virtual disk and the associated system resources are released.
The detailed capabilities and the calling interfaces to the above routines are available
in appendix B. The detailed design of the HAL itself is described in chapter 4. The
HAL also provides storage management routines to other RDSS software components;
however, those routines should not be used by the application server.
2.4 Application Front-End Design Requirements
In addition to the application server module an RDSS compliant application needs a
front-end module. The front-end module resides within the RD S S marshaUer-dispatcher
layer. It is responsible for multiplexing and dernultiplexing the client requests and their
corresponding responses. Figure 2.2 shows the relationship of the query and update
front-ends to the application server.
QUERY CLJENTS TRUSTED UPDATE CLIENT
1 I I
\
111111111111
Figure 2.2: Application Distributed View with Front-End Modules
The application front-end is separated into two smaller modules, corresponding to the
application server query and update interfaces. The query front-end is responsible for the
query requests and response, while the update front-end pre-filters the update commands.
In the current RDSS design, a generic update front-end is provided. It can be used in
place of a custom update front-end if necessary.
2.4.1 Application Query Front-End
When a new query client connects to the RDSS via the marshder-dispatcher query port,
a new process encompassing the Application Query Front-End (AQFE) will be created
to handle the new query session. The AQFE is responsible for reading client requests,
sending them to the relevant application servers, collecting responses, and giving the
results to the client.
Different applications require different query front-ends. As the RDSS places no require-
ment on the application query protocol, a custom query front-end must be built in order
to interpret and combine the results returned from the attached application servers. De-
pending on the nature of the application and its query protocol, the application query
front-end could be a simple multiplexor/demultiplexor (as provided as a sample in the
RDSS prototype) or a complex module with a significant amount of logic.
Instead of directly interfacing with query clients and application servers, the application
query front-end must use the routines provided by the Query Marshaller-Dispat cher
library (QMD-Lib). The QMD-Lib is a software library that contains all the routines for
the AQFE to communicate with the query ports on the application servers. By doing
so, most of the RDSS related activities are hidden and automatically resolved. Thus,
the coding effort required by the AQFE is much smaller. The following are the callable
routines in the QMD-Lib. More detailed descriptions of their operations can be found in
the QMD design section in the chapter 4. The exact calling interfaces for the QMD-Lib
in the RDSS prototype are available in appendix C.
New query session: This is called at the beginning of each query session. All necessary
initializations are performed. A list of visible application server query
ports (with an active flag for each) is returned.
Start query:
Read select:
This routine sends a new query request to the given list of application
server query ports. An updated list of visible application server query
ports is returned with current list size (which may have grown).
Using a mask corresponding to the last returned port list, the caller
will be blocked until one of the acceptable read events or an exception
occurs. Slot 0 in the mask refers to the client port. Again, the port
status list is updated.
Read from port: Data from the connection corresponding to the given slot is read into
the read buffer. The updated buffer and the read size are returned.
Note that new query requests should not be started this way. A simple
macro extension of this routine allows writing to one port at a time.
End query: An end query message may be sent to all application servers involved.
Both the size of the message sent and an updated application query
port status list are returned.
Terminate session: This routine ends the current session of query operations. The appli-
cation query port status list is released.
G e t status: This forces an update of the application query port status list.
The application query front-ends are responsible for handling out-of-band query re-
sponses. The QMD-Lib assumes one outstanding query per client, and responses from
any application servers after the end query call are considered out-of-band. Also, no
simultaneous queries are supported in QMD-Lib; however, it is possible to design an
AQFE to allow multiple outstanding queries.
During a query session, the index assignment of ports on the application query port
status list will remain unchanged. New ports will appear at the end of the list. If a query
port has previously gone inactive, it will reappear on the old index location when its
status returns to active. This feature is included to remove any possible ambiguity due
to the masked Read, Write or Select operations.
The AQFE is responsible for making sure that all returned data from active query ports
are properly handed. In the simple text server, there should only be one valid return,
because all data entries are unique. At the start of a query session, a request from the
client is sent to all text server node query ports. Each server searches its translation
table, and if the RDSS address tag is found, the actual data entry is returned to the
query client via the simple text server AQFE.
If node X has gone temporarily off-line: all the servers with mirrored data from node X
will appear on the query port connection list with the active flag turned on, while the
active flag will be turned off for the query ports associated with node X. If the changes
happen in the middle of a query (i.e., between the start query and the end query calls),
the original query request will be automatically sent to these new ports by the QMD-Lib.
When node X returns to on-line status, its active flag will be restored. The secondary
mirroring query ports will remain active on the list until the end query call. At that
point, they will become dormant, with their active flag turned off.
2.4.2 Application Update Front-End
Unlike the query part of the marshaller-dispatcher, the application update protocol is
clearly specified above, and a default update front-end module is possible. By default,
all update requests are sent to all application servers, where update decisions are made in
a distributed fashion based on the RDSS address segments assigned to each server node.
However, if desired, it is possible for the application design to add an update front-end
for the application. The main role of the update front-end is to prefilter update requests.
For example, if the simple text server is to be integrated with a word indexing engine
(which may be implemented as another RDSS application), the combined Application
Update Front-End (AUFE) may automatically generate the necessary word list.
To allow for a custom application update front-end, an Update Marshall-Dispatcher
Library (UMD-Lib) is provided. Instead of a list of application query ports, the update
front-end deals with a list of RDSS server nodes. Also, only one update client is allowed
for the whole RDSS at any given time. The calling interfaces to the UMD-Lib in the
RDSS prototype are available in appendix D.
Besides using the UMD-Lib, any custom application update front-end is constrained to
follow the other update interface and protocol constraints listed above. In the prototype
version of the RDSS, the default AUFE (Application Update Front-End) is assumed. No
custom update front-end will be required for the application.
Chapter 3
Architectural Overview
In the previous chapter, the requirements on an RDSS compatible application were de-
scribed. As the name implies, the RDSS (Robust Distributed Storage System) provides
additional fault tolerance to the application while increasing its performance by distribut-
ing the work load among server nodes.
In this chapter, an overview of the RDSS architecture is presented. First, the design
criteria are summarized. They are followed by an overview of the system configuration
and descriptions of the high level RDSS components. The final section provides typical
operating examples.
3.1 Design Criteria
Before going into the architectural design of the RDSS, this section reviews the design
criteria. First is a summary of the design goals as discussed in chapter 1. This summary
is followed by a discussion of the design assumptions.
3.1.1 Design Goals
The main design objectives for the RDSS architecture, as discussed in chapter 1 are to
a Provide a high level of transparency
Improve database consistency and fault tolerance
Enhance application availability
Increase query throughput performance
Minimize application design constraints
Maximize the ease of application implementation
3.1.2 Design Assumptions
There are several underlying assumptions in the RDSS architecture design. The first is
the application compatibility assumption, which basically means that the target appli-
cation will folrow the constraints described in chapter 2. The application is assumed to
have a specific query and update profile, with queries occurring frequently and updates
batched together. The performance of the system is primarily quantified by its query
throughput capacity.
Aside from the appIication assumptions, the RDSS architecture assumes a certain trust
model. Namely, there is only one active update client, which is not malicious. Although
it is possible to incorporate authentication into the design, it is not a core design goal.
Instead, the data security can be provided externally. This can be done, for example,
via connection access control to the update ports and other non-query access points of
the server nodes. No trust assumption is made regarding the query clients by the RDSS
architecture.
The RDSS deals with node level behaviour. The only error reported to the RDSS level is a
node level failure, which renders the whole server node unusable. To keep the architecture
simple, all node level failures are treated as if the failed node has been disconnected from
the rest of the system. It is up to the application and the underlying operating system
to deal with other less fax-reaching errors; for example, a query syntax error is not an
RDSS node level failure. These error handling mechanisms are beyond the scope of the
core RDSS architecture, and are assumed transparent to the RDSS components.
Finally, the RDSS architecture is designed on the assumption that the whole system
will be run on a single local area network (LAN) with typical local network latency and
throughput. More specifically, the inter-computer communication is not considered to
be a bottleneck in the architecture design of the RDSS. The throughput of the network
should also be sufficient to handle the expected query volume.
3.2 System Configuration
Figure 3.1 depicts a static snapshot of a simple RDSS setup. In each RDSS enabled
application, there are N server nodes, depending on the system configuration. In addition
to the server nodes, there are one or more marshaller-dispatcher nodes that perform
multiplexing and de-multiplexing operations. All these nodes axe located on the same
local area network. It is also possible to co-locate a marshaller-dispatcher and a server
on a single node.
In this section, a list of the high level software components residing on the server nodes
and the marshaller-dispatcher node is presented. Also, the usage of physical non-volatile
storage in the RDSS is given. Overviews of each of the top level RDSS components are
given in sections 3.3 to 3.9 of this chapter.
3.2.1 Server Node Components
A server node contains instances of the Application Server (AS). In addition, an instan-
tiation of a server node includes the following top level RDSS components:
Query Clients
+\ . Application Query Protocoi ,
1 \ Update Protocol
Network
RDSS Compliant Local Area Network
/
Primary D SERVER NODE
/ \ #
I I 0 #
I
I I
SERVER SERVER NODE NODE
I
Physical Storage
Physical Storage
Physical Storage
Figure 3.1: Network Overview of the RDSS Architecture
Node Configuration Monitor (NCM)
Node State Machine (NSM)
0 Application Sewer Manager (ASM)
Hardware Abstraction Layer (HAL)
Each server node contains a primary data storage segment, which is associated with a
copy of the Application Server. The primary data is distributively mirrored by all the
other nodes, and vice versa. Given N nodes in the system, each node will have N - I
mirroring segments, each serviced by an instance of the application server.
The Application Server Manager (ASM) is responsible for coordinating all application
server instances on a server node with the help of the Hardware Abstraction Layer (HAL).
The Node Configuration Monitor (NCM) is responsible for coordinating the local node
with others in the system, and finally the Node State Machine (NSM) maintains the state
of the local node by reacting to stimuli from the other components.
3.2.2 Marshaller-Dispatcher Components
The other major subsystem of the RDSS architecture is the marshaller-dispatcher soft-
ware. It is a separate subsystem because it does not have to reside with any particular
RDSS server node. It acts as the gateway for the query and update clients to interact
with the whole RDSS-enabled application. A copy of the marshaller-dispatcher software
cont ains the following components:
Marshuller-Dispatcher Port Manager (MDPM)
Query Marshaller-Dispatcher (QMD)
0 Update Marshaller-Dispatcher (LIMD)
The Application Query Front-End (AQFE) is the application specific part of the Query
Mushaller-Dispatcher (QMD): which is responsible for distributing query requests to
RDSS server nodes and collating the query results. The Application Update Front-End
(AUFE), similar to its counterpart, is part of the Update Marshaller-Dispatcher (UMD),
which deals with update requests and results. The Marshalla-Dispatcher Port Manager
(MDPM) keeps track of the server nodes and client connection ports for both the UMD
and the QMD.
3.2.3 Non-volatile St orage Management
The RDSS environment also manages the partitioning and usage of physical disk storage.
Every server node includes a Hardware Abstraction Layer (HAL) software component,
through which the application accesses the data on physical storage. This allows the
RDSS to provide both storage management and data replication on the same underlying
physical storage without adding any complexity to the application design requirements.
To achieve node level fault tolerance, single level data mirroring provides the simplest
solution. The replication strategy was described in chapter 1. The HAL software layer
allows primary and secondary (mirrored) data partitions to be resized on-line.
Finally, aside from the disk storage needed for data: a small amount of non-volatile storage
is needed for tracking transactions within the RDSS. In addition, system parameters and
states are maintained in this storage. This information is vital to the recovery of the
system after a crash.
3.3 Node Configuration Monitor (NCM) Module
The Node Configuration Monitor (NCM) is the key software module for coordinating
server nodes within the RDSS. It allows for distributed decision-making on the current
status of the whole system. The Node State Machine (NSM) module, to be described in
section 3.4, is responsible for maintaining the current state of the local server node. The
NCM, on the other hand, is the component for monitoring the status of the rest of the
distributed environment.
3.3.1 Node Contact List
There is a copy of the NCM software in each server node. While operating, the NCM
periodically broadcasts its contact list to all server nodes. A contact list contains the
list of node locations of other NCM's in the current RDSS environment. Each node
location is accompanied by a joan sequence number and a control flag. The join sequence
number is assigned to a node when it joins the system. The node location list is sorted
in increasing order according to this field. The control flag indicates the current mode of
the corresponding node as viewed by the Iocal NCM. The possible settings for the control
flag are:
Startup - At the start of an server node, the Startup flag is associated with all
nodes on an NCM's contact list (other than the local). A control flag of a node
changes when a contact list is received from that server node.
0 On-line - An On-line status indicates that a contact list has been received with the
corresponding node within a preset t ime period (the N C M on-line timeout).
0 O$-line - A server node is deemed Off-line if a contact list has not been received
within this preset time period.
Delete Request/Proceed/Ready - The Delete flag sets are used for remodelling
synchronization (deleting or relocating a node). Their roles are explained in the
following synchronization sub-section.
0 A d d Request/Proceed/Ready - The Delete flag sets are used for remodelling syn-
chronization (adding or relocating a node). Their roles are explained in the following
synchronization sub-section.
Finally, to provide a convenient way to shutdown the whole RDSS, the NCM also ac-
cepts a Shutdown command. To halt the system, the administrator can simply send the
Shutdown command to the NCM broadcasting address repeatedly.
CHAPTER 3. ARCHITECTURAL OVERVIEW
3.3.2 Guardian Node
The responsibility to tabulate the results of all distributed decisions is distributed among
the server nodes instead of being located in a single dedicated control node. For each
server node in the system, a designated guardian node is assigned. It is responsible for
initiating and mastering any NCM synchronization regarding the server node under its
protection. By using a different guardian for each node, no single node failure can cause
s ystem-wide failure.
The join sequence number is assigned to a node when it is successfully added to the RDSS
environment. Since the node addition operation is synchronized system-wide (see next
section) and only one node can be added at a time, the contact list for every NSM within
the RDSS will, therefore, be the same. The guardianship for a particular node x is, by
convention, assigned to the previous node on the contact list (i.e., node x - 1). The last
node on the list will be responsibIe for the first node on the list. If a new node wants to
join the system, it is temporarily placed at the end of the list. The guardian of the new
node is the last node on the original list, unless there is a node deletion synchronization
handshake in progress. See the next sub-section on node relocation in that scenario.
3.3.3 Server Nodes Synchronization
During normal, steady-state operations, all server nodes, N , are available for data queries.
When a node, X, fails and needs to be removed from the system, the other N - 1 nodes
become responsible for the data that it contains. To restore a single level of mirroring,
data transfer among the remaining nodes is necessary. A system-wide remodelling o p
eration like this requires the synchronization of all the remaining nodes to ensure the
consistency of the data during the change. Supervising the remodelling synchronization
is one of the primary jobs of the NCM.
In the case of node deletion, the guardian of a failed node would determine whether a
deletion remodelling is needed, depending on how long the failed node has been 08-
line and the NCM 08-line t imeou t parameter. Only one node can be synchronized for
deletion at any time. The following describes the protocol for deletion synchronization
on the guardian node:
1. During the initialization of the synchronization, the guardian asks the local A p
plication Server Manager (ASM) whether there is enough capacity to remodel the
local portion of the mirrored data from the failed node. If not, no node deletion
will be attempted.
2. If the local ASM reports there is enough room, the guardian changes the control
flag of the failed node from Off-line to Delete Request.
3. When the contact lists received from the remaining N - 2 server nodes have Delete
Reques t associated with the failed node, the guardian then changes the Delete
Reques t control flag to Delete Proceed and sends a remodelling trigger to the local
NSM.
4. After the local ASM has successfully completed the remodelling operation (but has
not yet committed the change), the guardian changes the control flag for the failed
node to Delete Ready.
5. Finally, after detecting that all other nodes have also raised the Delete Ready flag
for the failed node, the guardian removes the failed node from its contact list. A
remodel completion trigger is sent to the local NSM causing the ASM to commit
the remodel operation.
After the Delete Request is issued by the guardian, it can withdraw the synchronization
by reverting to either the On-line or the O f - l i n e status for the failed node. A non-
guardian node must honour the Delete R e a d y flag. The corresponding protocol for node
deletion synchronization on the NCM of the remaining nodes is as follows:
1. If a new Delete Request control flag is detected on a received contact list, the NCM
must determine whether it is coming from a valid guardian. If the request is valid,
the NCM queries the local ASM as to whether there is enough capacity to remodel
the local portion of the mirrored data from the failed node. If the remodelling can
proceed, it changes the control flag associated with the failed node to Delete Request
on the local contact list. Once the synchronization is started, Delete Request and
Add Request control flags set by other guardians are ignored.
2. Upon receiving the Delete Proceed command from the guardian of the failed node,
the NCM sends a deletion remodelling trigger to the local NSM. It also changes
the failed node control flag on the contact List to Delete Proceed, to indicate that
remodelling is in progress.
3. After the local ASM reports that the remodelling operation is ready to be commit-
ted, the NCM changes the control flag of the failed node to Delete Ready. It must
honour this flag even after a system crash.
4. At the end, when the guardian commits the remodelling operation by removing
the failed node from its broadcast list, the local node can then commit the changes
pending on the local ASM. After the changes are committed, the local NSM then
deletes the failed node entry from its contact list.
No harm is done if the deletion synchronization is reverted by the guardian before the
remodelling operation is committed. An abort command will be sent to the local ASM
if the node deletion synchronization has gone beyond the Delete Proceed step. Again,
only one node can be remodelled in or out at a time. An NCM will not respond to other
requests until the f i s t synchronization is successfully completed or reverted. Once the
synchronization protocol has started, contact lists received from the failed node, can be
ignored by nodes other than the guardian.
The addition remodelling synchronization is similar to the deletion case, with the Add
control flags replacing the Delete control flags for the new node involved. The guardian
node is the last node on the contact list before the new node appeared. The new node is
slotted at the end of the contact list with the next join sequence number available assigned
to it. As in the case of the deletion, the ASM has to be made ready for remodelling (e-g.,
by creating an instance of the application server for the mirroring of the new node) before
the local NCM can switch to the Add Request for the control flag associated with the new
node. After all nodes have reported Add Ready, the guardian signals that the remodelling
is committed by changing the control flag associated with the new node to On-line. In
the case where two or more nodes attempt to join the system, the guardian decides which
will be allowed to join.
A relocation remodelling synchronization is a combination of both deletion and addition.
In the event that there is not enough room in the surviving nodes for the system to
recover to the single mirroring level, deletion remodelling is not feasible. A new node
may be added to replace the failed node. During relocation remodelling, data from the
failed node will be recreated on the new node, while the primary and mirrored data in
the other nodes remain unchanged.
At the beginning of the node deletion remodelling sequence, each NSM checks with
the local ASM to determine whether there is enough room for the remodelling. In the
case that there is not enough room, the Delete Request flag is not raised. Thus, the
guardian of the failed node cannot proceed with the deletion remodelling. Since simple
node addition cannot occur if there is a failed node, the guardian of the new node
does not initiate or aborts the node addition remodelling sequence. Instead, the deletion
g u ~ d i a n has priority over the addition guardian. It becomes the valid relocation guardian
responsible for both the failed node and the new node. To start the relocation remodelling
synchronization, the relocation guardian issues the Delete Request and the Add Request on
the corresponding failed node slot and the new node slot on the contact list respectively. A
non-guardian responds to the relocation request by mirroring both flags at the same time.
The same dual flags convention is used to indicate relocation proceed/ready/cornmit
commands.
3.3.4 Crash Recovery
The contact lists are tracked in a recovery log on every node. After an unexpected node
shutdown, the NCM will be able to continue the remodelling process by repeating the last
step in the synchronization protocols. Non-guardian nodes must honour their responses
to the guardian.
If a failed node reconnects to the RDSS after it has been remodelled out by the system,
it recognizes that it is no longer on any broadcasted contact lists. The failed node then
re-initializes as a new node and attempts to join the system.
3.4 Node State Machine (NSM) Module
At the centre of each RDSS server node is the Node State Machine (NSM). It is respon-
sible for maintaining a consistent state for the local node, and responding to transition
stimuli from both the Node Configuration Monitor (NCM) and the Application Servers
Manager (ASM).
To keep the design simple, the NSM only maintains the non-transient states for the local
server node. Transitional behaviour between states (e.g., entering deletion remodelling)
is handled by other softwaxe modules (e.g., NCM). For each RDSS server node, there are
five basic states:
0 initialization
Steady
0 Modifying
Degraded
0 Failed
Within the Degraded state, there are four sub-states:
Backup
Remodel-Deletion
Remodel-Addition
Remodel-Relocation
The transition state machine is depicted in figure 3.2, with the sub-state diagram in
figure 3.3.
As mentioned in the assumptions section 3.1.2, all error transitions in the NSM deal with
node level failures. The NSM is not responsible to classify the severity of the failures, or
to initiate any application-specific recovery strategy.
Also, NSM does not guarantee whether a transition is asynchronous (self-determined by
the local node) or RDSS synchronous (agreed upon by all server nodes). It is up to the
transit ion triggers7 provider, the NCM or the ASM, to achieve network synchronization
for any particular transition. The NSM does not interface with other server nodes in the
system and thus does not deal with any synchronization issues.
All transitions in the state machine are atomic. The only way to exit a state is via
another transition. FoIlowing are brief summaries of the server node behaviour within
each state.
3.4.1 NSM Initialization State
Every NSM enters the Initialization state on startup of the server node software. For a
successful node startup, both NCM and ASM have to be started and initialized properly.
Only then can the NSM transition out of the initialization state and return to the last
known system state, Steady or Degraded, before the last system shutdown. If the last
known system state was the Modifying state, the outstanding update operations will be
aborted by the ASM containing the primary copy of the data affected.
Figure 3.2: State Transitions Diagram of an RDSS Server Node
CHAPTER 3. ARCHITECTURAL OVERVIEW
3.4.2 NSM Steady State
The Steady state is entered when all server nodes are functioning properly, and there is
no outstanding update request for the node. The only activities in the RDSS are query
requests from the users.
Node level failure will cause the NSM to transition into the Degraded state. The node
will return to the S t e a d y state when either the failure has reverted itself, or the RDSS has
remodelled to eliminate the failed node. The NSM enters the Modifying state to handle
an update request.
3 - 4 3 NSM Modifying State
In order to add or remove data via the external update interface of the RDSS, the NSM
must be in the kfodifying state. Both query and update requests will be accepted during
this state. It is the responsibility of the UMD and the ASM modules to ensure the
integrity of update transactions.
When an update session is completed, the system will return to the Steady state. As in
the Steady state, a node level error in the Modgying state causes the NSbf to transition to
the Degraded state. The current update request will either be abandoned or be restarted
when the system returns to the Steady state.
3.4.4 NSM Degraded State
If any server node ceases to function, the RDSS-enabled application is considered to be
in the Degraded state. The N S M remains in the Degraded state for as long as the system
has not fully recovered to the predefioed level of fault tolerance (currently, a single level
of mirroring).
Each of the four different sub-states within the Degraded state supports a different re-
CHAPTER 3. ARCHITECTURAL OVERVIEW
covery operation. Their interaction is shown in fig 3.3, and briefly described below.
NSM Backup Sub-State
In this sub-state, the backup Application Servers are required to provide the missing
data set that corresponds to the node that has failed. This transition is asynchronous
and self-det ermining.
After a given waiting period in the Backup sub-state, the server nodes in the RDSS may
agree (via NCM synchronization) to remodel the node list by deleting the node that has
gone down. The NSM will then enter the Remodel-Deletion sub-state. To simplify the
design, remodelling may delete only one node at a time.
NSM Remodel-Deletion Sub-State
Only one node can be deleted on each deletion remodelling, provided there is enough
storage capacity reported by the ASM on each node, as indicated by the Delete Request
response on the NCM node contact list. Remodelling is not started if there is not enough
storage to restore the system to the single mirroring level required. NCM synchronization
must be used to ensure the synchronization of the system before any NSM can enter this
sub-state.
During remodelling deletion, backup data corresponding to the failed node becomes pri-
mitry data in the current node. New backup copies will be made at other locations
according to the mirroring strategy. These operations are handled by the ASM's in the
system. On receiving the remodel commit command, the failed node is removed from
the NCM node contact list. Upon successful recovery, the NSM will return to the Steady
state.
I N Z T I . Z A T I O N MODIFIC4TION STEADY
/- 1 /
/ 4 I
I / C
0
DEGRADED I 0 4 /
0 / / 1
I I / /
I / /
0 4
I I /
/
0 / I
- - - - - - 7
I / I
f I I I I I i I I I I I I i I I I
RELOCATION I I I I I I /
I / /
FAILED STEAII)Y
Figure 3.3: Sub-state Transitions with the Degraded NSM State
CHAPTER 3. ARCHITECTURAL OVERVIEW
NSM Remodel-Addition Sub-State
As in the case of node deletion, the NCM will be responsible for synchronizing dl server
nodes to accept a new node. Once the RDSS environment has agreed to the addition,
all nodes are notified to enter the corresponding subs tate.
Unlike the deletion situation, the Remodel-Addition sub-state can only be entered from
the Steady state. While in this sub-state, the RDSS will perform load bdmcing. When
the load baimcing policy is satisfied, the NSM of the new node will initiate the completion
handshake via the NCM, causing the system to return to the Steady state.
NSM Remodel-Relocation Sub-State
If there is not enough capacity to handle a node deletion remodelling, a new node is
needed to increase the total capacity of the system. However, since a node has failed, a
node addition operation cannot be completed. Thus, a relocation remodelling operation
is necessary. It allows the recreation of the failed node on a new node.
The NCM synchronization for relocation is described in section 3.3 as a combination of
the deletion and addition protocols. In NSM, the relocation transition would be initiated
from the Backup state. During the Remodel-Relocation sub-state, the missing node will be
reconstructed from the corresponding copies from other nodes. Upon successful recovery,
the NSM will return to the Steady state.
3-4.5 NSM Failed State
The Failed state is entered if and only if the number of nodelevel failures is greater
than the level of node-level error fault tolerance. If the system policy allows the RDSS
application to remain on-line despite the fact that the complete data set is not available,
each node will stay in this state. Otherwise, a shutdown command will be sent to the
NCM of all server nodes.
In this sub-state, the RDSS system will continue to run. The only way out of this state
is for the failed nodes to come back on-line or for the system to terminate. In the future?
the architecture may be modified to allow the system to return to the Steady state with
the remaining data.
3.5 Application Servers Manager (ASM) Module
Each node is responsible for mirroring a portion of the data from a l l the other server
nodes. There are N - 1 mirror segments on each node in an N-node RDSS. Each mirror
segment is serviced by a copy of the application server. The ASM is responsible for
managing these application servers.
At server startup, the ASM connects to the update ports of all the local Application
Servers. All update requests are filtered and incoming data is mirrored. Links are estab-
lished from each ASM to the update ports of other ASM7s in the system. The necessary
redirection data required by the mirroring strategy during updates and remodelling are
done through these links.
Information is shared between ASM and NSM, such that the NSM can issue remodelling
commands to the ASM and the ASM can inform the NSM if there is an update request
from the marshaller-dispatcher. The ASM also interfaces with the Hardware Abstraction
Layer (H4L) software, through which resizing of the data segment can be performed if
needed.
In addition, the ASM is responsible for providing the location of its update port, along
with a list of visible application server query ports to the active marshaller-dispatcher.
Changes in these port locations are sent to the Marshaller-Dispatcher Port Manager
(MDPM) software module, via a control communication channel between the MDPM
and the ASM.
3.6 Hardware Abstraction Layer (HAL) Module
The main purpose for the Hardware Abstraction Layer (HAL) is to permit the RDSS to
control the storage usage of the application servers. It provides a simple storage interface
library for the application server regardless of the underlying operating system.
By interfacing with the ASM, the HAL allows dynamic on-line storage management.
Storage segments can be resized and re-mapped without having to restart any software
on the server node. Thus, the application server does not need a complicated storage
management component. The interface to the HAL library is described in chapter 4.
3.7 Marshaller-Dispatcher Port Manager (MDPM)
Module
The Marshaller-Dispatcher Port Manager (MDPM) is the coordinator in the RDSS
marshaller-dispatcher soft wzre. It maintains and exports two lists: an RD SS update
node list and an RDSS query connection list, which are used by the Update Marshailer-
Dispatcher (UMD) and the Query Marshdler-Dispatcher (QMD) respectively. It only
observes and does not influence state transition of any server node. The MDPM estab-
lishes and maintains a control connection with the ASM in each server node. The ASM
is responsible for providing up-to-date update and query port locations to the MDPM.
The RDSS update node list is used by the Update Marshaller-Dispatcher (UMD) software
module. Unlike the contact list broadcasted periodically by each NCM, there are no
control flags or join sequence numbers associated with an RDSS update node list entry.
The RDSS query connection list contains all the query ports visible from the Query
Marshaller-Dispatcher (QMD). It provides snapshots of the states of the current query
ports on all the visible Application Servers.
3.8 Query Marshaller-Dispat cher (QMD) Module
As the name implies, the QMD (Query Marshaller-Dispatcher) is responsible for dis-
tributing (dispatching) the users' query requests to the appropriate RD SS nodes and
collecting (marshalling) the query results. Each query client is serviced by an instanti-
ation of the QMD, each of which contains an instantiation of the AQFE (Application
Query Front-End) described in the previous chapter.
The RDSS-specific portion of the QMD module is the library. It provides a restricted
view of the RDSS environment. The interaction between the Library and the MDPM
is hidden from the AQFE. The MDPM communicates changes i n the query connection
Gst to the AQFE. However, since each QMD instance communicates with each visible
application server directly, it is possible to experience local TCP connection timeouts
or errors. Thus, the universe view in each QMD instance is a combination of the query
connection list provided by the MDPM and its local history.
The process for marshalling data is application dependent. This approach is deliberately
chosen because of the broad target application domain. New instances of the QMD are
created by the MDPM as new query clients appear. On startup, the QMD establishes
its own communication with all the visible AS query ports available in the MDPM query
connection list.
3.9 Update Marshaller-Dispat cher (UMD) Module
The counterpart to the QMD above is the UMD (Update Marshaller-Dispatcher). It is
responsible for update requests (as opposed to query requests).
Due to the restrictions on the application update protocol specified in the last chapter
and the trust assumption of the update client, it is possible for the RDSS to include
a generic AUFE (Application Update Front-End). The generic AUFE distributes data
solely based on balancing the capacity usage on each node.
The UMD (excluding the A W E ) is stateless. The integrity of update operations is
handled by the twephase transaction model. Changes are not committed until the
commit command is sent.
In addition to the update client, the UMD interfaces with d l ASMYs and the MDPM,
via the UMD library. Instead of connecting to the update ports of the application server
instances on each node, the UMD communicates with the update port of each ASM,
which is responsible for data replication and mirroring.
The MDPM provides the UMD with the list of update-enabled nodes, along with the
number of nodes in the RDSS environment. Before any update request can be processed,
all NSM's must enter their corresponding Modifying states. Only in that state can the
local ASM accept an update session connection.
3.10 Sample Operations
To summarize the overview given in this chapter, several typicd RDSS operations are
presented and the interactions among components are explained. Important features of
the system will be emphasized. Four sample operations are described in the following sub-
sections: the startup sequence of the system, the successful query and update operations,
the simple recovery sequence on a node failure, and the on-line resizing of storage.
3.10.1 Successful System Startup
There is no specific startup sequence among server nodes. The only prerequisites are
the number of application nodes in the system and the consistency of communication
parameters for all modules. Also, the existing RDSS address allocation to the nodes
must be completed and cannot contain m y overlap. That is, the data in the system is
coherent.
A predefined startup window (or delay) is used by all application nodes. If a broadcasted
contact list is received from a valid RDSS server node, the NCM will change the corre-
sponding control flag from startup to on-line in its local node contact list. Otherwise, if
no contact is established and the startup window has passed, the remaining nodes with
startup flags are deemed off-line.
After NCM has established contacts to other nodes, the local ASM then establishes a
replication update connection with the ASM on other server nodes- On completion, the
NSM goes from the Initialization state to the Steady state.
Only after a server node has exited the initialization state would the query and update
ports locations be made available to the MDPM. On startup, the MDPM establishes a
control connection to every ASM. The MDPM startup parameters include the locations
of the nodes. After the update node list and a query connection list (visible query port
list) are built, regular query and update sessions can commence.
3.10.2 Successful Queries and Updates
Once the RDSS application has successfully started on the network, including at least
one marshaller-dispatcher, query and update requests can proceed. When the MDPM
detects a request for a new query session, a new QMD instance is created to service that
session. Similarly, a new UMD instance is enabled for an update session, except that the
system is limited to one update session at a time.
The QMD is successfully started when it has established communication with the query
ports listed as available on the connection query list. Query requests sent by the user are
directed to some or all the AS instances, as determined by the AQFE logic. By collecting
all the return data from the AS instances, a merged result is returned to the query client.
Update operation of the UMD is similar, except that the modifications are handled
through a two-phased commit transaction. In the generic AUFE given, new data is
sent to the least full (in terms of capacity) node. The ASM of that particular node is
responsible for ensuring the data is replicated to other nodes. The UMD will not commit
the operation until the ASM returns a successful reply.
3.10.3 Successful Recovery
Consider the simple node failure case. If one node has temporarily failed and the local
NCM has lost contact with the failed node, the ASM will make the query port of the
secondary AS associated with the failed node visible to the marshaller-dispatcher by
sending the port location to the MDPM. If the failed node returns, the system returns to
the Steady state, and the reversing instruction will be sent from the ASM to the MDPM
through the control connection, removing the temporary secondary query port.
If the failure described above is permanent, the NCM of the guardian for the faiied node
will timeout. Provided that the two phased commit transition described in section 3.3
succeeds, the ASM of all nodes will start remodelling to delete the failed node. Depending
on the actual storage allocation on each node, some partition resizing may need to be
done before the remodelling can be started.
Since data on each server node is mirrored to N - 1 nodes on an N node system, every
single server node has a secondary AS that deals with a subset of data from the failed
node. The ASM will try to export these to the primary AS on the same node. As new
primary data, these new data entries are replicated and mirrored tc other remaining
nodes within the system. Only when the level of mirroring is restored and all data can
be found in a primary AS is the system considered to be recovered.
3.10.4 Changing Storage Size
To change the amount of storage available to a particular Application Server (AS) in-
stance, the HAL must be used to modify the size of the corresponding virtual disk
partition. To increase the usable partition size, the following sequence of operations is
needed:
I. The ASM increases the size of the specified virtual disk partition via the HAL
on-line maintenance library.
2. The EXP4ND storage update command is sent to the AS instance, that uses the
specified virtual disk partition, to icforrn it of the extra storage space available.
The operation sequence for reducing storage space is slightly more complicated. Storage
truncation should only be taken if there is enough unused storage available. The operation
sequence for storage truncation is as follows:
1. Before attempting any size reduction in a virtual disk partition, the ASM uses the
STORAGE available update command on the affected AS instance to find out how
much free space is available for truncation.
2. If there is enough room, the TRUNCATE storage update command is sent to the
affected AS, asking it to reduce its storage usage.
3. When that is complete, the ASM then can use the HAL on-line maintenance library
to free up the unused storage in the specified virtual disk.
Chapter 4
RDSS Detailed Design
In the last chapter, we described the overall architecture of the RDSS system. To achieve
a working prototype, more detailed design and implementation choices are required. In
this chapter, the detailed behaviour of each software component in the RDSS environment
will be discussed, along with the implementation choices made for the prototype.
To recapitulate the previous chapter, the RDSS server node software consists of four
major RDSS modules, along with instances of the application server (AS). The mod-
ules are the Node State Machine (NSM), the Node Configuration Monitor (NCM), the
Application Server Manager (ASM), and the Hardware Abstraction Layer (HAL).
The RDSS Marshaller-Dispatcher software has three main components: the Marshaller-
Dispatcher Port Manager (MDPM), the Query Marshder-Dispatcher (QMD) and the
Update Marshaller-Dispatcher (UMD). The MDPM controls the startup and synchro-
nization of the QMD and the UMD. The QMD is responsible for multiplexing and de-
multiplexing the query requests and responses, while the UMD handles update traffic.
The next section Iists the target platform and the underlying operating system assump-
tions for the RDSS prototype. The rest of the chapter describes the detailed design of
each RDSS component.
CHAPTER 4. RDSS DETAILED DESIGN
4.1 Target Platform Assumptions
The current RDSS design limits its implementation to a local area network (LAN). For
convenience, the prototype assumes an Ethernet-based LAN with average round trip
time on the order of 100 rns between any two nodes. More importantly, the underlying
network must support broadcasting of UDP packets.
At the operating system level, the TCP/IP protocol suite is assumed to be available
on all nodes. Appropriate security features, such as access control, should be available.
Also, multitasking support is required.
For prototyping, a network of Ethernet connected UNIX workstations satisfies the above
requirements. The prototype is designed with this choice in mind. However, it should be
easily portable to other networks and operating systems that satisfy the above constraints.
To simplify load balancing and mirroring, the RDSS prototype assumes a homogeneous
environment, particularly that all nodes have the same storage available.
4.2 Node Configuration Monitor (NCM) Detailed
Design
As described in the previous chapter, the Node Configuration Monitor (NCM) module
on each server node is responsible for monitoring changes in the rest of the RDSS server
nodes, excluding the local node. Based on the observed changes, it triggers the appro-
priate transitions in the local Node State Machine (NSM). It is also the guardian for the
next node on the node contact list.
In addition to the UDP transport layer, the NCM requires the support of the local real
time clock. Internally, a down counter is associated with each entry in the contact list.
On a clock tick (e-g., 2 seconds in the prototype), each counter is decreased by 1 and the
current Local node contact list is sent to the network broadcast address. h the prototype,
CHAPTER 4. RDSS DETAILED DESIGN
the clock ticks are triggered by the timeout mechanism in the select operation.
4.2.1 NCM Startup
On each server node, there is an NCM startup file that contains the locations of all nodes
in the RDSS, the reserved network broadcast address, and port to be used by the NCM.
In the prototype, the NCM uses the UDP transport protocol to broadcast its contact
list. For convenience, the location of a node is denoted by its numeric IP address and the
UDP broadcasting port number. Therefore, the node contact list, is a list of IP addresses
with UDP ports, sorted by the join sequence number.
Initially, each remote node on the local contact list has the startup control flag, with the
timeout down counter set to the startup value. The local NCM broadcasts the list at a
regular interval (e.g., every second). In addition, the NCM also checks with the ASM
to see if there was a remodelling operation in progress before the last re-start. If the
local node was the controlling guardian node, then it is responsible for restoring the last
known contact list flag for the node being remodelled.
Each NCM is also listening for the broadcasts from other server nodes. For every packet
that the NCM receives, it performs the following sequence of operations:
1. It identifies the source location of the contact list packet.
3. It changes the local contact list if the control flag associated with the source location
on the contact list received is on-line. The timeout down counter for that node is
set to the on-line timeout value.
3. The local must assume it has been deleted from the system, if the contact list
packets from other nodes do not contain the location of the local node. It should
then perform a reset, clear all local data and attempt to rejoin the RDSS as a new
node.
CHAPTER 4. RDSS DETAILED DESIGN 63
4. It checks the whole contact list for remodelling flag settings associated with other
nodes. The local node should honour the last remodelling flag in the local contact
list before the system restart, including m y Ready flags.
If any timeout down counter reaches zero, the corresponding node is flagged 08-line.
When no more startup flags remain in the local contact list, the NCM startup is considered
complete. The node deletion and addition synchronization mechanism is then enabled.
A trigger is sent to the NSM causing it to transition to the Steady state if d l nodes have
no off-line flags. Otherwise, for each off-line flag, a node failure trigger is sent to the
NSM. If remodelling was in progress, the appropriate remodelling trigger is sent as well
as the node failure trigger.
4.2.2 Detecting Node Failure
To maintain an active on-line control flag in the local node contact list, the contact
list broadcast packet from that particular node must be received periodically. On each
verified reception, the timeout down counter associated with that node is reset to the
NCM on-line timeout value.
If a remote node reports a failure on a third node (i.e., an 08-line flag is presented in a
received contact list), no action is needed on the local node. The transition to the 08-line
flag is not synchronized.
On each clock tick, the down counters are decremented. A remote node is deemed off-line
when it hits zero. If the happens, a node failure trigger is sent to the local NSM. If the
local node is the guardian for the failed node, the down counter will then be set to the
deletion value. The role of a guardian was described in the last chapter. In the RDSS
prototype, each node is the guardian for the next node on the sorted contact list, with
the last node on the list responsible for the first node on the list.
If a node recovers after it has been considered of-line, a recover trigger is sent to the
CHAPTER 4. RDSS DETAILED DESIGN 64
local NSM. Its flag on the local contact list is changed back to on-line with the timeout
down counter set to the NCM on-line timeout value. Both the node failure and recov-
ery triggerings are not synchronized among server nodes. Each NSM is responsible for
determining these transitions for its local node.
4.2 -3 Initiating Remodelling Synchronization
The synchronization handshake protocol between a guardian and other nodes was pre-
sented in chapter 3. Here, the details on the conditions for initiating the deletion, addi-
tion, and relocation remodelling are described.
First, when a node is deemed off-line by its guardian node, the NCM deletion timeout
value is placed on the timeout counter associated with the failed node. No action is
required of the NSM module on the non-guardian nodes. When the counter hits zero,
the remodelling protocol is initiated.
For node addition, the new node first broadcasts its own contact list adding itself to the
end of the contact node list with a startup flag. Its guardian node (i.e., the previous
node on the contact list ) then initiates the addition remodelling synchronization process,
provided that there is no visible of-line node or node deletion in progress. There is a
short timeout (NCM addition timeout) before the addition process begins, to insure the
stability of the connection to the new node.
Lastly, if there is a node deemed 08-line, a node addition process cannot be completed.
Instead of doing node addition, the image of the failed node is recreated on the new node.
The guardian node for this relocation process is the same as the guardian node far the
failed node. After a deIetion timeout for the failed node, the relocation guardian may
choose to start the relocation synchronization instead of the deletion synchronization.
The synchronization steps were described in the last chapter.
During the synchronization, the guardian uses a remodelling timeout counter. It has the
CHAPTER 4. RDSS DETAILED DESIGN 65
option of abandoning the synchronization before committing a remodelling operation.
To do so: the guardian reverts the control flag of the target node to a previous setting
(i.e., 08-line or on-line).
In addition, the NSM should use the remodelling interface on the Application Server
Manager (ASM) of the local node to see whether there are enough remaining storage
resources on the local node before initiating a remodelling. Other nodes would do the
same after receiving a remodelling request. Only on successful return from the local
ASM, shall a non-guardian NSM respond to the remodelling request.
4.2.4 Completing Remodelling
During remodelling, contact lists broadcasted by the node being remodelled are discarded.
On the contact list of each NCM, the node being remodelled continues to have the
deletion proceed and/or addition proceed flags associated with it. The current RDSS only
allows remodelling of one node at a time. Thus, during a remodelling, the mechanism
for initiating other remodelling operations is disabled. However, it is possible for other
nodes to fail during the process. In that case, the failure trigger is sent to the NSM. As
the RDSS only handles one node failure, another node failure before completion of node
deletion is fatal. The guardian is responsible for abandoning the doomed remodelling
effort in progress.
When the remodelling operation is ready to be committed, the ASM signals the local
NCM. The NCM then exercises the remainder of the remodelling synchronization proto-
col. Once the guardian has committed the changes in its contact node list, each NCM
instructs the local ASM to commit the changes.
CHAPTER 4. RDSS DETAILED DESIGN
4-2.5 Node Reset
If an NCM contact list is received from the node that has already been deleted, it will be
ignored by the nodes in the current RDSS environment. The deleted node would notice
that it is not on any of the NCM contact lists that it received. The deleted node should
reset itself by abandoning d l its data and then attempting to rejoin the system as a new
node.
If a node is being deleted, it may find itself still on the list, but with remodelling flags. No
action is required in those cases. It is the guardian's choice to abandon the remodelling,
provided that the deletion or relocation has not been committed. Otherwise, if the
deletion has been committed by the guardian but not all the nodes have committed,
the deleted node must wait for the completion of all nodes before performing a reset
operat ion.
4.3 Node State Machine (NSM) Detailed Design
The Node State Machine software module resides on each server node. As it maintains
the logical state of each node, it is the first component to be started on the RDSS server
node software and is executed by the main thread of the node software.
In the RDSS prototype, the triggers for NSM state transitions come from both the
NSM and the ASM modules that reside on the same server node. The inter-thread
communication is handled by a single message queue.
In chapter 3, the behaviour of each of the node states is briefly described. Here, the focus
is on the transition behaviour of the state diagram. Both the prerequisites and the actual
stimuli for each transition are given in the following sub-sections. In additicn, thc sub-
sections list the design choices for how these conditions are established and communicated
within the RDSS to the node state machine.
CHAPTER 4. RDSS DETAILED DESIGN
4.3.1 NSM Startup
A startup file provides the necessary information. Since NSM is not involved in query
or update transactions directly, it is not directly responsible for the recovery of those
transactions. The recovery is performed by the ASM and the NCM modules that reside on
the same server node. Incomplete data modifications before the last system termination
are abandoned. The NSM waits for both the ASM and the NCM to successfdly start up
before accepting any stimuli that would cause a state transition.
4.3.2 From Initialization State to Steady State
During the Initialization state, the NCM of each node attempts to establish communi-
cation with every other node. After the NCM has established that d l nodes are up and
running, a trigger is sent to the NSM, causing it to transition to the Steady state.
4.3.3 From Initialization State to Degraded State
If some of the nodes listed in the NCM parameter file are not reachable, or cannot
be started properly, then each NCM will generate a transition trigger that causes the
corresponding NS M to enter the Degraded state.
Each node failure message from the NCM indicates that one node is not available. The
first node failure message causes the NSM to enter the Backup sub-state. Subsequent
messages may cause the NSM to transition to the Failed state.
One exception to the above is a node failure during the system start. If the remodelling
process was in progress before the last termination, the NCM will attempt to continue
the last remodelling effort. If this action is successful, instead of generating node fail-
ure stimulus, the NCM sends the corresponding remodelling message to the NSM. The
NSM transitions back to the same remodelling sub-states that it was in before the last
CHAPTER 4. RDSS DETAILED DESIGN
termination. See section 4.2.1 on NCM startup for more details.
4.3.4 From Steady State t o Modifying State
When an update request is received by the ASM, a message is sent to the local NSM caus-
ing it to transition to the Modifying state. Once the node has entered the Modifying state,
the A S M can start processing update requests from the Update Marshaller-Dispatcher
(UMD).
4.3.5 From Steady State or Modifying State to Degraded Sta te
If a node level failure is detected by the NCM modu!e, a node failure message will be
sent to the NSM causing a transition to the Degraded state. If the NSM is currently
in its Steady or Modifying state, it will enter the Backup sub-state within the Degraded
state. This transition is asynchronous and self-determined. The NCM can initiate such
a transition as soon as it can no longer detect a node.
The difference between a transition from the Modifying state and a transition from the
Steady state is that, in the former, the current outstanding update request needs to be
aborted before the transition. Note that no new update requests are accepted once a
node exits the Modifying state and enters the Degraded state. The ASM is notified of the
transit ion.
The ASM is responsible for refusing further update requests. It is also responsible for
making the necessary backup servers available to the Query Marshaller-Dispatchers (via
MDPM) to provide the missing data sets that correspond to the node that has failed.
CHAPTER 4. RDSS DETAILED DESIGN
4.3.6 From Modifying State to Steady State
While in the Modifyingstate, the NSM does not enforce the transaction atomicity directly.
Enforcing atomicity is the responsibility of the UMD and the ASM modules. When the
update session is completed, the ASM informs the NCM, which in turn causes the NSM
to return to the Steady state.
4.3.7 From Degraded State to Steady State
The system can recover from a node failure either by a remodelling of the RDSS environ-
ment, deleting the node in question, or by re-establishing contact with that node. When
the NCM is satisfied that all nodes in the RDSS environment are ready and available,
the NSM is told to transition back to the Steady state.
4.3.8 From Degraded State to Failed State
For each node level failure detected, the NCM sends a node failure message to the corre-
sponding NSM. Depending on the number of mirroring or backup levels, the additional
node failure may lead to a less-than-complete data set being available for queries. In that
case, the NSM would exit the Degraded state and enter the Faded state.
This transition can occur in any of the Degraded sub-states. All remodelling efforts are
aborted when the Failed state is entered. In the RDSS prototype, the mirroring level is
one, which means that if more than one node fails, the system will enter the Failed state.
4.3.9 NSM Termination
Depending on the application, it may or may not make sense to allow the system to
remain in the Failed state if the accessible data set is less than complete. If not, the
CHAPTER 4. RDSS DETAILED DESIGN 70
NSM will orchestrate a orderly shutdown of the RDSS environment via the NCM. Once
the shutdown decision has been made, the only method of recovery is to restart the
RDSS.
4.3.10 On-line Removal of a Node
Removal of a node while the RDSS environment is on-line is permitted with the system in
the Removal-Deletion sub-state within the Degraded state. Unlike the Backup sub-state,
entering the Removal-Deletion sub-state requires system-wide synchronization, which was
discussed in the NCM section.
The deletion transition can be initialized in one of the following ways:
0 Incomplete remodelling detected by the NCM on restart.
0 A NCM 08-line timeout during the Backup sub-state in the guardian NCM.
0 An administrative command by-passing the timeout mechanism in the guardian
NCM.
For the last two ways of entering the Remodel-Deletion sub-state, the guardian NCM
also has to make sure there is enough remaining capacity in the system. Once an NSM
has started on a remodelling synchronization process, it is no longer able to respond to
another remodelling request until the current remodelling has been completed.
Transitions to the Remodel-Deletion sub-state are tracked in persistent storage in case
of unexpected node termination. Only one RDSS node may be deleted during each
remodelling. This is not a problem for the RDSS prototype as only one level of fault
tolerance is being supported.
Once the NSM has entered the Remodel-Deletion sub-state, the corresponding ASM on
the local node is instructed to start merging secondary data corresponding to the failed
node into the local primary data. This operation includes making the new secondary
copies of the data on other RDSS nodes. Once the level of fault tolerance is restored,
CHAPTER 4. RDSS DETAILED DESIGN 71
the NCM's are again synchronized and a remodel complete message is sent to the NSM,
returning it to the Steady state.
Deletion remodelling cannot be completed if there is not enough storage to reestablish
the required fault tolerance level. Therefore, the NCM must ensure that there is enough
room before telling the NSM to enter the Remodel-Deletion sub-state.
4-3-11 On-line Addition ofa New Node
When a new node is added to the system, its guardian NCM contacts all existing nodes for
synchronization. In the RDSS prototype, the last node on the node List is the designated
guardian for the new node. Once the RDSS environment has agreed to the addition, the
NSM's on all nodes are notified to enter the Remodel-Addition sub-state.
The Remodel-Addition sub-state can only be entered from the Steady state. If a node is
added when the RDSS application is in the Backup sub-state, the system transitions to
the Remodel-Relocation sub-state (see below) instead.
During this sub-state, the RDSS performs load balancing. When the load balancing
policy is satisfied and all data transfers are completed, the ASM informs the NCM. The
guardian NCM then performs the completion handshake via the NCM. If successful, the
NSM will be instructed to return to the Steady state.
4.3.12 Relocating Data from a Deleted Node to a New Node
When a new node is added during the Backup state of the guardian node, no node addition
operation is performed. Instead, the relocation guardian (same as the deletion guardian)
for the failed node may choose to initiate node relocation. The same list of starting
conditions that apply to node deletion also applies here, with the extra requirement that
there is a new node waiting to join the system.
CHAPTER 4. RDSS DETAILED DESIGN 72
During a node relocation, mirrored data from the failed node goes into the primary
application server i n s t a c e on the new server node. Also, each surviving node must
recreate the missing mirrored portion of its own data on the new node. Like node
deletion, o d y one node can be relocated at a time.
4.4 Application Servers Manager (ASM) Detailed
Design
The main role for the ASM is to oversee the single primary instance and the multiple
secondary instances of the Application Server (AS) on the local node. All data mod-
ification requests flow through the ASM. In addition, all ASM modules participate in
node remodelling (deletion, addition and relocation). The ASM also communicates with
the Marshaller-Dispatcher Port Manager (MDPM) to control the visibility of AS query
ports. The following sub-sections describe each of the responsibilities of the ASM.
4.4.1 Data Mirroring
Each ASM has an update port and a control port. At initialization of the MDPM, a
TCP connection is established between the MDPM and the control port of the ASM.
The ASM sends the location of its visible update port to the MDPM. From then on, all
update traffic from the Update Marshaller-Dispatcher (UMD) is sent to this ASM update
port.
Within each RDSS server node, there is one (primary) AS instance and N - 1 (secondary)
AS instances on each node. The primary AS instance is responsible for storing the new
data entries from the UMD. Each secondary AS instance is responsible for mirroring a
portion of a remote node. Conversely, data in the primary AS instance is mirrored to
N - 1 secondary AS instances, each on a different remote node.
CHAPTER 4. RDSS DETAILED DESIGN 73
At the beginning of each update session, the UMD requests the available capacity of all
the server nodes via the ASM update port. Using the capacity estimates from d the
nodes, the UMD sends the next new data entry to the least full server node by the way of
the ASM update port. The ASM uses the capacity usage of its remote mirroring location
to select the least full mirroring node. The data entry is then sent through the update
connection between the two ASM's. The primary ASM is responsible for remembering
the RDSS addresses of the data entries in its primary AS and the location of the mirroring
node for every entry.
Removing data is the reverse of adding data. Merge and extract operations can be viewed
as multiple additions and deletions. Changes are not updated until they are committed
by the commit command. Multiple requests may be committed with a single commit
command. To ensure that no data is lost, once the update transaction is started, it can
only be terminated by either a commit or an abort message.
Before an ASM sends a ready reply to the UMD, it must first receive a ready reply from
the mirroring ASM's. When the commit command arrives, the primary ASM saves the
commit flag to the persistent storage. It then sends the commit command to the mirroring
ASM's. After the mirroring ASM's have acknowledged the commit, the changes on the
local node are then committed. Finally, the commit flag is removed from the persistent
storage and acknowledgment is returned to the UMD.
4.4.2 Remodelling Interface
Before node deletion remodelling can begin, the guardian ASM has to ensure there is
enough capacity in every node for the operation. This occurs when there is enough room
to:
transfer the data entries corresponding to the failed node from the local secondary
AS to the local primaxy AS.
0 mirror new primary data to the other nodes by asking other ASM's to determine
whether there is enough storage for the data.
If both criteria are met, the ASM will let the NCM proceed with the deletion remodelling
synchronization. Repartitioning of storage between the primary AS instance and the
secondary AS instances is possible (but not implemented in the RDSS prototype for
remodelling).
When the NSM enters a remodelling state, the ASM performs the necessary transfer of
data from the secondary AS corresponding to the failed node to the local primary AS.
The ASM performs the following sequence of operations:
Removes the failed node from the mirror destination list.
Sends the extract update command to the secondary AS on the local server node
corresponding to the failed node.
Sends the merge update command to the ASM update port of the local node. The
merge routine in the ASM module handles the necessary mirroring by treating the
data as new data.
Sends a ready signal to the NCM after both the extract and merge operations are
ready.
Commits the outstanding remodelling changes on receiving the remodelling commit
command.
The operation sequence for node addition remodelling is slightly different. Since the
level of mirroring does not increase, the storage requirement on each node is not affected.
However, due to the remodelling sequence used, extra capacity is needed for moving
data. The ASM in the prototype needs to make sure that there is enough room for a
new secondary AS on the local node. For node addition, the ASM performs the following
sequence of operations:
CHAPTER 4. RDSS DETAILED DESIGN 75
1. Creates a new secondary AS corresponding to the new node. Add the new node to
the mirroring destination list.
2. Sends the extract update command to the ASM update port on the local node for
roughly 1/N (on an N node system) of the data stored in the primary AS instance.
The extract routine performs the necessary deletion of mirrored data from remote
nodes.
3. Sends the merge update command to the ASM of the new node. The merge routine
automatically handles the necessary mirroring.
4. Sends a ready signal to the NCM after both the extract and merge operations are
ready.
5 - Commits the outstanding remodelling changes on receiving the remodelling commit
command.
The sequence for relocation remodelling involves the recreation of the failed node. Pro-
vided the new node is as least the size of the failed node (by the homogeneous node size
assumption), no capacity check is necessary. The ASM performs the following sequence
of operations for node relocation:
1. Replace the failed node with the new node in the mirroring destination list.
2. Search the mirroring destinations of all the data entries on the local node. For each
data entry mirrored to the failed node, send it to the mirroring AS on the new
node.
3. Change the local secondary AS instance that corresponds to the failed node to
correspond to the new node. Its data is extracted and sent to the new node and
merged into the new primary AS without generating new mirror data.
4. Sends a ready signal to the NCM after both the extract and merge operations axe
ready.
CHAPTER 4. RDSS DETAILED DESIGN 76
5. Commits the out st anding remodelling changes on receiving the remodelhg commit
command.
4-43 Controlling Port Visibility
On startup of a Marshder-Dispatcher, its port manager, MDPM, connects to the pre-
defined control port of the ASM on each server node. There are only two commands on
this control link:
Enable (updatelquery port)
Disable (update/query port)
Upon completing the connection, the ASM first uses the enable (port) command to send
its own ASM update port and then the query port of the primary application server (AS)
on the local node.
If a remote node has gone off-line, the NCM will instruct the NSM to enter the Degraded
state. The A S M makes visible the application server that contains mirrored data from
the failed node by sending the query port of that AS in an enable (port) command to the
MDPM. If the remote node returns to on-line status, the secondary query port will be
removed using the disable (port) command.
In the case of node deletion, the ASM of the guardian is responsible for sending the
disable (update port) command. I f the update port location of the failed node is not
known, the IP address of the failed node alone will suffice. For node addition, the new
node itself is responsible for making its ASM update port and primary query port known,
after the node addition remodelling has been committed.
CHAPTER 4. RDSS DETAILED DESIGN
4.4.4 Persistent Information
The ASM must ensure the integrity of data modification operations. By the definition
of the update protocol specified in chapter 2, all update operations that modify data
are handled by transactions. Once the f i s t phase of an update request is received by
the ASM, it needs to keep track of the transaction status until the operation is either
committed or aborted. It is possible to have several outstanding update transactions to
be committed or aborted together at the same time. However, only one update session
can be in progress at any time. This is part of the assumed trust on the part of the
update client.
The transaction status is stored in persistent storage, such that a system crash could not
compromise the transaction. The persistent storage also holds the commit flag for the
ASM. Once this flag is set, the outstanding transactions wodd be committed even after
the system crash. This commit flag is not removed until the commit acknowledgments
are received from the affected AS instances. After a restart from a system crash, the
primary ASM with any outstanding transactions would send out the appropriate update
commit or abort command, depending whether the commit flag is set.
4.4.5 ASM Interfaces
The interfaces of the ASM are depicted in Figure 4.1. For the ASM to perform all its
duties, it needs to have access to all the operations that may alter data in the storage.
It acts as the middle layer between the Update Marshaller-Dispatcher (UMD) and the
update ports of the local AS instances. For mirroring purposes, it needs to deal with
the ASM7s on other nodes. To support backup and remodelling, and to report any node
failure, it needs to interface with the NSM on the local node. It is aIso responsible for
keeping the Marshaller-Dispatcher Port Manager (MDPM) informed of the availability of
backup query ports. Finally, the need for dealing with storage resizing and repartitioning
requires the ASM to interface with the Hardware Abstraction Layer (HAL) on the local
Dispatchers
Marshaller-
I I Other 1 Nude 4---
Configuration Monitors
I I I
Dispatcher Update Port Marshaller- Manager Dispatcher
- - - - - - - - urner I r--+ Application
h Servers
I
I
I Managers Node I
Monitor I
State Application Servers Manager Machine Node u
STORAGE L J Figure 4.1: RDSS Server Node Components
CHAPTER 4. RDSS DETAILED DESIGN
Interface (to) Communication Content
AS update ports
UMD-ASM port
other ASM update ports
local NSM
MDPM control port
HAL library
Update Request
Update Request
Mirroring Request
Remodel and status report
Ports enable/disable
Storage allocation
Implement at ion
TCP socket connection
TCP socket connection
TCP socket connection
TCP socket connection
TCP socket connection
Linked during compilation
Table 4.1: Application Servers Manager Interfaces
node.
Table 4.1 summarizes the ASM interfaces, and the current implementation choices for
them. Other than the HAL interface, all communications are done through TCP socket
connections. This allows the ASM to use a simple select operation to service all of them.
4.5 Hardware Abstraction Layer (HAL) Detailed De-
sign
Instead of directly accessing the physical storage, the RDSS software (including the
application server instances) performs data storage manipulation via the Hardware Ab-
straction Layer (HAL). In the current prototype, the HAL is implemented as two software
libraries. The HAGAS (HAL Application Server) library needs to be linked with the ap-
plication server executable, and the HAL-OM (HAL On-line Maintenance) library needs
to be linked with the ASM software module.
4.5.1 Virtual Disk
At the heart of the HAL design is the concept of a virtual disk. The HAL-AS libraxy
presents a set of operations analogous to an actual physical disk. The actual storage,
however, may be re-routed to multiple storage destinations. The list of available HAL-AS
operations is shown in chapter 2.
Due to its hardware and operating system dependencies, portions of the HAL module
must be customized for each specific platform. Nevertheless, the interface used by the
application server should not be affected by the changes in the back-end of the HAL
module.
Each virtual disk is described by its block size and the number of blocks it contains.
These parameters, along with the mapping table to the physical storage, are stored in
the description header at the beginning of the virtual disk. As an extreme example, a
HAL virtual disk might be made up of three physical storage devices: a raw SCSI disk,
a portion of an RAID-5 disk array with a file system, and a flash memory storage. The
HAL keeps track of the mappings of the virtual disk t c the actual storage location. The
application server does not know that there are three different physical storage devices
involved. The main purpose of the HAL is to allow resizing and repartitioning of the
storage allocated to each application server.
4.5.2 Dynamic Repart it ioning
In addition to amalgamating multiple storage devices, one key feature of the HAL is to
allow for on-line manipulation of the virtual disk partition size. The capability to increase
or decrease a virtual disk partition size is included in the HAL-OM library. The HAL-OM
and HAL-AS libraries are synchronized through a queued mutually exclusive semaphore.
Only one thread may have access to the virtual disk description header information at a
time. If the header information is changed, all affected HAL libraries must also change
accordingly.
Unlike the HAL-AS library, which only provides visibility to one virtual disk at a time,
the HAGOM can be used to maintain any number of virtual disks. In the RDSS, each
AS has its own virtual disk, the ASM can use the HALGM to truncate space from one
CHAPTER 4. RDSS DETAILED DESIGN
virtual disk to give it to another application server instance as needed.
4.5.3 HAL-OM Library Interfaces
In addition to resizing, the HALOM Library contains routines to create and destroy a
virtual disk. The following axe the available routines in the HAL-OM library:
Initialize:
Expand:
Truncate:
Create:
Destroy:
Shutdown:
Status:
Initializes the HALOM library, read the HAL parameter file and synchronize
with other threads using the HAL libraries on the current server node. This
routine must be called before other HAL routines.
Increases the specified virtual disk size by the given number of blocks using
the given physical storage.
Decreases the specified virtual disk size by the given number of blocks, taken
from the end of the virtual disk. Truncated data is discarded. The location
of the freed physical storage is returned.
Creates a new virtual disk on the given physical storage devices with the given
number of blocks and block size.
Deletes a given virtual disk from the associated physical storage devices.
Frees the system resources associated with the HAL-OM library.
Returns status information about a virtual disk partition, including the virtual
disk block size and the number of usable blocks.
Plain text arguments are used for describing physical devices. This allows the HAL-
OM Library interface to remain unchanged regardless of the physical storage devices that
it supports. It is possible to put Expand/Truncate/Create/Destro y commands in the
HAL-OM startup parameter file. They are executed by the Initialize command.
CHAPTER 4. RDSS DETAILED DESIGN 82
4.6 Marshaller-Dispat cher Port Manager (MDPM)
Detailed Design
As described in the previous chapter, the role of the MDPM is to maintain both the list
of nodes in the RDSS and the list of visible AS query ports exported by the nodes. It is
also required to synchronize the lists with each other. In addition, it provides snapshots
of the lists to the QMD and the UMD.
In the RDSS prototype, connections are built on TCP sockets. The MDPM is responsible
for listening to the ports, establishing and maintaining the communication channels.
4.6.1 Port Monitoring
The MDPM monitors and listens to all connection attempts. The port numbers are preset
at start up time as specified by the start up parameter file for the marshaller-dispatcher.
The external query and update port numbers are unique to the marshaller-dispatcher.
They are the entry points to the RDSS to which clients connect. The node reporting
port is for synchronizatio~ between the MDPM and all server nodes.
When a new query client connection is established, a new instantiation of the query
front-end is started to handle the connection. The MDPM itself does not keep track of
the query connection spawned. It continues to listen for new query connections.
For client update connections, the MDPM does not spawn new tasks or connections. It is
a design choice to support only one trusted update connection at a time. No new update
sessions are accepted until the current one is completed.
Ln addition, the ASM on each RDSS server node also connects to the MDPM. In the
running state, the MDPM maintains a connection to all ASM's.
CHAPTER 4. RDSS DETAILED DESIGN
QUERY CLIENTS
Application Query Protocol
TRUSTED UPDATE CLIENT
RDSS Compliant Application Update Protocol
MARSHALLER- DISPATCHER
Marshaller- Dispatcher
' Port Manager '\
Query Mars haller- Dispatcher Library
Update Marshaller- Dispatcher Library
Application Server Managers Control Ports
Application Server Query Ports Application Servers Managers Update Ports
Server Node Components
Figure 4.2: RDSS Marshaller-Dispatcher Node Components
CHAPTER 4. RDSS DETAILED DESIGN
4.6.2 Observing RDSS state changes
No new - query or update connections are accepted until the minimum required number
of nodes have established c o ~ e c t i o n with MDf M. At the start of each connection, the
ASM of the RDSS server node sends its update port location and the query port location
of the primary AS instance to the MDPM.
The connection between an ASM and the MDPM is maintained at all times throughout
the operation of the system. Changes in the NSM axe reported to the MDPM, which
monitors the changes via the TCP socket select mechanism. Besides changes reported
from the ASM, the MDPM also monitors any exceptions in these connections.
4.6.3 Maintaining the Node List
On start up, the marshaller-dispatcher parameter file provides the number of nodes in
the system and the minimum acceptable number. The node list contains a list of the
update port locations of each node.
On request from the update module in the marshaller-dispatcher, a shared memory copy
of the node list is made available. It is the responsibility of the UMD to obtain a new
copy of the node list before every new update session. The shared memory is protected
by a semaphore.
4.6.4 Maintaining the Query Connection List
The other list that the MDPM maintains is the query connection list, which is a list of
query ports made visible to the rnarshauer-dispatcher by the nodes. When all nodes are
operating in the Steady state, this list only contains the location of the query port of the
primary Application Server on each node. Associated with each query port is the index
of the associated node on the node list described in the last section.
CHAPTER 4. RDSS DETAILED DESIGN 85
On initialization, the QMD instance connects to shared memory that contains the u p
to-date version of the query connection list. The latest query connection list can be
obtained through this shared memory buffer at any time. Only the active query ports
are on this query connection list.
If a node fads, the query port locations of the necessary secondary application server
instances will be made known to the MDPM. These query ports would be added to the
query connection list. The reverse happens when the secondary query ports are no longer
necessary. It is the job of the query modules to use this information correctly.
4.7 Update Marshaller-Dispat cher (UMD) Detailed
Design
The UMD is responsible for handling all update requests From the update client, and
making sure the RDSS remains stable during and after update transactions.
4.7.1 Update Client must Model
The current RDSS design focuses on applications where data updates or modifications
occur infrequently or can be batched. These applications are usually controlled by a
single administrative entity. To keep the design simple, the architecture assumes that
only a single update session is needed at any given time.
The trust placed on the update client includes the followings:
a Only one active update client is connected to the RDSS at a time.
a All update requests are authenticated and committed, asd the information is veri-
fied.
a Restrictions imposed by the RDSS and the application are strictly followed.
CHAPTER 4. RDSS DETAILED DESIGN 86
An implication of this trust model means that each marshaller-dispatcher only needs one
static instantiation of the UMD. Also, there is no need for update synchronization among
UMD7s.
4.7.2 Default Application Update Front-End
Unlike the query protocol, the update protocol of an RDSS compliant application is fully
specified (see chapter 2). Therefore, it is possible to include a default AUFE (Application
Update Front-End), which can be used in lieu of a custom application-specific update
front-end. The protocol supported by the AUFE is exactly the same as the one used by
the application update port as defined in appendix A.
The operation of the default AUFE is very simple. At the beginning of each update
session, the AUFE obtains the storage usage information from each node. For each new
data entry, it is sent to the least full node. For other non-addition update operations (e.g.,
capacity available), the A W E simply broadcasts the update requests to all active nodes
via the update connections to the ASM's. For any data modifying update operations
to take effect, a commit command must be sent. Multiple update changes may be
accumulated before being committed together through a single commit command. See
the following section on data integrity for discussion on the commit operations of update
requests .
Update requests are allowed only on a successful start of the update session in the UMD.
To enter an update session: the UMD must establish a connection to every ASM update
port listed on the MDPM node list. The ASM of a server node that fails to enter the
Modifying state would not accept the update connection.
CHAPTER 4. RDSS DETAILED DESIGN
4.7.3 Update Transaction Integrity
To ensure data integrity during the modification, al l update operations that affect the
data must be committed by the AUFE to take effect. The initial request is sent t o one
or more server nodes. When the changes are ready, acknowledgment would be received
by the UMD. The AUFE can then send the commit command. At any time before the
commit command is sent, the update client can abort the changes. Once the commit
command reaches the ASM on the server node, the changes are finalized.
In the event of a failure during an update, the changes are aborted by the UMD. If the
failure is a system crash, the ASM will abort the update changes on the restart, unless
the commit flag is set. (See the ASM detailed design section.)
The UMD keeps track of whether there are any outstanding update changes that have
not been committed or aborted. The update session can only end if there is no outstand-
ing change. Note that non-data modifying update requests, like the storage available
requests, do not need the commit or abort command.
4.7.4 RDSS Update Marshaller-Dispatcher Library
UMD library has the following routines:
Start update session
Send apdate command
Read update command result
End update session
The UMD-Lib hides the connections between UMD and all ASM's on the nodes. It also
handles the shared resource between the MDPM and itself, specifically the active node
list. A brief description of each method is given below. The calling interfaces are listed
in appendix D.
CHAPTER 4. RDSS DETAILED DESIGN
Entering an Update Session
This start update session routine creates the TCP connections from the UMD to the
ASM on the server nodes. The server node fist is obtained from the MDPM. Provided
that the connections to all these update ports are successfuI, the UMD-Lib would return
a success value to the caller. Otherwise, the connection attempts eventually timeout and
failure is returned.
Sending an Update Command
If no node failure has occurred since the beginning of the update session, the update
command is accepted by the UMD. The request string is then sent to the specified
nodes. Success is returned after the command is sent to the specified ASM's.
Reading an Update Command Result
If no node level failure has occurred since beginning the update session, the read update
command result routine will return the data from the first readable ASM connection
stream. Only an ASM whose corresponding flags are enabled in the select mask is read.
The order of results returned by this method is not determined.
Terminating an Update Session
The end update session routine terminates the update session by closing the update
connections to the ASM's. The UMD exits the update session regardless the success of
this action.
CHAPTER 4. RDSS DETAILED DESIGN 89
4.8 Query Marshaller-Dispatcher (QMD) Detailed
Design
When a new query client connects to the RDSS via the marshaller-dispatcher query port,
the RDSS port manager creates a new process to handle all query requests from that
client connection. At the heart of each of the marshaller-dispatcher query processes is
the application query front-end module (AQFE), which is responsible for the logic of the
marshaller-dispat cher query process.
Different applications require different AQFE's. As the RDSS places no requirements on
the application query protocol, a custom AQFE must be built in order to interpret and
combine the results returned from the attached application servers. Depending on the
nature of the application and its query protocol, the AQFE could be a simple multiplexor-
demultiplexor (as provided as an example in the prototype) or a complex module with a
significant amount of logic.
Instead of directly interfacing with query clients and application servers, the AQFE must
use the routines provided by the QMD library (QMD-Lib). By doing so, most of the
RDSS related activities are hidden and automatically resolved. Thus, the coding effort
required by the AQFE is much smaller. The next section describes the callable routines
in the QMD-Lib.
4.8.1 Query Marshaller-Dispatcher Library (QMD-Lib)
The Query Marshaller-Dispatcher library (QMD-Lib) has the following routines:
0 New query session
0 Start query
Read select
0 Read
a Write
End query
Terminate query session
The AQFE is responsible for handling out-of-bound responses (query responses that came
after the end query). The QMD-Lib also assumes that there is only one outstanding
query per client; however, it is possible to design an AQFE to allow multiple outstanding
queries, provided that the query protocol has the proper support. The calling interfaces
can be found in appendix C.
Starting a New Query Session
This routine associates the query client to an AQFE instance. The query connection list
is updated from the MDPM. If the new process flag is set, a new AQFE process is started
to handle all queries from the given client . Normally, the routine is called by the MDPM
when a new query client appears on the marshaller-dispatcher query port. However, it
can also be used to change the query stream, provided that there is a mechanism in the
application query protocol to support this.
Note that slot 0 (offset 0) in the status list is reserved for the query client location.
There are two fields in a server query status record. The first one indicates whether it is
a primary server or not. This is a number showing the level of mirroring of the current
server (0 indicates it is a primary server). The second field is a boolean value indicating
whether the server is alive (true) or dead (false). Only the second field is relevant in the
query client slot (slot 0).
The status list is compacted at the start of every new query session (i-e., all "dead"
entries in the list are discarded). Within the same query session, the same slot in the
status Iist always refers to the same application server or query client. Thus, the size of
the status list does not decrease during a query session.
CHAPTER 4. RDSS DETAILED DESIGN
Issuing a New Query
This command broadcasts the query to all visible application servers. If the MDPM
synchronization flag is set, an up-to-date application server list will be obtained from
MDPM (recommended). Otherwise, the existing server list is used. The routine can
broadcast the given query buffer. If the read-from-client flag is set, it will read a single
line from the client file descriptor and broadcast it. If there is no data on the query client
stream to be read, the routine blocks the execution thread untiI the query appears or the
timer expires.
If a new application server appears during a query, after the start of the query and before
the end of the query, the same query request will be sent to it. The changes in the status
list and number of servers will be visible on the next QMD-Lib call.
Wait for Read Using Select
This routine blocks the execution thread until one of the following events occurs:
1. Data appears in the incoming stream of an application server, and the correspond-
ing bit in the read mask is set.
2. Data appears in the query client incoming stream and the bit in the read mask for
slot 0 is set.
3. An error occurs in the query client connection preventing future communication
and the bit in the exception mask for slot 0 is set.
4. An error occurs in an application server whose corresponding bit in the exception
mask is set.
5 . A timeout occurs.
CHAPTER 4. RDSS DETAILED DESIGN 92
If an error occurs and the corresponding bit in the exception mask is not set, the QMD-
Lib will ignore the error and attempt to continue. The error is reported to the client via
the query connection status list.
Reading from the Query Client or an Application Server
This routine reads from the given stream corresponding to the given slot number. Data
from the stream is placed into the read buffer, until there is no more data available, or the
read buffer is full. If there is no data on a given stream, the routine returns immediately.
This routine does not block the execution thread.
Write to the Query Client or Application Servers
This routine writes the data in the given buffer to the given streams whose corresponding
write mask bits are set. This routine does not block the execution thread.
Ending a Query
This routine writes the data in the given buffer to the servers where the corresponding
query was sent. It also cleans up the internal state in QMD-Lib. If a new primary server
appears after the end of the query, no automatic query will be sent.
Ending the Query Session
This routine closes all QMD-Lib connections to the client and application servers regard-
less of any outstanding queries. The slot assignments of the status list axe discarded at
the end of the query session.
Chapter 5
Implement ation Stat us
A prototype of the RDSS has been implemented. However, the design has been evolving
and most of the RDSS components need to be updated or rewritten to reflect the latest
version of the RDSS architecture, described in this thesis. In addition, to test and verify
the application environment's design and the implementation, a reference application
was developed. Instead of using a full search application, this test application is a simple
test snippet server. This simple test application provides the following benefits:
0 full control over its internal structure and interface design,
0 reduced hardware requirements for the prototype,
0 isolated testing of the RDSS software modules, and
0 simplified debugging.
This chapter reports on the current status of the RDSS prototype implementation, fol-
lowed by a description of the design of the simple text snippet server, which is used as
the test application for the RDSS prototype. Finally, it outlines the necessary modifica-
tions to the current versions of the MultiText index engine and text server to make them
compatible with the RDSS.
5.1 RDSS Prototype Implementation
The current RDSS design is the result of many design and prototyping iterations. Both
the architecture and the detailed design have evolved dramatically over the iterations.
Instead of building a complete prototype at each iteration, individual modules were imple-
mented for evaluation. In this section, the implementation framework will be presented,
along with some of the issues encountered during implementation.
5.1.1 Prototyping Framework
The RDSS prototype is implemented on the LINUX operating system running on an
INTEL 486 compatible CPU with 20 MB or more system memory. RCS version control
utility is used for maintaining the source files.
There are three development areas. The first one contains the common supporting source
code that is shared by all RDSS modules. They simplify the access to system dependent
routines, and make the RDSS module more portable. The components included here are:
0 Constant definitions (const)
0 Count semaphore library (cntsem)
0 Binary semaphore library (binsem)
Mutex library (mutex)
a TCP library (tcp)
0 UDP library (udp)
Mailbox library (mbox)
Logging library (log)
The second area contains the various RDSS modules. Due to changes between the proto-
type versions, some of the modules are not up-to-date with respect to the latest design.
The current setup of the system contains:
RDSS shared components
- Application update protocol parser (aup-par)
- Application update protocol scanner jaupscan)
- RDSS common routines (rdss-corn)
0 Node Configuration Monitor (NCM)
- NCM broadcast control (ncmhc)
- NCM transaction control (ncm-tc)
- NCkl main (ncmm)
0 Node State Machine (NSM)
- NSM transition states (nsmfs)
- NSM main (nsmm)
Application Server Manager (ASM)
- ASM mirror control (asmmc)
- ASM remodel control (asrnx)
- ASM query control (asm-qc)
- ASM main (asmm)
0 Hardware Abstraction Layer (HAL)
- HAL application server library (halas)
- HAL online maintenance library (hal-om)
- HAL common routines (hal-corn)
Marshaller-Dispatcher Port Manager (MDPM)
- MDPM query connection control (mdpm-qc)
- MDPM node control (mdpmnc)
- MDPM common routines (rndprn-corn)
Query Marshaller-Dispatcher (QMD)
- QMD query connection control (qmd-qc)
- QMD library (qmdlib)
- QMD test library (qmd-t)
Update Marshaller-Dispatcher (UMD)
- UMD node control (umdnc)
- UMD library (urndlib)
- UMD test library (umd-t)
The third axea contains the simple text snippet server software. The test drivers and
simulation scripts are located in the last area of the prototyping framework.
5.1.2 Implement at ion Issues
The application update protocol parser and scanner are implemented with YACC and
LEX compatibility in mind. On LINUX, they are compiled with the GNU BISON and
FLEX compilers. GCC 2.6.3 and System V libraries are assumed to be available on the
compilation platform.
During prototyping, some shortcuts were used to simplify implementation. Instead of
writing to a raw disk partition, the prototype HAL uses a large file to simulate a disk
partition and to allow quick verification of the stored data. Some of the port numbers
and the broadcast address are hard-coded instead of being read in from a parameter file.
The location of the logs needed for restart after system crash are also hard-coded.
Currently, most of the RDSS modules exist but are not fully consistent with the latest
RDSS architecture, as described in chapters 3 and 4. A stand-alone simple text snippet
server with its front-end has been successfully integrated with the QMD, the ASM and
the HAL. The extra delay for simple query due to the RDSS layers is:
where
T C f : the TCP communication delay from the query client to the application
query front-end;
TC Pfe-to-as: the delay from the front-end to the application server;
T S e s the delay in a direct TCP connection between the query client and the
application server; and
HAL,,,: the small additional delay due to t h e RDSS HAL library as opposed to a
direct physical storage access.
While the computational delays of the RDSS softwilre components are included in the
above numbers, the TCP delays are dominated by the node to node network communi-
cation delay, and the HAL delay is dominated by the disk access response time.
5.2 Simple Text Snippet Server
As briefly described in chapter 2, the text snippet server is a simple server that is capable
of storing variable length generic data segments ("blobsn). Only the core features of such
a server are implemented in the reference application for the RDSS, Most of the appli-
cation administrative functions and interfaces that would be included in a production
version are omitted unless they are needed for the RDSS integration.
The simple text snippet server stores blocks of character symbols in physical storage.
Each data entry (one or more blocks of characters) is indexed by a tag. For simplicity,
the snippet tag is the same as the RDSS address range for the snippet. To retrieve a
snippet, the valid snippet tag (RDSS address) is required.
5.2.1 User Interfaces
As a simple reference application, the update interface of the simple text snippet server is
a direct implementation of the RDSS update protocol. The only query command available
is to retrieve the stored data given a valid snippet tag or multiple tags. All snippet tags
Header Block I Total Allocated Storage Space (N)
N-K- I
Block Numbers
Data Storage
Figure 5.1: Physical Storage Layout of Simple Text Snippet Server
Index Map
that fall within a user-supplied RDSS address range in the query are considered valid,
and the associated snippets are returned.
5.2.2 Storage Format
To allow retrieval of snippets, a mapping table is used to correlate the snippet tags to
physical storage offsets. The mapping table, along with its header, is also stored in the
physical storage along with the text snippets. Figure 5.1 depicts the actual layout of the
physical storage used by the simple text snippet server.
The available storage is the total storage allocated to the server by the RDSS less the
blocks used by mapping table and the header block. For simplicity, an entry is reserved
for each available storage block in the server, and direct mapping is used. Slot x in
the mapping table corresponds to block x in the data storage. For a 64-bits RDSS
address space and a single byte control flag, the storage overhead for the mapping table
is l7/(512 + 17) = 3.21% with a block size of 512 bytes. The storage used by the header
block is negligible for a large number of blocks.
At the stmtup of the application, the mapping table is read into memory. An inverted
map, sorted by the snippet tags (RDSS addresses) is created. To retrieve snippets, query
requests are matched with this inverted map in the memory.
The header block contains the start-end pair of indices for the snippet tags mapping
table. They are used to indicate the range of indices that are active. The range wraps
around at the last index value. That is, if the start index is greater than the end index,
then the active index range is wrapped around (i-e., from the start index to the last
possible index slot and from the first possible index slot to the end index).
In addition to the start-end indices, the header block also includes other information
needed for storage resizing and other operations. The following list shows the content of
the simple text snippet server header block:
number of storage blocks allocated
maximum number of entries in the mapping table
first mapping table block
r last mapping table block
start index in the mapping table
0 end index in the mapping table
next available index entry in the mapping table
current ending RDSS address
For efficient operation, the header block information and the translation table are cached
in memory. For data integrity, they are synchronized with their corresponding copies in
the persistent storage at all times.
5.2.3 Storage Resizing
To support the truncate and expand operations in the update protocol, the ability to
resize the usage of physical storage is needed. The following steps are done during resizing
to ensure that data integrity is maintained throughout the operation.
For simplicity, this step is only performed if the destination of the new mapping table
(after relocation) does not overlap with the current mapping table. A storage truncation
will return a failure if the mapping table relocation is not possible (unless unused blocks
at the end of the storage are enough for the truncation request). A storage expansion, on
the other hand, will return a success without utilizing the extra storage blocks allocated.
Regardless of the direction of resizing, if relocation of the mapping table is necessary, the
next step is to defragmentize the current storage. Empty storage blocks in the current
storage range (between the start-end indices) will be Wed by non-empty blocks from the
end of that range.
For storage truncation, the next step is to determine whether data block relocations are
needed. If the size reduction is not achievable (i-e., there are not enough empty blocks for
the resizing), the operation will stop. Otherwise, data blocks that fall within the to-be-
removed area at the end of the storage will be moved to the safe area (beginning of the
storage). The start-end indices will be adjusted accordingly, along with the maximum
number of entries in the header block. This step is not needed for storage size expansion.
Next, the mapping table is moved to its new storage location. The header block is then
updated with the new mapping table's location. For storage expansion, the mapping
table is first moved, and then its size is increased. For storage truncation, the mapping
table size is first decreased before the table is moved. After that, the number of allocated
storage blocks is reduced. The next available index will also be updated to reflect the
changes. During the mapping table relocation, no actual data is moved, and the content
of the index map is not changed. The resize operation is completed after the update of
the header block. A system crash during the resize operation will not result in any data
loss, as long as the header block update is atomic and protected from crashes. Figure 5.2
depicts the relocation of the index map (mapping table) when the storage size allocated
to the application is expanded from N to N'.
Header Block I Total Nlocated Storage Space (N')
N-K- 1 N
Block Numbers
Figure 5.2: Mapping Table Relocation during Storage Expansion
Data Storage
5.3 Conversion of the MultiText System
- - - - - ** Index Map -.
In the current MultiText system, neither the index engine nor the text retrieval server
Index Map
includes any data mirroring or on-line maintenance features. A brief description and
a setup diagram are given in chapter 1. Further description can be found in various
MultiText papers [CCPT97, CBCG951.
To adapt the MultiText into the RDSS framework, some changes are necessary. Most
of the RDSS application requirements, like the transaction based update operations,
described in chapter 2, are already included in the current system. Thus, no major
redesign is necessary. To ease the transition, the adaptation can be broken into five
testable stages.
Stage one is for the MultiText index engine and the text server to use the RDSS HALAS
library instead of directly accessing the physical storage. The application will be able to
run as if nothing has changed. Since both servers currently use physical disk partitions
for all persistent storage, this is not a difficult adaptation.
In the second stage, the update interface of MultiText needs to be modified to become
compatible with the RDSS application update protocol requirement. A translator may
be coded such that update requests in the old format may be translated into the new
RDSS compliant update requests. The current MultiText update protocol is very similar
to the RDSS compliant application update protocol, and only minimal syntax changes
are necessary.
The biggest change comes in stage three. The dynamic resizing ability is not included in
the current MultiText system and has to be added. The ability for a robust application
to truncate or expand its storage usage operations is one of the requirements that was
not anticipated before the development of the RDSS design.
The next stage is to design and implement the Application Query Front-End (AQFE) for
the MultiText system. The current MultiText Marshaler-Dispatcher is not suit able for
integration with the RDSS and must be replaced. The default Application Update Front-
End (AUFE) is adequate and no custom AUFE is needed for either the MultiText index
engine or the text server. Using the QMD and UMD test library, the application front-
end modules can be tested with the application server without other RDSS components.
The final stage is to integrate and test MultiText with the RDSS environment.
Chapter 6
Conclusion
In chapter 1, the idea of a robust distributed storage system (RDSS) was presented
along with its associated design goals. After many iterations of the design effort, a
workable architecture was devised and is presented in this dissertation. The architecture
satisfies the stated transparency, availability, and throughput performance criteria, and
also encompasses many different kinds of real-world applications in its target domain.
6.1 Contribution of the Thesis
With Moore's observation on the doubling of microprocessor speeds continuing to hold,
and on-going advances in affordable, high-capacity storage technology taking place, the
power of a network of inexpensive personal computers cannot be ignored. The RDSS
architecture can be implemented on a network of very inexpensive personal computer,
and presents an attractive low-cost alternative to even a modest RAID storage subsystem.
One of the major hurdles in putting a large information system on a distributed platform
is the complexity of the software design required. Through the RDSS environment, most
of the tricky issues in the design are handled.
-. i 4
The RDSS not only provides data mirroring, it also scatters mirrored data across many
nodes, avoiding data loss due to a single node failure. It also keeps data within the
same entry together at the mirrored location, thus avoiding the cost and complexity of
combining data entry across the distributed system. This is important for index engine as
the indexing information of a document need to be kept together for the search algorithm
to proceed.
On top of that, the RDSS provides load balancing, not only when the system is operating
on its primary data, but also when the system is using the mirrored data. Since data
entries are mirrored to N - 1 nodes on the N nodes system, each server node only
has to handle 1/N - 1 extra data if a node fails. To provide these benefits, a totally
transparent storage, like RAID, is not possible. For the applications that it is designed
to support, the inconveniences due to compliant requirements (described in chapter 2)
are far out-weighed by the benefits gained.
While more tests may be done on the RDSS prototype, the architecture and the detailed
design shown in this thesis provide a strong and solid foundation for future work in
networked distributed storage. The preliminary implementation has shown that the
RDSS framework is flexible and usable.
6.2 Future Work
In the short term, more work will be needed to take the prototype implementation t o a
production level. The current prototype needs to be made consistent with the current
design described in this thesis. The hard-coded shortcuts mentioned in the last chapter
will need to be properly addressed by customizable parameters. Changes may be needed
in the hardware abstraction layer (HAL) for the RDSS to work in different workstation
setups and to enable dynamic virtual disk re-partitioning.
In addition, performance benchmarking with multiple nodes on multiple network loca-
tions will need to be performed to further refine the RDSS design. It is important to
find out if there are any throughput bottlenecks in the design as well as to discover
any scalability limits of the implementation. A comparison study between a n RDSS en-
abled application and its single computer equivalent should be performed to show the
cost-saving potential of the RDSS.
In terms of design improvements, there are several areas that successors to the RDSS
architecture may need to consider. The followings are a few that worth considering.
Communication Improvements
With IP v6, which support multicasting and reservation, getting closer to reality, it may
be possible to improve on the node to node communications done in the RDSS. For
example, the number of connections per node may be reduced by replacing the node to
node update connections with a multicasting network.
In addition, changes may be added to support prioritized queries. It may also be necessary
to use a different communication protocol if the TCP/IP protocol suite is not available
or efficient in the target network. Adopting a fault tolerant multicasting protocol may
help to reduce the complexity of the rest of the RDSS.
Multiple Nodes Failures
Another possible high profile improvement to the RDSS is to increase the level of fault
tolerance. To continue the scattered mirroring strategy, multiple mirroring levels will
be needed to achieve this. For each primary data partition in an N nodes system,
its first level of mirroring data is scattered among the other N - 1 nodes, just like the
current RDSS described in this thesis. However, in a multiple level mirroring architecture,
this process is recursive. Each first level mirroring partition will have N - 2 secondary
mirroring segments located in the remaining nodes, excluding the node with the primary
copy of the data.
While the mirroring procedure is relatively simple, the difficult part is to modify the
Node. Configuration Monitor (NCM) to handle the possibility of multiple node failure
and remodelling. So fax, in the RDSS, the NCM does not need a majority agreement
mechanism, as only one node failure is handled. However, for the system to handle
multiple node failures, such a mechanism is needed to achieve network consistency.
Network Partitioning
With the inclusion of multiple nodes failure, there is the possibility of a network parti-
tioning problem. Again, with the single node failure and local area network assumptions,
the current RDSS design does not handle a partitioned network very well. The majority
agreement mechanism needs to be added to the NCM. Also, the restart sequence needs
to be changed so that two or more partitioned sub-networks can be recombined without
problem.
Wide Area Network
As stated in the assumption, the RDSS only works on a local area network. Because of
communication delays and unreliability of communication links, distributed applications
on a wide area network often suffer from virtual network partitioning. Unlike the local
area network case, multiple replicated primary data partitions may be desirable. Instead
of limiting the RDSS to a strict single primary data source, it may be worth while to
explore the inclusion of a replicated primary node in future designs if the support for a
wide area network is needed.
Other Fault Tolerance Strategies
The RDSS architecture provides node level fault tolerance, but none is provided at a data
entry level. It is possible to use RAID storage on all nodes, but that would be expensive
and defeat the low cost purpose of the RDSS. One possible solution is to include error
correction codes [GCCTSG]. It may compliment the mirroring strategy by providing
sub-node level fault protection and higher degree of fault tolerance.
Another possible variation on the fault tolerance strategy is to optimize the mirroring
rules. Instead of always mirroring to every remaining node, the system may only mirror
to some of them. In a multiple level mirroring system, this may or may not give better
protection with less replication, depending on the mirroring policy. Nevertheless, it is an
interesting problem to tackle.
Security Enhancements
There are weak spots in the current RDSS system in terms of security. Namely, the
node to node communications. In particular, the NCM broadcasting network is very
vulnerable to malicious attacks. It is acceptable on a private secure local area network
insulated from the outside world by a fiewall, as long as the firewall is configured properly
to protect the RDSS. Ideally, some form of security should be built into the system to
guard against misuse.
AppIication Support
Last, but not least, samples from the various application in the RDSS target domain
should be used to see where the strength of the RDSS lies. The update protocol may be
enhanced or changed to a binary format, which may be more appropriate for multimedia
and other non-text digital libraries.
Appendix A
Application Update Protocol BNF
The following is the syntax of the application update protocol implemented in the RDSS
protocol in extended BNF (Buckus Naur Form). The application server must implement
a superset of this protocol on its update port. Keywords are shown in the quoted ('-2)
form and variable tokens are italicized.
To kens: string number new-line
string: ' " ' ( a n y ASCII character)* ' " '
ne wlin e: (line- feed) carriage-return
comrnandAine: update-command new-line
update-command: add~eques t
/ delete~equest
/ extractrequest
I merge~eques t
I update-commit
I updateabort
CHAPTER A. APPLICATION UPDATE PROTOCOL B N F
add-request:
de le te~eq uest :
extract~equest:
I storage-available~equest I tuncatestorage
I expandstorage
I quit -updatesession
I shutdownserver
'ADD' 'AT' number 'TO' number 'CAPACITY' number 'SIZE'
number 'DATA' string
'DELETE' 'FROM' number 'TO' number
'EXTRACT' 'FROM' number 'TO' nzmzber 'PORT' number
( ' E X T E R N A L FORMAT')
merge~eq uest: 'MERGE' 'FROM' 'SERVER' number 'PORT' number ('EX-
TERNAL FORMAT')
update-commit: 'COMMIT' ('ALL7)
update~bort : 'ABORT7 ('ALL')
storage-availa ble-req uest: 'STORAGE' 'STATUS7
tu ncatestorage: 'TRUNCATE' number 'BLOCKS'
expandstorage: 'EXPAND7 number 'BLOCKS'
quit-updatesession: 'QUIT7
shutdownserver: 'SHUTDOWN'
Every update-command requires a response of 'ACK' or 'NACK', with the exception
of the storage-availablelequest , the capacity~vailablerequest and the shutdownserver
command. The 'ACK7 response indicates the operations is successful or is ready to be
committed.
CHAPTER A. APPLICATION UPDATE PROTOCOL BNF 110
For the storage-availablerequest, the proper return message contains two numbers. The
first one is the storage block size and the second number is the number of blocks avail-
able. For the capacityavailablerequest, the return message contains the single number
indicating the remaining free capacity.
Appendix B
Hardware Abstraction Layer -
Application Server Interface
The following is the calling interface to the Hardware Abstraction Layer - Application
Server (HAL-AS) Library.
Method name: bVDiskOpen
Input: Virtual disk description string (sVDisk).
Output: Virtual disk ID (iVDiskID).
Return value: Boolean, true for success.
Method name: bVDiskRead
Input: Virtual disk ID (iVDiskID), starting block offset (istart), number of blocks
(icount), pointer to the read buffer (sBuffer).
Output: Read buffer content (sBuffer).
APPENDIX B. HAL-AS INTERFACE
Return value: Pointer to the read buffer. Return nu1 on error.
Method name: bVDiskWrite
Input: Virtual disk ID (iVDiskID), starting block offset (istart), number of blocks
(icount), pointer to the write buffer (sBdFer).
Output: none.
Retu rn value: Number of blocks successfully writ ten.
Method name: bVDiskStatus
Input: Virtual disk ID (iVDiskID).
Output: Virtual disk block size (iBlkSize), Number of blocks in virtual disk (iNum-
Blk).
Return value: Boolean, true for success.
Method name: bVDiskClose
Input: Virtual disk ID (iVDiskID).
Output: none.
Return value: Boolean, true for success.
Appendix C
Query Marshaller-Dispat cher
Library Interface
The following is the calling interface to the Query Marshaller-Dispatcher Library (QMD-
Lib) .
Method name: bNewQrySession
Input: Query client file descriptor (iCliendFd), fork new process flag (bNewPro-
cess)
Output: Number of servers in the complete RDSS world (iNumServer), status list
size (iNumStatus), status list (asqsStatusList).
Return value: Boolean, true for success.
Method name: bStartQuery
Input: Direct from client flag (bDirect), update server list flag (bResync), query
buffer (sQuer y ) , client t imeout (iTimeout )
Output: Number of servers in the RDSS (iNumServer), status list size (iNumStatus),
status list (asqsStatusList).
Return value: Boolean, true if no error has occurred.
Method name: bReadSelect
Input: Number of servers and client (iNumStatus), array of read masks (abRead-
Mask), array of exception masks (abExceptMask) , timeout value (iTime-
out).
Output: Array of read results (abReadMask) , array of exception result (ab Excep t-
Mask), Number of servers in the RDSS (iNumServer), status list size (iNum-
Status), status list asqsStatusList.
Return value: Boolean, true for success.
Method name: bRead
Input: Read slot number (islot), read buffer sReadBuffer, buffer size iBufSize.
Output: Data in read buffer sReadBuffer, relevant data size iBufSize.
Return value: Boolean, true for success.
Method name: bWrite
Input: Number of servers and clients (iNumStatus), array of write masks (ab-
WriteMask), write buffer (sWriteBuffer), buffer size (iBufSize).
APPENDIX C. QMD-LIB INTERFACE
Output: Data size written (iBufSize).
Return value: Boolean, true for success.
Method name: bEndQuery
Input: Write buffer (sWriteBufFer), buffer size (iBufSize).
Output: Data size written (iBufSize), number of servers in the RDSS (iNumServer),
status list size (iNumStatus), status list (asqsStatusList).
Return value: Boolean, true for success.
Met hod name: bEndQuery Session
Return value: Process return 0 on successful termination.
Appendix D
Update Marshaller-Dispatcher
Library Interface
The following is the calling interface to the Update Marshder-Dispatcher Library (UMD-
Lib).
Method name: bNewUpdSession
Input: Update client file descriptor (i Client Fd)
Output: Number of nodes that are part of the system (iNumNodes)
Return value: Boolean, true for success.
Method name: bUpdateCmd
Input: Pointer to update request (sUpdReq), broadcast mask (abBroadcastMask)
Output: Number of nodes to which the update request are sent (iNumSent)
116
APPENDIX D. UMD-LIB INTERFACE
Return value: Boolean, true for success.
Method name: bReadUpdRtn
Input: Pointer to return buffer (sUpdRtnBuf), size of buffer (iRtnBufSize), read
select mask (abBroadcast Mask).
Output: Number of bytes used in the return buffer (iResultSize) Data written to
buffer (sUpdRtnBuf)
Return value: Boolean, true for success.
Method name: bEndUpdSession
Return value: Boolean, true for success.
Glossary
AQFE:
AS:
ASM:
AUFE:
GCC:
IP:
LAN:
MDPM:
NCM:
NSM:
RAID:
RCS:
RDSS:
ROWA:
QMD:
SCSI:
Application Query Front-End
Application Server
Application Server Manager
Application Update Front-End
GNU C Compiler
Internet Protocol
Local Area Network
Marshaller-Dispatcher Port Manager
Node Configuration Monitor
Node State Machine
Redundant Arrays of Inexpensive Disks
Revision Control System
Robust Distributed Storage System
Read One Write All
Query Marshaller-Dispatcher
Small Computer Storage Interface
APPENDIX D . UMD-LIB INTERFACE
TCP : Transmission Control Protocol
U D P: User Datagram Protocol
U M D: Update Marshaller-Dispatcher
Bibliography
[AS911 I. J. Aalbersberg and F. Sijstermms. High-Quality and High-Performance
F d Text Document Retrieval: The Parallel Infoguide Systems. In Int 'l
Conference Parallel and Distributed Information Systems, December, 1991,
pp. 142-150.
[BBD+96] W. J. Bolosky, J. S. Barrera, R. P. Draves, R. P. Fitzgerald and Givson.
The Tiger Video Fileserver, Microsoft Technical Report MSR-TR-96-09,
April 1996.
F TP location: ftp://fp.research.microsufi.com/pub/tr/tr-96-09.ps
[Bir93] K. P. Birman. The Process Group Approach to Reliable Distributed Com-
puting. Communications of the ACM, December 1993, Vol. 36, No. 12, pp.
37-53.
[BS92] W. A. Burkhard and P. D. Stojadinovic. Storage-Efficient Reliable Files. In
Proceedings: USENIX Winter 1992 Technical Conference, San Francisco,
January 1992.
[Bur901 F. J. Burkowski. Retrieval Performance of a Distributed Text Database
Utilizing a Parallel f rocessor Document Server. In Int '1 Symp. Databases
i n Parallel and Distributed Systems, 1990, pp. 71-79.
[CBCG~S] G. V. Cormack, F. 3. Burkowski, C. L. A. Clarke and R. C. Good. A Global
Search Architecture. Technical Report CS-95-12, University of Waterloo
Computer Science Department, April 1995.
BIBLIOGRAPHY 121
C. L. A. Clarke, G . V. Cormack and I?. J. Burkowski. An Algebra for Struc-
tured Text Search and A Framework for its Implementation. The Computer
Journal, 1995, Vol. 38, No. 1, pp. 43-56.
C. L. A. Clarke, G. B. Corrnack and F. J. Burkowski. Schema-Independent
Retrieval from Heterogeneous Structured Text. In Fourth Annual Sympo-
sium on Document Analysis and Information Retrieval, Las Vegas, Nevada,
April 1995, pp. 279-289.
G. V. Cormack, C. L. A. Clarke, C. R. Palmer and S. L. To. Passage-
Based Refinement (MultiText Experiment for TREC-6). In Proceedings of
the Sixth Text REtrieval Conference (TREC-6), Gaithersburg, Maryland,
November 1997.
G. F. Coulouris and J. Dollirnore. Distributed Systems: Concepts and De-
sign, Addison-Wesley Publishing Co., Wokingharn, England, 1988.
P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz and D. A. Patterson.
RAID: High-Performance, Reliable Secondary Storage, ACM Computing
Surveys, June 1994, Vol. 26, No. 2, pp. 145-185.
B. Cahoon and K. S. McKinley. Performance Evaluation of a Distributed
Architecture for Information Retrieval. In Proceedings of the 19th Annual
Int 'l A CM SIGIR Conference on Research and Development in In formation
Retrieval, Zurich, Switzerland, August, 1996.
P. B. Danzig, J. Ahn, J. Nail and K. Obraczka. Distributed Indexing: A
Scalable Mechanism for Distributed Information Retrieval. In ACM STGIR
Conference, October 1991, pp. 220-229.
R. C. Good, G. V. Cormack, C. L. A. Clarke and D. J . Taylor. A Robust
Storage System Architecture. In 8th Int'l Conference on Computing and
Information, June 1996.
BIBLIOGRAPHY 122
D. K. Gifford, P. Jouvelot, M. A. Sheldon and J. W. O'Toole, Jr. Semantic
File Systems. Operating Systems Reuiew: Proceedings of the 13th ACM
Symposium on Operating Systems Principles, Pacific Grove, CA, October
1991, VO~. 25, NO. 5, pp. 16-25.
J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques,
Morgan Kaufman PubLishers, San Francisco, CA, 1993.
G. A. Gibson, D. I?. Nagle, K. Amirir, F. W. Chang, H. Gobioff, E. Riedel,
D. Rochberg and J. Zelenka. Filesystems for Networks-Attached Secure
Disks. Technical Reports CMU-CS-97-118, School of Computer Science,
Carnegie Mellon University, Pittsburgh, Pennsylvania, July 1997.
A. A. Held, A. A. Heddaya and B. B. Bhargava, Replication Techniques
in Distributed Systems, Kluwer Academic Publishers, Boston, 1996.
Inktomi Corporation. The Inktomi Technology Behind Hot Bot (a White
Paper), 1996.
W W W location: http://www. inktomi. com/Tech/Coup Cluster WhitePap. htrnl.
B. S. Jeong and E. Orniecinski. Inverted File Partitioning Schemes in Multi-
ple Disk Systems. lEEE Transactions on Parallel and Distributed Systems,
February 1995, Vol. 6, No. 2, pp. 142-153.
M. Lesk Practical Digital Libraries: Books, Bytes, and Bucks, Morgan
Kaufmann Publishers, San Francisco, CA, 1997.
B. Liskov, S. Ghemawat , R. Gruber, P. Johnson, L. S hrira and M. Williams.
Replication in the Harp File System. Operating Systems Reuiew: Proceed-
ings of the f 3th ACM Symposium on Operating Systems Principles, Pacific
Grove, CA, October 1991, Vol. 25, No. 5, pp. 216-238.
2. Lin. Cat: An Execution Model for Concurrent Full Text Search. In Int '1
Conference Parallel and Distnb~ted Information Systems, December 1991,
pp. 151-158.
BIBLIOGRAPHY 123
[LSSO] E. Levy and A. Silberschatz. Distributed File Systems: Concepts and Ex-
amples. ACM Computing Surveys, ACM Press, December 1990, Vol. 22,
No. 4, pp. 321-374.
In Sky and Telescope, July 1997, p. 44.
C. S tanfill. Partitioned Posting Files: A Parallel Inverted File Structure
for Infarmation Retrieval. In ACM SIGfR Conference, September 1990,
pp. 413-428.
A. Tomasic and H. Garcia-Molina. Performance Issues in Distributed
Shared-Not hing Inforrnation-Retrieval Systems. I n f o m a t ion Processing
and Management, 1996, Vol. 32, No. 6, pp. 647-665.
[WGSSSG] J. Wilkes, R. Golding, C. Staelin and T. Sullivan. The HP AutoRAID
Hierarchical Storage System. ACM Transaction on Compv te r Systems, Vol.
14, No. I , February 1996, pp. 108-136.
[WMB94] I. H. Witten, A. Moffat and T. C. Bell. Managing Gigabytes - Compressing
and indexing Documents and Images, Van Nostrand Reinhold, New York,
1994.