-mOO · 1.5 Thesis Outline ... 3.3 Node Configuration Monitor (NCM) Module ... 101 6 Conclusion 103 6.1 Contributionof theThesis

This manuscript has been reproduced from the microfilm master. UMI films the

text directly from the original or copy submitted. Thus, some thesis and

dissertation copies are in typewriter face, while others may be from any type of

computer printer.

The quality of this reproduction is dependent upon the quality of the copy

submitted. Broken or indistinct print, colored or poor quality illustrations and

photographs, print bleedthrough, substandard margins, and improper alignment

can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and

there are missing pages, these will be noted. Also, if unauthorized copyright

material had to be removed, a note will indicate the deletion.

Oversize materials (e-g., maps, drawings, charts) are reproduced by sectioning

the original, beginning at the upper left-hand comer and continuing from left to

right in equal sections with small overlaps. Each original is also photographed in

one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced

xerographically in this copy. Higher quality 6" x 9" black and white photographic

prints are available for any photographs or illustrations appearing in this copy for

an additional charge. Contact UMI directly to order.

Bell & Howell Information and Learning 300 North Zeeb Road, Ann Arbor, MI 48106-1346 USA

800-521 -mOO

A Robust Distributed S t orage System for Large Information Retrieval Applications

Antonio S. Cheng

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering University of Toronto

@ Copyright by Antonio S. Cheng 1998

National Library BibliothQue nationale du Canada

Acquisitions and Acquisitions et Bibliographic Services services bibliographiques

395 Wellington Street 395. rue Weliington OttawaON KlAON4 Ottawa ON K I A ON4 Canada Canada

Yow Me vom reference

Our fib [ l e re r8terence

The author has granted a non- L'autew a accorde m e licence non exclusive licence allowing the exclusive pennettant a la National Library of Canada to Bibliotheque nationale du Canada de reproduce, loan, distribute or sell reproduire, prster, distribuer ou copies of this thesis in microform, vendre des copies de cette these sous paper or electronic formats. la fome de microfiche/6lm, de

reproduction sur papier ou sur format electronique.

The author retains ownership of the L'auteur conserve la propriete du copyright in this thesis. Neither the droit d'auteur qui protege cette these. thesis nor substantial extracts from it Ni la these ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent 6tre imprimes reproduced without the author's ou autrement reproduits sans son permission. autorisation.

A Robust Distributed Storage System for Large Information Retrieval

Applications

Antonio S. Cheng

Master of Applied Science

Graduate Depaftment of Electrical and Computer Engineering

University of Toronto

1998

Abstract

This thesis presents an architecture and design for a Robust Distributed Storage System

(RDSS) targeted at digit a1 library, multimedia, and information retrieval applications,

and implemented on networks of low-cost workstations or personal computers. In partic-

ular, the system addresses problems associated with managing large distributed indices

in the context of these applications. The RDSS provides a framework for scaling a single-

node server to create a reliable distributed system. In addition to performance benefits

achieved by distributing these applications, the RDSS provides efficient data mirroring,

on-line failure recovery, and node management.

Acknowledgements

I would like to express my deepest thanks to my supervisor, Professor Charles L. A.

Clarke, for his guidance and support throughout my graduate study at University of

Toronto. Professor Clarke has been very kind in providing me with research direction

and very patient in correcting my mistakes. His endless support throughout the research

and the thesis write-up is crucial in the completion of this thesis. I sincerely hope that

the result of this study will benefit the future evolution of the MultiText project.

I would also like to thanks my exmination committee members: Professor S. A. Bortoff

(chair), Professor H. M. Hinton, and Professor M. Stumrn. In addition, I would like to

send my gratitudes to Miched Van D m for helping me with the thesis layout, and to

Sam Griffiths and Dr. Rob Irish for proofreading the thesis in short notice.

Thanks to my fellow electrical graduate students for making the learning experience

enjoyable. Thanks are also due to other members of the Electrical Engineering Computer

Group (EECG), for the knowledge (technical or otherwise) that I have learned from them

during my time with the group. Also, my sincere gratitude to the technical support staff

and the administrative staff in the electrical engineering department.

Last but not least, I would like to thank my family and friends for their encouragement.

This work is supported by NSERC PGS-A postgraduate scholarship.

Contents

I Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 MultiText Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Distribution Problem . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Index Mirroring Problem . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Overview of the Robust Distributed Storage System (RDSS) . . . . . . . 10

1.4.1 System Environment . . . . . . . . . . . . . . . . . . . . . . . . . 10

1 - 4 2 Mirroring Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Application Environment 16

2.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

i v

2.2.1 Data Addressing Scheme . . . . . . . . . . . . . . . . . . . . . . .

2.2.2 Capacity Definition . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.3 Data Organization Requirements Summary . . . . . . . . . . . . .

2.3 Application Server Design Requirements . . . . . . . . . . . . . . . . . .

2.3.1 Application Server Interfaces Requirements . . . . . . . . . . . . .

2.3.3 Application Server Protocols Requirements . . . . . . . . . . . . .

2.4 Application Front-End Design Requirements . . . . . . . . . . . . . . . .

2.4.1 Application Query Front-End . . . . . . . . . . . . . . . . . . . .

2.4.2 Application Update Front-End . . . . . . . . . . . . . . . . . . . .

3 Architectural Overview

3.1 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.2 Design Assumptions . . . . . . . . . . . . . . . . . . . . . . . . .

3.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.1 Server Node Components . . . . . . . . . . . . . . . . . . . . . . .

3.2.2 Marshaller-Dispatcher Components . . . . . . . . . . . . . . . . .

3.2.3 Non-volat ile Storage Management . . . . . . . . . . . . . . . . . .

3.3 Node Configuration Monitor (NCM) Module . . . . . . . . . . . . . . . .

3.3.1 Node Contact List . . . . . . . . . . . . . . . . . . . . . . . . . .

v

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Guardian Node

3.3.3 Server Nodes Synchronization . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Crash Recovery

3.4 Node State Machine (NSM) Module . . . . . . . . . . . . . . . . . . . . .

3.4.1 NSM Initialization State . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 NSM Steady State

3.4.3 NSM Modifying State . . . . . . . . . . . . . . . . . . . . . . . .

3.4.4 NSM Degraded State . . . . . . . . . . . . . . . . . . . . . . . . .

3.4.5 NSM FailedState . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5 Application Servers Manager (ASM) Module . . . . . . . . . . . . . . . .

3.6 Hardware Abstraction Layer (HAL) Module . . . . . . . . . . . . . . . .

3.7 Marshaller-Dispatcher Port Manager (MDPM) Module . . . . . . .

3.8 Query Marshaller-Dispatcher (QMD) Module . . . . . . . . . . . . . . .

3.9 Update Marshaller-Dispatcher (UMD) Module . . . . . . . . . . . . . . .

3.10 Sample Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.10.1 Successful System Startup . . . . . . . . . . . . . . . . . . . . . .

3.10.2 Successful Queries and Updates . . . . . . . . . . . . . . . . . . .

3.10.3 Successful Recovery . . . . . . . . . . . . . . . . . . . . . . . . . .

3.10.4 Changing Storage Size . . . . . . . . . . . . . . . . . . . . . . . .

vi

4 RDSS Detailed Design 60

4.1 Target Platform Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Node Configuration Monitor (NCM) Detailed Design . . . . . . . . . . . 61

4.2.1 NCM Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.2 Detecting Node Failure . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.3 Initiating Remodelling Synchronization . . . . . . . . . . . . . . . 64

4.2.4 Completing Remodelling . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.5 Node Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Node State Machine (NSM) Detailed Design . . . . . . . . . . . . . . . . 66

4.3.1 NSM Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.2 From Initialization State to Steady State . . . . . . . . . . . . . . 67

4.3.3 FromInitializationStateto Degradedstate . . . . . . . . . . . . . 67

4.3.4 From Steady State to Modifying State . . . . . . . . . . . . . . . . 68

4.3.5 From Steady State or Modifying State to Degraded State . . . . . 68

4.3.6 From Modifying State to Steady State . . . . . . . . . . . . . . . . 69

4.3.7 From Degraded State to Steady State . . . . . . . . . . . . . ,. . 69

4.3.8 From Degraded State to Failed State . . . . . . . . . . . . . . . . 69

4.3.9 NSMTermination . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.10 On-1ineRemovalofaNode . . . . . . . . . . . . . . . . . . . . . . 70

4.3.11 On-line Addition of a New Node . . . . . . . . . . . . . . . . . . . 71

vii

4.3.12 Relocating Data from a Deleted Node to a New Node . . . . . . . 71

4.4 Application Servers Manager (ASM) Detailed Design . . . . . 72

4.4.1 Data Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Remodelling Interface 73

4.4.3 Controlling Port Visibility . . . . . . . . . . . . . . . . . . . . . . 76

4.4.4 Persistent Information . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4.5 ASM Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Hardware Abstraction Layer (HAL) Detailed Design . . . . . . . . . . . . 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Virtual Disk 79

4.5.2 Dynamic Repartitioning . . . . . . . . . . . . . . . . . . . . . . . PO

4.5.3 H A E O M Library Interfaces . . . . . . . . . . . . . . . . . . . . . 81

4.6 Marshaller-Dispatcher Port Manager (MDPM) Detailed Design . . . . . . 82

4.6.1 Port Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.6.2 Observing RDSS state changes . . . . . . . . . . . . . . . . . . . 84

4.6.3 Maintaining the Node List . . . . . . . . . . . . . . . . . . . . . . 84

4.6.4 Maintaining the Query Connection List . . . . . . . . . . . . . . . 84

4.7 Update Marshaller-Dispatcher (UMD) Detailed Design . . . . . . . . . . 85

4.7.1 Update Client Trust Model . . . . . . . . . . . . . . . . . . . . . 55

4.7.2 Default Application Update Front-End . . . . . . . . . . . . . . . 86

4.7.3 Update Transaction Integrity . . . . . . . . . . . . . . . . . . . . 87

... Vll l

. . . . . . . . . . . . 4.7.4 RDSS Update Marshder-Dispatcher Library 87

. . . . . . . . . . . 4.8 Query Maxshder-Dispatcher (QMD ) Detailed Design 89

4.8.1 Query Marshaller-Dispatcher Library (QMD-Lib) . . . . . . . . . 89

5 Implementation Status 93

. . . . . . . . . . . . . . . . . . . . . . 5.1 RDSS Prototype Implementation 94

. . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Prototyping Framework 94

. . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Implementation Issues 96

. . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Simple Text Snippet Server 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 User Interfaces 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Storage Format 98

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Storage Resizing 99

. . . . . . . . . . . . . . . . . . . . . 5.3 Conversion of the MultiText System 101

6 Conclusion 103

. . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Contributionof theThesis 103

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work 104

A Application Update Protocol BNF 108

B Hardware Abstraction Layer . Application Server Interface

C Query Marshaller-Dispatcher Library Interface

D Update Marshaller-Dispatcher Library Interface

Glossary

Bibliography

List of Tables

. . . . . . . . . . . . . . . . . . . . . 2.1 Application Server Update Protocol 28

. . . . . . . . . . . . . . . . . . . 4.1 Application Servers Manager Interfaces 79

List of Figures

1.1 Architecture of the MultiText System . . . . . . . . . . . . . . . . . . . . 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 System View of RDSS I1

. . . . . . . . . . . . . . . . . . . . . . . . 1.3 Data Replication in the RDSS 13

. . . . . . . . . . . . 1.4 Load Balancing of the Scattered Mirroring Strategy 15

. . . . . . . . . . . . . . . . . . . . . . . 2.1 Stand Alone Application Server 18

. . . . . . . . . . 2.2 Application Distributed View with Front-End Modules 30

. . . . . . . . . . . . . . . . 3.1 Network Overview of the RDSS Architecture 35

. . . . . . . . . . . . 3.2 StateTransitionsDiagramofanRDSSServerNode 46

. . . . . . . . . . . . 3.3 Sub-state Transitions with the Degraded NSM State 51

. . . . . . . . . . . . . . . . . . . . . . . 4.1 RDSS Server Node Components 75

. . . . . . . . . . . . . . 4.2 RDSS Marshaller-Dispatcher Node Components 53

5.1 Physical Storage Layout of Simple Text Snippet Server . . . . . . . . . . 98

5.2 Mapping Table Relocation during Storage Expansion . . . . . . . . . . . 101

xii

Chapter 1

Introduction

With the growth of the information age, the demand for timely information and data

also grows in leaps and bounds. Many technological trends are driven by the need to

satisfy this demand. From hardware advances like higher network bandwidth and data

storage capacity, to software developments in data mining and digital libraries, the driving

requirement is to get more information to end users quicker and cheaper.

The purpose of this thesis is to propose, develop and refine a flexible framework for a Ro-

bust Distributed Storage System (RDSS) that will satisfy this driving requirement. The

RDSS is a storage management framework targeted toward digital library, multimedia

and information retrieval applications. It provides the benefits of distributed computing

and data replication. Prototypes of the RDSS components are developed as part of the

design process.

1.1 Background

Most work in reliable, distributed database systems has targeted relational databases

and file systems [HHBSG]. The work described in this thesis is aimed at digital libraries,

multimedia, and information retrieval applications. The focus of these applications is on

search and retrieval. Because of this focus, it is relatively easy to structure distributed

versions of these applications. Presently, the most visible applications in this class are

World Wide Web search engines (e.g., Altavista, http : //m. altavista. digital. corn,

and Excite, http : //m. excite. corn).

With the growth of the World Wide Web, and other new digital library services such as

video-on-demand, the need for a large-scale digital library systems is on the rise [Les97].

In 1996, the largest disk storage m a y was 20 terabytes in size. By the time the Sloan Dig-

ital Sky Survey is completed, it will contain about 200 terabytes of digital data [Sky97]. It

is also estimated that the scanned version of the US national library will require a digital

library of 1 petabyte in size [Les97] (1 petabyte = 1,048,576 megabytes, for comparison).

While building faster and bigger server hardware and storage subsystems is possible,

emerging database applications require efficient storage but cannot justify the cost of a

single computer super server. In 1998, a high bandwidth storage subsystem, which is

capable of storing 200 gigabytes or more, costs more than US$80,000 without disk or

server (Digital Storage Work Enterprise Storage Array 10000). The goal of this study

is to provide a distributed framework, such that almost any stand-alone digital library

application can be easily converted to a scdable distributed system using a network of

low-cost personal computers.

1.1.1 MultiText Project

An example of a digital library is the parent project of this thesis - the MultiText

Project. The MultiText system allows search and retrieval of relevant text passages us-

ing innovative ranking methods - shortest sub-string ranking with passage-based refine-

ment [CCPT97], along with support for semi-structured data [CCB95a, CCB95bl. The

content of the text can be net-news, email, postscript or any other text based documents

in various formats and languages.

The current MultiText architecture consists of multiple replicated index engines with

one or more text servers on a local area network. Incoming client queries are routed

by a rnarsshaller/dispatcher to the the index engines in the network. The results from

the index engines axe used for text passage retrieval from the text servers. From the

query clients' point of view, the whole network of index engines, the text servers and

the marshaller/dispatcher function as a single search application. Figure 1.1 shows the

architecture of the MultiText system.

The issues of concern to the MultiText Project include data distribution, load balancing,

fault tolerance, fast update, compression, document structure, ranking and user interac-

tion [CBCG95]. The RDSS architecture proposed in this thesis addresses the f i s t three

issues.

1.1.2 Distribution Problem

Compared to the retrieval server (multimedia or text), index engine operations are com-

putational intensive. An expensive large RAID (Redundant Arrays of Inexpensive Disks)

subsystem may solve the storage problem, but it cannot solve the problem of increased

computational demands. Thus, an index engine supporting a sizable information man-

agement system would benefit in both cost and performance by spreading both data and

processing loads across many nodes.

Many large databases and digital libraries, including MultiText, utilize replicated nodes

(replicated index engines in MultiText ) to achieve better performance. Many Read One

Write All (ROWA) techniques [HHB96] have been developed to handle the data con-

sistency problem in a distributed system. Data queries in some replicated systems are

served by multiple servers, each containing the same data set. The workload of each

query is not shared among nodes. As a result, although the throughput of the search

may be sped up by replicating the index list to multiple locations, the response time of

each query is not improved.

To make a completely distributed version of a search-enabled digital library, both the

Clients

Engine 0 Index Engine

Text Server

Figure 1.1: Architecture of the MultiText System

index engine and the retrieval server need to be distributed across multiple processing

nodes. However, the data storage format in an index engine is different from that in its as-

sociated retrieval server. Therefore, a content independent distributed storage framework

is needed such that the index engine can use one storage format in one framework and

the associated retrieval (multimedia or document) server can employ a different storage

format in another framework.

1.1.3 Index Mirroring Problem

As digital libraries become more critical to users, fault tolerance becomes indispensable.

At the single node level, various RAID [CLG+94] and similar technologies have improved

storage reliability. At the system level of a distributed server, mirroring or correction

code based backup methods [GCCT96, BS92] have been proposed and developed.

However, unlike blocks in a storage device, index lists in an index engine cannot be

arbitrarily partitioned. For an index search to be completed, an index engine needs

to operate on index lists describing whole documents. Thus, a blind block scattering

strategy (e-g., the Tiger Video server [BBD+96]) for mirroring, where data is replicated

and scattered across the network in fixed blocks regardless of content, would not work.

While a blind scattering strategy for mirroring does not work for the index engine, a

strategy of mirroring all data to a single location (e.g., HotBot web search system [Ink96])

would upset load balancing in the event of a node failure. To circumvent this problem, this

thesis proposes a data-entry-based scattered-mirroring strategy. It avoids load balancing

problems when the backup data is needed, while keeping related information together.

In addition, it is flexible enough that various retrieval servers associated with the index

engine can also utilize the same mirroring scheme on a separate setup of the same RDSS

framework. Section 1.4.2 contains a more detailed description of the mirroring scheme

in the proposed storage system.

1.2 Related Work

There have been many studies on distributed databases [Burgo, CM96, Lin91, TG96],

and information retrieval systems utilizing clustering [BBD+96, DAN091, InkS6, 5095,

Stago]. The parent project of this study, MultiText [CBCG95, CCPT971, also multiple

search and retrieval servers in its setup.

Many of these studies, including the information retrieval system by Tomasic and Garcia-

Molina [TG96] and the current MultiText system [CCPT97] have distributed retrieval

servers and index engines. Some other studies, like the distributed indexing system by

Danzig et al. (DAN0911 , focus on the index engine only and do not address the availability

issues. Commercid systems like the Tiger Video Server [BBD+96] and the HotBot

server (Ink961 include a backup strategy7 but are narrow in their target applications.

None of them provides a complete solution framework for distributing a generic digital

library application that also includes on-line management with fault tolerance and load

balancing capabilities.

For the related work with respect to reliability, there is the proven RAID technology,

which is summarized by Chen et. al. [CLG+94]. Other studies on reliable distributed

storage are mostly file-system related [BS9%, GCCT96, GJSO91, GNAf97, LGGf91,

LSgO].

Discussions on the problems associated with developing distributed appIications, includ-

ing distributed agreement and transaction processing are well documented by Coulouris

and Dollimore [CD8S] and by Gray and Reuter [GR93]. The two-phased distributed

agreement and transaction protocol has been indispensable in the development of the

RDSS framework in this study.

Finally, background information on digital library application can be found in Lesk7s

book [Les97] or in Witten et al. [WMB94]. Also, a recent list of experimental and

commercial reliable distributed systems can be found in Held, Heddaya and Bhargava's

book [HHB96] on replication techniques.

1.3 Design Objectives

Applications in the RDSS targeted design domain, index engine or retrieval server, re-

quire on-line retrieval (query) and batched data modification (update) capability. Any

application that falls within this domain could be adapted, in principle, to become an

RDSS-enabled distributed application. The concise specification of the application do-

main is given in chapter 2.

In this domain, the query clients are generally considered to represent external users,

while the update clients usually represent administrators or trusted users. From this

point on, the term 'user' refers to a 'query client' unless stated otherwise.

An ideal distributed storage system would support aL1 distributed applications with high

efficiency and without discrimination. Also, like RAID storage, the distributed layer

would be transparent to the application. However, in reality the desire for efficiency

and transparency are in conflict. Thus, compromises and decisions me made necessary

in order to make the final system practical. Attempts are made to make sure these

constraints (see chapter 2) are minimal m d well-defined.

The main architectural design criteria is listed in this section. They are:

e location transparency,

0 reliability,

rnaint ainability and extensibility,

load balancing, and

query performance.

While satisfying these design objectives, the RDSS ~ c h i t e c t u r e is designed to minimize

the restrictions on the target application and maximize the ease of its implementation.

Location Transparency

One of the main design goals of the RDSS as a distributed architecture is to maintain

a high degree of transparency. From a query client's point of view, the RDSS-enabled

application should be completely location transparent. That is, the multi-computer envi-

ronment is hidden and the application would appear to an external client as a stand-alone

application on a single server.

Similarly, the update client should be able to manage the whole distributed system as

an integrated unit. Depending on the implementation of the final system, however, the

RDSS is flexible enough to allow for individual node management for both the system

administrator and the update client.

In terms of the application, if the front-end marshaller-dispatcher module is excluded,

its external environment should behave like the external environment of an equivalent

st and-alone single server.

Reliability

The second main design objective of the RDSS architecture is reliability. To achieve

this goal, the RDSS architecture needs to provide better fault tolerance and higher re-

coverability compared to a stand-alone single server application. The ultimate goal is to

incorporate enough flexibility such that various fault tolerances and recovery technologies

and policies may be integrated into the system without affecting the application or the

users' environment.

In the current design, a single level of data redundancy is built into the architecture.

Only a node level failure is considered to be the responsibility of the RDSS; however, the

reliability of each node within the RDSS can be improved using RAID storage.

CHAPTER 1. INTRODUCTTON

Maintainability and Extensibility

To further improve an application's availability over its corresponding stand-alone single

computer version, on-line administration is part of the RDSS. Reconfiguration of the

RDSS environment can be done while users' queries axe being processed.

More specifically7 on-line removal of a failed node and on-line addition of a new node are

supported. The latter operation allows the system to expand without any interruption

to the application's availability.

Load Balancing

The RDSS attempts to provide good load balancing among all nodes whether or not

backup (mirrored) data is in use or not. During normal operations, with no node failure,

data entries are spread evenly among all nodes. In the event of a node failure, the entry-

based scat tered-mirroring strategy keeps the workload of the remaining nodes balanced.

Load balancing minimizes the performance degradation in the event of a node failure.

The objective is to maximize the total serving capacity of the RDSS network, and to

reduce the performance impact of the reliability objective.

Query Performance

While adding the above capabilities, the RDSS should have minimal impact on query

performance. One of the main benefits of a multi-computer distributed application is the

increased throughput relative to a single computer system. Regardless of the through-

put gain due to its distributive nature, the RDSS attempts to minimize the penalty on

individual node throughput and latency on every query request.

1.4 Overview of the Robust Distributed Storage Sys-

tem (RDSS)

The RDSS architecture is designed to counter the distribution and replication problems

of an index engine, while meeting the design objectives listed in the previous section. The

rest of the thesis presents the detailed constraints, design, and behaviour of the system.

This section offers a brief summary of its features.

1.4.1 System Environment

Figure 1.2 shows the network view of a sample RDSS setup. Each RDSS server node

runs the same RDSS and application software and manages one or more physical storage

units. In addition to the RDSS server nodes, a marshaller-dispatcher module performs

routing of requests and responses to and from the clients. The current RDSS design

limits the number of update clients to one and relies on external means to guarantee the

trust-worthiness of the update client. However, no limit is placed on the number of query

clients supported.

Each server node and the marshaller-dispatcher node run on separate workstations. How-

ever, there is no technical reason (other than performance issues) from preventing the

marshaller-dispatcher module from co-locating with a server node on the same worksta-

tion.

During normal operation, a query request from a query client is distributed across the

system to individual server nodes by the marshaller-dispatcher module to which the

query client is connected. The responses from the server nodes are then gathered by

the marshaller-dispatcher module, arid the amalgamated result is returned to the query

client. The operation for a trusted update request is similar to this, except that the

replication strategy of the RDSS is enforced on any modification of data to ensure that

data consistency is maintained.

RDSS Mars hder- Dispatcher

TRUSTED UPDATE

VT

RDSS Compliant AppIication Update Protocol

Application Query Protocol

QUERY CLIENTS

Figure 1.2: System View of RDSS

1.4.2 Mirroring Strategy

A mirroring strategy is chosen as the basis for fault tolerance in the RDSS because of its

simplicity and its quick recovery ability. More importantly, mirrored data can be used

directly by the application servers without additional processing cost. This means that

the throughput performance of the system would be less affected in the degraded mode.

Instead of a data blind strategy, the RDSS requires the application under its infiuence

to associate data entries to a range of addresses on a Linear address space (0 to 2K). The

details of the RDSS data organization requirements are given in chapter 2.

In the current design, for simplicity, only one level of mirroring is included in the system.

Each server mirrors its primary data to the rest of the server nodes in the system.

Data scattering (stripping) is done along the boundary of data entries, such that each

mirrored data entry can be accessed as a whole from a single mirroring location without

any recombination. The reason for stripping the secondary data across N- 1 server nodes

is to improve performance during the degraded mode (in an N nodes system). If one

server becomes unavailable, each remaining node will only have to carry an additional

1 / ( N - 1) of the workload.

Figure 1.3 depicts a simple distribution of data on a fictitious RDSS application with N

server nodes. The primary copy of each data entry is stored in one of the server node.

The secondary (mirrored) copy of each data entry is stored in another server node. Wr

example, if the primaq copy of entry x is stored in node N, the mirrorred copy of entry

x may be stored in any other node but node N .

If extra storage is needed, new server nodes can be added on-line to the RDSS without

interruption to the application services. Conversely, a server node can be deleted on-line

if necessary. To perform node addition and deletion, the RDSS may need to resize the

mirroring partitions on-line.

The entry based scattered mirroring strategy allows the workload to remain balanced

1111111111111111.

Primary data storage 111111111111

Secondary mirrorred data storage

Figure 1.3: Data Replication in the RDSS

even if one node has failed. Figure 1.4 shows the distribution of data on a three-node

system before and after a node failure. The data entries in the primary partition on each

node is scattered evenly on the two other nodes. Data entries that are located on storage

partition 2P (node 2 primary partition) on node 2 are mirrored to either partition 2aS

(secondary partition for node 2 data) on node 1 or partition 2bS on node 3. Similary

data entries on 1P are mirrored to laS and lbS, and data entries on 3P are mirrored to

3aS and 3bS.

After the failure of node 2, the secondary (mirrored) data partition of node 2 on node 1

(partition 2aS) and node 3 (partition 2bS) becomes active. In this example, the distri-

bution of data is assumed to be closely related to the actual processing workload usage.

Such that after the failure of node 2: the workload is balanced between the two remaining

nodes.

1.5 Thesis Outline

The rest of this thesis is structured as follows. Chapter 2 provides a complete discussion

of the application constraints and requirements. An application must follow them for it

to work properly within the RDSS framework. Chapter 3 contains the architecture design

overview of the RDSS. The member components are shown, along with their relationship

to each other.

Chapter 4 is the detailed design chapter that presents an implementable design of the

RDSS prototype and describes the behaviour of the member components. Chapter 5

gives the implementation status of the RDSS prototype. Finally, a conclusion and a

list of possible future work are given in chapter 6. The appendices contain further

implementation information on the prototype.

Before any node failure

Node I Node 2 Node 3

Primary data storage

Secondary mirrorred data storage

Active data storage serving queries

Node 3

Figure 1.4: Load Balancing of the Scattered Mirroring Strategy

Chapter 2

Application Environment

This chapter prsents the Robust Distributed Storage System (RDSS) application domain

and the constraints introduced by the RDSS. The applications within the target engi-

neering domain are described, and the design and implementation requirements on the

target application are given.

2.1 Application Domain

One of the goals of the RDSS is to minimize the differences seen by the application devel-

oper between a single node non-distributed version of an application and its equivalent

distributed version. Instead of designing for the universe, the RDSS targets a chosen

subset of applications.

The chosen domain of the RDSS encompasses most of the digital library applications,

regardless of their content - text, hypertext, multimedia streams, objects or indices.

It does not include all database applications, because the RDSS places constraints that

many database applications cannot meet.

The RDSS enabled application is assumed to reside on a local area network. Clients of

the application may reside outside of this network, provided that means of connection and

authentication axe available. Many applications may be adapted t o fit within the RDSS

architecture, providing their data organization and external interfaces may be adapted

to meet RDSS requirements. The data organization required by the RDSS environment

will be presented in section 2.2.

For an application to function properly in the scalable RDSS environment, it must comply

with certain constraints specified in this chapter. An application designer is given a list of

precise yet flexible constraints and tools, such that applications in the RDSS engineering

domain can be created without additional difficulties.

In terms of the application's environment, the RDSS software layers are mostly transpar-

ent. An RDSS compliant application can be used as a single stand-alone server without

any software modification. The system synchronization, and the duplication and recovery

processes are invisible to the application.

The goal is for any single stand-alone application within the application domain to be-

come a distributed application via the RDSS framework, with a few simple additions

and no major modification. To illustrate the application requirements, we will use as an

example throughout the chapter, a simple text snippet application, that stores arbitrary

text objects. Implementation of the application will be used as a further example in

chapter 5 .

Figure 2.1 depicts the interfaces that an RDSS compiiant application is required to have

In the following sections, the constraints required by the RDSS on each of the application

interfaces are described in more detail.

2.2 Data Organization

For the RDSS to provide automatic data mirroring and on-line storage management,

certain data organization constraints are needed. The RDSS addressing scheme and the

QUERY CLIENTS

APPLCWON QUERY PROTOCOL (AQP)

TRUSTED UPDATE CLIENT

I RDSS COMPLIANT APPLICATION UPDATE PROTOCOL (RC-AUP)

query interface update interfacb

APPLICATION SERVER (AS)

PHYSICAL STORAGE

Figure 2.1: Stand Alone Application Server

CHAPTER 2. APPLICATION ENVIRONMENT 19

notion of capacity are the two main requirements on any RDSS-compat ible applications.

2.2.1 Data Addressing Scheme

In the RDSS world, data entries are associated with addresses on a linear address space.

The size of the address space is determined by the RDSS implementation (usually 64

bits). Each data address location corresponds to a virtual storage quantum, which may

be a byte, a word of text, a video frame or any other arbitrary storage element. Only

active address locations may have physical storage allocated to them.

Externally, the application server stores a collection of data entries. Each data entry

is associated with a finite, non-zero set of storage quanta. More specifically, each data

entry must be mapped to a contiguous range of addresses in the RDSS address space.

The mechanism for translating a virtual storage quantum into a physical storage block

is internal to the application.

No two data entries may share a single address location. Each active address location

must correspond to a unique quantum in a unique data entry. Each data entry must

be mapped individually on a unique unbroken range in the RDSS address range. If a

data entry X is assigned to RDSS addresses 1024 to 2047, then no RDSS position within

that range can refer to data outside X. Un-associated address locations, however, may

exist between addresses used by data entries. For example, the simple text snippet server

stores variable length generic text snippets. Each entry is associated with a single RDSS

address. In the simple text server, the RDSS address doubles as the retrieval handle

for the text snippet. Using the assigned RDSS address, specific text snippets can be

retrieved. For both data retrieva.1 consistency and RDSS compliance, each entry needs a

unique RDSS address tag (a contiguous range of RDSS address).

2.2.2 Capacity Definition

Each data entry will also be assigned a value representing its capacity requirements.

Capacity is the notion of estimated storage usage. It represents the maximum storage

requirement of the data entry in units that are specific t o the application. It may or

may not be directly correspond to the actual physical storage size. the application may

employ compression and data correction techniques, such that the actual storage required

for a data entry may be smaller or larger than the external format size seen by a query

client. The notion of capacity is used for two purposes:

1. to measure the maximum application dependent storage requirements for a data

entry.

2. to measure the minimum available storage in an application server.

The RDSS uses the notion of capacity to allocate entries t o servers. Given a set of data

entries with capacities Q, ..., c ~ - 1 and a server with capacity Cs then N entries can be

on the server if:

In the simple text snippet server example, no data compression, encryption, or error

encoding is done within the application. The capacity usage for each data entry is the

size of the data rounded up to the next block size. The header block and translation

table usages are taken out of the capacity calculation by the application because they

are not tied to the amount of data in the storage. A 786 byte text snippet will have the

capacity of 2 on a system using 512 byte storage blocks.

Regardless of whether any compression, error correction, encryption or any other trans-

formation is applied to the data, the application must store the resulting data entries

in a virtual linear block device provided by the RDSS. Each data entry stored must be

independent from others. In particular, the application cannot use information from one

data entry for compression, error correction, or encryption of another data entry.

2.2.3 Data Organization Requirements Summary

To summarize, the RDSS assumes that the appIication in its environment has the fol-

lowing data organization:

0 Data entries are quantized.

Data entries are mapped onto the linear RDSS address scheme.

0 An estimated storage capacity is associated with each data entry.

Data entries must be independent in their stored fonnat.

The usage of this data organization may be found in the architecture design of chapter 3.

In short, the quantized addressing scheme allows the RDSS to perform automatic data

mirroring, while the notion of capacity allows on-line storage management.

2.3 Application Server Design Requirements

There are two components in an RDSS compliant application: the application server

module and the application front-end module. The application server module is a seIf-

contained database server that can be executed alone as a single node database appli-

cation if desired. However, when multiple application servers are running together as a

distributed application, an application front-end module is needed. It acts as a rnulti-

plexor/dernultiplexor for external queries.

Most of the RDSS compliance requirements for an application are focused on the external

interface of the application server. Internally, the application server has to comply with

the data organization restrictions described in the last section. The physical storage

Limitations will be addressed in the interface section (2.3.1).

In addition, the application must support the transaction model when dealing with data

updates. An update request is not finalized until it has been committed by a specific

commit message. Support of the abort message is also required.

Finally, the application needs to support e6cient transfer of large amounts of data b e

tween two instances of itself. This is needed for load balancing, mirroring and recovery

operations. On request, the application can extract a subset of its contents and send it

to another instance of itself thorough a specified TCP port. The receiving application

instance, on request, should establish a TCP connection to the aforementioned port lo-

cation and merge the incoming data with its own contents. The data format during the

transfer is not constrained by the RDSS. In the simple text server example, only the text

blocks and RDSS addresses are transferred.

The following sub-section details the interface requirements on the connection and com-

munication mechanisms of the application server module. The protocol sub-section gives

the format of the communication contents that are understood by the RDSS.

2.3.1 Application Server Interfaces Requirements

As shown in figure 2.1, the RDSS assumes and requires a compLiant application server

to have three depicted interfaces: a query interface, an update interface, and a storage

interface. The given application server may have other interfaces that do not belong

to any of the three interfaces described. However, for an application server to function

properly with other copies of itself in the RDSS environment, the three interfaces shown

must exist and comply with the requirements in this section.

There are two categories of requirements for any one of the three RDSS required applica-

tion interfaces. The first contains the requirements imposed by the RDSS architecture.

For example, the need for the three aforementioned interfaces is an architectural con-

straint. It is an RDSS design goal that the architectural interface constraints remain as

lightweight as possible.

The other category of requirements is due to the RDSS implementation. These require-

ments depend on the implementation and may change even if the RDSS architecture

remains unchzuged. The limitations placed by the current RDSS prototype are used.

CHAPTER 2. APPLICATION ENVIRONMENT

Application Query Interface Constraints

The RDSS architectural constraints on the application query interface are the following:

0 The interface is stream based.

Concurrent multiple clients must be supported.

The prototype RDSS implementation constraints on the application query interface are

these:

0 Each client has its own session.

The protocol used is session based. (See the query protocol requirements in sec-

tion 2.3.2.)

Each session is connection based.

Multiple simultaneous sessions are supported.

0 The connection mechanism is implemented on the TCP transport layer.

Application Update Interface Constraints

The following lists the RDSS architectural constraints on the application update interface:

0 The interface is stream based.

The interface complies with the RDSS application update protocol restrictions.

However, a superset of the protocol is possible. (See the update protocol require-

ments in section 3.3.2.)

The prototype RDSS implementation constraints on the application update interface are

the following:

The protocol used is session based.

0 Each session is connection based.

0 The connection mechanism is implemented on the TCP transport layer.

CHAPTER 2. APPLICATION ENVIRONMENT

Application Storage Interface Constraints

All storage operations of the application must be done via the Hardware Abstraction

Layer (HAL). It provides an abstraction of a linear block device accessed through a virtual

disk interface. The architectural constraints of the storage interface on the application

are these:

It must use HAL Application Server (HAL-AS) library to access any physical stor-

age.

0 No application persistent state may exist outside of HAL managed storage.

The HAL software library is designed to be a simplified version of the 110 routines

of the original low-level system library. See the HAL interface in the next section for

details. Appendix B contains the calling interfaces of the HAL-AS library prototype.

The implementation limitations imposed by the prototype are the following:

The protocol is specified by the HAL-AS library interface.

The application must link and execute with the HAL-AS library.

0 Only one HAL setup is allowed for each physical storage unit (disk partition, disk,

or group of disks).

Constraints on Other Interfaces

The RDSS does not use any other activation or communication interfaces provided by the

application server. If the application server has such interfaces, the following architectural

constraints apply:

0 These interfaces are not necessary for the normal query and update operations of

an application server.

They do not conflict with the requirements on the other three interfaces mentioned

previously.

An example interface that falls into this category would be a profiling interface. Any addi-

tional support for gathering appLication usage statistics is up to the application designers.

Additional modules may be added to combine information from all the servers. In the

simple text snippet server, an activity log is produced for the performance benchmarking

and debugging. Its activities do not interfere with three RDSS required interfaces in any

way.

2.3.2 Application Server Protocols Requirements

The protocol requirements specify the restrictions on the contents sent via the interfaces

listed above. Again, some of the constraints may be due to the architectural design of

the RDSS, while others are due to the implementation choices in the RDSS prototype.

The following details the limitations on the protocol used by the three RDSS compliant

interfaces on the application server. No protocol constraints are placed on the non-RDSS

interfaces.

Application Query Protocol Constraints

Other than the interface constraints given in the last section, there is no restriction on the

query protocol required by the RDSS. That is, the query format is completely application

dependent. Any session-based protocol (binary or plain text) may be used.

Application Update Protocol Constraints

In order for the RDSS to perform correctly, it has to manipulate data among the net-

worked application servers. To do so, it needs to know the protocol used by the appli-

cation server update interface. Thus, in addition to restricting the update interface, it

also needs certain features in the update protocol. The following is a description of the

update commands required of the update interface:

ADD data entry:

D ELETE entries:

EXTRACT entries:

M ERG E entries:

Update C O M M I T :

Update ABORT:

CAPACITY available:

ST0 RAGE available:

A new data item is added to the server. The start and end RDSS

address are assigned externally by the update client. Capacity and

external format size of the entry will also be needed.

All complete data entries within the given RDSS address range will

be removed from the system. Data entries that partly fall within

the range are not removed.

All data entries within the specified data range (inclusive) are ei-

ther sent to the output stream of the application or to a given port

number. The data may be in an externally formatted form if desired.

Add a number of data entries to the server from an external port

location (e-g., the output port on the EXTRACT command). The

application will attempt t o connect to the given port location and

read in the data entries. The transfer format is the same one used

by the EXTRACT command. No data entries will be committed

until the update commit command is issued.

For an add or delete operation, the application will first acknowledge

that it is ready. Only when the commit message arrives, will the

visible change occur. The committing step in the application is

assumed to be atomic (i-e., the transaction is either committed or

not at all).

Instead of committing the current outstanding update requests, the

operations are aborted and the changes are discarded.

Upon receiving this command, the application returns the minimum

capacity available for new data. (See section 2.2.2 on the definition

of capacity.)

Upon receiving this command, the application returns how much

storage space in the virtual disk is available for reclamation. Block

TRUNCATE storage:

EXPAN D storage:

SHUTDOWN server:

size and number of blocks available are returned. It should be used

before any storage is truncated using the TRUNCATE command.

This command is used to notify the application of any pending stor-

age size decrease. The application should free up the necessary

blocks at the end of the virtual linear block device visible to the ap-

plication. After a successful storage truncation, the RDSS HAL can

safely reduce the storage size allocated to that particular application

server.

This command is use to notify the application that a storage size

increase has occurred. The application can then make use of the

additional storage blocks at the end of the virtual linear block device.

This command instructs the application to terminate itself.

For the RDSS to operate correctly, the update protocol must be implemented exactly in

both the RDSS and the application. A superset of the given command set is possible,

but any additiond features wilI not be used by the RDSS.

The complete BNF (Buckus Naur Form) notation of the protocol used by the RDSS

prototype can be found in appendix A. In the RDSS prototype, the update protocol

is implemented as plain text commands. Table 2.1 briefly summarizes the parameters

accepted by each update command listed above.

Application Storage Protocol Constraints

To use the storage interface, the application server has to be linked with the hardware

abstraction layer Application Server (HALAS) library. It is a software library containing

abstractions for the physical storage. Physical storage is presented as a virtual disk

partition to the application. It allows the RDSS to manipulate the application physical

storage allocation without interfering with the application operations.

Operations

I/ Merge in data entries

11 Commit a modification

11 Abort a modification

11 Check capacity available

Check storage available

Truncate storage

Expand storage

Shutdown server

Input Arguments

Formatted data size

RDSS address range

Capacity size

Data

RDSS address range

RDSS address range

Extract format

(Output port number)

Source Server

Source port

Storage block count

Storage block count

Return Parameters

Ready for commit

(or operation failed)

Ready for commit


Ready for commit


Ready for commit


Success or failure

Success or failure . -

Capacity available

Block size

Number of blocks

Success or failure

Success or failure

Table 2.1: Application Server Update Protocol

Instead of following a given message format protocol like the update interface, the ap-

plication server must use the routines provided by the HAL-AS Library for storing and

retrieving data. The basic methods available are:

Open:

Read:

Write:

Status:

Close:

This routine initializes the HAL-AS library and synchronizes with other HALAS

users on the current server node. It opens a single virtual disk for read and write.

This has to be called before other HAEAS routines.

This method allows the application to read into memory buffer a specific number

of blocks from a given virtual disk partition.

Using this method, the application writes blocks of data from the memory to the

virtual disk partition.

The current information about the virtual disk partition is returned, including

the virtual disk block size and the number of usable blocks.

The opened HAGAS virtual disk and the associated system resources are released.

The detailed capabilities and the calling interfaces to the above routines are available

in appendix B. The detailed design of the HAL itself is described in chapter 4. The

HAL also provides storage management routines to other RDSS software components;

however, those routines should not be used by the application server.

2.4 Application Front-End Design Requirements

In addition to the application server module an RDSS compliant application needs a

front-end module. The front-end module resides within the RD S S marshaUer-dispatcher

layer. It is responsible for multiplexing and dernultiplexing the client requests and their

corresponding responses. Figure 2.2 shows the relationship of the query and update

front-ends to the application server.

QUERY CLJENTS TRUSTED UPDATE CLIENT

1 I I

\

111111111111

Figure 2.2: Application Distributed View with Front-End Modules

The application front-end is separated into two smaller modules, corresponding to the

application server query and update interfaces. The query front-end is responsible for the

query requests and response, while the update front-end pre-filters the update commands.

In the current RDSS design, a generic update front-end is provided. It can be used in

place of a custom update front-end if necessary.

2.4.1 Application Query Front-End

When a new query client connects to the RDSS via the marshder-dispatcher query port,

a new process encompassing the Application Query Front-End (AQFE) will be created

to handle the new query session. The AQFE is responsible for reading client requests,

sending them to the relevant application servers, collecting responses, and giving the

results to the client.

Different applications require different query front-ends. As the RDSS places no require-

ment on the application query protocol, a custom query front-end must be built in order

to interpret and combine the results returned from the attached application servers. De-

pending on the nature of the application and its query protocol, the application query

front-end could be a simple multiplexor/demultiplexor (as provided as a sample in the

RDSS prototype) or a complex module with a significant amount of logic.

Instead of directly interfacing with query clients and application servers, the application

query front-end must use the routines provided by the Query Marshaller-Dispat cher

library (QMD-Lib). The QMD-Lib is a software library that contains all the routines for

the AQFE to communicate with the query ports on the application servers. By doing

so, most of the RDSS related activities are hidden and automatically resolved. Thus,

the coding effort required by the AQFE is much smaller. The following are the callable

routines in the QMD-Lib. More detailed descriptions of their operations can be found in

the QMD design section in the chapter 4. The exact calling interfaces for the QMD-Lib

in the RDSS prototype are available in appendix C.

New query session: This is called at the beginning of each query session. All necessary

initializations are performed. A list of visible application server query

ports (with an active flag for each) is returned.

Start query:

Read select:

This routine sends a new query request to the given list of application

server query ports. An updated list of visible application server query

ports is returned with current list size (which may have grown).

Using a mask corresponding to the last returned port list, the caller

will be blocked until one of the acceptable read events or an exception

occurs. Slot 0 in the mask refers to the client port. Again, the port

status list is updated.

Read from port: Data from the connection corresponding to the given slot is read into

the read buffer. The updated buffer and the read size are returned.

Note that new query requests should not be started this way. A simple

macro extension of this routine allows writing to one port at a time.

End query: An end query message may be sent to all application servers involved.

Both the size of the message sent and an updated application query

port status list are returned.

Terminate session: This routine ends the current session of query operations. The appli-

cation query port status list is released.

G e t status: This forces an update of the application query port status list.

The application query front-ends are responsible for handling out-of-band query re-

sponses. The QMD-Lib assumes one outstanding query per client, and responses from

any application servers after the end query call are considered out-of-band. Also, no

simultaneous queries are supported in QMD-Lib; however, it is possible to design an

AQFE to allow multiple outstanding queries.

During a query session, the index assignment of ports on the application query port

status list will remain unchanged. New ports will appear at the end of the list. If a query

port has previously gone inactive, it will reappear on the old index location when its

status returns to active. This feature is included to remove any possible ambiguity due

to the masked Read, Write or Select operations.

The AQFE is responsible for making sure that all returned data from active query ports

are properly handed. In the simple text server, there should only be one valid return,

because all data entries are unique. At the start of a query session, a request from the

client is sent to all text server node query ports. Each server searches its translation

table, and if the RDSS address tag is found, the actual data entry is returned to the

query client via the simple text server AQFE.

If node X has gone temporarily off-line: all the servers with mirrored data from node X

will appear on the query port connection list with the active flag turned on, while the

active flag will be turned off for the query ports associated with node X. If the changes

happen in the middle of a query (i.e., between the start query and the end query calls),

the original query request will be automatically sent to these new ports by the QMD-Lib.

When node X returns to on-line status, its active flag will be restored. The secondary

mirroring query ports will remain active on the list until the end query call. At that

point, they will become dormant, with their active flag turned off.

2.4.2 Application Update Front-End

Unlike the query part of the marshaller-dispatcher, the application update protocol is

clearly specified above, and a default update front-end module is possible. By default,

all update requests are sent to all application servers, where update decisions are made in

a distributed fashion based on the RDSS address segments assigned to each server node.

However, if desired, it is possible for the application design to add an update front-end

for the application. The main role of the update front-end is to prefilter update requests.

For example, if the simple text server is to be integrated with a word indexing engine

(which may be implemented as another RDSS application), the combined Application

Update Front-End (AUFE) may automatically generate the necessary word list.

To allow for a custom application update front-end, an Update Marshall-Dispatcher

Library (UMD-Lib) is provided. Instead of a list of application query ports, the update

front-end deals with a list of RDSS server nodes. Also, only one update client is allowed

for the whole RDSS at any given time. The calling interfaces to the UMD-Lib in the

RDSS prototype are available in appendix D.

Besides using the UMD-Lib, any custom application update front-end is constrained to

follow the other update interface and protocol constraints listed above. In the prototype

version of the RDSS, the default AUFE (Application Update Front-End) is assumed. No

custom update front-end will be required for the application.

Chapter 3

Architectural Overview

In the previous chapter, the requirements on an RDSS compatible application were de-

scribed. As the name implies, the RDSS (Robust Distributed Storage System) provides

additional fault tolerance to the application while increasing its performance by distribut-

ing the work load among server nodes.

In this chapter, an overview of the RDSS architecture is presented. First, the design

criteria are summarized. They are followed by an overview of the system configuration

and descriptions of the high level RDSS components. The final section provides typical

operating examples.

3.1 Design Criteria

Before going into the architectural design of the RDSS, this section reviews the design

criteria. First is a summary of the design goals as discussed in chapter 1. This summary

is followed by a discussion of the design assumptions.

3.1.1 Design Goals

The main design objectives for the RDSS architecture, as discussed in chapter 1 are to

a Provide a high level of transparency

Improve database consistency and fault tolerance

Enhance application availability

Increase query throughput performance

Minimize application design constraints

Maximize the ease of application implementation

3.1.2 Design Assumptions

There are several underlying assumptions in the RDSS architecture design. The first is

the application compatibility assumption, which basically means that the target appli-

cation will folrow the constraints described in chapter 2. The application is assumed to

have a specific query and update profile, with queries occurring frequently and updates

batched together. The performance of the system is primarily quantified by its query

throughput capacity.

Aside from the appIication assumptions, the RDSS architecture assumes a certain trust

model. Namely, there is only one active update client, which is not malicious. Although

it is possible to incorporate authentication into the design, it is not a core design goal.

Instead, the data security can be provided externally. This can be done, for example,

via connection access control to the update ports and other non-query access points of

the server nodes. No trust assumption is made regarding the query clients by the RDSS

architecture.

The RDSS deals with node level behaviour. The only error reported to the RDSS level is a

node level failure, which renders the whole server node unusable. To keep the architecture

simple, all node level failures are treated as if the failed node has been disconnected from

the rest of the system. It is up to the application and the underlying operating system

to deal with other less fax-reaching errors; for example, a query syntax error is not an

RDSS node level failure. These error handling mechanisms are beyond the scope of the

core RDSS architecture, and are assumed transparent to the RDSS components.

Finally, the RDSS architecture is designed on the assumption that the whole system

will be run on a single local area network (LAN) with typical local network latency and

throughput. More specifically, the inter-computer communication is not considered to

be a bottleneck in the architecture design of the RDSS. The throughput of the network

should also be sufficient to handle the expected query volume.

3.2 System Configuration

Figure 3.1 depicts a static snapshot of a simple RDSS setup. In each RDSS enabled

application, there are N server nodes, depending on the system configuration. In addition

to the server nodes, there are one or more marshaller-dispatcher nodes that perform

multiplexing and de-multiplexing operations. All these nodes axe located on the same

local area network. It is also possible to co-locate a marshaller-dispatcher and a server

on a single node.

In this section, a list of the high level software components residing on the server nodes

and the marshaller-dispatcher node is presented. Also, the usage of physical non-volatile

storage in the RDSS is given. Overviews of each of the top level RDSS components are

given in sections 3.3 to 3.9 of this chapter.

3.2.1 Server Node Components

A server node contains instances of the Application Server (AS). In addition, an instan-

tiation of a server node includes the following top level RDSS components:

Query Clients

+\ . Application Query Protocoi ,

1 \ Update Protocol

Network

RDSS Compliant Local Area Network

/

Primary D SERVER NODE

/ \ #

I I 0 #

I

I I

SERVER SERVER NODE NODE

I

Physical Storage

Physical Storage

Physical Storage

Figure 3.1: Network Overview of the RDSS Architecture

Node Configuration Monitor (NCM)

Node State Machine (NSM)

0 Application Sewer Manager (ASM)

Hardware Abstraction Layer (HAL)

Each server node contains a primary data storage segment, which is associated with a

copy of the Application Server. The primary data is distributively mirrored by all the

other nodes, and vice versa. Given N nodes in the system, each node will have N - I

mirroring segments, each serviced by an instance of the application server.

The Application Server Manager (ASM) is responsible for coordinating all application

server instances on a server node with the help of the Hardware Abstraction Layer (HAL).

The Node Configuration Monitor (NCM) is responsible for coordinating the local node

with others in the system, and finally the Node State Machine (NSM) maintains the state

of the local node by reacting to stimuli from the other components.

3.2.2 Marshaller-Dispatcher Components

The other major subsystem of the RDSS architecture is the marshaller-dispatcher soft-

ware. It is a separate subsystem because it does not have to reside with any particular

RDSS server node. It acts as the gateway for the query and update clients to interact

with the whole RDSS-enabled application. A copy of the marshaller-dispatcher software

cont ains the following components:

Marshuller-Dispatcher Port Manager (MDPM)

Query Marshaller-Dispatcher (QMD)

0 Update Marshaller-Dispatcher (LIMD)

The Application Query Front-End (AQFE) is the application specific part of the Query

Mushaller-Dispatcher (QMD): which is responsible for distributing query requests to

RDSS server nodes and collating the query results. The Application Update Front-End

(AUFE), similar to its counterpart, is part of the Update Marshaller-Dispatcher (UMD),

which deals with update requests and results. The Marshalla-Dispatcher Port Manager

(MDPM) keeps track of the server nodes and client connection ports for both the UMD

and the QMD.

3.2.3 Non-volatile St orage Management

The RDSS environment also manages the partitioning and usage of physical disk storage.

Every server node includes a Hardware Abstraction Layer (HAL) software component,

through which the application accesses the data on physical storage. This allows the

RDSS to provide both storage management and data replication on the same underlying

physical storage without adding any complexity to the application design requirements.

To achieve node level fault tolerance, single level data mirroring provides the simplest

solution. The replication strategy was described in chapter 1. The HAL software layer

allows primary and secondary (mirrored) data partitions to be resized on-line.

Finally, aside from the disk storage needed for data: a small amount of non-volatile storage

is needed for tracking transactions within the RDSS. In addition, system parameters and

states are maintained in this storage. This information is vital to the recovery of the

system after a crash.

3.3 Node Configuration Monitor (NCM) Module

The Node Configuration Monitor (NCM) is the key software module for coordinating

server nodes within the RDSS. It allows for distributed decision-making on the current

status of the whole system. The Node State Machine (NSM) module, to be described in

section 3.4, is responsible for maintaining the current state of the local server node. The

NCM, on the other hand, is the component for monitoring the status of the rest of the

distributed environment.

3.3.1 Node Contact List

There is a copy of the NCM software in each server node. While operating, the NCM

periodically broadcasts its contact list to all server nodes. A contact list contains the

list of node locations of other NCM's in the current RDSS environment. Each node

location is accompanied by a joan sequence number and a control flag. The join sequence

number is assigned to a node when it joins the system. The node location list is sorted

in increasing order according to this field. The control flag indicates the current mode of

the corresponding node as viewed by the Iocal NCM. The possible settings for the control

flag are:

Startup - At the start of an server node, the Startup flag is associated with all

nodes on an NCM's contact list (other than the local). A control flag of a node

changes when a contact list is received from that server node.

0 On-line - An On-line status indicates that a contact list has been received with the

corresponding node within a preset t ime period (the N C M on-line timeout).

0 O$-line - A server node is deemed Off-line if a contact list has not been received

within this preset time period.

Delete Request/Proceed/Ready - The Delete flag sets are used for remodelling

synchronization (deleting or relocating a node). Their roles are explained in the

following synchronization sub-section.

0 A d d Request/Proceed/Ready - The Delete flag sets are used for remodelling syn-

chronization (adding or relocating a node). Their roles are explained in the following

synchronization sub-section.

Finally, to provide a convenient way to shutdown the whole RDSS, the NCM also ac-

cepts a Shutdown command. To halt the system, the administrator can simply send the

Shutdown command to the NCM broadcasting address repeatedly.

CHAPTER 3. ARCHITECTURAL OVERVIEW

3.3.2 Guardian Node

The responsibility to tabulate the results of all distributed decisions is distributed among

the server nodes instead of being located in a single dedicated control node. For each

server node in the system, a designated guardian node is assigned. It is responsible for

initiating and mastering any NCM synchronization regarding the server node under its

protection. By using a different guardian for each node, no single node failure can cause

s ystem-wide failure.

The join sequence number is assigned to a node when it is successfully added to the RDSS

environment. Since the node addition operation is synchronized system-wide (see next

section) and only one node can be added at a time, the contact list for every NSM within

the RDSS will, therefore, be the same. The guardianship for a particular node x is, by

convention, assigned to the previous node on the contact list (i.e., node x - 1). The last

node on the list will be responsibIe for the first node on the list. If a new node wants to

join the system, it is temporarily placed at the end of the list. The guardian of the new

node is the last node on the original list, unless there is a node deletion synchronization

handshake in progress. See the next sub-section on node relocation in that scenario.

3.3.3 Server Nodes Synchronization

During normal, steady-state operations, all server nodes, N , are available for data queries.

When a node, X, fails and needs to be removed from the system, the other N - 1 nodes

become responsible for the data that it contains. To restore a single level of mirroring,

data transfer among the remaining nodes is necessary. A system-wide remodelling o p

eration like this requires the synchronization of all the remaining nodes to ensure the

consistency of the data during the change. Supervising the remodelling synchronization

is one of the primary jobs of the NCM.

In the case of node deletion, the guardian of a failed node would determine whether a

deletion remodelling is needed, depending on how long the failed node has been 08-

line and the NCM 08-line t imeou t parameter. Only one node can be synchronized for

deletion at any time. The following describes the protocol for deletion synchronization

on the guardian node:

1. During the initialization of the synchronization, the guardian asks the local A p

plication Server Manager (ASM) whether there is enough capacity to remodel the

local portion of the mirrored data from the failed node. If not, no node deletion

will be attempted.

2. If the local ASM reports there is enough room, the guardian changes the control

flag of the failed node from Off-line to Delete Request.

3. When the contact lists received from the remaining N - 2 server nodes have Delete

Reques t associated with the failed node, the guardian then changes the Delete

Reques t control flag to Delete Proceed and sends a remodelling trigger to the local

NSM.

4. After the local ASM has successfully completed the remodelling operation (but has

not yet committed the change), the guardian changes the control flag for the failed

node to Delete Ready.

5. Finally, after detecting that all other nodes have also raised the Delete Ready flag

for the failed node, the guardian removes the failed node from its contact list. A

remodel completion trigger is sent to the local NSM causing the ASM to commit

the remodel operation.

After the Delete Request is issued by the guardian, it can withdraw the synchronization

by reverting to either the On-line or the O f - l i n e status for the failed node. A non-

guardian node must honour the Delete R e a d y flag. The corresponding protocol for node

deletion synchronization on the NCM of the remaining nodes is as follows:

1. If a new Delete Request control flag is detected on a received contact list, the NCM

must determine whether it is coming from a valid guardian. If the request is valid,

the NCM queries the local ASM as to whether there is enough capacity to remodel

the local portion of the mirrored data from the failed node. If the remodelling can

proceed, it changes the control flag associated with the failed node to Delete Request

on the local contact list. Once the synchronization is started, Delete Request and

Add Request control flags set by other guardians are ignored.

2. Upon receiving the Delete Proceed command from the guardian of the failed node,

the NCM sends a deletion remodelling trigger to the local NSM. It also changes

the failed node control flag on the contact List to Delete Proceed, to indicate that

remodelling is in progress.

3. After the local ASM reports that the remodelling operation is ready to be commit-

ted, the NCM changes the control flag of the failed node to Delete Ready. It must

honour this flag even after a system crash.

4. At the end, when the guardian commits the remodelling operation by removing

the failed node from its broadcast list, the local node can then commit the changes

pending on the local ASM. After the changes are committed, the local NSM then

deletes the failed node entry from its contact list.

No harm is done if the deletion synchronization is reverted by the guardian before the

remodelling operation is committed. An abort command will be sent to the local ASM

if the node deletion synchronization has gone beyond the Delete Proceed step. Again,

only one node can be remodelled in or out at a time. An NCM will not respond to other

requests until the f i s t synchronization is successfully completed or reverted. Once the

synchronization protocol has started, contact lists received from the failed node, can be

ignored by nodes other than the guardian.

The addition remodelling synchronization is similar to the deletion case, with the Add

control flags replacing the Delete control flags for the new node involved. The guardian

node is the last node on the contact list before the new node appeared. The new node is

slotted at the end of the contact list with the next join sequence number available assigned

to it. As in the case of the deletion, the ASM has to be made ready for remodelling (e-g.,

by creating an instance of the application server for the mirroring of the new node) before

the local NCM can switch to the Add Request for the control flag associated with the new

node. After all nodes have reported Add Ready, the guardian signals that the remodelling

is committed by changing the control flag associated with the new node to On-line. In

the case where two or more nodes attempt to join the system, the guardian decides which

will be allowed to join.

A relocation remodelling synchronization is a combination of both deletion and addition.

In the event that there is not enough room in the surviving nodes for the system to

recover to the single mirroring level, deletion remodelling is not feasible. A new node

may be added to replace the failed node. During relocation remodelling, data from the

failed node will be recreated on the new node, while the primary and mirrored data in

the other nodes remain unchanged.

At the beginning of the node deletion remodelling sequence, each NSM checks with

the local ASM to determine whether there is enough room for the remodelling. In the

case that there is not enough room, the Delete Request flag is not raised. Thus, the

guardian of the failed node cannot proceed with the deletion remodelling. Since simple

node addition cannot occur if there is a failed node, the guardian of the new node

does not initiate or aborts the node addition remodelling sequence. Instead, the deletion

g u ~ d i a n has priority over the addition guardian. It becomes the valid relocation guardian

responsible for both the failed node and the new node. To start the relocation remodelling

synchronization, the relocation guardian issues the Delete Request and the Add Request on

the corresponding failed node slot and the new node slot on the contact list respectively. A

non-guardian responds to the relocation request by mirroring both flags at the same time.

The same dual flags convention is used to indicate relocation proceed/ready/cornmit

commands.

3.3.4 Crash Recovery

The contact lists are tracked in a recovery log on every node. After an unexpected node

shutdown, the NCM will be able to continue the remodelling process by repeating the last

step in the synchronization protocols. Non-guardian nodes must honour their responses

to the guardian.

If a failed node reconnects to the RDSS after it has been remodelled out by the system,

it recognizes that it is no longer on any broadcasted contact lists. The failed node then

re-initializes as a new node and attempts to join the system.

3.4 Node State Machine (NSM) Module

At the centre of each RDSS server node is the Node State Machine (NSM). It is respon-

sible for maintaining a consistent state for the local node, and responding to transition

stimuli from both the Node Configuration Monitor (NCM) and the Application Servers

Manager (ASM).

To keep the design simple, the NSM only maintains the non-transient states for the local

server node. Transitional behaviour between states (e.g., entering deletion remodelling)

is handled by other softwaxe modules (e.g., NCM). For each RDSS server node, there are

five basic states:

0 initialization

Steady

0 Modifying

Degraded

0 Failed

Within the Degraded state, there are four sub-states:

Backup

Remodel-Deletion

Remodel-Addition

Remodel-Relocation

The transition state machine is depicted in figure 3.2, with the sub-state diagram in

figure 3.3.

As mentioned in the assumptions section 3.1.2, all error transitions in the NSM deal with

node level failures. The NSM is not responsible to classify the severity of the failures, or

to initiate any application-specific recovery strategy.

Also, NSM does not guarantee whether a transition is asynchronous (self-determined by

the local node) or RDSS synchronous (agreed upon by all server nodes). It is up to the

transit ion triggers7 provider, the NCM or the ASM, to achieve network synchronization

for any particular transition. The NSM does not interface with other server nodes in the

system and thus does not deal with any synchronization issues.

All transitions in the state machine are atomic. The only way to exit a state is via

another transition. FoIlowing are brief summaries of the server node behaviour within

each state.

3.4.1 NSM Initialization State

Every NSM enters the Initialization state on startup of the server node software. For a

successful node startup, both NCM and ASM have to be started and initialized properly.

Only then can the NSM transition out of the initialization state and return to the last

known system state, Steady or Degraded, before the last system shutdown. If the last

known system state was the Modifying state, the outstanding update operations will be

aborted by the ASM containing the primary copy of the data affected.

Figure 3.2: State Transitions Diagram of an RDSS Server Node


3.4.2 NSM Steady State

The Steady state is entered when all server nodes are functioning properly, and there is

no outstanding update request for the node. The only activities in the RDSS are query

requests from the users.

Node level failure will cause the NSM to transition into the Degraded state. The node

will return to the S t e a d y state when either the failure has reverted itself, or the RDSS has

remodelled to eliminate the failed node. The NSM enters the Modifying state to handle

an update request.

3 - 4 3 NSM Modifying State

In order to add or remove data via the external update interface of the RDSS, the NSM

must be in the kfodifying state. Both query and update requests will be accepted during

this state. It is the responsibility of the UMD and the ASM modules to ensure the

integrity of update transactions.

When an update session is completed, the system will return to the Steady state. As in

the Steady state, a node level error in the Modgying state causes the NSbf to transition to

the Degraded state. The current update request will either be abandoned or be restarted

when the system returns to the Steady state.

3.4.4 NSM Degraded State

If any server node ceases to function, the RDSS-enabled application is considered to be

in the Degraded state. The N S M remains in the Degraded state for as long as the system

has not fully recovered to the predefioed level of fault tolerance (currently, a single level

of mirroring).

Each of the four different sub-states within the Degraded state supports a different re-


covery operation. Their interaction is shown in fig 3.3, and briefly described below.

NSM Backup Sub-State

In this sub-state, the backup Application Servers are required to provide the missing

data set that corresponds to the node that has failed. This transition is asynchronous

and self-det ermining.

After a given waiting period in the Backup sub-state, the server nodes in the RDSS may

agree (via NCM synchronization) to remodel the node list by deleting the node that has

gone down. The NSM will then enter the Remodel-Deletion sub-state. To simplify the

design, remodelling may delete only one node at a time.

NSM Remodel-Deletion Sub-State

Only one node can be deleted on each deletion remodelling, provided there is enough

storage capacity reported by the ASM on each node, as indicated by the Delete Request

response on the NCM node contact list. Remodelling is not started if there is not enough

storage to restore the system to the single mirroring level required. NCM synchronization

must be used to ensure the synchronization of the system before any NSM can enter this

sub-state.

During remodelling deletion, backup data corresponding to the failed node becomes pri-

mitry data in the current node. New backup copies will be made at other locations

according to the mirroring strategy. These operations are handled by the ASM's in the

system. On receiving the remodel commit command, the failed node is removed from

the NCM node contact list. Upon successful recovery, the NSM will return to the Steady

state.

I N Z T I . Z A T I O N MODIFIC4TION STEADY

/- 1 /

/ 4 I

I / C

0

DEGRADED I 0 4 /

0 / / 1

I I / /

I / /

0 4

I I /

/

0 / I

- - - - - - 7

I / I

f I I I I I i I I I I I I i I I I

RELOCATION I I I I I I /

I / /

FAILED STEAII)Y

Figure 3.3: Sub-state Transitions with the Degraded NSM State


NSM Remodel-Addition Sub-State

As in the case of node deletion, the NCM will be responsible for synchronizing dl server

nodes to accept a new node. Once the RDSS environment has agreed to the addition,

all nodes are notified to enter the corresponding subs tate.

Unlike the deletion situation, the Remodel-Addition sub-state can only be entered from

the Steady state. While in this sub-state, the RDSS will perform load bdmcing. When

the load baimcing policy is satisfied, the NSM of the new node will initiate the completion

handshake via the NCM, causing the system to return to the Steady state.

NSM Remodel-Relocation Sub-State

If there is not enough capacity to handle a node deletion remodelling, a new node is

needed to increase the total capacity of the system. However, since a node has failed, a

node addition operation cannot be completed. Thus, a relocation remodelling operation

is necessary. It allows the recreation of the failed node on a new node.

The NCM synchronization for relocation is described in section 3.3 as a combination of

the deletion and addition protocols. In NSM, the relocation transition would be initiated

from the Backup state. During the Remodel-Relocation sub-state, the missing node will be

reconstructed from the corresponding copies from other nodes. Upon successful recovery,

the NSM will return to the Steady state.

3-4.5 NSM Failed State

The Failed state is entered if and only if the number of nodelevel failures is greater

than the level of node-level error fault tolerance. If the system policy allows the RDSS

application to remain on-line despite the fact that the complete data set is not available,

each node will stay in this state. Otherwise, a shutdown command will be sent to the

NCM of all server nodes.

In this sub-state, the RDSS system will continue to run. The only way out of this state

is for the failed nodes to come back on-line or for the system to terminate. In the future?

the architecture may be modified to allow the system to return to the Steady state with

the remaining data.

3.5 Application Servers Manager (ASM) Module

Each node is responsible for mirroring a portion of the data from a l l the other server

nodes. There are N - 1 mirror segments on each node in an N-node RDSS. Each mirror

segment is serviced by a copy of the application server. The ASM is responsible for

managing these application servers.

At server startup, the ASM connects to the update ports of all the local Application

Servers. All update requests are filtered and incoming data is mirrored. Links are estab-

lished from each ASM to the update ports of other ASM7s in the system. The necessary

redirection data required by the mirroring strategy during updates and remodelling are

done through these links.

Information is shared between ASM and NSM, such that the NSM can issue remodelling

commands to the ASM and the ASM can inform the NSM if there is an update request

from the marshaller-dispatcher. The ASM also interfaces with the Hardware Abstraction

Layer (H4L) software, through which resizing of the data segment can be performed if

needed.

In addition, the ASM is responsible for providing the location of its update port, along

with a list of visible application server query ports to the active marshaller-dispatcher.

Changes in these port locations are sent to the Marshaller-Dispatcher Port Manager

(MDPM) software module, via a control communication channel between the MDPM

and the ASM.

3.6 Hardware Abstraction Layer (HAL) Module

The main purpose for the Hardware Abstraction Layer (HAL) is to permit the RDSS to

control the storage usage of the application servers. It provides a simple storage interface

library for the application server regardless of the underlying operating system.

By interfacing with the ASM, the HAL allows dynamic on-line storage management.

Storage segments can be resized and re-mapped without having to restart any software

on the server node. Thus, the application server does not need a complicated storage

management component. The interface to the HAL library is described in chapter 4.

3.7 Marshaller-Dispatcher Port Manager (MDPM)

Module

The Marshaller-Dispatcher Port Manager (MDPM) is the coordinator in the RDSS

marshaller-dispatcher soft wzre. It maintains and exports two lists: an RD SS update

node list and an RDSS query connection list, which are used by the Update Marshailer-

Dispatcher (UMD) and the Query Marshdler-Dispatcher (QMD) respectively. It only

observes and does not influence state transition of any server node. The MDPM estab-

lishes and maintains a control connection with the ASM in each server node. The ASM

is responsible for providing up-to-date update and query port locations to the MDPM.

The RDSS update node list is used by the Update Marshaller-Dispatcher (UMD) software

module. Unlike the contact list broadcasted periodically by each NCM, there are no

control flags or join sequence numbers associated with an RDSS update node list entry.

The RDSS query connection list contains all the query ports visible from the Query

Marshaller-Dispatcher (QMD). It provides snapshots of the states of the current query

ports on all the visible Application Servers.

3.8 Query Marshaller-Dispat cher (QMD) Module

As the name implies, the QMD (Query Marshaller-Dispatcher) is responsible for dis-

tributing (dispatching) the users' query requests to the appropriate RD SS nodes and

collecting (marshalling) the query results. Each query client is serviced by an instanti-

ation of the QMD, each of which contains an instantiation of the AQFE (Application

Query Front-End) described in the previous chapter.

The RDSS-specific portion of the QMD module is the library. It provides a restricted

view of the RDSS environment. The interaction between the Library and the MDPM

is hidden from the AQFE. The MDPM communicates changes i n the query connection

Gst to the AQFE. However, since each QMD instance communicates with each visible

application server directly, it is possible to experience local TCP connection timeouts

or errors. Thus, the universe view in each QMD instance is a combination of the query

connection list provided by the MDPM and its local history.

The process for marshalling data is application dependent. This approach is deliberately

chosen because of the broad target application domain. New instances of the QMD are

created by the MDPM as new query clients appear. On startup, the QMD establishes

its own communication with all the visible AS query ports available in the MDPM query

connection list.

3.9 Update Marshaller-Dispat cher (UMD) Module

The counterpart to the QMD above is the UMD (Update Marshaller-Dispatcher). It is

responsible for update requests (as opposed to query requests).

Due to the restrictions on the application update protocol specified in the last chapter

and the trust assumption of the update client, it is possible for the RDSS to include

a generic AUFE (Application Update Front-End). The generic AUFE distributes data

solely based on balancing the capacity usage on each node.

The UMD (excluding the A W E ) is stateless. The integrity of update operations is

handled by the twephase transaction model. Changes are not committed until the

commit command is sent.

In addition to the update client, the UMD interfaces with d l ASMYs and the MDPM,

via the UMD library. Instead of connecting to the update ports of the application server

instances on each node, the UMD communicates with the update port of each ASM,

which is responsible for data replication and mirroring.

The MDPM provides the UMD with the list of update-enabled nodes, along with the

number of nodes in the RDSS environment. Before any update request can be processed,

all NSM's must enter their corresponding Modifying states. Only in that state can the

local ASM accept an update session connection.

3.10 Sample Operations

To summarize the overview given in this chapter, several typicd RDSS operations are

presented and the interactions among components are explained. Important features of

the system will be emphasized. Four sample operations are described in the following sub-

sections: the startup sequence of the system, the successful query and update operations,

the simple recovery sequence on a node failure, and the on-line resizing of storage.

3.10.1 Successful System Startup

There is no specific startup sequence among server nodes. The only prerequisites are

the number of application nodes in the system and the consistency of communication

parameters for all modules. Also, the existing RDSS address allocation to the nodes

must be completed and cannot contain m y overlap. That is, the data in the system is

coherent.

A predefined startup window (or delay) is used by all application nodes. If a broadcasted

contact list is received from a valid RDSS server node, the NCM will change the corre-

sponding control flag from startup to on-line in its local node contact list. Otherwise, if

no contact is established and the startup window has passed, the remaining nodes with

startup flags are deemed off-line.

After NCM has established contacts to other nodes, the local ASM then establishes a

replication update connection with the ASM on other server nodes- On completion, the

NSM goes from the Initialization state to the Steady state.

Only after a server node has exited the initialization state would the query and update

ports locations be made available to the MDPM. On startup, the MDPM establishes a

control connection to every ASM. The MDPM startup parameters include the locations

of the nodes. After the update node list and a query connection list (visible query port

list) are built, regular query and update sessions can commence.

3.10.2 Successful Queries and Updates

Once the RDSS application has successfully started on the network, including at least

one marshaller-dispatcher, query and update requests can proceed. When the MDPM

detects a request for a new query session, a new QMD instance is created to service that

session. Similarly, a new UMD instance is enabled for an update session, except that the

system is limited to one update session at a time.

The QMD is successfully started when it has established communication with the query

ports listed as available on the connection query list. Query requests sent by the user are

directed to some or all the AS instances, as determined by the AQFE logic. By collecting

all the return data from the AS instances, a merged result is returned to the query client.

Update operation of the UMD is similar, except that the modifications are handled

through a two-phased commit transaction. In the generic AUFE given, new data is

sent to the least full (in terms of capacity) node. The ASM of that particular node is

responsible for ensuring the data is replicated to other nodes. The UMD will not commit

the operation until the ASM returns a successful reply.

3.10.3 Successful Recovery

Consider the simple node failure case. If one node has temporarily failed and the local

NCM has lost contact with the failed node, the ASM will make the query port of the

secondary AS associated with the failed node visible to the marshaller-dispatcher by

sending the port location to the MDPM. If the failed node returns, the system returns to

the Steady state, and the reversing instruction will be sent from the ASM to the MDPM

through the control connection, removing the temporary secondary query port.

If the failure described above is permanent, the NCM of the guardian for the faiied node

will timeout. Provided that the two phased commit transition described in section 3.3

succeeds, the ASM of all nodes will start remodelling to delete the failed node. Depending

on the actual storage allocation on each node, some partition resizing may need to be

done before the remodelling can be started.

Since data on each server node is mirrored to N - 1 nodes on an N node system, every

single server node has a secondary AS that deals with a subset of data from the failed

node. The ASM will try to export these to the primary AS on the same node. As new

primary data, these new data entries are replicated and mirrored tc other remaining

nodes within the system. Only when the level of mirroring is restored and all data can

be found in a primary AS is the system considered to be recovered.

3.10.4 Changing Storage Size

To change the amount of storage available to a particular Application Server (AS) in-

stance, the HAL must be used to modify the size of the corresponding virtual disk

partition. To increase the usable partition size, the following sequence of operations is

needed:

I. The ASM increases the size of the specified virtual disk partition via the HAL

on-line maintenance library.

2. The EXP4ND storage update command is sent to the AS instance, that uses the

specified virtual disk partition, to icforrn it of the extra storage space available.

The operation sequence for reducing storage space is slightly more complicated. Storage

truncation should only be taken if there is enough unused storage available. The operation

sequence for storage truncation is as follows:

1. Before attempting any size reduction in a virtual disk partition, the ASM uses the

STORAGE available update command on the affected AS instance to find out how

much free space is available for truncation.

2. If there is enough room, the TRUNCATE storage update command is sent to the

affected AS, asking it to reduce its storage usage.

3. When that is complete, the ASM then can use the HAL on-line maintenance library

to free up the unused storage in the specified virtual disk.

Chapter 4

RDSS Detailed Design

In the last chapter, we described the overall architecture of the RDSS system. To achieve

a working prototype, more detailed design and implementation choices are required. In

this chapter, the detailed behaviour of each software component in the RDSS environment

will be discussed, along with the implementation choices made for the prototype.

To recapitulate the previous chapter, the RDSS server node software consists of four

major RDSS modules, along with instances of the application server (AS). The mod-

ules are the Node State Machine (NSM), the Node Configuration Monitor (NCM), the

Application Server Manager (ASM), and the Hardware Abstraction Layer (HAL).

The RDSS Marshaller-Dispatcher software has three main components: the Marshaller-

Dispatcher Port Manager (MDPM), the Query Marshder-Dispatcher (QMD) and the

Update Marshaller-Dispatcher (UMD). The MDPM controls the startup and synchro-

nization of the QMD and the UMD. The QMD is responsible for multiplexing and de-

multiplexing the query requests and responses, while the UMD handles update traffic.

The next section Iists the target platform and the underlying operating system assump-

tions for the RDSS prototype. The rest of the chapter describes the detailed design of

each RDSS component.

CHAPTER 4. RDSS DETAILED DESIGN

4.1 Target Platform Assumptions

The current RDSS design limits its implementation to a local area network (LAN). For

convenience, the prototype assumes an Ethernet-based LAN with average round trip

time on the order of 100 rns between any two nodes. More importantly, the underlying

network must support broadcasting of UDP packets.

At the operating system level, the TCP/IP protocol suite is assumed to be available

on all nodes. Appropriate security features, such as access control, should be available.

Also, multitasking support is required.

For prototyping, a network of Ethernet connected UNIX workstations satisfies the above

requirements. The prototype is designed with this choice in mind. However, it should be

easily portable to other networks and operating systems that satisfy the above constraints.

To simplify load balancing and mirroring, the RDSS prototype assumes a homogeneous

environment, particularly that all nodes have the same storage available.

4.2 Node Configuration Monitor (NCM) Detailed

Design

As described in the previous chapter, the Node Configuration Monitor (NCM) module

on each server node is responsible for monitoring changes in the rest of the RDSS server

nodes, excluding the local node. Based on the observed changes, it triggers the appro-

priate transitions in the local Node State Machine (NSM). It is also the guardian for the

next node on the node contact list.

In addition to the UDP transport layer, the NCM requires the support of the local real

time clock. Internally, a down counter is associated with each entry in the contact list.

On a clock tick (e-g., 2 seconds in the prototype), each counter is decreased by 1 and the

current Local node contact list is sent to the network broadcast address. h the prototype,


the clock ticks are triggered by the timeout mechanism in the select operation.

4.2.1 NCM Startup

On each server node, there is an NCM startup file that contains the locations of all nodes

in the RDSS, the reserved network broadcast address, and port to be used by the NCM.

In the prototype, the NCM uses the UDP transport protocol to broadcast its contact

list. For convenience, the location of a node is denoted by its numeric IP address and the

UDP broadcasting port number. Therefore, the node contact list, is a list of IP addresses

with UDP ports, sorted by the join sequence number.

Initially, each remote node on the local contact list has the startup control flag, with the

timeout down counter set to the startup value. The local NCM broadcasts the list at a

regular interval (e.g., every second). In addition, the NCM also checks with the ASM

to see if there was a remodelling operation in progress before the last re-start. If the

local node was the controlling guardian node, then it is responsible for restoring the last

known contact list flag for the node being remodelled.

Each NCM is also listening for the broadcasts from other server nodes. For every packet

that the NCM receives, it performs the following sequence of operations:

1. It identifies the source location of the contact list packet.

3. It changes the local contact list if the control flag associated with the source location

on the contact list received is on-line. The timeout down counter for that node is

set to the on-line timeout value.

3. The local must assume it has been deleted from the system, if the contact list

packets from other nodes do not contain the location of the local node. It should

then perform a reset, clear all local data and attempt to rejoin the RDSS as a new

node.

CHAPTER 4. RDSS DETAILED DESIGN 63

4. It checks the whole contact list for remodelling flag settings associated with other

nodes. The local node should honour the last remodelling flag in the local contact

list before the system restart, including m y Ready flags.

If any timeout down counter reaches zero, the corresponding node is flagged 08-line.

When no more startup flags remain in the local contact list, the NCM startup is considered

complete. The node deletion and addition synchronization mechanism is then enabled.

A trigger is sent to the NSM causing it to transition to the Steady state if d l nodes have

no off-line flags. Otherwise, for each off-line flag, a node failure trigger is sent to the

NSM. If remodelling was in progress, the appropriate remodelling trigger is sent as well

as the node failure trigger.

4.2.2 Detecting Node Failure

To maintain an active on-line control flag in the local node contact list, the contact

list broadcast packet from that particular node must be received periodically. On each

verified reception, the timeout down counter associated with that node is reset to the

NCM on-line timeout value.

If a remote node reports a failure on a third node (i.e., an 08-line flag is presented in a

received contact list), no action is needed on the local node. The transition to the 08-line

flag is not synchronized.

On each clock tick, the down counters are decremented. A remote node is deemed off-line

when it hits zero. If the happens, a node failure trigger is sent to the local NSM. If the

local node is the guardian for the failed node, the down counter will then be set to the

deletion value. The role of a guardian was described in the last chapter. In the RDSS

prototype, each node is the guardian for the next node on the sorted contact list, with

the last node on the list responsible for the first node on the list.

If a node recovers after it has been considered of-line, a recover trigger is sent to the


local NSM. Its flag on the local contact list is changed back to on-line with the timeout

down counter set to the NCM on-line timeout value. Both the node failure and recov-

ery triggerings are not synchronized among server nodes. Each NSM is responsible for

determining these transitions for its local node.

4.2 -3 Initiating Remodelling Synchronization

The synchronization handshake protocol between a guardian and other nodes was pre-

sented in chapter 3. Here, the details on the conditions for initiating the deletion, addi-

tion, and relocation remodelling are described.

First, when a node is deemed off-line by its guardian node, the NCM deletion timeout

value is placed on the timeout counter associated with the failed node. No action is

required of the NSM module on the non-guardian nodes. When the counter hits zero,

the remodelling protocol is initiated.

For node addition, the new node first broadcasts its own contact list adding itself to the

end of the contact node list with a startup flag. Its guardian node (i.e., the previous

node on the contact list ) then initiates the addition remodelling synchronization process,

provided that there is no visible of-line node or node deletion in progress. There is a

short timeout (NCM addition timeout) before the addition process begins, to insure the

stability of the connection to the new node.

Lastly, if there is a node deemed 08-line, a node addition process cannot be completed.

Instead of doing node addition, the image of the failed node is recreated on the new node.

The guardian node for this relocation process is the same as the guardian node far the

failed node. After a deIetion timeout for the failed node, the relocation guardian may

choose to start the relocation synchronization instead of the deletion synchronization.

The synchronization steps were described in the last chapter.

During the synchronization, the guardian uses a remodelling timeout counter. It has the


option of abandoning the synchronization before committing a remodelling operation.

To do so: the guardian reverts the control flag of the target node to a previous setting

(i.e., 08-line or on-line).

In addition, the NSM should use the remodelling interface on the Application Server

Manager (ASM) of the local node to see whether there are enough remaining storage

resources on the local node before initiating a remodelling. Other nodes would do the

same after receiving a remodelling request. Only on successful return from the local

ASM, shall a non-guardian NSM respond to the remodelling request.

4.2.4 Completing Remodelling

During remodelling, contact lists broadcasted by the node being remodelled are discarded.

On the contact list of each NCM, the node being remodelled continues to have the

deletion proceed and/or addition proceed flags associated with it. The current RDSS only

allows remodelling of one node at a time. Thus, during a remodelling, the mechanism

for initiating other remodelling operations is disabled. However, it is possible for other

nodes to fail during the process. In that case, the failure trigger is sent to the NSM. As

the RDSS only handles one node failure, another node failure before completion of node

deletion is fatal. The guardian is responsible for abandoning the doomed remodelling

effort in progress.

When the remodelling operation is ready to be committed, the ASM signals the local

NCM. The NCM then exercises the remainder of the remodelling synchronization proto-

col. Once the guardian has committed the changes in its contact node list, each NCM

instructs the local ASM to commit the changes.


4-2.5 Node Reset

If an NCM contact list is received from the node that has already been deleted, it will be

ignored by the nodes in the current RDSS environment. The deleted node would notice

that it is not on any of the NCM contact lists that it received. The deleted node should

reset itself by abandoning d l its data and then attempting to rejoin the system as a new

node.

If a node is being deleted, it may find itself still on the list, but with remodelling flags. No

action is required in those cases. It is the guardian's choice to abandon the remodelling,

provided that the deletion or relocation has not been committed. Otherwise, if the

deletion has been committed by the guardian but not all the nodes have committed,

the deleted node must wait for the completion of all nodes before performing a reset

operat ion.

4.3 Node State Machine (NSM) Detailed Design

The Node State Machine software module resides on each server node. As it maintains

the logical state of each node, it is the first component to be started on the RDSS server

node software and is executed by the main thread of the node software.

In the RDSS prototype, the triggers for NSM state transitions come from both the

NSM and the ASM modules that reside on the same server node. The inter-thread

communication is handled by a single message queue.

In chapter 3, the behaviour of each of the node states is briefly described. Here, the focus

is on the transition behaviour of the state diagram. Both the prerequisites and the actual

stimuli for each transition are given in the following sub-sections. In additicn, thc sub-

sections list the design choices for how these conditions are established and communicated

within the RDSS to the node state machine.


4.3.1 NSM Startup

A startup file provides the necessary information. Since NSM is not involved in query

or update transactions directly, it is not directly responsible for the recovery of those

transactions. The recovery is performed by the ASM and the NCM modules that reside on

the same server node. Incomplete data modifications before the last system termination

are abandoned. The NSM waits for both the ASM and the NCM to successfdly start up

before accepting any stimuli that would cause a state transition.

4.3.2 From Initialization State to Steady State

During the Initialization state, the NCM of each node attempts to establish communi-

cation with every other node. After the NCM has established that d l nodes are up and

running, a trigger is sent to the NSM, causing it to transition to the Steady state.

4.3.3 From Initialization State to Degraded State

If some of the nodes listed in the NCM parameter file are not reachable, or cannot

be started properly, then each NCM will generate a transition trigger that causes the

corresponding NS M to enter the Degraded state.

Each node failure message from the NCM indicates that one node is not available. The

first node failure message causes the NSM to enter the Backup sub-state. Subsequent

messages may cause the NSM to transition to the Failed state.

One exception to the above is a node failure during the system start. If the remodelling

process was in progress before the last termination, the NCM will attempt to continue

the last remodelling effort. If this action is successful, instead of generating node fail-

ure stimulus, the NCM sends the corresponding remodelling message to the NSM. The

NSM transitions back to the same remodelling sub-states that it was in before the last


termination. See section 4.2.1 on NCM startup for more details.

4.3.4 From Steady State t o Modifying State

When an update request is received by the ASM, a message is sent to the local NSM caus-

ing it to transition to the Modifying state. Once the node has entered the Modifying state,

the A S M can start processing update requests from the Update Marshaller-Dispatcher

(UMD).

4.3.5 From Steady State or Modifying State to Degraded Sta te

If a node level failure is detected by the NCM modu!e, a node failure message will be

sent to the NSM causing a transition to the Degraded state. If the NSM is currently

in its Steady or Modifying state, it will enter the Backup sub-state within the Degraded

state. This transition is asynchronous and self-determined. The NCM can initiate such

a transition as soon as it can no longer detect a node.

The difference between a transition from the Modifying state and a transition from the

Steady state is that, in the former, the current outstanding update request needs to be

aborted before the transition. Note that no new update requests are accepted once a

node exits the Modifying state and enters the Degraded state. The ASM is notified of the

transit ion.

The ASM is responsible for refusing further update requests. It is also responsible for

making the necessary backup servers available to the Query Marshaller-Dispatchers (via

MDPM) to provide the missing data sets that correspond to the node that has failed.


4.3.6 From Modifying State to Steady State

While in the Modifyingstate, the NSM does not enforce the transaction atomicity directly.

Enforcing atomicity is the responsibility of the UMD and the ASM modules. When the

update session is completed, the ASM informs the NCM, which in turn causes the NSM

to return to the Steady state.

4.3.7 From Degraded State to Steady State

The system can recover from a node failure either by a remodelling of the RDSS environ-

ment, deleting the node in question, or by re-establishing contact with that node. When

the NCM is satisfied that all nodes in the RDSS environment are ready and available,

the NSM is told to transition back to the Steady state.

4.3.8 From Degraded State to Failed State

For each node level failure detected, the NCM sends a node failure message to the corre-

sponding NSM. Depending on the number of mirroring or backup levels, the additional

node failure may lead to a less-than-complete data set being available for queries. In that

case, the NSM would exit the Degraded state and enter the Faded state.

This transition can occur in any of the Degraded sub-states. All remodelling efforts are

aborted when the Failed state is entered. In the RDSS prototype, the mirroring level is

one, which means that if more than one node fails, the system will enter the Failed state.

4.3.9 NSM Termination

Depending on the application, it may or may not make sense to allow the system to

remain in the Failed state if the accessible data set is less than complete. If not, the


NSM will orchestrate a orderly shutdown of the RDSS environment via the NCM. Once

the shutdown decision has been made, the only method of recovery is to restart the

RDSS.

4.3.10 On-line Removal of a Node

Removal of a node while the RDSS environment is on-line is permitted with the system in

the Removal-Deletion sub-state within the Degraded state. Unlike the Backup sub-state,

entering the Removal-Deletion sub-state requires system-wide synchronization, which was

discussed in the NCM section.

The deletion transition can be initialized in one of the following ways:

0 Incomplete remodelling detected by the NCM on restart.

0 A NCM 08-line timeout during the Backup sub-state in the guardian NCM.

0 An administrative command by-passing the timeout mechanism in the guardian

NCM.

For the last two ways of entering the Remodel-Deletion sub-state, the guardian NCM

also has to make sure there is enough remaining capacity in the system. Once an NSM

has started on a remodelling synchronization process, it is no longer able to respond to

another remodelling request until the current remodelling has been completed.

Transitions to the Remodel-Deletion sub-state are tracked in persistent storage in case

of unexpected node termination. Only one RDSS node may be deleted during each

remodelling. This is not a problem for the RDSS prototype as only one level of fault

tolerance is being supported.

Once the NSM has entered the Remodel-Deletion sub-state, the corresponding ASM on

the local node is instructed to start merging secondary data corresponding to the failed

node into the local primary data. This operation includes making the new secondary

copies of the data on other RDSS nodes. Once the level of fault tolerance is restored,


the NCM's are again synchronized and a remodel complete message is sent to the NSM,

returning it to the Steady state.

Deletion remodelling cannot be completed if there is not enough storage to reestablish

the required fault tolerance level. Therefore, the NCM must ensure that there is enough

room before telling the NSM to enter the Remodel-Deletion sub-state.

4-3-11 On-line Addition ofa New Node

When a new node is added to the system, its guardian NCM contacts all existing nodes for

synchronization. In the RDSS prototype, the last node on the node List is the designated

guardian for the new node. Once the RDSS environment has agreed to the addition, the

NSM's on all nodes are notified to enter the Remodel-Addition sub-state.

The Remodel-Addition sub-state can only be entered from the Steady state. If a node is

added when the RDSS application is in the Backup sub-state, the system transitions to

the Remodel-Relocation sub-state (see below) instead.

During this sub-state, the RDSS performs load balancing. When the load balancing

policy is satisfied and all data transfers are completed, the ASM informs the NCM. The

guardian NCM then performs the completion handshake via the NCM. If successful, the

NSM will be instructed to return to the Steady state.

4.3.12 Relocating Data from a Deleted Node to a New Node

When a new node is added during the Backup state of the guardian node, no node addition

operation is performed. Instead, the relocation guardian (same as the deletion guardian)

for the failed node may choose to initiate node relocation. The same list of starting

conditions that apply to node deletion also applies here, with the extra requirement that

there is a new node waiting to join the system.


During a node relocation, mirrored data from the failed node goes into the primary

application server i n s t a c e on the new server node. Also, each surviving node must

recreate the missing mirrored portion of its own data on the new node. Like node

deletion, o d y one node can be relocated at a time.

4.4 Application Servers Manager (ASM) Detailed

Design

The main role for the ASM is to oversee the single primary instance and the multiple

secondary instances of the Application Server (AS) on the local node. All data mod-

ification requests flow through the ASM. In addition, all ASM modules participate in

node remodelling (deletion, addition and relocation). The ASM also communicates with

the Marshaller-Dispatcher Port Manager (MDPM) to control the visibility of AS query

ports. The following sub-sections describe each of the responsibilities of the ASM.

4.4.1 Data Mirroring

Each ASM has an update port and a control port. At initialization of the MDPM, a

TCP connection is established between the MDPM and the control port of the ASM.

The ASM sends the location of its visible update port to the MDPM. From then on, all

update traffic from the Update Marshaller-Dispatcher (UMD) is sent to this ASM update

port.

Within each RDSS server node, there is one (primary) AS instance and N - 1 (secondary)

AS instances on each node. The primary AS instance is responsible for storing the new

data entries from the UMD. Each secondary AS instance is responsible for mirroring a

portion of a remote node. Conversely, data in the primary AS instance is mirrored to

N - 1 secondary AS instances, each on a different remote node.


At the beginning of each update session, the UMD requests the available capacity of all

the server nodes via the ASM update port. Using the capacity estimates from d the

nodes, the UMD sends the next new data entry to the least full server node by the way of

the ASM update port. The ASM uses the capacity usage of its remote mirroring location

to select the least full mirroring node. The data entry is then sent through the update

connection between the two ASM's. The primary ASM is responsible for remembering

the RDSS addresses of the data entries in its primary AS and the location of the mirroring

node for every entry.

Removing data is the reverse of adding data. Merge and extract operations can be viewed

as multiple additions and deletions. Changes are not updated until they are committed

by the commit command. Multiple requests may be committed with a single commit

command. To ensure that no data is lost, once the update transaction is started, it can

only be terminated by either a commit or an abort message.

Before an ASM sends a ready reply to the UMD, it must first receive a ready reply from

the mirroring ASM's. When the commit command arrives, the primary ASM saves the

commit flag to the persistent storage. It then sends the commit command to the mirroring

ASM's. After the mirroring ASM's have acknowledged the commit, the changes on the

local node are then committed. Finally, the commit flag is removed from the persistent

storage and acknowledgment is returned to the UMD.

4.4.2 Remodelling Interface

Before node deletion remodelling can begin, the guardian ASM has to ensure there is

enough capacity in every node for the operation. This occurs when there is enough room

to:

transfer the data entries corresponding to the failed node from the local secondary

AS to the local primaxy AS.

0 mirror new primary data to the other nodes by asking other ASM's to determine

whether there is enough storage for the data.

If both criteria are met, the ASM will let the NCM proceed with the deletion remodelling

synchronization. Repartitioning of storage between the primary AS instance and the

secondary AS instances is possible (but not implemented in the RDSS prototype for

remodelling).

When the NSM enters a remodelling state, the ASM performs the necessary transfer of

data from the secondary AS corresponding to the failed node to the local primary AS.

The ASM performs the following sequence of operations:

Removes the failed node from the mirror destination list.

Sends the extract update command to the secondary AS on the local server node

corresponding to the failed node.

Sends the merge update command to the ASM update port of the local node. The

merge routine in the ASM module handles the necessary mirroring by treating the

data as new data.

Sends a ready signal to the NCM after both the extract and merge operations are

ready.

Commits the outstanding remodelling changes on receiving the remodelling commit

command.

The operation sequence for node addition remodelling is slightly different. Since the

level of mirroring does not increase, the storage requirement on each node is not affected.

However, due to the remodelling sequence used, extra capacity is needed for moving

data. The ASM in the prototype needs to make sure that there is enough room for a

new secondary AS on the local node. For node addition, the ASM performs the following

sequence of operations:


1. Creates a new secondary AS corresponding to the new node. Add the new node to

the mirroring destination list.

2. Sends the extract update command to the ASM update port on the local node for

roughly 1/N (on an N node system) of the data stored in the primary AS instance.

The extract routine performs the necessary deletion of mirrored data from remote

nodes.

3. Sends the merge update command to the ASM of the new node. The merge routine

automatically handles the necessary mirroring.

4. Sends a ready signal to the NCM after both the extract and merge operations are

ready.

5 - Commits the outstanding remodelling changes on receiving the remodelling commit

command.

The sequence for relocation remodelling involves the recreation of the failed node. Pro-

vided the new node is as least the size of the failed node (by the homogeneous node size

assumption), no capacity check is necessary. The ASM performs the following sequence

of operations for node relocation:

1. Replace the failed node with the new node in the mirroring destination list.

2. Search the mirroring destinations of all the data entries on the local node. For each

data entry mirrored to the failed node, send it to the mirroring AS on the new

node.

3. Change the local secondary AS instance that corresponds to the failed node to

correspond to the new node. Its data is extracted and sent to the new node and

merged into the new primary AS without generating new mirror data.

4. Sends a ready signal to the NCM after both the extract and merge operations axe

ready.


5. Commits the out st anding remodelling changes on receiving the remodelhg commit

command.

4-43 Controlling Port Visibility

On startup of a Marshder-Dispatcher, its port manager, MDPM, connects to the pre-

defined control port of the ASM on each server node. There are only two commands on

this control link:

Enable (updatelquery port)

Disable (update/query port)

Upon completing the connection, the ASM first uses the enable (port) command to send

its own ASM update port and then the query port of the primary application server (AS)

on the local node.

If a remote node has gone off-line, the NCM will instruct the NSM to enter the Degraded

state. The A S M makes visible the application server that contains mirrored data from

the failed node by sending the query port of that AS in an enable (port) command to the

MDPM. If the remote node returns to on-line status, the secondary query port will be

removed using the disable (port) command.

In the case of node deletion, the ASM of the guardian is responsible for sending the

disable (update port) command. I f the update port location of the failed node is not

known, the IP address of the failed node alone will suffice. For node addition, the new

node itself is responsible for making its ASM update port and primary query port known,

after the node addition remodelling has been committed.


4.4.4 Persistent Information

The ASM must ensure the integrity of data modification operations. By the definition

of the update protocol specified in chapter 2, all update operations that modify data

are handled by transactions. Once the f i s t phase of an update request is received by

the ASM, it needs to keep track of the transaction status until the operation is either

committed or aborted. It is possible to have several outstanding update transactions to

be committed or aborted together at the same time. However, only one update session

can be in progress at any time. This is part of the assumed trust on the part of the

update client.

The transaction status is stored in persistent storage, such that a system crash could not

compromise the transaction. The persistent storage also holds the commit flag for the

ASM. Once this flag is set, the outstanding transactions wodd be committed even after

the system crash. This commit flag is not removed until the commit acknowledgments

are received from the affected AS instances. After a restart from a system crash, the

primary ASM with any outstanding transactions would send out the appropriate update

commit or abort command, depending whether the commit flag is set.

4.4.5 ASM Interfaces

The interfaces of the ASM are depicted in Figure 4.1. For the ASM to perform all its

duties, it needs to have access to all the operations that may alter data in the storage.

It acts as the middle layer between the Update Marshaller-Dispatcher (UMD) and the

update ports of the local AS instances. For mirroring purposes, it needs to deal with

the ASM7s on other nodes. To support backup and remodelling, and to report any node

failure, it needs to interface with the NSM on the local node. It is aIso responsible for

keeping the Marshaller-Dispatcher Port Manager (MDPM) informed of the availability of

backup query ports. Finally, the need for dealing with storage resizing and repartitioning

requires the ASM to interface with the Hardware Abstraction Layer (HAL) on the local

Dispatchers

Marshaller-

I I Other 1 Nude 4---

Configuration Monitors

I I I

Dispatcher Update Port Marshaller- Manager Dispatcher

- - - - - - - - urner I r--+ Application

h Servers

I

I

I Managers Node I

Monitor I

State Application Servers Manager Machine Node u

STORAGE L J Figure 4.1: RDSS Server Node Components


Interface (to) Communication Content

AS update ports

UMD-ASM port

other ASM update ports

local NSM

MDPM control port

HAL library

Update Request

Update Request

Mirroring Request

Remodel and status report

Ports enable/disable

Storage allocation

Implement at ion

TCP socket connection





Linked during compilation

Table 4.1: Application Servers Manager Interfaces

node.

Table 4.1 summarizes the ASM interfaces, and the current implementation choices for

them. Other than the HAL interface, all communications are done through TCP socket

connections. This allows the ASM to use a simple select operation to service all of them.

4.5 Hardware Abstraction Layer (HAL) Detailed De-

sign

Instead of directly accessing the physical storage, the RDSS software (including the

application server instances) performs data storage manipulation via the Hardware Ab-

straction Layer (HAL). In the current prototype, the HAL is implemented as two software

libraries. The HAGAS (HAL Application Server) library needs to be linked with the ap-

plication server executable, and the HAL-OM (HAL On-line Maintenance) library needs

to be linked with the ASM software module.

4.5.1 Virtual Disk

At the heart of the HAL design is the concept of a virtual disk. The HAL-AS libraxy

presents a set of operations analogous to an actual physical disk. The actual storage,

however, may be re-routed to multiple storage destinations. The list of available HAL-AS

operations is shown in chapter 2.

Due to its hardware and operating system dependencies, portions of the HAL module

must be customized for each specific platform. Nevertheless, the interface used by the

application server should not be affected by the changes in the back-end of the HAL

module.

Each virtual disk is described by its block size and the number of blocks it contains.

These parameters, along with the mapping table to the physical storage, are stored in

the description header at the beginning of the virtual disk. As an extreme example, a

HAL virtual disk might be made up of three physical storage devices: a raw SCSI disk,

a portion of an RAID-5 disk array with a file system, and a flash memory storage. The

HAL keeps track of the mappings of the virtual disk t c the actual storage location. The

application server does not know that there are three different physical storage devices

involved. The main purpose of the HAL is to allow resizing and repartitioning of the

storage allocated to each application server.

4.5.2 Dynamic Repart it ioning

In addition to amalgamating multiple storage devices, one key feature of the HAL is to

allow for on-line manipulation of the virtual disk partition size. The capability to increase

or decrease a virtual disk partition size is included in the HAL-OM library. The HAL-OM

and HAL-AS libraries are synchronized through a queued mutually exclusive semaphore.

Only one thread may have access to the virtual disk description header information at a

time. If the header information is changed, all affected HAL libraries must also change

accordingly.

Unlike the HAL-AS library, which only provides visibility to one virtual disk at a time,

the HAGOM can be used to maintain any number of virtual disks. In the RDSS, each

AS has its own virtual disk, the ASM can use the HALGM to truncate space from one


virtual disk to give it to another application server instance as needed.

4.5.3 HAL-OM Library Interfaces

In addition to resizing, the HALOM Library contains routines to create and destroy a

virtual disk. The following axe the available routines in the HAL-OM library:

Initialize:

Expand:

Truncate:

Create:

Destroy:

Shutdown:

Status:

Initializes the HALOM library, read the HAL parameter file and synchronize

with other threads using the HAL libraries on the current server node. This

routine must be called before other HAL routines.

Increases the specified virtual disk size by the given number of blocks using

the given physical storage.

Decreases the specified virtual disk size by the given number of blocks, taken

from the end of the virtual disk. Truncated data is discarded. The location

of the freed physical storage is returned.

Creates a new virtual disk on the given physical storage devices with the given

number of blocks and block size.

Deletes a given virtual disk from the associated physical storage devices.

Frees the system resources associated with the HAL-OM library.

Returns status information about a virtual disk partition, including the virtual

disk block size and the number of usable blocks.

Plain text arguments are used for describing physical devices. This allows the HAL-

OM Library interface to remain unchanged regardless of the physical storage devices that

it supports. It is possible to put Expand/Truncate/Create/Destro y commands in the

HAL-OM startup parameter file. They are executed by the Initialize command.


4.6 Marshaller-Dispat cher Port Manager (MDPM)

Detailed Design

As described in the previous chapter, the role of the MDPM is to maintain both the list

of nodes in the RDSS and the list of visible AS query ports exported by the nodes. It is

also required to synchronize the lists with each other. In addition, it provides snapshots

of the lists to the QMD and the UMD.

In the RDSS prototype, connections are built on TCP sockets. The MDPM is responsible

for listening to the ports, establishing and maintaining the communication channels.

4.6.1 Port Monitoring

The MDPM monitors and listens to all connection attempts. The port numbers are preset

at start up time as specified by the start up parameter file for the marshaller-dispatcher.

The external query and update port numbers are unique to the marshaller-dispatcher.

They are the entry points to the RDSS to which clients connect. The node reporting

port is for synchronizatio~ between the MDPM and all server nodes.

When a new query client connection is established, a new instantiation of the query

front-end is started to handle the connection. The MDPM itself does not keep track of

the query connection spawned. It continues to listen for new query connections.

For client update connections, the MDPM does not spawn new tasks or connections. It is

a design choice to support only one trusted update connection at a time. No new update

sessions are accepted until the current one is completed.

Ln addition, the ASM on each RDSS server node also connects to the MDPM. In the

running state, the MDPM maintains a connection to all ASM's.


QUERY CLIENTS

Application Query Protocol

TRUSTED UPDATE CLIENT

RDSS Compliant Application Update Protocol

MARSHALLER- DISPATCHER

Marshaller- Dispatcher

' Port Manager '\

Query Mars haller- Dispatcher Library

Update Marshaller- Dispatcher Library

Application Server Managers Control Ports

Application Server Query Ports Application Servers Managers Update Ports

Server Node Components

Figure 4.2: RDSS Marshaller-Dispatcher Node Components


4.6.2 Observing RDSS state changes

No new - query or update connections are accepted until the minimum required number

of nodes have established c o ~ e c t i o n with MDf M. At the start of each connection, the

ASM of the RDSS server node sends its update port location and the query port location

of the primary AS instance to the MDPM.

The connection between an ASM and the MDPM is maintained at all times throughout

the operation of the system. Changes in the NSM axe reported to the MDPM, which

monitors the changes via the TCP socket select mechanism. Besides changes reported

from the ASM, the MDPM also monitors any exceptions in these connections.

4.6.3 Maintaining the Node List

On start up, the marshaller-dispatcher parameter file provides the number of nodes in

the system and the minimum acceptable number. The node list contains a list of the

update port locations of each node.

On request from the update module in the marshaller-dispatcher, a shared memory copy

of the node list is made available. It is the responsibility of the UMD to obtain a new

copy of the node list before every new update session. The shared memory is protected

by a semaphore.

4.6.4 Maintaining the Query Connection List

The other list that the MDPM maintains is the query connection list, which is a list of

query ports made visible to the rnarshauer-dispatcher by the nodes. When all nodes are

operating in the Steady state, this list only contains the location of the query port of the

primary Application Server on each node. Associated with each query port is the index

of the associated node on the node list described in the last section.


On initialization, the QMD instance connects to shared memory that contains the u p

to-date version of the query connection list. The latest query connection list can be

obtained through this shared memory buffer at any time. Only the active query ports

are on this query connection list.

If a node fads, the query port locations of the necessary secondary application server

instances will be made known to the MDPM. These query ports would be added to the

query connection list. The reverse happens when the secondary query ports are no longer

necessary. It is the job of the query modules to use this information correctly.

4.7 Update Marshaller-Dispat cher (UMD) Detailed

Design

The UMD is responsible for handling all update requests From the update client, and

making sure the RDSS remains stable during and after update transactions.

4.7.1 Update Client must Model

The current RDSS design focuses on applications where data updates or modifications

occur infrequently or can be batched. These applications are usually controlled by a

single administrative entity. To keep the design simple, the architecture assumes that

only a single update session is needed at any given time.

The trust placed on the update client includes the followings:

a Only one active update client is connected to the RDSS at a time.

a All update requests are authenticated and committed, asd the information is veri-

fied.

a Restrictions imposed by the RDSS and the application are strictly followed.


An implication of this trust model means that each marshaller-dispatcher only needs one

static instantiation of the UMD. Also, there is no need for update synchronization among

UMD7s.

4.7.2 Default Application Update Front-End

Unlike the query protocol, the update protocol of an RDSS compliant application is fully

specified (see chapter 2). Therefore, it is possible to include a default AUFE (Application

Update Front-End), which can be used in lieu of a custom application-specific update

front-end. The protocol supported by the AUFE is exactly the same as the one used by

the application update port as defined in appendix A.

The operation of the default AUFE is very simple. At the beginning of each update

session, the AUFE obtains the storage usage information from each node. For each new

data entry, it is sent to the least full node. For other non-addition update operations (e.g.,

capacity available), the A W E simply broadcasts the update requests to all active nodes

via the update connections to the ASM's. For any data modifying update operations

to take effect, a commit command must be sent. Multiple update changes may be

accumulated before being committed together through a single commit command. See

the following section on data integrity for discussion on the commit operations of update

requests .

Update requests are allowed only on a successful start of the update session in the UMD.

To enter an update session: the UMD must establish a connection to every ASM update

port listed on the MDPM node list. The ASM of a server node that fails to enter the

Modifying state would not accept the update connection.


4.7.3 Update Transaction Integrity

To ensure data integrity during the modification, al l update operations that affect the

data must be committed by the AUFE to take effect. The initial request is sent t o one

or more server nodes. When the changes are ready, acknowledgment would be received

by the UMD. The AUFE can then send the commit command. At any time before the

commit command is sent, the update client can abort the changes. Once the commit

command reaches the ASM on the server node, the changes are finalized.

In the event of a failure during an update, the changes are aborted by the UMD. If the

failure is a system crash, the ASM will abort the update changes on the restart, unless

the commit flag is set. (See the ASM detailed design section.)

The UMD keeps track of whether there are any outstanding update changes that have

not been committed or aborted. The update session can only end if there is no outstand-

ing change. Note that non-data modifying update requests, like the storage available

requests, do not need the commit or abort command.

4.7.4 RDSS Update Marshaller-Dispatcher Library

UMD library has the following routines:

Start update session

Send apdate command

Read update command result

End update session

The UMD-Lib hides the connections between UMD and all ASM's on the nodes. It also

handles the shared resource between the MDPM and itself, specifically the active node

list. A brief description of each method is given below. The calling interfaces are listed

in appendix D.


Entering an Update Session

This start update session routine creates the TCP connections from the UMD to the

ASM on the server nodes. The server node fist is obtained from the MDPM. Provided

that the connections to all these update ports are successfuI, the UMD-Lib would return

a success value to the caller. Otherwise, the connection attempts eventually timeout and

failure is returned.

Sending an Update Command

If no node failure has occurred since the beginning of the update session, the update

command is accepted by the UMD. The request string is then sent to the specified

nodes. Success is returned after the command is sent to the specified ASM's.

Reading an Update Command Result

If no node level failure has occurred since beginning the update session, the read update

command result routine will return the data from the first readable ASM connection

stream. Only an ASM whose corresponding flags are enabled in the select mask is read.

The order of results returned by this method is not determined.

Terminating an Update Session

The end update session routine terminates the update session by closing the update

connections to the ASM's. The UMD exits the update session regardless the success of

this action.


4.8 Query Marshaller-Dispatcher (QMD) Detailed

Design

When a new query client connects to the RDSS via the marshaller-dispatcher query port,

the RDSS port manager creates a new process to handle all query requests from that

client connection. At the heart of each of the marshaller-dispatcher query processes is

the application query front-end module (AQFE), which is responsible for the logic of the

marshaller-dispat cher query process.

Different applications require different AQFE's. As the RDSS places no requirements on

the application query protocol, a custom AQFE must be built in order to interpret and

combine the results returned from the attached application servers. Depending on the

nature of the application and its query protocol, the AQFE could be a simple multiplexor-

demultiplexor (as provided as an example in the prototype) or a complex module with a

significant amount of logic.

Instead of directly interfacing with query clients and application servers, the AQFE must

use the routines provided by the QMD library (QMD-Lib). By doing so, most of the

RDSS related activities are hidden and automatically resolved. Thus, the coding effort

required by the AQFE is much smaller. The next section describes the callable routines

in the QMD-Lib.

4.8.1 Query Marshaller-Dispatcher Library (QMD-Lib)

The Query Marshaller-Dispatcher library (QMD-Lib) has the following routines:

0 New query session

0 Start query

Read select

0 Read

a Write

End query

Terminate query session

The AQFE is responsible for handling out-of-bound responses (query responses that came

after the end query). The QMD-Lib also assumes that there is only one outstanding

query per client; however, it is possible to design an AQFE to allow multiple outstanding

queries, provided that the query protocol has the proper support. The calling interfaces

can be found in appendix C.

Starting a New Query Session

This routine associates the query client to an AQFE instance. The query connection list

is updated from the MDPM. If the new process flag is set, a new AQFE process is started

to handle all queries from the given client . Normally, the routine is called by the MDPM

when a new query client appears on the marshaller-dispatcher query port. However, it

can also be used to change the query stream, provided that there is a mechanism in the

application query protocol to support this.

Note that slot 0 (offset 0) in the status list is reserved for the query client location.

There are two fields in a server query status record. The first one indicates whether it is

a primary server or not. This is a number showing the level of mirroring of the current

server (0 indicates it is a primary server). The second field is a boolean value indicating

whether the server is alive (true) or dead (false). Only the second field is relevant in the

query client slot (slot 0).

The status list is compacted at the start of every new query session (i-e., all "dead"

entries in the list are discarded). Within the same query session, the same slot in the

status Iist always refers to the same application server or query client. Thus, the size of

the status list does not decrease during a query session.


Issuing a New Query

This command broadcasts the query to all visible application servers. If the MDPM

synchronization flag is set, an up-to-date application server list will be obtained from

MDPM (recommended). Otherwise, the existing server list is used. The routine can

broadcast the given query buffer. If the read-from-client flag is set, it will read a single

line from the client file descriptor and broadcast it. If there is no data on the query client

stream to be read, the routine blocks the execution thread untiI the query appears or the

timer expires.

If a new application server appears during a query, after the start of the query and before

the end of the query, the same query request will be sent to it. The changes in the status

list and number of servers will be visible on the next QMD-Lib call.

Wait for Read Using Select

This routine blocks the execution thread until one of the following events occurs:

1. Data appears in the incoming stream of an application server, and the correspond-

ing bit in the read mask is set.

2. Data appears in the query client incoming stream and the bit in the read mask for

slot 0 is set.

3. An error occurs in the query client connection preventing future communication

and the bit in the exception mask for slot 0 is set.

4. An error occurs in an application server whose corresponding bit in the exception

mask is set.

5 . A timeout occurs.


If an error occurs and the corresponding bit in the exception mask is not set, the QMD-

Lib will ignore the error and attempt to continue. The error is reported to the client via

the query connection status list.

Reading from the Query Client or an Application Server

This routine reads from the given stream corresponding to the given slot number. Data

from the stream is placed into the read buffer, until there is no more data available, or the

read buffer is full. If there is no data on a given stream, the routine returns immediately.

This routine does not block the execution thread.

Write to the Query Client or Application Servers

This routine writes the data in the given buffer to the given streams whose corresponding

write mask bits are set. This routine does not block the execution thread.

Ending a Query

This routine writes the data in the given buffer to the servers where the corresponding

query was sent. It also cleans up the internal state in QMD-Lib. If a new primary server

appears after the end of the query, no automatic query will be sent.

Ending the Query Session

This routine closes all QMD-Lib connections to the client and application servers regard-

less of any outstanding queries. The slot assignments of the status list axe discarded at

the end of the query session.

Chapter 5

Implement ation Stat us

A prototype of the RDSS has been implemented. However, the design has been evolving

and most of the RDSS components need to be updated or rewritten to reflect the latest

version of the RDSS architecture, described in this thesis. In addition, to test and verify

the application environment's design and the implementation, a reference application

was developed. Instead of using a full search application, this test application is a simple

test snippet server. This simple test application provides the following benefits:

0 full control over its internal structure and interface design,

0 reduced hardware requirements for the prototype,

0 isolated testing of the RDSS software modules, and

0 simplified debugging.

This chapter reports on the current status of the RDSS prototype implementation, fol-

lowed by a description of the design of the simple text snippet server, which is used as

the test application for the RDSS prototype. Finally, it outlines the necessary modifica-

tions to the current versions of the MultiText index engine and text server to make them

compatible with the RDSS.

5.1 RDSS Prototype Implementation

The current RDSS design is the result of many design and prototyping iterations. Both

the architecture and the detailed design have evolved dramatically over the iterations.

Instead of building a complete prototype at each iteration, individual modules were imple-

mented for evaluation. In this section, the implementation framework will be presented,

along with some of the issues encountered during implementation.

5.1.1 Prototyping Framework

The RDSS prototype is implemented on the LINUX operating system running on an

INTEL 486 compatible CPU with 20 MB or more system memory. RCS version control

utility is used for maintaining the source files.

There are three development areas. The first one contains the common supporting source

code that is shared by all RDSS modules. They simplify the access to system dependent

routines, and make the RDSS module more portable. The components included here are:

0 Constant definitions (const)

0 Count semaphore library (cntsem)

0 Binary semaphore library (binsem)

Mutex library (mutex)

a TCP library (tcp)

0 UDP library (udp)

Mailbox library (mbox)

Logging library (log)

The second area contains the various RDSS modules. Due to changes between the proto-

type versions, some of the modules are not up-to-date with respect to the latest design.

The current setup of the system contains:

RDSS shared components

- Application update protocol parser (aup-par)

- Application update protocol scanner jaupscan)

- RDSS common routines (rdss-corn)

0 Node Configuration Monitor (NCM)

- NCM broadcast control (ncmhc)

- NCM transaction control (ncm-tc)

- NCkl main (ncmm)

0 Node State Machine (NSM)

- NSM transition states (nsmfs)

- NSM main (nsmm)

Application Server Manager (ASM)

- ASM mirror control (asmmc)

- ASM remodel control (asrnx)

- ASM query control (asm-qc)

- ASM main (asmm)

0 Hardware Abstraction Layer (HAL)

- HAL application server library (halas)

- HAL online maintenance library (hal-om)

- HAL common routines (hal-corn)

Marshaller-Dispatcher Port Manager (MDPM)

- MDPM query connection control (mdpm-qc)

- MDPM node control (mdpmnc)

- MDPM common routines (rndprn-corn)

Query Marshaller-Dispatcher (QMD)

- QMD query connection control (qmd-qc)

- QMD library (qmdlib)

- QMD test library (qmd-t)

Update Marshaller-Dispatcher (UMD)

- UMD node control (umdnc)

- UMD library (urndlib)

- UMD test library (umd-t)

The third axea contains the simple text snippet server software. The test drivers and

simulation scripts are located in the last area of the prototyping framework.

5.1.2 Implement at ion Issues

The application update protocol parser and scanner are implemented with YACC and

LEX compatibility in mind. On LINUX, they are compiled with the GNU BISON and

FLEX compilers. GCC 2.6.3 and System V libraries are assumed to be available on the

compilation platform.

During prototyping, some shortcuts were used to simplify implementation. Instead of

writing to a raw disk partition, the prototype HAL uses a large file to simulate a disk

partition and to allow quick verification of the stored data. Some of the port numbers

and the broadcast address are hard-coded instead of being read in from a parameter file.

The location of the logs needed for restart after system crash are also hard-coded.

Currently, most of the RDSS modules exist but are not fully consistent with the latest

RDSS architecture, as described in chapters 3 and 4. A stand-alone simple text snippet

server with its front-end has been successfully integrated with the QMD, the ASM and

the HAL. The extra delay for simple query due to the RDSS layers is:

where

T C f : the TCP communication delay from the query client to the application

query front-end;

TC Pfe-to-as: the delay from the front-end to the application server;

T S e s the delay in a direct TCP connection between the query client and the

application server; and

HAL,,,: the small additional delay due to t h e RDSS HAL library as opposed to a

direct physical storage access.

While the computational delays of the RDSS softwilre components are included in the

above numbers, the TCP delays are dominated by the node to node network communi-

cation delay, and the HAL delay is dominated by the disk access response time.

5.2 Simple Text Snippet Server

As briefly described in chapter 2, the text snippet server is a simple server that is capable

of storing variable length generic data segments ("blobsn). Only the core features of such

a server are implemented in the reference application for the RDSS, Most of the appli-

cation administrative functions and interfaces that would be included in a production

version are omitted unless they are needed for the RDSS integration.

The simple text snippet server stores blocks of character symbols in physical storage.

Each data entry (one or more blocks of characters) is indexed by a tag. For simplicity,

the snippet tag is the same as the RDSS address range for the snippet. To retrieve a

snippet, the valid snippet tag (RDSS address) is required.

5.2.1 User Interfaces

As a simple reference application, the update interface of the simple text snippet server is

a direct implementation of the RDSS update protocol. The only query command available

is to retrieve the stored data given a valid snippet tag or multiple tags. All snippet tags

Header Block I Total Allocated Storage Space (N)

N-K- I

Block Numbers

Data Storage

Figure 5.1: Physical Storage Layout of Simple Text Snippet Server

Index Map

that fall within a user-supplied RDSS address range in the query are considered valid,

and the associated snippets are returned.

5.2.2 Storage Format

To allow retrieval of snippets, a mapping table is used to correlate the snippet tags to

physical storage offsets. The mapping table, along with its header, is also stored in the

physical storage along with the text snippets. Figure 5.1 depicts the actual layout of the

physical storage used by the simple text snippet server.

The available storage is the total storage allocated to the server by the RDSS less the

blocks used by mapping table and the header block. For simplicity, an entry is reserved

for each available storage block in the server, and direct mapping is used. Slot x in

the mapping table corresponds to block x in the data storage. For a 64-bits RDSS

address space and a single byte control flag, the storage overhead for the mapping table

is l7/(512 + 17) = 3.21% with a block size of 512 bytes. The storage used by the header

block is negligible for a large number of blocks.

At the stmtup of the application, the mapping table is read into memory. An inverted

map, sorted by the snippet tags (RDSS addresses) is created. To retrieve snippets, query

requests are matched with this inverted map in the memory.

The header block contains the start-end pair of indices for the snippet tags mapping

table. They are used to indicate the range of indices that are active. The range wraps

around at the last index value. That is, if the start index is greater than the end index,

then the active index range is wrapped around (i-e., from the start index to the last

possible index slot and from the first possible index slot to the end index).

In addition to the start-end indices, the header block also includes other information

needed for storage resizing and other operations. The following list shows the content of

the simple text snippet server header block:

number of storage blocks allocated

maximum number of entries in the mapping table

first mapping table block

r last mapping table block

start index in the mapping table

0 end index in the mapping table

next available index entry in the mapping table

current ending RDSS address

For efficient operation, the header block information and the translation table are cached

in memory. For data integrity, they are synchronized with their corresponding copies in

the persistent storage at all times.

5.2.3 Storage Resizing

To support the truncate and expand operations in the update protocol, the ability to

resize the usage of physical storage is needed. The following steps are done during resizing

to ensure that data integrity is maintained throughout the operation.

For simplicity, this step is only performed if the destination of the new mapping table

(after relocation) does not overlap with the current mapping table. A storage truncation

will return a failure if the mapping table relocation is not possible (unless unused blocks

at the end of the storage are enough for the truncation request). A storage expansion, on

the other hand, will return a success without utilizing the extra storage blocks allocated.

Regardless of the direction of resizing, if relocation of the mapping table is necessary, the

next step is to defragmentize the current storage. Empty storage blocks in the current

storage range (between the start-end indices) will be Wed by non-empty blocks from the

end of that range.

For storage truncation, the next step is to determine whether data block relocations are

needed. If the size reduction is not achievable (i-e., there are not enough empty blocks for

the resizing), the operation will stop. Otherwise, data blocks that fall within the to-be-

removed area at the end of the storage will be moved to the safe area (beginning of the

storage). The start-end indices will be adjusted accordingly, along with the maximum

number of entries in the header block. This step is not needed for storage size expansion.

Next, the mapping table is moved to its new storage location. The header block is then

updated with the new mapping table's location. For storage expansion, the mapping

table is first moved, and then its size is increased. For storage truncation, the mapping

table size is first decreased before the table is moved. After that, the number of allocated

storage blocks is reduced. The next available index will also be updated to reflect the

changes. During the mapping table relocation, no actual data is moved, and the content

of the index map is not changed. The resize operation is completed after the update of

the header block. A system crash during the resize operation will not result in any data

loss, as long as the header block update is atomic and protected from crashes. Figure 5.2

depicts the relocation of the index map (mapping table) when the storage size allocated

to the application is expanded from N to N'.

Header Block I Total Nlocated Storage Space (N')

N-K- 1 N

Block Numbers

Figure 5.2: Mapping Table Relocation during Storage Expansion

Data Storage

5.3 Conversion of the MultiText System

- - - - - ** Index Map -.

In the current MultiText system, neither the index engine nor the text retrieval server

Index Map

includes any data mirroring or on-line maintenance features. A brief description and

a setup diagram are given in chapter 1. Further description can be found in various

MultiText papers [CCPT97, CBCG951.

To adapt the MultiText into the RDSS framework, some changes are necessary. Most

of the RDSS application requirements, like the transaction based update operations,

described in chapter 2, are already included in the current system. Thus, no major

redesign is necessary. To ease the transition, the adaptation can be broken into five

testable stages.

Stage one is for the MultiText index engine and the text server to use the RDSS HALAS

library instead of directly accessing the physical storage. The application will be able to

run as if nothing has changed. Since both servers currently use physical disk partitions

for all persistent storage, this is not a difficult adaptation.

In the second stage, the update interface of MultiText needs to be modified to become

compatible with the RDSS application update protocol requirement. A translator may

be coded such that update requests in the old format may be translated into the new

RDSS compliant update requests. The current MultiText update protocol is very similar

to the RDSS compliant application update protocol, and only minimal syntax changes

are necessary.

The biggest change comes in stage three. The dynamic resizing ability is not included in

the current MultiText system and has to be added. The ability for a robust application

to truncate or expand its storage usage operations is one of the requirements that was

not anticipated before the development of the RDSS design.

The next stage is to design and implement the Application Query Front-End (AQFE) for

the MultiText system. The current MultiText Marshaler-Dispatcher is not suit able for

integration with the RDSS and must be replaced. The default Application Update Front-

End (AUFE) is adequate and no custom AUFE is needed for either the MultiText index

engine or the text server. Using the QMD and UMD test library, the application front-

end modules can be tested with the application server without other RDSS components.

The final stage is to integrate and test MultiText with the RDSS environment.

Chapter 6

Conclusion

In chapter 1, the idea of a robust distributed storage system (RDSS) was presented

along with its associated design goals. After many iterations of the design effort, a

workable architecture was devised and is presented in this dissertation. The architecture

satisfies the stated transparency, availability, and throughput performance criteria, and

also encompasses many different kinds of real-world applications in its target domain.

6.1 Contribution of the Thesis

With Moore's observation on the doubling of microprocessor speeds continuing to hold,

and on-going advances in affordable, high-capacity storage technology taking place, the

power of a network of inexpensive personal computers cannot be ignored. The RDSS

architecture can be implemented on a network of very inexpensive personal computer,

and presents an attractive low-cost alternative to even a modest RAID storage subsystem.

One of the major hurdles in putting a large information system on a distributed platform

is the complexity of the software design required. Through the RDSS environment, most

of the tricky issues in the design are handled.

-. i 4

The RDSS not only provides data mirroring, it also scatters mirrored data across many

nodes, avoiding data loss due to a single node failure. It also keeps data within the

same entry together at the mirrored location, thus avoiding the cost and complexity of

combining data entry across the distributed system. This is important for index engine as

the indexing information of a document need to be kept together for the search algorithm

to proceed.

On top of that, the RDSS provides load balancing, not only when the system is operating

on its primary data, but also when the system is using the mirrored data. Since data

entries are mirrored to N - 1 nodes on the N nodes system, each server node only

has to handle 1/N - 1 extra data if a node fails. To provide these benefits, a totally

transparent storage, like RAID, is not possible. For the applications that it is designed

to support, the inconveniences due to compliant requirements (described in chapter 2)

are far out-weighed by the benefits gained.

While more tests may be done on the RDSS prototype, the architecture and the detailed

design shown in this thesis provide a strong and solid foundation for future work in

networked distributed storage. The preliminary implementation has shown that the

RDSS framework is flexible and usable.

6.2 Future Work

In the short term, more work will be needed to take the prototype implementation t o a

production level. The current prototype needs to be made consistent with the current

design described in this thesis. The hard-coded shortcuts mentioned in the last chapter

will need to be properly addressed by customizable parameters. Changes may be needed

in the hardware abstraction layer (HAL) for the RDSS to work in different workstation

setups and to enable dynamic virtual disk re-partitioning.

In addition, performance benchmarking with multiple nodes on multiple network loca-

tions will need to be performed to further refine the RDSS design. It is important to

find out if there are any throughput bottlenecks in the design as well as to discover

any scalability limits of the implementation. A comparison study between a n RDSS en-

abled application and its single computer equivalent should be performed to show the

cost-saving potential of the RDSS.

In terms of design improvements, there are several areas that successors to the RDSS

architecture may need to consider. The followings are a few that worth considering.

Communication Improvements

With IP v6, which support multicasting and reservation, getting closer to reality, it may

be possible to improve on the node to node communications done in the RDSS. For

example, the number of connections per node may be reduced by replacing the node to

node update connections with a multicasting network.

In addition, changes may be added to support prioritized queries. It may also be necessary

to use a different communication protocol if the TCP/IP protocol suite is not available

or efficient in the target network. Adopting a fault tolerant multicasting protocol may

help to reduce the complexity of the rest of the RDSS.

Multiple Nodes Failures

Another possible high profile improvement to the RDSS is to increase the level of fault

tolerance. To continue the scattered mirroring strategy, multiple mirroring levels will

be needed to achieve this. For each primary data partition in an N nodes system,

its first level of mirroring data is scattered among the other N - 1 nodes, just like the

current RDSS described in this thesis. However, in a multiple level mirroring architecture,

this process is recursive. Each first level mirroring partition will have N - 2 secondary

mirroring segments located in the remaining nodes, excluding the node with the primary

copy of the data.

While the mirroring procedure is relatively simple, the difficult part is to modify the

Node. Configuration Monitor (NCM) to handle the possibility of multiple node failure

and remodelling. So fax, in the RDSS, the NCM does not need a majority agreement

mechanism, as only one node failure is handled. However, for the system to handle

multiple node failures, such a mechanism is needed to achieve network consistency.

Network Partitioning

With the inclusion of multiple nodes failure, there is the possibility of a network parti-

tioning problem. Again, with the single node failure and local area network assumptions,

the current RDSS design does not handle a partitioned network very well. The majority

agreement mechanism needs to be added to the NCM. Also, the restart sequence needs

to be changed so that two or more partitioned sub-networks can be recombined without

problem.

Wide Area Network

As stated in the assumption, the RDSS only works on a local area network. Because of

communication delays and unreliability of communication links, distributed applications

on a wide area network often suffer from virtual network partitioning. Unlike the local

area network case, multiple replicated primary data partitions may be desirable. Instead

of limiting the RDSS to a strict single primary data source, it may be worth while to

explore the inclusion of a replicated primary node in future designs if the support for a

wide area network is needed.

Other Fault Tolerance Strategies

The RDSS architecture provides node level fault tolerance, but none is provided at a data

entry level. It is possible to use RAID storage on all nodes, but that would be expensive

and defeat the low cost purpose of the RDSS. One possible solution is to include error

correction codes [GCCTSG]. It may compliment the mirroring strategy by providing

sub-node level fault protection and higher degree of fault tolerance.

Another possible variation on the fault tolerance strategy is to optimize the mirroring

rules. Instead of always mirroring to every remaining node, the system may only mirror

to some of them. In a multiple level mirroring system, this may or may not give better

protection with less replication, depending on the mirroring policy. Nevertheless, it is an

interesting problem to tackle.

Security Enhancements

There are weak spots in the current RDSS system in terms of security. Namely, the

node to node communications. In particular, the NCM broadcasting network is very

vulnerable to malicious attacks. It is acceptable on a private secure local area network

insulated from the outside world by a fiewall, as long as the firewall is configured properly

to protect the RDSS. Ideally, some form of security should be built into the system to

guard against misuse.

AppIication Support

Last, but not least, samples from the various application in the RDSS target domain

should be used to see where the strength of the RDSS lies. The update protocol may be

enhanced or changed to a binary format, which may be more appropriate for multimedia

and other non-text digital libraries.

Appendix A

Application Update Protocol BNF

The following is the syntax of the application update protocol implemented in the RDSS

protocol in extended BNF (Buckus Naur Form). The application server must implement

a superset of this protocol on its update port. Keywords are shown in the quoted ('-2)

form and variable tokens are italicized.

To kens: string number new-line

string: ' " ' ( a n y ASCII character)* ' " '

ne wlin e: (line- feed) carriage-return

comrnandAine: update-command new-line

update-command: add~eques t

/ delete~equest

/ extractrequest

I merge~eques t

I update-commit

I updateabort

CHAPTER A. APPLICATION UPDATE PROTOCOL B N F

add-request:

de le te~eq uest :

extract~equest:

I storage-available~equest I tuncatestorage

I expandstorage

I quit -updatesession

I shutdownserver

'ADD' 'AT' number 'TO' number 'CAPACITY' number 'SIZE'

number 'DATA' string

'DELETE' 'FROM' number 'TO' number

'EXTRACT' 'FROM' number 'TO' nzmzber 'PORT' number

( ' E X T E R N A L FORMAT')

merge~eq uest: 'MERGE' 'FROM' 'SERVER' number 'PORT' number ('EX-

TERNAL FORMAT')

update-commit: 'COMMIT' ('ALL7)

update~bort : 'ABORT7 ('ALL')

storage-availa ble-req uest: 'STORAGE' 'STATUS7

tu ncatestorage: 'TRUNCATE' number 'BLOCKS'

expandstorage: 'EXPAND7 number 'BLOCKS'

quit-updatesession: 'QUIT7

shutdownserver: 'SHUTDOWN'

Every update-command requires a response of 'ACK' or 'NACK', with the exception

of the storage-availablelequest , the capacity~vailablerequest and the shutdownserver

command. The 'ACK7 response indicates the operations is successful or is ready to be

committed.

CHAPTER A. APPLICATION UPDATE PROTOCOL BNF 110

For the storage-availablerequest, the proper return message contains two numbers. The

first one is the storage block size and the second number is the number of blocks avail-

able. For the capacityavailablerequest, the return message contains the single number

indicating the remaining free capacity.

Appendix B

Hardware Abstraction Layer -

Application Server Interface

The following is the calling interface to the Hardware Abstraction Layer - Application

Server (HAL-AS) Library.

Method name: bVDiskOpen

Input: Virtual disk description string (sVDisk).

Output: Virtual disk ID (iVDiskID).

Return value: Boolean, true for success.

Method name: bVDiskRead

Input: Virtual disk ID (iVDiskID), starting block offset (istart), number of blocks

(icount), pointer to the read buffer (sBuffer).

Output: Read buffer content (sBuffer).

APPENDIX B. HAL-AS INTERFACE

Return value: Pointer to the read buffer. Return nu1 on error.

Method name: bVDiskWrite

Input: Virtual disk ID (iVDiskID), starting block offset (istart), number of blocks

(icount), pointer to the write buffer (sBdFer).

Output: none.

Retu rn value: Number of blocks successfully writ ten.

Method name: bVDiskStatus

Input: Virtual disk ID (iVDiskID).

Output: Virtual disk block size (iBlkSize), Number of blocks in virtual disk (iNum-

Blk).


Method name: bVDiskClose

Input: Virtual disk ID (iVDiskID).

Output: none.


Appendix C

Query Marshaller-Dispat cher

Library Interface

The following is the calling interface to the Query Marshaller-Dispatcher Library (QMD-

Lib) .

Method name: bNewQrySession

Input: Query client file descriptor (iCliendFd), fork new process flag (bNewPro-

cess)

Output: Number of servers in the complete RDSS world (iNumServer), status list

size (iNumStatus), status list (asqsStatusList).


Method name: bStartQuery

Input: Direct from client flag (bDirect), update server list flag (bResync), query

buffer (sQuer y ) , client t imeout (iTimeout )

Output: Number of servers in the RDSS (iNumServer), status list size (iNumStatus),

status list (asqsStatusList).

Return value: Boolean, true if no error has occurred.

Method name: bReadSelect

Input: Number of servers and client (iNumStatus), array of read masks (abRead-

Mask), array of exception masks (abExceptMask) , timeout value (iTime-

out).

Output: Array of read results (abReadMask) , array of exception result (ab Excep t-

Mask), Number of servers in the RDSS (iNumServer), status list size (iNum-

Status), status list asqsStatusList.


Method name: bRead

Input: Read slot number (islot), read buffer sReadBuffer, buffer size iBufSize.

Output: Data in read buffer sReadBuffer, relevant data size iBufSize.


Method name: bWrite

Input: Number of servers and clients (iNumStatus), array of write masks (ab-

WriteMask), write buffer (sWriteBuffer), buffer size (iBufSize).

APPENDIX C. QMD-LIB INTERFACE

Output: Data size written (iBufSize).


Method name: bEndQuery

Input: Write buffer (sWriteBufFer), buffer size (iBufSize).

Output: Data size written (iBufSize), number of servers in the RDSS (iNumServer),

status list size (iNumStatus), status list (asqsStatusList).


Met hod name: bEndQuery Session

Return value: Process return 0 on successful termination.

Appendix D

Update Marshaller-Dispatcher

Library Interface

The following is the calling interface to the Update Marshder-Dispatcher Library (UMD-

Lib).

Method name: bNewUpdSession

Input: Update client file descriptor (i Client Fd)

Output: Number of nodes that are part of the system (iNumNodes)


Method name: bUpdateCmd

Input: Pointer to update request (sUpdReq), broadcast mask (abBroadcastMask)

Output: Number of nodes to which the update request are sent (iNumSent)

116

APPENDIX D. UMD-LIB INTERFACE


Method name: bReadUpdRtn

Input: Pointer to return buffer (sUpdRtnBuf), size of buffer (iRtnBufSize), read

select mask (abBroadcast Mask).

Output: Number of bytes used in the return buffer (iResultSize) Data written to

buffer (sUpdRtnBuf)


Method name: bEndUpdSession


Glossary

AQFE:

AS:

ASM:

AUFE:

GCC:

IP:

LAN:

MDPM:

NCM:

NSM:

RAID:

RCS:

RDSS:

ROWA:

QMD:

SCSI:

Application Query Front-End

Application Server

Application Server Manager

Application Update Front-End

GNU C Compiler

Internet Protocol

Local Area Network

Marshaller-Dispatcher Port Manager

Node Configuration Monitor

Node State Machine

Redundant Arrays of Inexpensive Disks

Revision Control System

Robust Distributed Storage System

Read One Write All

Query Marshaller-Dispatcher

Small Computer Storage Interface

APPENDIX D . UMD-LIB INTERFACE

TCP : Transmission Control Protocol

U D P: User Datagram Protocol

U M D: Update Marshaller-Dispatcher

Bibliography

[AS911 I. J. Aalbersberg and F. Sijstermms. High-Quality and High-Performance

F d Text Document Retrieval: The Parallel Infoguide Systems. In Int 'l

Conference Parallel and Distributed Information Systems, December, 1991,

pp. 142-150.

[BBD+96] W. J. Bolosky, J. S. Barrera, R. P. Draves, R. P. Fitzgerald and Givson.

The Tiger Video Fileserver, Microsoft Technical Report MSR-TR-96-09,

April 1996.

F TP location: ftp://fp.research.microsufi.com/pub/tr/tr-96-09.ps

[Bir93] K. P. Birman. The Process Group Approach to Reliable Distributed Com-

puting. Communications of the ACM, December 1993, Vol. 36, No. 12, pp.

37-53.

[BS92] W. A. Burkhard and P. D. Stojadinovic. Storage-Efficient Reliable Files. In

Proceedings: USENIX Winter 1992 Technical Conference, San Francisco,

January 1992.

[Bur901 F. J. Burkowski. Retrieval Performance of a Distributed Text Database

Utilizing a Parallel f rocessor Document Server. In Int '1 Symp. Databases

i n Parallel and Distributed Systems, 1990, pp. 71-79.

[CBCG~S] G. V. Cormack, F. 3. Burkowski, C. L. A. Clarke and R. C. Good. A Global

Search Architecture. Technical Report CS-95-12, University of Waterloo

Computer Science Department, April 1995.

BIBLIOGRAPHY 121

C. L. A. Clarke, G . V. Cormack and I?. J. Burkowski. An Algebra for Struc-

tured Text Search and A Framework for its Implementation. The Computer

Journal, 1995, Vol. 38, No. 1, pp. 43-56.

C. L. A. Clarke, G. B. Corrnack and F. J. Burkowski. Schema-Independent

Retrieval from Heterogeneous Structured Text. In Fourth Annual Sympo-

sium on Document Analysis and Information Retrieval, Las Vegas, Nevada,

April 1995, pp. 279-289.

G. V. Cormack, C. L. A. Clarke, C. R. Palmer and S. L. To. Passage-

Based Refinement (MultiText Experiment for TREC-6). In Proceedings of

the Sixth Text REtrieval Conference (TREC-6), Gaithersburg, Maryland,

November 1997.

G. F. Coulouris and J. Dollirnore. Distributed Systems: Concepts and De-

sign, Addison-Wesley Publishing Co., Wokingharn, England, 1988.

P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz and D. A. Patterson.

RAID: High-Performance, Reliable Secondary Storage, ACM Computing

Surveys, June 1994, Vol. 26, No. 2, pp. 145-185.

B. Cahoon and K. S. McKinley. Performance Evaluation of a Distributed

Architecture for Information Retrieval. In Proceedings of the 19th Annual

Int 'l A CM SIGIR Conference on Research and Development in In formation

Retrieval, Zurich, Switzerland, August, 1996.

P. B. Danzig, J. Ahn, J. Nail and K. Obraczka. Distributed Indexing: A

Scalable Mechanism for Distributed Information Retrieval. In ACM STGIR

Conference, October 1991, pp. 220-229.

R. C. Good, G. V. Cormack, C. L. A. Clarke and D. J . Taylor. A Robust

Storage System Architecture. In 8th Int'l Conference on Computing and

Information, June 1996.

BIBLIOGRAPHY 122

D. K. Gifford, P. Jouvelot, M. A. Sheldon and J. W. O'Toole, Jr. Semantic

File Systems. Operating Systems Reuiew: Proceedings of the 13th ACM

Symposium on Operating Systems Principles, Pacific Grove, CA, October

1991, VO~. 25, NO. 5, pp. 16-25.

J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques,

Morgan Kaufman PubLishers, San Francisco, CA, 1993.

G. A. Gibson, D. I?. Nagle, K. Amirir, F. W. Chang, H. Gobioff, E. Riedel,

D. Rochberg and J. Zelenka. Filesystems for Networks-Attached Secure

Disks. Technical Reports CMU-CS-97-118, School of Computer Science,

Carnegie Mellon University, Pittsburgh, Pennsylvania, July 1997.

A. A. Held, A. A. Heddaya and B. B. Bhargava, Replication Techniques

in Distributed Systems, Kluwer Academic Publishers, Boston, 1996.

Inktomi Corporation. The Inktomi Technology Behind Hot Bot (a White

Paper), 1996.

W W W location: http://www. inktomi. com/Tech/Coup Cluster WhitePap. htrnl.

B. S. Jeong and E. Orniecinski. Inverted File Partitioning Schemes in Multi-

ple Disk Systems. lEEE Transactions on Parallel and Distributed Systems,

February 1995, Vol. 6, No. 2, pp. 142-153.

M. Lesk Practical Digital Libraries: Books, Bytes, and Bucks, Morgan

Kaufmann Publishers, San Francisco, CA, 1997.

B. Liskov, S. Ghemawat , R. Gruber, P. Johnson, L. S hrira and M. Williams.

Replication in the Harp File System. Operating Systems Reuiew: Proceed-

ings of the f 3th ACM Symposium on Operating Systems Principles, Pacific

Grove, CA, October 1991, Vol. 25, No. 5, pp. 216-238.

2. Lin. Cat: An Execution Model for Concurrent Full Text Search. In Int '1

Conference Parallel and Distnb~ted Information Systems, December 1991,

pp. 151-158.

BIBLIOGRAPHY 123

[LSSO] E. Levy and A. Silberschatz. Distributed File Systems: Concepts and Ex-

amples. ACM Computing Surveys, ACM Press, December 1990, Vol. 22,

No. 4, pp. 321-374.

In Sky and Telescope, July 1997, p. 44.

C. S tanfill. Partitioned Posting Files: A Parallel Inverted File Structure

for Infarmation Retrieval. In ACM SIGfR Conference, September 1990,

pp. 413-428.

A. Tomasic and H. Garcia-Molina. Performance Issues in Distributed

Shared-Not hing Inforrnation-Retrieval Systems. I n f o m a t ion Processing

and Management, 1996, Vol. 32, No. 6, pp. 647-665.

[WGSSSG] J. Wilkes, R. Golding, C. Staelin and T. Sullivan. The HP AutoRAID

Hierarchical Storage System. ACM Transaction on Compv te r Systems, Vol.

14, No. I , February 1996, pp. 108-136.

[WMB94] I. H. Witten, A. Moffat and T. C. Bell. Managing Gigabytes - Compressing

and indexing Documents and Images, Van Nostrand Reinhold, New York,

1994.