Chapter 6 Distributed File Systemsacl.ece.arizona.edu/classes/ece677/chapters/ch10.pdfChapter 6 Distributed File Systems Chapter Objectives A file system is a subsystem of an operating

Chapter 6

Distributed File Systems Chapter Objectives A file system is a subsystem of an operating system whose purpose is to organize, retrieve, store and allow sharing of data files. A distributed file system is a distributed implementation of the classical time-sharing model of a file system, where multiple users who are geographically dispersed share files and storage resources. Accordingly, the file service activity in a distributed system has to be carried out across the network, and instead of a single centralized data repository there are multiple and independent storage devices. The objectives of this chapter are to study the design issues and the different implementations of distributed file systems. In addition, we give an overview of the architecture and implementation techniques of some well known distributed file systems such as SUN Network File System (NFS), Andrew File System, and Coda.

Keywords: NFS, AFS, CACHE, FILE CASHING, TRANSPERANCY, CONCURENCY CONTROL, LOCUS, ETC…

6.1 Introduction The file system is part of an operating system that provides the computing system with the ability to permanently store, retrieve, share, and manipulate stored data. In addition, the file system might provide other important features such as automatic backup and recovery, user mobility, and diskless workstations. The file system can be viewed as a system that provides users (clients) with a set of services. A service is a software entity running on a single machine [levy, 1990]. A server is the machine that runs the service. Consequently, the file system service is accessed by clients or users through a well-defined set of file operations (e.g., create, delete, read, and write). The server is the computer system and its storage devices (disks and tapes) on which files are stored and from which they are retrieved according to the client requests. The UNIX time-sharing is an example of a conventional centralized file system. A Distributed File System (DFS) is a distributed implementation of the traditional time-sharing model of a file system that enable users to store and access remote files in a similar way to local files. Consequently, the clients, servers, and storage devices of a distributed file system are geographically dispersed among the machines of a distributed system.

The file system design issues have experienced changes similar to the changes observed in operating system design issues. These changes were mainly with respect to the number of processes and users that can be supported by the system. Based on the number of processes and users, file systems can be classified into four types [Mullender, 1990]: 1) single-user/single-process file system; 2) single-user/multiple-processes file system; 3) multiple-users/multiple-processes centralized time-sharing file system; and 4) multiple-users/multiple-processes geographically distributed file system. The design issues in single-user/single-process file system include how to name files, how to allocate files to physical storage, how to perform file operations, and how to maintain the file system consistency against hardware and software failures. When we move to a single-user/multiple-processes file system, we need to address concurrency control issues and how to detect and avoid deadlock situations that result from sharing resources. These issues become even more involved when we move to a multiple-users/multiple-processes file system. In this system, we do need to address all the issues related to multiple concurrent processes as well as those related to how to protect and secure user processes (security). The main security issues include user identification and authentication. In the most general type (multiple-users/multiple-processes geographically distributed file system), the file system is implemented using a set of geographically dispersed file servers. The design issues here are more complex and challenging because the servers, clients, and network(s) that connect them are typically heterogeneous and operate asynchronously. In this type that we refer to as a distributed file system, the file system services need to have access and name transparency, fault tolerance, highly available, secure, and high performance. The design of a distributed file system that supports all of these features in an efficient and a cost-effective way is a challenging research problem..

6.2 File System Characteristics and Requirements The client applications and their file system requirements vary from one application to another. Some applications run on only one type of computers, others run on a cluster of computers. Each application type demands different requirements on the file system. One could characterize the applications requirements for a file system in terms of the file system role, file access granularity, file type, protection, fault-tolerance and recovery [svobodova]. • File System Role. The file system role can be viewed in terms of two extremes: 1)

Storing Device, and 2) Full-scale filing system. The Storing device appears to the users as a virtual disk that is mainly concerned with storage allocation, maintenance of data objects on storage medium, and data transfer between the network and the medium. The full-scale filing system provides all the services offered by the storing device and additional functions such as controlling concurrent file accesses, protecting user files, enforcing required access control, and directory service that maps textual file names into file identifiers recognized by the system software.

• File Access Granularity. There are three main granularities to access data from the file system: 1) File-Level Storage and Retrieval, 2) Page (block)-Level Access, and 3)

Byte-Level Access. The file access granularity affects significantly the latency and sustainable file access rate. The appropriate access granularity depends on the type of applications and system requirements; some applications need the file system to support a bulk transfer of the whole files and thus the appropriate access mode is the file-level access mode, while other applications require efficient random access to small parts within a file and thus the appropriate access mode could be byte-level.

• File Access Type. The file access mode can be broadly defined in terms of two access models: Upload/download model and Remote Access. In the Upload/download Model, when a client reads a file, the entire file is moved to the client’s host before the read operation is carried out. At the end when the client is done with the file, it is sent back to the server for storage. The advantage of this approach is that once the file is loaded in the client’s memory, all file accesses are performed locally without the need to access the network. However, the disadvantages of this approach are two folds: 1) increasing the load on the network due to downloading/uploading entire files, and 2) the client computer might not have enough memory to hold large files. Consequently, this approach limits the size of the files that can be accessed. Furthermore, experimental results showed that the majority of file accesses read few bytes and then close the file; the life cycle of most files is within a few seconds [zip parallel file system]. In the Remote Access Model, each file access is carried out by a request sent through the network to the appropriate server. The advantages of this approach are 1) the users do not need to have large local storage in order to access the required files and 2) the messages are small and can be handled efficiently by the network.

• Transparency. Ideally, a distributed file system (DFS) should look to its clients like a conventional, centralized file system. That is, the multiplicity and dispersion of servers and storage devices should be transparent to the clients. Transparency measures the system ability to hide the geographic separation and heterogeneity of resources from the user and the application programmer so that the system is perceived as a whole rather as a collection of independent resources. The cost of implementing full transparency is prohibitive and challenging. Instead several types of transparencies, that are less transparent than full transparency, have been introduced such as network transparency and mobile transparency.

• Network transparency: Network transparency allows clients/users to access remote files using the same set of operations used to access local files. That means accessing remote and local files become indistinguishable by users and applications. However, the time it takes to access remote files is longer because of the network delay.

• Mobile Transparency: This transparency defines the ability of the distributed file system to allow users to log in to any machine available in the system, regardless of the users locations; that means the system does not force users to login to specific machines. This transparency facilitates user mobility by bringing the users' environment (e.g., home directory) to wherever they log in.

• Performance. In centralized file system, the time it takes to access a file depends on the disk access time and the CPU processing time. In a distributed file system, a remote file access has two more factors to be considered: the time it takes to transfer and process the file request at the remote server and the time it takes to deliver the requested data from the server to the client/user. Furthermore, there is also the

overhead associated with running the communication protocol on client and server computers. The performance of a distributed file system can be interpreted as another dimension of its transparency; the performance of remote file access should be comparable to local file access [levy, 1990].

• Fault-tolerance. A distributed file system is considered fault-tolerance if it can continue providing its services in a degraded mode when one or more of its components experience failures. The failure could be due to communication faults, machine failures, storage device crashes, and decays of storage media. The degradation can be in performance, functionality, or both. The fault-tolerance property is achieved by using redundant resources and complex data structures (transactions). In addition to redundancy, atomic operations and immutable files (the files that can only be read but not written) have been used to guarantee the integrity of the file system and facilitate fault recovery.

• Scalability. It measures the ability of the system to adapt to increased load and/or to

addition of heterogeneous resources. The distributed file system performance should degrade gracefully (moderately) as the system and network traffic increases. In addition, the addition of new resources should be smooth with little overhead (e.g., adding new machines should not clog the network and increase file access time).

It is important to emphasize that the distribution property of a distributed file system that makes the system fault-tolerance and scalable due to the inherent multiple redundant resources in a distributed system. Furthermore, the geographic dispersion of the system resources and its activities must be hidden from the users and made transparent. Because of these characteristics, the design of a distributed file system is more complicated than the file system for a single processor system. Consequently, the main issues to be emphasized in this chapter are related to transparency, fault-tolerance, and scalability.

6.3 File Model And Organization The file model addresses the issues related to how the file should be represented and what types of operations can be performed on the file. The types of files range from unstructured byte stream files to highly structured files. For example, a file can be structured as a sequence of records or just simply a sequence of byte streams. Consequently, different file systems have different models and operations that can be performed on their files. Some file systems provide a single file model, such as a byte stream as in the UNIX file system. Other file systems provide several file types (e.g., Indexed Sequential Access Memory (ISAM) and record in the VMS file system). Hence, a bitmap image would be stored as a sequence of bytes in the UNIX file system while it might be stored as one or two record file in the VMS file system. The organization of the file systems can be described in terms of three modules or services (see Figure 6.1): 1) Directory Service, 2) File Service, and 3) Block Service. These services can be implemented as individually independent co-operative components or all integrated into one software component. In what follow, we review the design issues related to each of these three modules.

Figure 6.1. File service architecture

6.3.1 Directory Service The naming and mapping of files are provided by the directory service, which is an abstraction used for mapping between text names and file identifiers. The structure of the directory module is system dependence. Some systems combine directory and file services into a single server that handles all the directory operations and file calls. Others keep them separate, hence opening a file requires going to the directory server to map its symbolic name onto its binary name and then passing the binary name to the file server to actually read or write the file. In addition to naming files, the directory service controls file access using two techniques: capability-based and identity-based techniques.

1. Capability-based approach: It is based on using a reference or name that acts as a token or a key to access each file. A process is allowed to access a file when it possesses the appropriate capability. For example, a Unique File Identifier (UFID) can be used as a key or capability to maintain the protection against any unauthorized access.

2. Identity-based approach: This approach requires each file to have an associated list that shows the users and operations that are entitled to perform on that file. The file server checks the identity of each entity requesting a file access to determine if the requested file operation can be granted based on the user's access rights.

Client computer Server computer

Application program

Applicationprogram

Client module

Flat file service

Directory service

6.3.2 File Service The file service is concerned with issues related to file operations, access modes, file state and how it is maintained during file operations, file caching techniques, file sharing, and file replication. • File Operation: The file operations include open, close, read, write, delete, etc. The

properties of these operations define and characterize the file service type. Certain applications might require all file operations to be atomic; an atomic operation is the one when it is completed it is guaranteed to be performed successfully from the beginning until the end, otherwise the operation is aborted without any side effect on the system state. Other applications require files to be immutable; that means files can be modified once they are created and therefore the set of allowed operations include read, delete but no write operation. Immutable files are typically used to simplify the recovery from faults and consistency algorithms.

• File State: This issue is concerned whether or not file, directory, and other servers

should maintain state information about clients. Based on this issue, the file service can be classified into two types: Statefull and Stateless File Service.

1. Statefull File Service: A file server is statefull when it keeps information on its client

state and then uses this information to process client file requests. This service can be characterized by the existence of a virtual circuit between the client and the server during a file access session. The advantage of statefull service is performance; file information is cached in main memory and can be easily accessed, thereby saving disk accesses. The disadvantage of statefull service is that state information will be lost when the server crashes and this complicates the fault recovery of the file server.

2. Stateless Server: The server is stateless when the server does not maintain any information on the client once it finished processing its file requests. A stateless server avoids the state information by making each request self-contained. That is, each request identifies the file and the position of the read/write operations. That is there is no need to for Open/Close operations that are required in a statefull file service.

The distinction between statefull and stateless service becomes evident when considering the effects of a crash during a service activity. When the server crashes, a statefull server usually restores its state by a following an appropriate recovery protocol. A stateless server avoids this problem altogether by making any file request be a self-contained and thus can be processed by any new reincarnated server. In other circumstances when the client failure, the statefull server needs to become aware of such failures to reclaim the space allocated to record the state of the crashed clients. On the contrary, no obsolete state needs to be cleaned up on the server side in a stateless file service. The penalty for using the stateless file service is longer request messages and slower processing of file requests since there is no in-core information to speed up the handling of these requests.

The file service issues related to caching, replication, sharing, and tolerance will be discussed in further detail next.

6.3.3 Block Service The block service addresses the issues related to disk block operations and the allocation techniques. The block operations can be implemented either as a software module embedded within the file service or as a separate service. In some systems a network disk server (e.g. Sun UNIX operating system) provides access to a remote disk block for swapping and paging by diskless workstations. Separating the block service from the file service offers two advantages: 1) Separate the implementation of the file service from disk specific optimizations and other hardware concerns. This allows the file service to use a variety of disks and other storage media; and 2) Support several different file systems that can be implemented using the same underlying block service.

6.4 Naming and Transparency In a DFS, a user refers to a file by a textual name. Naming is the means of mapping textual names (logical names) to physical devices. There is a multilevel mapping from this name to the actual blocks in a disk at some location, which hides from the user the details of how and where the file is located in the network. Furthermore, to improve file system availability and fault-tolerance, the files can be stored on multiple file servers. In this case, the mapping between a logical file name and the actual physical name returns multiple physical locations that contain a replica of the logical file name. The mapping task in a distributed file system is more complicated than a centralized file system because of the geographic dispersion of file servers and storage devices. The most naive approach to name files is to append the local file name to the host name at which the file is stored as is done in the VMS operating system. This scheme guarantees that all files have unique names even without consulting other file servers. However, the main disadvantage of this approach is that files can not be migrated from one system to another; if the file must be migrated, its name needs to be changed and furthermore all the file users must be notified. In this subsection, we will discuss transparency issues related to naming files, naming techniques and implementation issues.

6.4.1 Transparency Support Ideally, a distributed file system (DFS) should look to its clients like a conventional, centralized file system. That is, the multiplicity and dispersion of servers and storage

devices should be transparent to the clients. Transparency measures the system ability to hide the geographic separation and heterogeneity of resources from the user and the application programmer, so that the system is perceived as a whole rather as a collection of independent resources. The cost of implementing full transparency is prohibitive and challenging. Instead several types of transparencies, that are less transparent than full transparency, have been introduced. We discuss transparency issues that attempt to hide location, network, and mobility. In addition, we address the transparency issues related to file names and how to interpret them in a distributed computing environment. • Naming transparency: Naming is a mapping between logical and physical objects.

Usually, a user refers to a file by a textual name. The latter is mapped to a lower-level numerical identifier, which in turn is mapped to disk blocks. This multilevel mapping provides users with an abstraction of file that hides the detail of how and where the file is actually stored on the disk. In a transparent DFS, anew dimension is added to the abstraction, that of hiding where in the network the file is located. The naming transparency can be interpreted into two notions:

1. Storage location. 2. Location Independence. The name of a file needs not be changed when the file's

physical storage location changes. A Location-independent naming scheme is a dynamic mapping, since it can map the same file name to different locations at different instances of time. Therefore, location independence is a stronger property than location Transparency. When referring to the location independent, one implicitly assumes that the movement of files is totally transparent to users. That is, files are migrated by the system without the users being aware of it.

In practice, most of the current file systems (e.g., Locus, Sprite) provide a static, location transparent mapping for user-level names. Only Andrew and some experimental file systems support location independence.

• Network transparency: Clients need to access remote files using the same set of

commands used to access local files; that means there is no difference in the commands used to access local and remote files. However, the time it takes to access remote files will be longer because of the network delay. Network transparency hides the differences between accessing remote and local files so they become indistinguishable by users and applications.

• Mobile Transparency: This transparency defines the ability of the distributed file system to allow users to log in to any machine available in the system, regardless of the users locations; that means the system does not force users to login to specific machines. This transparency facilitates user mobility by bringing the users' environment to wherever they log in.

The file names should not reveal any information about the location of the files and furthermore their names should not be changed when the files are moved from one

storage location to another. Consequently, we can define two types of naming transparencies: Location Transparency and Location Independence. • Location Transparency. The name of a file does not reveal any information about its

physical storage location. In location transparency, the file name is statically mapped into a set of physical disk blocks, though hidden from the users. It provides users with the ability to share remote files as if they were local. However, sharing the storage is complicated because the file name is statically mapped to certain physical storage devices. Most of the current file systems (e.g., NFS, Locus, Sprite) provide location transparent mapping for file names [levy, 1990].

• Location Independence. The name of a file needs not be changed when it is required to change the file’s physical location. A Location-independent naming scheme can be viewed as a dynamic mapping, since it can map the same file name to different locations at different instances of time. Therefore, location independence is a stronger property than location transparency. When referring to the location independent, one implicitly is that the movement of files is totally transparent to users. That is, files are migrated by the system to improve the overall performance of distributed file system by balancing the loads on its file servers without the users being aware of it. Only few distributed file systems supported location independence (e.g., Andrew and some experimental file systems).

6.4.2 Implementation Mechanisms We will review the main techniques used to implement naming schemes in distributed file systems such as pathname translation, mounting scheme, unique file identifier, and hints. • Pathname Translation: In this method, any logical file defined as a path name (e.g.,

/level1/level2/filename) and it is translated by recursively looking up the low-level identifier for each directory in the path, starting from the root (/). If the identifier indicates that the sub-directory (or the file) is located in another machine, the lookup procedure is forwarded to that machine. This is done till the identity of the machine that stores the requested file has been identified. This machine then returns to the client who requested the file the low-level identifier of that file (filename) in the machine's file system. In some DFS, such as NFS and Sprite the request for the file name lookup is passed on from one server to another server until the server that stores the requested file is found. In Andrew file system, each state of the lookup procedure is performed by the client. This option is scalable because the server is relieved from performing the lookup procedure needed to translate client file access requests.

• Mount Mechanisms: This scheme provides means to attach remote file systems (or directories) to a local name space, via a mount mechanism, as in Sun's NFS. Once a remote directory is mounted, its files can be named independent of the file’s location. This approach enables all the machines on the network to specify the part of the file name space (such as executables and home directories of users) that can be shared with other machines and at the same time keeps machine specific directories local. Consequently, each user can access local and remote files according to its naming

tree. However, this tree might be different from one computer to another and thus accessing any file is not independent of the location of file requests.

• Unique File Identifier. In this approach, we have a single global name space that is

visible to machines and it spans all the files in the system. Consequently, all files are accessed using a single global name space. This approach assigns each file a Unique File Identification (UFID) that is used to address any file in the system regardless of its location. In this method, any file is associated with a component unit. All files in a component unit are located at the same storage. Any file name is translated into a UFID that has two fields. The first field contains the component unit to which the file belongs. The second field is a low-level identifier into the file system of that component unit. At run time, we maintain a table indicating the physical location of each component unit. Note that this method is truly location independent, since we associate files with component units whose actual location is unspecified, except at bind-time. There are a number of ways to ensure the uniqueness of the UFIDs associated with different files. [Needham and Herbert, 1982; Mullender, 1985; Leach, 1983] all gave importance on using relatively large sparsely populated space to generate UFIDs. To achieve the uniqueness in UFIDs we can concatenate a number of identifying numbers and/or a random number to ensure further security. This can be done by concatenating the host address of the server creating the file with a file representing the position of the UFID in the chronological sequence of UFIDs created by that server. An extra field containing a random number is embedded into each UFID in order to combat any possible attempt of counterfeiting. This ensures the distribution of the valid UFIDs is sparse; also the length of the UFID is so long, that makes unauthorized access practically impossible. Figure 6.2 shows the format of a UFIDs that is represented as a 12-bit records.

Figure 6.2. This figure shows the UFID with the long identification number to avoid unauthorized access

In the above format, the server identifier is considered to be an internet address, ensuring uniqueness across all registered internet-based systems. Access control to file is based upon the fact that a UFID constitutes a 'key' or capability to access a file. Actually access control is a matter of denying UFIDs to unauthorized clients. When a file is shared over a group, then the owner of the file holds all the rights on the file i.e. he/she can perform all types of operations, whether it is read, write, delete, truncate etc. But the other members of the group hold lesser rights on the file e.g. they can

UFID FCB

1 23422 location on disk, size,..2 5465 … 3 65842 …

only read the file, but they are not authorized to perform the other operations. Most refined form of access control can be done by embedding a permission field in the UFID, which encodes the access rights that the UFID confers upon its processor. The permission field must be combined with the random part of before giving the UFID to users. The use of the permission field should be done carefully so that its not easily accessible. Otherwise the access rights can easily be changed, e.g. from read to write etc. Whenever a file is created, the UFID is returned to the owner (creator) of the file. When the owner gives the access rights to other users, it is necessary that some rights are taken away to restrict the capabilities of the other users, by a function in the file server meant for restricting capabilities. The two different ways suggested in [Coulouris and Dolimore, 1988] by which a file server may hide its permission field is as follows:

1. The permission field and the random part are encrypted with a secret key issued to clients. When client present UFIDs for file access, the file server uses the secret key to decrypt them.

2. The file server may encrypt the two fields by using a one-way function to produce UFIDs issued to clients. When clients present the UFIDs for file access, the file server applies the one-way function to its copy of the UFID and compares the result with the client's UFID.

• Hints: Hint is a technique often used for quick translation of file names. A hint is a

piece of information that directly gives the location of the requested file and thus it speeds up performance if it is correct. However, it does not cause any semantically negative effects if it is incorrect. Since looking up path names is a time consuming procedure, especially if multiple directory servers are involved. Some system attempts to improve their performance by maintaining a cache of hints. When a file is opened, the cache is checked to see if the path name is there. If so, the directory-by-directory lookup is skipped and the binary address is taken from the cache.

6.5 File Sharing Semantics The file semantics of sharing are important criteria for evaluating any file system that allows multiple clients to share files. When two or more users share the same file, it is necessary to define the semantics of reading and writing to avoid problems such as data inconsistency or deadlock. The most common types of sharing semantics are 1) Unix Semantics, 2) Session Semantics, 3) Immutable Shared Semantics, and 4) Transaction Semantics.

6.5.1 UNIX Semantics This semantics makes writes to an open file by a client are visible immediately to other (possibly remote) clients who have had this file open at the same time. When a READ operation follows a WRITE operation, the READ returns the value just written. Similarly, when two WRITEs happen in quick succession, followed by a READ, the value read is the value stored by the last write. It's possible for clients to share the pointer to the current file location. Thus advancing the pointer by one client affects all sharing clients. The system enforces an absolute time ordering on all operations and always returns the most recent value. In a distributed system, UNIX semantics can be easily achieved as long as there is only one file server and clients do not cache files; all READs and WRITEs go directly to the file server, which processes them strictly sequentially. This approach gives UNIX semantics. The sharing of the location pointer is needed primarily for compatibility of the distributed UNIX system with conventional UNIX software. In practice, the performance of a distributed system in which all file requests must be processed by a single server is frequently poor. This problem is often solved by allowing clients to maintain local copies of heavily used files in their private caches.

6.5.2 Session Semantics This semantics makes Writes to an opened file visible immediately to local clients but invisible to remote clients who have opened the same file simultaneously. Once a file is closed, the changes made to it are visible only in later starting sessions. However, these changes will not be reflected in the already opened instances of the file. Using session semantics raises the question of what happens if two or more clients are simultaneously caching and modifying the same file. When each file is closed, its value is sent back to the server, the client that closes the last will overwrite the pervious write operations and thus the updates of the previous clients will be lost. A good example for that are the yellow pages. Every year, the phone company produces one Telephone book that lists the business and customers’ numbers. It is a database with certain information that is updated once a year. The granularity is on annual basis. The yellow pages are not accurate during the year because it is updated at the end of the session. Accuracy in such examples is not a big issue because most of the customers will search for a business in the area not for a certain business. The application, in this case, relies on simplicity with a sacrifice of the accuracy.

6.5.3 Immutable Shared File Semantics This semantics states that a file can be opened and read only. That is, once a file is created and declared as a shared file by the creator, it cannot be modified any more. Clients cannot open a file for writing. What a client can do if it has to modify a file is to create an entirely new file and enter it into the directory under the name of a previous existing file, which now becomes inaccessible. Just like session semantics, when two processes try to replace the same file at the same time, either the latest one or non-deterministically one of them will be chosen to be the new file. This approach makes file implementation quite simple since the sharing is in read-only mode.

6.5.4 Transactions Semantics This semantics indicates that the operations on a file or a group of files will be performed indivisibly. This is done by having the process declares the beginning of the transaction using some type of BEGIN TRANSACTION primitive; this signals that what follows must be executed indivisibly. When the work has been completed, an END TRANSACTION primitive is executed. The key property of this semantics is that the system guarantees that all the calls contained within a transaction will be carried out in order, without any interference from other concurrent transactions. If two or more transactions start up at the same time, the system ensures that the final result is the same as if they were all run in some (undefined) sequential order.

6.6 Fault Tolerance And Recovery Fault tolerance is an important attribute in distributed system that can be supported because of the inherent multiplicity of resources. There are many methods to improve the fault tolerance of a DFS. Improving availability and the use of redundant resources are two common techniques to improve the fault tolerance of a DFS.

6.6.1 Improving Availability A file is called available if it can be accesses whenever needed. Availability is a fragile and unstable property. It varies as the system's state changes. On the other hand, it is relative to a client; for one client a file may be available, whereas for another client on a different machine, the same file may be unavailable. Before discuss the availability of a file, we define two file properties first: "A file is recoverable if is possible to revert it to an earlier, consistent state when operation on the file fails or is aborted by the client. A file is called robust if it is guaranteed to survive crashes of the storage device and decays of the storage medium." A file is called available if it can be accessed whenever needed, despite machine and storage device crashes and communication faults. Availability is often confused with robustness, probably they both can be implemented by redundancy techniques. A robust file is guaranteed to survive failures, but it may not be available until the fault component has recovered. Availability is a fragile and unstable property. First, it is temporal; Availability varies as the system's state changes. Also, it is relative to a client; for one client a file may be available, whereas for another client on a different machine, the same file may be unavailable. Replicating files can enhance the availability [Thompson, 1931] , however, merely replicating file is not sufficient. There are some principles destined to ensure increased availability of the files described below.

• The number of machines involved in a file operation should be minimal, since the probability of failure grows with the number of involved parties.

• Once a file has been located there is no reason to involve machines other than the client and the server machines. Identifying the server that stores the file and establishing the client-server connection is more problematic. A file location mechanism is an important factor in determining the availability of files. Traditionally, locating a file is done by a pathname traversal, which in a DFS may cross machine boundaries several times and hence involve more than two machines [Thompson, 1931]. In principle, most systems, e.g., Locus, Andrew, approach the problem by requiring that each component, i.e., directory, in the pathname would be looked up directly by the client. Therefore, when machine boundaries are crossed, the serve in the client-server pair changes, but the client the same.

• If a file is located by pathname traversal, the availability of a file depends on the availability of all the directories in its pathname. A situation can arise whereby a file might be available to reading and writing clients, but it cannot be located by new clients since a directory in its pathname is unavailable. Replicating top-level directories can partially rectify the problem and is indeed used in Locus to increase the availability of files.

• Caching directory information can both speed up the pathname traversal and avoid the problem of unavailable directories in the pathname (i.e., if caching occurs before the directory in the pathname becomes unavailable). Andrew uses this technique. A better mechanism is used by Sprite. In Sprite, machines maintain prefix tables that map prefixes of pathnames to the servers that store the corresponding component units. Once a file in some component unit is open, all subsequent Opens of files within that same unit address, the right server directly, without intermediate lookups at other servers. This mechanism is faster and guarantees better availability.

6.6.2 File Replication Replication of files is a useful scheme for improving availability, reducing communication traffic in a distributed system and improving response time. The replication schemes can be classified into three categories: the primary-stand-by, the modular redundancy, and the weighted voting [Yap, Jalote and Tripathi, 1988] and [Bloch, Daniels and Spector, 1987]. • Primary-stand-by: It selects one copy from the replica and designates it as the

primary copy, whereas the others are standbys. Then all subsequent requests are sent to the primary copy only. The stand-by copies are not responsible for the service, and they are only synchronized with the primary copy periodically. In case of failure, one of the standbys copies will be selected as the new primary copy, and the service goes on.

• Modular Redundancy: This approach makes no distinction between the primary copy and standby ones. Requests are sent to all the replica simultaneously, and these requests are performed by all copies. Therefore, a file request can be processed regardless of failures in networks and servers provided that there exists at least one accessible correct copy. This approach is costly to maintain the synchronization

between the replica. When the number of replica increases, the availability decreases, since any update operation will lock all the replica.

• Weighted Voting: In this scheme, all replica of a file, called representatives, are assigned a certain number of votes. Accesses operations are performed on a collection of representatives, called access quorum. Any access quorum which has a majority of the total votes of all the representatives is allowed to perform the access operation. Such a scheme has the maximum flexibility where the size of the access quorum can change for various conditions. On the other hand, it may be too complicated to be feasible in most practical implementation.

A variant model [Chung, 1991] which combines the modular redundancy and primary-stand-by approaches provides more flexibility with respect to system configuration. This model divides all copies of a file into several partitions. Each partition functions as a modular redundancy unit. One partition is selected as primary and the other partiions are backups. In this manner, it reaches the balance in the trade-off between the modular redundancy and primary-stand-by approaches. An important issue in file replication is how to determine the file replication level and the allocation of the replicated file copy necessary to achieve satisfactory system performance. There are three strategies in solving the file allocation problem (FAP). • Static File Allocation: It is intuitive that the replica are firmly allocated in specified

sites. Based on the assessment of file access activity levels, cost and system parameter values, the problem involves allocating file copies over a set of geographically dispersed computer so as to optimize an optimality criterion, while satisfying a variety of system constrains. Static file allocations are for systems which have a stable level of file access intensities. The optimality objectives used in past include system operating costs, transaction response time and system throughput. Essentially, static file allocation problems are formulated as combinatorial optimization models where the goal addresses the allocation tradeoffs in term of the selected optimality criterion. Investigations of static file allocation problems were pioneered by W.W. Chu [Chu, 1969]. Since the FAP is NP-complete, much attention has been given to the development of heuristics that can generate good allocations with lower computational complexity. Branch-and-bound and graph searching methods are the typical solution techniques to avoid enumeration of the entire solution space.

• Dynamic File Allocation: If file system is characterized with high variability in

usage patterns, the use of static file allocation will degrade the performance and the cost increases throughout the operational period. Dynamic file allocation is based on anticipated changes in the file access intensities. Of course, the file reallocation costs incurred in this scheme have to be taken into consideration in the initial design process. The dynamic file allocation problem is one of determining file reallocation policies over time. File reallocations involve simultaneous creation, relocation and deletion of file copies. Dynamic file allocation models can be classified as non-adaptive and adaptive. Initial research focused on non-adaptive models, while more recent studies have concentrated on adaptive policies. In most recent research of

adaptive model on the dynamic FAP yielding lower computational complexity, it is achieved by restricting the reallocations to single-file reallocations only. To improve the applicability of the research results of dynamic FAPs, it is necessary to study the problem structure under realistic schemes for file relocations in conjunction with effective control mechanisms and to develop specialized heuristics for practical implementations.

• File Migration: This is also referred to as file mobility or location independence. The

main difference between dynamic FAP and file migration is in the operations used to change a file assignment. Dynamic file reallocation models rely heavily on prior usage patterns of the system database. File migrations are not very sensitive to prior estimates on systems usage patterns. They automatically react to temporary changes in access intensity by making the necessary adjustments in file locations without human management or operator intervention.

Dynamic FAP considers file reallocations that might involve reallocating multiple replica. These major changes could result in system-wide interruptions of information services. In the file migration problem, each migration operation deals with only a single file copy. Evaluation of file migration policies has been investigated by several researchers [Gavish, 1990]. Since the file migration dealing with only a single file copy, an individual file migration operation might be less effective than a complete file reallocation in improving system performance. However, selecting an optimal or near-optimal single operation is less complex than determining complete file reallocations. Therefore, file migration can be invoked more frequency, thereby responding to system changes more rapidly than file reallocation.

6.6.3 Recoverability A file server should always ensure that the files it holds are always be accessible after the failure of the system. The effect of failure in distributed system is much more pervasive with respect to its centralized counterpart due to the fact that the clients and the servers may fail independently and therefore there is a greater need of designing a server computer that can restore data after the system failure and save the same from permanent loss. In both the conventional and distributed system, disk hardware and driver software can be designed to ensure that if the system crashes during a block write operation, or a data transfer occurring during a block transfer, partially written data or incorrect data are detected. In XDFS the use of stable storage is worth mentioning here. It is a redundant storage for structural information, which is implemented as a separate abstraction provided by the block service operation. It is basically a means to revive data from permanent loss after a system failure during a disk write operation or damage to any single disk block. Operations on stable blocks are implemented using two disk blocks which holds the content of each stable block in duplicate. This implementation is developed by Lampson [Lampson, 1981], who defined a set of operations on stable blocks which mirror the block service operation, the block pointers indicate that the stable storage blocks are to be distinguished from the ordinary blocks. Generally it is expected that the invariant

duplicates of the stable storage, are stored in two different disk drives to ensure that the blocks are not damaged simultaneously in a single failure, so each block acts as a back-up to the other block. To maintain invariance for each pair of block: • Not more than one pair of block is bad. • If both are good, they have the most recent data except during the execution of stable

put. The procedure Stable get operation reads one of the blocks using get block. The other representative is read when a error condition is detected. If during a stable put procedure a server crashes or halts, a recovery process is invoked while the server is re-started. The recovery procedure is meant to maintain invariance by inspecting the pair of blocks and doing the following: When • Both good and the same: nothing. • Both good and different: Copies one block of the pair to the other block of the pair. • One good and one bad: Copies the good block to the bad block.

6.7 File Cashing Caching is a common technique used to reduce the time it takes for a computer to retrieve information. The term cache is derived from the French word cacher, meaning "to hide. Ideally, recently accessed information is stored in a cache so that a subsequent repeat access to that same information can be handled locally without additional access time or burdens on network traffic When a request for information is made, the system's caching software takes the request, looks in the cache to see if it is available and, if so, retrieves it directly from the cache. If it is not present in the cache, the file is retrieved directly from its source, returned to the user, and a copy is placed in cache storage. Caching has been applied to the retrieval of data from numerous secondary devices such as hard and floppy disks, computer RAM, and network servers. Caching techniques are used to improve the performance of file access. The performance gain that can be achieved depends heavily on the locality of references and on the frequency of read and write operations. In a client-server system, each with main memory and a disk, there are four potential places to store(cache) files: the server's disk, the server's main memory, the client's disk, and the client's main memory. The server's disk is the most straightforward place to store all files. Files on the server's disk are accessible to all clients. Since there is only one copy of each file, no consistency problems arise. However, the main drawback is performance. Before a client can read a file, the file must first be transferred from the server's disk to the server's main memory, and then transferred over the network to the client's main memory.

Figure 2. Four file storage structures. Case a, when the storage in the server disk Case B, where the storage in the Server memory. Case C, where the storage is in the Client memory and D in the Client Disk Caches have been used in many operating systems to improve file system performance. Repeated accesses to a block in the cache can be handled without involving the disk. This feature has two advantages. First, caching reduces delays; a block in the cache can usually be returned to a waiting process five to ten times more quickly than one that must be fetched from the disk. Second, caching reduces contention for the disk arm, which may be advantageous if several processes are attempting to access simultaneously files on the same disk. However, since main memory is invariably smaller than the disk, when the cache fills up and some of the current cached blocks must be replaced. If an up-to-date copy exists on the disk, the cache copy of the replaced cache block is just discarded. Otherwise, the disk is first updated before the cached copy is discarded. A caching scheme in a distributed file system should address the following design decisions:

• The granularity of cached data. • The location of the client's cache (main memory or local disk). • How to propagate modifications of cached copies. • How to determine if a client's cached data is consistent.

The choices for these decisions are intertwined and related to the selected sharing semantics. 6.7.1 Cache Unit Size: The size of the cache can be either pages of a file or the entire file itself. For access patterns that have strong locality of reference, caching a large part of the file results in a high hit ratio, but at the same time, the potential for consistency problems also increases.

Server Client

Internet

Server Disk

Client Disk

A B

CD

Furthermore, If the entire file is cached, it can be stored contiguously on the disk (or at least in several large chunks), allowing high-speed transfers between the disk and memory and thus improves performance. Caching entire files also offers other advantages, such as fault-tolerance. This is because remote failures are visible only at the time of open and close operations, supporting disconnected operation of clients which already have the file cached. Whole file caching also simplifies cache management, since clients only have to keep track of files and not individual pages. However, caching entire files have two drawbacks. First, files larger than the local memory space (disk or main memory) cannot be cached. Second the latency of open requests is proportional to the size of the file and can be intolerable for large files.

If parts (blocks) of file stored in the cache, the cache and disk space is used more efficiently. This scheme uses read-ahead technique to read blocks from the server disk and buffer them on both the server and client sides before they are actually needed in order to speed up the reading process. Increasing the caching unit size increases the likelihood that the data for the next access will be found locally (i.e., the hit ratio is increased); on the other hand, the time required for the data transfer and the potential for consistency problems are increased. Selecting the unit size of caching depends on the network transfer unit and the communication protocol being used.

Earlier versions of the Andrew file system (AFS-1 and AFS-2) Coda and Amoeba cached the entire files. AFS-3 uses partial-file caching, but its use has not demonstrated substantial advantages in usability or performance, over the earlier versions. When caching is done at large granularity, considerable performance improvement can be obtained by the use of specialized bulk transfer protocols, which reduce the latency associated with transferring the entire file.

6.7.2 Cache Location: The cache location can be either at the server side, client side or both. Furthermore, the cache can be also either in the main memory or in the disk. The server caching eliminates accessing the disk on each access, but it still requires using the network to access the server. Caching at the client side can avoid using the network.

Disk caches have one clear advantage in reliability and scalability. Modifications to the cached data won't be lost even when the system crashes, and there is no need to fetch the data again during recovery. Disk caches contribute to scalability by reducing network traffic and server loads during client crashes. In Andrew and Coda, cache is on the local disk, with a further level of caching provided by the Unix kernel in main memory. On the other hand, caching in the main memory has four advantages. First, main-memory caches permit workstations to be diskless, which make them cheaper and quieter. Second, data can be accessed more quickly from a cache in the main

memory than a cache on the disk. Third, physical memories on the client workstations are now large enough to provide high hit ratios. Fourth, the server caches will be in the main memory regardless of where the client caches are located. Thus main-memory caches emphasize reduced access time while disk caches emphasize increased reliability and autonomy of single machines. If the designers decide to put the cache in the client's main memory, three options are possible as shown in Figure 3.

1. Caching within each process. The simplest way is to cache files directly inside the

address space of each user process. Typically, the cache is managed by the system call library. As files are opened, closed, read, and written, the library simply keeps the most heavily used ones around so that when a file is reused, it may already be available. When the process exits, all modified files are written back to the server. Although this scheme has an extremely low overhead, it is only effective if individual processes open and close files repeatedly.

2. Caching in the kernel. The kernel can dynamica1ly decide how much memory to reserve for programs and how much for the cache. The disadvantage here is that a kernel call is needed in all cache accesses, even on cache hits.

3. The cache manager as a user process. The advantage of a user-level cache manager is that it keeps the microkernel operating system free of the file system code. In addition, the cache manager is easier to program because it is completely isolated, and is more flexible.

• Write Policy: The write policy determines the way the modified cache blocks (dirty

blocks) are written back to their files on the server. The write policy has a critical effect on the system's performance and reliability. There are three write policies: write-through, delay-write, and write-on-close.

1. Write-through: the write-through policy is to write data through to the disk as soon

as it is placed in any cache. A write-through policy is equivalent to using remote service for writes and exploiting caches for reads only. This policy has the advantage of reliability, since little data is lost if the client crashes. However, this policy requires each write access to wait until the information is written to the disk, which results in poor write performance. Write-through and variations of delayed-write policies are used to implement the UNIX like semantics of sharing.

2. Delay-write: blocks are initially written only to the cache and then written through to the disk or server some time later. This policy has two advantages over write-through. First, since writes are to the cache, write access completes much more quickly. Second, data may be deleted before it is written back, in this case it needs not be written at all. Thus a policy that delays writes several minutes can substantially reduce the traffic to the server or disk. Unfortunately, delayed write schemes introduce reliability problems, since unwritten data will be lost whenever a server or client crashes. Sprite file system uses this policy with a 30-second delay interval.

3. Write-on-close: to write data back to the server when the file is closed. The write-on-close policy is suitable for implementing session semantics, but fail to give considerable improvement on performance for files which are open for a short while. It also increases the latency for close operations. This approach is used in the Andrew file system and Network File System (NFS).

There is a tight relation between the write policy and semantics of file sharing. Write-on-close is suitable for session semantics. When files are updated concurrently and occur frequently in conjunction with UNIX semantics, the use of delayed-write will result in long delays. A write-through policy is more suitable for UNIX semantics under such circumstances.

6.7.3 Client cache coherence in DFS Cache coherence is the fact of reading the latest copy of the file. In Unix systems, the user can always access the latest update. When working in a distributed computing environment, this problem arises especially if you are cashing at the client’s machine. To solve, we need to relax the problem by using the session. Open the file, update and then close. So it is based on session semantics rather than read and write semantics. Updating through the user will generate high traffic. The following methods are used to maintain coherence (according to a model, e.g. UNIX semantics or session semantics) of copies of the same file at various clients: Write-through: writes sent to the server as soon as they are performed at the client high traffic, requires cache managers to check (modification time) with server before can provide cached content to any client Delayed write: coalesces multiple writes; better performance but ambiguous semantics Write-on-close: implements session semantics Central control: file server keeps a directory of open/cached files at clients -> Unix semantics, but problems with robustness and scalability; problem also with invalidation messages because clients did not solicit them. 6.7.4 Cache Validation and Consistency: Cache validation is required to find out if the data in the cache is a stale copy of the master copy. If the client determines that its cached data is out of date, then future accesses can no longer be served by that cached data. An up-to-date copy of the data must be brought over from the file server. There are basically two approaches to verifying the validity of the cached data: 1. Client-initiated approach. A client-initiated approach for validation involves

contacting the server to check if both have the same version of the file. Checking is done usually by comparing header information such as a time-stamp of updates or a

version number (e.g., time stamp of the last update which is maintained in the i-node information in UNIX). The frequency of the validity check is the crux of this approach and it can vary from being performed with each access to a check initiated over a fixed interval of time. This method can cause severe network traffic, depending upon the frequency of checks. When it is performed with every access, the file access experiences more delay than the file access served immediately by the cache. Depending on its frequency, this kind of validity check can cause severe network traffic, as well as consume precious server CPU time. This phenomenon was the cause for Andrew designers to withdraw from this approach.

2. Server-initiated approach. In the server-initiated approach, whenever a client caches an object, the server hands out a promise (called a callback or a token) that it will inform the client before allowing any other client to modify that object. This approach enhances performance by reducing network traffic, but it also increases the responsibility of the server in direct proportion to the number of clients being served, not a good feature for scalability. The server records for each client the (parts of) files the client caches. Maintaining information on clients has significant fault tolerance implications. A potential for inconsistency occurs when a file is cached in conflicting modes by two different clients (i.e., at least one of the clients specified a write mode). If session semantics is implemented, whenever a server receives a request to close a file that has been modified, it should react by notifying the clients to discard their cached data and consider it invalid. Clients having this file open at that time discard their copy when the current session is over. Other clients discard their copy at once. Under session semantics, the server needs not be informed about Opens of already cached files. The server is informed about the Close of a writing session, however. On the other hand, if a more restrictive sharing semantics is implemented, like UNIX semantics, the server must be more involved. The server must be notified whenever a file is opened, and the intended mode (Read or Write) must be indicated. Assuming such notification, the server can act when it detects a file that is opened simultaneously in conflicting modes by disabling caching for that particular file (as done in Sprite). Disabling caching results in switching to a remote access mode of operation. The problem with the serve-initiated approach is that it violates the traditional client-server model, where clients initiate activities by requesting the desired services. Such a violation can result in irregular and complex code for both clients and servers.

The implementation techniques for cache consistency check depend on the semantics used for sharing files. Caching entire files is a perfect match for session semantics. Read and Write accesses within a session can be handled by the cached copy, since the file can be associated with different images according to the semantics. The cache consistency problem is reduced to propagating the modifications performed in a session to the master copy at the end of a session. This model is quite attractive since it has simple implementation. Observe that coupling this semantics with caching parts of files may complicate matters, since a session is supposed to read the image of the entire file that corresponds to the time it was opened.

A distributed implementation of the UNIX semantics using caching has serious consequences. The implementation must guarantee that at all times only one client is allowed to write to any of the cached copies of the same file. A distributed conflict resolution scheme must be used in order to arbitrate among clients wishing to access the same file in conflicting modes. In addition, once a cached copy is modified, the changes need to be propagated immediately to the rest of the cached copies. Frequent writes can generate tremendous network traffic and cause long delays before requests are satisfied. This is why implementations (e.g., Sprite) disable caching altogether and resort to remote service once a file is concurrently open in conflicting modes. Observe that such an approach implies some form of a server-initiated validation scheme, where the server makes a note of all Open calls. As was stated, UNIX semantics lend itself to an implementation where all requests are directed and served by a single server.

The immutable shared files semantics eliminates the cache consistency problem entirely since the files can not be written. The transactions-like semantics can be implemented in a straightforward manner using locking. In this scheme, all the requests for the same file are served by the same server on the same machine as is done in the remote service.

For session semantics we can easily implement, cache consistency by propagating changes to the master copy after closing the file. For implementing UNIX semantics, we have to propagate write to a cache, not only to the server, but also to other clients having a stale copy of the cache. This may lead to a poor performance and that is why many DFS (such as Sprite) switch to remote service when a client opens a file in a conflicting mode. Write-back caching is used in Sprite and Echo. Andrew and Coda use a write-through policy, for implementation simplicity and for reducing the chances of server data being stale due to client crashes. Both these systems use deferred write-back while operating in the disconnected mode, during server or network failures.

Maintaining cache coherence is unnecessary if the data in the cache is treated as a hint and is validated upon use. File data in a cache cannot be used as a hint since the use of a cached copy will not reveal whether it is current or stale. Hints are most often used for file location information in DFS. Andrew for instance caches individual mappings of volumes to servers. Sprite caches mappings of pathnames prefixes to servers.

Caching can thus handle a substantial amount of remote accesses in an efficient manner. This leads to performance transparency. It is also believed that client caching is one of the main contributing factors towards fault-tolerance and scalability. The effective use of caching can be done by studying the usage properties of files. For instance we could have write-through if we know that the sequential write-sharing of user files is uncommon. Also executables are frequently read, but rarely written and are very good candidates for caching. In a distributed system, it may be very costly to enforce transaction like semantics, as required by databases, which exhibit poor

locality, fine granularity of update and query and frequent concurrent and sequential write sharing. In such cases, it is best to provide explicit means, outside the scope of the DFS. This is the approach followed in the Andrew and Coda DFS.

6.7.5 Comparison of Caching and Remote Service The choice between caching and remote service is a choice between potential for improved performance and simplicity. Following are the advantages and disadvantages of the two methods. • When using caching scheme, a substantial amount of the remote access can be

handled efficiently by the local cache. In DFSs, such scheme's goal is to reduce network traffic. In remote access, there is an excessive overhead in network traffic and increase in the server load.

• The total network overhead in transmitting big chunks of data, as done in caching, is lower than when series of short responses to specific requests are transmitted.

• The cache consistency problem is the major drawback to caching. When writes are frequent, the consistency problems incur substantial overhead in terms of performance, network traffic, and server load.

• To use caching and benefit from it, clients must have either local disks or large main memories. Clients without disks can use remote-service method without any problems.

• In caching, data is transferred all together between the server and client and not in response to the specific needs of a file operation. Therefore, the interface of the server is quite different from that of the client. On the other hand, in the remote service the interface of the server is just an extension of the local file system interface across the network.

• It is hard to emulate the sharing semantics of a centralized system (Unix sharing semantics) in a system using caching. While using remote service, it is easier to implement and maintain the Unix sharing sematnics.

6.8 Concurrency Control

6.8.1 Transaction in a Distributed File System The term atomic transaction is used to describe the phenomenon of a single client carrying out a sequence of operations on a shared file without interference from another client. The net result of every transaction must be the same as if each transaction is performed at a completely separate instance of time. The atomic transaction ( in a file service ) enables a client program to define a sequence of operations on a file without the interference of any other client program to ensure a justifiable result. Synchronization of the operations by a file server that supports the transaction, must be done to ensure the above criteria. Also if the file, undergoing any modification by the file service, faces any unexpected server or client process halts due to a hardware error or the software fault

before the transaction is completed, the server ensures the subsequent restoration of the file to the original state before the transaction started. Though it is a sequence of operations, the atomic transaction can be viewed to be a single step operation from the clients point of view, to restore from one stable state to the other one. Either the transaction will be done successfully or the file status will be restored to the original one. The atomic transaction must satisfy two criteria to prevent the conflict between two concurrently accessing client processes requesting operations in the same data item concurrently. First it should be recoverable. Secondly the concurrent execution of several atomic transactions must be serially equivalent, i.e. the effect of several concurrent transactions would be the same if is it done one at a time. To ensure the proper atomicity of transactions concurrency controlling is done via locking, time stamping, optimistic concurrency control etc., details of which is explained later in the concurrency control chapter.

6. 9 Security and Protection To encourage sharing of files between users, the protection mechanism should allow a wide range of policies to be specified. As we discussed in the directory service, there are two important techniques to access files: 1) Capability-based access, and 2) Access Control List. The client process that has a valid Unique File Identifier (UFID) can use the file service to access the file, by the help of the directory service, which stores mappings from users' names for files to UFIDs. When a service or a group passes the authentication check, such as, a name or a password check they are given a UFID, which generally contains a large sparse number to reduce the counterfeiting. The UFIDs are issued after the user or the service is registered in the system. The authentication is done by the authentication service that maintains a table of user names, service names, passwords and the corresponding user identifier (ID). Each file has an owner (initially the creator) whose password is stored in the attributes of the created file, that will be subsequently used by the identity based file access control scheme. An access control list contains the user IDs of all the users who are entitled to access the file directly or indirectly. Generally the owner of the file can perform all file operations on using the file service. The other members have lesser access in the same file (e.g., read-only). The users of any file can be classified based on their requirements and needs to access a given file as follows. • The file's owner. • The directory service who is responsible for controlling the access and mapping the

file by its text names. • A client who is given special permissions to access the file on behalf of the owner to

manage the file contents and thereby is recognized by the system manager.

• All other clients. In large distributed systems, simple extensions of the mechanisms used by time-sharing operating systems are not sufficient. For example some systems implement authentication by sending a password to the server, which then validates it. Besides being risky, the client is not certain of the identity of the server. Security can be built on the integrity of a relatively small number of servers, rather than a large number of clients, as is done in the Andrew file system. The authentication function is integrated with the RPC' mechanism. When a user logs on, the user’s password is used as a key to establish a connection with an authentication server. This server hands the user a pair of authentication tokens, which are used by the user to establish secure RPC connections with any other server. Tokens expire periodically, typically in 24 hours. When making an RPC call to a server, the client supplies a variable-length identifier and the encryption key to the server. The server looks up the key to verify the identity of the client. At the same time, the client is assured that the server has the capability of looking up its key and hence can be trusted. Randomized information in the handshake guards against replays by suspicious clients. It is important that the authentication servers and file servers run on physically secured hardware and safe software. Furthermore, there may be multiple redundant instances of the authentication server, for greater availability. Access Rights In a DFS, there is more data to protect from a large number of users. The access privileges provided by the native operating systems are either inadequate or absent. Some DFS such as Andrew and Coda, maintain their own schemes for deciding access rights. Andrew implements a hierarchical access-list mechanism, in which a protection domain consists of users and groups. Membership privileges in a group are inherited and the user's privileges are the sum of the privileges of all the groups that he or she belongs to. Also privileges are specified for a unit of a file system such as directories, rather than individual files. Both these factors simplify the state information to be maintained. Negative access rights can also be specified, for quick removal of a user from critical groups. In case of conflicts, negative rights overrule positive rights.

6.10 Case Studies

6.10.1 SUN Network File System ( NFS )

The Network File System (NFS) is designed, has been specified and implemented by Sun Microsystems Inc. since 1985.

Overview NFS views a set of interconnected workstations as a set of independent machines with independent file systems. It allows some degree of sharing based on a client-server relationship among the file systems in a transparent manner. A machine may be both a client and server. Sharing is allowed between any pair of machines, not only with dedicated server machines. Consistent with the independence of machines is the fact that sharing of a remote file system affects only the client and no other machine. Hence there is no notion of a globally shared file system as in Locus, Sprite and Andrew.

Advantages of Sun's NFS • Support diskless Sun workstations entirely by way of the NFS protocol. • It provides the facility for sharing files in a heterogeneous environment of machines,

operating systems, and networks. Sharing is accomplished by mounting a remote file system, then reading or writing files in place.

• It is open-ended. Users are encouraged to interface it with other systems. It was not designed by extending SunOS into the network. Instead operating system independence was taken as a NFS design goal, along with machine independence, simple crash recovery, transparent access, maintenance of UNIX file system semantics, and reasonable performance. These advantages have made NFS a standard in the UNIX industry today.

NFS Description NFS provides transparent file access among computers of different architectures over one or more networks and keeps different file structures and operating system transparent to users. A brief description of the salient points is given below. It is a set of primitives that defines the operations, which can be made on a distributed file system. The protocol is defined in terms of a set of Remote Procedure Call (RPC), their arguments and results, and their effects.

NFS protocol 1. RPC and XDR: RPC mechanism is implemented as a library of procedures plus a

specification for portable data transmission, known as the External Data Representation (XDR). Together with RPC, XDR provides a standard I/O library for interprocess communication. The RPCs are used for defining NFS protocol. They are null(), lookup(), create(), remove(), getattr(), setattr(), read(), write() , rename(), link(), simlink(), readlink(), mkdir() , rmdir(), readdir(), statfs(). The most common NFS procedure parameter is a structure called file handle, which is provided by the server and used by retrying the call until the packet gets through.

2. Stateless protocols: The NFS protocol is stateless because each transaction stands on its own. The server does not keep track of any past client requests.

3. Transport independent: New transport protocol can be plugged into the RPC implementation without affecting the higher-level protocol code. In the current implementation, NFS uses UDP/IP protocol as the transport protocol.

The UNIX operating system does not guarantee that the internal file identification is unique within a local area network. In distributed system, it is possible for a file or a file system identification to be the same for another file on a remote system. To solve this problem, Sun has added a new file system interface to the UNIX operating system kernel. This improvement can uniquely locate and identify both local and remote files. The file system interface consists of the virtual file system (VFS) interface and the Virtual Node (vnode) interface. Instead of the inode the operating system deals with the vnode. When a client makes a request to access a file, it goes through the VFS that decides whether the file is local or remote. It uses the vnode to determine if the file is local or remote. If it is local, it refers to the I-node and the file is accessed as any other Unix files. If it is remote, the file handler is identified which in turn uses the RPC protocol to contact the remote server and obtain the required file. VFS defines the procedure and data structures that operate on the file system as a whole and vnode interface defines the procedures that operate on an individual file within that file system type

System calls interface

VFS interface

Client Server

Other types of file systems

Unix 4.2 file systems

NFS client

RPC/XDR RPC/XDR

NFS server

Network

Unix 4.2 file systems

VFS interface

disk disk

Figure 6.5 Schematic view of NFS architecture

Pathname Translation Pathname translation is done by breaking the path into component names and doing a separate NFS lookup call for every pair of component name and directory vnode. Thus, lookups are performed remotely by the server. Once a mount point is crossed every component lookup causes a separate RPC to the server. This expensive pathname traversal is needed, since each client has a unique layout of its logical name space, dictated by the mounts if performed. A directory name lookup cache at the client, which holds the vnodes for remote directory names, speeds up references to files with the same initial pathname. The cache is discarded cache is discarded when attributes returned from the server do not match the attributes of the cached node.

Caching There is a one-to-one correspondence between the regular UNIX system calls for

file operations and NFS protocol RPCs with the exception of opening and closing files. Hence a remote file operation can be translated directly to the corresponding RPC. Conceptually, NFS adheres to the remote service paradigm but in practice buffering and caching techniques are used for the sake of performance, i.e. no correspondence exists between a remote operation and a RPC. File blocks and file attributes are fetched by the RPCs and cached locally.

Caches are of two types: file block cache and file attribute (i-node information) cache. On a file open, the kernel checks with the remote server about whether to fetch or revalidate the cached attributes by comparing the time-stamps of the last modification. The cached file blocks are used only if the corresponding cached attributes are up to date. Both read-ahead and delayed-write techniques are used between the client and server. Clients do not free delayed write blocks until the server confirms the data is written to the disk

Performance tuning of the system makes it difficult to characterize the sharing semantics of NFS. New files created on a machine may not be able to visible elsewhere for a short duration of time. It is indeterminate whether writes to a file at one site are visible to other sites that have the file open for reading. New opens of that file observe only the changes that have been flushed to the server. Thus NFS fails to provide strict emulation of UNIX semantics.

Summary • Logical name structure: Global name hierarchy does not exist, i.e. every machine

establishes its own view of the name structure. Each machine has its own root serving as private and absolute point of reference for its own view of the name structure Hence users enjoy some degree of independence, flexibility and privacy sometimes at the expense of administrative complexity

• Remote service: When a file is accessed transparently I/O operations are performed according to the remote service method, i.e. the data in the file is not fetched at once instead the remote site potentially participates in each read and write operation.

• Fault tolerance: Stateless approach for the design of the servers results in resiliency to client, server, or network failures.

• Sharing semantics: NFS does not provide UNIX semantics for concurrently open files Figure 6.8 Local and remote file systems accessible on an NFS client. The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in Server 1; the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2.

6.10.2 Sprite

Sprite is featured by its performance. Sprite uses memory caching as a main tool to achieve good performance. Sprite is an experimental distributed system developed at the University of California at Berkeley. It is a part of the SPUR project, whose goal is the design and construction of high performance multiprocessor workstation.

Overview Designers of Sprite envisioned the next generation of workstations as powerful

machines with vast main memory of 100 to 500 MB. By caching files from dedicated servers the physical memories compensate for lack of local disks in diskless workstations.

Features Sprite uses ordinary files to store data and stacks of running processes instead of

special disk partitions used by many versions of UNIX. This simplifies process migration and enables flexibility and sharing of the space allocated for swapping. In Sprite, clients can read random pages from a server’s (physical) cache faster than from a local disk, which shows that a server with a large cache may provide better performance than from a local disk. The interface provided by Sprite is very similar to the one provided by UNIX

jim jane joeann

usersstudents

usrvmunix

Client Server 2

. . . nfs

Remote

mountstaff

big bobjon

people

Server 1

export

(root)

Remote

mount

. . .

x

(root) (root)

where a single tree is encompasses all the files and devices in a network making them equally and transparently accessible from every workstation. Location transparency is complete i.e. a file’s network location cannot be discerned from its name.

Description Caching An important aspect of the sprite file system is the capitalizing on the large main

memories and advocating diskless workstations, storing file caches in-core. The same caching scheme is used to avoid local disk accesses as well as to speed up remote accesses. In sprite, file information is cached in the main memories of both servers (workstations with disks), and clients (workstations wishing to access files on nonlocal disks). The caches are organized on block basis, each being 4Kb.

Sprite does not use read ahead to speed up sequential read, instead it uses delayed write approach to handle file modifications. Exact emulation of UNIX semantics is one of Sprite’s goals. A hybrid cache validation method is used for this end. Files are associated with version number. When a client opens a file, it obtains the file’s current version number from the server and compares this number to the version number associated with the cached blocks for that file. If the version numbers are different, the client discards all cached blocks for the file and reloads its cache from the server when the blocks are needed.

Looking up files with prefix tables Sprite presents its user with a single file system hierarchy that is composed of

several subtrees called domains. Each server stores one or more domains. Each machine maintains a server map called prefix table. This table maps domains to servers. This mapping is built and updated dynamically by a broadcast protocol. Every entry in a prefix table corresponds to one of the domains. It contains the pathname of the topmost directory in the domain, the network address of the server storing the domain, and a numeric designator identifying the domain’s root directory for the storing server. This designator is an index into the server table of open files; it saves repeating expensive name translation.

Every lookup operation for an absolute pathname starting with the client searching for its prefix table for the longest prefix matching the given file name. The client strips the matching prefix from the file name and sends the remainder of the name to the selected server along with the designator from the prefix table entry. The server uses this designator to locate the root directory of the domain, then proceeds by usual UNIX pathname translation for the remainder of the file name. When the server succeeds in completing the translation, it replies with a designator for the open file.

Location Transparency Like almost all modern network systems, Sprite achieved location transparency.

This means that the users should be able to manipulate files in the same ways they did under time-sharing on a single machine; the distributed nature of the file system and the techniques used to access remote files should be invisible to users under normal conditions. Most network file systems fail to meet the transparency goal in one or more

ways. Earliest systems allowed remote file access only with a few special programs. Second generation systems allow any application to access files on any machine in the network, but special names must be used for remote files. Third generation network file systems such as Sprite and Andrew, provide transparency. Sprite provides complete transparency, so applications running on different workstations see the same behavior they would see if all applications were executing on a single timeshared machine. Also sprite provides transparent access to remote I/O devices. Like UNIX, Sprite represents device as special files, unlike most versions of UNIX. It also allows any process to access any device, regardless of device location.

Summary The Sprite file system can be summarized by the following points. • Semantics: Sprite sacrifices performance in order to emulate UNIX semantics thus

eliminating the possibility and benefits of caching in big chunks. • Extensive use of caching: Sprite is inspired by the vision of diskless workstations

with huge main memories and accordingly relies heavily on caching. • Prefix tables: For LAN based systems prefix tables are a most efficient, dynamic,

versatile and robust mechanism for file lookup the advantages being the built-in facility for processing whole prefixes of pathnames and the supporting broadcast protocol that allows dynamic changes in tables.

6.9.3 Andrew File System Andrew, which is distinguished by its scalability, is a distributed computing

environment that has been under development since 1983 at CMU. The Andrew file system constitutes the underlying information-sharing mechanism among users of the environment. The most formidable requirements of Andrew is its scale.

Overview Andrew distinguishes between client machine and dedicated server machines. Clients are presented with a partitioned space of file names: a local name space and a shared name space. A collection of dedicated servers, collectively called Vice represents the shared name space to the clients as an identical and location-transparent file hierarchy. The local name space is the root file system of a workstation from which the shared name space descends. (Figure ) Workstations are required to have local disks where they store their local name space, whereas servers collectively are responsible for the storage and management of the shared name space. The local name space is small and distinct from each workstation and contains system programs essential for autonomous operation and better performance, Temporary files, and files the workstation owner explicitly wants, for privacy reasons, to store locally.

Figure 6.8. Andrew’s name space The key mechanism selected for remote file operations is whole file caching. Opening a file causes it to be cached, in its entirety, in the local disk. Reads and writes are directed to the cached copy without involving the servers. Entire file caching has many merits, but cannot efficiently accommodate remote access to very large files. Thus, a separate design will have to address the issue of usage of large databases in the Andrew environment.

Features • User mobility: Users are able to access any file in the shared name space from any

workstation. The only noticeable effect of a user accessing files not from the usual workstation would be some initial degraded performance due to the caching of files.

• Heterogeneity: Defining a clear interface to Vice is a key for integration of diverse workstation hardware and operating system. To facilitate heterogeneity, some files in the local /bin directory are symbolic links pointing to machine-specific executable files residing in Vice.

• Protection: Andrew provides access lists for protecting directories and the regular UNIX bits for file protection. The access list mechanism is based on recursive groups structure.

/ (root)

tmp bin cmuvmunix. . .

bin

SharedLocal

Symboliclinks

Figure 6.9 Distribution of processes in the Andrew File System

Description Scalability of the Andrew File system There are no magic guidelines to ensure the scalability of a system. But the Andrew file system presents some methods to make it scalable.

Location Transparency Andrew offers true location transparency: the name of a file contains no location information. Rather, this information is obtained dynamically by clients during normal operation. Consequential, administrative operations such as the addition or removal of servers and the redistribution of files on them are transparent to users. In contrast, some file systems require users to explicitly identify the site at which a file is located. Location transparency can be viewed as a binding issue. The binding of location to name is static and permanent when pathnames with embedded machine names are used. The binding is dynamic and flexible in Andrew. Usage experience has confirmed the benefits of a fully dynamic location mechanism in a large distributed environment.

Venus

Workstations Servers

Venus

VenusUserprogram

Network

UNIX kernel

UNIX kernel

Vice

Userprogram

Userprogram

ViceUNIX kernel

UNIX kernel

UNIX kernel

Client Caching

The caching of data at clients is undoubtedly the architectural feature that contributed most to scalability in a distributed file system. Caching has been an integral part of the Andrew designs from the beginning. In implementing caching one has to make three decisions: where to locate the cache, how to maintain cache coherence, and when to propagate modifications.

Andrew cache on the local disk, with a further level of file caching by the Unix kernel in main memory. Disk caches contribute to scalability by reducing network traffic and server load on client reboots.

Cache coherence can be maintained in two ways. One approach is for the client to validate a cached object upon use. A more scalable approach is used in Andrew. When a client caches an object, the server hands out a promise (called callback or token) that it will notify the client before allowing any other client to modify that object. Although more complex to implement, this approach minimizes server load and network traffic, thus enhancing scalability. Callbacks further improve scalability by making it viable for clients to translate pathnames entirely locally.

Existing systems use one of the approached to propagating modifications from client to server. Write-back caching, used in Sprite, is the more scalable approach. Andrew uses a write through caching scheme. This is a notable exception to scalability being the dominant design consideration in Andrew.

Bulk Data transfer

An important issue related to caching is the granularity of data transfers between

client and server. Andrew uses whole-file caching. This enhances scalability by reducing server load, because clients need only contact servers on file open and close requests. The far more numerous read and write operations are invisible to servers and cause no network traffic. Whole-file caching also simplifies cache management because clients only have to keep track of files, not individual pages, in their cache. When caching is done at large granularity, considerable performance improvement can be obtained by the use of a specialized bulk data-transfer protocol. Network communication overhead caused by protocol processing typically accounts for a major portion of the latency in a distributed file system. Transferring data in bulk reduces this overhead. Token-Based Mutual Authentication The approach used in Andrew to implement authentication is to provide a level of indirection using authentication tokens. When a user logs in to a client, the password typed in is used as the key to establish a secure RPC connection to an authentication server. A pair of authentication tokens are then obtained for the user on this secure connection. These tokens are saved by the client and are used by it to establish secure RPC connections on behalf of the user to file servers. Like a file server, an authentication server runs on physically secure hardware. To improve scalability and to balance load, there are multiple instances of the authentication server. Only one instance accepts updates; the others are slaves and respond only to queries.

Hierarchical Groups and Access Lists

Controlling access to data is substantially more complex in large-scale systems than it is in smaller systems. There is more data to protect and more users to make access control decisions about. To enhance scalability, Andrew organize their protection domains hierarchically and support a full-fledged access-list mechanism. The protection domain is composed of users and groups. Membership in a group is inherited, and a user’s privileges are the cumulative privileges of all groups he or she belongs to, either directly or indirectly.

Andrew uses an access-list mechanism for file protection. The total rights specified for a user are the union of the rights specified for him and the groups he or she belongs to. Access lists are associated with directories rather than individual files. The reduction in state obtained by this design decision provides conceptual simplicity that is valuable at large scale. Although the real enforcement of protection is done on the basis of access lists, Venus superimposes an emulation of Unix protection semantics by honoring the owner component of the Unix mode bits on a file. The combination of access lists on directories and mode bits on files has proved to be an excellent compromise between protection at fine granularity, scalability, and Unix compatibility.

Data Aggregation

In a large system, consideration of interoperability and system administration assume major significance. To facilitate these functions, Andrew organize file system data into volumes. A volume is a collection of files located on one server and forming a partial subtree of the Vice name space. Volumes are invisible to application programs and are only manipulated by system administrators. The aggregation of data provided by volumes reduces the apparent size of the system as perceived by operators and system administrators. Our operational experience in Andrew confirms the value of the volume abstraction in a large distributed file system.

Decentralized Administration

A large distributed system is unwieldy to manage as a monolithic entity. For

smooth and efficient operation, it is essential to delegate administrative responsibility along lines that parallel institutional boundaries. Such a system decomposition has to balance site autonomy with the desirable but conflicting goal of system-wide uniformity in human and programming interfaces. The cell mechanism of AFS-3 is an example of a mechanism that provides this balance.

A cell corresponds to a completely autonomous Andrew system, with its own protection domain, authentication and file servers, and system administrators. A federation of cells can cooperate in presenting users with a uniform, seamless file name space.

Heterogeneity

As a distributed system evolves it tends to grow more diverse. One factor contributing to diversity is the improvement in performance and decrease in cost of

hardware over time. This makes it likely that the most economical hardware configurations will change over the period of growth of the system. Another source of heterogeneity is the use of different computer platforms for different applications. Andrew did not set out to be a heterogeneous computing environment. Initial plans for it envisioned a single type of client running one operating system, with the network constructed of a single type of physical media. Yet heterogeneity appeared early in its history and proliferated with time. Some of this heterogeneity is attributed to the decentralized administration typical of universities, but much of it is intrinsic to the growth and evolution of any distributed system.

Coping with heterogeneity is inherently difficult, because of the presence of multiple computational environments, each with its own notions of file naming and functionality. The PC Server [Rifkin, 1987] is used to perform the function in the Andrew environment.

Summary The highlights of the Andrew file system are: • Name space and service model: Andrew explicitly distinguishes among local and

shared name spaces as well as among clients and servers. Clients have small and distinct local name space and can access the shared name space managed by the servers.

• Scalability: Andrew is distinguished by its scalability, the strategy adopted to address scale is whole file caching in order to reduce server load. Servers are not involved in reading and writing operations. The callback mechanism was invented to reduce the number of validity checks.

• Sharing semantics: Andrew’s semantics which are simple and well-defined ensure that a file’s updates are visible across the network only after a file has been closed.

6.9.4 Locus

Overview Locus is an ambitious project aimed at building a full-scale operating system. The

features of Locus are automatic management of replicated data, atomic file updates, remote tasking, ability of tolerate failures to a certain extent, and full implementation of nested transactions . The Locus has been operational in UCLA for several years on a set of mainframes and workstations connected by an Ethernet. The main component of Locus is its DFS. It represents a single tree structure naming hierarchy to users and applications. This structure covers all objects of all machines in the system. Locus is a location transparent system, i.e. from the name of an object you can not decide its location in the network.

Features

Fault tolerance issues have special emphasis in Locus. Network failures may split the network into two or more disconnected subnetworks (partitions). As long as at least one copy of a file is available in a subnetwork, read requests are served and it is still guaranteed that the version read is the most recent one available in that disconnected network. Upon reconnection of these subnetworks, automatic mechanisms take care of updating stale copies of files.

Seeking high performance in the design of Locus has led to incorporating network functions like formatting, queuing, transmitting, and retransmitting messages into the operating system. Specialized remote procedure protocols were devised for kernel-to-kernel communication. Lack of the multilayering (as suggested in the IS0 standard) enabled achieving high performance for remote operations. A file in Locus may correspond to a set of copies (replications) distributed in the system. It is the responsibility of Locus to maintain consistency and coherency among the versions of a certain file. Users have the option to choose the number and locations of the replicated files.

Description Locus uses the logical name structure to hide both location and replication details

from users and applications. A removable file system in Locus is called filegroup. A file group is the component unit in Locus. Virtually, logical filegroups are joined together to form a unified structure. Physically, a logical filegroup is mapped to multiple physical containers (packs) residing at various sites and storing replica of the files of that filegroup. These containers correspond to disk partitions. One of the packs is designated as the primary copy. A file must be stored at the site of the primary copy and can be stored at any subset of the other sites where there exists a pack corresponding to its filegroup. Hence, the primary copy stores the filegroup completely where the rest of the pack might be partial. The various copies of a file are assigned to the same i-node slot for all files it does not store. Data page numbers may be different on different packs, hence reference over the network to data pages use logical page numbers rather than physical ones. Each pack has a mapping of these logical numbers to physical numbers. To facilitate automatic replication management, each i-node of a file copy contains a version number, determining which copy dominates other copies.

Synchronization of Accesses to Files Locus distinguishes between three logical roles in file accesses; each one is

performed by a different site: 1. Using Site (US) issues the requests to open and access a remote file. 2. Storage Site (SS) is the selected site to serve the requests. 3. Currently Synchronization Site (CSS) enforces a global synchronization policy for a

filegroup and selects an SS for each open request referring to a file in the filegroup. There is at most one CSS for each filegroup in any set of communicating sites. The CSS maintains the version number and a list of physical containers for every file in the filegroup.

Reconciliation of filegroups at partitioned sites

The basic approach to achieve fault tolerance in Locus is to maintain within a single subnetwork, consistency among copies of a file. The policy is to allow updates only in a partition that has the primary copy. It is guaranteed that the most recent version of a file in a partition is read. To achieve this the system maintains a commit count for each filegroup, enumerating each commit of every file in the filegroup. The commit operation consists of moving the incore i-node to the disk i-node. Each pack has a lower-water-mark (lwm) that is a commit out value upto which the system guarantees that all prior commits are reflected in the pack. The primary copy pack (stored in the CSS) keeps a list of enumerating the files in a filegroup and the corresponding commit counts of the recent commits in the secondary storage. When a pack joins a partition it attempts to contact the CSS and checks whether its lvm is within the recent commit list range. If so the pack site schedules a kernel process that brings the pack to a consistent state by copying only the files that reflect commits later than that of the site’s lwm. If the CSS is not available writing is disallowed in this partition by reading is possible after a new CSS is chosen. The new CSS communicates with the partition members to itself informed of the most recent available version of each file in the filegroup. Then other pack sites can reconcile with it. As a result all communicating sites see the same view of the filegroup, and this view is as complete as possible given a particular partition. Since updates are allowed within the partition with the primary copy and Reads are allowed in the rest of the partitions, it is possible to read out-of-date replicas of a file. Thus Locus sacrifices consistency or the ability to continue to both update and read files in a partitioned environment. When a pack is out of date the system invokes an application-level process to bring the file group up to date. At this point the system lacks sufficient knowledge of the most recent commits to identify the missing updates. So the site inspects the entire i-node space to determine which files in its pack are out of date.

Summary An overall profile of Locus can be summarized by the following issues. • Distributed operating system: Due to the multiple dimensions of a transparency in

Locus it comes close to the definition of a distributed operating system in contrast to a collection of network services.

• Implementation Strategy: Kernel operation is the implementation strategy in Locus the common pattern being kernel-to-kernel implementation via specialized high performance protocols.

• Replication: Locus uses a primary copy replication scheme the main merit of which is the increased availability of directories that exhibit high read-write ratio.

• Access synchronization: UNIX semantics are emulated to the last detail inspite of caching at multiple USs. Alternatively locking facilities are provided.

• Fault tolerance: Some of the fault tolerance mechanisms in Locus are atomic update facility, merging replicated packs after recovery and a degree of independent operation of partitions.

Conclusion:

In this chapter, we were able discuss the characteristics of the File system which is a set of services that could be provided to the client (user). We introduced a group of terms such as NFS, Network file system, which is a case study in the chapter as well as Andrew file system. The chapter started by studying the File System Characteristics and Requirements. In this section, we defined the file system role, the file access granularity, file access type, transparency, network transparency, mobile transparency, performance, fault tolerance and scalability. In the file model and organization section, we compared the directory service, file service and block service and gave examples showing the difference between each service. In naming and transparency section, we discussed transparency issues related to naming files, naming techniques and implementation issues. In the naming files, we presented the naming transparency, network transparency, mobile transparency and location independence. Later in the chapter, we defined the sharing semantics. The most common types of sharing semantics are: Unix Semantics, Session Semantics, Immutable Shared Semantics and Transaction Semantics. Fault tolerance is an important attribute in distributed system, so we prepared a section in this chapter to discuss the methods to improve the fault tolerance of a DFS that could be summarized in improving availability and the use of redundant resources. Due to its importance in Distributed File systems, caching occupied a big area from our discussion in chapter 6. Caching is a common technique used to reduce the time it takes for a computer to retrieve information. Ideally, recently accessed information is stored in a cache so that a subsequent repeats access to that same information can be handled locally without additional access time or burdens on network traffic. It is not easy to design a cache system, a lot of factors should be taken in consideration such the cache unit size, cash location, client location and validation and consistency. The chapter goes over briefly over concurrency control and security issues which will be discussed deeply in the coming chapters and ends in discussing some case studies such as Sun NFS and Andrew file system. Questions:

1) What is a Distributed File system? 2) Briefly summarize the file system requirements. 3) What are the most common types of file sharing semantics? 4) Why fault tolerance is an important attribute to file systems? 5) Why do we need caching? What factors should be taken in consideration when designing cache systems 6) Briefly summarize the case studies provided in this chapter. 7) Present a file system that is not discussed in this chapter and compare it to NFS and AFS. References: [Chu, 1969] W. W. Chu. Optimal file allocation in a minicomputer information system, IEEE Transaction on Computer, C-l 8 10, Oct. 1969.

[Gavish, 1990], Bezalel Gavish, R. Olivia and Liu Sheng, Dynamic file migration in distributed computer systems, Communicatins of the ACM, Vol. 33, No. 2, Feb. 1990. [Chung, 1991], Hsiao-Chung Cheng and Jang-Ping Sheu, Design and implementation of a distributed file system, Software-Practice and Experience, Vol. 21(7), P 657-675, July 1991 [Walker, 1983] B. Walker et al. “The LOCUS distributed operating system,” Proc. of 1983 SIGOPS Conference, PP. 49-70, Oct. 1983. [Coulouris and Dolimore, 1988] George F. Coulouris and Jean Dolimore, Addison Wesley, “Distributed Systems : Concepts and Design” 1988, P 18-20 [Nankano, ] X. Jia, H Nankano, Kentaro Shimizu and Moment Maekawa, “Highly Concurrent directory Management in Galaxy Distributed Systems”,Proceedings on International symposium on Database in Parallel and Distributed Systems, P 416- 426. [Thompson, 1931] K. Thompson, “UNIX Implementation”, Bell systems Technical Journal, Vol. 57, no : 6, part 2, P 1931-46. [Needham] R.M Needham and A.J Herbert, “The Cambridge Distributed Computing System”, Addison Wesley. [Lampson , 1981] “Atomic Transactions ‘I, B.W Lampson , Lamport et al., 1981 [Gidfford, 1988] D.K Gifford, “Violet : An experimental decentralized system”, ACM Operating System review, Vol. 3, No 5. “Network Programming”, Sun Micro Systems Inc. May 1988. [Rifkin, 1987] Rifkin, R.L Hamilton, M.P Sabrio S. Shah and K. Yueh, “R.F.S Architectural Overview”, USENIX, 1987. [Gould, 1987] Gould, “The Network File System implemented on 4.3 BSD”, USEXIX, 1987. [Howard, 1988] J. Howard et al “Scale and performance in a Distributed File Systems”. ACM Transaction on Computer Systems, P 55-8 1, Feb.’ 1988. “The design of a capability based operating system”, Computer Journal, Vol. 29, No 4, P 289-300. [levy, 1990A] E. Levy and A. Silberschatz, “Distributed File Systems : Concepts and Examples”, ACM Computer Surveys, P 321-374, Dec’1990. [Nelson, ] M. Nelson et al “Caching in the Sprite Network File System”, ACM

transaction on computer Systems. [Mullender, 1990] Sape. J. Mullender and Guirdo Van Rossum “Ameoba - A Distributed Operating system for 1990’s”, CWI, Center for Mathematics and Computer Science. [ Needham, 1988] Gidfford, D.K., Needham, R.M., and Schroeder, M.D.: “The Cedar File System,” commun. of the ACM, vol. 3 1, pp. 288-298, March 1988. [Levy, 1990B] LEVY, E., and SILBERSCHATZ, A.: “Distributed File Systems: Concepts and Examples” Computing Surveys, ~01.22, pp. 321-374, Dec. 1990. [ Tanenbaum, 1992] Andrew S. Tanenbaum: “Modem Operating Systems” Prentice-Hall Inc., chapterl3, pp. 549-587, 1992. [Nelson, 1988] Nelson, M.N., Welch, B.B., and Ousterhout. J.K.: “Caching in the Sprite Network File System.” ACM Trans. on Computer Systems, vol. 6, pp. 134-154, Feb. 1988. [Richard, 1997] Richard S. Vermut, “File Caching on the Internet: Technical Infringement or Safeguard for EfficientNetwork Operation?” 4 J. Intell. Prop. L. 273 (1997)

Documents

Chapter 6 Distributed File Systemsacl.ece.arizona.edu/classes/ece677/chapters/ch10.pdfChapter 6 Distributed File Systems Chapter Objectives A file system is a subsystem of an operating