Kanishk Tewari

8/8/2019 Kanishk Tewari

1/21

DISTRIBUTED STORAGESYSTEMS

A Seminar Reportdone in partial fulfilment of

the requirements for the award of the degree of

Bachelor of Technologyin

Computer Science and Engineering

by

KANISHK TEWARI

B070351CS

Department of Computer Science & Engineering

National Institute of Technology CalicutKerala - 673601

Monsoon 2010


2/21

National Institute of Technology CalicutDepartment of Computer Science & Engineering

Certified that this Seminar Report entitled

DISTRIBUTED STORAGESYSTEMS

is a bonafide record of the Seminar presented by

KANISHK TEWARI

B070351CS

in partial fulfilment ofthe requirements for the award of the degree of

Bachelor of Technologyin

Computer Science & Engineering

SREENU NAIK BHUKYA

(Seminar Co-ordinator)

Assistant Professor

Dept.of Computer Science & Engineering


3/21

Acknowledgment

I would like to thank Shri. Sreenu Naik Bhukya, Assistant Professor and his associates, De-

partment of Computer Science and Engineering, National Institute of Technology Calicut, and

LATEX, for providing me an opportunity to present this work. I would also like to thank God

and my parents.

3


4/21


5/21

List of Figures

1 [4]DISKOS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 [2]The Classic Web 2.0 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 [2]Storage(Information Upload) . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 [2]Access(Data Linkage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 [2]Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5


6/21

Abstract

During the past years, anywhere, anytime access to large and reliable storage has become

increasingly important in both enterprise and home computer environments. Also, web ap-

plications store vast quantities of information online, storage, protection and maintenance of

which becomes the sole responsibility of web application provider. This is where the distributed

storage system makes its foray as it has evolved from one host infrastructure to distributed en-

vironment. It becomes more and more evident for the necessary of the separation of computing

and storage. So, in this paper we discuss various aspects about distributed storage systems like

data replication schemes, integration of data into the Web 2.0 model and data access methods.

6


7/21


8/21

is large and the storage capacity is limited. To overcome these limitations, an efficient genetic

algorithm based heuristic provides good quality solutions.

1.3 Data Access in Distributed Computing

The increasing demand for massive data processing has prompted the evolution from one host

to distributed environment. Distributed computing has become the most important method for

massive data processing and many platforms have emerged. At the same time the distributed

storage system has been used to supply massive distributed data. These different storage

systems vary greatly in design and implementation and it is desirable to construct the computing

that can adapt to a variety of applications. So data accessing becomes the key problem to

achieve it. Some research has been developed on this. Such as SRB(storage resource broker)[3]

,it supplies plenty of data accessing methods and protocols for distributed storages. It focuses

more on the data supplies from the resource and can be seen as a middleware between the

computing and the storage. Some research on the accessing of certain data type such as HENP

project[3] in Berkeley on physical data processing. The accessing of distributed data will be

designed as an interface on the accessing from the computing. The aim is to improve the

supportability of various storages and the data processing reliability and validity.

1.4 Web 2.0 and Data Storage

The current breed of Web Applications, known as Web 2.0, rely on monolithic storage models

which place the sole burden of data management on the web application provider. Having the

application provider manage this data has resulted in problems with data ownership issues, data

freshness and data duplication, as has been covered by our previous work . As the majority

of data in Web 2.0 is user-generated, it is suggested that the responsibility for storing user-generated content should be given to the data owner themselves. As such, a distributed storage

model which allows web applications to offload the storage and management of usergenerated

content to storage systems managed by the data owner, is presented. The Distributed Data

Service (DDS) API[2] allows web applications to seamlessly present user-generated content to a

3rd party user. The 3rd- party interacts directly with both the web application and numerous

DDS systems which host the web applications content. Data owners can manage their data

elements by interacting with a DDS directly while also exposing this data to web applicationsvia a publish/subscribe model.

8


9/21

2 DISKOS System Overview

The distributed storage systems designed so far are tailored to specific storage environments

and are mostly implemented on the file system level of the I/O stack of an operating system.

Secondly, distributed file systems are optimized to specific application characteristics but disk

access patterns change over time as applications evolve, a fact that ensures a continuous need

for new file systems. But DISKOS, on the other hand, is able to achieve high performance in

a storage environment consisting of more than two storage hosts.

Overall, it appears that most distributed storage systems assume the use of expensive storage

and network equipment, which in some cases is totally proprietary. Other systems assume that

system reliability is maintained through the use of some type of replication mechanism, which is

implemented by external software or hardware. Also, cluster file systems require the existence

of powerful, dedicated metadata servers for conducting all file metadata operations, resulting to

a hierarchical structure that scales well only in combination with high-speed, low-latency and

symmetric network links. DISKOS, on the other hand, is a completely decentralized storage

system that may be deployed even in poor and heterogeneous environments, in terms of CPU,

storage and network resources. In DISKOS, there is decoupling of file system-related semantics

such as directories and files metadata storage and management, concurrent file access poli-

cies and file locking mechanisms from the actual distribution of data and provide a low-level

operating system (OS) storage interface, which can be directly integrated with existing or new

OS compliant file systems. As a result of this, DISKOS becomes a file system neutral storage

infrastructure, whose main component is implemented as a regular disk driver embedded in an

operating system kernel. File system neutrality is achieved since the disk driver is agnostic to

the I/O requests content.

To achieve scalability in terms of supporting thousands of users in both wide area networks

and cooperative storage environments, and to enable self management, complete decentraliza-

tion, data replication, resilience to failures, and transparency regarding underlying hardware

and exported interfaces to the upper layers, including applicability to a multitude of storage

environments, we chose to exploit the basic principles of Distributed Hash Tables(DHTs)[4].

Here, the DHT acts as storage provider for block requests that the disk driver has to satisfy.

9


10/21

Figure 1: [4]DISKOS Overview

2.1 System Overview

In this section we briefly discuss the components of the proposed system and their interaction

with each other and other parts of the I/O system of an operating system. Every application

that runs in the user space of the operating system is not able to directly access any kernel

component, except by using the system calls exported by the kernel. For instance, when an

application has to open a file found in a local file system for reading, it simply issues an open()

system call along with the OS specific arguments.

One of the main components of the proposed system, the Virtual Disk Driver (VDD), is built

in the kernel and therefore it supports direct integration with every file system that implements

the interfaces exported by the kernel. The VDD is directly connected with the Virtual Device

Controller (VDC). This connection is accomplished using some kind of IPC mechanism sup-

ported by the operating system. The VDD receives I/O requests for sectors from the upper

levels of the I/O system and asynchronously forwards them to the VDC. The VDC provides

the VDD with higher level I/O services, such as request buffering and sector prefetching. Itis also the component, which, on behalf of the VDD, accesses the DHT, which is implemented

in the Storage Provider (SP). The SP implements a DHT peer, which asynchronously receives

10


11/21

and satisfies requests from the VDC. The VDC and the SP are user level processes.

The reason for choosing a three-tier architecture is twofold. Firstly, the VDD is a kernel com-

ponent, and therefore it has to be lightweight in terms of memory and CPU usage. Secondly,

separating the VDC from the SP allows a host that needs access to a distributed storage space

to connect to a remote SP, rather than implementing one itself. Using this schema, the pro-

posed system can also be applied to devices with limited CPU and storage resources, such as

mobile phones and PDAs.

11


12/21

3 Replication Algorithms

Our replication policy assumes the existence of one primary copy for each object in the network.

Let SPk be the site which holds the primary copy of Ok, the object, i.e., the only copy in the

network that cannot be deallocated, hence referred to as primary site of the kth object. Each

primary site SPk, contains information about the whole replication scheme Rk of Ok. This can

be done by maintaining a list of the sites where the kth object is replicated at, called from now

on the replicators of Ok. Moreover, every site S(i) stores a two-field record for each object. The

first field is its primary site SPk and the second the nearest SN(i)k site to S

(i) which holds a

replica of Ok. In other words, SN(i)k is the site for which the reads from S

(i) for Ok, if served

there, would incur the minimum possible communication cost. It is possible that SN(i)k = S(i),

if S(i) is a replicator or the primary site of Ok. Another possibility is that SN(i)k =SPk, if the

primary site is the closest one holding a replica of Ok.

When a site S(i) reads an object, it does so by addressing the request to the corresponding SN(i)k .

For the updates we assume that every site can update every object. Updates of an object Ok are

performed by sending the updated version to its primary site SPk, which afterwards broadcasts

it to every site in its replication scheme Rk. The simplicity of this policy allows us to develop

a general cost model that can be used with minor changes to formalize various replication and

consistency strategies.

We are basically interested in minimizing the total network transfer cost (NTC) due to object

movement, since the communication cost of control messages has minor impact to the overall

performance of the system. There are two components affecting NTC. First, is the NTC created

from the read requests.

Let R(i)k denote the total NTC, due to S

(i)s reading requests for object Ok, addressed to the

nearest site SN(i)k .

R(i)k = r

(i)k ok C(i,SN

(i)k ) (1)

where,

C(i,j) = Communication cost (per unit) between sites i and j

r(i)k = Number of reads from site i for object k

ok = Size of object k

The second component of NTC is the cost arising due to the writes. Let W(i)k be the total

12


13/21

NTC, due to S(i)s writing requests for object Ok, addressed to the primary site SPk.

W(i)k = w

(i)k ok

C(i,SPk) +

jRkj=i

C(SPk, j)

(2)

where,

w(i)k = Number of writes from site i for object k

Let Xik=1 if S(i) holds a replica of object Ok, and 0 otherwise. Hence Xik defines an MxN

replication matrix, named X, with boolean elements. Sites which are not replicators of object

create NTC equal to the communication cost of their reads from the nearest replicator, plus

that of sending their writes to the primary site of Ok. Sites belonging to the replication scheme

of Ok, are associated with the cost of sending/receiving all the updated versions of it. Using

the above formulation, the Data Replication Problem (DRP) can be defined as:

Find the assignment of 0, 1 values in the X matrix that minimize D.

There is a greedy method which tries to create replicas of objects by introducing a new term,

B(i)k , called the replication benefit for each site S

(i) and object Ok. Larger the replication benefit

value, better the chance of an objects replica to be created in that site.

We define the replication benefit value, B(i)k , as:

B(i)k =

R(i)k

Mx=1w

(x)k ok C(i,SPk) W

(i)k

ok(3)

13


14/21

4 Data Access Configuration

Efficient data accessing from different storage systems was first considered for implementation

of the computing platform in our work. Here, the method proposed shields the difference below

and supplies the computing platform with universal data accessing to achieve higher flexibility

and efficiency.

The Data I/O model[3] which covers the difference of the below storage systems and fulfills

the reaction between the computing platform and the storage. This enables the computing

platform to focus only on data accessing from the I/O . The data storage and management will

not impact the computing.

The details of the storage system are covered by I/O. When different storage systems are used

in the simple configuration (or it is called the driver), then the computing can call the universal

API to make the read or write operations from the storage. This decreases the dependence

between the storage and the computing. The computing can configure with different storages.

When constructing a distributed application the storage and the computing can be implemented

separately therefore making the applications much more flexible.

The distributed computing platform can react with the distributed storage by the I/O. The

reaction between them should be emphasized. This is supplied by the framework of the comput-

ing platform. The properties of the different storage systems can be configured in configuration

files then the computing platform will switch to the corresponding storage system.

The design and implementation of configuration files is essential to universal data accessing.

The configuration file for different distributed file systems is DFS-Mapping.xml. It configures

the IO operations of the storage system and other information related details.

For example,

?xmlversion = 1.0?

?xml stylesheettype = text/xslhref= configuration.xsl?

Configuration

name dfs.name/name

value /value

description The name of this Distributed File System, such as HDFS or GFS/description

name dfs.io.input/name

value /value

description The DataInputStream of the Distributed File System /description

14


15/21

name dfs.io.output/name

value /value

description The DataOutputSteam of the Distributed File System/description

... ....

/configuration

The framework takes the mapping from distributed file systems and data I/O. Users can make

the simple configuration for the computing platform to achieve the universal data accessing

from low level file systems. Varieties are hidden so that reaction between the computing and

the storage becomes clearer and easier.

15


16/21


17/21

in-house or outsourced to a specialised data storage provider. When a Web 2.0 site requests

content from the data owner, he or she provides a link into the DSS in which the data is stored.

Later, when the data is needed for display as part of a page generated by the site, the dis-

playing browser instance uses the embedded link to directly retrieve the data from the ownerss

DSS[2].To aid integration of the DDS into the data owners web experience a new module is

introduced called the DDS browser extension. Communication between the browser extension

and the storage service is achieved by inserting a DDSStorageService-AttachRequest into the

HTTP response. Thisrequest is transparently inspected on-the-fly by the browser extension

which allows the extension to re-write parts of the response HTML dynamically. For the en-

duser, the extension re-writes links into a remote DSS with data returned from that specific

DSS.

Figure 3: [2]Storage(Information Upload)

Phase two of the process, involves the data owner publishing content to the storage service.

Again, the implementation is not restricted by the API as long as each piece of stored data

is given a unique identifier that is global in that data owners domain. The unique identifier

comprises three components:

[name]:[path]@[system]

The name component represents an identifier for each specific piece of data (for example, credit

card). The path component supports a hierarchical storage structure allowing a data owner

wishing to store various groupings of data (for example, a data owner may have separate sets

of personal and business data). The system component is a unique identifier for the specific

17


18/21

distributed data service.

Once the data owner has uploaded data into their DDS, the next stage is to allow web applica-

tions to Access[2] this data.During the registration process for a DDS-enabled web application,

the web browser extension inspects the HTTP response traffic from the application and detects

a DDS-Application-Subscribe-AuthRequest request. This request, and the corresponding re-

sponse generated by the browser extension, is the basis for establishing a trust relationship

between the web application and the distributed data service.Once this relationship is estab-

lished, the data owner establishes a link between their data element and the web application.

This link is akin to the data owner uploading content to the web application in a standard Web

2.0 scenario, except that in the DDS design the data owner provides the unique identifier of the

data as opposed to uploading the content itself. Once the link is established between a piece

of data referenced by the web application and the storage location for that data in a DDS, the

web application is free to request the data directly from the DDS.

Figure 4: [2]Access(Data Linkage)

The final stage in the design, is the Presentation[2] stage. Here we define how the data is

transparently presented to end users. Again the browser extension plays a key role. In this

instance the extension operates as a 3rd party does not have any direct relationship to the

18


19/21

DDS of the rendered data. For example, an end user may access a web application and request

to view an image collage built from images stored in multiple DDSs owned by multiple data

owners.In this instance the web application, instead of returning raw data as in the classic Web

2.0 case, will return a DDS-Present-DataRequest containing security information exchanged

between the web application and DDS during the initial authentication request. This security

information is protected using PKI to ensure that it cannot be abused to falsify links between

a DDS and unauthorised web applications and clients. The trust relationship enforced in this

case is between the web application and the DDS, hence the DDS itself does not need to be

aware of all the end users who can render the data linked to a specific web application. The

DDS-Present-DataRequest message triggers a handoff of the user from the web application to

the DDS, allowing the browser extension to request and render the data directly from the DDS,

under the web applications instruction.

Figure 5: [2]Data Presentation

19


20/21

6 Conclusion

In this report, I have tried to outline some of the basic but important points regarding Dis-

tributed Storage Systems, ranging right from one of its infrastructure, DISKOS, to its appli-

cation in Web 2.0. Data replication schemes and methods of accessing data have also been

outlined.

DISKOS has shown that we can create a file system neutral distributed storage environment

while reaching high performance for several write access patterns.

The data replication problem has also been addressed and a cost model has been developed

which is applicable to very large distributed systems such as the WWW and distributed

databases.

Light has also been put on some data access methods and how they are used in separating

storage from the computing. The accessing efficiency and availability would be the focused

research points.

We have also seen concerns resulting from the growing popularity of Web 2.0 applications and

defined a new paradigm for the distributed storage of data on the Internet. As the user take-up

of Web 2.0 applications continues, it is sensible to adopt a distributed approach that parallels

the way content is originally generated.

20


21/21

Documents

Kanishk Tewari