An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reliability Techniques

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 1, Issue 6, June 2014. ISSN 2348 - 4853

107 | © 2014, IJAFRC All Rights Reserved www.ijafrc.org

An Analysis Of Cloud ReliabilityApproaches Based on Cloud

Components And Reliability Techniques Abishi Chowdhury1, Priyanka Tripathi2

National Institute of Technical Teachers’ Training and Research, Bhopal, India1,2

[email protected] 1, [email protected] 2

A B S T R A C T

Cloud computing is powering the overall business and organizational growth by providing the

threebasic services like Software as a Service (SaaS), Platform as a Service (PaaS) and

Infrastructure as a Service (IaaS). As the worldwide cloud users demand multiple services from

cloud at a time, so it is the most important concern for cloud service providers to concentrate on

the reliability of the system. The reliability of any system can be determined by the number of

failures occurred in the cloud computing environment vs. the total number of tasks done by the

cloud. The reliability of a system depends on the reliability of each and every component of the

system with which the system is composed of. In this paper, an attempt has been made to analyze

different cloud reliability techniques, different components for reliability measurement and the

methodology for measuring reliability. Further, based on these parameters we have prepared a

comparison table to compare these techniques.

Index Terms: Virtual Machine; Reliability; Cloud Manager; Fault Tolerance; Fault manager; Fault

Tolerance middleware

I. INTRODUCTION

Cloud computing provides on demand services to its users and the users can demand any kind of services

in the form of Software, Platform, Infrastructure and so on at anytime from anywhere [1][2]. Cloud

reserves its abstract nature while providing these services to the cloud users. Cloud comprises different

servers. A Datacenter can be a collection of thousands of the servers. Users request for the cloud

infrastructure and use the servers for doing their tasks. The cloud provides infrastructure in the form of

virtual machines (VMs). Cloud can also provide a whole virtual infrastructure for providing services to

the cloud users. And for doing this cloud uses different types of approaches.

Any type system can go under failure. Failure in the cloud environment is not an exceptional case because

fault is a nature of the technology. Failure can occur at any time. Failure affects different aspect of the

system. Most importantly, it affects the vast worldwide business of the cloud computing. Sometimes a

small failure can give a great loss to the cloud service provider. Failure can affect the revenue and the

long term image of the cloud service providers.

Failures can be hardware failures and the software failures. Both require different strategy for getting a

solution. Hardware failure can be the failure like failure of the memory, failure of the disk, etc. The

software failure can be like the failure application failure, Execution time failure, Timeout failure etc.

Reliability of cloud resources does not depend on the reliability of the individual resources. But it

depends on the reliability of their collective working. While calculating the reliability of cloud computing

it should be kept in mind that whether the components are working in parallel or not.

This paper presents a study of cloud reliability and the techniques which are proposed for measuring and

improving the cloud reliability. First the National Institute of Standards and Technology (NIST) standards




about the reliability of cloud computing are discussed. And then different techniques about measuring

the cloud reliability and the techniques to overcome the faults in the cloud environment are discussed.

II. RELIABILITY CONCEPT

A. NIST Standards

As stated in National Institute of Standards and Technology, NIST [3], broadly reliability is the function of

four main components of cloud computing.

• The software and hardware offered by the cloud service providers

• The personnel resources provided

• Connectivity

• The consumers’ personnel

It is very difficult task to measure the reliability of cloud computing environment. The main reasons for

this, first, as there are a number of components in a cloud environment, the individual reliability of these

components is different from the reliability of these components taken together. And the second is that it

is highly dynamic and depends on the environment. Now, first we have to consider all the possible

conditions of failures in cloud environment, then only a reliability model can be considered.

B. DIFFERENT RELIABILITY APPROACHES

1. Adaptive Fault Tolerance (AFT)

It is an adaptive fault tolerance technique [4] in real time cloud computing environment (AFTRC). It

tolerates the faults on the basis of reliability of each virtual machine. Based on the reliability a virtual

machine is selected. There are two types of nodes, the virtual machine which are running on the cloud

and the adjudication node. On the running virtual machine we have, the real time application and the test

for acceptance of logical validity. At adjudication node there is the decision system, reliability assessor

and time checker. In brief, this technique uses

• Acceptance test: This is for checking the results of the real time algorithms.

• Time checker: It checks the time of the results produced by each module.

• Reliability assessor: It assesses the reliability of every virtual machine.

• Decision mechanism: It is used for making decision about virtual machine.

• Recovery cache: It is used for the checkpoints.

2. Cloud Service Reliability Modelling and Analysis

In this paper the authors had presented [5] an inventive reliability model for cloud computing, which

deals with several types of failures that affect the success and failure of cloud services. First, a cloud

computing system in the VGrADS (Virtual Grid Application Development Software) is proposed. In this

system, there is a CMS (Cloud Management System) which is composed of a set of servers that serves four

different responsibilities. Such as,

• Managing a request queue that contains the requests of different cloud users

• Managing computing resources, such as PCs, etc.

• Managing data resources, such as, Databases, etc.

• Scheduling requests, and assigning these to different computing and data resources.

Whenever any request comes it passes through the CMS, then CMS provides the resources to them. A

number of failures were analyzed such as Computing resource missing, Data resource missing, Overflow




failure, timeout failure, Database failure, Network failure, Software failure and Hardware failure

[6][7][8]. Classification of reliability stages is given in figure1.

Figure 1. Reliability Stages

The solution proposed for the Request Stage Reliability is the Markov Model and for Execution Stage

Reliability is a graph based model. This model is further enhanced with the combination of the graph

based model and the Bayesian network model. The overall reliability is given by the multiplication of the

request stage reliability and the execution stage reliability.

3. Fault-Tolerant and Reliable Computation in Cloud Computing

In this paper the authors have explored the security aspect of scientific computation in cloud computing.

The proper cloud selection strategy and protection against faulty and mischievous cloud was investigated

[9]. They have considered the scientific computation in large matrix multiplication. At first, they assumed

that there are several clouds and each of which contains several servers. These servers are trusted

partially based on the experience of the individual client and the client knows the reliability and cost of

each of the cloud. Different cost for different cloud. The work is divided in the multiplication of the rows

and columns on the different clouds. The cost calculation of different clouds is given. Now, the problem is,

suppose a client dispatches li rows of matrix A for being multiplied with matrix B in the cloud i. Then the

overall cost will be:

C = ∑ ��

The reliability of the dispatched task will be:

R= ( ∑ ��

�� )/ l

Where, l = no of rows in matrix A and ∑ ��

The main objective is to minimize the overall cost C, subject to R >= Rs, where Rs is a minimum reliability

requirement which is previously specified. Now, the overall reliability of this task can be expressed as the

minimum value of the reliability of all the clouds involved in this calculation, i.e.

� �� ,�,…��

Then we can simply discard all the clouds with reliability less than the minimum required reliability, Rs.

Now, from the remaining set of the clouds we can choose the cloud with reliability value greater than Rs

and with the lowest cost Ci first. Then we can choose other clouds with higher reliability as per necessity.




4. Fault Tolerance and Resilience

In this paper [10], the concept of fault, errors and failure can be expressed by applying the following

chain:

Fault � Error � Failure

The failure behavior of the servers that are contained in the data center can be obtained by the study

about the server failures and the hardware failures. It is necessary to apply fault tolerance system to

enhance the reliability of hard disks in order to considerably cut down the number of failures. According

to the study of the system, failed machines are replaced. The study of the failure behavior of networks

should also be done as several network components are associated for constructing the data center.

Based on this study, it is observed that the reliability of comprehensive data center network is almost

99.99%. Fault tolerance is the capability of the system to achieve its function in spite of the presence of

failures. The classification of faults are done into two categories as shown in the below figure 2:

Figure 2: Classification of Faults

First, Crash faults which block the several system components from functioning or to remain idle at the

time of failures for example hard disk crash, power outage, etc.

Second is Byzantine faults that cause the system components to behave incorrectly at the time of failure.

As a result, the system shows erratic behavior.

The most popular methods to resolve these above two types of faults are described below:

Checking and monitoring: In this method the system is being observed continuously during its runtime

in order to justify the correctness of the system specification.

Checkpoint and restart: In this method the state of the system is grabbed and stored so that when the

system goes through a failure its correct state is restored using the checkpoint information.

Replication: And in this method the essential system components are replicated or imitated in such a

way so that a copy of this system components is available during a failure.

5. Fault Tolerance Middleware

The Low latency Fault Tolerance (LLFT) middleware [11] is a service that contributes fault tolerance

reliable services for distributed applications within data centers that comprises of several servers,

storage and networks. By using leader/follower approach, this LLFT middleware imitates the application

process in order to secure the application from several faults, particularly, the Crash fault and Timing

fault. Due to crash fault, a process or processor does not yield any further result and it does not yield any

result within a specific time constraint due to the timing fault. But the Byzantine fault is not handled by

this middleware. Two types of leader/follower replications are supported by this LLFT middleware.

These are as follows:

� Semi active replication: In this process the primary replica orderly arranges the received

messages and executes the operations and also provides the ordering information to the backup

replicas for the non-deterministic operations.




� Semi passive replication: It performs not only the same of the above but also in addition to this

the primary replica always communicates state update to the backup replicas. It uses lesser

processing power than semi active replication but if the primary fails it acquires greater latency

for the purpose of recovery and reconfiguration.

The LLFT middleware comprises of the following:

� Low Latency Fault Tolerance (LLFT) Messaging Protocol: It basically contributes two main

services for application messages; these are the following:

• Reliable delivery: In this all the members of a group receive each and every message that

is multicast to this group on a network connection.

• Total ordering: The primary replica in a group communicates the ordering information

to all the backups in this group and all the members in a group hand over the messages to

the application maintaining the same order.

� Low Latency Fault Tolerance (LLFT) Membership Protocol: This protocol confirms that all the

members of a particular group must have a consistent view about the membership set and the

primary replica of that group. It is much faster than a multi-round consensus protocol which is

mainly necessary in the case when primary replica fails. The primary replica decides the inclusion

and exclusion of the backup replicas to or from the group on the basis of their ranks and

precedence.

The precedence of a group member is determined by the order in which the members join the

group and the rank of the primary replica is 1 and for the backup replicas it will be 2, 3, 4… which

are assigned by the primary replica based on their precedence.

� Low Latency Fault Tolerance (LLFT) Virtual Determinizer Framework: The applications in

cloud computing environment commonly incorporate several sources of non-determinism.

Therefore, to preserve firm replica consistency, it is vital to mask these sources of non-

determinism. It records the ordering information and the results of each non deterministic

operation accomplished by the primary replica and at the back up replicas it carries out the same

ordering as the primary.

6. A System Level Approach

The purpose of this approach [12] is to overcome the limitations of current existing methodologies by

providing fault tolerance properties as on demand services. It contributes flexibility for the applications

to dynamically regulate its fault tolerance properties and the level of the required availability and

reliability overtime. The cost of the resources can be reduced to a certain extent and the performance

level can be adjusted according to the particular business needs. It allows the users to obtain an explicitly

fault tolerance support for its applications without having a comprehensive knowledge about the system

level proceedings. By adding a new dedicated service layer between the computing framework and the

applications, it is possible to provide fault tolerance reliable support to each application abstracting the

complications of the elemental infrastructure. To promote a well-developed support, it is necessary for

the service layer to accommodate a range of reliability mechanisms and also to construct a fault tolerance

solution which can be dispatched to different applications. And to accomplish this, fault tolerance

solution can be viewed as a combination of a set of definite activities. For example, each fault tolerance

mechanism, such as fault detection, replication of an application, masking and recovery, etc. can be

observed as a specific or distinct activity that are combined together to build a fault tolerance solution.

Now, this each individual activity can be accomplished as a stand-alone configurable module which

produces a consistent solution to a repetitive system failure. Moreover, each module is combined with a

set of metadata which characterize its functional, structural and operational properties. These metadata


112 | © 2014, IJAFRC All Rights Reserved

can be inspected during runtime and compared with diffe

relevant activities. This approach can be achieved by implementing each module separately as a web

service in the form of WSDL [13]

designing a scheme, the Fault Tolerance Manager (FTM).

FTM is composed of a set of following components:

• Replication Manager: It incorporates techniques to manage firmness in a replica group by updating

the state of the backup replicas and the primary replica.

• Fault Detection/Prediction Manager:

and to send notification regarding this to FTM kernel in order to invoke services from Fault Masking

Manager and Recovery Manager. It also notifies Resource Man

the resource state of the cloud.

• Fault Masking Manager: This component involves a collection of algorithms that mask the

occurrence of failures and restrict the faults to meet high availabil

• Recovery Manager: It incorporates all the mechanisms which is used to resume er

a normal node.

• Messaging Monitor: It is used to convey necessary messages among different replicas of a replica

group and also for inter-component c

• Client/Admin Interface: This is used to achieve users’ requirements and act as an interface

FTM and the end users.

• FTM Kernel: It is the pivotal computing component of FTM which manages the reliability m

present in the scheme.

• Resource Manager: This component is used to efficiently allocate required resources and to avoid

under provisioning and over provisioning during failures.

7. Fault Tolerant Approaches in Cloud Infrastructure

In most of the recent approaches, fault

customers. But there are no collaborations between them. Therefore, sometimes this leads to a partial or

faulty solution. To overcome this issue, different fault tolerance policies in clo

investigated in this paper [14]. There are mainly two types of policies, i.e. in the first one fault tolerance

mechanisms are handled solely by either the cloud service provider or the customer and in the second

policy there is a collaborative management between the custo

Fig

In general there are three layers in a cloud platform that is shown in the figure

Virtual machines and Resources and


Volume 1, Issue 6, June 2014.

© 2014, IJAFRC All Rights Reserved

can be inspected during runtime and compared with different users requirements in order to choose


document. The feasibility of this proposed approach is obtained by

heme, the Fault Tolerance Manager (FTM).

FTM is composed of a set of following components:

It incorporates techniques to manage firmness in a replica group by updating

plicas and the primary replica.

t Detection/Prediction Manager: It is used to detect the faults promptly after their occurrence


Manager and Recovery Manager. It also notifies Resource Manager about the faulty replica to update

This component involves a collection of algorithms that mask the

occurrence of failures and restrict the faults to meet high availability demands of the cloud

It incorporates all the mechanisms which is used to resume er

It is used to convey necessary messages among different replicas of a replica

component communication.

This is used to achieve users’ requirements and act as an interface

It is the pivotal computing component of FTM which manages the reliability m

This component is used to efficiently allocate required resources and to avoid

under provisioning and over provisioning during failures.

Fault Tolerant Approaches in Cloud Infrastructure

In most of the recent approaches, fault tolerance is entirely handled by the cloud service providers or the


faulty solution. To overcome this issue, different fault tolerance policies in cloud computing have been

. There are mainly two types of policies, i.e. in the first one fault tolerance


laborative management between the customers and the service providers.

Figure 3. Cloud computing Architecture

In general there are three layers in a cloud platform that is shown in the figure 3

Virtual machines and Resources and each of these are associated with several failures. That is why, there



www.ijafrc.org

rent users requirements in order to choose


. The feasibility of this proposed approach is obtained by

It incorporates techniques to manage firmness in a replica group by updating

It is used to detect the faults promptly after their occurrence


ager about the faulty replica to update

This component involves a collection of algorithms that mask the

ity demands of the cloud users.

It incorporates all the mechanisms which is used to resume erroneous nodes to

It is used to convey necessary messages among different replicas of a replica

This is used to achieve users’ requirements and act as an interface between

It is the pivotal computing component of FTM which manages the reliability mechanism

This component is used to efficiently allocate required resources and to avoid

tolerance is entirely handled by the cloud service providers or the


ud computing have been

. There are mainly two types of policies, i.e. in the first one fault tolerance


mers and the service providers.

3, these are Applications,

each of these are associated with several failures. That is why, there


113 | © 2014, IJAFRC All Rights Reserved

are mainly three types of failures: Application failure, Virtual machine failure and Hardware failure. And

for these failures there are some fault tolerance so

First fault tolerance method concentrates in stateless applications like proxy e.g. HAProxy or MySQL

Proxy. The second is a state-full method, in this customer must implement the functions for storing the

state of the server, so that on the next start of t

fault, sensors can be used. First, the faulty VM is deallocated from the job. Second, a new VM is allocated.

Third, start the tasks that are running on the failed VM. Fourth, restore the state of th

Fault tolerance system. The customer cannot see all these types of the fault. These can be monitored

the cloud service providers. This is done by a monitoring system composed of the sensors. These

techniques are used in [15][16].

8. A Virtualization and Fault Tolerance Approach

Fault tolerance is provided to cloud infrastructure by implementing the cloud manager [17], Load

balancer, Fault Handler and the Decision maker. A parameter, success rate is used specially for fault

tolerance. In this, Job is given to the virtual machine which has the success rate more than some specific

value. In this way the chances of the fault decrease. The fault handler has the responsibility that when a

VM is found to be faulty its performance table must b

performance table the cloud infrastruct


Volume 1, Issue 6, June 2014.

© 2014, IJAFRC All Rights Reserved


for these failures there are some fault tolerance solutions that are described in figure 4

Figure 4. Fault Tolerance

First fault tolerance method concentrates in stateless applications like proxy e.g. HAProxy or MySQL

full method, in this customer must implement the functions for storing the

state of the server, so that on the next start of the system this state can be resumed. For repairing the VM


Third, start the tasks that are running on the failed VM. Fourth, restore the state of th

Fault tolerance system. The customer cannot see all these types of the fault. These can be monitored

This is done by a monitoring system composed of the sensors. These

A Virtualization and Fault Tolerance Approach



In this, Job is given to the virtual machine which has the success rate more than some specific


VM is found to be faulty its performance table must be updated. According to the success rate and the

performance table the cloud infrastructure is made more fault tolerant.



www.ijafrc.org


in figure 4:

First fault tolerance method concentrates in stateless applications like proxy e.g. HAProxy or MySQL-

full method, in this customer must implement the functions for storing the

he system this state can be resumed. For repairing the VM


Third, start the tasks that are running on the failed VM. Fourth, restore the state of the VM for physical

Fault tolerance system. The customer cannot see all these types of the fault. These can be monitored by

This is done by a monitoring system composed of the sensors. These



In this, Job is given to the virtual machine which has the success rate more than some specific


e updated. According to the success rate and the




III. TABULAR ANALYSIS

Tabular analysis of different approaches is done as shown in table 1. First column tells about the

component for reliability measurements, these components are like VM, some approaches broadly

consider the reliability of the whole infrastructure and the system. The second column represents the

methodology used for cloud reliability measurement. Third column represents the techniques and

components used for reliability measurement. It also represents the effect of techniques on the reliability

of the system.

Sl

No.

Technique

Name

Components for

the reliability

measurement

Methodology

for

measuring

reliability

Reliability

measurement

Used for

B.1 Adaptive fault

tolerance

(AFT)

Reliability of each

virtual machine is

measured.

Virtual

machines are

divided into

two category

running and

adjudication

virtual

machines.

Acceptance

test, Time

checker,

Reliability

assessor,

Decision

mechanism,

Recovery cache

are used.

Real time

cloud

computing

B.2 Cloud Service

Reliability

Modelling

and Analysis

Reliability of the

system is measured

Reliability is

divided into

two parts:

request time

reliability and

execution

time

reliability

Total reliability

is measured by

the product of

the two

reliabilities.

Handling

different

failures in

Cloud

Computing

Environment

B.3 Fault-

Tolerant and

Reliable

Computation

in Cloud

Computing

Reliability of the

server is measured

Reliability

and the cost

relation are

studied.

Reliable

component

with the

reliability,

greater than a

threshold value

and having less

cost is selected.

General

Scientific

computation

B.4 Fault

Tolerance

and

Resilience

Failure of a

machine

Faults are

divided into

two parts

Crash faults

and

Byzantine

Faults.

Reliability of

the system

increase with

the checkpoint,

restart,

replacement.

Characterizing

recurrent

failures in

Cloud

environment

B.5 Fault

Tolerance

Middleware

Fault in the system Faults are

divided into

two parts

Crash and

Timing fault.

By providing a

middleware

service overall

reliability of

the system

increases.

Distributed

applications

fault tolerance

B.6 A System

Level

Approach

Reliability as a

service

By

introducing

the FTM for

Reliability of

the system

increases by

providing

fault tolerance

property as on




reliability using FTM. demand

service

B.7 Fault

Tolerant

Approaches

in Cloud

Infrastructure

Faults in the system Faults are

divided into

application,

virtual

machine and

the physical

node faults.

Reliability is

increased by

the stateless

and the state-

full approaches.

Replication and

the sensors

increase the

fault tolerance.

Autonomic

repairing of

faults

B.8 A

Virtualization

and Fault

Tolerance

Approach

Cloud

infrastructure

Faults are

handled by

the fault

handler.

Automatic

updating the

system by the

fault handler

and using

success rate

parameter to

increase the

reliability of

the system.

Reducing the

service time

and increasing

the system

availability in

a Cloud

environment

IV. CONCLUSION AND FUTURE WORK

In this paper, we have studied different approaches for cloud computing reliability. There are many

issues about the cloud reliability like the heterogeneity, dynamic nature etc. Reliability of cloud

computing depends on the reliability of its components like VM, Physical nodes or the application

running on the cloud environment. There are several types of faults in the cloud environment like crash

fault, timing fault, application faults, etc. Reliability of the cloud environment can be increased by

replication, restart, continuous auditing of all the information about each component of cloud

environment, by using efficient sensors for monitoring. In future we will work on the improvement of the

cloud reliability by proposing a mechanism which implements a collection of these reliability approaches.

V. REFERENCES

[1] National Institute of standards and technology U.S Department of Commerce special publication

800-145 Peter Mell Timothy Grance.

[2] Introduction to Cloud Computing architecture white paper sun microsystem.

[3] Lee Badger, Tim Grance, Robert Patt-Corner, Jeff Voas, “Cloud Computing Synopsis and

Recommendations” NIST Special Publication 800-146.

[4] Sheheryar Malik, Fabrice Huet, “Adaptive Fault Tolerance in Real Time Cloud Computing” World

Congress on Services 2011 IEEE.

[5] Yuan-Shun Dai, Bo Yang, Jack Dongarra, Gewei Zhang, “Cloud Service Reliability: Modeling and

Analysis”.

[6] D. Abramson, R. Buyya, J. Giddy, “A computational economy for grid computing and its

implementation in the Nimrod-G resource broker. Future Generation Computer Systems”.




[7] Y.S. Dai, M. Xie, K.L. Poh “Reliability of grid service systems, Computers & Industrial Engineering”,

50(1-2), 130-147.

[8] Y.S. Dai, M. Xie, K.L. Poh, “Reliability Analysis of Grid Computing Systems”,the9th IEEE Pacific Rim

Symposium on Dependable Computing IEEE Computer Press.

[9] Jing Deng, Scott C.-H. Huang, Yunghsiang S. Han, Julia H. Deng, “Fault-Tolerant and Reliable

Computation in Cloud Computing”.

[10] Ravi Jhawar Vincenzo Piuri, “Fault Tolerance and Resilience in Cloud Computing Environments”.

[11] Wenbing Zhao,P. M. Melliar-Smith and L. E. Moser “Fault Tolerance Middleware for Cloud

Computing” 978-0-7695-4130-3/10 IEEE.

[12] Ravi Jhawar, Vincenzo Piuri, Marco Santambrogioy, “A Comprehensive Conceptual System-Level

Approach to Fault Tolerance in Cloud Computing” 978-1-4673-0750-5 2012 IEEE.

[13] T. Erl, “Service-Oriented Architecture: Concepts, Technology, and Design” USA: Prentice Hall PTR.

[14] Alain Tchana, Laurent Broto, Daniel Hagimont “Fault Tolerance Approaches in Cloud Computing

Infrastructures” ISBN: 978-1-61208-187-8 2012 IEEE.

[15] Microsoft, “Windows azure: Microsoft’s cloud services platform,”

http://www.microsoft.com/windowsazure/.

[16] Walters John Paul, Chaudhary Vipin, “A fault-tolerant strategy for virtualized hpc clusters”, The

Journal of Supercomputing.

[17] Pranesh Das, Dr. Pabitra Mohan Khilar “VFT: A Virtualization and Fault Tolerance Approach for

Cloud Computing” 978-1-4673-5758-6/13/ 2013 IEEE.

Presentations & Public Speaking

An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reliability Techniques