18
A survey of fault tolerance in cloud computing Priti Kumari , Parmeet Kaur Department of Computer Science & Information Technology, Jaypee Institute of Information Technology, Noida, India article info Article history: Received 22 June 2018 Revised 24 August 2018 Accepted 25 September 2018 Available online xxxx Keywords: Cloud computing Fault tolerance Data center Network topology abstract Cloud computing has brought about a transformation in the delivery model of information technology from a product to a service. It has enabled the availability of various software, platforms and infrastruc- tural resources as scalable services on demand over the internet. However, the performance of cloud computing services is hampered due to their inherent vulnerability to failures owing to the scale at which they operate. It is possible to utilize cloud computing services to their maximum potential only if the per- formance related issues of reliability, availability, and throughput are handled effectively by cloud service providers. Therefore, fault tolerance becomes a critical requirement for achieving high performance in cloud computing. This paper presents a comprehensive overview of fault tolerance-related issues in cloud computing; emphasizing upon the significant concepts, architectural details, and the state-of-art tech- niques and methods. The objective is to provide insights into the existing fault tolerance approaches as well as challenges yet required to be overcome. The survey enumerates a few promising techniques that may be used for efficient solutions and also, identifies important research directions in this area. Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Contents 1. Introduction .......................................................................................................... 00 2. Background of cloud computing .......................................................................................... 00 2.1. Cloud computing infrastructure ..................................................................................... 00 2.2. Cloud deployment models ......................................................................................... 00 2.3. Cloud service model .............................................................................................. 00 3. Fault tolerance approaches in distributed systems ........................................................................... 00 3.1. Fault tolerance in distributed computing environments ................................................................. 00 4. Fault tolerance approaches in the cloud computing .......................................................................... 00 4.1. System model ................................................................................................... 00 4.2. Proactive approaches ............................................................................................. 00 4.3. Reactive approaches .............................................................................................. 00 4.4. Other miscellaneous approaches used for FT .......................................................................... 00 4.5. Integration of fault tolerance in miscellaneous cloud-based and related applications .......................................... 00 5. Future directions for research ............................................................................................ 00 5.1. Deep learning ................................................................................................... 00 5.2. Blockchain ...................................................................................................... 00 5.3. Distributed deduplication systems ................................................................................... 00 5.4. Emphasis on performance issues .................................................................................... 00 https://doi.org/10.1016/j.jksuci.2018.09.021 1319-1578/Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Corresponding author. E-mail address: [email protected] (P. Kaur). Peer review under responsibility of King Saud University. Production and hosting by Elsevier Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx Contents lists available at ScienceDirect Journal of King Saud University – Computer and Information Sciences journal homepage: www.sciencedirect.com Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault tolerance in cloud computing. Journal of King Saud University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

Journal of King Saud University –Computer and Information Sciences

journal homepage: www.sciencedirect .com

A survey of fault tolerance in cloud computing

https://doi.org/10.1016/j.jksuci.2018.09.0211319-1578/� 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

⇑ Corresponding author.E-mail address: [email protected] (P. Kaur).

Peer review under responsibility of King Saud University.

Production and hosting by Elsevier

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault tolerance in cloud computing. Journal of King Saud University – CompuInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

Priti Kumari ⇑, Parmeet KaurDepartment of Computer Science & Information Technology, Jaypee Institute of Information Technology, Noida, India

a r t i c l e i n f o

Article history:Received 22 June 2018Revised 24 August 2018Accepted 25 September 2018Available online xxxx

Keywords:Cloud computingFault toleranceData centerNetwork topology

a b s t r a c t

Cloud computing has brought about a transformation in the delivery model of information technologyfrom a product to a service. It has enabled the availability of various software, platforms and infrastruc-tural resources as scalable services on demand over the internet. However, the performance of cloudcomputing services is hampered due to their inherent vulnerability to failures owing to the scale at whichthey operate. It is possible to utilize cloud computing services to their maximum potential only if the per-formance related issues of reliability, availability, and throughput are handled effectively by cloud serviceproviders. Therefore, fault tolerance becomes a critical requirement for achieving high performance incloud computing. This paper presents a comprehensive overview of fault tolerance-related issues in cloudcomputing; emphasizing upon the significant concepts, architectural details, and the state-of-art tech-niques and methods. The objective is to provide insights into the existing fault tolerance approachesas well as challenges yet required to be overcome. The survey enumerates a few promising techniquesthat may be used for efficient solutions and also, identifies important research directions in this area.� 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is anopen access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 002. Background of cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

2.1. Cloud computing infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 002.2. Cloud deployment models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 002.3. Cloud service model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

3. Fault tolerance approaches in distributed systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

3.1. Fault tolerance in distributed computing environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

4. Fault tolerance approaches in the cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

4.1. System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 004.2. Proactive approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 004.3. Reactive approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 004.4. Other miscellaneous approaches used for FT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 004.5. Integration of fault tolerance in miscellaneous cloud-based and related applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

5. Future directions for research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

5.1. Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 005.2. Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 005.3. Distributed deduplication systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 005.4. Emphasis on performance issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

ter and

Page 2: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

2 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

PI

6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00

1. Introduction

Cloud computing refers to accessing, configuring and manipu-lating the resources (such as software and hardware) at a remotelocation (Patidar et al., 2012). Buyya et al. (2009) defined the Cloudcomputing in terms of distributed computing ‘‘A Cloud is a type ofparallel and distributed system containing a set of interconnectedand virtualized computers that are dynamically provisioned andpresented as one or more unified computing resources based onservice-level agreements established through negotiation betweenthe service provider and consumers”.

According to the U.S. National Institute of Standards and Tech-nology (NIST) definition: ‘‘Cloud computing is a model for enablingconvenient, on-demand network access to a shared pool of config-urable computing resources (for example servers, networks, stor-age, services, and applications) that can be quickly provisionedand released with least management effort or service providerinteraction‘‘ (Mell and Grance, 2011).

Cloud computing offers various resources in the form of servicesto the end users on-demand basis. It enables businesses and usersto use applications without installing them on physical machinesand allows access to required resources over the Internet. It pro-vides features as high performance, pay-as-you-go, connectivity,interactivity, reliability, ease of programmability, efficiency, scala-bility, management of large amount of data and elasticity to trans-form IT from a product to a service (Savu, 2011; Bokhari et al.,2016), as depicted in Fig. 1.

The cloud computing, as a fast advancing technology, is increas-ingly being used to host many business or enterprise applications.However, the extensive use of the cloud-based services for hostingbusiness or enterprise applications leads to service reliability andavailability issues for both service providers and users (Gokhrooet al., 2017; Nazari Cheraghlou et al., 2016). These issues are intrin-sic to cloud computing because of its highly distributed nature,heterogeneity of resources and the massive scale of operation. Con-sequently, several types of faults may occur in the cloud environ-ment leading to failures and performance degradation. The majortypes of faults (Amin et al., 2015; Saikia and Devi, 2014; Essa,2016) are listed as follows:

� Network fault: Since cloud computing resources are accessedover a network (Internet), a predominant cause of failures incloud computing are the network faults. These faults may occurdue to partitions in the network, packet loss or corruption, con-gestion, failure of the destination node or link, etc.

Fig. 1. Cloud c

lease cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleranformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

� Physical faults: These are faults that mainly occur in hardwareresources, such as faults in CPUs, in memory, in storage, failureof power etc.

� Process faults: faults may occur in processes because of resourceshortage, bugs in software, incompetent processing capabilities,etc.

� Service expiry fault: If a resource’s service time expires while anapplication that leased it is using it, it leads to service failures.

Failures lead to system breakdown or shut down of a system.However, distributed computing and thus, cloud computing ischaracterized by the notion of partial failures. A fault may occurin any constituent node, process or network component. This leadsto a partial failure and consequently, performance degradationinstead of a complete breakdown. Though this results in robustand dependable systems, faults should be handled effectively byproper fault tolerance mechanisms for high-performance comput-ing. Fault tolerance enables the system to serve the request evensome of the components are not working properly (Gokhrooet al., 2017; Charity and Hua, 2016).

Fault tolerance (FT) is the capability of a system that keeps onperforming its anticipated function regardless of faults. In otherwords, FT is related to reliability, successful operation, and absenceof breakdowns. A FT based system should be capable to examinefaults in particular software or hardware components, failures ofpower or other varieties of unexpected adversities and still fulfilits specification (Dubrova, 2008).

The survey makes the following contributions:

� It is a comprehensive study of fault tolerance in cloud com-puting systems. It discusses the taxonomy of faults, errors,and failures along with their likely causes in a cloudenvironment.

� Along with the existing approaches for ensuring fault tolerancein cloud systems, the same has also been described for varieddistributed systems including mobile computing systems. Thus,an all-rounded perspective of the problem and its challenges arepresented.

� The dependency of fault tolerance approaches in cloud comput-ing systems upon underlying network topologies of data centresis explored. The survey discusses the most common networktopologies in data centers for the cloud systems and how faulttolerance approaches to leverage the same in theirimplementation.

omputing.

nce in cloud computing. Journal of King Saud University – Computer and

Page 3: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Fig. 2. Cloud computing architecture.

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 3

� The survey lists a number of miscellaneous cloud-based prob-lems with which fault tolerance approaches have been inte-grated. We specifically emphasize on the integration of faulttolerance with cloud security.

� A number of promising techniques, such as deep learning andblockchains, that may be used effectively in this domain arediscussed.

� Based on the understanding of existing challenges and solu-tions, a few research directions have been enumerated.

The rest of the survey paper is classified as follows: An overviewof cloud computing environment, deployment models, and servicestack is presented in Section 2. In Section 3, the traditional faulttolerance approaches used in distributed systems is presented.Section 4 discusses the existing fault tolerance approaches in thecloud computing environment. Section 5 outlines the future direc-tions for research. Finally, Section 6 presents the concludingremarks.

2. Background of cloud computing

Cloud computing has evolved from efforts in various researchareas of computing such as distributed computing, grid computing,virtualization techniques and SOA (Service-Oriented Architecture).Therefore, it has imbibed their features, advances, and limitationsas well.

This section describes cloud computing in five dimensions: (a)Basic concepts (b) Cloud components (c) Cloud infrastructure (d)Cloud deployment model (e) Cloud service stack, as depicted inFig. 2.

2.1. Cloud computing infrastructure

Cloud computing infrastructure includes the computers, storagedevices, network facilities and other allied components necessaryfor offering cloud computing resources and services to users. Thesehardware components are mostly located within enterprise datacenters. These encompass multicore servers, solid-state drivesand hard disk drive offering stable storage and network devices,such as firewalls, switches, and routers; all on a large scale. Apartfrom these hardware components, software components that sup-port the cloud service model, such as the virtualization software,are also termed as the cloud computing infrastructure. The virtual-ization software provides an abstraction of cloud resources and

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

offers these resources to users generally using APIs (ApplicationProgram Interfaces) or other command line and/or graphical inter-faces. The virtualized resources, hosted by cloud service providers(CSPs) are delivered to the users usually over the Internet (andsometimes over any other network).

Cloud computing resources are offered to users as services, ingeneral, on a shared and multitenant-based approach. Amultitenant-based approach is used by major CSPs like AmazonWeb Services (AWS) and/or Google Cloud Platform. This approachis used to share resources in a cost-efficient and yet, secure manneramongst the multiple applications and tenants (businesses, organi-zations, etc.) using the cloud. Virtualization softwaremay be used toensure isolation between the tenants. A typical cloud infrastructurecomprises clients, servers, applications, and other components.

Another cloud computing component is the distributed file sys-tem (DFS), like the Google File System (GFS) and/or Hadoop Dis-tributed File System (HDFS) which is mainly used to store dataon disks in form of objects or blocks. These file systems decouplestorage management from the actual physical storage and hence,ensure scalability of storage. Thus, cloud computing infrastructureconsists, broadly (Singh et al., 2016), of the following:

� Servers – The physical machines that act as host machines forone or more e virtual machines.

� Virtualization – Technology that abstracts physical componentssuch as servers, storage, and networking and provides these aslogical resources.

� Storage – In the form of Storage Area Networks (SAN), networkattached storage (NAS), disk drives etc. Along with facilities asarchiving and backup.

� Network – To provide interconnections between physical ser-vers and storage.

� Management – Various software for configuring, managementand monitoring of cloud infrastructure including servers, net-work, and storage devices.

� Security – Components that provide integrity, availability, andconfidentiality of data and security of information, in general.

� Backup and recovery services.

2.2. Cloud deployment models

The cloud deployment model is based on the motive and envi-ronment in which a cloud service is expected to be used. The selec-tion of the deployment model determines the incurred cost, power

nce in cloud computing. Journal of King Saud University – Computer and

Page 4: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Fig. 3. Cloud service delivery models.

4 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

consumption by resources and other capital expenses (Rimal et al.,2009). The most commonly used deployment models in cloudenvironments are public cloud, private cloud, community cloud,and hybrid cloud.

� Public Cloud: The public cloud permits the general public toaccess the systems and services offered by an enterprise provi-der. It provides flexibility, scalability, location independencewith the very low cost since multi-tenancy is generally used(Patidar et al., 2012; Savu, 2011; Singh et al., 2016; Rimalet al., 2009). Resources are dynamically provisioned on an on-demand basis from a remote third-party provider who offersresources using a multi-tenant approach.

� Private Cloud: The private cloud is used within a particular orga-nization, i.e., the cloud resources and services can be accessed orused inside an organization. This model ensures high applica-tion and data security and privacy (Patidar et al., 2012; NazariCheraghlou et al., 2016; Singh et al., 2016; Rimal et al., 2009).

� Community Cloud: This model is used by various enterprises/organizations simultaneously and helps a particularcommunity/society that contains communal involvements (forexample security necessities, mission, and compliance consid-erations and so on). This model may be operated, owned, andmanaged by one or multiple organizations inside the commu-nity or/and a third party. (Mell and Grance, 2011; Savu, 2011;Zissis and Lekkas, 2012).

� Hybrid Cloud: The hybrid cloud is an alliance of both publiccloud and private cloud. In this cloud deployment, criticalevents (e.g. those requiring secure operations) are accomplishedusing the private cloud services and non-critical events areimplemented by using the public cloud (Savu, 2011; Rimalet al., 2009).

Public clouds are most suitable in scenarios where organiza-tions wish to use collaboration services like chat, and video confer-

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

ences but sufficient IT resources or infrastructure are not availablelocally. In contrast, if strict security and privacy are issues of highpriority, a private deployment model should be used. On the otherhand, an organization that possesses a large IT infrastructure and isalso expanding its capabilities, a hybrid deployment model shouldbe the choice.

2.3. Cloud service model

Though cloud computing has highly evolved in recent years,services are still into three major service models (Patidar et al.,2012; Singh et al., 2016; Rimal et al., 2009). The basic service mod-els are demonstrated in Fig. 3.

� Software-as-a-Service (SaaS): In this model, the software appli-cations are presented by cloud service provider in the form ofservices to the consumer/end-users (Patidar et al., 2012;Bokhari et al., 2016; Rimal et al., 2009). An application deliveredas a service to the client removes the need to install and executethe cloud application on the user’s computer and simplifiesmaintenance. As for example, the web conferencing services,email applications, social media platforms etc. The list of SaaSproviders is Amazon AWS, Google Compute Engine, MicrosoftAzure, IBM SmartCloud Enterprise, CloudStack, OpenStack, Open-Nebula, CloudForge, Citrix, Qstack and so on (https://www.datamation.com,cloud-computing,50-leading-saas-companies.html).

� Platform-as-a-Service (Paas): This model provides a platform todevelop, run, test and manage applications in the cloud(Bokhari et al., 2016; Rimal et al., 2009; Abdelfattah et al.,2017). A user can lease an environment with a software stackfrom a CSP and use it for custom application development.The list of PaaS providers is Acquia Cloud, Amazon AWS, AppAgile, Apprenda, AppScale, Bluemix, Cloud 66, Cloudways andso on (https://stackify.com/top-paas-providers/).

nce in cloud computing. Journal of King Saud University – Computer and

Page 5: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Fig. 4. Occurrence of failure.

Fig. 5. Classification of faults.

Fig. 6. Classification of error.

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 5

� Infrastructure-as-a-Service (IaaS): The IaaS model provides facil-ity to access to some primary resources i.e. physical machines,storage, networks, servers, virtual machines on the cloud etc

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

(Bokhari et al., 2016; Saikia and Devi, 2014; Abdelfattah et al.,2017). The IaaS provider offers services such as dynamic virtualmachine provisioning and on-demand storage facilities. The list

nce in cloud computing. Journal of King Saud University – Computer and

Page 6: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Fig. 7. Classification of failure.

Fig. 8. Classification of fault tolerance approaches.

6 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

of SaaS providers is Salesforce, Microsoft, AmazonWeb Services,Slack, Zendesk, GitHub, Oracle, Cisco and so on (https://stackify.com/top-iaas-providers/).

� Anything-as-a-Service (XaaS): The XaaS is another service modelthat may be anything or everything as a service. The cloud sys-tem is capable to maintain the huge amount of resources to ful-fill the personal, granular, and specific requirements usingSecurity-as-a-Service, Identity-as-a-Service, Communication-

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

as-a-Service, DaaS (Database-as-a-Service) or Strategy-as-a-Service and so on (Singh et al., 2016).

3. Fault tolerance approaches in distributed systems

Fault tolerance is crucial for a system in order to permit it tooffer the needed services even in the presence of component fail-ures, or one or multiple faults (Charity and Hua, 2016; Valle

nce in cloud computing. Journal of King Saud University – Computer and

Page 7: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Table 1Reactive fault tolerance mechanism with the description.

Reactive Fault ToleranceTechniques

Description

Check-pointing (Ataallah et al.,2015; Hosseini and Arani,2015)

Used to save the system’s stateperiodically. In case of a constituent task’sfailure, the job is restarted from the lastchecked pointed state rather than fromthe beginning. It prevents the loss ofuseful computation

Job Migration (Prathiba andSowvarnica, 2017)

If a job cannot complete its execution onsome specific physical machine due tosome reason and fails, then it is migratedto some other machine

Replication (Amin et al., 2015;Hosseini and Arani, 2015)

Used to create multiple copies of tasksand store replicas at different locations. Atask can continue execution in presence ofmalfunction or failures until all replicasare destroyed

S-Guard (Bala and Chana, 2012) It depends on rollback and recoveryprocess

Retry (Prathiba and Sowvarnica,2017)

In this approach, a task is executedrepeatedly until it succeeds. The sameresource is used to retry theunsuccessful/failed task

Task Resubmission (Amin et al.,2015; Ataallah et al., 2015)

In this method, the failed task is againsubmitted/resubmitted to the identicalresource and/or to a diverse machine forexecution

Rescue workflow (Prathiba andSowvarnica, 2017)

It enables a system to continue workingafter the failure of the task/job till it willnot be able to proceed without amendingthe fault

Table 2Proactive fault tolerance with the description.

Proactive Fault Tolerance Techniques Description

Self-Healing (Saikia and Devi, 2014;Ataallah et al., 2015; Park et al.,2005)

This method uses the divide andconquers technique, where a largetask is decomposed into multiplechunks. This partition is mainly doneto improve the system’sperformance. When numerousinstances of the same application runon various VMs (virtual machines),then the failure of application’sinstances are handled automatically.It permits the computing devices orsystems to itself identity, recognizesand heal dilemmas/problemsoccurring, without depending on theadministrator

Software Rejuvenation (Amin et al.,2015; Prathiba and Sowvarnica,2017)

In this approach, the systemundergoes periodic reboots andbegins from a new state every time

Pre-emptive Migration (Bala andChana, 2012; Engelmann et al.,2009, 2009)

In this approach, an application isobserved and analysed constantlyand thus, depends upon a feedback-loop control method

Load Balancing (Nazari Cheraghlouet al., 2016; Rimal et al., 2009;Singh and Kinger, 2013)

This approach is used to balance theload of memory and CPU when itexceeds a maximum/certain limit.The load of exceeded CPU istransferred to some other CPU thatdoes not exceed its maximum limit

Table 3Parameters for FT in cloud computing and their description.

Parameters/Metrics

Description

Adaptive All processes are automatically executed according to theconditions

Performance Used to ensure an efficiency of the systemResponse Time Total time that is taken to respond/reply to a specific

algorithmThroughput It computes the number of tasks whose implementation has

been completed successfullyReliability Its main motive is to provide accurate or acceptable result

in a certain time periodAvailability It is described as probability i.e. the system is functioning

properly after it is requested/intended for useUsability a user can make use of an invention/a product to

accomplish the target with efficiency, effectiveness, andsatisfaction

OverheadAssociated

determine the total overhead involved while executing afault-tolerance (FT) algorithm

Cost- effectivenessIt is a

descriptionof thesystemmonetarily

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 7

et al., 2008). Failures in a system occur as a result of errors whichare in turn due to faults (Fig. 4). These are described as follows:

� Faults: It is the incapability of a system to perform its necessary/needed task that is caused by some abnormal state or bug pre-sent in one or multiple parts of a system (Singh and Kinger,

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

2013; Nazari Cheraghlou et al., 2016; Amin et al., 2015; Saikiaand Devi, 2014; Essa, 2016). Various types of faults may occurin the system, which is classified as depicted in Fig. 5.

� Error: A system component can move to an error state or anincorrect condition due to the presence of faults. A system com-ponent’s performing erroneously may result in the system’spartial or even complete failure (Amin et al., 2015; Singh andKinger, 2013). A distributed system can contain various typesof errors, which are as shown in Fig. 6.

� Failure: It refers to the misbehaviour of a system that may beobserved by a user (a human or some other computer system).A failure is recognized only when the system’s output or out-come is incorrect (Amin et al., 2015; Singh and Kinger, 2013;Prathiba and Sowvarnica, 2017). Failures may be classified asdepicted in Fig. 7.

Fault tolerance approaches are necessary as they aid in detect-ing and handling faults in the system that may occur either due tohardware (H/W) failure or software (S/W) faults. Fault tolerance isespecially crucial in cloud platform as it gives assurance regardingperformance, reliability, and availability of the applications. Toachieve the robustness in the cloud computing, need to access aswell as handle the failure effectively (Gokhroo et al., 2017;Nazari Cheraghlou et al., 2016; Amin et al., 2015; Saikia andDevi, 2014). Some fault tolerance approaches, identified from liter-ature can be categorized (Fig. 8) as follows:

� Reactive fault tolerance: This approach is mainly used todecrease the influence of failure in the cloud system after thefailures/faults have actually occurred. It provides robustnessor reliability to a system (Saikia and Devi, 2014; Charity andHua, 2016). Reactive fault tolerance approaches have beenexplored for cloud as well as other distributed systems. Thesehave been listed in Table 1.

� Proactive fault-tolerance: - This approach is used to predicts thefaults proactively and substitute the suspected component bysome running components, i.e., it avoids recovery from faultsand errors (Charity and Hua, 2016; Valle et al., 2008;Mukwevho and Celik, 2018; Engelmann et al., 2009, 2009). Anoverview of proactive FT techniques is described in Table 2.

nce in cloud computing. Journal of King Saud University – Computer and

Page 8: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

8 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

� Parameters Used for Fault tolerance in Cloud Computing: - Thefault tolerance approaches in the cloud computing areevaluated using various parameters to check the efficiencyand effectiveness of the cloud systems (Prathiba andSowvarnica, 2017; Bala and Chana, 2012). The possible param-eters is listed in the Table 3.

3.1. Fault tolerance in distributed computing environments

Fault tolerance (FT) is an essential concern in cloud computingplatform since it enables the system to provide the required ser-vices with good performance in presence of the one or more fail-ures of the system components(Gokhroo et al., 2017; Valle et al.,2008; Mukwevho and Celik, 2018). In past, fault toleranceapproaches have been applied to many different distributed com-puting environments, apart from cloud computing. Some of themare as follows: -

� Wired Distributed System: - It is a collection of autonomous com-puters, which appears as a single consistent system to its cli-ents/users. All computers in the distributed system contain anindividual set of resources and can share some general periph-eral devices, e.g. a printer. Message passing is generally used forcommunication in a distributed system. Designing a distributedsystem is a difficult task due to the existence of componentsthat may be located at different places/sites. One of the majorchallenges that the system designer has to face is providingfault tolerance (FT).

In general, FT is highly required in distributed network systems,especially in the large-scale environment. Users of a distributedsystem require the system to stay working continuously even incase of technical failures. If one or more of the members of the sys-tem have crashed, even then the system should be able to fulfil theclient’s requests. Therefore, an efficient system must be designedand implemented to handle the partial failure of its components.Failure detection (FD) and process monitoring are the mostcommon techniques for FT in the distributed systems. Reactivefault tolerance approaches such as checkpointing, replication,retry, resubmission, etc have been used to handle failures in thesesystems (Xiong et al., 2009; https://www.slideshare.net/sumitjain2013/fault-tolerance-in-distributed-systems).

� Mobile Computing System: - It is a type of distributed system,where some or all the constituent nodes are mobile computers.This system preserves continuous network connectivity even inthe existence of mobility of the hosts due to which their site/location within the network may vary with time. Each node inthe system works independently, with infrequent asynchronousmessage communication. The fixed nodes in the mobile systemmay be interconnected using a static network. Moreover, a fixednode (commonly the mobile base station) is used to establishthe communication between the connected nodes, i.e., themobile node and the other nodes inside the system. Nodes inthe mobile system communicate with each other using mes-sages (Park et al., 2002; Tantikul and Manivannan, 2005).

Some restrictions of mobile systems are limited bandwidth,mobile hosts with limited disk space, user mobility, narrow batterylife, etc. To overcome the limitation of the mobile computingsystem, some fault tolerance approaches are used. The most com-monly used FT approaches in the mobile system is check-pointingsince limited resources prevent the use of redundancy-basedschemes as replication etc. The FT technique needs the processesthat are check-pointed at regular intervals to move the error-freestate to some storage (fixed/constant). If any failure arises in the

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

process, then that failure may be recovered by finding the newestsaved/maintained state (called rollback recovery). (Singh andCabillic, 2003).

Checkpointing schemes can be categorized as coordinated,communication-induced, and uncoordinated checkpointing. In acoordinated method, the processes adjust their checkpointingactions through transferring the checkpoint-based coordinationmessages. The coordinated check-pointing policies include hugemessage overhead, therefore not appropriate for the mobile sys-tems and have very less bandwidth wireless communication chan-nels/stations in the network. Furthermore, at the time ofcheckpointing coordination, the process execution may alsorequire to be suspended, which may result in degradation of theperformance.

The uncoordinated checkpointing methods permit processes toreceive checkpoints at regular intervals without synchronizationwith others (Prakash et al., 1996; Agbaria and Sanders, 2004) butthis method may suffer from the domino effect. Acommunication-induced check-pointing approach has been usedto handle the domino effect (Park et al., 2002).

� Mobile-Grid Computing: - Grids are very large-scale systems, dis-tributed in nature. These spread the amount of work to be doneamong the constituent systems. Grid computing facilitates thesharing of large-scale resources between loosely coordinated,distributed systems to solve the computational needs of large-sized tasks. Therefore, grid computing provides users with hugecomputational, bandwidth resources and storage. It is also pos-sible to use grid computing in conjunction with mobile comput-ing to get better performance. This approach is also importantto handle essential restrictions of mobile devices effectively(Buyya et al., 2009; Altameem, 2013). However, incorporationof mobile and grid devices for use of computing resources ischallenging because of unreliable connections, random nodemobility, battery dependence, small-bandwidth for communi-cation, restricted power for processing and fixed storage. Aneffective execution of the distributed applications in mobile gridcomputing (MoG) is desirable if the faults/failure of the mobiledevices are handled properly (Jaggi and Singh, 2014). Therefore,FT policies are required to handle the different types of faults inthe MoGs. The most frequently used FT technique in MoGs arecheckpointing and the rollback recovery. These FT methodshave been used extensively in conventional wired and cellularportable distributed systems. Jaggi and Singh (2014) presentedan adaptive approach for MoG based on checkpointing torecover the failure of mobile nodes.

Darby and Tzeng (2010) presented ReD (Reliability Driven) mid-dleware approach, which enabled the mobile grid scheduler tobuild informed conclusions/decisions, selectively submitting workportions to hosts having improved check-pointing arrangementsthat guarantee successful completion.

� MANET (Mobile ad hoc networks(N/W)): - It is a wireless ad hocnetwork which is self-configuring, and does not depend oninfrastructure, i.e., it is an infrastructure-less network of mobiledevices connected in a wireless manner. All devices in theMANET are autonomous and can change their path and direc-tion dynamically, and therefore alter links to the other devicesfrequently. MANETs are widely used for increasing the comput-ing abilities to existing mobile systems and MoG (mobile gridscomputing). But, MANETs are vulnerable to several transient/temporary and permanent failures. To handle the failure, anFT technique is required to be used effectively. The checkpoint-ing, as well as rollback recovery, is a widely used policy tohandle faults in static and/or cellular mobile systems. However,

nce in cloud computing. Journal of King Saud University – Computer and

Page 9: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Table 4Summary of basic network topologies.

Topologies Features Advantages Disadvantages

� BusTopology

� It transfers data only in a singledirection

� All computer devices are linked to aone cable

� This topology is cost effective� Require less cable as compared to othertopologies

� Suitable for small networks� Easier to understand� Easier to expand by merging two cablestogether

� When cables get fail then the entire network fails� In case of heavy network traffic or by adding morenodes, the network’s performance is degraded

� Cable length is limited� Slower than another topology

� RingTopology

� This topology makes use of repea-ters with a huge number of thenodes

� Data is transmitted bit by bit i.e. in asequential manner

� The network transmission is not get affected byadding multiple nodes or due to high traffic

� Very cheaper to expand and install

� Troubleshooting is very critical� Addition or deletion of computer nodes may dis-turb the activity of the network

� If one computer node fails then disturbs the entirenetwork

� StarTopology

� All nodes are connected to the cen-tral hub

� Hub is used as a repeater and help inthe data flow

� Used with optical fiber, twisted pairor coaxial cable

� The performance will be fast in case of fewcomputer nodes and very less network traffic

� Easier to upgrade hub� Easier to troubleshoot� Easier to modify and setup

� Installation is costly� Very expensive to use� If the central hub fails then the entire networkstops working

� Performance depends on the hub

� MeshTopology

� Fully connected� Robust� Not flexible

� Each connection cable can transmit its owndata load

� It is very robust� The fault diagnosis can be done easily� Provides privacy and security

� Difficult to install and configure� Cost of cabling is high� It requires bulk wiring

� TreeTopology

� Ideal if workstations /computernodes are placed in groups

� Mainly used in WAN (Wide AreaNetwork)

� Expansion of bus topology and star topology� Easier to expand the nodes� Easier to manage and maintain� Easier to detect an error

� Heavily cabled� Costly� If multiple nodes are added then maintenance isvery difficult

� If the central hub gets fail, the whole network fails

� HybridTopology

� It is an association of two or morethan two topologies

� Very reliable� Effective� Scalable� Flexible

� Complex in design� Costly

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 9

the use of FT approaches with MANETs are less examined. Thepreceding recovery-based approaches are not applied to theMANETs directly because of some challenges such as deficiencyof the static infrastructure, regular movement of a node, limitedbandwidth as well as the restricted amount of stable storage. Tohandle faults in MANETs, rollback recovery protocol based oncheckpointing has been proposed, which determines the fre-quency of checkpoint of a mobile device/node depending onthe mobility/portability and therefore avoids unnecessarycheckpoints (Jaggi and Singh, 2015; Chandra and Reddy, 2016).

4. Fault tolerance approaches in the cloud computing

4.1. System model

Cloud computing provides various services and scalable com-puting resources through the internet (Nazari Cheraghlou et al.,2016). On the provider’s side, a DC (data center) provides facilityto keep computer systems as well as their associated components,like networking, storage, uninterruptible power supply, etc. (Liuet al., 2014; Wang et al., 2014, 2015). To provide services to the cli-ents, many virtual machines (VMs) run on the physical machines incloud DC. These DCs use different types of network topologies.Fault tolerance approaches in cloud computing systems are depen-dent on underlying network topologies. In this section, we discussbasic common network topologies in DCs for the cloud system aswell as reactive and proactive FT approaches used in the cloud.

The network topology is the arrangement of nodes within a net-work (Pandya, 2013). In other words, topology is a basic buildingblock of the network that connects computer systems to each other(Bisht and Singh, 2015). The basic network topologies are the bus,

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

ring, star, mesh, tree, and hybrid topology (https://www.studytonight.com/computer-networks/network-topology-types)(Refer Table 4).

� Bus Topology: - This topology is used to connect all computersand network devices through a single cable. It transfers datain one direction. This topology is very cost effective, used insmall networks, easier to understand and expand. However, inthis topology when cables get fail, the entire network fails(Santra and Acharjya, 2013). If the network(n/w) is heavy thenthe performance of this topology is degraded. The cable containsa limited amount of length. This topology slower as compared toring topology (Bisht and Singh, 2015; https://www.studytonight.com/computer-networks/network-topology-types).

� Ring Topology: - In this topology, computer systems are con-nected to each other in a ring structure, where the last deviceis connected to the first one (Hegde et al., 2013). This topologyis very cheap to install as well as expand. In case of heavy net-work traffic and the addition of some extra node, the transmis-sion network is not affected. However, the failure of onecomputer system can affect the entire network. The networkactivities are disturbed by adding or deleting the computers.Troubleshooting is also very difficult in the ring topology(Bisht and Singh, 2015; https://www.studytonight.com/computer-networks/network-topology-types).

� Star Topology: - This topology is used to connect all computersystems to a single hub with the help of a cable. This hub is usedas a central node and all other available nodes are linked to it.This topology provides fast performance, easier to troubleshoot,set up and modify. However, this topology is expensive to useand if the hub gets fail, then the entire network stopped

nce in cloud computing. Journal of King Saud University – Computer and

Page 10: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

10 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

working. The installation cost is high (Pandya, 2013; Bisht andSingh, 2015; Hegde et al., 2013).

� Mesh Topology: - In this topology, all the node or computer sys-tems are fully linked to each other. This topology is very robust,easier to diagnose the failure. It also provides privacy and secu-rity. However, this topology is very difficult to install or config-ure. The cost of cabling is also high and it requires bulk wiring(Pandya, 2013; Bisht and Singh, 2015; Santra and Acharjya,2013).

� Tree Topology: - This topology contains a root node and all othercomputers or nodes are linked to it (known as hierarchicaltopology). The minimum level of the hierarchy should be three.This topology is an extended version of bus and star topology. Itis also easier to manage and maintain. Error detection is alsodone easily in this topology. However, this topology includes acostly process and is heavily cabled (Pandya, 2013; https://www.studytonight.com/computer-networks/network-topology-types; Santra and Acharjya, 2013).

� Hybrid Topology: - This topology is a combination of two ormore than two topologies. This topology provides features likereliability, scalability, and flexibility. However, its design iscomplex and involves a costly process (https://www.studytonight.com/computer-networks/network-topology-types;Santra and Acharjya, 2013).

The most commonly used network topologies in data centers ofcloud computing environment are as follows:

� Fat Tree topology: - This topology is the most widely used topol-ogy for the high-performance computing (HPC) and clusteringof cloud data centers (DC). It is a bidirectional multi-stage indi-rect topology, which provides fault tolerance and good perfor-mance levels. However, the hardware used by fat treetopology is very costly (Liu et al., 2014; Wang et al., 2014;Gómez et al., 2006; Coll et al., 2009; Sem-Jacobsen et al., 2011).

� RUFT (reduced unidirectional fat tree): - The RUFT topology isunidirectional MIN (multistage interconnection network),which provides good performance similar to the fat tree withvery less hardware cost. This topology has not been used byany fault tolerance (FT) approach (Bermúdez Garzón et al.,2016; Garzón et al., 2014; Gómez et al., 2008).

� RUFT-PL (reduced unidirectional fat tree with parallel link): - It isbased on creating duplicate copies of enterprise and the injec-tion links. It also distributes the traffic of the network in a bal-anced mode to decrease the HoL (Head of the line) blockingimpact between dual links. The number of switches used byRUFT-PL is similar to the number of switches in the RUFT andFat tree topologies. The switches of RUFT-PL can also twicethe amount of unidirectional ports of the RUFT switches(Bermúdez Garzón et al., 2016; Garzón et al., 2014).

� FT-RUFT-212 (Fault Tolerance RUFT 212): - This topology facili-tates the fault-tolerance (FT) feature by creating the duplicatecopy of the links, which connect to or from the ending nodesi.e. connect the nodes in a planned way that may entail a littleincrement in hardware(h/w) price. It uses the similar quantityof links as in RUFT topology as well as similar connection blue-print between switches (Bermúdez Garzón et al., 2016; Garzónet al., 2014).

� FT-RUFT-222 (Fault Tolerance RUFT 222): - It combines the per-formance features of RUFT-PL as well as the fault tolerance(FT) feature of FT-RUFT-212. It contains an FT variant of RUFTtopology. It makes use of two links to create the connectionbetween the network and the processing nodes, two (2) linksused for interconnecting the switch, and two (2) links to createthe connection between the last phase switches and processingnodes (Bermúdez Garzón et al., 2016; Garzón et al., 2014).

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

� Z-Fat Tree topology: - It is an extension of the fat tree known asZ-fat tree (Zoned-Fat tree). The extension is mainly related tothe utilization of extra ports for each switch by providing someadditional degree of connectivity. The main purpose of Z-fattree is to deal with the issues of scalability, FT as well as routingto accomplish low latency also high bandwidth. It deals withoptimization issues related to the creation of optimal FT (fattree) topology with minimum complexity (Adda andPeratikou, 2017).

� Clos Network topology: - It is a type of multistage network. Itprovides a non-blocking network, multistage switching archi-tecture, which reduces the number of ports needed to establisha connection. This network topology contains three stages:ingress, middle, and egress stage. Each of these stages is pre-pared using a number of crossbar switches (Liu et al., 2014;Wang et al., 2014; Dong and Rojas-Cessa, 2011).

� VL2 topology: - This is an agile also very cost-efficient network(n/w) design. It is created using various switches, organizedinside a Clos network topology. This topology employs ValiantLoad Balancing (VLB) to distribute the traffic among networkpaths. It also makes use of address resolution to help great ser-ver pools (Liu et al., 2014; Wang et al., 2014; Greenberg et al.,2009; Zhang et al., 2018).

� DCell topology: - This topology makes use of servers throughseveral ports also low-end small switches to make the recur-sively defined construction. In this topology, the main basiccomponent is DCell0, contains n number of servers and singlen-port switch. Every server in DCell0 is linked to a switch inthe identical DCell0 (Liu et al., 2014; Zhang et al., 2018;Lebiednik et al., 2016; Guo et al., 2008).

� BCube topology: - It provides a recursively defined construction,which is particularly designed to deliver modular data centersbased on the container. The most basic element is BCube0 sim-ilar to DCell0, i.e. n number of servers linked to the single n-portswitch. For constructing a BCube1, n number of additionalswitches are used, linking to one server to every BCube0 (Liuet al., 2014; Wang et al., 2014; Lebiednik et al., 2016; Guoet al., 2009).

4.2. Proactive approaches

Some proposed models and frameworks based on proactivefault tolerance are as follows:

� Jialei Liu et al. (2016) presented a PCFT (Proactive Co-ordinatedfault tolerance) method that accepts a virtual machine (VM)harmonized/coordinated method to look for a deterioratingphysical machine in the data center (DC) and then migrates vir-tual machines from the deteriorating physical machines tosome optimal target PMs automatically. An algorithm has beenalso proposed for the initial virtual cluster allocation.

The proposed approach includes two-steps: In the first step, theCPU temperature-based framework is proposed to anticipate adepreciate physical machine. In the second step, an optimal targetphysical machine is searched using a metaheuristic algorithm, Par-ticle Swarm Optimization algorithm. The authors have also usedsuitable metrics to compute the proficiency or efficiency of theproposed approach and compared with other 5 (five) models: FF(First-fit), best-fit (BF), random first-fit (RFF), modified best fitdecreasing (MBFD) and IVCA. The experimental result illustratedthat the PCFT approach has less transmission overhead and thetotal execution time is shorter as compared to the other fivealgorithms because PCFT uses the improved PSO (Particle swarmoptimization) algorithm. The PCFT has used less edge, aggregation,root switch and overall network resources than five approaches.

nce in cloud computing. Journal of King Saud University – Computer and

Page 11: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 11

� Zhang et al. (2018) presented an online FD (fault detection)approach called SVM-Grid. This approach has been describedas significant for the stability of cloud. According to the author,several fault detection models are required to know moreregarding the internal structure of the cloud system. The mostfrequently used models are traditional SVM (support vectormachine), but it provides very low accuracy. To deal with thisissue, an online fault detection model has been proposed thatdepends on the SVM-Grid. The SVM-Grid predicts emergingissues in the clouds. The model’s input parameter has beenenhanced by using grid method to accomplish fine-tuned pre-diction for the better/higher accuracy. Further, to enhance theperformance of fault prediction and to minimize the time cost,a fine-tuned prediction algorithm also an updating FT algorithmfor sample DBs (databases) have been developed. Simulationexperiments have been done using a dataset of the Google cor-poration (Google2 application cluster). The proposed approachhas also been compared to existing approaches namely backpropagation, LVQ (Learning vector quantization), and tradi-tional SVM. The experimental outcomes illustrated that thenewly developed model has provided improved accuracy andlower time cost as compared to BP, LVQ, and traditional SVM.

4.3. Reactive approaches

Some proposed models and frameworks based on reactive faulttolerance are as follows:

� Wang et al. (2016) proposed an OPVMP (optimal redundant vir-tual machine placement) model. This approach has been devel-oped to improve the reliability of server-based cloud servicesusing a replication-based fault tolerance method. The proposedapproach consists of the three steps: - selection of the host ser-ver, optimal VM placement, and recovery strategy decision. Aheuristic algorithm has been used for appropriate host serverselection and also for optimal VM placement. The experimentsto verify the benefits of the approach have been performed onthe CloudSim simulator (Calheiros et al., 2009). Results of theproposed approach have been compared to five other existingmodels. An experimental result demonstrated that the pro-posed approach has used the fewer network resources thanother algorithms.

� Amoon (2016) has developed an adaptive model/framework tohandle the fault-related issue in the cloud environment. Theadaptive model contains both checkpointing and replicationfault tolerance techniques to get a highly reliable platform toserve the client requests. The proposed framework consists oftwo algorithms: one for selecting VMs to serve the consumer’srequests, and other for deciding/choosing FT method, i.e. repli-cation or checkpointing approaches. The performance of theproposed framework has been evaluated on the basis ofthroughput, overheads, availability and monetary cost. To per-form the simulation, CloudSim tool has been used. The perfor-mance of the proposed framework has been compared to theexisting OCI (optimal checkpoint algorithm). As a result, anadaptive nature of the proposed framework has improved theperformance of the cloud environment as compared to existingalgorithms.

� Zhou et al. (2017) proposed an Edge switch failure aware check-pointing (EDCKP) model. This model is designed with the aim toenhance service reliability in the cloud computing system. Fattree network topology has been considered and two algorithmshave been proposed to address the edge switch failure. Onealgorithm is to select storage server for the checkpoint imageand other is for the recovery server. The proposed model hasbeen compared with other existing models such as NOCKP

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

(No checkpoint-based method) and NDCKP (is a network topol-ogy aware distributed delta checkpoint-based technique). Thesimulation result illustrated that the EDCKP method hasreduced the total execution time and consumes fewer networkresources to get the better service reliability.

� Darby and Tzeng (2010) presented a ReD (Reliability driven)approach for FT in a mobile grid (MoG) environment. Thisenabled a mobile host to transmit its check-pointed data to aselected neighbouring mobile host. The selected mobile hostserves as stable point storage for the checkpoints from theapproved neighbouring mobile host. The ReD approach hasbeen used to increase the likelihood of recovery of checkpointeddata. It thereby maximized the probability that a distributedapplication executing on the mobile grid (MoG) can be com-pleted without supporting an unrecoverable failure. In otherwords, ReD, Reliability Driven middleware enabled the mobilegrid scheduler in informed decision making by selectively sub-mitting work segments to the hosts that contain best-checkpointing arrangements to guarantee successful comple-tion. The ReD approach has been evaluated in the simulatorand test-bed environments. The outcome of ReD has been com-pared with the RCA (Random Checkpoint algorithm). As a result,stability control enhancements to the ReD produces the greatestpayoff at smaller wireless areas, resulted in superior average/normal reliability achievement and in that order less breakmessages. The outcome is practical also useful because, inReD, all host can decide its neighbourhood density, during mes-sage conversation/exchange among neighbours indirectly.

� Zhao et al. (2017) presented a JCSR (Joint Checkpoint Schedulingand Routing) method to provide flexible reliability optimizationin the cloud environment. A peer-to-peer checkpointingmethod has been used that allows client consistencylevels/points to be optimized on the basis of evaluation of theirindividual requirements and entirely resources available in thedata center. A distributed algorithm has been designed to solvethe joint optimization by dual decomposition. However, thegiven solution can improve the use of resources and presenteda supplementary source of profits to the data center operators.The validation outcome demonstrated a significant enhance-ment of reliability over existing methods. A heuristic algorithm,JCSR has been proposed with peer-to-peer check-pointing,which is used to locate the sub-optimal resolution for the jointcheckpoint scheduling problem and routing difficulty leverag-ing dual decomposition process. The motive of the proposedalgorithm is to come across an individual user optimizationproblem.

� Chen et al. (2016) presented task failure-modelling framework.In this framework, a hypothetical analysis has been conductedon the runtime performance to know the impact of transientfailures (a failure that occurs for short periods) of scientificworkflow executions. To improve the workflow’s runtime per-formance, three fault tolerance clustering strategies have beenproposed, these are selective re-clustering algorithm (a new-clustered job is used to keep the clustered job having only failedtasks), dynamic re-clustering algorithm (used to adjust thegranularity dynamically or size of the cluster or a number oftasks inside a job) and vertical re-clustering algorithm (usedto divide the clustered jobs into improved jobs that reducethe job granularity and then retries them). The WorkflowSimtool has been used for the simulation.

� Qiu et al. (2014) presented a ROCloud (reliability-based designoptimization) model/framework to enhance the reliability ofapplication using the FT approach. This framework has beenused to migrate the legacy applications on cloud. The proposedmodel included three steps. In the first part, the legacy applica-tion analysis has been done. In the second part, the important

nce in cloud computing. Journal of King Saud University – Computer and

Page 12: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Table 6Summary of Machine Learning algorithms.

Algorithms Descriptions

Neural Network (NN) (Trivedi,2016)

NN is a supervised based learningapproach. With an NN, a computer systemis modelled to work like a human brain orthe nervous system. NN works bygenerating connections among processingelements. The processing elements of NNare non-linear and can be interconnectedusing the adjustable weights. The outcomeis determined by an association andweights of connections

KNN (K-Nearest Neighbour)(Sharma, 2017)

KNN is a form of supervised learning. Thisalgorithm is used to store the availablecases and categorizes new cases based onsimilarity measure called distancefunctions. It is used to solve bothclassification predictive and regression

12 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

component ranking has been done and in the last part, an opti-mal FT method has been chosen automatically. However, twoalgorithms have been proposed for ranking the components ofboth ordinary applications (all components of the applicationcan migrate to cloud) and hybrid application (only someparts/portion of their components can migrate to cloud). Theimportant value of every component has been computed basedon the construction of an application, the relationships betweencomponent invocations and failure rates of component andeffects of failure. A tool developed in C++ language has beenused for the simulation.

� Chen et al. (2015) presented a KNF (k-out-of-n reliability)framework, which dealt with issues of energy efficiency as wellas fault tolerance. The proposed framework has provided thefacility of data fragmentation at the time of data storage by anode. Other nodes are able to fetch data consistently with nom-inal energy consumption and also permits mobile nodes to pro-cess the distributed data. However, energy consumptionrequired for processing data is also diminished. This frameworkhas been evaluated in MATLAB.

� Yao et al. (2017) presented the ICFWS (Fault tolerance workflowscheduling) algorithm. This algorithm provides the benefits ofboth resubmission and replication-based FT. First, the ICFWSalgorithm partitions the soft deadline of the available workflowinto numerous sub-deadlines. The FT policies of replication orresubmission are selected for each task and then all availabletasks are scheduled for their initial implementation. Backupcopies of the tasks are retained using the replication method.After that, an online reservation, adjustment method has beenproposed to fine-tune the sub-deadlines of any non-executedtask during task execution process.

The main features of the discussed models and frameworks forFT have been summarized in Table 5.

4.4. Other miscellaneous approaches used for FT

Some miscellaneous approaches have also been used for inte-grating fault tolerance methods in cloud systems to enhance theirperformance and to make the systems robust. Some miscellaneousapproaches are discussed as follows:

� Machine learning based Approaches: - Machine learning is anapplication of artificial intelligence, which enables systems tolearn automatically and use experience for improvementwithout any explicit programming. Its main motive is todevelop computer programs, which can access the data, and

Table 5Summary of the proposed model/framework based on proactive and reactive faulttolerance.

Proposed Framework/Model Fault toleranceApproaches

Use of Fat-TreeTopology

OPVMP (Wang et al., 2016) Reactive YesPCFT (Jialei Liu et al., 2016) Proactive YesAdaptive Framework

(Amoon, 2016)Reactive No

EDCKP (Zhou et al., 2017) Reactive YesReD (Darby and Tzeng, 2010) Reactive YesJCSR (Zhao et al., 2017) Reactive YesROCloud (Qiu et al., 2014) Reactive NoKNF framework (Chen et al.,

2015)Reactive No

SVM-Grid (Zhang et al., 2018) Proactive NoICFWS (Yao et al., 2017) Reactive No

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

they then use that data to learn (Chen et al., 2014; Macskassyand Provost, 2007). These machine learning techniques havealso been used to develop fault tolerance methods to enhanceservice reliability. In particular, machine learning is employedto develop proactive fault tolerance methods where failure pre-diction needs to be done before it occurs in the system based onprevious data of the system for the same (Leam, xxxx; Kochharand Hilda, 2017; Li-qun et al., 2011). Some machine learningbased algorithms are described in Table 6.

� Meta-Heuristics: - A meta-heuristic is an advanced level ofheuristic which is designed to create, find or choose a heuristic.Mainly used to direct the search procedure, the objective is toeffectively examine the search space to locate the near-optimal solutions. Approaches that comprise algorithms ofmeta-heuristic may range from easy local search processes tothe complex learning procedures (Frincu and Craciun, 2011).Meta-heuristic algorithms also approximate the outcome andare, in general, non-deterministic. The meta-heuristicsapproaches are usually not considered as being problem specificand therefore have been used for the efficient solution of mul-tiple optimization problems (Umamakeswari et al., 2017). Themost commonly used meta-heuristics algorithms are ACO algo-rithm (ant colony optimization), PSO algorithm (Particle swarmoptimization, SCO algorithm (Social cognitive optimization),etc. (Blum and Roli, 2003; Bianchi et al., 2009). The meta-heuristic algorithms can be applied in the many distinct areassuch that optimization of function, fuzzy system control,

predictive problems. In this approach,output interpretation consumes lowcalculation time and predictive power

SVM (Support vector machines)(Trivedi, 2016; He et al.,2017)

SVMs are a kind of supervised learning.The motive of the SVM algorithm is to finda hyperplane in N-dimensional (N standfor the number of features) space thatclearly classifies the data points. Thisalgorithm is commonly used in imageclassification, text and hypertextclassification, handwritten text/characterrecognition and so on

Reinforcement Learning (RL)(Shaw et al., 2017)

RL is a type of ML algorithm that permitsmachines and software representatives toautomatically determine the superlativebehavior in a specific situation tomaximize the performance. It is a goal-oriented learning based on the interactionwith the environment. It consists of twocomponents i.e. agent and anenvironment. The environment states tothe object where the agent is acting on

nce in cloud computing. Journal of King Saud University – Computer and

Page 13: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Table 7Summary of Metaheuristics algorithms.

Algorithms Description

Ant Colony Optimization (ACO) (Liu et al., 2011) It is an optimization system proposed in the early 90s by Marco Dorigo []. ACO is a heuristic based,multi-agent optimization method inspired by the biological systems for solving complex combinatorialoptimization problems

Particle Swarm Optimization (PSO) (Pandey et al., 2010) This algorithm is a self-adaptive and robust global search-based optimization approach developed byRush Eberhart and Jim Kennedy in 1995. This algorithm is developed for the population-basedstochastic optimization and simulates the common behaviour of fish schooling or bird flocking. PSO iseasy to implement as only a few parameters need to be adjusted to obtain a near-optimal result

Artificial Bee Colony (ABC) (Kruekaew and Kimpan, 2014; Liuet al., 2012)

ABC algorithm is inspired by the intelligent behaviour of the honey bees. This algorithm is a swarmbased meta-heuristic method. It provides a search procedure based on population and has been used tosolve many distinct types of problems. The main motive of the bee is to determine source locations offood having high nectar amount. This algorithm comprises three forms of bees: employed (search foodthroughout the food source inside the memory then share the information about the food source toonlooker bees), onlooker (pick good food sources i.e. founded by employed bees) and scout bees(interpreted from very few employed bees that unrestraint their food sources and look for new ones)

Cuckoo Search Algorithm (CSA) (Mareli and Twala, 2017;https://www.slideshare.net/AnujaJoshi6/cuckoo-o-ptimization-ppt)

It is an optimization-based algorithm defined in the year 2009. This algorithm is stimulated by restrictbrood parasitism of a few cuckoo species by arranging the eggs in the nest of some other host birds. Thisalgorithm can also be used to resolve multi-criteria optimization issues. This algorithm can be appliedin many distinct fields such that engineering optimization, stability analysis, reliability problem andmany more

Bee Colony Optimization (BCO) (https://www.researchgate.net/publication/

259383112_Honey_Bees_Inspired_Optimization_Method_The_Bees_Algorithm; Yuce et al., 2015)

This algorithm, to deal with the combinatorial optimizationproblems, contains two passes, i.e., forward and backward.In the forward pass phase, the partial solution is producedwith individual search and collective knowledge which issubsequently employed to the backward pass phase. Inthe backward pass phase, the likelihood information isutilized to create decision whether to stay to explore thepresent solution for a subsequent forward pass or to beginsearching the neighborhood with newly selected ones

Firefly-Based Algorithm (FA) (https://www.slideshare.net/supriyashilwant/firefly-algorithm-49723859; Yang,2013)

It is a meta-heuristic optimization procedure inspired by fireflies in nature. This algorithm is used tosolve extremely non-linear as well as multi-modal optimization problems efficiently. The speed ofconvergence of the algorithm is high and this algorithm has been integrated with other optimizationapproaches to build hybrid tools

Genetic algorithm (GA) (Haider and Nazir, 2017; Lagwal andBhardwaj, 2017; Hamad and Omara, 2016)

Genetic algorithms are a metaheuristic inspired by the procedure of natural selection. This algorithminvolves five main components: initial population (process starts with the group of individuals calledpopulation), fitness function (determine capability of an individual to participate/compete with someother individual i.e. how appropriate an individual is), selection (fittest individuals are selected andthen their genes are passed to next generation), crossover (for every couple of parents to be matched, acrossover point is selected randomly from the genes i.e. it represents the mating among individuals),mutation (some of their genes having low random probability subjected to mutation i.e. some of thebits with low probability inside bit string is flipped)

Differential evolution (DE) (Rani, 2016; Tsai et al., 2013) It is a stochastic as well as population-based optimization method developed by Storn and Price. DEalgorithm is a type of evolutionary programming used to resolve optimization problems on continuousdomains where every variable is denoted by the set of real numbers. This algorithm has a simplestructure and provides robustness

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 13

scheduling application, cloud computing, image processing,clustering, data mining, to train artificial neural network andmany more. These algorithms have also been applied toimprove the service reliability, i.e. the meta-heuristic algo-rithms have been integrated with the fault toleranceapproaches to make a reliable system or enhance its perfor-mance (Jialei Liu et al., 2016; Li-qun et al., 2011; Mohanapriyaand Priya, 2016). Some meta-heuristic based algorithms aredescribed in Table 7.

� Clustering: - In order to implement the parallel processing appli-cation in the enterprises, clustering is used in the cloud,whereby several computers, VMs (virtual machines), serversare tightly or loosely connected to work jointly and appear asa single system to the users (known as computer cluster). Clus-ter computing is also used for the HPC (High-performance com-puting). However, the long-term trend in HPC needs increasingamounts of the node inside the parallel computing platform andthus, involves a higher probability of failure. To solve the failureprobability various clustering approaches are available, whichcan be used with the fault tolerance approaches to make thesystem reliable and robust (Buyya et al., 2009; Shwe et al.,2008; Patil et al., 2011).

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

4.5. Integration of fault tolerance in miscellaneous cloud-based andrelated applications

Fault tolerance methods have been applied to implement reli-able cloud-based applications. Some of these are related to secu-rity, workflow scheduling, mobile data offloading, and IoT basedapplications, where there is an immense requirement of integra-tion of FT approaches.

A. Security and fault tolerance in cloud computing: - The develop-ment of a reliable cloud computing system should not onlyentail the development of techniques that tolerate benign faultsin the system but should also consider the handling of mali-cious attacks on the system. However, till date, reliable andtrustworthy systems’ construction has generally been viewedfrom two views; either security or fault tolerance, which isorthogonal to each other. For instance, from the perspective ofsecurity, it is required that the computing facilities be securedagainst unauthorized access and avoid redundancy. However,fault tolerance needs redundancy in the form of replication ofnodes as well as data. Also, fault tolerance schemes usuallyassume non-malicious nodes in the system.

nce in cloud computing. Journal of King Saud University – Computer and

Page 14: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Table 8Security Attacks in the Cloud environment.

Distributed Denial of Service attack can affect the availability or accessibility of cloud services because of its multi-tenant architectureIt is applied to the numerous cooperated systems with two key motives (Badve, 2017; Chauhan and Prasad, 2015; Gupta, 2011; Sridaran, 2015)� To overpower the resources of the server such as CPU time or the bandwidth of the network, therefore the genuine users cannot access the resource� To hide the identity of malicious users or attackers

Type of DDoSAttack

Description Affected CloudComputing Layer

Smurf attack Attacker sends a massive amount of ICMP echo requests. The sent requests are deceived i.e. IP address of source will be theIP of the victim and IP destination address will be the broadcast IP address. Finally, as an outcome, the victim will beoverwhelmed with broadcasted addresses (Sridaran, 2015)

IaaS

PING of deathattack

Attacker transmits an IP packet having the size more than the maximum threshold of IP protocol i.e. 65,535 bytes. In case ofmanaging an oversized or large size packet, the victim’s machine and resources of the cloud inside the cloud environmentcan be affected (Darwish et al., 2011)

IaaS and PaaS

IP-spoofingattack

Transmissions of the packet among the client and cloud server can be interrupted and their headers are reformed byattackers i.e.IP source field in IP packet is affected by either a genuine and/or an unreachable/unavailable IP address. As anoutcome, server will reply to a legitimate/genuine user machine, and will either affect the genuine user machine, or servermay not be able to finish the transaction to an unreachable IP address, that disturbs the resources of server (Sridaran, 2015;Darwish et al., 2011)

PaaS

Buffer overflowattack

An executable piece of code is sent by the attacker to the victim to take benefit of buffer overflow vulnerability. Finally, asan output, the attacker will control the machine of the victim. The attacker might damage either victim’s machine or makeuse of the infected machine to accomplish a cloud-based DDoS attack internally (Darwish et al., 2011)

SaaS

Teardrop attack Attackers use the ‘‘Teardrop.c‘‘ program to send illegal overlying values of the IP fragments within header of the TCPpackets. As an outcome, in cloud system, the victim’s machine will crash during re-assembly process (Sridaran, 2015)

IaaS and PaaS

Land attack Attackers send ‘‘Land.c‘‘ program to counterfeit the TCP SYN packets along with the victim’s IP address inside the source anddestination fields. In this case, the request will be received by the machine itself and then crash the system (Sridaran, 2015)

IaaS and PaaS

SYNFloodattack

It occurs when the attacker transmits many packets on the server but not able to complete the procedure of 3-wayhandshake. In this circumstance, server hold-up to finish the execution of all those packets, that cause server not to processvalid requests. Similarly, SYN flooding can also be done by transferring packets through a spoofed IP address (Darwish et al.,2011)

IaaS and PaaS

Botnet-based Esraa Alomari et al. (Alomari et al., 2012, 2016) has presented a comprehensive study on Botnet-based DDoS attack and itseffect on application layer, particularly on the web server. The Botnet-based DDoS attack on application layer restrictsresources, revenue and produces consumer dissatisfaction between others. The main motive of this attack is:

a. To involve damage on victim sideb. The hidden goal of this attack is personal i.e. it blocks the available computing resources or reduces the service’s

performance required by any destination machine. So, this attack is done for the purpose of revengec. One more purpose is to accomplish this attack is to increase popularity in community of hacker. This type of attacks

may also be performed for material gain i.e. to interrupt the privacy also use available data/information for theirownStrayer et al. (2008) presented a method to detect botnets attack by examining flow characteristics i.e. packettiming, burst period and bandwidth for indication of botnet control action and command. The author has also cre-ated an architecture which first removes traffic i.e. unlikely to be a fragment of botnet, then categorizes the residualtraffic in a set/group i.e. possibly to be portion of botnet and then associates the probable traffic patterns to find thecommon/shared communications which would recommend activity/action of a botnet

XSS (Cross-Site Scripting) attack: In a cloud environment, XSS attack generally found on the multimedia web applications of online social network (OSN) i.e. Facebook,Linkedin, Twitter, etc. In XSS attack, the attacker inserts the untrusted JavaScript code on OSN web server. This is done by stealing the user’s/handler’s loginidentifications and other complex information such that session tokens and/or financial account data/information.XSS attack can lead to harsh consequences suchthat stealing the cookie, hijacking of account, misinformation, DOS (Denial-of-Service) attack, etc. (Chaudhary et al., 2016)

To deal with this attack, Gupta and Gupta (2018) proposed an XSS-secure framework to detect and alleviates the proliferation of the XSS worms from the webapplication of OSN in the cloud system. The developed framework is operated in two modes: -

� Training mode: The training mode is used to create the secure sanitized JavaScript (JS) code i.e. encapsulated in the templates of the web pages� Detection mode: During the offline mode, the discrepancy in injected sanitizers are detected using the detection mode. This mode also perceives the injection ofmalignant variables as well as links to JavaScript (JS) code

The developed framework has been implemented in Java language and its components are integrated on (VMs) of the cloud system

14 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Cloud-based services are shared by the millions of consumers/users. Due to the distributed nature of the cloud, various securityissues may occur. The most common security issues that mayimpact a cloud user are related to data, availability, authenticationand authorization, privileged user access, etc. (Chen et al., 2015). Afew commonly occurring attacks are listed in Table 8. Fault Toler-ance approaches can be effectively used to minimize the effect ofthese attacks (Mishra, 2011). The FT methods can be applied tothree levels:

� At hardware level: if the attack on a hardware resource causesthe system failure, then its effect can be compensated by usingadditional hardware resources.

� At software (s/w) level: Fault tolerance techniques such as check-point restart and recovery methods can be used to progress sys-tem execution in the event of failures due to security attacks.

� At system level: At this level, fault tolerance measures can com-pensate failure in system amenities and guarantee the availabil-ity of network and other resources.

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

Therefore, it is imperative that viewpoints of security and faulttolerance be aligned with respect to cloud computing systems.Only then, cloud computing services will be able to gain the confi-dence of individual users as well as enterprises with regard to theirdata and computations. The integration of fault tolerance and secu-rity methods should cause a minimal overhead on systemperformance.

B. Workflow scheduling: - The cloud system is widely used byresearchers and scientists for the implementation of scientificworkflows that perform data analysis and high throughputcomputing. Workflows enable easy definition of computationalcomponent, data and their dependencies in a declarative man-ner. It enables automatic execution of workflows, improvingthe performance of an application, and decreasing the amountof time that is needed to obtain scientific outcomes (Poolaet al., 2014). Furthermore, novel pricing models and frame-works have been established by the cloud service providers thatpermit users to provide the available resources and to make use

nce in cloud computing. Journal of King Saud University – Computer and

Page 15: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

Table 9Types of Blockchains and its brief description (https://www.guru99.com/blockchain-tutorial.html).

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 15

of those resources in an efficient way with major cost falls. Sig-nificant price reductions are accomplished due to poorer QoSthat make them less efficient/reliable and susceptible to fail-ures. To deal with failure, i.e., to make the system reliable faulttolerance approaches are considered. The most frequently usedfault tolerance method in workflows is checkpointing, whichcan tolerate an instance failure as well as decrease the execu-tion cost (Chen et al., 2016; Chen and Deelman, 2012).C. Mobile data offloading: - Offloading is the process used todecrease the amount of data, which is being carried by themobile phones or cellular bands and also to free the bandwidthfor other end users. Data offloading is referred to as the offload-ing or migration of data or traffic to cloud servers. The mobiledata offloading method is categorized as the offloading of datausing tiny cell networks, using WiFi networks, opportunisticmobile networks, and using heterogeneous networks (Zhouet al., 2018). Offloading can also be done using cloud computingand virtualization techniques. It also allows mobile devices tomigrate the computational element of an application to thepowerful servers of the cloud. As mobile devices are portable,an unstable network connectivity of mobile can affect theoffloading decision (Deng et al., 2014). To deal with this prob-lem, fault tolerance approaches can be applied effectively.D. Internet of Things (IoT) applications: - IoT has facilitated anumber of applications such as smart homes, intelligent auto-motive, agricultural and industrial applications and many more.IoT applications have enforced new needs on FT (fault toler-ance) and energy management. In other words, in some applica-tions, such as airplane control systems, a single or a particularfault can be very dangerous and life-threatening. FT is criticalin many other IoT applications, in domains like medical strategyand automotive devices, where only one fault can lead to theserious consequences (Xu and Potkonjak, 2016). Therefore, itis very significant to detect and to correct faults in IoT applica-tions, i.e. FT is required.E. Social Network (SN): These represent a network between indi-viduals or organizations using some medium to share interests,thoughts, and different activities. SN makes use of online mediai.e. social media based on internet help to establish contactswith friends, family, customers, and clients. SN can occur forbusiness as well as social purposes or/and both using sites likeFacebook, Twitter, LinkedIn etc. SN is also an important targetfield for marketers seeking to involve/engage users (Chardet al., 2010).

Nowadays, the social network has also integrated with thecloud computing because cloud computing enables appropriate,pervasive and on-demand access of the network resources suchthat application, storage, etc., which can be provisioned with verynominal management cost. In other words, the cloud providesmany sustainable resources to the social network in many differentareas such that online marketing, news, jobs, chatting, share a pic-ture, audio, video, etc. As the cloud environment is very prone tothe failure then need to use fault tolerance approaches with thesocial cloud to ensure the reliable communication among the users(Ali et al., 2012).

Types ofBlockchain

Descriptions

PublicBlockchain

Provide transaction within existing blocks publicly. Thereis no access restriction for examples: Bitcoin, Ethereum,etc.)

ConsortiumBlockchain

It is a semi-decentralized blockchain and is controlled by agroup of members or organizations. For examples- Ripple,R3 & Hyperledger1.0

PrivateBlockchain

Needs an invitation from the network administrator tojoinFor examples- Multichain, Block-stack

5. Future directions for research

The cloud paradigm has become a reality and has been adoptedby researchers, IT industry and other organizations. Fault toleranceapproaches are required to improve the quality of service in thecloud environment. However, the existing or proposed approacheshave considered, in general, only the reliability issue of the simplecloud workflows and evaluation has been done using some basic

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

metrics, such as response time, availability, throughput and relia-bility. Some future directions for research in this domain are sug-gested as follows:

5.1. Deep learning

Today, the cloud computing system has become the mostattractive computing model in the field of academia and industryboth. The cloud service provider which has its own data center(DC) can use virtualization approach to structure physical machi-nes (PM) into virtual machines to offer resources, infrastructure,and services to the users in the form of utility. The main problemthat faced by cloud service providers is task scheduling, predictionof virtual machine workload, resource provisioning, security issuesand how to minimize energy cost. There are several different meth-ods have been developed to deal with this issue separately, but allthese methods consume more power, time, provide the result withless accuracy and quality. Each of these problems can be modelledas an optimization problem and solved more effectively using deeplearning approach.

Deep learning (DL) is a subclass of the ML, which learns featuresfrom data directly. When the volume of data is enlarged, ML tech-niques are not appropriate in terms of the performance, so in suchcase, DL has been found to provide better performance in terms ofaccuracy (Qiu et al., 2016). DL use various layers (hidden layer) ofthe nonlinear processing elements for the purpose of featureextraction as well as transformation. In this approach, every con-secutive layer uses the outcome of the previous layer in the formof input. DL approaches have been applied in many areas likespeech and audio recognition, NLP (natural language processing),computer vision, social network filtering, etc. (Cheng et al., 2018;Zhang et al., 2018).

It is envisaged that deep learning approaches can also be inte-grated with the fault tolerance methods to predict the faults andtake the corresponding preventive measures.

5.2. Blockchain

The blockchain is one of the new emerging technologies in com-puter science. It is a digitized, disruptive as well as decentralizedtechnology. It holds a chain of the blocks and each block holdsinformation or data without any central supervision. The data orinformation stored inside the blocks depend on types of the block-chain. This new technology is widely used to validate transactionsof digital currencies. Using this approach, the authenticity of every-one present in the block can be verified even though there is no anycentralized authority (Tosh et al., 2017; Park and Park, 2017). Thereare many different types of blockchains available, which aredefined in Table 9. There are few concepts which are used by theblockchain are:

nce in cloud computing. Journal of King Saud University – Computer and

Page 16: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

16 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

� Decentralized: no any central authority for supervisinganything).

� consensus mechanism: using this mechanism the decentralizednetwork derives to a consensus on some particular matter.

� Miners: users who make use of their computational power formining the blocks.

Apart from digital currencies, this technology has been used toprovide an effective security solution for multiple applications.Due to the dispersed behaviour of the cloud, many organizationsor individuals using the cloud to store their data face securityissues. The blockchain technology can be used to make the cloudsystem more secure and trustworthy. Besides, it can also beemployed to solve other fault tolerance related problems like asyn-chronous communication, unpredictable network delay complex-ity, i.e. backup and delay and so on (Park and Park, 2017; Xuet al., 2018).

5.3. Distributed deduplication systems

Deduplication is an approach being increasingly used in cloudcomputing systems to reduce the data storage costs. Data dedupli-cation method is employed for removing redundant copies of datain cloud repositories in order to the decrease the storage space,upload bandwidth and consistency of data between multiplecopies (Kalyani et al., 2016). The deduplication techniques are cat-egorized in terms of size (Li et al., 2015) as follows:

� File-level deduplication: - determines redundancies among dif-ferent files and eliminates these replications to diminish capac-ity demands.

� Block-level deduplication: - determines and eliminates redun-dancies among data blocks. The file can be decomposed intothe number of smaller blocks whose size can be either con-stant/fixed or variable. With help of fixed-size blocks calcula-tions, of boundaries of blocks are simplified, and blocks withvariable-size offers improved deduplication efficiency.

Data deduplication approach can be used for effective imple-mentation of scalable cloud computing systems by ensuring theconsistency of data. Since only one copy for each data/file is keptin the cloud system even if such a data/file is shared by a largenumber of users, it enhances the utilization of cloud storage. How-ever, this reduces system reliability. To solve the reliability and pri-vacy issue of the deduplication system, Jin Li et al. (2015) haveproposed a new distributed deduplication system. The authorshave proposed four constructions to help both levels i.e. file andblock-level of data deduplication. In this paper secret splittingmethod has been employed to support the secrecy of data. In thismethod, secure secret sharing systems have been used to dividedata into fragments and stored at the distinct location. Accordingto the authors, using the proposed model the reliability, integrity,and confidentiality can be achieved.

Implementation of fault tolerance in systems employing datadeduplication can be challenging. Development of solutions thatmeet the reliability expectations while also decreasing storagecosts and maintaining data consistency is a research topic thatneeds attention.

5.4. Emphasis on performance issues

� Fault tolerance approaches should be developed with a focus tosolve the reliability issues of the complex workflows.

� Appropriate metrics should be designed to evaluate the robust-ness of a cloud system. For an instance, a metric is required to

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

assess the overall benefit of implementing FT approach by con-sidering the cost incurred and the achieved reliability.

� Routing methods and network topology designs of data centersshould be leveraged for the development of efficient fault toler-ance techniques.

� There is also a need to develop a framework/model using proac-tive fault tolerance methods, which can predict and locate thereason that causes the faults in the cloud.

� The work should also focus on ways of enhancing the reliabilityof real-time systems.

� Moreover, besides the fault, the performance variation of virtualmachines also provides a negative consequence or impact onthe implementation of any task workflow. Therefore, theresearchers should plan to propose a robust algorithm forscheduling through elastic resource provisioning approach forworkflows in the cloud environment

6. Conclusion

Computing in the cloud provides various features like scalabil-ity, elasticity, high availability and many more. The cloud-computing model has changed the IT industry as it brings severalbenefits to individuals, researchers, organizations, and even coun-tries. Despite providing numerous advantages, the cloud system isstill susceptible to failures. Failures are inevitable in cloud comput-ing due to the scale of operation. Fault tolerance policies are com-monly implemented to handle faults effectively in the cloudenvironment. Fault tolerance techniques help in preventing as wellas tolerating faults in the system, which may occur either due tohardware or software failure. The main motive to employ fault tol-erance techniques in cloud computing is to achieve failure recov-ery, high reliability and enhance availability.

This survey paper has discussed cloud computing concepts, itscomponents, service model, and deployment models. The com-monly used data center network topologies are also describedsince these affect the design of FT algorithms. Various fault toler-ance techniques, models, and algorithms to improve the reliabilityissue of cloud services have been outlined. Techniques to enhancethe performance of fault tolerance approaches, such as heuristics,meta-heuristics, clustering approaches etc. have been enumerated.Further, metrics and parameters to assess the effectiveness andefficiency of the proposed FT approaches have been discussed. Con-sidering limitations of existing fault tolerance techniques and theemerging technologies in related domains, we have put forth somedirections for future research initiatives.

References

AbdElfattah, E., Elkawkagy, M., El-Sisi, A., 2017. A reactive fault tolerance approachfor cloud computing. In: Computer Engineering Conference (ICENCO), 201713th International. IEEE, pp. 190–194.

Adda, M., Peratikou, A., 2017. Routing and fault tolerance in Z-fat tree. IEEE Trans.Parallel Distrib. Syst. 28 (8), 2373–2386.

Agbaria, A., Sanders, W.H., 2004. Distributed snapshots for mobile computingsystems. In: Proc. - Second IEEE Annu. Conf. Pervasive Comput. Commun.PerCom, no. Cic, pp. 151–186.

Ali, Z., Rasool, R.U., Bloodsworth, P., 2012. Social networking for sharing cloudresources. In: Cloud and Green Computing (CGC), 2012 Second InternationalConference on. IEEE, pp. 160–166.

Alomari, E., Manickam, S., Gupta, B.B., Karuppayah, S., Alfaris, R., 2012. Botnet-baseddistributed denial of service (DDoS) attacks on web servers: classification andart. arXiv preprint arXiv:1208.0403.

Alomari, E., Manickam, S., Gupta, B.B., Anbar, M., Saad, R.M., Alsaleem, S., 2016. Asurvey of botnet-based DDoS flooding attacks of application layer: detectionand mitigation approaches. In: Handbook of Research on Modern CryptographicSolutions for Computer and Cyber Security. IGI Global, pp. 52–79.

Altameem, T., 2013. Fault tolerance techniques in grid computing systems. Int. J.Comput. Sci. Inf. Technol. 4 (6), 858–862.

nce in cloud computing. Journal of King Saud University – Computer and

Page 17: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 17

Amin, Z., Singh, H., Sethi, N., 2015. Review on fault tolerance techniques in cloudcomputing. Int. J. Comput. Appl. 116 (18), 11–17.

Amoon, M., 2016. Adaptive framework for reliable cloud computing environment.IEEE Access 4. 9469–9430.

Ataallah, S.M.A., Nassar, S.M., Hemayed, E.E., 2015. Fault tolerance in cloudcomputing – Survey. In: 11th Int. Comput. Eng. Conf., no. 1, pp. 241–245.

Badve, B.B.G.O.P., 2017. Taxonomy of DoS and DDoS attacks and desirable defensemechanism in a Cloud computing environment. Neural Comput. Appl. 28 (12),3655–3682.

Bala, A., Chana, I., 2012. Fault tolerance- challenges, techniques and implementationin cloud computing. Int. J. Comput. Sci. 9 (1), 288–293.

Bermúdez Garzón, D.F., Requena, C.G., Gómez, M.E., López, P., Duato, J., 2016. Afamily of fault-tolerant efficient indirect topologies. IEEE Trans. Parallel Distrib.Syst 27 (4), 927–940.

Bianchi, L., Dorigo, M., Gambardella, L.M., Gutjahr, W.J., 2009. A survey onmetaheuristics for stochastic combinatorial optimization. Nat. Comput. 8 (2),239–287.

Bisht, N., Singh, S., 2015. Analytical study of different network topologies. Int. Res. J.Eng. Technol. 2 (1), 88–90.

Blum, C., Roli, A., 2003. Metaheuristics in combinatorial optimization: overview andconceptual comparison. ACM Comput. Surv. 35 (3), 189–213.

Bokhari, M.U., Shallal, Q.M., Tamandani, Y.K., 2016. Cloud computing servicemodels: a comparative study. In: IEEE Int. Conf. Comput. Sustain. Glob. Dev.INDIACom, pp. 16–18.

Buyya, R. et al., 2009. Cloud computing and emerging IT platforms: vision, hype, andreality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst.25 (June 2009), 17.

Calheiros, R.N., Ranjan, R., De Rose, C.A.F., Buyya, R., 2009. CloudSim: A NovelFramework for Modeling and Simulation of Cloud Computing Infrastructuresand Services, 1–9.

Chandra, M.R., Reddy, P.C.S., 2016. Fault tolerance QoS routing protocol for MANETs.In: Advanced Computing (IACC), 2016 IEEE 6th International Conference on.IEEE, pp. 623–628.

Chard, K., Caton, S., Rana, O., Bubendorfer, K., 2010. Social cloud: cloud computing insocial networks. In: Cloud Computing (CLOUD), 2010 IEEE 3rd InternationalConference on. IEEE, pp. 99–106.

Charity, T.J., Hua, G.C., 2016. Resource reliability using fault tolerance in cloudcomputing. In: Next Generation Computing Technologies (NGCT), 2016 2ndInternational Conference on. IEEE, pp. 65–71.

Chaudhary, P., Gupta, S., Gupta, B.B., 2016. Auditing defense against XSS worms inonline social network-based web applications. In: Handbook of research onmodern cryptographic solutions for computer and cyber security. IGI Global, pp.216–245.

Chauhan, K., Prasad, V., 2015. Distributed denial of service (DDoS) attack techniquesand prevention on cloud environment. Int. J. Innovations Adv. Comput. Sci. 4(September), 210–215.

Chen, W., Deelman, E., 2012. WorkflowSim: A toolkit for simulating scientificworkflows in distributed environments. In: 2012 IEEE 8th Int. Conf. E-Science,e-Science 2012.

Chen, W., da Silva, R.F., Deelman, E., Fahringer, T., 2016. Dynamic and fault-tolerantclustering for scientific workflows. IEEE Trans. Cloud Comput. 4 (1), 49–62.

Chen, Z., Son, S.W., Hendrix, W., Agrawal, A., Liao, W.K., Choudhary, A., 2014.NUMARCK: machine learning algorithm for resiliency and checkpointing. In:Int. Conf. High Perform. Comput. Networking, Storage Anal. SC, vol. 2015–Janua,no. January, pp. 733–744.

Chen, X., Huang, X., Li, J., Ma, J., Lou, W., Wong, D.S., 2015b. New algorithms forsecure outsourcing of large-scale. Syst. Linear Eq. 10 (1), 69–78.

Chen, C., Won, M., Stoleru, R., Xie, G.G., 2015a. Energy-efficient fault-tolerant datastorage and processing in mobile cloud. IEEE Trans. Cloud Comput. 3 (1), 28–41.

Cheng, M., Li, J., Nazarian, S., 2018. DRL-cloud: deep reinforcement learning-basedresource provisioning and task scheduling for cloud service providers. In:Proceedings of the 23rd Asia and South Pacific Design Automation Conference.IEEE Press, pp. 129–134.

Coll, S., Mora, F.J., Duato, J., Petrini, F., 2009. Efficient and scalable hardware-basedmulticast in fat-tree networks. IEEE Trans. Parallel Distrib. Syst. 20 (9), 1285–1298.

Darby, P.J.D., Tzeng, N.F., 2010. Decentralized QoS-Aware checkpointingarrangement in mobile grid computing. IEEE Trans. Mob. Comput. 9 (8),1173–1186.

Darwish, M., Ouda, A., Capretz, L.F., 2011. Cloud-based DDoS attacks and defenses.In: Information Society (i-Society), 2013 International Conference on. IEEE, pp.67–71.

Deng, S., Huang, L., Taheri, J., Zomaya, A.Y., 2014. Computation offloading for serviceworkflow in mobile cloud computing, parallel and distributed systems. IEEETrans. Parallel Distrib. Syst. 26 (12), 3317–3329.

Dong, Z., Rojas-Cessa, R., 2011. Non-blocking memory-memory-memory Clos-network packet switch. In: 2011 34th IEEE Sarnoff Symp. SARNOFF 2011, pp. 1–5.

Dubrova, E., 2008. Fault tolerant design: an introduction. Microelectron. Inf.Technol. R.

Engelmann, C., Vallée, G.R., Naughton, T., Scott, S.L., 2009. Proactive fault toleranceusing preemptive migration. In: Proc. 17th Euromicro Int. Conf. Parallel, Distrib.Network-Based Process. PDP 2009, pp. 252–257.

Essa, Y.M., 2016. A survey of cloud computing fault tolerance: techniques andimplementation. Int. J. Comput. Appl. 138 (13), 8887.

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

Frincu, M.E., Craciun, C., 2011. Multi-objective meta-heuristics for schedulingapplications with high availability requirements and cost constraints in multi-cloud environments. In: Proc. - 2011 4th IEEE Int. Conf. Util. Cloud Comput. UCC2011, pp. 267–274.

Garzón, D.B., Gómez, C., Gómez, M.E., López, P., Duato, J., 2014. FT-RUFT: aperformance and fault-tolerant efficient indirect topology. In: Proc. - 2014 22ndEuromicro Int. Conf. Parallel, Distrib. Network-Based Process. PDP 2014, pp.405–409.

Gokhroo, M.K., Govil, M.C., Pilli, E.S., 2017. Detecting and mitigating faults in cloudcomputing environment. IEEE Int. Conf.

Gómez, C., Gómez, M., López, P., Duato, J., 2006. A Dynamic and Compact Fault-Tolerant Strategy for Fat-tree. In: Proc. IFIP Int’l Conf. Netw. Parallel Comput.,no. January.

Gómez, C., Gilabert, F., Gómez, M.E., López, P., Duato, J., 2008. RUFT: Simplifying thefat-tree topology. In: Proc. Int. Conf. Parallel Distrib. Syst. - ICPADS, pp. 153–160.

Greenberg, A. et al., 2009. VL2: A Scalable and Flexible Data Center Network. In:ACM SIGCOMM Conf. Data Commun., pp. 51–62.

Guo, C., Wu, H., Tan, K., Shi, L., Zhang, Y., Lu, S., 2008. DCell: a scalable and fault-tolerant network structure for data centers. In: Proc. ACM SIGCOMM 2008 Conf.Data Commun. - SIGCOMM ’08, pp. 75–86.

Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., 2009. BCube: a high performance, server-centric network architecture for modular data centers. In: SIGCOMM ’09 Proc.ACM SIGCOMM 2009 Conf. Data Commun., pp. 63–74.

Gupta, B.B., 2011. An Introduction to DDoS Attacks and Defense Mechanisms: AnAnalyst’s Handbook. Lap Lambert Academic Pub..

Gupta, S., Gupta, B.B., 2018. XSS-secure as a service for the platforms of online socialnetwork-based multimedia web applications in the cloud. Multimedia ToolsAppl. 77 (4), 4829–4861.

Haider, S., Nazir, B., 2017. Dynamic and adaptive fault tolerant scheduling with QoSconsideration in computational grid. IEEE Access 5, 7853–7873.

Hamad, S.A., Omara, F.A., 2016. Genetic-based task scheduling algorithm in cloudcomputing environment. Int. J. Adv. Comput. Sci. Appl. 7 (4), 550–556.

He, Z., Zhang, T., Lee, R.B., 2017. Machine learning based DDoS attack detection fromsource side in cloud. In: Cyber Security and Cloud Computing (CSCloud), 2017IEEE 4th International Conference on. IEEE, pp. 114–120.

Hegde, R., Kumar, S., Gurumurthy, K.S., 2013. The impact of network topologies onthe performance of the in-vehicle network. Int. J. Comput. Theory Eng. 5 (3),405–409.

Hosseini, S.M., Arani, M.G., 2015. Fault-tolerance techniques in cloud storage: asurvey. Int. J. Database Theory Appl. 8 (4), 183–190.

https://www.slideshare.net/supriyashilwant/firefly-algorithm-49723859 (availableonline 2018).

https://stackify.com/top-iaas-providers/ (available online 2018).https://stackify.com/top-paas-providers/ (available online 2018).https://www.datamation.com/cloud-computing/50-leading-saas-companies.html

(available online 2018).https://www.guru99.com/blockchain-tutorial.html (available online 2018).https://www.researchgate.net/publication/259383112_Honey_Bees_Inspired_

Optimization_Method_The_Bees_Algorithm (available online 2018).https://www.slideshare.net/AnujaJoshi6/cuckoo-o-ptimization-ppt (available

online 2018).https://www.slideshare.net/sumitjain2013/fault-tolerance-in-distributed-systems

(accessed 2018).https://www.studytonight.com/computer-networks/network-topology-types

(accessed 2018).Jaggi, P., Singh, A., 2014. Adaptive checkpointing for fault tolerance in an

autonomous mobile computing grid. Contemp. Comput. Informat., 553–557Jaggi, P.K., Singh, A.K., 2015. Movement-based checkpointing and message logging

for recovery in MANETs. Wirel. Pers. Commun. 83 (3), 1971–1993.Jialei Liu, I., Wang, Shangguang, Senior Member, IEEE, Zhou, Ao, Kumar, Sathish A.P.,

Yang, Fangchun, Senior Member, IEEE, Buyya, Rajkumar, Fellow, 2016. Usingproactive fault-tolerance approach to enhance cloud service reliability. IEEETrans. Cloud Comput., 1–1.

Kalyani, Z., Amune, A., Chas, M., Chas, M., 2016. Review on Secure DistributedDeduplication Systems with Improve, 5(1).

Kochhar, D., Hilda, A.K.J., 2017. An approach for fault tolerance in cloud computingusing machine learning technique. Int. J. Pure Appl. Math. 117 (22), 345–351.

Kruekaew, B., Kimpan, W., 2014. Virtual machine scheduling management on cloudcomputing using artificial bee colony. Proc. Int. MultiConf. Eng. Comput. Sci. 1,12–14.

Lagwal, M., Bhardwaj, N., 2017. Load balancing in cloud computing using geneticalgorithm. In: Intelligent Computing and Control Systems (ICICCS), 2017International Conference on. IEEE, pp. 560–565.

Leem, L. Title?: Fault-tolerant architecture for machine learning applications.Lebiednik, B., Mangal, A., Tiwari, N., 2016. A survey and evaluation of data center

network topologies, 1–12. arXiv preprint arXiv:1605.01701.Li, J., Chen, X., Huang, X., Tang, S., Xiang, Y., Member, S., 2015. Secure Distributed

Deduplication Systems with Improved Reliability, 64(12), 3569–3579Li-qun, Z., Shu-chen, L., Cheng-li, S., Chun-yan, Z., 2011. Fault-tolerant control

algorithm of neural network based on Particle Swarm Optimization. In: Controland Decision Conference (CCDC), 2011 Chinese. IEEE, pp. 700–704.

Liu, Y., Muppala, J.K., Veeraraghavan, M., Lin, D., Hamdi, M., 2012. Data centernetworks: topologies, architectures and fault-tolerance characteristics. SpringerScience & Business Media.

nce in cloud computing. Journal of King Saud University – Computer and

Page 18: A survey of fault tolerance in cloud computingstatic.tongtianta.site/paper_pdf/a6abdf18-1add-11e9-a3de... · 2019. 1. 18. · Cloud computing resources are offered to users as services,

18 P. Kumari, P. Kaur / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Liu, Y., Muppala, J.K., Veeraraghavan, M., 2014. A survey of data center networkarchitectures. IEEE Commun. Surv. Tutorials 15 (2), 1–22.

Liu, H., Xu, D., Miao, H., 2011. Ant colony optimization based service flow schedulingwith various QoS requirements in cloud computing. In: First ACIS InternationalSymposium on Software and Network Engineering. IEEE, pp. 53–58.

Macskassy, S.A., Provost, F., 2007. A brief survey of machine learning methods forclassification in networked data and an application to suspicion scoring. In:Statistical network analysis: models, issues, and new directions. Springer,Berlin, Heidelberg, pp. 172–175.

Mareli, M., Twala, B., 2017. An adaptive Cuckoo search algorithm for optimisation.Appl. Comput. Informat. (September).

Mell, P., Grance, T., 2011. The NIST definition of cloud computing recommendationsof the national institute of standards and technology. Nist Spec. Publ. 145, 7.

Mishra, A., 2011. A comparative study of distributed denial of service attacks.Intrusion Tolerance Mitigat. Techn.

Mohanapriya, N., Priya, C., 2016. A fault tolerant router with optimized an colonyalgorithm. In: Advanced Communication Control and Computing Technologies(ICACCCT), 2016 International Conference on. IEEE, pp. 167–170.

Mukwevho, M.A., Celik, T., 2018. Toward a smart cloud: a review of fault-tolerancemethods in cloud systems. IEEE Trans. Serv. Comput. 1374 (c), 1–18.

Nazari Cheraghlou, M., Khadem-Zadeh, A., Haghparast, M., 2016. A survey of faulttolerance architecture in cloud computing. J. Netw. Comput. Appl. 61, 81–92.

Pandey, S., Wu, L., Guru, S.M., Buyya, R., 2010. A particle swarm optimization-basedheuristic for scheduling workflow applications in cloud computingenvironments. In: 24th IEEE International Conference on AdvancedInformation Networking and Applications, pp. 400–407.

Pandya, K., 2013. Network structure or topology. Netw. Struct. Topol. 1 (2), 22–27.Park, J.H., Park, J.H., 2017. Blockchain security in cloud computing: use cases,

challenges, and solutions. Symmetry 9 (8), 164.Park, T.,Woo,N., Yeom,H.Y., 2002. An efficient optimisticmessage logging scheme for

recoverablemobile computing systems. IEEETrans.Mob. Comput. 1 (4). 265–251.Park, J., Yoo, G., Lee, E., 2005. Proactive self-healing system based on multi-agent

technologies. In: Software Engineering Research, Management andApplications, 2005. Third ACIS International Conference on. IEEE, pp. 256–263.

Patidar, S., Rane, D., Jain, P., 2011. A survey paper on cloud computing. In: Proc. -2012 2nd Int. Conf. Adv. Comput. Commun. Technol. ACCT 2012, pp. 394–398.

Patil, A., Shah, A., Gaikwad, S., Mishra, A.a., Kohli, S.S., Dhage, S., 2011. FaultTolerance in Cluster Computing System. In: 2011 Int. Conf. P2P, Parallel, Grid,Cloud Internet Comput., pp. 408–412.

Poola, D., Ramamohanarao, K., Buyya, R., 2014. Fault-tolerant workflow schedulingusing spot instances on clouds. Procedia Comput. Sci. 29, 523–533.

Prakash, R., Member, S., Society, I.C., 1996. Low-cost checkpointingl and failurerecovery in mobile computing systems. IEEE Trans. Parallel Distrib. Syst. 7(October), 1035–1048.

Prathiba, S., Sowvarnica, S., 2017. Survey of failures and fault tolerance in cloud. In:Proc. 2017 2nd Int. Conf. Comput. Commun. Technol. ICCCT 2017, pp. 169–172.

Qiu, F., Zhang, B., Guo, J., 2016. A deep learning approach for VM workloadprediction in the cloud. In: 2016 17th IEEE/ACIS International Conference onSoftware Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, pp. 319–324.

Qiu, W., Zheng, Z., Wang, X., Yang, X., Lyu, M.R., 2014. Reliability-based designoptimization for cloud migration. IEEE Trans. Serv. Comput. 7 (2), 223–236.

Rani, A., 2016. Proposed model for multi-objective differential evolution (mode) incloud computing. Int. J. Adv. Res. Innovative Ideas Educ. 3, 4119–4124.

Rimal, B.P., Choi, E., Lumb, I., 2009. A taxonomy and survey of cloud computingsystems. In: NCM 2009 - 5th Int. Jt. Conf. INC, IMS, IDC, pp. 44–51.

Saikia, L.P., Devi, Y.L., 2014. Fault tolerance techniques and algorithms in cloudsystem. Int. J. Comput. Sci. Commun. Networks 4 (1), 1–8.

Santra, S., Acharjya, P.P., 2013. A study and analysis on computer network topologyfor data communication. Int. J. Emerg. Technol. Adv. Eng. 3 (1), 522–525.

Savu, L., 2011. Cloud computing: deployment models, delivery models, risks andresearch challenges. In: 2011 Int. Conf. Comput. Manag. CAMAN 2011.

Sem-Jacobsen, F.O., Skeie, T., Lysne, O., Duato, J., 2011. Dynamic fault tolerance in fattrees. IEEE Trans. Comput. 60 (4), 508–525.

Sharma, S., 2017. Enhance data security in cloud computing using machine learningand hybrid cryptography techniques. Int. J. Adv. Res. Comput. Sci. 8 (9), 393–397.

Shaw, R., Howley, E., Barrett, E., 2017. An Advanced Reinforcement LearningApproach for Energy-Aware Virtual Machine Consolidation in Cloud DataCenters, pp. 61–66.

Shwe, T., Aye, W., 2008. A fault tolerant approach in cluster computing system. In:5th Int. Conf. Electr. Eng. Comput. Telecommun. Inf. Technol. ECTI-CON 2008,vol. 1, pp. 149–152.

Singh, P., Cabillic, G., 2003. A Checkpointing Algorithm for Mobile ComputingEnvironment, 65–74.

Singh, S., Jeong, Y.S., Park, J.H., 2016. A survey on cloud computing security: issues,threats, and solutions. J. Netw. Comput. Appl. 75, 200–222.

Singh, G., Kinger, S., 2013. A survey on fault tolerance techniques and methods incloud computing. Int. J. Eng. Res. Technol. 2 (6), 1215–1217.

Sridaran, R., 2015. An overview of DDoS attacks in cloud environment. Int. J. Adv.Networking Appl. (IJANA), 124–127.

Strayer, W.T., Lapsely, D., Walsh, R., Livadas, C., 2008. Botnet detection based onnetwork behavior. In: Lee, W., Wang, C., Dagon, D. (Eds.), Botnet Detection.Advances in Information Security, vol 36. Springer, Boston, MA.

Tantikul, T., Manivannan, D., 2005. A communication-induced checkpointing andasynchronous recovery protocol for mobile computing systems. Parallel Distrib.Comput. Appl. Technol. PDCAT Proc. 2005, 70–74.

Please cite this article in press as: Kumari, P., Kaur, P. A survey of fault toleraInformation Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.09.021

Tosh, D.K., Shetty, S., Liang, X., Kamhoua, C.A., Kwiat, K.A., Njilla, L., 2017. Securityimplications of blockchain cloud with analysis of block withholding attack. In:Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloudand Grid Computing. IEEE Press, pp. 458–467.

Trivedi, M., 2016. A survey on resource provisioning using machine learning incloud computing. Int. J. Eng. Dev. Res. 4 (4), 546–549.

Tsai, J.T., Fang, J.C., Chou, J.H., 2013. Optimized task scheduling and resourceallocation on cloud computing environment using improved differentialevolution algorithm. Comput. Oper. Res. 40 (12), 3045–3055.

Umamakeswari, P.V.M.D.A., Shaileshdheep, S.G.B.G., Aishwarya, V., Subramanian, E.R.S., 2017. A study of various meta heuristic algorithms for scheduling in cloud.Int. J. Pure Appl. Math. 115 (7), 205–208.

Valle, G. et al., 2008. A framework for proactive fault tolerance. In: ARES 2008 - 3rdInt. Conf. Availability, Secur. Reliab. Proc., pp. 659–664.

Wang, S. et al., 2016. Cloud Service Reliability Enhancement via Virtual MachinePlacement Optimization Cloud, 10(June), 902–913.

Wang, B., Qi, Z., Ma, R., Guan, H., Vasilakos, A.V., 2015. A survey on data centernetworking for cloud computing. Comput. Networks 91, 528–547.

Wang, T., Su, Z., Xia, Y., Hamdi, M., 2014. Rethinking the data center networking:architecture, network protocols, and resource sharing. IEEE Access 2, 1481–1496.

Xiong, N., Yang, Y., Cao, M., He, J., Shu, L., 2009. A survey on fault-tolerance indistributed network systems. In: Proc. - 12th IEEE Int. Conf. Comput. Sci. Eng.CSE 2009, vol. 2, pp. 1065–1070.

Xu, T., Potkonjak, M., 2016. Energy-efficient fault tolerance approach for internet ofthings applications. In: Computer-Aided Design (ICCAD), 2016 IEEE/ACMInternational Conference on. IEEE, pp. 1–8.

Xu, C., Wang, K., Guo, M., 2018. Intelligent resource management in blockchain-based cloud datacenters. IEEE Cloud Comput. 4 (6), 50–59.

Yang, X.S., He, X., 2013. Firefly algorithm: recent advances and applications, pp. 1–14. arXiv preprint arXiv:1308.3898.

Yao, G., Ding, Y., Hao, K., 2017. Using imbalance characteristic for fault-tolerantworkflow scheduling in cloud systems. IEEE Trans. Parallel Distrib. Syst. 28 (12),3671–3683.

Yuce, B., Packianather, M.S., Mastrocinque, E., Pham, D.T., Lambiase, A., 2013. Honeybees inspired optimization method: the Bees algorithm. Insects 4 (4), 646–662.

Zhang, P., Shu, S., Zhou, M., 2018b. An online fault detection model and strategiesbased on SVM-grid in clouds. IEEE/CAA J. Autom. Sin. 5 (2), 445–456.

Zhang, Q., Yang, L.T., Yan, Z., Chen, Z., 2018c. An efficient deep learning model topredict cloud workload for industry informatics. IEEE Trans. Ind. Informat. 14(7), 3170–3178.

Zhang, J., Yu, F.R., Wang, S., Huang, T., Liu, Z., Liu, Y., 2018a. Load balancing in datacenter networks: a survey. IEEE Commun. Surv. Tutorials (c). 1–1.

Zhao, J., Xiang, Y., Lan, T., Huang, H.H., Subramaniam, S., 2017. Elastic reliabilityoptimization through peer-to-peer checkpointing in cloud computing. IEEETrans. Parallel Distrib. Syst. 28 (2), 491–502.

Zhou, A., Sun, Q., Li, J., 2017. Enhancing reliability via checkpointing in cloudcomputing systems. China Commun. 14 (7), 108–117.

Zhou, H., Wang, H., Li, X., Leung, V.C.M., 2018. A survey on mobile data offloadingtechnologies. IEEE Access 6, 5101–5111.

Zissis, D., Lekkas, D., 2012. Addressing cloud computing security issues. Futur.Gener. Comput. Syst. 28 (3), 583–592.

Priti Kumari, received the MCA degree from theBanasthali University of Rajasthan, India in 2014. Gotthe M.Tech degree in Computer Science & Engineeringfrom the Banasthali University of Rajasthan, India in2016. She is currently a Ph.D. student in Jaypee Instituteof Information Technology, Noida, India, Computer Sci-ence branch. Her research interests include CloudComputing, Fault Tolerance, and Distributed Comput-ing.

Dr. Parmeet Kaur, received the B.E. degree in ComputerScience & Engineering from the Punjab EngineeringCollege of Chandigarh, India. She received the M.Techdegree in Computer Science from the KurukshetraUniversity of Haryana and Ph.D. degree in ComputerScience from NIT, Kurukshetra, India. She is currently anAssistant Professor (Senior Grade) in Jaypee Institute ofInformation Technology, Noida, India, Computer Sciencebranch. Her research interests include Mobile Comput-ing, Fault Tolerance, Theory of computation, Compilerdesign and Distributed algorithms.

nce in cloud computing. Journal of King Saud University – Computer and