The Anatomy Study of Agreement in a Cloud Computing

The Anatomy Study of Efficient Agreement

in a Cloud Computing Environment with Fallible Processes

Shu-Ching Wang

Chaoyang University of Technology Taiwan, R.O.C.

[email protected]

Kuo-Qin Yan*


[email protected]

(*: Corresponding author)

Chia-Ping Huang


[email protected]

Abstract—Fault-tolerance is an important research topic in

the study of distributed systems. To counter the influence of

faulty components, it is essential to reach a common agreement

in the presence of faults before performing certain tasks.

However, the agreement problem is fundamental to fault-

tolerant distributed systems. In previous studies, protocols

dealing with the agreement problem have focused on a

network topology with faulty hardware components. However,

cloud computing, an Internet-based development in which

dynamically scalable and often virtualized resources are

provided as a service over the Internet has become a significant

issue. Therefore, previous protocols for the agreement problem

with fallible hardware are not suitable for a cloud computing

environment with fallible processes. To enhance fault tolerance,

the agreement problem in a cloud computing environment with

fallible processes is revisited in this study. The proposed

protocol can solve the agreement problem with a minimal

number of rounds of message exchange and tolerates a

maximal number of faulty processes.

Keywords- Agreement problem; Byzantine agreement;

Consensus problem; Interactive consistency; Distributed system;

Fault-tolerance; Cloud computing

I. INTRODUCTION

Cloud computing is a new concept in distributed systems.

It is currently used mainly in business applications in which

computers cooperate to perform a specific service together.

In addition, the internet applications are continuously

enhanced with multimedia, and vigorous development of the

device quickly occurs in the network system [1,6]. As

network bandwidth and quality outstrip computer

performance, various communication and computing

technologies previously regarded as being of different

domains can now be integrated, such as telecommunication,

multimedia, information technology, and construction

simulation. Thus, applications associated with network

integration have gradually attracted considerable attention.

Similarly, cloud computing facilitated through distributed

applications over networks has also gained increased

recognition. In a cloud computing environment, users have

access to faster operational capability on the Internet [1],

and the computer systems must have high stability to keep

pace with this level of activity. However, cloud computing has greatly encouraged

distributed system design and application to support user-

oriented service applications [5]. Therefore, the users can use the application platform of cloud computing to execute the personal software or program in their capacity of account [7]. The processes in the cloud computing environment must be synchronously completed. To ensure the cloud computing environment is reliable, a mechanism to ensure that all nodes can reach an agreement is thus necessary.

The Internet platform of cloud computing provides many applications for users. Therefore, each node of a cloud computing environment needs to run many processes and needs to execute user’s requests synchronously. In the cloud computing environment, many users can execute application services simultaneously. Therefore, the high fault-tolerant capability of a cloud computing environment needs to be considered. However, in the cloud computing environment, nodes receive the user’s requests maybe influence by the disorderly faulty processes. Hence, to remove the affect of disorderly faulty processes needs to be mitigated. In a cloud computing environment, achieving perfect reliability must be accomplished by allowing a given set of nodes to reach a common agreement even in the presence of faulty processes.

The agreement problem is one of the most important

issues for designing a fault-tolerant distributed system [3].

The definition of the problem is to make the correct nodes in

an n nodes distributed system to reach agreement. Each

node i has its initial value vi and agrees on a set of common

values. Therefore, agreement has been achieved if the

following conditions are met:

Agreement: Each correct node agrees on a set of

common values V=[v1, v2, …, vn].

Validity: If the initial value of each correct node is vi,

then the i-th value in the common vector V

should be vi. Previous solutions of the agreement problem focused on a

network topology with faulty hardware components. However, previous studies of the agreement problem [3,6] do not specifically address the cloud computing with faulty processes to order the application of internet. Therefore, the agreement problem in a cloud computing environment with disorderly faulty processes is revised in this study. The proposed protocol is named Protocol of Fallible Processes (PFP in short) and it can lead an agreement of all correct nodes in a cloud computing environment.

The rest of this paper is organized as follows. Section 2 discusses the applications of cloud computing and the topology of cloud computing. Then, the proposed protocol

PFP is illustrated in detail in Section 3. In Section 4, an example of the execution of PFP is given. Section 5 demonstrates the correctness and complexity of PFP. Section 6 concludes this paper.

II. RELATED WORK

The previous studies of the agreement problem [6] do not specifically address cloud computing to order the application of internet. Hence, in this section, the applications of cloud computing are illustrated first. Then, the network construction of cloud computing is discussed.

A. Practical Applications

Cloud computing is a style of computing where massively scalable IT-related capabilities are provided to multiple external customers “as a service” using internet technologies [7]. The cloud providers have to achieve a large, general-purpose computing infrastructure; and virtualization of infrastructure for different customers and services to provide the multiple application services. The ZEUS Company has developed several types of software [10] that can create, manage, and deliver exceptional online services from physical and virtual datacenters or from any cloud environment, such as ZXTM [10] and ZEUS Web Server (ZWS) [10], as shown in Fig 1 [10].

Fig. 1 The resource of cloud infrastructure with virtualization [10]

A cloud infrastructure virtualizes large-scale computing

resources and packages them up into smaller quantities [10]. Furthermore, the ZEUS Company develops software that can let the cloud provider easily and cost-effectively offer every customer a dedicated application delivery solution. The ZXTM software is much more than a shared load balancing service and it offers a low-cost starting point in hardware development, with a smooth and cost-effective upgrade path to scale as your service grows [10].

The ZEUS provided network framework can be utilized to develop new cloud computing methods, and is utilized in the current work. In this network composition that can support the network topology of cloud computing used in our study. According to the ZEUS network framework can testify the construction of network topology that we proposed in this paper, which let the company has to be considered. Hence, the proposed network topology of cloud computing that has the trustworthy example with the company provided to support our study.

B. Network Construction

Cloud computing is a new distributed system concept that has been implemented by businesses such as Google [9] and Amazon [8]. Google provides various applications on their internet platform [9]. In previous literature, the agreement problem has been solved in various network topologies. However, previous studies of agreement problems [3,6] are not specifically address cloud computing to order the application of internet. Hence, in this paper, the topology of a cloud computing environment is applied. Subsequently, the agreement problem with fallible processes is discussed. The nodes of cloud computing are interconnected with the internet; the network is assumed reliable and synchronous. Fig. 2 is the topology of cloud computing used in our study. The topology is composed of two levels, as follows: (1) The nodes in an A-Level group receive the request from

users of different types of applications. Therefore, the nodes of an A-Level group have higher computational capability than the nodes in a B-Level group. In addition, nodes in an A-Level group compute enormous amounts data and can communicate with other nodes in the same group directly through transmission media.

(2) According to the properties of nodes, the nodes are clustered into a cluster Bi in a B-Level group, where 1≤i≤cn and cn is the total number of clusters in a B-Level group.

(3) For the reliable communication, multiple inter transmission media are used to connect the nodes between an A-Level group and a B-Level group. In A-Level group, each node forwards the message to all nodes in the corresponding cluster of the B-Level group.

Fig. 2. Example of topology of cloud computing

However, a process is said to be correct if it follows the

protocol specifications during the execution of a protocol;

otherwise, the process is faulty. The fault symptom of

process is called disorderly faults. The behavior of a

disorderly fault is the fault can cause other components

cannot complete work correctly and synchronously. A

disorderly faulty process takes place when a node fails to

transmit the fake messages and other nodes receive the same

fake messages. However, the behavior of disorderly faulty

process is unpredictable and to confuse other nodes to

receive incorrect messages to complete a specific service for

user. Therefore, the disorderly faulty processes will

influence each node in the system finish user’s request

synchronization. Here, a solution of the agreement problem

in the cloud computing environment with disorderly faulty

processes is presented.

III. THE PROPOSED PROTOCOL

In this study, the proposed protocol named Protocol

of Fallible Processes (PFP), is invoked to solve the

agreement problem in a cloud computing with fallible

processes. Based on the network topology of cloud

computing, PFP provides each node to transmit

messages to other nodes without influence from

disorderly faulty processes and the proposed protocol is

shown in Fig. 3. The PFP executes the follow steps:

Step 1: The nodes of the A-Level group execute the

Agreement Process (for the node i in the A-Level

group with initial value vi; 1≤i≤nA) and nA is the

total number of nodes in the A-Level group then an

agreement vector DEC can be acquired.

Step 2: Each element of DEC is mapped to a specific

application that will be executed in the cluster of

the B-Level group. Then, each node of the cluster

in A-Level group sends the specific element of

DEC to the nodes of a specific application having

the cluster of B-Level group.

Step 3: Each node k in the same cluster of B-Level group

receives the element values from nodes of A-Level

group.

Step 4: The nodes of B-Level group’s cluster execute the

Poll Process then the agreement value can be

obtained. The PFP is proposed to solve the agreement problem

when caused by disorderly faulty processes that may send incorrect messages to influence the cloud computing environment to reach agreement. When the disorderly faulty processes exist in the cloud computing environment then two rounds of message exchange required can be estimated to solve the agreement problem. In the cloud computing topology, the main work of an A-Level group’s nodes is collecting user requests. Each node in an A-Level group receives the various requests from users, while the nodes in a B-Level group’s cluster provide many services for users.

Each node in an A-Level group that uses the service request as the initial value executes the PFP to obtain the common vector DEC. Then each node of the A-Level group forwards the element of vector DEC to the nodes in the B-Level group. However, the specific service request can be conformed by the nodes of the same group.

Each node in the same cluster of a B-Level group receives the element from the nodes of the A-Level group. In the B-Level group, nodes may receive the fake value by the disorderly faulty processes in an A-Level group. The nodes in a B-Level group receive the fake value from failure processes of A-Level group through the transmission media.

Therefore, the number of A-Level group’s failure processes must be less than half with those processes.

Sequentially, each node in the same cluster of a B-Level group takes the majority value of the received element values (DEC). Hence, the agreement can be achieve by the common agreement value.

PFP protocol

The nodes of A-Level group execute Agreement Process.

The nodes of B-Level’s group execute Poll Process.

Agreement Process

Message Exchange Phase:

r = 1

Node i broadcasts vi, then receives the initial value from the other nodes

in A-Level group, and construct vector Vi.

r = 2

Node i broadcasts Vi, then receives column vectors broadcasted by other

nodes, and construct MATi.

Decision Making Phase:

(1) Take the majority value of each column k of MATi to MAJk.

(2) Search for any MAJk. If (∃MAJk=¬vi), then DECi=φ; else if (∃MAJk=λ)

AND (vki=vi), then DECi=φ; else DECi=vi, and terminate.

(3) Each node i sends the specific element of DECi to the nodes of a

specific application having the cluster of B-Level group.

Function MAT

(1) Receive the initial value vj from node j, for 1≤j≤n and j≠i.

(2) Construct the vector Vi=[v1, v2,…, vn], 1≤j≤n and j≠i.

(3) Broadcast Vi to all nodes, and receive column vector Vj from node j,

1≤j≤n.

(4) Construct a MATi (Setting the vector vj in column j, for 1≤j ≤n).

Poll Process

(1) The node k in the cluster of the B-Level group receives the element

value of DEC with a specific application needs to be serviced.

(2) To take a majority value MAJk (1≤k≤nBj, where nBj is the number of

nodes in cluster j of the B-Level group) of the received element

values.

(3) The common value vk with the node k in cluster j of the B-Level

group can be obtained.

Fig. 3 Protocol PFP

IV. EXAMPLES OF EXECUTING PFP

An example of an A-Level group in the topology of cloud computing is shown in Fig. 4. The nodes in an A-Level group receive service requests. PFP requires two rounds to exchange the messages in an A-Level group. Each node can obtain the initial value in the A cluster as shown in Fig. 5. In the first round of the Message Exchange Phase, each node receives the initial value from other nodes to construct vector Vi, as shown in Fig. 6. In the second round of the Message Exchange Phase, each node broadcasts its vector Vi, and receives the vectors from other nodes to construct the matrix MATi, as shown in Fig. 7. Finally, the common value DECi (=1) is obtained by taking the majority value on matrix MATi.

The node in A-level group broadcasts the DECi (=1) value to the cluster of a B-level group. All nodes in the cluster of a B-Level group receive the element DECi from the nodes of an A-Level group for the specific applications needing to be serviced, as shown in Fig 8. Subsequently, the POLL function is applied to take a majority value. Then, the agreement value is gotten, as shown in Fig. 8.

Fig. 4. Example of cluster A in an A-Level group with the 1st round

Fig. 5. The initial value of each node in A cluster

Fig. 6. Each node in the A cluster at the 1st round

Fig. 7. Construct MAT in 2nd round and MAJ of MAT as decision value

Fig. 8. Each node of BⅡ-1 cluster receives element of DECA from an A-

Level group node

V. THE CORRECTNESS AND COMPLEXITY

The protocol is proposed and the following proofs for the agreement and validity property are given in this section. The following lemmas and theorems are used to prove the correctness and complexity of the PFP. Lemma 1. If the initial value of node i is vi, whether the process is correct or not, the majority value at the i-th row of

MATj, 1≤j≤nA, should be either vi or cannot be determined

with vij=¬ vi. Proof: When process is under the influence of disorderly fault, we consider the following two cases after the first round is run. Case 1: vij = vi.

Since there are at most nA/2-1 disorderly faulty

processes connected with node j, at most nA/2-1 values that

may be ¬ vi’s in the second round. The number of vi’s is nA-

(nA/2-1) =(nA+1)/2 in the i-th row; therefore, the majority of the i-th row in MATi is vi.

Case 2: vij = ¬¬¬¬ vi.

There are at most nA/2-1 faulty processes Therefore, in

the second round, the total number of ¬vi‘s does not exceed

(nA/2-1)+1=nA/2 and the number of vi‘s is at least nA-

(nA/2-1)=(nA+1)/2. If nA is an even number, then

nA/2=nA/2, the majority of the i-th row in MATj cannot

be determined. If nA is an odd number, then nA/2<nA/2. Hence, the majority of the i-th row in MATj is vi.

Lemma 2. If (¬∃MAJk=¬ vi) and {(∃MAJk=?) and (vki = vi)}

in MATi, then DECi=φ is correct. Proof: If MAJk does not exist or cannot be determined,

there are exactly nA/2 v’s and n/2 ¬v’s in the k-th row. Let vki

=vi in MATi, then, all nA/2 ¬v’s should be received in the

second round. Notably, nA/2-1 faulty processes are in the system. Therefore, in the second round, node i at least receives a value from node k without any disturbance. The initial value of node k should disagree with the initial value

of node i; hence it is correct to choose DECi=φ. If vki =¬vi,

we claim that ¬vi ought to be passed by a faulty process

from node k, and the initial value of node be ¬vki =vi. To prove, if faulty process is correct, then the initial value of

node k should be ¬vi. By Lemma 1, the majority value of the

k-th row in MATi is ¬vi. This is contradiction with the

condition of (¬∃MAJk=¬vi. If the initial value of node k was

¬vi, then, by Lemma 1, MAJk should be either ¬vi or cannot be determined for vki=vi. It is a contradiction. Theorem 1. PFP is valid. Proof: According to Lemmas 1 and 2, the validity of PFP is confirmed. Theorem 2. Protocol PFP can reach an agreement. Proof:

Agreement:

Part 1: If a node agrees on φ, then all nodes should agree

on φ. If the node i with initial value vi agrees on φ, by

Theorem 1, at least there is a node k with initial value ¬vi in the environment. By Lemma 2, the majority value in

the k-th row of MATj, 1≤j≤nA, should be either ¬vi or ? (Cannot be determined) for vjk =vi. All nodes with initial value vi are agreement. Similarly, for the node i with

initial value ¬vi, the majority value of the i-th row in

MATj, 1≤j≤nA, should be either vi or cannot be

determined with ¬vij = vi. All nodes with initial value ¬vi

agree on φ, too. Part 2: If a node agrees on vi, then all nodes should agree on vi. If the node i with initial value vi and DECi= vi, but

there exists some node j, j≠i, which has DECj ≠ vi. Therefore, such a situation is impossible. To demonstrate

this impossibility, if DECi =φ, by Part 1, then DECi =φ.

This contradicts the above assumption. If DECi =¬vi,

unless the initial value of node j is ¬vi; otherwise, it is impossible according to the definition of a agreement

problem. However, if the initial value of node j is ¬vi, by

Lemma 2, MAJk equals to ¬vi or cannot be determined

with vji =vi in MAT. Then, DECi =φ, it is a contradiction. Hence, all nodes should agree on the same value by the definition of agreement problem.

Validity: To prove this case, the initial value of all nodes

should be the same. If there is a value ¬vi in MATj,

1≤j≤nA, then the value must be attributed to a faulty

process. There are at most nA/2-1 faulty processes;

hence, there is at most nA/2-1 ¬vi‘s in each row. Since

the value received in the first round may be ¬vi, the majority of each row for all MATj should be vi. Therefore, all nodes should agree on vi.

Theorem 3. The maximal number of allowable faulty processes by PFP is TF= TFA+ TFB. TFA is the total number of allowable faulty processes in A-Level group and TFA

=nA/2-1 where nA is the number of nodes in an A-Level group. TFB is the total number of allowable faulty processes

in B-Level group and TFB = ∑=

−nc

jBjn

1

)1( .

Proof: In the A-Level group, if the allowable faulty

processes TFA is greater than nA/2-1, then the nodes in A-Level group cannot reach agreement. However, when a node of cluster j in B-Level group exists, then the specific application can possibly be carried out where 1≤j≤cn and cn is the number of clusters in B-Level group. Therefore, the

fault tolerant capability of B-Level group is ∑=

−nc

jBjn

1

)1(

where nBj is the number of nodes in group j of B-Level group. Hence, the maximal number of allowable faulty processes by PFP is TF= TFA+ TFB. Theorem 4. PFP requires 3 rounds to solve the agreement in

a cloud-computing environment.

Proof: The nodes in A-Level group need 2 rounds to

exchange message in the Message Exchange Phase. In

addition, in the Decision Making Phase, each node in A-

Level group sends the specific element of DEC to the nodes

in a B-Level group. Therefore, an additional round of

message exchange is required. In conclusion, the minimum

number of rounds of PFP is 3. Moreover, the number rounds

required is optimal.

VI. CONCLUSIONS

Cloud computing is a new concept of distributed systems [1,2,9]. It has greatly encouraged distributed system design and practice to support user-oriented services with application [5,7]. In the Internet platform of cloud computing where each node needs to complete the user’s requests synchronously and to reach the common agreement as specific service. Fault-tolerance is an important research topic in the study of distributed systems and it is a fundamental problem in distributed systems; there are many relative literatures in the past [3,6]. According to previous studies, network topology plays an important role in the agreement problem, but the results cannot cope with a cloud computing environment and the agreement problem thus needs to be reinvestigated. Moreover, in this paper, the agreement problem with disorderly faulty processes has been solved by the proposed protocol.

The proposed Processes Failure of Cloud computing (PFP in short) ensures that all nodes in a cloud computing environment can reach a common value. Moreover, the new protocol PFP is adapted to the cloud computing environment and the solution of PFP is applied to solve the disorderly faulty processes existence. PFP can derive the bound of allowable disorderly faulty processes. PFP uses the minimum number of rounds of message exchange and tolerates the maximum number of allowable disorderly faulty processes in a cloud computing environment. Furthermore, the fault-tolerance capacity is enhanced by PFP.

Merely considering processes faults in the agreement problem is insufficient for the highly reliable distributed system of a cloud computing environment. A related closely problem called the Fault Diagnosis Agreement (FDA) problem. The objective of solving the FDA problem is to make each node can detects or locates the common set of disorderly faulty processes in the distributed system. Therefore, solving the FDA problem for the highly reliable distributed system underlying topology of cloud computing environment is included in our future work.

ACKNOWLEDGMENT

This work was supported in part by the Taiwan National

Science Council under Grants NSC97-2221-E-324–007.

REFERENCE

[1] F.M. Aymerich, G. Fenu and S. Surcis, “An Approach to a Cloud Computing Network,” the First International Conference on the Applications of Digital Information and Web Technologies, pp. 113-118, Aug. 2008.

[2] R.L. Grossman, Y. Gu, M. Sabala and W. Zhang, “Compute and Storage Clouds Using Wide Area High Performance Networks,” Future Generation Computer Systems, Vol. 25, No. 2, pp. 179-183, Feb. 2009.

[3] L. Lamport, R. Shostak, and M. Pease, “The Byzantine General Problem,” ACM Transactions on Programming Language and Systems, Vol. 4, No. 3, pp. 382-401, Jul. 1982.

[4] M.A. Vouk, “Cloud Computing- Issues, Research and Implementations,” Information Technology Interfaces, pp. 31-40, Jun. 2008.

[5] L.H. Wang, J. Tao and M. Kunze, “Scientific Cloud Computing: Early Definition and Experience,” the 10th IEEE International Conference on High Performance Computing and Communications, pp. 825-830, 2008.

[6] S.C. Wang, K.Q. Yan, S.S. Wang and G.Y. Zheng, “Reaching Agreement Among Virtual Subnets in Hybrid Failure Mode,” IEEE Transactions on Parallel and Distributed Systems, Vol. 19, No. 9, pp. 1252-1262, Sept. 2008.

[7] A. Weiss, “Computing in The Clouds,” netWorker, Vol. 11, No. 4, pp. 16-25, 2007.

[8] (2010) Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more. [Online]. Available: http://www.amazon.com/

[9] (2010) More Google Product. [Online]. Available: http://www.google. com/options/

[10] (2010) ZXTM for Cloud Hosting Providers. [Online]. Available: http://www.zeus.com/cloud_computing/for_cloud_providers.html

Documents

The Anatomy Study of Agreement in a Cloud Computing