8
DECENTRALIZED SYSTEMS PROJECT. MAY 2012 1 High Availability of Services in Wide-Area Shared Computing Networks ario Almeida (mario.almeida@est.fib.upc.edu), EMDC Student Ozgur Seyhanli (ozgur.seyhanli@est.fib.upc.edu), CANS Student Sergio Mendoza (sergio.mendoza@est.fib.upc.edu), CANS Student Zafar Gilani (syed.zafar.ul.hussan.gilani@est.fib.upc.edu), EMDC Student Abstract—Highly available distributed systems have been widely used and have proven to be resistant to a wide range of faults. Although these kind of services are easy to access, they require an investment that developers might not always be willing to make. We present an overview of Wide-Area shared computing networks as well as methods to provide high availability of services in such networks. We make some references to highly available systems that are being used and studied at the moment this paper was written. Index Terms—High Availability, Wide-Area Networks, Replication, Quorum Consistency, Descentralized Systems, File virtualization, Load balancing, Migration of Services I. Introduction H IGHLY available distributed systems have been widely used and have proven to be resistant to a wide range of faults, like power outages, hardware failures, secu- rity breaches, application failures, OS failures and even to byzantine faults. For example, services like Amazon Elas- tic Compute Cloud provide resizable computation capacity in the cloud with an annual uptime percentage of 99.95%. Although these kind of services are easy to access, they require an investment that developers might not always be willing to make. Also some distributed systems have specific properties that make more sense when applied to shared non-dedicated computing networks. An example can consist of a file sharing peer-to-peer network, in which the developers might not want to be held responsible for the contents being shared. In this report we present an overview of Wide-Area shared computing networks as well as methods to provide high availability of services in such networks. We make some references to highly available systems that are being used and studied at the moment this paper was written. II. Wide-Area Shared Computing Networks A Wide-Area shared computing network is an hetero- geneous non-dedicated computer network. In these types of networks, machines have varying and limited resources and can fail at anytime. Also, theyre often not properly designed to deal with machine failures and so it makes the challenge of having no planned downtimes or maintenance intervals even harder. These types of networks can be sim- ulated using PlanetLab testbed. III. High Availability of Services High availability is a system design approach and asso- ciated service implementation that ensures a prearranged level of operational performance will be met during a con- tractual measurement period. Availability is related to the ability of the user to access the system. If a user is unable to access it then it is said to be unavailable. The period in which the system is un- available is called downtime. Unscheduled downtime can be due to multiple causes such as power outages, hardware failures, security breaches or application/OS failures. As stated in the CAP theorem [1], a distributed com- puter system has to decide which two of these three prop- erties will be provided : Consistency, Availability and Par- tition tolerance. This formulation tends to oversimplify the tensions between properties since there is the need to choose between consistency and availability when there are partitions. Recent distributed systems [22] show that there is a lot of flexibility for handling partitions and recovering, even for highly available and somewhat consistent systems. As an example of different distributed systems, one can think of the NoSQL movement that focus on availability first and consistency second, while the databases that pro- vide ACID properties focus more on consistency. Availability is usually expressed as a percentage of up- time in a year. Services generally provide service level agreements (SLA) that refer to a contract on the mini- mum monthly downtime or availability. For example, ser- vices like Amazon Elastic Compute Cloud provide resizable computation capacity in the cloud with an annual uptime percentage of 99.95% [25]. This SLA agreements gener- ally have a backdrop since they generally cover the core instances and not the services on which the instances de- pend. This was a big issue during EC2s Easter outage in 2011 [26]. As downtimes in distributed systems generally occur due to faults, we will focus on a specific type of fault that is depicted by the Byzantine Generals Problem [8]. A. Byzantine faults A Byzantine fault is an arbitrary fault that occurs dur- ing the execution of an algorithm by a distributed system. It can describe omission failures (crashes, lost communi- cation...) and commission failures (incorrect processing, corrupt states or incorrect responses...). If a system is not tolerant to byzantine faults it might respond in unpre- dictable ways. Some techniques have been widely used since the paper [8] that was published in 1999. Some open-source solu-

High Availability of Services in Wide-Area Shared Computing Networks

Embed Size (px)

DESCRIPTION

(Check my blog @ http://www.marioalmeida.eu/ ) Highly available distributed systems have been widely used and have proven to be resistant to a wide range of faults. Although these kind of services are easy to access, they require an investment that developers might not always be willing to make. We present an overview of Wide-Area shared computing networks as well as methods to provide high availability of services in such networks. We make some references to highly available systems that are being used and studied at the moment this paper was written (2012).

Citation preview

Page 1: High Availability of Services in Wide-Area Shared Computing Networks

DECENTRALIZED SYSTEMS PROJECT. MAY 2012 1

High Availability of Services in Wide-Area SharedComputing Networks

Mario Almeida ([email protected]), EMDC Student

Ozgur Seyhanli ([email protected]), CANS Student

Sergio Mendoza ([email protected]), CANS Student

Zafar Gilani ([email protected]), EMDC Student

Abstract—Highly available distributed systems have beenwidely used and have proven to be resistant to a wide rangeof faults. Although these kind of services are easy to access,they require an investment that developers might not alwaysbe willing to make. We present an overview of Wide-Areashared computing networks as well as methods to providehigh availability of services in such networks. We make somereferences to highly available systems that are being usedand studied at the moment this paper was written.

Index Terms—High Availability, Wide-Area Networks,Replication, Quorum Consistency, Descentralized Systems,File virtualization, Load balancing, Migration of Services

I. Introduction

HIGHLY available distributed systems have beenwidely used and have proven to be resistant to a wide

range of faults, like power outages, hardware failures, secu-rity breaches, application failures, OS failures and even tobyzantine faults. For example, services like Amazon Elas-tic Compute Cloud provide resizable computation capacityin the cloud with an annual uptime percentage of 99.95%.

Although these kind of services are easy to access, theyrequire an investment that developers might not alwaysbe willing to make. Also some distributed systems havespecific properties that make more sense when applied toshared non-dedicated computing networks. An examplecan consist of a file sharing peer-to-peer network, in whichthe developers might not want to be held responsible forthe contents being shared.

In this report we present an overview of Wide-Areashared computing networks as well as methods to providehigh availability of services in such networks. We makesome references to highly available systems that are beingused and studied at the moment this paper was written.

II. Wide-Area Shared Computing Networks

A Wide-Area shared computing network is an hetero-geneous non-dedicated computer network. In these typesof networks, machines have varying and limited resourcesand can fail at anytime. Also, theyre often not properlydesigned to deal with machine failures and so it makes thechallenge of having no planned downtimes or maintenanceintervals even harder. These types of networks can be sim-ulated using PlanetLab testbed.

III. High Availability of Services

High availability is a system design approach and asso-ciated service implementation that ensures a prearranged

level of operational performance will be met during a con-tractual measurement period.

Availability is related to the ability of the user to accessthe system. If a user is unable to access it then it is saidto be unavailable. The period in which the system is un-available is called downtime. Unscheduled downtime canbe due to multiple causes such as power outages, hardwarefailures, security breaches or application/OS failures.

As stated in the CAP theorem [1], a distributed com-puter system has to decide which two of these three prop-erties will be provided : Consistency, Availability and Par-tition tolerance. This formulation tends to oversimplifythe tensions between properties since there is the need tochoose between consistency and availability when there arepartitions. Recent distributed systems [22] show that thereis a lot of flexibility for handling partitions and recovering,even for highly available and somewhat consistent systems.As an example of different distributed systems, one canthink of the NoSQL movement that focus on availabilityfirst and consistency second, while the databases that pro-vide ACID properties focus more on consistency.

Availability is usually expressed as a percentage of up-time in a year. Services generally provide service levelagreements (SLA) that refer to a contract on the mini-mum monthly downtime or availability. For example, ser-vices like Amazon Elastic Compute Cloud provide resizablecomputation capacity in the cloud with an annual uptimepercentage of 99.95% [25]. This SLA agreements gener-ally have a backdrop since they generally cover the coreinstances and not the services on which the instances de-pend. This was a big issue during EC2s Easter outage in2011 [26].

As downtimes in distributed systems generally occur dueto faults, we will focus on a specific type of fault that isdepicted by the Byzantine Generals Problem [8].

A. Byzantine faults

A Byzantine fault is an arbitrary fault that occurs dur-ing the execution of an algorithm by a distributed system.It can describe omission failures (crashes, lost communi-cation...) and commission failures (incorrect processing,corrupt states or incorrect responses...). If a system isnot tolerant to byzantine faults it might respond in unpre-dictable ways.

Some techniques have been widely used since the paper[8] that was published in 1999. Some open-source solu-

Page 2: High Availability of Services in Wide-Area Shared Computing Networks

2 DECENTRALIZED SYSTEMS PROJECT. MAY 2012

tions like UpRight provide byzantine fault tolerance usinga Paxos-like consensus algorithm.

B. High Availability in Wide-Area Networks

Usually high availability clusters provide a set of tech-niques in order to make the infrastructure as reliable aspossible. Some techniques include disk mirroring, redun-dant network connections, redundant storage area networkand redundant power inputs on different circuits. In thecase of Wide-Area networks, only a few of these techniquescan be used since it relies on heterogeneous machines thatarent designed specifically for providing high availability.

As the main property of this type of networks is the het-erogeneity of nodes and its varying resources, it is crucialto scale its capacities depending on the incoming requestsor actual resources available to the node. Due to the lim-itations of resources, it is important to be able to scalethe service to more nodes. This is one of the key pointsof availability of services, since if a node receives more re-quests than it can handle, it will stop being able to providethe service and therefore wont have high availability. Thismeans that a service needs to do load balancing and some-times partition data or state in several machines in orderto scale. Scaling the number of machines also increasesthe probability of some machines to fail. This can be ad-dressed by creating redundancy by means of replication totolerate failures.

C. Load balancing

Load balancing consists of a methodology to distributeworkload across multiple machines in order to achieve op-timal resource utilization, maximize throughput, minimizeresponse time and avoid overloading. A simple solutioncan be achieved through a domain name system, by asso-ciating multiple IP addresses with a single domain name.

In order to determine how to balance the workload, theload balancer can also have other characteristics in accountsuch as reported server load, recent response times, keepingtrack of alive nodes, number of connections, traffic andgeographic location.

The load balancing can be done at two levels: at thetracking of services at system level and at the node level.At the node level this load balancing can be achieved byeither redirecting requests or redirecting clients. Also, thenodes could send tokens to each other in order to estimatehow much requests they can redirect to each other.

D. Replication of Services

To explain how replicating a service can help it toleratefailures, let’s consider the probability of failure of a sin-gle machine to be P and that machines fail independently.Then if we replicate data N times to survive N-1 failures,of replicas the probability of losing a specific data must bePN . A desired reliability R can be picked by changing thenumber so that PN <R.

So we can provide smaller probabilities of having down-times of services by increasing the number of replicas. Butthis is not as easy as it seems as the increasing number

Fig. 1. Diagram of active replication architecture.

Fig. 2. Diagram of passive replication architecture.

of replicas also has an impact on the performance andcomplexity of the system. For example, higher number ofreplicas imply more messages to keep consistency betweenthem.

Replication is important not only to create the neededredundancy to handle failures but also to balance the work-load by distributing the client requests to the nodes de-pending on their capacities.

When we talk about replication two simple schemescome to our mind, active and passive replication [11]. Thearchitectures of active and passive replication models arerepresented, respectively, in Figure 1 and Figure 2.

In active replication each request is processed by all thenodes. This requires that the process hosted by the nodesis deterministic, meaning that having the same initial stateand the same request sequence, all processes should pro-duce the same response and achieve the same final state.This also introduces the need of atomic broadcast proto-cols that guarantee that either all the replicas receive themessage in the same order or none receives it.

Page 3: High Availability of Services in Wide-Area Shared Computing Networks

ALMEIDA, SEYHANLI, MENDOZA AND GILANI: H. AVAILABILITY OF SERVICES IN W-A SHARED CNS 3

In passive replication there is a primary node that pro-cesses client requests. After processing the request thenode replicates its state to the other backup nodes andsends back the response to the client. If the primary nodefails, there is a leader election and one of the backups takesits place as primary.

In regular passive replication, secondary replicas shouldonly perform reads, while the writes are performed by theprimary replica and then replicated to the other replicas.There could be better workload balancing if every nodecould receive requests but this also implies using anothersystem to keep consistency between nodes. Also caching ofreads can greatly improve the overall performance of thesystem but one may have to relax consistency propertiesto achieve this.

For passive replication, papers like A Robust andLightweight Stable Leader Election Service for DynamicSystems [3] describe system implementations of fault-tolerant leader election services that use stochastic failuredetectors [10] and link quality estimators to provide somedegree of QoS control. These systems adapt to changingnetwork conditions and has proven to be robust and nottoo expensive.

Active replication deals better with real time systemsthat require faster responses, even when there are faults.The main disadvantage of active replication is that mostservices are non-deterministic and the disadvantage of pas-sive replication is that in case of failure the response isdelayed.

Passive replication can be efficient enough if we considerthat the type of services we want to provide perform sig-nificantly more reads than writes. Serializing all updatesthrough a single leader can be a performance bottleneck.

As replication also introduces costs in communicationand resources, some techniques are generally used to reduceit. An example is the use of read and write quorum setsas we will explain in sections ahead.

E. Service Recovery

Another important characteristic of this kind of net-works is that a node can be shutdown at any moment.Actually some studies show that most failures in Planet-Lab are due to rebooting the machines, this means thatnode regeneration capabilities would be crucial in such en-vironment. It is noticeable that in this case, re-accessingthe data on the secondary storage instead of creating anew node and performing replication could definitively im-prove the overall performance of the system (for systemsthat keep states). This can also depend on the averageobserved downtime of the machines that will be revisitedin the evaluation.

F. Storage replication

Wide Area Shared Computing networks arent the mostpropitious type of networks for persistently storing filesin a highly available way. Since nodes often reboot, theeasiest way would be to replicate data like the services.

This solution highly depends on the amount of data thatthe service manages.

A common way to simplify the access to remote files in atransparent way is to perform file virtualization. File vir-tualization eliminates the dependencies between the dataaccessed at the file level and the location where the filesare physically stored. It allows the optimization of storageuse and server consolidation and to perform non-disruptivefile migrations.

Caching of data can be done in order to improve its per-formance. Also there can be a single management interfacefor all the distributed virtualized storage systems. It allowsreplication services across multiple heterogeneous devices.

The data replication can also be done in an hybrid way,storing less important content in the heterogeneous nodesand more important content in a more reliable distributedfile system. An example of a somewhat hybrid system canbe Spotify, it makes use of client replicated files in orderto offload some work from its servers but when the clientshave low throughput or the files arent available, the Spotifyservers can provide the files in a more reliable way.

Amazon S3 also provides storage options such as thereduced redundancy storage system. This system reducesthe costs by storing non-critical, reproducible data at lowerlevels of redundancy. It provides a cost-effective, highlyavailable solution for distributing or sharing content thatis durably stored elsewhere, or for storing thumbnails,transcoded media, or other processed data that can be eas-ily reproduced.

G. Migration of Services

Another problem in Wide-Area networks is that noderesources can vary a lot. This means that although a nodemay have proven worth during a period of time, its avail-able resources such as CPU, bandwidth or memory canvary and affect the service sustainability. Also, if the levelof replication is not aware of the variation of resources ofthe nodes, we might see the number of needed replicas toprovide a service growing to a point that it affects the per-formance of the service. Due to this aspect, a new concepthas been researched lately that consists of resource-awaremigration of services [5] between nodes.

It might seem that migration is a similar concept to repli-cation as it consists of replicating the data from one nodeto another. However it is different since it also aims totransfer the current state of execution in the volatile stor-age as well as its archival state in the secondary storage.Moreover it also provides mechanisms for handling any on-going client-sessions.

Migration of services uses migration policies to decidewhen a service should migrate. These policies can belocality-aware and resource-aware For example; availableresources can be CPU, bandwidth, memory and more.

Migration of services also introduces some issues suchas the need of a tracking system to allow clients to accessa location changing service. Also, during this migrationthere is a period in which the service might be unable toattend requests and so it needs to delegate responsibilities

Page 4: High Availability of Services in Wide-Area Shared Computing Networks

4 DECENTRALIZED SYSTEMS PROJECT. MAY 2012

to another replica. This period is called blackout periodand the aim of replication is to make this period negligible.

Latest research papers such as Building AutonomicallyScalable Services on Wide-Area Shared Computing [4] aimto provide models for estimating service capacity that islikely to be provided by a replica in the near future . It alsoprovides models for dynamic control of the degree of servicereplication. This is done in order to provision the requiredaggregate service capacity based on the estimated servicecapacities of the replicas. They also describe techniquesto provide reliable registry services for clients to locateservice replicas. Their experimental evaluation shows theimportance of this estimations and they claim to be ableto predict correctness of 95%.

In conclusion, the system performance of this kind ofsystems is highly dependent on the type of service pro-vided. For services that make intensive use of the sec-ondary storage, migration is a very costly solution. Oneapproach could consist of pro-actively select and transfersecondary storage to a potential target node for any futurere-locations.

H. Quorum Consistency

If we consider services that make large use of secondarystorage and the properties of Wide-Area shared computingnetworks such as the frequent shutdown of nodes, then wemust be able to recover these nodes so that we dont haveto replicate the whole data again. If, on other hand, weassume that this data is small, then we can simply replicatethe server to a new node.

If we consider recovering services, we must have a wayto keep track of the nodes alive and have an efficient wayto update them instead of copying the whole set of data.If we consider a small and fix number of nodes, we canalways do a simple heartbeat/versioning system, but formore dynamic number of replicas, a group membershipprotocol would probably be more suitable for keeping trackof the nodes.

In order to perform efficient updates in a dynamic set ofreplicas, a quorum system can be used to provide consis-tency. The main advantage of quorums is that it uses itsquorum sets properties to propagate changes and reducethe needed number of messages. It can reduce the needednumber of messages to perform a critical action from threetimes the total number of nodes to three times the numberof nodes in its quorum (in the best case).

For example, in the case of passive replication, if theprimary node needs to perform a write operation, it gen-erates the vector clock for the new data version and per-forms the write locally. Then it sends the new version tothe nodes in its quorum, if all the nodes respond then thewrite is considered successful. Thanks to quorum proper-ties, it doesnt need to contact all the backup nodes, onlythe nodes present at its quorum set. The latency is deter-mined by the slowest node of this write quorum.

As the primary node can also fail and it could hypothet-ically (and depending on the design) have the most recentversion of the data that it didn’t have time to replicate, it

is therefore important to be able to verify and merge theexisting versions. This can be achieved by requesting allexisting versions of data from the read quorum, and thenwaiting for the responses from all the replicas. If thereare multiple versions, it returns all the versions that arecausally unrelated. Divergent versions are reconciled andwritten to the write quorum.

Quorums consistency is actually used in a variety of dis-tributed systems and seem to perform well. An example isthe quorum consistency of replicas used by Amazons Dy-namo [12]. Dynamo also manages group membership usinga gossip-based protocol to propagate membership changesand maintain an eventually consistent view of membership.One of the other methods of achieving consistency includestechniques like fuzzy snapshots to perceive the global stateof the system composed by the replicas.

IV. Related work

Commercial approaches for replication have been evolv-ing towards increasing tolerance to fail-stop faults. Thisis mainly because of falling hardware costs, the fact thatreplication techniques become better understood and eas-ier to adopt, and systems become larger, more complex,and more important.

There appears to be increasingly routine use of doubly-redundant storage. Similarly, although two-phase commitis often good enough, it can be always safe and rarely un-live, increasing numbers of deployments pay the extra costto use Paxos three-phase commit to simplify their designor avoid corner cases requiring operator intervention.

Distributed systems increasingly include limited Byzan-tine fault tolerance aimed at high-risk subsystems. For ex-ample the ZFS [17], GFS [18], and HDFS [19] file systemsprovide checksums for on-disk data. As another example,after Amazon S3 was afected for several hours by a flippedbit, additional checksums on system state messages wereadded.

Some other systems that we have studied and will in-clude here are UpRight fault tolerance infrastructure andZookeeper coordination service. We have studied manyother systems that we did not list here. Special mention toamazons Dynamo storage system that provides advancedtechniques like the ones we have mentioned in previouschapters.

A. UpRight

Upright is an open-source infrastructure and library forbuilding fault tolerant distributed systems [20]. It pro-vides a simple library to ensure high availability and faulttolerance through replication. It claims to provide highavailability, high reliability (system remains correct even ifbyzantine failures are present) and high performance. InFigure 3 we show the architecture of UpRight.

As depicted in the previous architecture diagram, theapplication client sends its requests through the client li-brary and these requests are ordered by the UpRight Core.The application servers handle these ordered requests andsend replies back to the clients. The redundancy provided

Page 5: High Availability of Services in Wide-Area Shared Computing Networks

ALMEIDA, SEYHANLI, MENDOZA AND GILANI: H. AVAILABILITY OF SERVICES IN W-A SHARED CNS 5

Fig. 3. Diagram of Upright architecture.

by the UpRight replication engine guarantees that even if agiven number of nodes are down, faulty, or even malicious,the whole system can still work correctly.

UpRight also uses some of the properties that we de-scribed in previous chapters, such as the use of quorums.Its purpose is to optimistically send messages to the min-imum number of nodes and resend to more nodes only ifthe observed progress is slow. It also provides byzantinefault tolerance using a Paxos-like consensus algorithm.

B. Zookeeper

Zookeeper [16] is an open-source coordination servicethat has some similarities to Chubby [15]. It provides ser-vices like consensus, group management, leader election,presence protocols, and consistent storage for small files.

Zookeeper guards against omission failures. However,because a data center typically runs a single instance of acoordination service on which many cluster services de-pend, and because even a small control error can havedramatic effects, it seems reasonable to invest additionalresources to protect against a wider range of faults.

Considering u as the total number of failures it can tol-erate and remain live and r the number of those failuresthat can be commission failures while maintaining safety,Zookeeper deployment comprises 2u + 1 servers. A com-mon configuration is 5 servers for u = 2 r = 0. Serversmaintain a set of hierarchically named objects in memory.Writes are serialized via a Paxos-like protocol, and readsare optimized to avoid consensus where possible. A clientcan set a watch on an object so that it is notified if theobject changes unless the connection from the client to aserver breaks, in which case the client is notified that theconnection broke.

For crash tolerance, each server synchronously logs up-dates to stable storage. Servers periodically produce fuzzysnapshots to checkpoint their state: a thread walks theserver’s data structures and writes them to disk, but re-quests concurrent with snapshot production may alterthese data structures as the snapshot is produced. If aZookeeper server starts producing a snapshot after requestSstart and finishes producing it after request send, thefuzzy snapshot representing the system’s state after re-quest send comprises the data structures written to diskplus the log of updates from Sstart to send.

V. Practical work

A. PlanetLab Overview

PlanetLab [21] is a heterogeneous infrastructure of com-puting resources shared across the Internet. Establishedin 2002, it is a global network of computers available as atestbed for computer networking and distributed systemsresearch. In December 2011, PlanetLab was composed of1024 nodes at 530 sites worldwide.

Accounts are available to people associated with com-panies and universities that host PlanetLab nodes. Eachinvestigation project runs a ”slice”, which gives experi-menters access to a virtual machine on each node attachedto that slice.

Several efforts to improve the heterogeneity of PlanetLabhave been made. OneLab, an European Project funded bythe European Commission, started in September 2006 withtwo overarching objectives: Extend the current PlanetLabinfrastructure and Create an autonomous PlanetLab Eu-rope

PlanetLab Europe is a European-wide research testbedthat is linked to the global PlanetLab through a peer-to-peer federation. During this Project different kinds of ac-cess technologies (such as UMTS, WiMax and WiFi) wereintegrated, allowing the installation of new kinds of multi-homed PlanetLab nodes (e.g. nodes with an Ethernet plusa interface) [23].

Since 2008, hundreds researchers at top academic insti-tutions and industrial research labs have tested their exper-imental technologies on PlanetLab Europe, including: dis-tributed storage, network mapping, peer-to-peer systems,distributed hash tables, and query processing. As of Jan-uary 2012, PlanetLab Europe has 306 nodes at 152 sites.

B. PlanetLab Setup

For using PlanetLab infrastructure an account is re-quired. To use the resources offered by various nodes, aslice has to be created. A slice is a collection of resourcesdistributed across multiple PlanetLab nodes. When a nodeis added to a slice, a virtual server for that slice is createdon that node. When a node is removed from a slice, thatvirtual server is destroyed. Each sites PI is in charge ofcreating and managing slices at that site [24].

In order to measure a few metrics related to availabil-ity, we deployed a sample application on PlanetLabs UPCslice (upcple sd). To run experiments, we added a total of8 nodes to our slice to create a virtual network over Plan-etLab. The following table shows hostnames of nodes andtheir locations.

The map represented in Figure 4 shows the locations ofthe nodes in Europe.

We deployed a simple application to these nodes to eval-uate number of requests generated against time. More im-portantly we evaluate the availability of nodes against timebased on number of successful requests.

Apart from this we also had to set up a web server atIST in Lisbon, Portugal. A web server is necessary for stor-ing messages from PlanetLab nodes. When a node sends

Page 6: High Availability of Services in Wide-Area Shared Computing Networks

6 DECENTRALIZED SYSTEMS PROJECT. MAY 2012

Fig. 4. Location of the nodes in Europe.

a message to the web server, it is termed as a heartbeatmessage. In our experiments, we setted up each node tosend a heartbeat message once every 10 minutes. We tookmeasurements on data obtained for two periods of 6 hourseach. These two periods correspond to day and night timeusage in order to observe any difference in availability dur-ing day and night time hours. Percentage usage of a nodecan possibly affect availability of a node.

C. Monitoring

In order to automate the process of generating requestsand having nodes to send a heartbeat message, we usedcronjobs. Each node was instructed to execute a pythonscript once every 10 minutes. The python script performedthree easy tasks:

Get Node Credentials: Get PlanetLab nodes credentialssuch as name and url.

Get Site Credentials: Get PlanetLab sites credentialssuch as site id, latitude, longitude and login base

Post Data: Encode information this into a URL en-coded string and send it to the web server as POSTmessages by calling a php script on the web server.

When called, the php script on a web server was used toappend node and site information sent by the PlanetLabnode to a text file.

D. Results

Figure 5 below shows number of successful requests foreach node for night time period. We represent each nodeby its location (i.e. country). The vertical axis shows totalrequests. Each bar represents a different node and showsnumber of successful requests for night time period from2100 to 0300 hours (CEST). This time period is dividedin 6 one-hour columns, as represented by the color-codedlegend.

Fig. 5. Successfull requests between 2200 and 2259 hours (CEST).

Fig. 6. Successfull requests between 0900 and 0959 hours (CEST).

It can be observed from the bar chart that all the nodesresponded successfully to requests apart from the nodes inPortugal and Greece, which failed for a request between2200 and 2259 hours (CEST).

Figure 6 is similar to Fig 5 but shows for day time pe-riod from 0600 to 1200 hours (CEST). It can be observedthat the Norwegian node in our slice could not successfullyreply to a request between 1000 and 1059 hours (CEST).Similarly, node in Sweden failed to reply between 0900 and0959 hours (CEST).

From these two bar charts we can conclude that mostof the requests in a given time period were handled suc-cessfully and that failure of one or more nodes doesnt af-fect the overall operation since the application had replicaselsewhere.

Figure 7 shows a bar chart for availability of nodes inour PlanetLab slice. The vertical axis represents availabil-ity in percentage as a function of successful requests foreach node. Each node shows two bars, dark for night andlight for day. As can be seen most of the nodes show morethan 97 percent of availability. Some nodes such as theones in Portugal and Greece were unavailable for short pe-riod of time during night hours. Others such as the ones inNorway and Sweden were shortly unavailable during day-time.

Page 7: High Availability of Services in Wide-Area Shared Computing Networks

ALMEIDA, SEYHANLI, MENDOZA AND GILANI: H. AVAILABILITY OF SERVICES IN W-A SHARED CNS 7

Fig. 7. Availability of nodes in our PlanetLab slice.

E. Issues

Standard approach to deploy software and applicationson PlanetLab nodes is to use an application called CoDe-ploy. However using CoDeploy [27] was neither convenientnor consistent. We observed that for most of the nodes, thedeployment failed altogether. As a workaround we manu-ally deployed scripts on PlanetLab nodes.

Similarly, standard method of registering cronjobs onPlanetLab nodes is to use an application called MultiQuery[27] which is a part of CoDeploy application. We foundthat even though MultiQuery registers cronjobs, it how-ever fails to start the crond daemon. As a workaround wemanually registered our cronjobs on PlanetLab nodes.

VI. Evaluation of Highly Available Systems

The problem with theoretical reliability through replica-tion is that it assumes that these failures are indeed inde-pendent. If nodes share the same software and there can becorrupt requests, there is always some correlation betweenfailures of nodes (At least WAN networks will less proba-bly share the same configuration of machines). PN is anupper bound of reliability that can never be approached inpractice. This is discussed in papers such as the one fromGoogle [4] that shows empirical numbers on group fail-ures that demonstrates rates of several orders of magnitudehigher than the independence assumption would predict.

Reliability and high availability should not only beproved through theoretical methodologies, it has to betested through empirical methods such as continuous hoursof successful operations. There are two metrics that are of-ten difficult to measure on academic research projects butgive a very good measurement of the availability and relia-bility of the system, mean time between failures and meantime between recoveries.

VII. Conclusion

We expected to perform further evaluation over Planet-Lab. However it took more to get an account and to getaccess to a slice and respective nodes. This was mainly dueto the fact that each of this resources is managed by differ-ent entities. Once we had an account, we were surprisedby the time it takes for virtual machines to get configured

on PlanetLab. Moreover as mentioned in section E, weconsistently experienced failure of tools such as CoDeployand MultiQuery. Ultimately we had to accomplish thingsmanually.

Also we realized that some of the tools havent been up-dated for about ten years and some of their dependenciesare already deprecated.

We had to find a host in order to launch our server andgroup the results from the PlanetLab nodes. As this hostdid not have a fixed IP we had to constantly update ourprivate/public keys to communicate with the nodes. Ifwe opted for using PlanetLab tools it would have takeneven longer to evaluate our project since it can take froma few minutes to a few hours to commit changes to virtualmachine configurations.

For speeding up the process of development of a highlyavailable distributed system, one can use either AmazonsEC2 for deploying highly available and resource elastic ser-vices. As this is not always the most appropriate solu-tion, one can set up its own network and use, for example,the multiple open-source Hadoop technologies for reliableand scalable distributed systems. But in the case of WideArea Shared Computing networks, maybe solutions like theopen-source UpRight are more suitable since it can be inte-grated either with Zookeeper or with Hadoops distributedfile system.

We have concluded that it is possible to provide highlyavailable distributed systems in wide area shared comput-ing through the use of resource-aware replication [5] withreasonable results. Quorum sets help reducing the costsof replication and the paxos algorithm can help toleratebyzantine faults.

Finally, as an experiment we replicated a simple appli-cation over a small network of PlanetLab PLE nodes us-ing active replication methodology. We found that eventhough a few nodes might fail at any given time, the ap-plication can still work without major issues.

References

[1] Nancy Lynch and Seth Gilbert, Brewer’s conjecture and thefeasibility of consistent, available, partition-tolerant web servicesACM SIGACT News, Volume 33 Issue 2, 2002.

[2] Mamoru Maekawa, Arthur E. Oldehoeft, Rodney R. OldehoeftOperating Systems: Advanced Concept. Benjamin/CummingsPublishing Company, Inc, 1987.

[3] Nicolas Schiper, Sam Toueg, A Robust and Lightweight Sta-ble Leader Election Service for Dynamic Systems University ofLugano, 2008.

[4] V. Padhye, A. Tripathi, Building Autonomically Scalable Serviceson Wide-Area Shared Computing Platforms Network Computingand Applications (NCA), 10th IEEE International Symposium,2011.

[5] V. Padhye, A. Tripathi, D. Kulkarni, Resource-Aware MigratoryServices in Wide-Area Shared Computing Environments ReliableDistributed Systems. SRDS 28th IEEE International Symposium,2009.

[6] A. Tripathi, V. Padhye, Distributed Systems Research withAjanta Mobile Agent Framework 2002.

[7] Benjamin Reed, Flavio P. Junqueira, A simple totally orderedbroadcast protocol LADIS ’08 Proceedings of the 2nd Workshopon Large-Scale Distributed Systems and Middleware, 2008.

[8] Miguel Castro, Barbara Liskov, Practical Byzantine Fault Toler-ance Laboratory for Computer Science, Massachusetts Instituteof Technology, 1999.

Page 8: High Availability of Services in Wide-Area Shared Computing Networks

8 DECENTRALIZED SYSTEMS PROJECT. MAY 2012

[9] W. Chen, S. Toueg, and M. K. Aguiler, On the quality ofservice of failure detector IEEE Transactions on Computers,51(5):561?580, May 2002.

[10] Jay Kreps - LinkedIn, Getting Real About Distributed SystemReliability NA.

[11] Jaksa, Active and Passive Replication in Distributed Systems2009.

[12] Werner Vogels, Amazon’s Dynamo 2007.[13] Joydeep Sen Sarma, Dynamo: A flawed architecture 2009.[14] A. Rich, ZFS, sun’s cutting-edge le system Technical report,

Sun Microsystems, 2006.[15] M. Burrows, The Chubby lock service for loosely-coupled dis-

tributed system OSDI, 2006.[16] Apache, Zookeeper OSDI, 2006.[17] C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life,

death, and the critical transition: Finding liveness bugs in sys-tems code NSDI, 2007.

[18] A. Clement et al Life, death, and the critical transition: Findingliveness bugs in systems code NSDI, 2007.

[19] Hadoop, Hadoop NSDI, 2007.[20] A. Clement et al UpRight Cluster Services SOSP, 2009.[21] Larry Peterson, Steve Muir, Timothy Roscoe and, Aaron

Klingaman PlanetLab Architecture: An Overview PrincetonUniversity, 2006.

[22] Eric Brewer CAP Twelve Years Later: How the ”Rules” HaveChanged University of California, Berkeley, February 2012.

[23] Giovanni Di Stasi, Stefano Avallone, and Roberto Canonico,Integration of OMF-Based Testbeds in a Global-Scale NetworkingFacility N. Bartolini et al. (Eds.): QShine/AAA-IDEA, 2009.SOSP, 2009.

[24] PlanetLab, PlanetLab[25] Amazon, Amazon EC2 Service Level Agreement SOSP, 2008.[26] Charles Babcock Amazon SLAs Didn’t Cover Major Outage

InformationWeek, 2009.[27] KyoungSoo Park,Vivek Pai, Larry Peterson and Aki Nakao

Codeploy Princeton.[28] Leslie Lamport, A Document Preparation System: LATEX User’s

Guide and Reference Manual, Addison-Wesley, Reading, MA,2nd edition, 1994. Be sure to get the updated version for LATEX2ε!

[29] Michel Goossens, Frank Mittelbach, and Alexander Samarin,The LATEX Companion, Addison-Wesley, Reading, MA, 1994.