13
Vive La Diff erence: Paxos vs. Viewstamped Replication vs. Zab Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider, Fellow, IEEE Abstract—Paxos, Viewstamped Replication, and Zab are replication protocols for high-availability in asynchronous environments with crash failures. Claims have been made about their similarities and differences. But how does one determine whether two protocols are the same, and if not, how significant are the differences? We address these questions using refinement mappings. Protocols are expressed as succinct specifications that are progressively refined to executable implementations. Doing so enables a principled understanding of the correctness of design decisions for implementing the protocols. Additionally, differences that have a significant impact on performance are surfaced by this exercise. Index Terms—Systems specification methodology, distributed systems, reliability Ç 1 INTRODUCTION A protocol expressed in terms of a state transition specification S refines another specification S 0 if there exists a mapping of the state space of S to the state space of S 0 and each state transition in S can be mapped to a state transition in S 0 or to a no-op. This mapping between specifications is called refinement [1] or backward simulation [2]. If two protocols refine one another then we might argue that they are alike. But if they don’t, how does one characterize the similarities and differences between two protocols? We became interested in this question while comparing three replication protocols for high availability in asynchro- nous environments with crash failures. Paxos [3] is a state machine replication protocol [4], [5]. We consider a version of Paxos that uses the multi-decree Synod consensus algorithm described in [3], sometimes called Multi-Paxos. Implementa- tions appear in Google’s Chubby service [6], [7], Microsoft’s Autopilot service [8] (used by Bing), and the Ceph distributed file system [9] (with interfaces now part of the standard Linux kernel). Viewstamped Replication (VSR) [10], [11] is a replica- tion protocol originally targeted at replicating partic- ipants in a Two-Phase Commit (2PC) [12] protocol. VSR was used in the implementation of the Harp File System [13]. ZooKeeper Atomic Broadcast (Zab) [14] is a replica- tion protocol used for the ZooKeeper [15] configura- tion service. ZooKeeper was designed at Yahoo! and is now a popular open source product distributed by Apache. These protocols seem to rely on many of the same principles. Citations [16], [17] assert that Paxos and Viewstamped Repli- cation are “the same algorithm independently invented,” “equivalent,” or that “the view management protocols seem to be equivalent” [3]. The Zab paper [14] says that “Zab follows the abstract description of Paxos by Lampson [18].” In this paper, we explore similarities and differences between these protocols using refinements. Refinement mappings induce an ordering relation on specifications; Fig. 1 shows a Hasse diagram of a set of eight specifications of interest, ordered by refinement. In Fig. 1, we write S 0 ! S if S refines S 0 —that is, if there exists a refinement mapping of S to S 0 . We also give informal levels of abstraction in Fig. 1, rang- ing from a highly abstract specification of a linearizable ser- vice [19] to concrete, executable specifications. Active and passive replication are common approaches for replicating a service and ensuring that behaviors are still linearizable. Multi-Consensus protocols use a form of rounds in order to refine active and passive replication. Finally, we obtain pro- tocols such as Paxos, VSR, and Zab. Each refinement corresponds to a design decision, and as can be seen in Fig. 1, the same specification is derived by fol- lowing different paths of such design decisions. There is a qualitative difference between refinements that cross abstraction boundaries in Fig. 1 and those that do not. When crossing an abstraction boundary, a refinement takes an abstract concept and replaces it with a more concrete one. For example, it may take an abstract decision and replace it by a majority of votes. Within the same abstraction, a refinement restricts behaviors. For example, one specifica- tion might decide commands out of order whereas a more restricted specification might decide them in order. Using refinements, we identify and analyze commonali- ties and differences between the replication protocols. The refinements also suggest new variants of these protocols, characterizing conditions required for a linearizable service. Other publications have used refinements to specify replica- tion protocols [18], [20], [21]. We employ refinement for com- paring Paxos, Viewstamped Replication, and Zab. The authors are with the Department of Computer Science, Cornell University, Ithaca, NY, 14853. E-mail: {rvr, fbs}@cs.cornell.edu, [email protected]. Manuscript received 30 Sept. 2013; revised 3 July 2014; accepted 16 Aug. 2014. Date of publication 7 Sept. 2014; date of current version 10 July 2015. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TDSC.2014.2355848 472 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015 1545-5971 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

472 IEEE TRANSACTIONS ON DEPENDABLE AND ... La Difference: Paxos vs. Viewstamped Replication vs. Zab Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider, Fellow, IEEE Abstract—Paxos,

  • Upload
    buikien

  • View
    233

  • Download
    0

Embed Size (px)

Citation preview

Vive La Diff�erence: Paxos vs. ViewstampedReplication vs. Zab

Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider, Fellow, IEEE

Abstract—Paxos, Viewstamped Replication, and Zab are replication protocols for high-availability in asynchronous environments with

crash failures. Claims have been made about their similarities and differences. But how does one determine whether two protocols are

the same, and if not, how significant are the differences? We address these questions using refinement mappings. Protocols are

expressed as succinct specifications that are progressively refined to executable implementations. Doing so enables a principled

understanding of the correctness of design decisions for implementing the protocols. Additionally, differences that have a significant

impact on performance are surfaced by this exercise.

Index Terms—Systems specification methodology, distributed systems, reliability

Ç

1 INTRODUCTION

A protocol expressed in terms of a state transitionspecification S refines another specification S0 if

there exists a mapping of the state space of S to the state

space of S0 and each state transition in S can be mapped

to a state transition in S0 or to a no-op. This mapping

between specifications is called refinement [1] or backwardsimulation [2]. If two protocols refine one another then wemight argue that they are alike. But if they don’t, howdoes one characterize the similarities and differencesbetween two protocols?

We became interested in this question while comparingthree replication protocols for high availability in asynchro-nous environments with crash failures.

� Paxos [3] is a state machine replication protocol [4],[5]. We consider a version of Paxos that uses themulti-decree Synod consensus algorithm describedin [3], sometimes called Multi-Paxos. Implementa-tions appear in Google’s Chubby service [6], [7],Microsoft’s Autopilot service [8] (used by Bing), andthe Ceph distributed file system [9] (with interfacesnow part of the standard Linux kernel).

� Viewstamped Replication (VSR) [10], [11] is a replica-tion protocol originally targeted at replicating partic-ipants in a Two-Phase Commit (2PC) [12] protocol.VSR was used in the implementation of the HarpFile System [13].

� ZooKeeper Atomic Broadcast (Zab) [14] is a replica-tion protocol used for the ZooKeeper [15] configura-tion service. ZooKeeper was designed at Yahoo! andis now a popular open source product distributedby Apache.

These protocols seem to rely on many of the same principles.

Citations [16], [17] assert that Paxos and Viewstamped Repli-

cation are “the same algorithm independently invented,”

“equivalent,” or that “the view management protocols seem to

be equivalent” [3]. The Zab paper [14] says that “Zab follows

the abstract description of Paxos by Lampson [18].”

In this paper, we explore similarities and differencesbetween these protocols using refinements. Refinementmappings induce an ordering relation on specifications;Fig. 1 shows a Hasse diagram of a set of eight specificationsof interest, ordered by refinement. In Fig. 1, we write

S0 ! S if S refines S0—that is, if there exists a refinement

mapping of S to S0.

We also give informal levels of abstraction in Fig. 1, rang-ing from a highly abstract specification of a linearizable ser-vice [19] to concrete, executable specifications. Active andpassive replication are common approaches for replicatinga service and ensuring that behaviors are still linearizable.Multi-Consensus protocols use a form of rounds in order torefine active and passive replication. Finally, we obtain pro-tocols such as Paxos, VSR, and Zab.

Each refinement corresponds to a design decision, and ascan be seen in Fig. 1, the same specification is derived by fol-lowing different paths of such design decisions. There is aqualitative difference between refinements that crossabstraction boundaries in Fig. 1 and those that do not.When crossing an abstraction boundary, a refinement takesan abstract concept and replaces it with a more concreteone. For example, it may take an abstract decision andreplace it by a majority of votes. Within the same abstraction,a refinement restricts behaviors. For example, one specifica-tion might decide commands out of order whereas a morerestricted specification might decide them in order.

Using refinements, we identify and analyze commonali-ties and differences between the replication protocols. Therefinements also suggest new variants of these protocols,characterizing conditions required for a linearizable service.Other publications have used refinements to specify replica-tion protocols [18], [20], [21]. We employ refinement for com-paring Paxos, Viewstamped Replication, and Zab.

� The authors are with the Department of Computer Science, CornellUniversity, Ithaca, NY, 14853.E-mail: {rvr, fbs}@cs.cornell.edu, [email protected].

Manuscript received 30 Sept. 2013; revised 3 July 2014; accepted 16 Aug.2014. Date of publication 7 Sept. 2014; date of current version 10 July 2015.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TDSC.2014.2355848

472 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015

1545-5971� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This paper is organized as follows. Section 2 introducesstate transition specifications for linearizable replicationas well as for active and passive replication. Section 3presents Multi-Consensus, a canonical protocol that gener-alizes Paxos, VSR, and Zab: it forms a basis for comparison.Progress summaries, a new class of invariants, enable con-structive reasoning about why and how these protocolswork. We show how passive replication protocols refineMulti-Consensus by adding prefix order (aka primary order),creating Multi-Consensus-PO. Section 4 presents implemen-tation details of Paxos, VSR, and Zab that constitute the finalrefinement steps to executable protocols. Section 5 discussesthe implications of identified differences on performance.Section 6 gives a short overview of the history of conceptsused in these replication protocols, and Section 7 is a conclu-sion. A proof that active replication refines the linearizableservice is presented in Section 2.2, the details of the otherrefinements are available online at the Computer SocietyDigital Library [22].

2 MASKING FAILURES

To improve the availability of a service, a common tech-nique is replication. A consistency criterion defines expectedresponses to clients for concurrent operations. Ideally, thereplication protocol ensures linearizability [19]—execution ofconcurrent client operations is equivalent to a sequentialexecution, where each operation is atomically performed atsome point between its invocation and response.

2.1 Specification

We characterize linearizability by giving a state transitionspecification (see Specification 1). A specification definesstates and gives legal transitions between states. A state isdefined by a collection of variables and their current values.Transitions can involve parameters (listed in parentheses)that are bound within the defined scope. A transition defini-tion gives a precondition and an action. If the preconditionholds in a given state, then the transition is enabled in thatstate. The action relates the state after the transition to thestate before. A transition is performed indivisibly, startingin a state satisfying the precondition. No two transitionsare performed concurrently, and if multiple transitions are

enabled simultaneously, then the choice of which transitionto perform is unspecified.

Specification 1. Linearizable Service

var inputsn; outputsn; appState; invokedclt; respondedcltinitially: appState ¼ ? ^ inputsn ¼ outputsn ¼ ; ^8clt : invokedclt ¼ respondedclt ¼ ;

interface transition invoke(clt; op):precondition:op 62 invokedclt

action:invokedclt :¼ invokedclt [ fopginputsn :¼ inputsn [ fðclt; opÞg

internal transition executeðclt; op; result; newStateÞ:precondition:ðclt; opÞ 2 inputsn ^ðresult; newStateÞ ¼ nextStateðappState; ðclt; opÞÞ

action:appState :¼ newStateoutputsn :¼ outputsn [ fððclt; opÞ; resultÞg

interface transition response(clt; op; result):precondition:ððclt; opÞ; resultÞ 2 outputsn ^ op 62 respondedclt

action:respondedclt :¼ respondedclt [ fopg

There are interface variables and internal variables. Interfacevariables are subscripted with the location of the variable,which is either a process name or the network, n. Internalvariables have no subscripts and their value is a function onthe state of the underlying implementation. SpecificationLinearizable Service has the following variables:

� inputsn: a set that contains ðclt; opÞ messages sent byprocess clt. Here op is an operation invoked by clt.

� outputsn: a set of ðclt; op; resultÞ messages sent by theservice, containing the results of client operationsthat have been executed.

� appState: an internal variable containing the state ofthe application.

� invokedclt: the set of operations invoked by processclt. This variable is maintained by clt itself.

� respondedclt: the set of completed operations, alsomaintained by clt.

There are interface transitions and internal transitions. Inter-

face transitions model interactions with the environment,

which consists of a collection of processes connected by a net-

work. An interface transition is performed by the process that

is identified by the first parameter to the transition. Interface

transitions may not access internal variables. Internal transi-

tions are performed by the service, and we demonstrate how

this is done by implementing those transitions. The transitions

of Specification Linearizable Service are:

� Interface transition invokeðclt; opÞ is performedwhen clt invokes operation op. Each operation isuniquely identified and can be invoked at most onceby a client (enforced by the precondition). Addingðclt; opÞ to inputsn models clt sending a message con-taining op to the service. The client records whatoperations it has invoked in invokedclt.

Fig. 1. An ordering of refinement mappings and informal levels ofabstraction.

VAN RENESSE ET AL.: VIVE LA DIFF�ERENCE: PAXOS VS. VIEWSTAMPED REPLICATION VS. ZAB 473

� Transition executeðclt; op; result; newStateÞ is aninternal transition that is performed when the repli-cated service executes op for client clt. The applica-tion-dependent deterministic function nextStaterelates an application state and an operation from aclient to a new application state and a result. Addingððclt; opÞ; resultÞ to outputsn models the service send-ing a response to clt.

� Interface transition responseðclt; op; resultÞ is per-formed when clt receives the response. The clientkeeps track of which operations have completed inrespondedclt to prevent this transition from beingperformed more than once per operation.

Specification Linearizable Service implies that a client cannot

receive a response to an operation before invocation. How-

ever, the specification does allow each client operation to be

executed an unbounded number of times. In an implementa-

tion, multiple executions could occur if the response to the

client operation got lost by the network and the client retrans-

mits its operation to the service. In practice, replicated services

reduce or eliminate the probability that a client operation is

executed more than once by keeping state about which opera-

tions have been executed. For example, a service could elimi-

nate duplicate operations by attaching per-client sequence

numbers to operations. In all of the specifications that follow,

we omit logic for preventing operation duplication to simplify

the presentation.

We make the following assumptions about interfacetransitions:

� Crash failures. A process follows its specification untilit fails by crashing. Thereafter, it executes no transi-tions. Processes that never crash are called correct. Aprocess that halts execution and later recovers andresumes execution is considered correct albeit, tem-porarily slow. A recovering process starts its execu-tion exactly where it left it before halting, by relyingon stable storage.

� Failure threshold. Bound f defines the maximumnumber of replica processes that may fail; the num-ber of client processes that may fail is unbounded.

� Fairness. Except for interface transitions at a crashedprocess, a transition that becomes continuouslyenabled is eventually executed.

� Asynchrony. There is no bound on the delay before acontinuously enabled transition is executed.

We use refinement to show only that design decisionsare safe under the assumption of Crash Failures. Intermedi-ate specifications we derive do not necessarily guarantee

liveness properties.1 To show that the final refinements sup-port liveness properties such as “an operation issued by acorrect client is eventually executed,” we require the exis-tence of the V failure detector (see Section 4.2).

2.2 Active and Passive Replication

Specification 1 has internal variables and transitions thathave to be implemented. There are two well-knownapproaches to replication:

� With active replication, also known as state machinereplication [4], [5], each replica implements a deter-ministic state machine. All replicas process the sameoperations in the same order.

� With passive replication, also known as primary backup[23], a primary replica runs a deterministic statemachine, while backups only store copies of the pri-mary replica’s state. The primary computes asequence of new application states by processingoperations and forwards these states to each backup,in order of generation.

Fig. 2a illustrates a failure-free execution of a service imple-

mented using active replication.

1) Clients submit operations to the service (op1 andop2 in Fig. 2a).

2) Replicas, starting out in the same state, executereceived client operations in the same order.

3) Replicas send responses to the clients. Clients ignoreall but the first response they receive.

The tricky part of active replication is ensuring that replicas

execute operations in the same order, despite replica failures,

message loss, and unpredictable delivery and processing

delays. A fault-tolerant consensus protocol [24] is typically

employed so that replicas will agree on the ith operation for

each i. Specifically, each replica proposes an operation that

was received from one of the clients in instance i of the consen-sus protocol. Only one of the proposed operations can be

decided. The service remains available provided each instance

of consensus eventually terminates.

Fig. 2b depicts a failure-free execution of passivereplication:

1) Clients submit operations only to the primary.2) The primary orders operations and computes new

states and responses.

Fig. 2. A failure-free execution of active and passive replication.

1. State transition specifications may include supplementary liven-ess conditions. If so, a specification S that refines a specification S

0 pre-serves both the safety and liveness properties of S0.

474 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015

3) The primary forwards new states (so-called stateupdates) to each backup in the order generated.

4) The primary sends the response from an opera-tion only after the corresponding state update hasbeen successfully decided (this is made precise inSection 3).

Because two primaries may compete to have their state

updates applied at the backups, replicas apply a state

update u on the same state used by the primary to compute

u. This is sometimes called the prefix order or primary order

property [14].

For example, consider a replicated integer variable withinitial value 3. One client wants to increment the variable,while the other wants to double it. One primary receivesboth operations and submits state updates 4 followed by 8.Another primary receives the operations in the oppositeorder and submits updates 6 followed by 7. Without prefixordering, it may happen that the decided states are 4 fol-lowed by 7, not corresponding to any sequential history ofthe two operations.

VSR and Zab employ passive replication; Paxos employsactive replication. However, it is possible to implement oneapproach on the other. The Harp file system [13], for exam-ple, uses VSR to implement a replicated message-queuecontaining client operations—the Harp primary proposesstate updates that backups apply to the state of the mes-sage-queue. Replicas, running deterministic NFS statemachines, then execute NFS operations in queue order. Inother words, Harp uses an active replication protocol builtusing a message-queue that is passively replicated usingVSR. In doing so, the existing NFS servers do not have to bemodified.

2.3 Refinement

Below, we present a refinement of a linearizable service(Specification 1) using active replication. We then furtherrefine active replication to obtain passive replication. Therefinement of a linearizable service to passive replicationfollows transitively.

2.3.1 Active Replication

We omit interface transitions invokeðclt; opÞ andresponseðclt; op; resultÞ, which are the same as in Specifica-tion 1. Hereafter, a command cmd denotes a pair ðclt; opÞ.

Specification 2 uses a sequence of slots. A replica executestransition proposeðreplica; slot; cmdÞ to propose a commandcmd for slot slot. We call the command a proposal. Transitiondecideðslot; cmdÞ guarantees that at most one proposal isdecided for slot. Transition learnðreplica; slotÞ models areplica learning a decision and assigning that decision tolearnedreplica½slot�. Replicas update their state by executing alearned operation in increasing order of slot number withtransition updateðreplica; cmd; res; newStateÞ. The slot of thenext operation to execute is denoted by versionreplica.

Note that proposeðreplica; slot; cmdÞ requires that replicahas not yet learned a decision for slot. While not necessaryfor safety, proposing a command for a slot that is known tobe decided is wasted effort. It would make sense to requirethat replicas propose for the smallest value of slot whereboth proposalsreplica½slot� ¼ ; and learnedreplica½slot� ¼ ?. We

do not require this only to simplify the refinement mappingbetween active and passive replication.

Specification 2. Specification Active Replication

var proposalsreplica½1:::�; decisions½1:::�; learnedreplica½1:::�appStatereplica; versionreplica; inputsn; outputsn

initially:8s 2 Nþ : decisions½s� ¼ ? ^8replica :appStatereplica ¼ ? ^ versionreplica ¼ 1^8s 2 Nþ :proposalsreplica½s� ¼ ; ^ learnedreplica½s� ¼ ?

interface transition proposeðreplica; slot; cmdÞ:precondition:cmd 2 inputsn ^ learnedreplica½slot� ¼ ?

action:proposalsreplica½slot� :¼ proposalsreplica½slot� [ fcmdg

internal transition decideðslot; cmdÞ:precondition:decisions½slot� ¼ ? ^ 9r : cmd 2 proposalsr½slot�

action:decisions½slot� :¼ cmd

internal transition learnðreplica; slotÞ:precondition:learnedreplica½slot� ¼ ? ^ decisions½slot� 6¼ ?

action:learnedreplica½slot� :¼ decisions½slot�

interface transition updateðreplica; cmd; res; newStateÞ:precondition:cmd ¼ learnedreplica½versionreplica� ^ cmd 6¼ ? ^ðres; newStateÞ ¼ nextStateðappStatereplica; cmdÞ

action:outputsn :¼ outputsn [ fðcmd; resÞgappStatereplica :¼ newStateversionreplica :¼ versionreplica þ 1

To show that active replication refines Specification 1,we first show how the internal state of Specification 1 isderived from the state of Specification 2. The internalstate in Specification 1 is appState. For our refinementmapping, its value is the copy of the application state thatis being maintained by the replica (or one of the replicas)with the highest version number.

To complete the refinement mapping, we also mustshow how transitions of active replication map ontoenabled transitions of Specification 1, or onto stutter steps(no-ops with respect to Specification 1). The propose,decide, and learn transitions are always stutter steps,because they do not update appStatereplica of any replica.An updateðreplica; cmd; res; newStateÞ transition corre-sponds to executeðclt; op; res; newStateÞ in Specification 1,where cmd ¼ ðclt; opÞ and replica is the first replica toapply op. Transition update is a stutter step if the execut-ing replica is not the first to apply the update.

2.3.2 Passive Replication

Passive replication (Specification 3) also uses slots, and pro-posals are tuples ðoldState; ðcmd; res; newStateÞÞ consistingof the state prior to executing a command, a command, theoutput of executing the command, and a new state that

VAN RENESSE ET AL.: VIVE LA DIFF�ERENCE: PAXOS VS. VIEWSTAMPED REPLICATION VS. ZAB 475

results from applying the command. In an actual implemen-tation, the old state and new state would each be repre-sented by an identifier along with a state update, ratherthan by the entire value of the state.

Specification 3. Specification Passive Replication

var proposalsreplica½1:::�; decisions½1:::�; learnedreplica½1:::�appStatereplica; versionreplica; inputsn; outputsn;shadowStatereplica; shadowVersionreplica

initially:8s 2 Nþ : decisions½s� ¼ ?8replica :appStatereplica ¼ ? ^ versionreplica ¼ 1^shadowStatereplica ¼ ? ^ shadowVersionreplica ¼ 1^8s 2 Nþ :proposalsreplica½s� ¼ ; ^ learnedreplica½s� ¼ ?

interface transition proposeðreplica; slot; cmd;res; newStateÞ:

precondition:cmd 2 inputsn ^ slot ¼ shadowVersionreplica ^learnedreplica½slot� ¼ ?^ðres; newStateÞ ¼ nextStateðshadowStatereplica; cmdÞ

action:proposalsreplica½slot� :¼ proposalsreplica½slot� [fðshadowStatereplica; ðcmd; res; newStateÞÞg

shadowStatereplica :¼ newStateshadowVersionreplica :¼ slotþ 1

internal transition decideðslot; cmd; res; newStateÞ:precondition:decisions½slot� ¼ ?^9r; s : ðs; ðcmd; res; newStateÞÞ 2 proposalsr½slot� ^ðslot > 1 ) decisions½slot� 1� ¼ ð�;�; sÞÞ

action:decisions½slot� :¼ ðcmd; res; newStateÞ

internal transition learnðreplica; slotÞ:precondition:learnedreplica½slot� ¼ ? ^ decisions½slot� 6¼ ?

action:learnedreplica½slot� :¼ decisions½slot�

interface transition updateðreplica; cmd; res; newStateÞ:precondition:ðcmd; res; newStateÞ ¼ learnedreplica½versionreplica�

action:outputsn :¼ outputsn [ fðcmd; resÞgappStatereplica :¼ newStateversionreplica :¼ versionreplica þ 1

internal transition resetShadowðreplica; version; stateÞ:precondition:version � versionreplica ^(version ¼ versionreplica ) state ¼ appStatereplica)

action:shadowStatereplica :¼ stateshadowVersionreplica :¼ version

Any replica can act as primary. Primaries act specula-tively, computing a sequence of states before they aredecided. Because of this, primaries maintain a separatecopy of the application state to which they apply thespeculative updates. We call this copy the shadow state.Primaries may propose different state updates for thesame slot.

Transition proposeðreplica; slot; cmd; res; newStateÞ isperformed when a primary replica proposes applying cmdto shadowStatereplica for slot, resulting in output res. StateshadowStatereplica is what the primary calculated for the pre-vious slot (even though that state is not necessarily decidedyet, and it may never be decided). Proposals for a slot arestored in a set, since a primary may propose to apply differ-ent commands for the same slot if there are repeatedchanges of primaries.

Transition decideðslot; cmd; res; newStateÞ specifies thatonly one of the proposed new states can be decided.Because cmd was performed speculatively, the decide

transition checks that the state decided in the prior slot, ifany, matches state s to which replica r applied cmd; prefixordering is ensured.

Similar to active replication, transition learn models areplica learning the decision associated with a slot. With theupdate transition, a replica updates its state based on whatwas learned for the slot. With active replication, each replicaperforms each client operation; in passive replication, onlythe primary performs client operations and backups simplyobtain the resulting states.

Replicas that are experiencing unexpected delay (e.g.,due the crash suspicion of the primary by a failure detector)can start acting as primary by performing transitionresetShadow to update their speculative state and ver-sion, respectively stored in variables shadowStatereplica andshadowVersionreplica. The new shadow state may itself bespeculative; it must be a version at least as recent as the lat-est learned state.

3 A GENERIC PROTOCOL

Specifications 2 and 3 contain internal variables and tran-sitions that need to be refined for an executable imple-mentation. We start by refining active replication. Multi-Consensus (Specification 4) refines active replication andcontains no internal variables or transitions. As before,invoke and response transitions (and correspondingvariables) have been omitted—they are the same as inSpecification 1. Transitions propose and update ofSpecification 2 are omitted for the same reason. In thissection we explain how and why Multi-Consensus works.Details of the refinement mappings are available online atthe Computer Society Digital Library [22].

3.1 Certifiers and Rounds

Multi-Consensus has two basic building blocks:

� A static set of n processes called certifiers. A minorityof these may crash. So for tolerating at most f faultyprocesses, we require that n � 2f þ 1must hold.

� An unbounded number of rounds.2

In each round, Multi-Consensus assigns to at most one certi-

fier the role of sequencer. The sequencer of a round certifies at

most one command for each slot. Other certifiers can imitate

the sequencer, certifying the same command for the same slot

and round; if two certifiers certify a command in the same slot

2. Table 1 translates between terms used in this paper and thosefound in the papers describing the protocols under consideration.

476 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015

and the same round, then it must be the same command.

Moreover, a certifier cannot retract a certification. Once a

majority of certifiers certify the command within a round, the

command is decided (and because certifications cannot be

retracted the command will remain decided thereafter). In

Section 3.4 we show why two rounds cannot decide different

commands for the same slot.

Each round has a round-id that uniquely identifies theround. Rounds are totally ordered by round-ids. A round isin one of three modes: pending, operational, or wedged. Oneround is the first round (it has the smallest round-id), andinitially only that round is operational. Other rounds ini-tially are pending. Two possible transitions on the mode ofa round are:

1) A pending round can become operational only if allrounds with lower round-id are wedged;

2) A pending or operational round can become wedgedunder any circumstance.

This implies that at any time at most one round is operational

and that wedged rounds can never become unwedged.

3.2 Tracking Progress

In Specification 4, each certifier cert maintains a progresssummary progrsumcert½slot� for each slot, defined as:

Progress Summaries. A progress summary is a pairhrid; cmdi where rid is the identifier of a round and cmdis a proposed command or ?, satisfying:

� If cmd ¼ ?, then the progress summary guaran-tees that no round with id less than rid can everdecide, or have decided, a proposal for the slot.

� If cmd 6¼ ?, then the progress summary guaran-

tees that if a round with id rid0 such that

rid0 � rid decides (or has decided) a proposalcmd0 for the slot, then cmd ¼ cmd0.

� Given two progress summaries hrid; cmdi and

hrid; cmd0i for the same slot, if neither cmd nor

cmd0 equals ?, then cmd ¼ cmd0.

We define a total ordering as follows:

hrid0; cmd0i � hrid; cmdi for the same slot iff

� rid0 > rid; or� rid0 ¼ rid ^ cmd0 6¼ ? ^ cmd ¼ ?.

At any certifier, the progress summary for a slot is monotoni-

cally non-decreasing.

Specification 4.Multi-Consensus

var ridcert; isSeqcert; progrsumcert½1:::�; certificsn; snapshotsninitially: certificsn ¼ snapshotsn ¼ ; ^8cert : ridcert ¼ 0 ^ isSeqcert ¼ false ^8slot 2 Nþ : progrsumcert½slot� ¼ h0;?i

interface transition certifySeqðcert; slot; hrid; cmdiÞ:precondition:isSeqcert ^ rid ¼ ridcert ^ progrsumcert½slot� ¼ hrid;?i ^ð8s 2 Nþ : progrsumcert½s� ¼ hrid;?i ) s � slotÞ ^9replica : cmd 2 proposalsreplica½slot�

action:progrsumcert½slot� :¼ hrid; cmdicertificsn :¼ certificsn [ fðcert; slot; hrid; cmdiÞg

interface transition certifyðcert; slot; hrid; cmdiÞ:precondition:9cert0 : ðcert0; slot; hrid; cmdiÞ 2 certificsn ^ridcert ¼ rid ^ hrid; cmdi � progrsumcert½slot�

action:progrsumcert½slot� :¼ hrid; cmdicertificsn :¼ certificsn [ fðcert; slot; hrid; cmdiÞg

interface transition observeDecisionðreplica; slot; cmdÞ:precondition:9rid :

jfcertjðcert; slot; hrid; cmdiÞ 2 certificsngj > n2 ^

learnedreplica½slot� ¼ ?action:learnedreplica½slot� :¼ cmd

interface transition supportRoundðcert; rid; proseqÞ:precondition:rid > ridcert

action:ridcert :¼ rid; isSeqcert :¼ false

snapshotsn :¼snapshotsn [ fðcert; rid; proseq; progrsumcertÞg

interface transition recoverðcert; rid;SÞ:precondition:ridcert ¼ rid ^ :isSeqcert ^ jSj > n

2 ^S � fðid; progÞjðid; rid; cert; progÞ 2 snapshotsng

action:8s 2 Nþ :hr; cmdi :¼ max�fprog½s� j ðid; progÞ 2 Sgprogrsumcert½s� :¼ hrid; cmdiif cmd 6¼ ? thencertificsn :¼ certificsn [ fðcert; s; hridcert; cmdiÞg

isSeqcert :¼ true

TABLE 1Translation between Terms Used in This Paper and in the Various Replication Protocols under Consideration

Our term Paxos [3] VSR [10] Zab [14] meaning

replica learner cohort server/observer stores copy of application statecertifier acceptor cohort server/participant maintains consensus statesequencer leader primary leader certifier that proposes orderingsround ballot view epoch round of certificationround-id ballot number view-id epoch number uniquely identifies a roundnormal case phase 2 normal case normal case processing in the absence of failuresrecovery phase 1 view change recovery protocol to establish a new roundcommand proposal event record transaction a pair of a client id and an operation to be performedround-stamp N/A viewstamp zxid uniquely identifies a sequence of proposals

VAN RENESSE ET AL.: VIVE LA DIFF�ERENCE: PAXOS VS. VIEWSTAMPED REPLICATION VS. ZAB 477

3.3 Normal Case Processing

Each certifier cert supportsexactly one round-id ridcert, ini-tially 0. The normal case holds when a majority of certifierssupport the same round-id, and one of these certifiers issequencer (signified by isSeqcert being true).

Transition certifySeqðcert; slot; hrid; cmdiÞ is performedwhen sequencer cert certifies command cmd for the givenslot and round. The condition progrsumcert½slot� ¼ hrid;?iholds only if no command can be decided in this slot by around with an id lower than ridcert. The transition requiresthat slot is the lowest empty slot of the sequencer. If thetransition is performed, then cert updates progrsumcert½slot�to reflect that a command decided in this round must becmd. Sequencer cert also notifies all other certifiers by add-ing ðcert; slot; hrid; cmdiÞ to set certificsn (modeling a broad-cast to the certifiers).

A certifier that receives such a message checkswhether the message contains the same round-id that itis currently supporting and whether the progress sum-mary in the message exceeds its own progress summaryfor the same slot. If so, then the certifier updates its ownprogress summary and certifies the proposed command(transition certifyðcert; slot; hrid; cmdiÞ).

Transition observeDecisionðreplica; slot; cmdÞ at replicais enabled if a majority of certifiers in the same roundhave certified cmd in slot. If so, then the commandis decided and, as explained in the next section, all repli-cas that undergo the observeDecision transition forthis slot will decide on the same command. Indeed,internal variable decisions½slot� of Specification 2 isdefined to be cmd if each certifier cert in a majorityhave certified ðcert; slot; hrid; cmdiÞ for some rid (and ? ifno such rid exists). The details of the refinement map-ping are available at the Computer Society DigitalLibrary [22].

3.4 Recovery

When an operational round is no longer certifying pro-posals, perhaps because its sequencer has crashed or isslow, the round can be wedged and a round with a higherround-id can become operational. A certifier cert transitionsto supporting a new round-id rid and prospectivesequencer proseq (transition supportRoundðcert; rid;proseqÞ). This transition only increases ridcert. The transitionsends the certifier’s snapshot by adding it to the setsnapshotsn. A snapshot is a four-tuple ðcert; rid; proseq;progrsumcertÞ containing the certifier’s identifier, its currentround-id, the identifier of proseq, and the certifier’s list ofprogress summaries. Note that a certifier can send at mostone snapshot for each round.

Round rid with sequencer proseq is operational, bydefinition, if a majority of certifiers support rid andadded ðcert; rid; proseq; progrsumcertÞ to set snapshots.Clearly, the majority requirement guarantees that therecannot be two rounds simultaneously operational,nor can there be operational rounds that do not haveexactly one sequencer. Certifiers that support rid can nolonger certify commands in rounds prior to rid. Conse-quently, if a majority of certifiers support a round-idlarger than x, then all rounds with an id of x or lowerare wedged.

Transition recoverðcert; rid;SÞ is enabled at cert if Scontains snapshots for rid and sequencer cert from amajority of certifiers. The sequencer helps ensure thatthe round does not decide commands inconsistent withprior rounds. For each slot, sequencer cert determinesthe maximum progress summary hr; cmdi for the slot inthe snapshots contained in S. It then sets its own prog-ress summary for the slot to hrid; cmdi. It is easy to seethat rid � r.

We argue that hrid; cmdi satisfies the definition of prog-ress summary in Section 3.2. All certifiers in S support ridand form a majority. Thus, it is not possible for any roundbetween r and rid to decide a command, because none ofthese certifiers can certify a command in those rounds.There are two cases:

� If cmd ¼ ?, then no command can be decided beforer, so no command can be decided before rid. Hence,hrid;?i is a correct progress summary.

� If cmd 6¼ ?, then a command decided by r or a roundprior to r must be cmd. Since no command can bedecided by rounds between r and rid, progress sum-mary hrid; cmdi is correct.

The sequencer sets isSeqcert flag upon recovery. As a result, the

sequencer can propose new commands. The normal case for

the round begins and holds as long as a majority of certifiers

support the corresponding round-id.

3.5 Passive Replication

Specification 4 does not satisfy prefix ordering (Section 2.2)because any proposal can be decided in a slot, including aslot that does not correspond to a state decided in the priorslot. Thus Multi-Consensus does not refine Passive Replica-tion. One way of implementing prefix ordering would befor the primary to delay proposing a command for a slotuntil it knows decisions for all prior slots. But that wouldbe slow.

A better solution is to refine Multi-Consensus andobtain a specification that also refines Passive Replicationas well as satisfying prefix ordering. We call this specifi-cation Multi-Consensus-PO. Multi-Consensus-PO guaran-tees that each decision is the result of an operationapplied to the state decided in the prior slot (except forthe first slot). We complete the refinement by adding twopreconditions:3

i) In Multi-Consensus-PO, slots have to be decided insequence. To guarantee this, commands are certifiedin order by adding the following precondition totransition certify: slot>1 ) 9c 6¼ ? : progrsumcert

½slot� 1� ¼ hrid; ci. Thus, if in some round thereexists a majority of certifiers that have certified acommand in slot, then there also exists a majority ofcertifiers that have certified a command in theprior slot.

3. These two preconditions, combined with the condition thatisSeqcert must hold in transition certifySeq, serve as the primaryorder and integrity properties defined in [25] to implement primary-backup replication. The condition isSeqcert is equivalent to the barrierfunction tPaxos used in the same paper to implement primary-backupon top of a white-box consensus service.

478 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015

ii) To guarantee that a decision in slot is based on thestate decided in the prior slot, we add the followingprecondition to transition certifySeq:slot > 1 ) 9s :

cmd ¼ ðs;�Þ ^progrsumcert½slot� 1� ¼ hrid; ð�; ð�;�; sÞÞi:

This works due to properties of progress summaries:if a command has been or will be decided in round ridor a prior round for slot� 1, then it is the command inprogrsumcert½slot� 1�. Therefore, if the sequencer’sproposal for slot is decided in round rid, the proposalwill be based on the state decided in slot� 1. If the pri-mary and the sequencer are co-located, as they usuallyare, this transition precondition is satisfied automati-cally, because the primary computes states in order.

Multi-Consensus-PO inherits transitions invoke andresponse from Specification 1 as well as transitions pro-

pose, update, and resetShadow from Specification 3. Thevariables contained in these transitions are inherited aswell.

Passive replication protocols VSR and Zab share the fol-lowing design decision in the recovery procedure: Thesequencer broadcasts a single message containing its entiresnapshot rather than sending separate certifications for eachslot. Certifiers wait for this message and overwrite theirown snapshot with its contents before they certify new com-mands in this round. As a result, these progrsumcert slotshave the same round identifier at each certifier and can thusbe maintained as a separate variable.

4 IMPLEMENTATION

Specifications Multi-Consensus and Multi-Consensus-POdo not contain internal variables or transitions. However,they only specify which transitions may be performed, notwhen transitions should be performed. We discuss finalrefinements of these specifications that produce Paxos, VSR,and Zab.

4.1 Normal Case

We first turn to implementing state and transitions of Multi-Consensus and Multi-Consensus-PO. The first question ishow to implement the variables. Variables inputsn, outputsn,certificsn, and snapshotsn are not per-process but global.They model messages that have been sent. In actual proto-cols, this state is implemented by the network: a value ineither set is implemented by a message on the networktagged with the appropriate type, such as snapshot.

The remaining variables are all local to a process such asa client, a replica, or a certifier. This state can be imple-mented as ordinary program variables. In Zab, progrsumcert

is implemented by a queue of commands. In VSR,progrsumcert is replaced by the application state and acounter that records the number of updates made to thestate in this round. In Paxos, a progress summary is simplya pair consisting of a round identifier and a command.

Fig. 3 illustrates normal case processing in the protocols.The figure shows three certifiers (f ¼ 1). Upon receiving anoperation from a client (not shown):

1) Sequencer cert proposes a command for the nextopen slot and sends a message to the other certifiers

(maps to certifySeqðcert; slot; hrid; cmdiÞ). In VSRand Zab, the command is a state update that resultsfrom executing the client operation; in Paxos, thecommand is the operation itself.

2) Upon receipt by a certifier cert, if cert supports theround-id rid in the message, then cert updates itsslot and replies to the sequencer (transitioncertifyðcert; slot; hrid; cmdiÞ). If not, the messagecan be ignored. With VSR and Zab, prefix-orderingmust be ensured, and cert only replies to thesequencer if its progress summary for slot� 1 con-tains a non-empty command for rid.

3) If the sequencer receives successful responses from amajority of certifiers (transition observeDeci-

sion), then the sequencer learns the decision andbroadcasts a decide message for the command tothe replicas (resulting in learn transitions thatupdate the replicas, see Specifications 2 and 3).

The various protocols reflect other different design decisions:

� In VSR, a specific majority of certifiers is determineda priori and fixed for each round. We call this a desig-nated majority. In Paxos and Zab, any certifier can cer-tify proposals.

� In VSR, replicas are co-located with certifiers, andcertifiers speculatively update their local replica aspart of certification. A replica could be updatedbefore some proposed command is decided, so ifanother command is decided then the state of thereplica must be rolled back (as we shall see later).Upon learning that the command has been decided(Step 3), the sequencer responds to the client.

� Some versions of Paxos use leases [3], [26] for read-only operations. Leases have the advantage thatread-only operations can be served at a single replicawhile still guaranteeing linearizability. This requiressynchronized clocks (or clocks with bounded drift),and the sequencer obtains a lease for a certain timeperiod. A sequencer that is holding a lease can thusforward read-only operations to any replica, insert-ing the operation in the ordered stream of com-mands sent to that replica.

� ZooKeeper, which is built upon Zab, offers theoption to use leasing or to have any replica handleread-only operations individually, circumventingZab. The latter is efficient, but a replica might nothave learned the latest decided proposals so its cli-ents can receive results based on stale state (suchreads satisfy sequential consistency [27]).

Fig. 3. Normal case processing at three certifiers. Dots indicate transi-tions, and arrows between certifiers are messages.

VAN RENESSE ET AL.: VIVE LA DIFF�ERENCE: PAXOS VS. VIEWSTAMPED REPLICATION VS. ZAB 479

For replicas to learn about decisions, two options exist:

� Certifiers can respond to the sequencer. Thesequencer learns that its proposed command hasbeen decided if the sequencer receives responsesfrom a majority (counting itself). The sequencer thennotifies the replicas.

� Certifiers can broadcast notifications to all replicas,and each replica can individually determine if amajority of the certifiers have certified a particularcommand.

There is a trade-off between the two options: with n certifiers

and m replicas, the first approach requires nþm messages

and two network latencies. The second approach requires

nm messages but involves only one network latency. All

implementations we know of use the first approach.

4.2 Recovery

Fig. 4 illustrates the recovery steps in the protocols.With Paxos, a certifier proseq that notices a lack of

progress (typically the certifier designated by a weakleader election protocol V [28]) may start the followingprocess to try to become sequencer itself (see Fig. 4a):

� Step 1: Prospective sequencer proseq supports a newround rid, proposing itself as sequencer (transitionsupportRoundðproseq; rid; proseqÞ), and queries atleast a majority of certifiers.

� Step 2: Upon receipt, a certifier cert that transitions tosupporting round rid and certifier proseq assequencer (supportRoundðcert; rid; proseqÞ) respondswith its snapshot.

� Step 3: Upon receiving responses from a majority,certifier proseq learns that it is sequencer of roundrid (transition recoverðproseq; rid;SÞ). Normal caseoperation resumes after proseq broadcasts the com-mand with the highest rid for each slot not known tobe decided.

Prospective sequencers in Zab are designated by V aswell. WhenV determines that a sequencer has become unre-sponsive, it initiates a protocol (see Fig. 4b) in which a newsequencer is elected:

� Step 0: V proposes a prospective sequencer proseqand notifies the certifiers. Upon receipt, a certifiersends a message containing the round-id it supportsto proseq.

� Step 1: Upon receiving such messages from a major-ity of certifiers, prospective sequencer proseq selectsa round-id rid that is one larger than the maximum

it received, transitions to supporting it (transitionsupportRoundðproseq; rid; proseqÞ), and broadcaststhis to the other certifiers for approval.

� Step 2: Upon receipt, if certifier cert can support ridand has not agreed to a certifier other than proseqbecoming sequencer of the round, then it performstransition supportRoundðcert; rid; proseqÞ. Zabexploits prefix ordering to optimize the recoveryprotocol. Instead of sending its entire snapshot tothe prospective sequencer, a certifier that transi-tions to supporting round rid sends a round-stamp.A round-stamp is a lexicographically ordered pairconsisting of the round-id in the snapshot (thesame for all slots) and the number of slots in theround for which it has certified a command.

� Step 2.5: Once proseq receives responses from amajority, it computes the maximum round-stampand determines if commands are missing andretrieves them from the certifier certmax with thehighest received round-stamp. If proseq is missingtoo many commands (e.g. if proseq did not partici-pate in the last round certmax participated in),certmax sends its entire snapshot to proseq.

� Step 3: After receiving the missing commands,proseq broadcasts its snapshot. In practice thesnapshot is a checkpoint of its state with asequence of state updates, to the certifiers (transi-tion recoverðproseq; rid;SÞ). Certifiers acknowl-edge receipt of this snapshot and, upon receivingacknowledgments from a majority, proseq learnsthat it is now the sequencer of rid and broadcasts acommit message before resuming the normal caseprotocol (not shown in the picture).

In VSR, each round-id has a pre-assigned view manager vthat is not necessarily the sequencer. A round-id is a lexico-graphically ordered pair comprising a number and the pro-cess identifier of the view manager.

The view manager v of round-id starts the followingrecovery procedure if the protocol seems stalled (seeFig. 4c):

� Step 1: v starts supporting round rid (transitionsupportRoundðv; rid; vÞ), and it queries at least amajority of certifiers.

� Step 2: Upon receipt of such a query, a certifier certstarts supporting rid (transition supportRound

ðcert; rid; vÞ). Similar to Zab, cert sends its round-stamp to v.

� Step 2.5: Upon receiving round-stamps from a major-ity of certifiers, view manager v uses the set of

Fig. 4. Depiction of the recovery phase. Dots represent transitions and are labeled with s and r, respectively denoting a supportRound and arecover transition. Messages of the form (x, y)* contain multiple (x,y) tuples.

480 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015

certifiers that responded as the designated majorityfor the round and assigns the sequencer role to thecertifier p that reported the highest round-stamp.The view manager then notifies certifier p, request-ing it to become sequencer.

� Step 3: Sequencer p, having the latest state, broad-casts its snapshot (transition recoverðp; rid;SÞ). Inthe case of VSR, the state that the new sequencersends is its application state rather than a snapshot.

4.3 Garbage Collection

Multi-Consensus has each certifier accumulating state aboutall slots, which does not scale. For Paxos, the issue hasreceived attention in [7], [29]. In VSR, no garbage collectionis required. Certifiers and replicas are co-located, and theyonly store the most recent round-id they adopted; applica-tion state is updated upon certification of a command. Dur-ing recovery, the sequencer simply sends application stateto the replicas and, consequently, any decided commandsneed not be replayed.

4.4 Liveness

All of the protocols require—in order to make progress—that at most a minority of certifiers experience crash failures.If the current round is no longer making progress, a newround must become operational. If certifiers are slow atmaking this transition in the face of an actual failure, thenperformance may suffer. However, if certifiers are tooaggressive about starting this transition, rounds willbecome wedged before being able to decide commands,even in the absence of failures.

To guarantee progress, some round with a correctsequencer must eventually not get preempted by a higherround [30], [31]. Such a guarantee is difficult or even impos-sible to make [32], but with careful failure detection a goodtrade-off can be achieved between rapid failure recoveryand spurious wedging of rounds [29].

In this section, we look at how the various protocols opti-mize progress.

4.4.1 Partial Memory Loss

If certifiers keep state on stable storage (say, a disk), then acrash followed by a recovery is not treated as a failure butinstead as the affected certifier being slow. Stable storageallows protocols like Paxos, VSR, and Zab to deal with suchtransients. Even if all machines crash, then as long as a

majority eventually recovers their state from before thecrash, the service can continue operating.

4.4.2 Total Memory Loss

In [7, Section 5.1], the developers of Google’s Chubby ser-vice describe a way for Paxos to deal with permanent mem-ory loss of a certifier (due to disk corruption). The memoryloss is total, so the recovering certifier starts in an initialstate. It copies its state from another certifier and then waitsuntil it has seen one decision before starting to participatefully in the Paxos protocol again. This optimization isflawed and it breaks the invariant that a certifier’s round-idcan only increase over time (confirmed by the authors of[7]). By copying the state from another certifier, it may, as itwere, go back in time, which can cause divergence.

Nonetheless, total memory loss can be tolerated byextending the protocols. The original Paxos paper [3] showshow the set of certifiers can be reconfigured to tolerate totalmemory loss, and this has been worked out in greater detailin Microsoft’s SMART project [33] and later for View-stamped Replication as well [11]. Zab also supports recon-figuration [34].

5 DISCUSSION

Table 2 summarizes some differences between Paxos, VSR,and Zab. These differences demonstrate that the protocolsdo not refine one another; they also have pragmatic conse-quences, as discussed below. The comparisons are based onpublished algorithms; actual implementations may vary.We organize the discussion around normal case processingand recovery overheads.

5.1 Normal Case

5.1.1 Passive vs. Active Replication

In active replication, at least f þ 1 replicas each must exe-cute operations. In passive replication, only the sequencerexecutes operations, but it has to propagate state updates tothe backups. Depending on the overheads of executingoperations and the size of state update messages, one mayperform better than the other. Passive replication, however,has the advantage that execution at the sequencer does nothave to be deterministic and can take advantage of parallelprocessing on multiple cores.

5.1.2 Read-Only Optimizations

Paxos and Zookeeper support leasing for read-only opera-tions, but there is no reason why leasing could not be added

TABLE 2Overview of Important Differences between the Various Protocols

What Section Paxos VSR Zab

replication style 2.2 active passive passiveread-only operations 4.1 leasing certification read any replica/leasingdesignated majority 4.1, 4.2 no yes notime of execution 4.1 upon decision upon certification depends on rolesequencer selection 4.2 majority vote or deterministic view manager assigned majority voterecovery direction 4.2 two-way from sequencer two-way/from sequencerrecovery granularity 4.2 slot-at-a-time application state command prefixtolerates memory loss 4.4.1, 4.4.2 reconfigure partial reconfigure

VAN RENESSE ET AL.: VIVE LA DIFF�ERENCE: PAXOS VS. VIEWSTAMPED REPLICATION VS. ZAB 481

to VSR. Indeed, Harp (built on VSR) uses leasing. A leaseimproves latency of read-only operations in the normalcase, but it delays recovery in case of a failure. ZooKeeperoffers the option whether to use leases. Without leases,ZooKeeper clients read any replica at any time. Doing somay cause replicas to read stale state.

5.1.3 Designated Majority

VSR uses designated majorities. An advantage is that other(typically f) certifiers and replicas are not employed duringnormal operation, and they play only a small role duringrecovery, thereby saving almost half of the overhead. Thereare two disadvantages: (1) if the designated majority con-tains the slowest certifier then the protocol will run as slowas the slowest, rather than the “median” certifier; and (2) ifone of the certifiers in the designated majority crashes,becomes unresponsive, or is slow, then recovery is neces-sary. In Paxos and Zab, recovery is necessary only if thesequencer crashes. A middle ground is achieved by using2f þ 1 certifiers and f þ 1 replicas.

5.1.4 Time of Command Execution

In VSR, replicas apply state updates speculatively whenthey are certified, possibly before they are decided. Com-mands are forgotten as soon as they are applied to the state.So no garbage collection is necessary. A disadvantage isthat response to a client operation must be delayed until allreplicas in the designated majority have updated theirapplication state. In other protocols, only one replica mustupdate its state and compute a response, because if the rep-lica fails then another deterministic replica is guaranteed tocompute the same response. At the time of failing each com-mand that led to this state and response has been certifiedby a majority and, therefore, the state and response arerecoverable.

In Zab, the primary also speculatively applies clientoperations to compute state updates before these updatesare decided. Replicas apply state updates after thoseupdates have been decided. In Paxos, there is no specu-lative execution.

5.2 Recovery

5.2.1 Sequencer Selection

VSR selects a sequencer that has the most up-to-date state(taking advantage of prefix ordering), so VSR does not haveto recover state from the other certifiers, which simplifiesand streamlines recovery.

5.2.2 Recovery Direction

Paxos allows the prospective sequencer to recover state ofprevious slots and, at the same time, propose new com-mands for slots where it has already retrieved sufficientstate. However, all certifiers must send their certificationstate to the prospective sequencer before it can re-proposecommands for slots. With VSR, the sequencer is the certifierwith the highest round-stamp, and it does not need torecover state from the other certifiers. A similar optimiza-tion is sketched in the description of the Zab protocol (andimplemented in the Zookeeper service).

With Paxos and Zab, garbage collection of the certifica-tion state is important for ensuring that the amount of stateexchanged on recovery does not become too large. A recov-ering replica can be brought up-to-date faster by replayingdecided commands that were missed rather than by copy-ing state.

With VSR, the selected sequencer pushes its snapshot tothe other certifiers. The snapshot has to be transferred andprocessed prior to processing new certification requests,possibly resulting in a performance interruption.

5.2.3 Recovery Granularity

In VSR, the entire application state is sent from thesequencer to the backups. For VSR, this state is transactionmanager state and is small, but this approach does not scale.However, in some cases that cost is unavoidable. For exam-ple, if a replica has a disk failure, then replaying all com-mands from time 0 is not scalable either, and the recoveringreplica instead will have to seed its state from another one.In this case, the replica will load a checkpoint and thenreplay missing commands to bring the checkpoint up-to-date—a technique used in Zab (and Harp, as well). Withpassive replication protocols, replaying missing commandssimply means applying state updates; with active replica-tion protocols, replaying commands entails re-executingcommands. Depending on the overhead of executing opera-tions and the size of state update messages, one or the otherapproach may perform better.

5.2.4 Tolerating Memory Loss

An option suggested by VSR is to retain on disk only around-id; the remaining state is kept in memory. This tech-nique works only in restricted situations, where at least onecertifier has the most up-to-date state in memory.

6 A BIT OF HISTORY

We believe the first consensus protocol to use rounds andsequencers is due to Dwork et al. [30]. Rounds in DLS arecountable and round bþ 1 cannot start until round b hasrun its course. Thus, DLS does not refine Multi-Consensus(even though Multi-Consensus does use rounds andsequencers).

Chandra and Toueg’s work on consensus [31] formalizedconditions under which consensus protocols terminate.Their consensus protocol resembles the Paxos single-decreeSynod protocol, and it refines Multi-Consensus.

To the best of our knowledge, majority intersection toavoid potential inconsistencies first appears in Thomas [35].Quorum replication [35] supports only storage objects withread and write operations (or, equivalently, get and put

operations in the case of a Key-Value Store).

7 CONCLUSION

Paxos, VSR, and Zab are three well-known replication pro-tocols for asynchronous environments that admit boundednumbers of crash failures. This paper describes a specifica-tion for Multi-Consensus, a generic specification that con-tains important design features that the three protocolsshare. The features include an unbounded number of totally

482 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015

ordered rounds, a static set of certifiers, and at most onesequencer per round.

The three protocols differ in how they refine Multi-Con-sensus. We disentangled fundamentally different designdecisions in the three protocols and considered impact onperformance. Compute-intensive services are better off witha passive replication strategy, such as used in VSR and Zab(provided that state updates are of a reasonable size). Toachieve predictable low-delay performance for short opera-tions during both normal case execution and recovery, anactive replication strategy without designated majorities,such as used in Paxos, is the best option.

ACKNOWLEDGMENTS

The authors are grateful for the anonymous reviews anddiscussions with Flavio Junqueira and Marco Serafini. Theauthors are supported in part by AFOSR grants FA2386-12-1-3008, F9550-06-0019, by the AFOSR MURI “Science ofCyber Security: Modeling, Composition, and Measurement”as AFOSR grant FA9550-11-1-0137, by NSF grants 0430161,0964409, 1040689, 1047540, and CCF-0424422 (TRUST), byONR grants N00014-01-1-0968 and N00014-09-1-0652, byDARPA grants FA8750-10-2-0238 and FA8750-11-2-0256, byDOE ARPA-e grant DE-AR0000230, by MDCN/iAd grant54083, and by grants from Microsoft Corporation, FacebookInc., and Amazon.com.

REFERENCES

[1] L. Lamport, “Specifying concurrent program modules,” ACMTrans. Programm. Lang. Syst., vol. 5, no. 2, pp. 190–222, Apr. 1983.

[2] N. A. Lynch and F. W. Vaandrager, “Forward and backward sim-ulations, II: Timing-based systems,” Inf. Comput., vol. 128, no. 1,pp. 1–25, 1996.

[3] L. Lamport, “The part-time parliament,” ACM Trans. Comput.Syst., vol. 16, no. 2, pp. 133–169, 1998.

[4] L. Lamport, “Time, clocks, and the ordering of events in a distrib-uted system,” Commun. ACM, vol. 21, no. 7, pp. 558–565, Jul. 1978.

[5] F. Schneider, “Implementing fault-tolerant services using the statemachine approach: A tutorial,” ACM Comput. Surv., vol. 22, no. 4,pp. 299–319, Dec. 1990.

[6] M. Burrows, “The chubby lock service for loosely-coupled distrib-uted systems,” in Proc. 7th Symp. Oper. Syst. Des. Implementation,Nov. 2006, pp. 335–350.

[7] T. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: Anengineering perspective,” in Proc. 26th ACM Symp. Principles Dis-trib. Comput., May 2007, pp. 398–407.

[8] M. Isard, “Autopilot: Automatic data center management,” Oper.Syst. Rev., vol. 41, no. 2, pp. 60–67, Apr. 2007.

[9] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn, “Ceph: Ascalable, high-performance distributed file system,” in Proc. 7thConf. Oper. Syst. Des. Implementation, Nov. 2006, pp. 307–320.

[10] B. Oki and B. Liskov, “Viewstamped Replication: A generalprimary-copy method to support highly-available distributedsystems,” in Proc. 7th ACM Symp. Principles Distrib. Comput.,Aug. 1988, pp. 8–17.

[11] B. Liskov and J. Cowling, “Viewstamped replication revisited,”Comput. Sci. Artif. Intell. Laboratory, MIT, Tech. Rep. MIT-CSAIL-TR-2012-021, Jul. 2012.

[12] B. Lampson and H. Sturgis, “Crash recovery in a distributed datastorage system,” Comput. Sci. Laboratory, Xerox PARC, PaloAlto, CA, 1976.

[13] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M.Williams, “Replication in the Harp file system,” in Proc. 13th ACMSymp. Oper. Syst. Principles, Oct. 1991, pp. 226–238.

[14] F. Junqueira, B. Reed, and M. Serafini, “Zab: High-performancebroadcast for primary-backup systems,” in Proc. 41st Int. Conf.Dependable Syst. Netw., 2011, pp. 245–256.

[15] P. Hunt, M. Konar, F. Junqueira, and B. Reed, “ZooKeeper: Wait-free coordination for Internet-scale systems,” in Proc. USENIXAnnu. Technol. Conf., 2010, p. 11.

[16] B. Lampson, “How to build a highly available system using con-sensus,” in Distributed Systems, S. Mullender, Ed., 2nd ed. NewYork, NY: ACM Press, 1993, pp. 1–17.

[17] C. Cachin, “Yet another visit to Paxos,” IBM Res., Zurich,Switzerland, Tech. Rep. RZ3754, 2009.

[18] B. Lampson, “The ABCDs of Paxos,” in Proc. 20th ACM Symp.Principles Distrib. Comput., 2001, p. 13.

[19] M. P. Herlihy and J. M. Wing, “Linearizability: A correctness con-dition for concurrent objects,” ACM Trans. Programm. Lang. Syst.,vol. 12, no. 3, pp. 463–492, Jul. 1990.

[20] J. Aizikowitz, “Designing distributed services using refinementmappings,” Ph.D. dissertation, Cornell Univ., Ithaca, NY, USA,1990.

[21] L. Lamport, “Byzantizing Paxos by refinement,” in Proc. 25th Int.Conf. Distrib. Comput., 2011, pp. 211–224.

[22] [Online]. Available: http://www.computer.org/csdl/trans/tq/index.html

[23] P. Alsberg and J. Day, “A principle for resilient sharing of distrib-uted resources,” in Proc. 2nd Int. Conf. Softw. Eng., Oct. 1976,pp. 627–644.

[24] M. Pease, R. Shostak, and L. Lamport, “Reaching agreement in thepresence of faults,” J. ACM, vol. 27, no. 2, pp. 228–234, Apr. 1980.

[25] F. P. Junqueira and M. Serafini, “On barriers and the gap betweenactive and passive replication,” in Proc. Distrib. Comput., 2013,vol. 8205, pp. 299–313.

[26] C. Gray and D. Cheriton, “Leases: An efficient fault-tolerantmechanism for distributed file cache consistency,” in Proc. 12thACM Symp. Oper. Syst. Principles, Nov. 1989, pp. 202–210.

[27] L. Lamport, “How to make a multiprocessor computer that cor-rectly executes multiprocess programs,” IEEE Trans. Comput.,vol. 28, no. 9, pp. 690–691, Sep. 1979.

[28] N. Schiper and S. Toueg, “A robust and lightweight stable leaderelection service for dynamic systems,” in Proc. Int. Conf. Depend-able Syst. Netw., 2008, pp. 207–216.

[29] J. Kirsch and Y. Amir, “Paxos for system builders: An overview,”in Proc. 2nd Workshop Large-Scale Distrib. Syst. Middleware, 2008,pp. 1–6.

[30] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the pres-ence of partial synchrony,” in Proc. 3rd ACM Symp. Principles Dis-trib. Comput., Aug. 1984, pp. 103–118.

[31] T. Chandra and S. Toueg, “Unreliable failure detectors for asyn-chronous systems,” in Proc. 11th ACM Symp. Principles Distrib.Comput., Aug. 1991, pp. 325–340.

[32] M. Fischer, N. Lynch and M. Patterson, “Impossibility of distrib-uted consensus with one faulty process,” J. ACM, vol. 32, no. 2,pp. 374–382, Apr. 1985.

[33] J. Lorch, A. Adya, W. Bolosky, R. Chaiken, J. Douceur, and J.Howell, “The SMARTway to migrate replicated stateful services,”in Proc. 1st Eurosys Conf., Apr. 2006, pp. 103–115.

[34] A. Shraer, B. Reed, D. Malkhi, and F. Junqueira, “Dynamic recon-figuration of primary/backup clusters,” in Proc. USENIX Annu.Tech. Conf., Jun. 2012, p. 39.

[35] R. Thomas, “A solution to the concurrency control problem formultiple copy databases,” in Proc. COMPCON 78 Spring, Feb.1978, pp. 88–93.

Robbert van Renesse received the PhD degreefrom the Vrije Universiteit, Amsterdam, in 1989.He is a principal research scientist in the Depart-ment of Computer Science at Cornell University.After working at AT&T Bell Labs in Murray Hill, hejoined Cornell in 1991. He was an associate edi-tor of the IEEE Transactions on Parallel and Dis-tributed Systems from 1997 to 1999, and he iscurrently associate editor for ACM ComputingSurveys. His research interests include the faulttolerance and scalability of distributed systems.

He is a fellow of the ACM.

VAN RENESSE ET AL.: VIVE LA DIFF�ERENCE: PAXOS VS. VIEWSTAMPED REPLICATION VS. ZAB 483

Nicolas Schiper received the PhD degree fromthe University of Lugano, Switzerland, in 2009.He is a postdoctoral associate at the ComputerScience Department at Cornell University. Hisresearch lies at the intersection of the theory andpractice of distributed systems. He is particularlyinterested in scalability and energy efficiencyaspects of fault-tolerance.

Fred B. Schneider is Samuel B. Eckert profes-sor of computer science at Cornell University.He is a fellow of the AAAS (1992), ACM(1995), IEEE (2008), a member of NAE (2011)and a foreign member of its Norwegean equiv-alent NKTV (2012). He was named Professor-at-Large at the University of Tromso (Norway)in 1996 and was awarded a Doctor of ScienceHonoris Causa by the University of New-Castle-upon-Tyne in 2003 for his work in com-puter dependability and security. He received

the 2012 IEEE Emanuel R. Piore Award.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

484 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 4, JULY/AUGUST 2015