Randomized L eader E lection - Purdue University M.K. R amanathan, R.A. F erreira, S. Jagannathan, A . G rama, W . S zpank owski: Randomized L eader E lection Algorithm Messages Rounds/Time

Distributed Computing manuscript No.(will be inserted by the editor)

Randomized Leader Election

Murali Krishna RamanathanRonaldo A. FerreiraSuresh JagannathanAnanth GramaWojciech Szpankowski

Department of Computer SciencesPurdue University, IN 47907

{rmk, rf, suresh, ayg, spa}@cs.purdue.edu

Summary. We present an efficient randomized algo-rithm for leader election in large-scale distributed sys-tems. The proposed algorithm is optimal in message com-plexity (O(n) for a set of n total processes), has roundcomplexity logarithmic in the number of processes in thesystem, and provides high probabilistic guarantees onthe election of a unique leader. The algorithm relies on aballs and bins abstraction and works in two phases. Themain novelty of the work is in the first phase where thenumber of contending processes is reduced in a controlledmanner. Probabilistic quorums are used to determine awinner in the second phase. We discuss, in detail, thesynchronous version of the algorithm, provide extensionsto an asynchronous version and examine the impact offailures.

1 Introduction

The problem of leader election in distributed systemsis an important and well-studied one. This problem isat the core of a number of applications – it forms thebasis for replication (or duplicate elimination) in unreli-able systems, establishing group communication primi-tives by facilitating and maintaining group memberships,and load balancing and job-scheduling in master-slaveenvironments [31].

A leader election algorithm is formally characterizedas follows [19,32]: given a distributed ensemble of pro-cesses with each process executing the same local algo-rithm, the algorithm is decentralized, i.e., a computationcan be initiated by an arbitrary non-empty subset of pro-cesses, and the algorithm reaches a terminal configura-tion in each computation, and in each reachable termi-nal configuration, in the subset of processes that initiatedthe computation, there is exactly one process in the stateleader and all other processes in the subset are in thestate lost.

The central challenge of efficient, scalable, and ro-bust leader election algorithms is to simultaneously min-imize the number of messages and overall execution time.With these objectives, a number of leader election pro-tocols have been proposed [4,8,9,12,28]. The emergenceof novel distributed paradigms and underlying platformssuch as peer-to-peer systems [11,26,27,29] for resourcesharing pose interesting new variants of this problemviz., scalability, reduced restriction on correctness, etc.This motivates the design of a variety of probabilisticleader election schemes [12,28]. In [12], Gupta et al. usea multicast approach to elect a leader in a group withhigh constant probability. In [28], Schooler et al. proposetwo variants of the leader election algorithm and analyzeit in the context of a multicast with respect to severalmetrics, including delay and message overhead.

In this paper, we present a randomized leader elec-tion algorithm (primarily non fault-tolerant) that is op-timal in terms of message complexity (O(n) for a dis-tributed system with n processes), has round complex-ity logarithmic in n, and is correct with high probability(w.h.p.). In other words, as n ! ", the probability ofone leader getting elected tends to one. Throughout thispaper, w.h.p. 1 (with high probability) denotes proba-bility 1 # 1

log!(1) n. This is in contrast to known algo-

rithms (Section 8) that involve a large number of mes-sages [18,20,21] or require a larger number of rounds [4].In this sense, our algorithm is targeted towards large-scale distributed systems [11,26,27,29], in which scala-bility is an important issue. However, it is important to

1 Our definition of the term “with high probability” isweaker than the customary one, which typically requires ahigh probability event to occur with probability at least1! 1

n" , for some ! > 0, where n is a “system size” parameter.Under both definitions the probability of the event convergesto 1 as the system size grows to infinity, but under our use ofthe term this convergence is exponentially slower than underthe customary use.

2 M.K. Ramanathan, R.A. Ferreira, S. Jagannathan, A. Grama, W. Szpankowski: Randomized Leader Election

Algorithm Messages Rounds/Time CorrectnessProbabilistic Quorum [21] O(n

!n) 1 round with high probability

P.C.L.E [12] O(K4 " n) per round unspecified depends on system parametersLE in multihop broadcast environment [4] O(n log n) O(n) time guaranteedLE in a synchronous ring [9] !(n log n) Bounded rounds (t) guaranteedHighly available LE Service [8] O(n # 1) datagrams per round unspecified dependent on connectivityLE for multicast groups [28] dependent on failures dependent on failures guaranteed (in the absence of failures)Randomized Leader Election O(n) O(log n) rounds with high probability

Table 1. Comparison of election algorithms with respect to message complexity, round complexity and correctness.

note that while many traditional leader election algo-rithms provide absolute guarantees for the election of aunique leader, our algorithm guarantees leader electionwith high probability. We provide a comparison of leaderelection algorithms for different metrics in Table 1.

1.1 Technical Contributions

The main contributions of the paper are as follows:1. A randomized leader election algorithm that is opti-

mal in the number of messages O(n), has round com-plexity logarithmic in the number of processes in thesystem O(log n), and elects a unique leader w.h.p.

2. An approach in which the lack of global informationis intelligently leveraged to prune the number of pro-cesses participating in the leader election algorithm.

3. An asynchronous version of the first phase of the al-gorithm and a partially synchronous [5] version of thesecond phase so that the algorithm can be effectivelyrealized for general distributed applications.

4. A rigorous analysis to prove the correctness and thecomplexity of the algorithm.The rest of the paper is organized as follows: Sec-

tion 2 formalizes definitions and terminology used inthis paper, Section 3 presents the synchronous version ofthe protocol along with proofs relating to message com-plexity and probabilistic bounds on election of a uniqueleader, and Section 4 presents an asynchronous versionof the protocol with analytical performance bounds. Theimpact of failures is addressed in Section 5. Issues relat-ing to current distributed systems are discussed in Sec-tion 6. We provide a proof of concept of our approachusing simulations in Section 7. Related work is presentedin Section 8. Conclusions and avenues for future researchare outlined in Section 9.

2 System Definitions and Model

We start by formalizing definitions and terminology usedin the rest of the paper.• Process: A process is an individual entity in a dis-

tributed system that can participate in the leaderelection protocol. It can communicate with any otherprocess by sending and receiving messages. It is ca-pable of generating random numbers independent ofother processes in the system. In a synchronous sys-tem, all processes share a common clock (or, equiva-lently, their local clocks are synchronized), and mes-sage transfers are assumed to take unit time. In an

asynchronous system, a process maintains a local clock,which is not necessarily synchronized with other pro-cesses. Furthermore, message transfers may take ar-bitrary time. All processes are assumed to follow thespecified algorithm.

• Contender: The leader election algorithm presentedin this paper is fashioned as a game played by par-ticipating processes. The set of processes playing thisgame decreases as the algorithm proceeds. This set ofprocesses, from which a winner is selected, is referredto as the set of contenders.

• Mediator: Winners at intermediate steps in the gameare decided from among the set of contenders by pro-cesses referred to as mediators. Specifically, a media-tor is a process that receives a message from a con-tender and arbitrates whether the contender partici-pates in subsequent steps of the protocol.

• Round: A round is composed of a two-way exchangebetween a single process and a set of mediators. Atthe end of a round, if the process is still a contender,a new round (communication with a new set of me-diators) is initiated.

• Winner: A winner is a process that has not receivednegative responses from any mediator throughout theentire execution of the protocol.

In this paper, we primarily discuss leader election al-gorithm in a distributed system without process or com-munication failures. We initially provide the methodol-ogy and proof for the correctness and complexity for thesynchronous version of the algorithm. The complexitymeasures considered are:

1. Message Complexity: The total number of mes-sages exchanged by all processes (contenders and me-diators) in the system across all rounds for one com-plete execution of the leader election algorithm.

2. Round Complexity: The total number of roundstaken by the process elected as ‘leader’.

We provide an asynchronous version of the algorithm.We assume partial asynchrony based on the models de-veloped by Dolev et al [5], where we assume a maximumtime bound for message delay and processor speed i.e.,within a specific time (!) known to all processes, a mes-sage sent by any process A to any process B is receivedand processed by B in at most ! time units. Subsequently,we consider the implications of fail-stop process failures,in which a failed process neither sends nor receives mes-sages, on our algorithm. We assume that each processfails with some probability ".

M.K. Ramanathan, R.A. Ferreira, S. Jagannathan, A. Grama, W. Szpankowski: Randomized Leader Election 3

3 A Synchronous Leader Election Protocol

In this section, we present a randomized synchronousalgorithm for leader election that is scalable with respectto system size and offers high probabilistic guarantees forcorrectness. By high probabilistic guarantee, we meanthat the probability of two processes getting elected asleader tends to zero with increase in the system size. Weinitially present the synchronous version of our algorithmfor ease of understanding. In Section 4, we extend ouralgorithm to the asynchronous setting, which is morerealistic for large-scale systems.

We first present the algorithm informally using aballs-and-bins abstraction and subsequently formalize itin the context of distributed systems. The algorithm isplayed as a tournament in two phases. The first phaseconsists of multiple rounds. In round i of this phase, eachcontender casts #i balls into n bins. (We derive preciseexpressions for #i later in this section.) A contender issaid to ‘win’ a bin if its ball is the only one that landsin the bin. If a contender wins all the bins that its ballsland in, it is considered a winner in this round and pro-ceeds to the next round. An example of this process isillustrated in Figure 1 for n = 8. The first phase consistsof log2 8 # 1 = 2 rounds. In the first round, processes2,4,5 and 6 proceed, and in the second round, processes5 and 6 proceed.

6

5

5

6

5

2

3

4

5

6

7

8

6

Round 3: Each contender

1

Contenders Mediators

Leader

Second PhaseFirst Phase

contender casts 5 balls.

6

8

5

5

1

26

3

4

5

6

7

6

26

7 87

86

Round 1: Each contender

5

2

3

13

44

5

casts one ball.

4

6

5

42

5

11

2

3

4

5

6

7

8

Round 2: Each contendercasts two balls

2

Fig. 1. Illustration of the synchronous protocol. Contendersare illustrated by squares, mediators by bins, and messagesfrom contenders to mediators by labeled balls (the labels in-dicate the source of the ball). In all rounds, contenders thatare no longer in the running are shaded. In the first round,each contender casts one ball and contenders 2, 4, 5, and6 proceed since their balls uniquely occupy their respectivebins. In round 2 (last round of the first phase), each contendercasts two balls and contenders 5 and 6 proceed. Finally, inround 3 (the only round of the second phase), each contendercasts 5 balls and contender 6 is selected the leader.

In a realization of this balls-and-bins abstraction, aprocess can be a contender as well as a mediator (a pro-cess to which one or more contenders sends a messageto arbitrate) at the same time. Casting a ball into a

randomly chosen bin corresponds to a message from acontender to a randomly chosen process (who therebybecomes a mediator) picked uniformly at random. A con-tender is a winner if none of the mediators it sends mes-sages to receive messages from any other contenders. Thenumber of mediators to which a contender sends mes-sages in round i is denoted #i. This quantity depends onthe number of processes and the current round number.We discuss how to choose the value of #i in Section 3.1.This number, if carefully selected, can reduce the numberof contenders by half, on average, after every round. Thisresults with high probability in a small, nonzero, numberof contenders being left after a number of rounds that islogarithmic in the number of processes. The number ofcontenders is not relevant since each contender executesthe algorithm assuming that every process is a contenderand the number of rounds is still logarithmic in the to-tal number of processes and the message complexity ispreserved.

The second phase of the protocol consists of a singleround and is based on probabilistic quorums [21]. Eachremaining contender generates a random number in aspecific range, such that the random number generatedis unique w.h.p among all the random numbers generatedby other contenders, and sends this number to a set ofmediators picked uniformly at random. A contender winsthis phase w.h.p. if it generates the largest random num-ber among all the contenders. The number of mediatorsin this phase is chosen so that there is a high probabilityof at least one overlapping mediator between two con-tenders. This overlapping mediator arbitrates based onwho generates the larger random number. A crucial dif-ference between the two phases is as follows: In the firstphase a contender wins a round only if it sends a mes-sage to an exclusive set of mediators (the set of processesto which no other contender has sent a message for thatround). However, in the second phase, the winner is de-cided based on the random number generated and theintersection of the selected sets of mediators. The reasonfor this difference is that in the first phase, the num-ber of contenders is reduced just enough so that only afew of them proceed to the next round, maintaining themessage complexity. In the second phase, the objectiveis to have one final winner and the message complexityis maintained due to a smaller set of contenders.

3.1 The Synchronous Leader Election Protocol

We consider a distributed system of n processes repre-sented by the set $n = {ai | 1 $ i $ n}, for processesa1, a2, . . . , an. The set of contenders in round j of theprotocol is represented by %j = {&i | 1 $ i $ |%j |},where %j % $n, %j+1 % %j and &1, &2, ..., &|"j | are thecontenders. We define E[Xj ] to be the expected numberof contenders in round j, since the exact number of con-tenders |%j | cannot be calculated in a distributed setting.We later show in our analysis that E[Xj ] & n

2j!1 . A ran-dom integer 'i is selected in the range [0..n4] by each con-tender &i. This range guarantees that the chosen valuesare unique with high probability [19](chapter 4, page 72).


Notation Description#n Distributed system with n processes.ai A process in the distributed system.$i A contender in a round."j Set of contenders in round j.%i Random number generated by contender $i.w Final round of the protocol.&j Number of processes that will be mediators

in round j for a contender.'ij Set of mediators for contender $i in round j.(ij Set of messages received by mediator ai in round j.Cn Maximum number of contenders in the final round.) Failure probability of any process.

Table 2. System Parameters

The protocol concludes by declaring a unique member in%w to be the final winner, where w corresponds to thefinal round in the protocol, and {'j : j < w, |%j | > 1}.We take w = log n# log log6 n+1, which is a fixed valueknown by all processes. We show in Section 3.2 that thissuffices to satisfy the probabilistic guarantees of our pro-tocol. Various system parameters are summarized in Ta-ble 2.

The synchronous algorithm is as follows, with eachcontending process &i in round j performing the follow-ing steps:

• Set

#j =

!"n ln 2

E[Xj ]#1 , if j < w(

n ln n, otherwisewhere #j is the number of mediators in round j,E[Xj ] is the expected number of contenders in roundj and n is the total number of processes in the system.The value of #j is selected in this manner to guar-antee bounds on number of messages and rounds. Adetailed analysis is provided in Section 3.2.

• Let (ij be the set of #j processes selected uniformlyat random from $n. If j < w, send (&i) to all pro-cesses in (ij . Otherwise, send the ordered pair (&i,'i).

• Proceed to the next round j + 1 as a contender ifand only if positive responses are received from allthe processes in (ij . Otherwise, move election resultto lost.

• If round number is w + 1, move election result toleader.

Each process ai in $n performs the following steps inround j:• Receive messages from &k’s in %j and populate the

set )ij with ordered pairs (&k,'k).• First Phase: j < w

– If |)ij | = 1, then send a positive response to &k

in )ij and proceed to round j + 1.– Otherwise, send a decline message to every &k

present in )ij and proceed to round j + 1.• Second Phase: j = w

– For every (&k,'k) in )ij , find the largest2 'k andsend a positive response to the corresponding &k.Send a decline message to the rest of the &k’s.

2 The probability that the random numbers generated byany two contenders is equal is extremely low. In the unlikelyevent that the two equal numbers are the largest among the

3.2 Analysis

We quantify the overhead and the probability of electinga unique leader using our algorithm. On average, thenumber of contenders is halved after a single round in thefirst phase. Let Xi be a random variable representing thenumber of contenders proceeding to round i from roundi#1, E[Xi] the expectation of Xi, fn = log n#log log6 n,1 < i $ fn, * be O

#1

log2 n

$, +i = (1 # *)E[Xi], ui =

(1 + *)E[Xi], g(t) = 1 # O(1t ), and h(t) = 1 + O(1

t ).

Lemma 1.

1.n

2i#1[g(log2 n)]i#1 $ E[Xi] $

n

2i#1[h(log2 n)]i#1

2. P (+i < Xi < ui) ) g(log6 n)

Sketch of proof: We are given X1 = n and #i ="n ln 2

E[Xi]#1 . The analysis can be trivially extended to X1 <

n. The Xi’s are binomially distributed in any round i.It can easily be shown that E[Xi+1|Xi = +] = +(1 #&in )&i(*#1). We illustrate this concept using a few vari-ables and then generalize for 1 < i $ fn.

E[X2] =%

x

E[X2|X1 = x]P (X1 = x)

= E[X2|X1 = n] =n

2g(n)

We can bound the probabilities of X2 based on the Cher-noff bound [30].

P (X2 ) u2) $ e##2E[X2]

3

P (X2 $ +2) $ e##2E[X2]

2

P (+2 $ X2 $ u2) ) g(n2)

We calculate the bounds for X2 since the value of X1

is already known. However, for subsequent random vari-ables, we give the bounds for Xi considering the variabil-ity of Xi#1. Let ,i represent the condition +i < x < ui,then

E[X3] =%

+2

E[X3|X2 = x]P (X2 = x)

+%

+2

E[X3|X2 = x]P (X2 = x)

=%

+2

E[X3|X2 = x]g(n2) + (1 # g(n2))

We use the following implication to bound E[X3]. z <x < y * ze#y < xe#x < ye#z.

E[X3|X2 = u2] $ u2

#1 # #2

n

$&2(*2#1)

=n

22[h(log2 n)]2

numbers generated by all contenders, the algorithm fails inthat two or no leaders are elected even in the presence ofintersection of the mediator sets.


Similarly, we can prove that

E[X3|X2 = +2] )n

22[g(log2 n)]2

From the above results, we get

n

22[g(log2 n)]2 $ E[X3] $

n

22[h(log2 n)]2

The above bounds can be extended in a similar fashionfor any i, 1 < i $ fn.

Theorem 1. At the end of the first phase, there is atleast one contender with probability 1 # O( 1

log6 n ), i.eP (Xfn = 0) ! 0 as n ! ".

Proof: We use the second moment method, the re-sults from Lemma 1 and the fact that Xi’s are binomiallydistributed.

Var[Xi|Xi#1 = ui#1]

$ ui

&1 #

#1 # mi#1

n

$mi!1(ui!1#1)'

=12E[Xi][h(log2 n)]

Similarly, we can prove that

Var[Xi|Xi#1 = +i#1] )12E[Xi][g(log2 n)]

From the above results, we get (for c < 1)

cE[Xi][g(log2 n)] $ Var[Xi] $ E[Xi][h(log2 n)]

From Lemma 1, we have

E[Xfn ] ) n

2fn#1[g(log2 n)]fn#1 =

log6 n

2[g(log n)]

Using the second moment method, we have

P (Xfn = 0) $ Var[Xfn ]E[Xfn ]2

= O

(1

log6 n

)

The theorem follows. !

Theorem 2. At the end of the first phase, the numberof contenders remaining is at most 1

2 (1 + *) log6 n withprobability 1#O( 1

log6 n ), i.e P (Xfn > 12 (1+*) log6 n) ! 0

as n ! ", for some * < O( 1log2 n ).

Proof: From Lemma 1, the theorem follows. !

Theorem 3. With high probability, there is exactly onecontender remaining at the end of the protocol.

Proof: 'u, v : u, v + %w, u ,= v, the probability thatany two sets (uw and (vw intersect is

Pr((uw - (vw) = 1 ##1 #

(n ln n

n

$!n ln n

& 1 # 1n

Furthermore, at the end of the first phase, O(log6 n)contenders are present w.h.p. To elect a unique leader,

the mediator set of the contender with the highest valueneeds to intersect with the mediator set of all other con-tenders. The corresponding probability of intersection is*1 # 1

n

+O(log6 n) =#1 # O

#log6 n

n

$$. Since the mediator

sets intersect w.h.p, only the process &i with the highestvalue of 'i receives positive responses from all processesin (iw. The rest of the contending processes each haveat least one decline message. However, from Theorem 1,the probability that at least one contender remains atthe end of the first phase is 1 - O( 1

log6 n ). Therefore, theprobability that a unique final winner is chosen at theend of the protocol is given by 1 - O( 1

log6 n ). !

Theorem 4. The number of rounds in the protocol isO(log n).

Proof: The number of rounds in the first phase islog n# log log6 n. There is a unique round in the secondphase. The theorem follows. !

Theorem 5. The total number of messages exchangedby all the processes across all rounds is O(n).

Proof: The total number of messages exchangedby all the processes across all rounds of the first phaseis 2

# ,w#1j=1 Xj

"n ln 2

E[Xj ]#1

$. Based on Lemma 1, in the

worst case, the number of messages in the first phaseis O(n) as

,w#1j=1

1

2j!12

converges. Since each contender

sends at most(

n ln n messages in the second phase, thetheorem follows from Theorem 2. !

An improvement of the algorithm is to send requestmessages to #j #

,j#1l=1 #k mediators in a round j of

the protocol with the mediators from previous roundsparticipating in round j. This is in contrast to selectinga new set of #j processes for every round j. This op-timization does not change the asymptotic behavior ofthe algorithm, but reduces the number of messages by aconstant factor.

4 An Asynchronous Leader Election Protocol

The design of the asynchronous protocol is largely mo-tivated by its synchronous counterpart. However, asyn-chrony poses significant challenges since we would like topreserve the message and round complexities of the syn-chronous protocol. We present a suitably modified asyn-chronous protocol, prove that it is correct, and that itfollows the bounds presented in Section 3. While the firstphase is totally asynchronous, the second phase, whichis based on probabilistic quorums [21], is partially syn-chronous [5], where we assume a maximum time boundfor message delay and processor speed i.e., within a spe-cific time (!) known to all processes, a message sent byany process A to any process B is received and processedby B in at most ! time units.

In the first phase of the synchronous protocol, whenmessages from two distinct contenders are sent to the


same mediator, both contenders receive a negative re-sponse. In the asynchronous case, messages from con-tenders to a mediator for a specific round need not bereceived at the same time. Therefore, in every round, amediator responds positively to the first contender re-quest for that round. A negative response is sent to sub-sequent requests from other contenders for that round.In the second phase of the partially synchronous ver-sion [5], messages from contenders need not be receivedat the same time, as compared to the synchronous ver-sion. A contender in the second phase of the protocolwins if it gets positive responses from the required num-ber of mediators within a bounded time (the time boundis provided as a parameter to the algorithm). If a con-tender receives no response (either positive or negative)from a mediator, such scenario is attributed to the fail-ure of a mediator. An analysis for this is presented inSection 5.

4.1 The First Phase

In the first phase of the asynchronous protocol, a con-tender &i sends requests, along with a round number j,to the mediators in the set (ij . The mediators in the set(ij are picked uniformly at random as in the synchronousprotocol. The number of mediators in any round of theasynchronous first phase is equal to the correspondinground in the synchronous phase. A contender &i pro-ceeds to round j +1 if it receives positive responses fromall the mediators in its set (ij . We describe below theprocedure that the mediator adopts to send a positiveresponse to a contender.

A mediator M maintains a vector V of size equalto the number of rounds (log, n). All the entries in Vare initially set to zero. On receiving a request from acontender &i in round j, if entry j in V at M is zero, Msends a positive response to &i and sets the entry j to &i

(signifying that the winner of the jth round at M is &i).Otherwise a negative response is sent to &i. The purposeof this step is to reduce the number of contenders thatproceed to subsequent rounds. A better solution thatreduces the number of contenders in the protocol anddoes not change the correctness of the protocol wouldrequire M to send a negative response to any requestwith round number smaller than the index of the highestentry in V with a non zero value &k. This is because Mhas knowledge that &k is in an advanced round (aheadof &i) and hence is more likely to be elected the leaderwhen compared to &i. We use the first approach here, inwhich a mediator lets a contender proceed even when ahigher numbered entry in its vector is already set. Thissimplifies our analysis considerably. We show that theasymptotic message complexity and number of roundshold even under this conservative approach.

Figure 2 illustrates an example of the protocol exe-cuted at M . In (a), M initially sets all the entries in Vto zero. M receives a request for round one from con-tender A. Since the corresponding entry is zero, it setsthe entry to A and sends a positive response to A (see(b)). Similarly, M receives a request for round three from

(d)

(c)

(b)

(a)

0

BA . . . 00 0

BA . . . 00 0

A . . . 000

w!11 2 3 w!2

0 . . . 000 0

w!2

w!11 2 3 w!2

w!11 2 3 w!2

w!11 2 3

Fig. 2. (a) Vector V before receiving any requests. (b) V aftera request from A for round number 1. (c) V after a requestfrom B for round number 3. (d) V after a request from D forround number 3

contender B. It sets entry three to B and sends a posi-tive response to B (see (c)). However, when M receivesanother request for round three from contender C, sincethe entry is already set, a negative response is sent to C(see (d)). A mediator responds to requests for a roundbased on the corresponding entry in V , independent ofother entries. The contenders that survive (which do notreceive even a single negative response) all the roundsin the first phase proceed to the second phase of theprotocol.

Let A and B be two contenders in the same roundi and let C, D be two common mediators (in general,the probability of such an event is very low). SupposeC receives requests from A and B in that order and Dreceives the request in the reverse order. Both A and Bcannot proceed to future rounds. This is similar to a colli-sion in the synchronous case. However, if C and D bothreceive the requests from A and B in the same order,then one of the contender proceeds to the next round. Wenote that race conditions can occur in an asynchronoussystem. However, our probabilistic analysis takes into ac-count these race conditions 3. We show in our analysisthat the increased number of contenders proceeding tolater rounds as compared to the synchronous phase doesnot affect the message and time complexity.

4.2 Details of the second phase of the asynchronousalgorithm

Let ! represent the network delay and processing timeassociated with any message in the system. This is a pa-rameter to the algorithm – messages in an asynchronousenvironment that exceed this time are treated as failuresand their handling is addressed in Section 5. In the sec-ond phase of the protocol, a contender &i sends a request

3 To provide an analogy, the randomized quick sort cantake O(n2) in the worst case. However, such events are rare.


Message DescriptionREQ A request from a contender to make the

mediator recognize it as a potential winner.POTW The message from contender to notify the

mediator that the requisite (O(!

n ln n)mediators have recognized it as a potentialwinner and to recognize it as a final winner.

DEC To notify the mediator that the contender isdropping out as a contender.

Table 3. Messages REQ (REQuest), POTW (POTentialWinner), DEC (DECline) from contender to mediator.

Message DescriptionACK Positive Response for REQ, POTW messages.NAK Negative Response for REQ, POTW messages.

Table 4. Messages from mediator to contender

to all its mediators (iw ((iw represents the set of medi-ators for the contender &i in round w, the last round ofthe protocol, see table 2). Some mediators, in the bestcase, may receive and process the message within - time(- << !) and respond immediately. In the worst case,the mediator might have processed the message at ! timeunits after it was sent and the response might take an-other ! units. Therefore, &i waits for at least 2! unitsof time to receive responses from (iw. In the actual pro-tocol, though, the &i waits for a maximum of 5! timeunits. We explain in Theorem 8 the reason for using 5! .If &i receives a negative response from one of (iw withinthis period, it sends a decline to the rest of (iw. Other-wise, upon receiving all the positive responses from (iw

(a contender will receive a response from every mediatorwithin 5! units after sending a request), it sends a mes-sage to (iw hinting that it could be a potential winner.&i waits for 2! units to receive a response from (iw af-ter sending the potential winner message. If it does notreceive a negative response from any of (iw, it becomesthe final winner of the protocol. Otherwise, if it receivesa negative response in the intervening time period, itsends a decline message to the rest of (iw.

S4S3

S0 S1 S2(I1, C1, A1)

(I2, C2, A2)

(I3, C3, A3)

(I4, C4, A4)

(I4, C4, A4)

(I5, C5, A5)

Fig. 3. State diagram of a contender.

A detailed state transition for the contender is shownin Figure 3. The subscripts in the figure correspond torows in the table 5. Table 5 provides the state transitionconditions with I representing a received message (in-

Index I C A1 Send REQ$i message to the

mediators in the set 'iw.counter $ -w

2 ACK counter $ counter #13 ACK counter = 1 Send POTW$i to all

mediators in 'iw.timer $ 2.

4 NAK Send DEC messages to allmediators in 'iw.

5 timer = 0 Final Winner

Table 5. Contender Transitions

put), C representing a condition, and A representing theaction taken by the contender. For example, (I3, C3, A3)in Figure 3 is represented by the third row of Table 5.It is read as, on an input of ACK, given the conditionis counter= 1, the action is sending POTW message to(iw. A contender &i that enters the second phase of theprotocol is initially in state S0. After it sends the re-quest to (iw in the second phase, it goes to state S1.If it receives a negative response, it goes to final stateS3, where the &i is not the final winner. Otherwise, af-ter receiving all the positive responses, it sends potentialwinner messages and goes to S2 from S1. If it does notreceive any negative response from a mediator in stateS2 within 2! time units, it goes to the final state S4,where the contender is the final winner.

safe post-safe close-safe

POTWACK

3. 3. 3.

Fig. 4. States of a mediator.

A mediator responds to a contender in the secondphase based on its second-phase state. We define threesecond-phase states for the mediator (see Figure 4)4. Amediator is in a safe state if it sent a positive response toa request in the last 3! time units. In this state, the me-diator is committed to the positive response sent earlier.It responds negatively to a request from another con-tender with a lower random number value. Otherwise, itdelays the response. A mediator is in a post-safe state,if a potential winner or decline message is not receivedfrom the contender that received the most recent pos-itive response from the mediator, and if the differencebetween the current time and the time when the pos-itive response was sent is between 3! and 6! units. Inthis state, the mediator can pre-empt (send a negativeresponse to the contender to which it sent a positive re-sponse earlier) the earlier contender and send a positiveresponse to some other contender having a higher ran-dom number value. A mediator is in a close-safe state

4 The duration of each state is at most 3" . However, it canbe lesser than that based on the time of receipt of di!erentmessages.


if it received a potential winner message and has not re-ceived a decline message from the same contender andthe difference between the current time and the time ofreceipt of potential winner message is not greater than3! units. In the close-safe state, the mediator delays re-sponses to any contender requests (similar to safe state).However, it takes different actions based on the messagesreceived in the close-safe state. If a decline message is notreceived from the contender that sent a potential winnermessage, it sends a negative response to the delayed con-tenders and declares the contender that sent the poten-tial winner message to be the final winner. Otherwise,the actions performed at the end of the close-safe stateare similar to the actions at the end of the safe state.

On receiving a request from a contender A, if the me-diator is not in any of the three second-phase states (i.e.,it has not received a request from any contender for thesecond phase), it immediately sends a positive response.If it is in one of the second-phase states, the response isbased on random numbers sent by the contender. In asafe/close-safe state, if the incoming request from con-tender B has the largest random number (compared toother random numbers received by the mediator thusfar), the response is delayed. Otherwise, a negative re-sponse is sent immediately. In a post-safe state, it sendsa positive response to the request from B, which was de-layed and preempts the earlier contender A. However, ifa potential winner message from contender A is receivedbefore the mediator enters the post-safe state, a nega-tive response is sent to the delayed contender B. Also, ifa request with a larger random number is received in apost-safe state, a positive response is sent back, and theearlier contender A is preempted.

On receiving a potential winner message for the mostrecent positive response, the mediator goes to a close-safestate. If no decline messages are received in the durationof a close-safe state, the mediator declares the contenderthat sent the potential winner to be the final winner.

S2S1

S3 S4

S0

(I1, C1, A1)

(I2, C2, A2)

(I3, C3, A3)

(I4, C4, A4)

(I5, C5, A5)

(I6,C

6,A

6)

(I7, C7, A7)

(I8, C8, A8)

(I9, C9, A9)

(I10,C

10,A

10)

(I11, C11, A11)

(I11, C11, A11)

(I12, C12, A12)

Fig. 5. State diagram of a mediator.

Index I C A1 REQ$i init(%i, $i, !, !, 1)2 REQ$j act()

DEC %wait %= ! ack($wait)init(%wait, $wait, !, !, 1)

POTW$j $j %= $curr nak($j)

timer = 0, nak($curr)%wait %= ! ack($wait)

init(%wait, $wait, !, !, 1)3 timer = 0 timer $ 3.

%wait = !4 REQ$j %j < %curr nak($j)

5 REQ$j %j > %curr nak($curr)Send ACK to $j

init(%j , $j , !, !, 1)6 POTW$j $curr = $j timer $ 3.

if $wait %= !nak($wait)init(%curr, $curr, !, !, 0)

7 POTW$j $j = $curr timer $ 3.if $wait %= !

nak($wait)init(%curr, $curr, !, !, 0)

8 REQ$j act()

9 timer = 0 Final Winner $ $curr10 DEC %wait %= ! ack($wait)

init(%wait, $wait, !, !, 1)11 DEC %wait = !12 DEC $j = $curr

Table 6. Mediator Transitions.

Name Actionsinit(%c, $c,%w, $w, set) /curr $ %c

$curr $ $c

%wait $ %w

$wait $ $w

if set is not 0 thentimer $ 3.

act() if %j < %currnak($j)

elseif %wait = !init(%curr, $curr,%j , $j , 0)

elseif %j < %waitnak($j)

elsenak($wait)init(%curr, $curr, $j ,%j , 0)

ack(val) Send ACK to valnak(val) Send NAK to val

Table 7. Mediator Actions.

A formal description of the mediator actions is pre-sented in the transition diagram in Figure 5, and thecorresponding Table 6. The subscripts in the diagramcorrespond to the rows and the symbols in parenthesescorrespond to the columns in Table 6. State S4 is thefinal state where the eventual winner is declared. StateS0 is the starting idle state, state S1 corresponds to thesafe state, state S2 corresponds to the post-safe state,and state S3 corresponds to the close-safe state.

We further elucidate the second phase of the protocolwith a few examples. In Figure 6 (example 1), C1 andC2 are two contenders, and M1 and M2 are mediators(in reality, M1 may correspond to same process as C1 orC2, but for the sake of clarity we present it as a differententity). Let the random number generated by C1 be thelargest in the system. Contender C1 sends requests to M1

and M2. Since M2 has not received any requests previ-


M1 C1 M2 C2

REQ

REQ

ACK

DEC

REQ

ACK

REQ

ACK

POTW

POTW

ACK

ACK

REQ

NAK

Winner=C1 Winner=C1 Winner=C1

close-safeclose-safe

NAK

safe

safe

safe

post-safe

Fig. 6. Illustration (example 1) of the second phase of theasynchronous algorithm; C1 is the final winner. Intervals be-tween message arrivals and departures not drawn to scale.

ously, it sends a positive response immediately. HoweverM1 had already sent a positive response to some othercontender, say C3, in the system and is in a safe state.The request from C1 must wait for a maximum of 3!units at M1 since the random number value generatedby C1 is the largest. In the mean time, C2 also sends arequest to M2. Since M2 is in a safe state and the ran-dom number of C2 is smaller than that of C1, a negativeresponse is sent to C2. Meanwhile, C3 sends a declinemessage to M1, since it might have received a negativeresponse from some other mediator. Mediator M1 sendsa positive response to the next waiting contender C1. Af-ter having received the requisite positive responses fromall the mediators to which it had sent a request (therequisite number of responses is O(

(n ln n)), C1 sends a

potential winner message to the mediators from which ithas received positive responses and waits for 2! units oftime. Because it receives a positive response from bothM1 and M2, C1 becomes the final winner. Suppose, ifsome other contender with a larger random number thanC1’s random number sends a request to M1 when M1 isin the close-safe state, then that contender receives anegative response from M1 as shown in the Figure 6.

In Figure 7 (example 2), C1 and C2 are contenders,and M1 and M2 are mediators. Let the random num-ber generated by C1 be less than that generated by C2.As shown in the figure, C1 gets positive responses fromits mediators but before it could send a potential win-ner message, M2 is in a post-safe state. When C2, whichhas a larger random number sends a request, M2 sendsa positive response immediately to C2, and a negativeresponse to C1. This has a cascading effect as C1 sends adecline to other mediators (here M1), which had previ-ously sent a positive response. Now, M1 is free to respondpositively to the request that has been delayed after thereceipt of the decline from C1.

M1 C1 M2 C2

REQ

REQ

ACK

ACK

DEC

REQ

ACK

REQ

ACK

POTW

POTW

NAK

REQ

ACK

close-safe

ACK

DEC

safe

safe

safe

safe

post-safe

Fig. 7. An example (example 2) of the second phase of theasynchronous algorithm; intervals between message arrivalsand departures not drawn to scale.

M1 C1 M2 C2

REQ

ACK

POTW

REQ

NAK

REQ

ACK

DEC

REQ

NAK

safe safe

close-safepost-safe

Fig. 8. An example of the second phase of the asynchronousprotocol (example 3). Intervals between message arrivals anddepartures not drawn to scale.

In Figure 8(example 3), let the random number gen-erated by C1 be the largest. The response to request fromC1 is delayed by M1 since it is in a safe state. M1 receivesa potential winner message from the contender (say C3)to which it had sent a positive response earlier before theend of the safe state. Therefore, a negative response issent to C1. In turn, C1 sends a decline to M2 from whichit received a positive response. Note that M2 has alreadysent a negative response to another contender C2, whoserandom number value was lower than that generated byC1. A delay in response to C2 would not help because arequest from C2 must intersect with a request from C3

at some mediator (the probability of intersection of theset of mediators generated by two different contenders is


high) and C2 will receive a negative response from thatmediator also (given that a potential winner message wasreceived by M1 from C3 when it was in a safe state).

4.3 Analysis of asynchronous protocol

We prove that the bounds established for the synchronousprotocol still hold for the asynchronous case as well.

Theorem 6. The number of rounds in the protocol isO(log n).

Proof: In the first phase of the asynchronous pro-tocol, the contenders may potentially arrive at distinctpoints of time. A contender can proceed to the nextround, if it sends requests to mediators, who have notreceived request from any other contender for the sameround. Unlike in the synchronous case, where all con-tenders involved in a collision do not proceed to the nextround, the contender which arrives first at all mediatorsfor a specific round proceeds to the next round (apartfrom the contenders that sent requests to unique media-tors). The expected number of contenders in round j +1can be calculated as:

E[Xj+1|Xj = |%j |] &|"j |%

i=1

e#%2

j (i!1)n

=|"j |#1%

t=0

e#"itn where t = i # 1

= 1 +|"j |#1%

t=1

e#"jt

n

Since e#"jt

n is a monotonically decreasing function, wecan approximate E[Xj+1|Xj = |%j |] by the followingintegral:

E[Xj+1|Xj = |%j |] $ 1 +- |"j |#1

0e#

"jtn dt $ n

.j

Since the number of contenders on an average decreasesby a factor of . and there is a single round in the secondphase, the number of rounds in the protocol is O(log n).!

Theorem 7. The total number of messages in the pro-tocol sent by all nodes is O(n).

Proof: The expected number of messages, N , in thesystem in the first phase of the protocol is given by:

N = E[w#1%

j=1

Xj#j ] =w#1%

j=1

E[Xj ]#j < n.w#1%

j=1

1.

j2

Since . > 1, the above summation converges. The num-ber of messages sent in the second phase of the protocolis O(log6 n)

(n ln n. Therefore, the number of messages

sent by all nodes in the system is O(n). !

Theorem 8. It takes a maximum of 7! units for a con-tender in the second phase of the protocol to know whetherit is a winner or not.

Proof: In the worst case, a request from a contendercan take ! units. The mediator that received the requestmay have entered the safe state at approximately thesame time as the request was received. The mediator mayrespond positively after 3! units (or in the worst case, anegative response after 3! units, if the earlier contendersent a potential winner). The response from the mediatorto the contender can take at most ! units. Therefore,a contender need not wait for more than 5! units oftime after it sends a request. Suppose, the contendergets a positive response from all its mediators, it sendsa potential winner message, which takes at most ! unitsand the response from the mediator takes ! units. Byaccounting for the time taken for all the messages, acontender need not wait for more than 7! units in theworst case to know whether it is a winner or not. !

Theorem 9. A contender will know whether it is a win-ner or not within O(! log n) time.

Proof: For every contender, the time to know whetherit is a winner or not is bounded by the total time takenfor the first phase and the second phase. There are O(log n)rounds (by Theorem 6) in the first phase and the con-tender takes at most O(! log n) because for every roundin the first phase, a contender spends 2! time units forcommunication with the mediators. By Theorem 8, acontender will not take more than 7! units in the sec-ond phase. Hence, a contender will know whether it is awinner or not in O(! log n). !

Theorem 10. There is exactly one contender remainingat the end of the protocol w.h.p.

Proof: A contender can become the final winneronly when it has sent potential winner messages to itsmediators. This can happen if and only if all its me-diators sent a positive response. There are two states,viz. idle and post-safe, in which a mediator can send apositive response to a request. However after sending thepotential winner messages, a contender can either win orlose. There are two scenarios of two or more contendersreceiving positive responses from all its mediators.

• Scenario 1: More than one contender receives posi-tive responses from all its mediators when the medi-ators were in idle state. This can happen only if theset of mediators chosen by two distinct contendersdoes not intersect. This can happen with probability1n .

• Scenario 2: Since the set of mediators intersect w.h.p.,two contenders might have received a positive re-sponse from the same mediator if and only if themediator was in an idle/post-safe state for one con-tender and a post-safe state for the other contender.A mediator will not send a positive response in asafe or close-safe state. But the most recent positiveresponse is outstanding and replaces the earlier posi-tive response. The potential winner message from the


contender that received a positive response from themediator (when it was in a idle state) will receive anegative response from the mediator.

Therefore, at any point during the execution of the pro-tocol, by Theorem 3, a mediator will maintain only onecontender as a potential winner and when a set of

(n ln n

of mediators hold the same contender as a potential win-ner, the corresponding contender becomes the final win-ner. !

5 Handling Failures

Failures are an integral part of large-scale distributedsystems. In this section, we examine the impact of fail-ures of mediators in our synchronous protocol. Contenderscan also fail. However, as long as there is at least onecontender who has not failed, a leader gets elected. Inthe event of all contenders failing or the last remainingcontender failing, the scenario is analogous to any otherelection system where in the elected leader fails. Thisissue needs to be handled within the application frame-work. We assume fail stop failures, i.e., a process canstop responding at any point of time, based on the fail-ure probability. A process can fail either before or afterresponding to a contender. If the mediator fails after re-sponding to a contender, it means that the mediator hasalready arbitrated and does not pose a problem for thecorrect functioning of the algorithm. Consequently, weaddress the issue of mediators failing before they havesent any message to contenders.

In the first phase of the protocol, when a request issent by a contender to m mediators, it expects responsesfrom each of the mediators. When a mediator to whichthe contender sends a message fails, the contender willnot receive any response from the mediator as to whetherthe contender can proceed to subsequent rounds. We canapply two different approaches in this scenario:

1. A contender can proceed to the next round as longas it does not receive a negative response.

2. A contender can proceed to the next round if andonly if it receives positive responses from all the me-diators.

If we adopt the first approach, the number of contendersthat proceed to subsequent rounds may be greater thanthe number of contenders proceeding to subsequent roundswhen there are no failures. This affects the message andround complexity bounds. In the second approach, thenumber of contenders proceeding in the face of failureswill be fewer than when there are no failures. This canlead to a case where the number of contenders in suc-cessive rounds decreases by a large factor resulting ina lower probability of having at least one contender atthe end of the first phase of the protocol. This approachdoes, however, preserve the message and round complex-ity bounds. We adopt the second approach and identifythe operating range of the high probabilistic correctnessof our algorithm by providing bounds on the failure prob-ability of processes.

Theorem 11. In the first phase (1 $ j < w), for " < 1&j

the decrease in the number of contenders proceeding tosubsequent round (j + 1) is bounded by a factor of 1

e&%j .(see Table 2 for definition of the parameters)

Proof: Let Xj is the number of contenders thatproceed to round j,

E[Xj+1|Xj = |%j |] = |%j |#1 # #j

n

$&j(|"j |#1)+n)(1)

& |%j |e#%2

j (|'j |!1)n #&j)

The difference between the expectation of Xj when com-pared to Lemma 1 is the addition of n" in (1). We dothis to account for the failure of mediators and to let acontender proceed to the next round if and only if it hasnot sent a message to one of the failed processes.

The expected number of contenders proceeding toround j + 1 is given by |"j |

2e%j& . When #j" < 1, we havebetween |"j |

2e and |"j |2 contenders proceeding to round

j + 1 on an average. Therefore, when " < 1&j

, it limitsthe decrease in the number of contenders proceeding tosubsequent rounds. !

It is necessary that every contender in the secondphase intersect with other contenders in the system. Inthe second phase of the protocol, more messages are sentto negate the presence of failures in the system (i.e.,)every contender in the second phase needs to get a re-sponse from

(n ln n other mediators. Given the bounds

for " for the first phase, the message complexity of thesecond phase is maintained. Since, " is the failure prob-ability, the expected number of failed processes is n".We use n#n" instead of n in calculating the number ofrounds in the first phase.

Theorem 12. If "#w#1 < 1, a single leader is electedwith probability 1 # O( e&%w!1

log6 n ), the message complexityis O(n) and the round complexity is O(log n).

Proof: We specify the modifications to the ear-lier proof without failures. Notice that #i is a mono-tonically increasing function, we have #w#1 > #0. FromTheorem 11 and applying the method used in obtainingLemma 1, we restate Lemma 1 as follows:

1.n

2i#1e)&i!1[g(log2 n)]i#1 $ E[Xi] $

n

2i#1e)&i!1[h(log2 n)]i#1

2. P (+i < Xi < ui) ) g(log6 n

e)&w!1)

The result follows. The round and message complexitiescan be shown using the proofs from Theorems 4 and 5respectively.

6 Applications and Relevance to CurrentDistributed Systems

For many applications in large scale distributed systems,probabilistic guarantees are sufficient as a tradeoff for


scalability. One such application, that forms our largerresearch and development goal, is a versioning-based dis-tributed file system called Plethora [6]. In this and re-lated systems, two contending processes for a commondata object may modify the object concurrently, assum-ing they each have exclusive access. If both of these con-tenders commit their changes, a branch in the versiontree is created and the two committed versions are in-stalled as siblings in the version tree. While these ver-sions may subsequently be reconciled, it is desirable thatthe number of such branches in the version tree be min-imized. In this scenario, failed mutual exclusion (multi-ple processes gaining access to a mutually exclusive re-source) merely results in a branch in the version tree.Minimizing the probability of such an occurrence, whileminimizing associated overhead can be achieved usingthe protocol presented in this paper. Similarly, duplicateelimination of files in distributed stores [7] can be effi-ciently realized by using the algorithm presented in thissection. In this section, we present issues that need to beaddressed for a practical realization in realistic networksof the algorithm described in the paper.

One of the parameters of our protocol is the size ofthe network. An estimate of the number of processes inthe system is needed to determine the number of mes-sages that need to be sent in each round of the protocol.A number of researchers have addressed the problem ofestimating network size [2,14,24]. For example, in [14],Horowitz et al. present an estimation scheme that allowsa peer to estimate the size of its network based only onlocal information with constant overhead by maintaininga logical ring. Our approach does not, however, requirethe knowledge of the number of contenders.

Random walks are typically used to obtain uniformsamples. The Metropolis-Hastings algorithm [3,13,22] pro-vides a method to hasten the mixing time of a randomwalk to reach a stationary distribution and can be ap-plied to different graphs. Gkantsidis et al. [10] show thatsuccessive steps of a random walk on an expander graphfrom a random point is equivalent to uniformly sam-pling in the entire graph. King and Saia [16] present anon random walk approach to choosing a random peer ina structured network. These results provide the neededalgorithms for uniformly selecting mediators. Uniformsampling has many other applications [16].

7 Simulation Results

To explore the performance of our approach (RE), wesimulated it on a network of 50,000 processes with powerlaw connectivity i.e., the degree distribution of the pro-cesses follows a power-law distribution. To generate thenetwork, we first generate the degrees of the processesaccording to the power-law distribution with power-lawparameter . = 0.8 and then connect the processes ran-domly. We assume that / of the processes participate ascontenders and these contenders are selected uniformlyat random. We vary / from 1% to 50%. These percent-ages are chosen to show the number of messages on the

system based on the percentage of processes participat-ing as contenders. We provide a comparison with theprobabilistic quorum (PQ) approach, which is equiva-lent to the second phase of the RE approach. The simu-lations are performed over the asynchronous versions ofthe respective approaches.

7.1 Message Overhead

1e+8

1e+7

1e+6

1e+5

1e+4

1e+3 0 5 10 15 20 25 30 35 40 45 50

Tota

l Num

ber o

f Mes

sage

s

Percentage of Contenders

PQRE

Fig. 9. Total number of messages in the system vs Percentageof Contenders.

We first compare the approaches with respect to thetotal number of messages exchanged in the system. Fig. 9shows the number of messages for the two leader elec-tion approaches with varying /. The probabilistic quo-rum approach does not scale well when the percentageof contenders increases in the network. When 50% ofthe processes are contenders, the number of messages inthe PQ approach is more than one order of magnitudegreater than that of our approach.

7.2 Accuracy of the Protocols

The final goal of both protocols (RE and PQ) is to finda unique leader in the system. To evaluate how close tothis goal the protocols perform, we fix the percentageof contenders to 1% (500 processes) of the system size.In ten thousand different runs with different seeds, theprotocols always produced a unique leader.

8 Related Work

A direct application of the probabilistic quorum algo-rithm developed by Malkhi et al. [21] to the leader elec-tion problem would result in O(n

(n ln n) number of

messages and a single round. In our protocol, we increasethe number of rounds to O(log n), but reduce the numberof messages sent by all processes in all rounds to O(n).This reduction is possible because winner at each round


in the first phase is chosen based on the non-intersectionof their mediator sets with any other contender’s me-diator sets. We use the probabilistic quorum algorithmin the last round with a reduced (O(

(log n)) set of con-

tenders. In this case, the O(n) message complexity is stillpreserved.

In [20], Maekawa proposes a(

n deterministic algo-rithm for mutual exclusion in distributed systems, wherethe distributed sites are arranged in a grid. A processthat participates in the mutual exclusion algorithm winsif it can get exclusive access to processes on an arbitrar-ily selected single row and column. A direct applicationof this result for leader election results in O(n

(n) mes-

sages.In [1], Agrawal et al. improve upon Maekawa’s al-

gorithm for mutual exclusion by realizing quorum setsas sites lying on paths similar to trajectories of billiardballs. The size of the quorum generated is

(2n when

compared to 2(

n of Maekawa’s algorithm. Even thoughtheir mutual exclusion algorithm can be extended tosolve the leader election problem deterministically, themessage complexity of their method is of the same or-der as that of Maekawa’s algorithm (O(n

(n)) and hence

suboptimal when compared to the algorithm presentedin this paper.

Gupta et al. [12] propose a probabilistic leader elec-tion protocol for large groups. The complexity of the pro-tocol is O(K4 .n) messages per election round, where Kis the filter value on the number of processes that canparticipate in the relay phase. The success of the algo-rithm is guaranteed only with probability depending onvarious system parameters viz., the view probability (theprobability that a random process has another randomprocess in its view), K value, the system size amongothers. In their paper, it is necessary that the hash func-tion and K value, used to decide whether a process willparticipate in the relay phase, should be the same (orapproximately the same) for all processes in the group.While the hash function can be the same for all pro-cesses, agreement on the K value is not straight forwardbecause it decides the number of processes that partici-pate in the relay phase and hence is a function of groupsize. If the K value is a predetermined constant, thenit is not clear on how this method can be adopted forvarying group sizes. Also, while simulation results in thepaper shows that the number of election rounds is quiteless, the number of rounds as a function of the systemsize is not clearly specified. Furthermore, the probabilityof success of the protocol is significantly dependent onthe filter value, the view probability among other sys-tem parameters. However, a significant contribution oftheir work is in recognizing the usefulness of probabilis-tic approaches in large scale systems, where correctnesscan be sacrificed occasionally for scalability. We also de-sign a probabilistic algorithm such that the probability ofelecting exactly one leader tends to unity as the systemsize approaches infinity. Furthermore, the message com-plexity is linear in the system size and no assumptionsare made with respect to the knowledge of any constants.While the exact number of messages exchanged is depen-dent on the number of initial contenders (or group size),

the knowledge of the number of contenders is irrelevantfor our approach.

In [28], Schooler et al. propose two interesting leader-election algorithms for multicast groups. Both algorithms(LE-A and LE-S) use announcements to decide the leader.In the first algorithm (LE-A), a process initially an-nounces itself as the leader to the group and listen for an-nouncements of the other members of the groups. Lead-ers are selected based on their identifiers, the processwith the greatest identifier is selected as leader. On re-ceiving an announcement, a process decides locally if thecurrent leader must be changed based on the processidentifier received in the announcement. The second al-gorithm, (LE-S) includes a suppression phase to avoidthe initial announcement. In LE-S, processes wait for arandom time before sending their leader announcements.An extension to the algorithm is making the suppres-sion wake-up timer value the actual “id” that serves asthe basis of the resolution algorithm. Thus, when pro-cesses with smaller timer values announce their leader-ship, they are deemed leaders and appointed leaders ear-lier. This (LE-S) approach greatly reduces the numberof messages, since it probabilistically suppresses unnec-essary leadership announcements. This reduction in thenumber of messages shares a similarity in its goal withthe first phase of our approach, where the number ofcontenders are reduced in successive rounds. In [28],the value upon which the leadership decision is basedis selected from a range that deliberately minimizes col-lision. The contenders in our approach also generate arandom number from a range ([0..n4]) for the sake ofuniqueness. Their approach has the potential on the onehand to lead to more messages in the network, yet onthe other hand to reduce the likelihood of implosion withlarge n when coupled with suppression. A valuable as-pect of their work is that it considers message loss as afundamental concern for distributed leader election al-gorithms.

Mullender and Vitanyi propose a “distributed match-making” problem in [23] to study distributed control is-sues arising in name servers, mutual exclusion, replicateddata management, that involve making matches betweenprocesses. In a general network, they propose that a con-nected graph of n processes can be divided into O(

(n)

connected sub graphs of diameter O((

n) each. Our algo-rithm (in the first phase) can be viewed as constructingsub graphs in a step by step fashion instead of a singleround which results in a better message complexity.

There has been significant work in developing mutualexclusion algorithms for shared memory models [17,15].These could be extended for purposes of leader election.In [17], Kushilevitz and Rabin correct the randomizedmutual exclusion algorithm presented in [25]. The au-thors develop a randomized algorithm for mutual exclu-sion in a shared memory model. In [15], Joung proposesa new problem called Congenial Talking Philosophers tomodel the mutual exclusion problem and provides severalcriteria to evaluate solutions to the problem. Our leaderelection algorithm is for a large scale distributed systemwhere it is not feasible for processes to share memory.


9 Conclusion

This paper presents the design of an efficient randomizedalgorithm for leader election in large-scale distributedsystems. The algorithm guarantees correctness with highprobability and has optimal message complexity O(n).To our knowledge, this is the first result providing highprobabilistic guarantees with optimal message complex-ity for a general topology. We propose variants of thealgorithm for synchronous as well as asynchronous envi-ronments. We give an analysis for the correctness of thealgorithm and bounds on the number of messages andthe number of rounds.

AcknowledgementsWe sincerely thank Prof. Vassos Hadzilacos and the

anonymous reviewers for their insightful comments whichhelped us improve the quality of the paper.

References

1. D. Agrawal, O. Egecioglu, and A.E. Abbadi. Billiardquorums on the grid. Inf. Process. Lett., 64(1):9–16, 1997.

2. M. Bawa, H. Garcia-Molina, A. Gionis, and R. Motwani.Estimating Aggregates on a Peer-to-Peer Network. Tech-nical Report, CS Department, Stanford University, 2003.

3. S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing markovchain on a graph. SIAM Review vol 46, no 4, pages 667–689, Dec 2004.

4. I. Cidon and O. Mokryn. Propagation and leader elec-tion in a multihop broadcast environment. In DISC ’98:Proceedings of the 12th International Symposium on Dis-tributed Computing, pages 104–118, London, UK, 1998.Springer-Verlag.

5. D. Dolev, C. Dwork, and L. Stockmeyer. On the minimalsynchronism needed for distributed consensus. J. ACM,34(1):77–97, 1987.

6. R.A. Ferreira, A. Grama, and S. Jagannathan. Plethora:A Locality Enhancing Peer-to-Peer Network. Journal ofParallel and Distributed Computing (in print).

7. R.A. Ferreira, M.K. Ramanathan, A. Grama, and S. Ja-gannathan. Randomized Protocols for Duplicate Elimi-nation in Peer-to-Peer Storage Systems. In Proceedingsof The Fifth IEEE International Conference on Peer-to-Peer Computing, Konstanz, Germany, Sep 2005.

8. C. Fetzer and F. Cristian. A highly available local leaderelection service. IEEE Trans. Software Eng., 25(5):603–618, 1999.

9. G.N. Frederickson and N.A. Lynch. Electing a leader ina synchronous ring. J. ACM, 34(1):98–115, 1987.

10. C. Gkantsidis, M. Mihail, and A. Saberi. Random walksin peer-to-peer networks. In Proceedings of IEEE INFO-COM, 2004, Hong Kong, March 2004.

11. Gnutella. http://gnutella.wego.com/.12. I. Gupta, R. Renesse, and K.P. Birman. A probabilisti-

cally correct leader election protocol for large groups. InDISC ’00: Proceedings of the 14th International Confer-ence on Distributed Computing, pages 89–103, London,UK, 2000. Springer-Verlag.

13. W. Hastings. Monte carlo sampling methods usingmarkov chains and their applications. Biometrika, 57,pages 97–109, 1970.

14. K. Horowitz and D. Malkhi. Estimating network sizefrom local information. The IPL journal 88(5), pages237–243, December 2003.

15. Y. Joung. Asynchronous group mutual exclusion. Dis-tributed Computing, 13(4):189–206, 2000.

16. V. King and J. Saia. Choosing a random peer. In PODC’04: Proceedings of the twenty-third annual ACM sympo-sium on Principles of distributed computing, pages 125–130, New York, NY, USA, 2004.

17. E. Kushilevitz and M.O. Rabin. Randomized mutualexclusion algorithms revisited. In PODC ’92: Proceedingsof the eleventh annual ACM symposium on Principles ofdistributed computing, pages 275–283, 1992.

18. S. Lin, Q. Lian, M. Chen, and Z. Zhang. A practicaldistributed mutual exclusion protocol in dynamic peer-to-peer systems. In IPTPS, pages 11–21, 2004.

19. N.A. Lynch. Distributed Algorithms. Morgan KaufmannPublishers, Inc., 1996.

20. M. Maekawa. A"

N algorithm for mutual exclusionin decentralized systems. ACM Trans. Comput. Syst.,3(2):145–159, 1985.

21. D. Malkhi, M. Reiter, A. Wool, and R. Wright. Proba-bilistic quorum systems. Information and Computation,170(2):184–206, November 2001.

22. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller,and E. Teller. Equations of state calculations by fastcomputing machines. J. Chem. Phys., Vol. 21, pages1087–1092, 1953.

23. S.J. Mullender and P.M.B. Vitanyi. Distributed match-making. Algorithmica, 3:367–391, 1988.

24. D. Psaltoulis, D. Kostoulas, I. Gupta, K. Birman, andA. Demers. Practical algorithms for size estimation inlarge and dynamic groups. Springlass project at Cornell.

25. M.O. Rabin. N-process mutual exclusion with boundedwaiting by 4 log2 N valued shared variable. Journal ofComputer and System Sciences, 25(1):66–75, 1982.

26. S. Ratnasamy, P. Francis, M. Handley, R. M. Karp, andS. Shenker. A scalable content-addressable network. InSIGCOMM ’01: Proceedings of the 2001 conference onApplications, technologies, architectures, and protocolsfor computer communications, pages 161–172, New York,NY, USA, 2001.

27. A.I.T. Rowstron and P. Druschel. Pastry: Scalable, de-centralized object location, and routing for large-scalepeer-to-peer systems. In Middleware 2001: Proceedings ofthe IFIP/ACM International Conference on DistributedSystems Platforms Heidelberg, pages 329–350, London,UK, 2001. Springer-Verlag.

28. E.M. Schooler, R. Manohar, and K.M. Chandy. An anal-ysis of leader election for multicast groups. TechnicalReport, ATT Labs-Research, Menlo Park, CA, Feb 2002.

29. I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F.Kaashoek, F. Dabek, and H. Balakrishnan. Chord: ascalable peer-to-peer lookup protocol for internet appli-cations. IEEE/ACM Trans. Netw., 11(1):17–32, 2003.

30. W. Szpankowki. Average Case Analysis of Algorithms onSequences. Wiley, 2001.

31. A.S. Tanenbaum and M.V. Steen. Distributed Systems:Principles and Paradigms. Prentice Hall, 2002.

32. G. Tel. Introduction to Distributed Algorithms. Cam-bridge University Press, 1994.

Documents

Randomized L eader E lection - Purdue University M.K. R amanathan, R.A. F erreira, S. Jagannathan, A . G rama, W . S zpank owski: Randomized L eader E lection Algorithm Messages Rounds/Time