How to Choose a Timing Model? Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology

How to Choose a Timing Model?

Idit Keidar and Alexander ShraerTechnion – Israel Institute of Technology

How do you survive failures and achieve high availability?

Replication

State Machine Replication

a a ab b

c

• Replicas are identical deterministic state machines

• Process operations in the same order remain consistent

Consensus

• Building block for state machine replication

• Each process has an input, should decide on an output so that– Agreement: decisions are the same

Validity: decision is input of one process

Termination: eventually all correct processes decide

Basic Model

• Message passing• Channels between every pair of processes

– do not create, duplicate or alter messages (integrity)

• Failures• What about timing?

Synchronous Model

abc

• Very convenient for algorithms– understanding performance– early decision with no/few failures

abc

abc

ddd

Synchronous Model: Limitation

• Requires very conservative timeouts

– in practice: avg. latency <max. latency

100[Cardwell, Savage, Anderson 2000[ ,]Bakr-Keidar 2002]

long timeout

Asynchronous Model

• Unbounded message delay

• Much more practical

Fault-tolerant consensus impossible ]FLP85[

Eventually Stable (Indulgent) Models

• Initially asynchronous– for unbounded period of time

• Eventually reach stabilization– GST (Global Stabilization Time)

– following GST certain assumptions hold

• Examples– ES (Eventual Synchrony) – all links are ◊timely

]Dwork, Lynch, Stockmeyer 88[

– failure detectors: (eventual leader), ◊S

]Chandra, Toueg 96[, ]Chandra, Hadzilacos, Toueg 96[

Why Eventual Stabilization?

• Because “eventually” formally models “most of the time” (in stable periods).

• In practice, does not have to be forever, just “long enough” for the algorithm (TA)

• TA depends on our synchrony assumptions !

Our Goals1. Understand the relationship between:

– Assumptions (number of timely links, with or without , etc.) and

– performance of algorithms that exploit them

• In runs that eventually satisfy these assumptions

– unlike stable runs in previous work

• And only these assumptions– unlike synchronous runs in previous work

2. Understand how message complexity affects performance.

Algorithm for process pi

upon receive m

add m to M (msg buffer)

upon end-of-round

FD ← oracle (k)

if (k = 0) then <out_msg, Dest> ← initialize(FD)

else <out_msg, Dest> ← compute(k, M, FD)

k ← k+1

enable sending of out_msg to Dest

waiting conditioncontrolled by env.

GIRAF – The Generic Algorithm

Your pet algorithm here

Your pet algorithm here

Defining Properties in GIRAF

• Environment can have – perpetual properties, φ

– eventual properties, ◊φ

• In every run r, there exists a round GSR(r)

• GSR(r) – the first round from which:– no process fails

– all eventual properties hold in each round

Example Communication Properties

• timely link in round k: pd receives the round k message of ps, in round k– if pd is correct, and ps executes round k

(end-of-rounds occurs in round k)

• j-source: same j timely outgoing links in every round

• j-sourcev: j timely outgoing links in every round (can vary in each round)

• j-destination: same j incoming timely links from correct processes in every round

Example Oracle Properties

• leader: correct process pi s.t. round k and process pj: oraclej(k)=i

– range of oracle( ) is failure detector: ◊leader

Timing Models

• ES (Eventual Synchrony) ]Dwork et al. 88[

– All links between correct processes are ◊timely

– Consensus in 3 rounds (optimal) ]Dutta et al. 04[

• ◊AFM (All-From-Majority) simplified: – every correct process ◊majority–destinationv,

◊majority–sourcev

– Consensus in 5 rounds ]Keidar&Shraer 06[

• ◊LM (Leader and Majority):– Ω, leader is ◊n–source,

every correct process is ◊majority-destinationv

– Consensus in 3 rounds ]Keidar&Shraer 06[From some round onward, one

process is trusted by all (leader)

From some round onward, the link delivers messages in the round they were sent

majority of timely incoming links (v means majority can change each

round)

majority of timely outgoing links

New Model: ◊WLM• Ω, leader is ◊n–source,

◊majority-destinationv

– unlike all processes in ◊LM

– similar to ]Malkhi et al. 05[, a little stronger

Previous Work• Most Ω-based algorithms wait for majority in each round

• Paxos ]Lamport 98[ can make progress in WLM

– Takes constant number of rounds in ES

– But how many rounds without ES?

Paxos Run in ES

21 21

21

21

.

.

.

(Commit, 21 ,v1)

21

21

21

.

.

.

20 21

21

21

.

.

.

(“prepare”,21)

yes

decide v1

(Commit, 21, v1)

Ω Leader

BallotNum

BallotNum – number of attempts to decide initiated by leaders

1 2

5

20

.

.

.

no5

20

.

.

.

yes(“prepare”,2)

Paxos in ◊WLM (w/out ES)

2

(“prepare”,2)

2

5

20

8

13

99

9

20

9

13

(“prepare”,9) (“prepare”,14)

Ω Leader

ok

no (5)

no (8)

ok

ok

no (13)

1

5

20

8

13

GSR GSR+1 GSR+2 GSR+3

BallotNum

Commit takes O(n) rounds!

New Consensus Algorithm for WLM

• Tolerates unbounded periods of asynchrony

• Minority can crash

• Message efficient: O(n) stable-state message complexity

• Achieves global decision in 4 rounds if leader is stable before GSR

– 5 otherwise

Our ◊WLM Algorithm in a Nutshell• Commit with increasing ballot numbers, decide on value

committed by majority– like Paxos, etc.

• Challenge: Don’t know all ballots, how to choose the new one to be highest one?

• Solution: Choose it to be the round number• Challenge: rounds are wasted if a prepare/commit fails. • Solution: pipeline prepares and commits: try in each round• Challenge: do they really need to say no?• Solution: support leader’s prepare even if have a higher

ballot number– challenge: higher number may reflect later decision! Won’t

agreement be compromised?– solution: new field “trustMe” ensures supported leader doesn't miss

real decisions: it is set in round k+1 if majority trust the leader in round k

Example Run: GSR=100

1

5

20

8

13

Ω Leader

GSR+1 GSR+2

8

5

20

8

13

GSR

<Prepare, …, trustMe>All Preparewith !trustMe

All Commit

Did not lead todecision

GSR+3 GSR+4

8

8

20

8

13

102

102

102

102

102

102

102

102

102

102

102

102

102

102

102

Leader DecidesAll Decide

Comparing The Models

Probabilistic Analysis• Each link is timely with probability p in each round

– Independent and Identically Distributed (IID) Bernoulli random variables

• Other simplifying assumptions:– Synchronous rounds– No failures

• Good starting point to understand behavior in real systems

• For each model M, calculate:– PM – probability of requirements of M to hold in a round– Expected number of rounds until the requirements of M hold

long enough– E(DM) – expected number of rounds until (global) decision

in M

Comparing the Models (IID)

Expected number of rounds for global decision (n=8)

ES requires 350 rounds for p=0.97

LAN measurements• How frequent is a stable round in each model ?

– compare measured PM to IID prediction

• For IID: p = fraction of timely messages (over all rounds)

– Example: for timeout = 0.1ms, p=0.7; timeout=0.2, p=0.976

ES is slightly better in practice(a slow round)

AFM is slightly worse in practice

(a slow node)

WLM, LM are better in practice

(good leader)

WLM, LM are better in practice

(good leader)

WLM rounds are the most

frequent !

GIRAF implementation for WAN• Some round synchronization is needed for all models

– In LAN, computers often have synchronized clocks

• A simple algorithm to implement GIRAF:– Li]j[: average latency between ni and nj as measured by ni (pings)

– timeout – input parameter

Receiver thread: Sender thread:upon receive m send message to peers

add m to M (msg buffer) wait for timeout

if m belongs to round kj > ki, compute next round msg.

notify sender thread upon notify:

stop waiting

jump to round kj

duration: timeout – Li]j[

WAN measurementsQuestions:• How frequent is a stable round in each model ? (PM)

• For each model M, measure time and #rounds until global decision in M

• How to set the timeout?

The experiment:

• 33 runs for each timeout, 300 rounds per run – A run is represented by average on 15 different points in the run

• Asynchronous node startup – don’t consider rounds before the first stable round of the model

Question 1: Stable Rounds (PM)• Up to 99% of messages arrive till timeout = 350ms.

– Waiting for 100% requires orders of magnitude longer ]Cadwell et al. 98[LM is sensitive to a single slow node In some runs P LM = 95%, in others P LM = 15%

AFM is constantly low: around 40%

ES is constantly rare for small timeoutsOccasionally good for larger timeouts – sensitive

to a individual slow messages

WLM rounds are the most frequent

(15% better than LM for 160ms), with lowest

variance !

Question 2: Global decisionWLM is best for timeout < 180ms. Same as others for

higher timeouts. Choice of leader matters… With a bad leader – use AFM

Question 3: Choosing the Timeout• Tradeoff:

– Longer timeouts: more stable rounds, less time/rounds for decision

– But: each round takes longer and decision time is longer

– The values are right for our system – might be different for yours

Less rounds, each one is longer

More rounds, each one is shorter

With their optimal timeouts, WLM is just 80ms worse

Conclusions WLM – new timing model• New algorithm for WLM

– Tolerates unbounded periods of asynchrony

– O(n) stable-state message complexity

– Achieves global decision in 4 or 5 rounds

• Thanks to the weak stability requirements, our algorithm has better/comparable performance compared to algorithms that take less rounds. – Even though other algorithms send more messages (Ω(n2))

Documents

How to Choose a Timing Model? Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology