Upload
ezra-franklin
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
How to Choose a Timing Model?
Idit Keidar and Alexander ShraerTechnion – Israel Institute of Technology
How do you survive failures and achieve high availability?
Replication
State Machine Replication
a a ab b
c
• Replicas are identical deterministic state machines
• Process operations in the same order remain consistent
Consensus
• Building block for state machine replication
• Each process has an input, should decide on an output so that– Agreement: decisions are the same
Validity: decision is input of one process
Termination: eventually all correct processes decide
Basic Model
• Message passing• Channels between every pair of processes
– do not create, duplicate or alter messages (integrity)
• Failures• What about timing?
Synchronous Model
abc
• Very convenient for algorithms– understanding performance– early decision with no/few failures
abc
abc
ddd
Synchronous Model: Limitation
• Requires very conservative timeouts
– in practice: avg. latency <max. latency
100[Cardwell, Savage, Anderson 2000[ ,]Bakr-Keidar 2002]
long timeout
Asynchronous Model
• Unbounded message delay
• Much more practical
Fault-tolerant consensus impossible ]FLP85[
Eventually Stable (Indulgent) Models
• Initially asynchronous– for unbounded period of time
• Eventually reach stabilization– GST (Global Stabilization Time)
– following GST certain assumptions hold
• Examples– ES (Eventual Synchrony) – all links are ◊timely
]Dwork, Lynch, Stockmeyer 88[
– failure detectors: (eventual leader), ◊S
]Chandra, Toueg 96[, ]Chandra, Hadzilacos, Toueg 96[
Why Eventual Stabilization?
• Because “eventually” formally models “most of the time” (in stable periods).
• In practice, does not have to be forever, just “long enough” for the algorithm (TA)
• TA depends on our synchrony assumptions !
Our Goals1. Understand the relationship between:
– Assumptions (number of timely links, with or without , etc.) and
– performance of algorithms that exploit them
• In runs that eventually satisfy these assumptions
– unlike stable runs in previous work
• And only these assumptions– unlike synchronous runs in previous work
2. Understand how message complexity affects performance.
Algorithm for process pi
upon receive m
add m to M (msg buffer)
upon end-of-round
FD ← oracle (k)
if (k = 0) then <out_msg, Dest> ← initialize(FD)
else <out_msg, Dest> ← compute(k, M, FD)
k ← k+1
enable sending of out_msg to Dest
waiting conditioncontrolled by env.
GIRAF – The Generic Algorithm
Your pet algorithm here
Your pet algorithm here
Defining Properties in GIRAF
• Environment can have – perpetual properties, φ
– eventual properties, ◊φ
• In every run r, there exists a round GSR(r)
• GSR(r) – the first round from which:– no process fails
– all eventual properties hold in each round
Example Communication Properties
• timely link in round k: pd receives the round k message of ps, in round k– if pd is correct, and ps executes round k
(end-of-rounds occurs in round k)
• j-source: same j timely outgoing links in every round
• j-sourcev: j timely outgoing links in every round (can vary in each round)
• j-destination: same j incoming timely links from correct processes in every round
Example Oracle Properties
• leader: correct process pi s.t. round k and process pj: oraclej(k)=i
– range of oracle( ) is failure detector: ◊leader
Timing Models
• ES (Eventual Synchrony) ]Dwork et al. 88[
– All links between correct processes are ◊timely
– Consensus in 3 rounds (optimal) ]Dutta et al. 04[
• ◊AFM (All-From-Majority) simplified: – every correct process ◊majority–destinationv,
◊majority–sourcev
– Consensus in 5 rounds ]Keidar&Shraer 06[
• ◊LM (Leader and Majority):– Ω, leader is ◊n–source,
every correct process is ◊majority-destinationv
– Consensus in 3 rounds ]Keidar&Shraer 06[From some round onward, one
process is trusted by all (leader)
From some round onward, the link delivers messages in the round they were sent
majority of timely incoming links (v means majority can change each
round)
majority of timely outgoing links
New Model: ◊WLM• Ω, leader is ◊n–source,
◊majority-destinationv
– unlike all processes in ◊LM
– similar to ]Malkhi et al. 05[, a little stronger
Previous Work• Most Ω-based algorithms wait for majority in each round
• Paxos ]Lamport 98[ can make progress in WLM
– Takes constant number of rounds in ES
– But how many rounds without ES?
Paxos Run in ES
21 21
21
21
.
.
.
(Commit, 21 ,v1)
21
21
21
.
.
.
20 21
21
21
.
.
.
(“prepare”,21)
yes
decide v1
(Commit, 21, v1)
Ω Leader
BallotNum
BallotNum – number of attempts to decide initiated by leaders
1 2
5
20
.
.
.
no5
20
.
.
.
yes(“prepare”,2)
Paxos in ◊WLM (w/out ES)
2
(“prepare”,2)
2
5
20
8
13
99
9
20
9
13
(“prepare”,9) (“prepare”,14)
Ω Leader
ok
no (5)
no (8)
ok
ok
no (13)
1
5
20
8
13
GSR GSR+1 GSR+2 GSR+3
BallotNum
Commit takes O(n) rounds!
New Consensus Algorithm for WLM
• Tolerates unbounded periods of asynchrony
• Minority can crash
• Message efficient: O(n) stable-state message complexity
• Achieves global decision in 4 rounds if leader is stable before GSR
– 5 otherwise
Our ◊WLM Algorithm in a Nutshell• Commit with increasing ballot numbers, decide on value
committed by majority– like Paxos, etc.
• Challenge: Don’t know all ballots, how to choose the new one to be highest one?
• Solution: Choose it to be the round number• Challenge: rounds are wasted if a prepare/commit fails. • Solution: pipeline prepares and commits: try in each round• Challenge: do they really need to say no?• Solution: support leader’s prepare even if have a higher
ballot number– challenge: higher number may reflect later decision! Won’t
agreement be compromised?– solution: new field “trustMe” ensures supported leader doesn't miss
real decisions: it is set in round k+1 if majority trust the leader in round k
Example Run: GSR=100
1
5
20
8
13
Ω Leader
GSR+1 GSR+2
8
5
20
8
13
GSR
<Prepare, …, trustMe>All Preparewith !trustMe
All Commit
Did not lead todecision
GSR+3 GSR+4
8
8
20
8
13
102
102
102
102
102
102
102
102
102
102
102
102
102
102
102
Leader DecidesAll Decide
Comparing The Models
Probabilistic Analysis• Each link is timely with probability p in each round
– Independent and Identically Distributed (IID) Bernoulli random variables
• Other simplifying assumptions:– Synchronous rounds– No failures
• Good starting point to understand behavior in real systems
• For each model M, calculate:– PM – probability of requirements of M to hold in a round– Expected number of rounds until the requirements of M hold
long enough– E(DM) – expected number of rounds until (global) decision
in M
Comparing the Models (IID)
Expected number of rounds for global decision (n=8)
ES requires 350 rounds for p=0.97
LAN measurements• How frequent is a stable round in each model ?
– compare measured PM to IID prediction
• For IID: p = fraction of timely messages (over all rounds)
– Example: for timeout = 0.1ms, p=0.7; timeout=0.2, p=0.976
ES is slightly better in practice(a slow round)
AFM is slightly worse in practice
(a slow node)
WLM, LM are better in practice
(good leader)
WLM, LM are better in practice
(good leader)
WLM rounds are the most
frequent !
GIRAF implementation for WAN• Some round synchronization is needed for all models
– In LAN, computers often have synchronized clocks
• A simple algorithm to implement GIRAF:– Li]j[: average latency between ni and nj as measured by ni (pings)
– timeout – input parameter
Receiver thread: Sender thread:upon receive m send message to peers
add m to M (msg buffer) wait for timeout
if m belongs to round kj > ki, compute next round msg.
notify sender thread upon notify:
stop waiting
jump to round kj
duration: timeout – Li]j[
WAN measurementsQuestions:• How frequent is a stable round in each model ? (PM)
• For each model M, measure time and #rounds until global decision in M
• How to set the timeout?
The experiment:
• 33 runs for each timeout, 300 rounds per run – A run is represented by average on 15 different points in the run
• Asynchronous node startup – don’t consider rounds before the first stable round of the model
Question 1: Stable Rounds (PM)• Up to 99% of messages arrive till timeout = 350ms.
– Waiting for 100% requires orders of magnitude longer ]Cadwell et al. 98[LM is sensitive to a single slow node In some runs P LM = 95%, in others P LM = 15%
AFM is constantly low: around 40%
ES is constantly rare for small timeoutsOccasionally good for larger timeouts – sensitive
to a individual slow messages
WLM rounds are the most frequent
(15% better than LM for 160ms), with lowest
variance !
Question 2: Global decisionWLM is best for timeout < 180ms. Same as others for
higher timeouts. Choice of leader matters… With a bad leader – use AFM
Question 3: Choosing the Timeout• Tradeoff:
– Longer timeouts: more stable rounds, less time/rounds for decision
– But: each round takes longer and decision time is longer
– The values are right for our system – might be different for yours
Less rounds, each one is longer
More rounds, each one is shorter
With their optimal timeouts, WLM is just 80ms worse
Conclusions WLM – new timing model• New algorithm for WLM
– Tolerates unbounded periods of asynchrony
– O(n) stable-state message complexity
– Achieves global decision in 4 or 5 rounds
• Thanks to the weak stability requirements, our algorithm has better/comparable performance compared to algorithms that take less rounds. – Even though other algorithms send more messages (Ω(n2))