1 Principles of Reliable Distributed Systems Lecture 12: Disk Paxos and Quorum Systems Spring 2009 Idit Keidar

1

Principles of Reliable Distributed Systems

Lecture 12: Disk Paxos and

Quorum Systems

Spring 2009

Idit Keidar

2

Today’s Material

• Shared memory Paxos from Sec. 5 of:Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory, Abraham, Chockler, Keidar, & Malkhi: PODC 2004.

• Disk Paxos, Gafni & Lamport, DISC 2000• Frangipani: A Scalable Distributed File

System, Thekkath, Mann, & Lee, SOSP 1997

3

Reminder: Asynchronous R/W Shared Memory Model

• Shared memory registers– Simple read/write (R/W) objects

• Accessed by processes with ids 1,2,…• All communication through shared memory!• Algorithms must be wait-free

– Must tolerate any number of process (client) failures

– Possible thanks to reliable shared memory

4

Consensus in Shared Memory

• A shared object supporting a method decide(vi)i returning a value di

• Satisfying:– Agreement: for all i and j di=dj

– Validity: di=vj for some j

– Termination: decide returns

5

Solving Consensus in/with Shared Memory

• Assume asynchronous shared memory system with atomic R/W registers

• Can we solve consensus?– Consensus is not solvable if even one process

can fail. Shared-memory version of [FLP]: write stands for send, read for receive.

– Yes, if no process can fail– Yes, with eventual synchrony or

6

Shared Memory (SM) Paxos

• Consensus – In asynchronous shared memory – Using wait-free regular R/W registers– And (why?)

• Wait-free – Any number of processes may fail (t < n)

• Unlike message-passing model (why?)

– Only the leader takes steps

7

Regular Registers

• SM Paxos can use registers that provide weaker semantics than atomicity

• SWMR regular register: a read returns– Either a value written by an overlapping write

or – The register’s value before the first write that

overlaps the read

8

write(0)

Regular versus Atomic

time

read(1)

read(0)

write(1)

time

write(1) already

happened

Regular canreturn 0

not

linearizable

9

Variables

• Reminder: Paxos variables are:– BallotNum, AcceptVal, AcceptNum

• SM version uses shared SWMR regular registers:– xi = bal, val, num, decision i for each process i

– Initially 0,0, , 0,0, – Writeable by i, readable by all

• Each process keeps local variables b,v,n– Initially 0,0, , 0,0

10

Reminder: Paxos Phase I

• if leader (by ) thenBallotNum choose new unique ballot

send to all

• Upon receive (“prepare”, bal) from iif bal BallotNum then

BallotNum bal

send (ack, bal, AcceptNum, AcceptVal) to i

• Upon receive (ack, BallotNum, num, val) from n-t

if all vals = then myVal initial value

else myVal received val with highest num

n-t must have not moved on

11

SM Paxos: Phase I

if leader (by ) thenb choose new unique ballot

write b, v, n, to xi

read all xj’sif some xj.bal > b then start over

if all read xj.val’s = then v my initial value

else v read val with highest num

Write is like sending to all

Read instead of waiting for acks

No ack: someone

moved on!

Only b changed in this phase

12

Phase I Summary

• Classical Paxos: – Leader chooses new

ballot, sends to all

– Others ack if they did not move on to a later ballot

– If leader cannot get a majority, try again

– Otherwise, move to Phase 2

• SM Paxos:– Leader chooses new

ballot, writes its variable

– Leader reads to check if anyone moved on to a later ballot

– If anyone did move on, try again

– Otherwise, move to Phase 2

13

Reminder: Paxos Phase II

send (“accept”, BallotNum, myVal) to all

Upon receive (“accept”, b, v) with b BallotNum

AcceptNum b; AcceptVal vsend (“accept”, b, v) to all (first time only)

Upon receive (“accept”, b, v) from n-tdecide vsend (“decide”, v) to all

Accept messages change

AcceptNum and AcceptVal

Only if did not move on yet.

14

SM Paxos: Phase IILeader Cont’d

n b

write b,v,n , to xi

read all xj’s

if some xi.bal > b then start over

write b,v,n,v to xi

return v

Read to see if all would

have accepted this proposal

When don’t they?

Like sending “accept” to all

v,n changed in this phase

Decide

15

Why Read Twice?

readwrite(b) write read

readwrite(b’>b)

write(b’) did not complete

write(b’>b)read

read does not see b’

16

Adding The Non-Leader Code

while (true)

if leader (by ) then

[ leader code from previous slides ]

else

read xld ,were ld is leader

if xld.decision ≠ then

return xld.decision

start over means go here

17

Liveness

• The shared memory is reliable• The non-leaders don’t write

– They don’t even need to be “around”

• The leader only fails if another leader competes with it– Contention

– By , eventually only one leader will compete

– In shared memory systems, is called a contention manager

18

Validity

• Leader always proposes its own value or one previously proposed by an earlier leader– Regular registers suffice

19

Agreement

readwrite(b) write(v) read write decision

no write(b’) for b’>b

completed

write(b’>b) read

read does not see any b’>b

write

read sees b,vwrites

v

20

Agreement Proof Idea

• Look at lowest ballot, b, in which some process decides, v

• By uniqueness of b, no other value is decided with b

• Prove by induction that every decision with b>b’ is v

• Homework: complete the proof– See argument in previous slide– See Byzantine Disk Paxos paper

21

Termination

• When one correct leader exists– It eventually chooses a higher b than all those

written before– No other process writes a higher ballot– So it does not start over, and hence decides

• Any number of processes can fail• How can it be possible? Didn’t we show a

majority of correct processes is needed?

22

Optimization

• As in the message passing case….

• The first write does not write consensus values

• A leader running multiple consensus instances can perform the first write once and for all and then perform only the second write for each consensus instance

23

Leases

• We need eventually accurate leader ()– But what does this mean in shared memory?

• We would like to have mutual exclusion– Not fault-tolerant!

• Lease: fault-tolerant, time-based mutual exclusion– Live but not safe in eventual synchrony model

24

Using Leases

• A client that has something to write tries to obtain the lease – Lease holder = leader– May fail…

• Example implementation:– Upon failure, backoff period

• Leases have limited duration, expire• When is mutual exclusion guaranteed?

25

Lock versus Lease

Lock is blocking– Using locks is not wait-free– If lock holder fails, we’re in trouble

Lease is non-blocking– Lease expires regardless whether holder fails

Lock is always safe– Never two lock-holders

Lease is not – Two lease-holders possible due to asynchrony– OK for indulgent algorithms, like Paxos

26

Disk Paxos

[Gafni,Lamport 00]

27

Data-Centric Replication

• A fixed collection of persistent data items accessed by transient clients

• Data items have limited functionality– E.g., R/W registers, or– An object of a certain type

• Data items can fail

• Cannot communicate with one another

28

System Model: Fault-Prone Memory

• n fault-prone shared-memory objects– Called base objects– Can be n servers or disks storing base objects– t out of n can fail

• m processes (clients) – Any number can fail (wait-free)

29

What Is It Good For?

• Storage Area Networks (SAN)– “Brick” storage

– Disk functionality is limited (R/W)

– Disks cannot communicate with each other

– Disks and disk servers can fail

• Large-scale client/server systems– Simple servers that do not communicate with each

other scale better, manage load better

– Servers can fail

30

Disk Paxos

• Consensus using n 2t+1 fault-prone disks– Disks can incur crash failures

• Solution combines:– m-process shared memory Paxos and– ABD-like emulation of shared registers from

fault-prone ones

31

Disk Paxos Setting

R/W

R/W

R/W

Replicated Data StoreClient processes

32

Disk Paxos Data Structures

m processes

n disks

1

2

3

4

5

b,v,n,d

1 2 3

Process i can write block[i][j], for each disk j, can read all blocks

x2

b,v,n,d b,v,n,d

33

Read Emulation

• In order to read xi

– Issue read block[i][j], for each disk j– Wait for majority of disks to respond– Choose block with largest b,n

• Is this enough?

• How did ABD’s read emulation work?

34

does not find a written

copy,returns 0

write(0)

One Read Round Enough for Regular

time

read(1)

read(0)

write(1)

time

returning 0 is OK for regular

finds a copy that was written

35

Write Emulation

• In order to write xi

– Issue write block[i][j], for each disk j– Wait for majority of disks to respond

• Is this enough?

• Homework: put everything together– Write complete Disk Paxos pseudo-code

based on SM Paxos and R/W emulations

36

Quorum Systems

Generalization of Majority

37

Why Majority?

• In indulgent algorithms (e.g., Paxos) we assumed a majority of the processes are correct

• But what we really need is:If Q1, Q2 are sets of processes s.t.

there liveness is guaranteed whenever all processes in P-Q1 or P-Q2 crash,

then Q1 and Q2 intersect.

38

1st Generalization: Weighted Voting [Gifford 79]

• Each process has a weight– Like share-holders in a corporation

• In order to make progress, need “votes” from a set of processes that have a majority of the weights (shares)

• Special cases:– Each process has weight 1 – majority– One process has all the weights – singleton

39

Definition of Quorum System

• A quorum system over a universe U of n processes is a collection of subsets of U (called quorums) such that every two quorums intersect

• Examples: – Singleton: QS = {{pi}}

– Majority: QS = {Q U: |Q| > n/2}

40

The Grid Quorum System

• A quorum consists of one row plus one cell from each row above it

p1 p2 p3 p4 p5

p6 p7 p8 p9 p10

p11 p12 p13 p14 p15

p16 p17 p18 p19 p20

p21 p22 p23 p24 p25

41

Advantages of Quorum Systems

• Availability– Allow faulty/slow servers to be avoided (up to

a certain threshold)

• Load balancing– Each server participates only in a fraction of

quorums and therefore is accessed only a fraction of overall accesses

• Fundamental tradeoff: load vs. availability

42

Coteries and Domination

• A coterie is a quorum system in which no quorum is a subset of another quorum– Obtained from a quorum system by removing

supersets and keeping only minimal quorums

• A coterie QS dominates a coterie QS’ if every quorum Q’QS’ is a superset of some quorum in Q QS

• A non-dominated coterie is not dominated

43

Quorum Sizes

• Majority: O(n)

• Grid: O(Sqrt(n))

• Primary Copy: O(1)

• Weighted Majority: varies

44

The Load of a Quorum System

• The probability of accessing the busiest server in the best case, i.e., using a strategy that minimizes the load, and when no failures occur

• An access strategy for QS is a probability distribution for accessing the quorums in QS

• The load of a server under a strategy is the probability that this server is in the accessed quorum

45

Availability of a Quorum System

• The resilience f of QS is the number of failures QS is guaranteed to survive– After f failures there is always a live quorum

• Failure probability– Assume that each server fails independently

with probability p

– Fp(QS) is the probability that all quorums in QS are hit, i.e., no quorum survives

46

Examples

• Majority– Best availability (smallest failure probability) for p<½– Worst availability for p > ½– Load is close to ½

• Singleton– Fp = p (optimal when p > ½)– Load is 1

• Grid– Load O(1/Sqrt(n))– Resilience of Sqrt(n)-1– Failure probability goes to 1 as n grows

47

Quorum Replication

• Each operation accesses a quorum of replicas

• Generalization: Byzantine Quorums– Larger intersection

Documents

1 Principles of Reliable Distributed Systems Lecture 12: Disk Paxos and Quorum Systems Spring 2009 Idit Keidar