CS294, YelickTime, p1 CS 294-8 Time, Clocks, and Snapshots in Distributed Systems yelick/294

CS294, Yelick Time, p1

CS 294-8Time, Clocks, and

Snapshots in Distributed Systems

http://www.cs.berkeley.edu/~yelick/294


Agenda• Concurrency models

– Partial orders, state machines,

• Correctness conditions– Serializability, …– Safety and liveness

• Clock synchronization• Bag of tricks for distributed alg’s

– Timestamps, markers,


Common Approach #1• Reasoning about a concurrent

system: partially order events in which you can observe the ordering– Messages, memory operations, events

• Consider all possible topological sorts to serialize

• Each of those serial “histories” must be “correct.”


Happens Before Relation• System is a set of processes• Each process is a sequence of

events (a total ordering) on events• Happens before, denoted ->, is

defined as the smallest relation s.t.:– 1) if a and b are in the same process,

and a is before b, then a -> b– 2) If a is a message send and b is the

matching receive, then a -> b– 3) if a->b and b->c then a->c

(transitive)


Happens Before Example

• What if processes are multithreaded?• How to we determine the events?

– Send message, receive message, what else?– What about read/write? Nonblocking opns?

• Does this help if we’re trying to reason about a program or an algorithm?

time


Common Approach #2• Reasoning about a concurrent

system: View the system as a state machine that produces serial executions:– Invocation/response events,

send/receive message (with explicit model of network between)

• Each of the global serial executions must be “correct.”


State Machine Approach• A distributed system is:

– A finite set of processes– A finite set of channels (network links)

• A channel is– Reliable, order-preserving, and with infinite

buffer space

• A process is: – A set of states, with one initial state– A set of events

• Models differ on specific details


Comparison of Models• Which approach (partial orders vs.

state machines) is better?• Is one lower level than the other?• Is one for shared memory and the

other for message?


Common Notions of Correctness

• Serializability, strong serializability• Sequential consistency, total store

ordering, weak ordering, …• Linearizability (used in wait-free)

• All of these are based on the idea that operations (transactions) must appear to happen in some serial order

• Which of these (or others) are useful?


Variations on Correctness• Why are there so many notions?

– Do all processes observe the same “serial” order, or can some see different views?

– Are the specifications of each operation deterministic? Can processes see different “correct” behaviors?

– Are all operations executed by a process ordered? E.g., read x -> read y doesn’t matter.

– Is the observed order consistent with real time?


Closed World Assumption• Most of these correctness ideas

assume that all communication is part of the system.

• Anomalies come from:– Phone calls between users– A second “external” network

• How does one prevent these anomalies?


Clock Condition• A logical clock assigns a timestamp C<a>

to each event a.• Clock Condition: For any events a, b:

– If a->b then C<a> < C<b>

• To implement a clock– Increment clock between each pair of events

within a process– When you send a message, append current

time– When you receive a message, update own clock

with max(timestamp, my-current-time) +1


Mutual Exclusion• Lamport defines a total order => as

the clock ordering with ties broken by PID

• “Clocks” are not synchronized a priori– Loosely synchronized within algorithm

• Uses for mutual exclusion algorithm– How useful is this algorithm?– Is the specification useful?


Clock Synchronization• Two notions of synchronization:

– External: Within some bound of UTC (Universal Coordinated Time)

– Internal: Within some bound of other clocks in the system

• Huge literature on clock synch. under various models (e.g., PODC).– Impossible in general due to message delays – Many algorithms synchronize clocks with high

probability (within some bound)

• Are synchronized clocks useful? For fault tolerance or in general?


Clock Synchronization Algorithms

• Christian [89] describes a centralized time server approach– Set time = t_server + t_round_trip/2– No fault tolerance to server failure

• Berkeley Algorithm [89]: – Master polls slaves and records round_trip

time to estimate local times – Averages “non faulty” clocks– Send delta (not new time) to slaves– Master can be re-elected


Clock Synchronization Algorithms

• Network Time Protocol (NTP) [Mills 95] meant for wide area:– Uses statistical techniques to filter clock data– Servers can reconfigure after faults– Protected (authentication used)

• Design:– Primary servers connected to radio clock– Secondary servers synchronize with primaries,

with distance from primary defining stata– Modes: Multicast, Procedure call, Symmetric– 1-10s of milliseconds accuracy reported


Stable Properties• Termination detection, garbage

detection, deadlock are stable properites: once true, they remain true “forever.”

• Snapshot algorithms are useful for stable properties.

• What properties related to faults are stable?


Safety and Liveness• A safety property says:

– informally: nothing bad ever happens – formally: any violation of a safety property can be

observed on a finite subsequence of the execution– this is a partial correctness condition

• A liveness property says:– informally: something good eventually happens– formally: it is a property of an infinite execution that

cannot be checked on finite subsequences– this is a total correctness condition (combined with

safety)

• What about real time constraints?


Snapshot Algorithm• Each process records is own state• Pair of processes that share a channel

coordinate to save its state

c

c’

p q

• Example: Single token conservation system• s0 = no token; s1 = has token• in-p --> snapshot p --> in c --> snapshot q

and channels by q => see 2 copies of token


Snapshot Algorithm• Idea: mark point in message stream

– P sends 1 marker after recording its state and before sending any message on channel c

– When q receives the marker it either• Records current state and c as being empty• Keeps previously recorded state and

records c as contain all messages received since recorded and after marker

– Works for n strongly connect processors


Related Algorithms• Dijkstra-Scholten termination

detection– Create an implicit spanning tree:

sender of your first message is your parent

– Keep track of children awaiting termination and signal parent when they (and you) are done.


Backup Example Slides


Using State Machines to Model Parallel Systems

• A large parallel application is built from a set of communicating objects

• The correctness can be divided into two problems:– the correctness of each of the objects– the correctness of the system that uses the objects

• Both can be modeled as a state machine

Distributed hash table

Distributed task queue


Atomicity• In any attempt to reason about

concurrency, need to determine the level of atomicity– what is the smallest indivisible event?

• reads and writes to memory (usually) • basic arithmetic and conditional operations• message sends/receives

• Often too complex to do all reasoning at this level– group a set of low level events within a

function/method into a higher level atomic event

• Leads to multi-level, modular proofs


Specifications and Implementations

• The first step in reasoning about correctness is having a specification of desired behavior– what sequences of operations (inputs and

results) are legal?

• A specification and implementation may be written– in the same formal language– in two different languages

• Concentrate on the behaviors produced and avoid syntax here


Correctness of Serial Types•A sequential ADT may be specified by pre/post conditions on the operations– insert (s, e)

• pre: true• post: s = s + { e }

–remove (s, e)• pre: e in s• post: s = s - { e }

•An implementation is correct if, given any sequence of these operations, they meet the specifications

insert remove …

time


Linearizability• Taken from “Axioms for concurrent

objects” by Herlihy and Wing, TOPLAS, July 1990.

• Intuition behind correctness condition– each operation should appear instantaneous– order of nonconcurrent operations should be

preserved

• Using this intuition, examine some histories and determine if they are acceptable

• Example: a queue DT with – enqueue (E) and – dequeue (D) operations


Some Queue Histories• History 1, acceptable

• History 2, not acceptable

• History 3, not acceptable

q.E(x) A q.D(y) A q.E(z) A

q.E(y) B q.D(x) B

q.E(x) A q.D(y) A

q.E(y) B

q.E(x) A q.D(y) A

q.E(y) B q.D(y) B


Execution Model Details• A concurrent systems is a

– collection of sequential threads– communication through shared objects

• Each object has a– unique name– a type, which defines the set of legal operations

• Formally, each operation invocation is 2 events– an invocation event: <obj op(args*) A>– a response event: <obj term(res*) A>– where term is a termination condition: OK, exception,

etc.– an invocation and response match if they have the

same process name and the same object name


Histories• Assume invocation and response events are

totally ordered (although operation may overlap)• A history is a finite sequence of invocation and

response events• An invocation is pending in a history if no

matching response follows.• Given a history H, complete(H) is the maximal

subsequence of H consisting only of invocations and matching responses

• A history H is sequential if:– the first event in H is an invocation– each invocation (except possibly the last) is

immediately followed by a matching response

• A history that is not sequential is concurrent.


Ordering in Histories• A history H induces an irreflexive partial order <H

• op1 <H op2 iff the response for op1 appears before the invocation for op2 in H

• another presentation (non graphical) of the same history op1 op2 op1 < op2 & op1 < op4

op3 op4 op3 < op2 & op3 < op4

op1_invop3_invop1_resop3_resop2_invop4_invop2_resop4_res


Linearizability

• A process subhistory H|P (H at P) is the subsequence of all events in H whose process names are P

• Two histories H1 and H2 are equivalent if for all processes P, H1|P = H2|P

• A history H is linearizable if it can be extended (by appending zero or more response events) to a history H’ such that– L1: complete(H’) is equivalent to some legal

sequential history S– L2: <H subseteq <S


Observations on Linearizability

• Linearizability is stronger that sequential consistency

• Linearizability is composable, which is nice for building large systems of component objects

• Intuitively, linearizability states that operations must appear to take effect sometime between their invocation and response

• Too strong for large scale machines and distributed data structures?


Relaxed Consistency Model• In the Multipol project, we used a weaker

notion of correctness• Each operation was divided into an

invocation part and completion part– like put/sync and get/sync in Split-C– e.g., E.update_mesh & E.sync_update– e.g., T.insert_element * T.sync_insert

• This worked well for several example problems

• Data structures do not compose well– need more than 2 phases in some cases


Correctness• An implementation is a set of history in which

events of two objects– a representation object, REP– an abstract object, ABS

are interleaved in a constrained way:• For each history H in the implementation

– 1) the subhistories H|REP and H|ABS are well-formed– 2) for each processor P, each REP operation in H | P lies

within an abstract operation in H | P– e.g., q.enq_inv(x) lock(q.l), q.back++, q.dat[back]=x,

unlock(q.l) q.enq_res/OK

• An implementation is correct if H|ABS is linearizable


Safety and Liveness Examples• Examples of safety properties

– There are no race conditions– This queue is never empty– One thread never accesses another thread’s data– No arithmetic operation in the program ever overflows

• Examples of liveness properties– The program eventually terminates– The scheduler is “fair”: every enabled thread

eventually gets to run– The scheduler is “fair”: a thread enabled infinitely

often will get to run infinitely often– The set will eventually contain the grobner basis of the

input


Proving Safety and Liveness• Techniques for proving safety properties

are relatively straightforward– sometimes many states and cases for a particular

system– extends ideas from reasoning about sequential code

• Techniques for proving liveness are much more difficult– often want something stronger than liveness as well,

such as a time bound– some of the formal frameworks make proofs of

particular liveness properties easier, e.g., fairness– often ensure a liveness property using a stronger

safety property


Methods for Proving Correctness (Safety)

• In serial implementations, the subset of REP values that are legal representations of and ABS values are those that satisfy the representation invariant– I: REP -> BOOL, a predicate on REP values

• The meaning of a legal representation is defined by an abstraction function– A: REP -> ABS

• In sequential programs, the abstract operations may go through a set of states in which the invariant I is not true


Existence of Abstraction Functions

There are three cases that arise in trying to prove that an abstraction function exist in a concurrent system

• The function can be defined directly on the REP state• A history variable is needed to record a past event• A prophecy variable is needed record a future event

(Alternatively, you may use an abstraction relation.)

A(t’)A(t)

t t’


Example 1: Locked Queue• Given a queue implementation containing

– integers, back, front– an array of values, items– a lock, l

Enq = proc (q: queue, x: item) // ignoring buffer overflow

lock(l) i: int = q.back++ // allocate new slot q.items[i] = x // fill it unlock(l)Deq = proc (q: queue) returns (item) signals empty lock(l) if (back==front) signal empty else front++ ret: item = items[front] unlock (l) return(ret)


Simple Abstraction Function• The abstraction function maps the elements in

items items[front…back] to the abstract queue value

• Proof is straightward

• The lock prevents the “interesting” cases


Example 2: Statistical DB Type

• Given a “statistical DB” Spec with the operations– add(x): add a new number, x, to the DB– size(): report the number of elements in the DB– mean(): report the mean of all elements in the DB– variance(): report the variance of all elements in the DB

• Straightforward implementation keeps set of values

• A more compact one uses the three values only:– integer count, initially 0 // number of elements– float sum, initially 0 // sum of elements– float sumSquare, initially 0 // sum of squares of all

elements

• Implementations of operations are straightforward


Need for History Variables• The problem with verifying this example is that

the specification contains more information than the implementation

• Proof idea:– add a variable items to the representation state (for the

proof only)– the implementation may update items– it may not use items to compute operations

• The abstraction function, A, for the augmented DB simply maps the items field to the abstract state

• Need to prove the representation invariants:– count = items.size– sum = sum({x | x in items})– sumSquare = sum({x2 | x in items})


Example 2: Queue Implementation

• Given a queue implementation containing– an integer, back– an array of values, items

Enq = proc (q: queue, x: item) // ignoring buffer overflow

i: int = INC(q.back) // allocate new slot, atomic STORE(q.items[i], x) // fill it

Deq = proc (q: queue) returns (item) while true do range: int = READ(q.back) - 1 for i: int in 1.. range do x: item = SWAP (q.items[i], null) if x!= null then return


Queue Example Notes• There are several atomic operations defined

here– STORE, SWAP, INC– These may or may not be supported on given

hardware, which would change the proof

• The deq operation starts at the front end of the queue– slots already dequeued will show up at nulls– slots not yet filled will also be nulls– picks the first non-empty slot– will repeat scan until it finds an element, waiting for an

enqueue to happen if necessary

• Many inefficiencies, such as the lack of a head pointer. Example is to illustrate proof technique.


Need for Prophecy Variables•Representation invariant

– to be useful, these must be true at every point in an execution, I.e., the points observable by other concurrent operations

– sequential case was much weaker

•Abstraction functionEnq(x) AEnq(y) B INC(q.back) A for this execution, there is no way of OK(1) A defining an abstraction function without INC(q.back) B predicting the future, I.e., whether x or y OK(2) B will be dequeued first STORE(q.items[2], y) B OK() BOK() B


Abstraction Relation• An alternate approach to using history and

prophecy variables is to use and abstraction relation, rather than a function

• Queue example:

History Linearized values {[]}Enq(x) A {[], [x]}Enq(y) B {[], [x], [y], [x,y], [y,x]}OK() B {[y], [x,y], [y,x]}OK() A {[x,y], [y,x]}Deq() C {[x], [y], [x,y], [y,x]}OK(x) C {[y]}


Key idea in Queue Proof• Lemma: If Enq(x), Enq(y), Deq(x) and Deq(y) are

complete operations of H such that x’s Enq precedes y’s Enq, then y’s Deq does not precede x’s Deq.

• Proof: Suppose this is not true. Pick a linearization and let qi and qj be queue values following the Deq operation of x and y respectively. From the assumption that j < i, qj-1 = [y,…,x,…] which implies that y is enqueued before x, a contradiction.


Relaxed Consistency Model• In the Multipol project, we used a weaker

notion of correctness• Each operation was divided into an

invocation part and completion part– like put/sync and get/sync in Split-C– e.g., E.update_mesh & E.sync_update– e.g., T.insert_element * T.sync_insert

• This worked well for several example problems

• Data structures do not compose well– need more than 2 phases in some cases


Generalizing Linearizability to Relaxed Consistency

• Generalization to split-phase operations is straightforward

• Each logical operation has:– op_start– op_complete

• In the formal model, each of these have invocation and response events (like other operations)

• The total order <H is define by the completion of op_complete happening before the initiation of op_start

• Informally, the operation must take place atomically between two phases

• Additional generalization: different processes see different total orders.

Documents

CS294, YelickTime, p1 CS 294-8 Time, Clocks, and Snapshots in Distributed Systems yelick/294