Upload
cecily-whitehead
View
223
Download
0
Embed Size (px)
Citation preview
1
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechani
sms
Author: Ozalp Babaoglu and Keith Marzullo
Distributed Systems: 526 U1580
Professor: Ching-Chi Hsu
2
Introduction
Many problems in distributed computing can be cast as executing some notification or reaction when the state of the system satisfies a particular condition
Global Predicate Evaluation (GPE): to establish the truth of a Boolean expression whose variables may refer to the global systems state
A global state may not be consistent Asynchronous system:
no bounds on the relative speeds of processes and message delays Impossible to maintain synchronized local clocks Communication remains the only possible mechanism for
synchronization
channels are reliable but may deliver messages out of order
3
Outline
Two Class of solutions to the GPE problem: A reactive-architecture: each process, when executing an event, n
otify P0 by sending it a message describing the event A snapshot architecture: the monitor P0 sends each process a ‘stat
e enquiry’ message.
4
Definitions (1)
distributed systems: a collection of sequential processes p1, p2, ...,
pn networked by unidirectional communication channels
events: the activity of each sequential process, which can be internal events or communications: send(m) or receive(m) with another process
local history of process pi : hi = ei1ei
2...
global history: H = h1h2... hn
cause-effect relation '->': If ei
k, eilhi and k<l, then ei
k eil
If ei = send(m) and ej = receive(m), then ei ej
If e e' and e' e'', then e e'' Concurrent e||e': neither e e' nor e' e
5
Definitions (2)
distributed computation: a partially ordered set defined by the pair (H, )
space-diagram: representation of a distributed computation
p1
p2
p3
e11 e1
2 e13 e1
4 e15 e1
6
e21
e22
e23
e31 e3
2 e33 e3
4 e35 e3
6
6
Definitions (3)
local state of pi immediately after executing event eik is denoted b
y ik
global state: (, ..., n)
a cut C(c1,...,cn) is a subset of global history H and contains an initial prefix of each of the local histories, i.e. C h1
c1hnc
n
a run R is a total ordering of all events in H and is consistent with each local history Example: pp6
Note that a single distributed computation may have many runs
7
Example
Insistent cut and phantom deadlock
p1
p2
p3
e11 e1
2 e13 e1
4 e15 e1
6
e21
e22
e23
e31 e3
2 e33 e3
4 e35 e3
6
C C’
req req resp
resp
reqreq
8
Consistency
A consistent cut C, is such that e and e', (e C)(e' e) => e' C
A consistent global state is one corresponding to a consistent cut Aconsistent run R, is such that
e and e', (e e') => e appears before e' in R Example: pp6
If the run is consistent then all the global states in the sequence will be consistent as well
9
Observing Distributed Computations
A monitor p0 will assume a passive role in that it will not send any messages of its own
The application processes notify p0 by sending it a message whenever they execute an event
The monitor p0 constructs an observation of the underlying distributed computation as the events arrived
Due to the variability of message delays, an observation can correspond to a consistent run, an inconsistent run or no run at all O1 = e2
1e11e3
1e32e3
4e12e2
2e33e1
3e14e3
5.... => not a run
O2 = e11e3
1e21e3
2e12e3
3e34e1
3e22e3
5e36.... => inconsistent run
O3 = e31e2
1e11e1
2e32e3
3e13e3
4e14e2
2e15.... => consistent run
To restore order of messages by defining a delivery rule for deciding when received messages are to be presented to the application process
10
First-In-First-Out(FIFO) delivery for all messages m and m' from pi to pj
if sendi(m) sendi(m') => deliverj(m) deliverj(m')
FIFO can be implemented by adding sequence numbers to messages
While FIFO delivery is sufficient to guarantee that observations correspond to runs, it is not sufficient to guarantee consistent observations
FIFO delivery
11
Observing Distributed Computations with Real-Time Clocks
Environment: message delays are bounded by channels are FIFO existence of a global real-time clock each message includes RC(e), the global real-time clock when event
e occurs, as its timestamp DR1:
At time t, deliver all received messages with timestatmps up to t- in increasing timestamp order
Observation is consistent iff the following is satisfied Clock condition: e e' => RC(e) < RC(e')
12
Observing Distributed Computations with Logical Clocks
Environment: channels are FIFO asynchronous communication implementation of logical clocks each message includes LC(e), the logical clock when event e occurs,
as its timestamp DR2:
Deliver all messages that are stable at p0 in increasing timestamp order
Note: a message m is stable at p if no future messages with timestamp < TS(m) Given FIFO channels, m is stable at p0 when p0 has received at least
one message with timestamp>TS(m) from all other processes
13
Logical Clocks
p1
p2
p3
1 2 4 5 6 7
1
5
6
1 2 3 4 5 7
Logical Clockeach process pi maintains a local variable LCi
when a new event ei occurs, pi modifies LCi to
LCi + 1 if ei is an internal or send event max{ LCi, TS(m)} + 1 if ei = receive(m)
14
Observing Distributed Computations with Causal Delivery
Causal Delivery (CD): sendi(m) sendj(m') => deliverk(m) deliverk(m')
If p0 uses a delivery rule satisfying CD, then all of its observations will be consistent
15
Efficient Delivering
For implementing causal delivery, what is really needed is an effective procedure for deciding: given events e,e' that are causally related and their clock values, do
es there exists some other event e'' such that e e'' e' Given RC(e) <RC(e') (or LC(e)<LC(e')), it may be that
e e' or e|| e', i.e. e' e) The above observations suggest a timing mechanism TC whereby
causal precedence relations between events can be deduced from their timstamps
Stong Clock Condition: e e' TC(e) < TC(e')
16
Causal History (1)
p1
p2
p3
e21
e22
e23
e31 e3
2 e33 e3
4 e35 e3
6
Causal history of event e14
e11 e1
2 e13 e1
4 e15 e1
6
Causal history of event e(e) = { e' H | e' e} {e}That is, (e) is the smallest consistent cut that includes e
17
Causal Histories (2)
Maintaining Causal History Each process pi initializes local variable i to be Each message m contains a timestamp TS(m) which is the causal his
tory of its send event Scheme
If ei is internal or send event,
then i={ei} the causal history of the previous local event
If ei is the receive of message m by process pi from pj
then i={ei} the causal history of the previous local event of pi
the causal history of the corresponding send event at pj
The strong clock condition is satisfied if clock comparison is interpreted as set inclusion e e' (e) (e') or e e' e (e') if e e'
Problem: the causal histories will grow rapidly
18
Vector Clocks
The causal history of an event can be represented as a fixed-dimensional vector VC(e)[1..n] rather than a set, where VC(e)[i] = k, iff i(e) = hi
k for i = 1,2,...,n
p2(1,2,4)
(4,3,4)
p3
p1
(0,1,0)
(0,0,1) (1,0,2) (1,0,3) (1,0,4) (1,0,5) (1,0,6)
(1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)
19
Maintaining Vector Clocks
Maintaining Vector clock Each process pi maintains a local vector VCi[1..n]
Each message m contains a timestamp TS(m) which is the vector clock value VC(e)of its send event e
Scheme if ei is an internal or send event
VCi [i]= VCi [i] + 1, and VC(ei)=VCi
if ei = receive(m)
VCi = max { VCi , TS(m) }
VCi [i] = VCi [i] + 1
VC(ei)[j] number of events of pj that causally precede event ei of pi
V < V' (VV')k: 1kn: V[k] V'[k])
20
Properties of Vector Clocks
Properties of Vector Clocks Strong Clock Condition Simple Strong Clock Condition
e e' VC(e) < VC(e') ei ej VC(ei)[i] VC(ej)[i]
Concurrent ei||ej VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j])
Pairwise Inconsistent i j, VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j])
Consistent Cut (c1,c2, ..., cn) iff
i, j: 1 i,j n, VC(eici)[i] VC(ej
cj)[i]
Counting: the number of events precedes e is givent by #(e) #(e) =n
j=1 VC(e)[j] -1
Weak Gap-Detection: Given ei and ej
if VC(ei)[k] < VC(ej)[k] for some k j,
then ek such that (ek ei) (ek ej)
21
Implementing Causal Deliberywith Vector Clocks
Babaoglu & Marzullo monitor p0 maintains an array D[1..n] where D[i] contains TS(mi)[i
] where mi is the last message delivered from process pi
DR3: Deliver message m from process pj when both of the following is sat
isfied D[j] = TS(m)[j] -1 => guarantee FIFO D[k] TS(m)[k], k j => guarantee Causal Relation
DR4: Monitor p0 maintains an counter D
Deliver message m of event ei as soon as
D = #(ei) - 1
22
Causal Delivery with vector ClockExamples
p1(2,2)
(3,2)
p2
p0
(1,0)
(0,0) (1,1) (1,2)
[0,0](1,1) (2,2)
(0,0)
(1,0) (1,2) (3,2)
23
Distributed Snapshots
In this strategy, p0 will request the states of the other processes and then combined them into a global state
Definition: channel state: for each channel from pi to pj,
i,j = set difference between i and j
incoming channels of process pi :INi
outgoing channels of process pi :OUTi
Snapshot Protocols Chandy and Lamport [1985] Morgan[1985]
24
Snapshot Protocol 1
Assumption: existence of a global real-time clock : RC Each message is attached with timestamp Message delays are bounded
global clock algorithm P0 sends [take snapshot at tss] to all processes
When clock RC reads tss, each process pi do the following
records its local state i,
sends an empty message over all its outgoing channels and starts recording all message received over each incoming channels
For the time pi receives a message from pj with timestamp greater than or equal to tss, pi stops recording messages for that channel
25
Snapshot Protocol 2
Assumption: Bounded message delays Channels are FIFO
Chandy & Lamport P0 send [take snapshot] to itself
For each process pi receiving [take snapshot] If it is the first time
records its local state i
sends each out-going channels [take snapshot] starts recording messages from other incoming channels
If it is not the first time stops recording message from that incoming channel
26
Chandy & Lamport (1985)
p1
p2
p0
e11 e1
2 e13 e1
4 e15 e1
6
e21 e2
2 e23 e2
4 e25
Real computation R= e21 e1
1 e12 e1
3 e22 e1
4 e23 e2
4 e15 e2
5 e16
in terms of global state =00 0111 21 31 32 42 43 44 54 55 65
e1*
e2*
27
Properties of Snapshots
Definition a : the global state in which the snapshot protocol is initiated, f : the global state in which the protocol terminates and S : the global state constructed ei
* denote the event when pi receives [take snapshot] for the first time, causing pi to start recording its state
let the time be ti when ei* occurs
ei is a prerecordering event if ei ei*
,
otherwise it is a post-recording event Properties
Then there exists a run R' such that a S f That is to say S could have happened
28
Argumentation (1)
Chandy & Lamport(1985) consider any (post-recordering, prerecordering) pair (e, e') then e e') swapping all such events will result in another consistent run R'
swap (e13 , e2
2 ) r1= e21 e1
1 e12 e2
2 e13 e1
4 e23 e2
4 e15 e2
5 e16
swap (e14 , e2
3 ) r2= e21 e1
1 e12 e2
2 e13 e2
3 e14 e2
4 e15 e2
5 e16
swap (e13 , e2
3 ) R'= e21 e1
1 e12 e2
2 e23 e1
3 e14 e2
4 e15 e2
5 e16
the global state after executing the last prerecording event (e23 ) in R
' is S (=23), the constructed global state If the computation goes in this run, S could have happen
29
Argumentation (2)
Lai & Yang(1987) Let GSN(ti:piP) be a snapshot taken between 1 and 2, during the
computation R. Let =2-1, construct R' as follows:
R' is the same as R except that every post-recording event in R is now postponed for d units of time, that is
R'(t) =R(t) if R(t) is an event at piand tti
R(t-) if R(t-) is an event at pi and t- ti
otherwise Example
30
Properties of Global Predicates
Stable Predicates Many system properties one wishes to detect have the characteristic
that once they become true, they remain true If is a stable predicate, since a S f
( is true in s ) => ( is true in f ) ( is false in s ) =>( is false in a )
Nonstable Predicates the condition encoded by the predicate may not persist long enough
for it to be true when the predicate is evaluated if a predicate is found to be true by the monitor, we do not know
whether ever held during the actual run
31
Nonstable Predicates
Two problems The condition encoded by the predicate may not persist long enou
gh for it to be true when the predicate is evaluated If a predicate is found to be true by the monitor, we do not kno
w whether ever held during the actual run The predicate may have held even if it is not detected, and even if
it is detected it may have never held. Extended nonstable global predicate: apply to the entire distribute
d computation Possibly() Definitely()
32
Detecting Possibly and Definitely
min (ik) : the global state with the smallest level in the lattice cont
aining ik
max(ik) : the global state with the largest level in the lattice contai
ning ik
Examples: min (13) = 31,max (1
3) = 33
min(ik) = (1
c1, 2c2,…, n
cn ): j: VC(jcj)[j]=VC( i
k)[j]
max(ik) = (1
c1, 2c2,…, n
cn ): j: VC(jcj)[i]<=VC( i
k)[i] and ((j
Cj = jf) or (VC(j
Cj+1)[i] > VC(jk)[i]))
The minimum level containing jk is the sum of components of the v
ector timestamp VC(jk)
An algorithm for detecting Definitely(): O(kn): k is the maximum number of events a monitored process has executed
33
Example