View
236
Download
1
Tags:
Embed Size (px)
Citation preview
Finding Liveness Bugs In Distributed Systems
R. Jhala [C. Killian, J. Anderson , A. Vahdat]UC San Diego
Concurrent, Distributed Systems
Stock Exchanges Telecoms Commuter Rail
Concurrent, Distributed Systems
System Nodes exchanging Messages
Execution1. Node gets message event2. Executes event handler
- Updates node state - Sends new messages
3. Repeat…
Distributed Systems: Challenges
SystemNodes exchanging Messages
Challenges Nodes: enter, leave, fail Messages: reordered, lost
System must stay available- Eventually, all nodes regroup - Eventually, all packets delivered- Eventually, some good happens
Liveness Properties
The Space of System Executions
1 2 Initial State
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@2
At each state,scheduler picks:1. Node n2. Event @n3. Executes code
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
Bad States
Safety Bugs: Execution that drives system to bad state
1 2 1 2
Safety Bugs
Bad States• Null Dereferences• Buffer overflows• Assertion Failures• Low-level crash
1 2 1 2event@2 fail@2
How to find Safety Bugs?Find path from Initial to BadBy systematically exploring executions(Iterating over sequences of choices)
Initial State Bad States
1 2
Model Checking for Safety Bugs
Bad States1 2
Find path from Initial to BadBy systematically exploring executions[Verisoft 97, Cmc 04, Chess 07]
Safety Properties are too Low Level
Find path from Initial to BadBy systematically exploring executions[Verisoft 97, Cmc 04, Chess 07]
Safety Properties are too Low Level
Distributed Systems:Designed for crashes & failures
Challenge: End-to-end ProblemsLiveness bugs
Live States
Bad States
InitialState
Good States: All nodes regroupAll packets deliveredLive States: Eventually Good Happens
Live Executions
InitialState
Live States
Liveness Violations
InitialState
Live States
Execution never reaches live state
How to Find Liveness Violations?
Live States
Explore all executions ?Infinitely many ...
How to Find Liveness Violations?
Live States
Explore all executions upto bound ?
Combinatorial explosion (depth < 50) Liveness at depth >> 50
[Verisoft 97, Cmc 04, Chess 07]
How to Find Liveness Violations?
Live States
Looks pretty hopeless...
Live States
Idea 1: Dead States
Dead States
No execution can reach live statesRecovery is impossible
Idea 1: Dead States
To find Liveness bugs, Look for Dead executions.How to tell if a state is Dead ?
Idea 2: Random Walks
Live States
Dead States
Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1How to tell if a state is Dead ?
Executions and Random Walks
At each execution step, 1. Scheduler picks node n2. Schedular picks event @n3. Executes event code
Random Walk: Scheduler picks randomly(from some Prob. Dist. over nodes, events)
Liveness Bugs = Search + Random Walks
1. Systematic Search: find candidates 2. Random Walk: test if candidate dead
Live States
Iterate
Liveness Bugs = Search + Random Walks
Live States
If walk length >> avg. steps to livenessThen non-live walk is likely liveness bug!
100k Events
1k Events
100,000 Step Execution (2 Gb Log file)How to pinpoint bug ?
Live States
Idea 3: The Critical Transition
Dead States
System transitions from a recoverable to a dead stateHow to find Critical Transitionwithout knowing Dead States?
Live States
Idea 3: The Critical Transition
Binary Search using
Random Walks!
Live States
Idea 3: The Critical Transition
Binary Search using
Random Walks!
Binary Search
Live States
Idea 3: The Critical Transition
Critical Transition
Dead States
System transitions from a recoverable to a dead statePinpoints bug
RecapLiveness Bugs FoundSystem has shot itself (but doesnt know it)
Systematic SearchFinds candidate dead states
Random WalksDetermine if candidate is dead
Critical TransitionThe event that makes recovery impossible
Bells and Whistles (1/2)
Random Walk Bias• Assign “likely” events higher weight• e.g. application > network > timer > fail
Bugs not missed• Random walk only tests deadness
Live state reached sooner• Error traces shorter, simpler
Bells and Whistles (2/2)
Prefix-Based Search• Restart search after reaching liveness• Analyzes effect of failures in “steady-state”
Evaluation
Liveness Bugs,Critical Transition
Mace (C++)System MaceMC
Liveness Properties
Systems
RandTreeRandom Overlay Tree with max degree.
MaceTransportUser-level, reliable messaging service.
PastryKey-based routing, using an overlay ring.
ChordKey-based routing, using an overlay ring.
Liveness Properties
RandTreeRandom Overlay Tree with max degree.
MaceTransportUser-level, reliable transport service.
PastryKey-based routing, using an overlay ring.
ChordKey-based routing, using an overlay ring.
Eventually, all messages acknowledged.
Eventually, all nodes form single tree.
Eventually, all nodes form a ring.
Eventually, all nodes form a ring.
Sample Bug: RandTree
Nodes With Child, Parent pointers
PropertyEventually nodes form tree
Sample Bug: RandTree
C
A
C requests to join under AA sends ackC fails and restartsC ignores ack from AC joins under B
Bug: System stuck as a DAG!
C’s failure not propagated to A
B
Liveness Bugs Yield Safety Assertions
Dead States Violations of a priori unknown safety properties
Critical TransitionHelps identify dead statesYields new safety properties and bugs
New Safety Property: ChordNodes with Fwd, Back pointers
PropertyEventually nodes form a ring
Critical Transition To Dead StateWhere: n.back=n, n.fwd = m
New Safety PropertyIF n.back=n THEN n.fwd=n
ScorecardSystem Bugs Liveness Safety
MaceTransport 11 5 6RandTree 17 12 5
Pastry 5 5 0Chord 19 9 10Totals 52 31 21
Several “protocol level” bugsRoutinely used by Mace programmers
Programming Challenges
How to handle unexpected events ?
How to propagate effects of failures ?
How to limit impact on performance?
Take Away Message
Liveness BugsAre Very ImportantRandomness Helps.
www.macesystems.org(papers, code, etc.)