Upload
vudung
View
218
Download
1
Embed Size (px)
Citation preview
Minimizing Faulty Executions of Distributed Systems
Colin Scott, Aurojit Panda, Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, Scott Shenker
1 LaToza, Venolia, DeLine, ICSE’ 06
49% of developers’ time spent on debugging!1
Understanding How Bug Is Triggered
Fixing Problematic Code
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: bmsg
dst: a
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
External events (events outside
system’s control): Crash-recovery Process creation External message
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
crash recovery
External events (events outside
system’s control): Crash-recovery Process creation External message
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
crash recovery
External events (events outside
system’s control): Crash-recovery Process creation External message
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
crash recovery
Invariant Checking
An invariant is a predicate P over the state of all processes.
a b c d e
{ ✔ ✗
✗
A faulty execution is one that ends in an invariant violation.
e1 i1 i2 i3 i4e2
OutlineIntroductionBackground
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Formal Problem Statement
Find: locally minimal reproducing sequence τ’:
τ’ violates P, |τ’| ≤ |τ|
τ’ contains a subsequence of the external events of τ
if we remove any external event e from τ’,
¬∃ τ’’ containing same external events - e, s.t. τ’’ violates P
Given: schedule τ that results in violation of P
Formal Problem Statement
After finding τ’, minimize internal events:
remove extraneous message deliveries from τ’
Minimization
τ :Given
Straightforward approach: Enumerate all schedules |τ’| ≤ |τ|, Pick shortest sequence that reproduces ✗
τ Schedule Space
… ✗e1 i1 i2 i4e2 enim
Minimization
τ :Given
Straightforward approach: Enumerate all schedules |τ’| ≤ |τ|, Pick shortest sequence that reproduces ✗
τ Schedule Space
… ✗e1 i1 i2 i4e2 enim
Observation #1: many schedules are commutative
Adopt DPOR: Dynamic Partial Order Reduction
C. Flanagan, P. Godefroid, “Dynamic Partial-Order Reduction for Model Checking Software”, POPL ‘05
Approach: prioritize schedule space exploration
Assume: fixed time budget Objective: quickly find small failing schedules
x
τ :
ene3ext: e5e1 e2 e4
… ✗e1 i1 i2 i4e2 enim
(Apply Delta Debugging1)
1A Zeller, R. Hildebrandt, “Simplifying and Isolating Failure-Inducing Input”, IEEE ‘02
Observation #2: selectively mask original events
τ :
ene3ext: e5
sub1:
e1 e2 e4
… ✗e1 i1 i2 i4e2 enim
e4 e5 en…
(Apply Delta Debugging1)
1A Zeller, R. Hildebrandt, “Simplifying and Isolating Failure-Inducing Input”, IEEE ‘02
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
… e5e4 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 … e5e4 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 … e5e4 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 … e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 … e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2:
ene5e4
i1 i4 ✗… e5e4 enim
e5 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2: i1 i4 …
ene5e4
i1 i4 ✗… e5e4 enim
e5 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2: i1 i4 ✔…
ene5e4
i1 i4 ✗… e5e4 enim
e5 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2:
…. . .
i1 i4 ✔…Explore backtrack points until (i) ✗ or (ii) time budget for sub2 expired
ene5e4
i1 i4 ✗… e5e4 enim
e5 enim
Observation #2: selectively mask original events
Observation #4: shrink external message contents
Observation #1: many schedules are commutative
Approach: prioritize schedule space exploration
Goal: find minimal schedule that produces violation
Minimize internal events after externals minimized
Observation #2: selectively mask original events
Observation #3: some contents should be masked
OutlineIntroductionBackground
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
How well does DEMi work?To
tal E
vent
s
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
1114407718040226823523 300
600
1000
400
17101500
2850
2380
1250
2160
Initial ExecutionAfter Minimization
How well does DEMi work?To
tal E
vent
s
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
1114407718040226823523 300
600
1000
400
17101500
2850
2380
1250
2160
Initial ExecutionAfter Minimization
80% - 97% Reduction!
How well does DEMi work?To
tal E
vent
s
0306090
120150180210240270300
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294spark-3150spark-9256
11112529392851
212322 111440
77
180
40
226
82
3523
After MinimizationSmallest Manual Trace
How well does DEMi work?To
tal E
vent
s
0306090
120150180210240270300
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294spark-3150spark-9256
11112529392851
212322 111440
77
180
40
226
82
3523
After MinimizationSmallest Manual Trace
Factor of 1x - 5x from hand-crafted
170 69
How quickly does DEMi work?Ru
ntim
e in
Sec
onds
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
210245427348
10676
69
43482
2132
282170
Overall Minimization(~12 hours) (~3 hours)
(~35 minutes)
170 69
How quickly does DEMi work?Ru
ntim
e in
Sec
onds
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
210245427348
10676
69
43482
2132
282170
Overall Minimization
<10 minutes except 3 cases
(~12 hours) (~3 hours)
(~35 minutes)
See the paper for… How we handle non-determinism Handling multithreaded processes Supporting other RPC libraries Sketch for minimizing production traces More in-depth evaluation Related work …
Conclusion
Open source tool: github.com/NetSys/demi
Read our paper! eecs.berkeley.edu/~rcs/research/nsdi16.pdf
Optimistic that these techniques can be successfully applied more broadly
Thanks for your time!Contact me! [email protected]
AttributionsInspiration for slide design: Jay Lorch’s IronFleet slides
Graphic Icons: thenounproject.org logfile: mantisshrimpdesign magnifying glass: Ricardo Moreira disk: Anton Outkine hook: Seb Cornelius bug report: Lemon Liu devil: Mourad Mokrane Putin: Remi Mercier
Production TracesModel: feed partially ordered log into
single machine DEMi
Require: - Partial ordering of all message deliveries - All crash-recoveries logged to disk
Related WorkThread Schedule Minimization
•Isolating Failure-Inducing Thread Schedules. SIGSOFT ’02. •A Trace Simplification Technique for Effective Debugging of
Concurrent Programs. FSE ’10. Program Flow Analysis.
•Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction. ISSTA ’07.
•Toward Generating Reducible Replay Logs. PLDI ’11. Best-Effort Replay of Field Failures
•A Technique for Enabling and Supporting Debugging of Field Failures. ICSE ’07.
•Triage: Diagnosing Production Run Failures at the User’s Site. SOSP ’07.
Dealing With ThreadsIf you’re lucky: threads are largely independent (Spark)
If you’re unlucky: key insight: A write to shared memory is equivalent to a message delivery
Approach: •interpose on virtual memory, thread scheduler •pause a thread whenever it writes to shared memory / disk
Cf. “Enabling Tracing Of Long-Running Multithreaded Programs Via Dynamic Execution Reduction”, ISSTA ‘07
Dealing With Non-DeterminismInterpose on:
- Timers - Random number generators - Unordered hash values - ID allocation
Stop-gap: replay each schedule multiple times