Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of...
51
Distributed Debugging Presenter: Chi-Hung Lu 1
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments
Problems Distributed applications are hard to validate
Distribution of application state across many distinct execution
environments Protocols involve complex interactions among a
collection of networked machines Need to handle failures ranging
from network problems to crashing nodes Intricate sequences of
events can trigger complex errors as a result of mishandled corner
cases 2
Problem Description It is difficult to diagnose the source of
the problem for an internet application Current network diagnostic
tools only focus on one particular protocol Does not share
information on the application between the user, service, and the
network operators 5
Slide 6
Examples traceroute Could locate IP connectivity problem Could
not reveal proxy or DNS failures HTTP monitoring suite Could locate
application problem Could not diagnose routing problems 6
Slide 7
Examples 7 User DNS Server Proxy Web Server
Slide 8
Examples 8 User DNS Server Proxy Web Server
Slide 9
Examples 9 User DNS Server Proxy Web Server
Slide 10
Examples 10 User DNS Server Proxy Web Server
Slide 11
X-Trace An integrated tracing framework Record the network path
that were taken Invoke X-Trace when initiating an application task
Insert X-Trace metadata with a task identifier in the request
Propagate the metadata down to lower layers through protocol
interfaces 11
Slide 12
Task Tree X-Trace tags all network operations resulting from a
particular task with the same task identifier Task tree is the set
of network operations connected with an initial task Task tree
could be reconstruct after collecting trace data with reports
12
Slide 13
An example of the task tree A simple HTTP request through a
proxy 13
Slide 14
X-Trace Components Data X-Trace metadata Network path Task tree
Report Reconstruct task tree 14
Slide 15
Propagation of X-Trace Metadata The propagation of X-Trace
metadata through the task tree 15
Slide 16
Propagation of X-Trace Metadata The propagation of X-Trace
metadata through the task tree 16
Slide 17
The X Trace metadata FieldUsage FlagsBits that specify which of
the three optional components are present TaskIDAn unique integer
ID TreeInfoParentID, OpID, EdgeType DestinationSpecify the address
that X-Trace report should be sent to OptionsAccommodate future
extensions mechanism 17
Slide 18
Operation of X-Trace Metadata 18
Slide 19
Operation of X-Trace Metadata 19
Slide 20
X-Trace Report Architecture 20
Slide 21
X-Trace Report Architecture 21
Slide 22
X-Trace Report Architecture 22
Slide 23
Usage Scenario (1) Web request and recursive DNS queries
23
Slide 24
Usage Scenario (2) A request fault annotated with user input
24
Slide 25
Usage Scenario (3) A client and a server communicate over I3
overlay network 25
Slide 26
Usage Scenario (3) Internet Indirect Infrastructure (I3)
26
Slide 27
Usage Scenario (3) Internet Indirect Infrastructure (I3)
27
Slide 28
Usage Scenario (3) Internet Indirect Infrastructure (I3)
28
Problem Description Log mining is both labor-intensive and
fragile Latent bugs often are distributed across multiple nodes
Logs reflect incomplete information of an execution Non-determinism
of distributed application 35
Slide 36
Goals Efficiently verify application properties Provide fairly
complete information about an execution Reproduce the buggy runs
deterministically and faithfully 36
Slide 37
Approach Log the actual execution of a distributed system Apply
predicate checking in a centralized simulator over a run driven by
testing scripts or replayed by logs Output violation report along
with message traces An execution is interpreted as a sequence of
events, which are dispatched to corresponding handling routines
37
Slide 38
Components A versatile script language Allow a developer to
refine system properties into straightforward assertions A checker
Inspect for violations 38
Slide 39
Architecture Components of WiDS Checker 39
Slide 40
Architecture Reproduce real runs Log all non-deterministic
events using Lamports logical clock Check user-defined predicates A
versatile scription language to specify system states being
observed and the predicates for invariants and correctness Screen
out false alarms with auxiliary information For liveness properties
Trace root causes using a visualization tool 40
Slide 41
Programming with WiDS WiDS APIs are mostly member function of
the WiDSObject class WiDS runtime maintains an event queue to
buffer pending events and dispatches them to corresponding handling
routines 41
Slide 42
Enabling Replay Logging Log all WiDS nondeterminism Redirect OS
calls and log the results Embed a Lamport Clock in each out-going
message Checkpoint Support partial replay Save the WiDS process
context Replay Start from the beginning or a checkpoint Replay
events in serialized Lamport order 42
Slide 43
Checker Observe memory state Define states and evaluate
predicates Refresh database for each event Maintain history
Re-evaluate modified predicates Auxiliary information for
violations Liveness properties only guarantee to be true eventually
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
Visualization Tools Message flow graph 47
Slide 48
Evaluation Benchmark and result summary 48
Slide 49
Performance Running time for evaluating predicates 49
Slide 50
Logging Overhead Percentage of logging time 50
Slide 51
Discussion System is debugged by those who developed it Bugs
are hunted by those who are intimately familiar with the system
51