D 3 S: Debugging Deployed Distributed Systems

D3S: Debugging Deployed Distributed Systems

Xuezheng Liu et al, Microsoft Research, NSDI 2008

Presenter: Shuo Tang, CS525@UIUC

Debugging distributed systems is difficult

• Bugs are difficult to reproduce– Many machines executing concurrently– Machines/network may fail

• Consistent snapshots are not easy to get• Current approaches– Multi-threaded debugging– Model-checking– Runtime-checking

State of the Arts

• Example– Distributed reader-writer locks

• Log-based debugging– Step1: add logs

• void ClientNode::OnLockAcquired(…) {• …• print_log( m_NodeID, lock, mode);• }

– Step2: Collect logs– Step3: Write checking scripts

Problems

• Too much manual effort• Difficult to anticipate what to log– Too much?– Too little?

• Checking for large system is challenging– A central checker cannot keep up– Snapshots must be consistent

D3S Contribution

• A simple language for writing distributed predicates

• Programmers can change what is being checked on-the-fly

• Failure tolerant consistent snapshot for predicate checking

• Evaluation with five real-world applications

D3S Workflow

Checker Checker

Predicate:no conflict locks

Violation!

statestate

state

statestate

Conflict!

Glance at D3S Predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)

class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for

partitioning};

D3S Parallel Predicate CheckerLock clients

Checkers

Expose statesindividually

Reconstruct:SN1, SN2, …

Exposed states(C1, L1, E), (C2, L3, S), (C5, L1, S),…

L1L1

(C1,L1,E),(C5,L1,S) (C2,L3,S)

Key: LockID

Summary of Checking Language

• Predicate– Any property calculated from a finite number of

consecutive state snapshots• Highlights– Sequential programs (w/ mapping)– Reuse app types in the script and C++ code

• Binary Instrumentation– Supports for reducing the overhead (in the paper)

• Incremental checking• Sampling the time or snapshots

Snapshots

• Use Lamport clock– Instrument network library– 1000 logic clocks per second

• Problem: how does the checker know whether it receives all necessary states for a snapshot?

Consistent Snapshot

• Membership• What if a process does not have state to expose for a long

time?• What if a checker fails?

A

B

Checker

{ (A, L0, S) }, ts=2

{ (B, L1, E) }, ts=6

{ }, ts=10

ts=12

{ (A, L1, E) }, ts=16

M(2)={A,B}SB(2)=??

M(6)={A,B}SA(6)=??

M(10)={A,B}SA(6)=SA(2) check(6)

Detect failure

SB(10)=SB(6) check(10)

M(16)={A}check(16)

SA(2) SB(6) SA(10) SA(16)

Experimental Method

• Debugging five real systems– Can D3S help developers find bugs?– Are predicates simple to write?– Is the checking overhead acceptable?

• Case: Chord implementation – i3– Using predecessors and successors list to stabilize – “holes” and overlap

Chord Overlay

Perfect Ring:• No overlap, no hole• Aggregated key coverage is 100%

???

0 10000200003000040000500006000070000800000%

50%

100%

150%

200% 3 predecessors8 predecessors

time (seconds)

key

rang

e co

vera

ge r

atio

Consistency vs. Availability: cannot get both• Global measure on the factors• See the tradeoff quantitatively for performance tuning• Capable of checking detailed key coverage

0 64 128 192 2560

1

2

3

43 predecessors8 predecessors

key serial

# of

hit

of c

hord

nod

es

Summary of ResultsApplication LoC Predicates LoP Results

PacificA (Structured data storage)

67,263 membership consistency; leader election; consistency among replicas

118 3 correctness bugs

Paxos implement-ation

6,993 consistency in consensus outputs; leader election

50 2 correctness bugs

Web search engine

26,036 unbalanced response time of indexing servers

81 1 performance problem

Chord (DHT) 7,640 aggregate key range coverage; conflict key holders

72 tradeoff bw/ availability & consistency

BitTorrent client

36,117 Health in neighbor set; distribution of downloaded pieces; peer contribution rank

210 2 performance bugs; free riders

Data

cen

ter a

pps

Wid

e ar

ea a

pps

Overhead (PacificA)

2 4 6 8 100

30

60

90

120

150

180

7.21%

4,38%3.94%

4.20%

7.24%

withoutwith

# of clients, each sending 10,000 requests

tim

e to

com

plet

e (s

econ

ds)

• Less than 8%, in most cases less than 4%. • I/O overhead < 0.5%• Overhead is negligible in other checked systems

Related Work• Log analysis

– Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07]

• Predicate checking at replay time– WiDS Checker[NSDI’07], Friday[NSDI’07]

• P2-based online monitoring– P2-monitor[EuroSys’06]

• Model checking– MaceMC[NSDI’07], CMC[OSDI’04]

Conclusion

• Predicate checking is effective for debugging deployed & large-scale distributed systems

• D3S enables:– Change of what is monitored on-the-fly– Checking with multiple checkers– Specify predicate in sequential & centralized

manner

Thank You

• Thank the authors for providing some of slides

PNUTSYahoo!’s Hosted Data Serving Platform

Brian F. Cooper et al. @ Yahoo! Research

Presented by Ying-Yi Liang* Some slides come from the authors’ version

What is the Problem The web era: web applications Users are picky – low latency; high availability Enterprises are greedy – high scalability Things go fast – new ideas expires very soon Two ways of developing a cool web application

Making your own fire: quick, cool, but tiring, error prone Using huge “powerful” building blocks: wonderful, but the

market would have shifted away when you are done Both ways do not scale very well…

Something is missing – an infrastructure specially tailored for web applications!

Web Application ModelObject sharing: Blogs, Flicker, Web Picasa,

Youtube, …Social: Facebook, Twitter, …Listing: Yahoo! Shopping, del.icio.us, newsThey require:

High scalability, availability and fault tolerance Acceptable latency w.r.t. geographically

distributed requests Simplified query API Some consistency (weaker than SC)

PNUTS – DB in the Cloud

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseGeographic replication

Indexes and views

Structured, flexible schema

Hosted, managed infrastructure

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

Basic ConceptsGrape Grapes are good to eatLime Limes are GreenApple Apple is wisdom

Strawberry

Strawberry shortcake

Orange Arrgh! Don’t get scurvy! Avocado But at what price?Lemon How much did you pay for this

lemon?Tomato Is this a vegetable? Banana The perfect fruit

Kiwi New Zealand

Primary Key

Record

Tablet

Field

A view from 10,000-ft

PNUTS Storage Architecture

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

Geographic Replication

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

Region 1

Region 2

Region 3

In-region Load BalanceStorage unit

Tablets

Data and Query ModelsSimplified rational data model: tables of

recordsTyped columnsTypical data types plus the blob typeDoes not enforce inter-table relationshipOperation: selection, projection (no join,

aggregation, …)Options: point access, range query, multiget

Record Assignment

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberryCanteloupeGrapeKiwiLemonLimeMangoOrangeStrawberryTomatoWatermelon

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

Single Point Update1

Write key k

2Write key k

7Sequence # for key k

8Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

Range Query

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?Grapefruit…Lime?

Lime…Pear?

MIN-Canteloupe

SU1

Canteloupe-Lime

SU3

Lime-Strawberry

SU2

Strawberry-MAX

SU1

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

Relaxed ConsistencyACID transactions

Sequential consistency: too strong Non-trivial overhead for asynchronous settings

Users can tolerate stale data in many casesGo hybrid: eventual consistency + mechanism

for SCUse versioning to cope with asynchrony

Time

Record insertedUpdate Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Update Update

Relaxed Consistency


v. 6 v. 8

Current version

Stale versionStale version

read_any()

Relaxed Consistency


v. 6 v. 8

Current version


read_latest()

Relaxed Consistency


v. 6 v. 8

Current version


read_critical(“v.6”)

Relaxed Consistency


v. 6 v. 8

Current version


write()

Relaxed Consistency


v. 6 v. 8

Current version


test_and_set_write(v.7)

ERROR

Membership ManagementRecord timelines should be coherent for each

replicaUpdates must be applied to the latest versionUse mastership

Per-record basis Only one replica has mastership at anytime All update requests are sent to master to get

ordered Routers & YMB maintain mastership information Replica receiving frequent write req. gets the

mastership Leader election service provided by ZooKeeper

ZooKeeperA distributed system is like a zoo, someone needs to

be in charge of it.ZooKeeper is a highly available, scalable coordination

svc.ZooKeeper plays two roles in PNUTS

Coordination service Publish/subscribe service

Guarantees: Sequential consistency; Single system image Atomicity (as in ACID); Durability; Timeliness

A tiny kernel for upper level building blocks

ZooKeeper: High AvailabilityHigh availability via replicationA fault-tolerant persistent storeProviding sequential consistency

ZooKeeper: ServicesPublish/Subscribe Service

Contents stored in ZooKeeper organized as directory trees

Publish: write to specific znode Subscribe: read specific znode

Coordination via automatic name resolution By appending sequence number to names CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) “/…/x-1”, “/…/x-2”, … Ephemeral nodes: znodes living as long as the

session

ZooKeeper Example: Lock1) id = create(“…/locks/x-”, SEQUENCE |

EMPHEMERAL);2) children = getChildren(“…/locks”, false);3) if (children.head == id) exit();4) test = exists(name of last child before id, true);5) if (test == false) goto 2);6) wait for modification to “…/locks”;7) goto 2);

ZooKeeper Is PowerfulMany core svc. in distributed sys. built on

ZooKeeper Consensus Distributed locks (exclusive, shared) Membership Leader election Job tracker binding …

More information at http://hadoop.apache.org/zookeeper/

http://hadoop.apache.org/zookeeper/

Experimental SetupProduction PNUTS code

Enhanced with ordered table typeThree PNUTS regions

2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5

array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk

Workload 1200-3600 requests/second 0-50% writes 80% locality

Scalability

2 2.5 3 3.5 4 4.5 50

20406080

100120140160

Hash table Ordered table

Number of storage units

Avg.

Lat

ency

(ms)

Sensitivity to R/W Ratio

0 5 10 15 20 25 30 35 40 45 500

20406080

100120140


Write percentage

Avg.

Lat

ency

(ms)

Sensitivity to Request Dist.

0 0.25 0.5 0.75 10

20

40

60

80

100


Zipf Factor

Avg.

Lat

ency

(ms)

Related Work Google BigTable/GFS

Fault-tolerance and consistency via Chubby Strong consistency – Chubby not scalable Lack of geographic replication support Targeting analytical workloads

Amazon Dynamo Unstructured data Peer-to-peer style solution Eventual consistency

Facebook Cassandra (still kind of a secret) Structured storage over peer-to-peer network Eventual consistency Always writable property – success even in the face of a failure

DiscussionCan all web applications tolerate stale data? Is doing replication completely across WAN a good

idea?Single level router vs. B+ tree style router hierarchyTiny service kernel vs. stand alone services Is relaxed consistency just right or too weak? Is exposing record versions to applications a good idea?Should security be integrated into PNUTS?Using pub/sub service as undo logs

Documents

D 3 S: Debugging Deployed Distributed Systems