49
D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC

D 3 S: Debugging Deployed Distributed Systems

  • Upload
    oria

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

D 3 S: Debugging Deployed Distributed Systems. Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC. Debugging distributed systems is difficult. Bugs are difficult to reproduce Many machines executing concurrently Machines/network may fail - PowerPoint PPT Presentation

Citation preview

Page 1: D 3 S: Debugging Deployed Distributed Systems

D3S: Debugging Deployed Distributed Systems

Xuezheng Liu et al, Microsoft Research, NSDI 2008

Presenter: Shuo Tang, CS525@UIUC

Page 2: D 3 S: Debugging Deployed Distributed Systems

Debugging distributed systems is difficult

• Bugs are difficult to reproduce– Many machines executing concurrently– Machines/network may fail

• Consistent snapshots are not easy to get• Current approaches– Multi-threaded debugging– Model-checking– Runtime-checking

Page 3: D 3 S: Debugging Deployed Distributed Systems

State of the Arts

• Example– Distributed reader-writer locks

• Log-based debugging– Step1: add logs

• void ClientNode::OnLockAcquired(…) {• …• print_log( m_NodeID, lock, mode);• }

– Step2: Collect logs– Step3: Write checking scripts

Page 4: D 3 S: Debugging Deployed Distributed Systems

Problems

• Too much manual effort• Difficult to anticipate what to log– Too much?– Too little?

• Checking for large system is challenging– A central checker cannot keep up– Snapshots must be consistent

Page 5: D 3 S: Debugging Deployed Distributed Systems

D3S Contribution

• A simple language for writing distributed predicates

• Programmers can change what is being checked on-the-fly

• Failure tolerant consistent snapshot for predicate checking

• Evaluation with five real-world applications

Page 6: D 3 S: Debugging Deployed Distributed Systems

D3S Workflow

Checker Checker

Predicate:no conflict locks

Violation!

statestate

state

statestate

Conflict!

Page 7: D 3 S: Debugging Deployed Distributed Systems

Glance at D3S Predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)

class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for

partitioning};

Page 8: D 3 S: Debugging Deployed Distributed Systems

D3S Parallel Predicate CheckerLock clients

Checkers

Expose statesindividually

Reconstruct:SN1, SN2, …

Exposed states(C1, L1, E), (C2, L3, S), (C5, L1, S),…

L1L1

(C1,L1,E),(C5,L1,S) (C2,L3,S)

Key: LockID

Page 9: D 3 S: Debugging Deployed Distributed Systems

Summary of Checking Language

• Predicate– Any property calculated from a finite number of

consecutive state snapshots• Highlights– Sequential programs (w/ mapping)– Reuse app types in the script and C++ code

• Binary Instrumentation– Supports for reducing the overhead (in the paper)

• Incremental checking• Sampling the time or snapshots

Page 10: D 3 S: Debugging Deployed Distributed Systems

Snapshots

• Use Lamport clock– Instrument network library– 1000 logic clocks per second

• Problem: how does the checker know whether it receives all necessary states for a snapshot?

Page 11: D 3 S: Debugging Deployed Distributed Systems

Consistent Snapshot

• Membership• What if a process does not have state to expose for a long

time?• What if a checker fails?

A

B

Checker

{ (A, L0, S) }, ts=2

{ (B, L1, E) }, ts=6

{ }, ts=10

ts=12

{ (A, L1, E) }, ts=16

M(2)={A,B}SB(2)=??

M(6)={A,B}SA(6)=??

M(10)={A,B}SA(6)=SA(2) check(6)

Detect failure

SB(10)=SB(6) check(10)

M(16)={A}check(16)

SA(2) SB(6) SA(10) SA(16)

Page 12: D 3 S: Debugging Deployed Distributed Systems

Experimental Method

• Debugging five real systems– Can D3S help developers find bugs?– Are predicates simple to write?– Is the checking overhead acceptable?

• Case: Chord implementation – i3– Using predecessors and successors list to stabilize – “holes” and overlap

Page 13: D 3 S: Debugging Deployed Distributed Systems

Chord Overlay

Perfect Ring:• No overlap, no hole• Aggregated key coverage is 100%

???

0 10000200003000040000500006000070000800000%

50%

100%

150%

200% 3 predecessors8 predecessors

time (seconds)

key

rang

e co

vera

ge r

atio

Consistency vs. Availability: cannot get both• Global measure on the factors• See the tradeoff quantitatively for performance tuning• Capable of checking detailed key coverage

0 64 128 192 2560

1

2

3

43 predecessors8 predecessors

key serial

# of

hit

of c

hord

nod

es

Page 14: D 3 S: Debugging Deployed Distributed Systems

Summary of ResultsApplication LoC Predicates LoP Results

PacificA (Structured data storage)

67,263 membership consistency; leader election; consistency among replicas

118 3 correctness bugs

Paxos implement-ation

6,993 consistency in consensus outputs; leader election

50 2 correctness bugs

Web search engine

26,036 unbalanced response time of indexing servers

81 1 performance problem

Chord (DHT) 7,640 aggregate key range coverage; conflict key holders

72 tradeoff bw/ availability & consistency

BitTorrent client

36,117 Health in neighbor set; distribution of downloaded pieces; peer contribution rank

210 2 performance bugs; free riders

Data

cen

ter a

pps

Wid

e ar

ea a

pps

Page 15: D 3 S: Debugging Deployed Distributed Systems

Overhead (PacificA)

2 4 6 8 100

30

60

90

120

150

180

7.21%

4,38%3.94%

4.20%

7.24%

withoutwith

# of clients, each sending 10,000 requests

tim

e to

com

plet

e (s

econ

ds)

• Less than 8%, in most cases less than 4%. • I/O overhead < 0.5%• Overhead is negligible in other checked systems

Page 16: D 3 S: Debugging Deployed Distributed Systems

Related Work• Log analysis

– Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07]

• Predicate checking at replay time– WiDS Checker[NSDI’07], Friday[NSDI’07]

• P2-based online monitoring– P2-monitor[EuroSys’06]

• Model checking– MaceMC[NSDI’07], CMC[OSDI’04]

Page 17: D 3 S: Debugging Deployed Distributed Systems

Conclusion

• Predicate checking is effective for debugging deployed & large-scale distributed systems

• D3S enables:– Change of what is monitored on-the-fly– Checking with multiple checkers– Specify predicate in sequential & centralized

manner

Page 18: D 3 S: Debugging Deployed Distributed Systems

Thank You

• Thank the authors for providing some of slides

Page 19: D 3 S: Debugging Deployed Distributed Systems

PNUTSYahoo!’s Hosted Data Serving Platform

Brian F. Cooper et al. @ Yahoo! Research

Presented by Ying-Yi Liang* Some slides come from the authors’ version

Page 20: D 3 S: Debugging Deployed Distributed Systems

What is the Problem The web era: web applications Users are picky – low latency; high availability Enterprises are greedy – high scalability Things go fast – new ideas expires very soon Two ways of developing a cool web application

Making your own fire: quick, cool, but tiring, error prone Using huge “powerful” building blocks: wonderful, but the

market would have shifted away when you are done Both ways do not scale very well…

Something is missing – an infrastructure specially tailored for web applications!

Page 21: D 3 S: Debugging Deployed Distributed Systems

Web Application ModelObject sharing: Blogs, Flicker, Web Picasa,

Youtube, …Social: Facebook, Twitter, …Listing: Yahoo! Shopping, del.icio.us, newsThey require:

High scalability, availability and fault tolerance Acceptable latency w.r.t. geographically

distributed requests Simplified query API Some consistency (weaker than SC)

Page 22: D 3 S: Debugging Deployed Distributed Systems

PNUTS – DB in the Cloud

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseGeographic replication

Indexes and views

Structured, flexible schema

Hosted, managed infrastructure

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

Page 23: D 3 S: Debugging Deployed Distributed Systems

Basic ConceptsGrape Grapes are good to eatLime Limes are GreenApple Apple is wisdom

Strawberry

Strawberry shortcake

Orange Arrgh! Don’t get scurvy! Avocado But at what price?Lemon How much did you pay for this

lemon?Tomato Is this a vegetable? Banana The perfect fruit

Kiwi New Zealand

Primary Key

Record

Tablet

Field

Page 24: D 3 S: Debugging Deployed Distributed Systems

A view from 10,000-ft

Page 25: D 3 S: Debugging Deployed Distributed Systems

PNUTS Storage Architecture

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

Page 26: D 3 S: Debugging Deployed Distributed Systems

Geographic Replication

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

Region 1

Region 2

Region 3

Page 27: D 3 S: Debugging Deployed Distributed Systems

In-region Load BalanceStorage unit

Tablets

Page 28: D 3 S: Debugging Deployed Distributed Systems

Data and Query ModelsSimplified rational data model: tables of

recordsTyped columnsTypical data types plus the blob typeDoes not enforce inter-table relationshipOperation: selection, projection (no join,

aggregation, …)Options: point access, range query, multiget

Page 29: D 3 S: Debugging Deployed Distributed Systems

Record Assignment

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberryCanteloupeGrapeKiwiLemonLimeMangoOrangeStrawberryTomatoWatermelon

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

Page 30: D 3 S: Debugging Deployed Distributed Systems

Single Point Update1

Write key k

2Write key k

7Sequence # for key k

8Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

Page 31: D 3 S: Debugging Deployed Distributed Systems

Range Query

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?Grapefruit…Lime?

Lime…Pear?

MIN-Canteloupe

SU1

Canteloupe-Lime

SU3

Lime-Strawberry

SU2

Strawberry-MAX

SU1

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

Page 32: D 3 S: Debugging Deployed Distributed Systems

Relaxed ConsistencyACID transactions

Sequential consistency: too strong Non-trivial overhead for asynchronous settings

Users can tolerate stale data in many casesGo hybrid: eventual consistency + mechanism

for SCUse versioning to cope with asynchrony

Time

Record insertedUpdate Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Update Update

Page 33: D 3 S: Debugging Deployed Distributed Systems

Relaxed Consistency

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Current version

Stale versionStale version

read_any()

Page 34: D 3 S: Debugging Deployed Distributed Systems

Relaxed Consistency

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Current version

Stale versionStale version

read_latest()

Page 35: D 3 S: Debugging Deployed Distributed Systems

Relaxed Consistency

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Current version

Stale versionStale version

read_critical(“v.6”)

Page 36: D 3 S: Debugging Deployed Distributed Systems

Relaxed Consistency

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Current version

Stale versionStale version

write()

Page 37: D 3 S: Debugging Deployed Distributed Systems

Relaxed Consistency

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Current version

Stale versionStale version

test_and_set_write(v.7)

ERROR

Page 38: D 3 S: Debugging Deployed Distributed Systems

Membership ManagementRecord timelines should be coherent for each

replicaUpdates must be applied to the latest versionUse mastership

Per-record basis Only one replica has mastership at anytime All update requests are sent to master to get

ordered Routers & YMB maintain mastership information Replica receiving frequent write req. gets the

mastership Leader election service provided by ZooKeeper

Page 39: D 3 S: Debugging Deployed Distributed Systems

ZooKeeperA distributed system is like a zoo, someone needs to

be in charge of it.ZooKeeper is a highly available, scalable coordination

svc.ZooKeeper plays two roles in PNUTS

Coordination service Publish/subscribe service

Guarantees: Sequential consistency; Single system image Atomicity (as in ACID); Durability; Timeliness

A tiny kernel for upper level building blocks

Page 40: D 3 S: Debugging Deployed Distributed Systems

ZooKeeper: High AvailabilityHigh availability via replicationA fault-tolerant persistent storeProviding sequential consistency

Page 41: D 3 S: Debugging Deployed Distributed Systems

ZooKeeper: ServicesPublish/Subscribe Service

Contents stored in ZooKeeper organized as directory trees

Publish: write to specific znode Subscribe: read specific znode

Coordination via automatic name resolution By appending sequence number to names CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) “/…/x-1”, “/…/x-2”, … Ephemeral nodes: znodes living as long as the

session

Page 42: D 3 S: Debugging Deployed Distributed Systems

ZooKeeper Example: Lock1) id = create(“…/locks/x-”, SEQUENCE |

EMPHEMERAL);2) children = getChildren(“…/locks”, false);3) if (children.head == id) exit();4) test = exists(name of last child before id, true);5) if (test == false) goto 2);6) wait for modification to “…/locks”;7) goto 2);

Page 43: D 3 S: Debugging Deployed Distributed Systems

ZooKeeper Is PowerfulMany core svc. in distributed sys. built on

ZooKeeper Consensus Distributed locks (exclusive, shared) Membership Leader election Job tracker binding …

More information at http://hadoop.apache.org/zookeeper/

Page 44: D 3 S: Debugging Deployed Distributed Systems

Experimental SetupProduction PNUTS code

Enhanced with ordered table typeThree PNUTS regions

2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5

array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk

Workload 1200-3600 requests/second 0-50% writes 80% locality

Page 45: D 3 S: Debugging Deployed Distributed Systems

Scalability

2 2.5 3 3.5 4 4.5 50

20406080

100120140160

Hash table Ordered table

Number of storage units

Avg.

Lat

ency

(ms)

Page 46: D 3 S: Debugging Deployed Distributed Systems

Sensitivity to R/W Ratio

0 5 10 15 20 25 30 35 40 45 500

20406080

100120140

Hash table Ordered table

Write percentage

Avg.

Lat

ency

(ms)

Page 47: D 3 S: Debugging Deployed Distributed Systems

Sensitivity to Request Dist.

0 0.25 0.5 0.75 10

20

40

60

80

100

Hash table Ordered table

Zipf Factor

Avg.

Lat

ency

(ms)

Page 48: D 3 S: Debugging Deployed Distributed Systems

Related Work Google BigTable/GFS

Fault-tolerance and consistency via Chubby Strong consistency – Chubby not scalable Lack of geographic replication support Targeting analytical workloads

Amazon Dynamo Unstructured data Peer-to-peer style solution Eventual consistency

Facebook Cassandra (still kind of a secret) Structured storage over peer-to-peer network Eventual consistency Always writable property – success even in the face of a failure

Page 49: D 3 S: Debugging Deployed Distributed Systems

DiscussionCan all web applications tolerate stale data? Is doing replication completely across WAN a good

idea?Single level router vs. B+ tree style router hierarchyTiny service kernel vs. stand alone services Is relaxed consistency just right or too weak? Is exposing record versions to applications a good idea?Should security be integrated into PNUTS?Using pub/sub service as undo logs