Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas...

Efficient replica maintenance for distributed storage systems

Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris. 2006.

About the Paper

• 137 Citations (Google Scholar)• In Proceedings of the 3rd conference on

Networked Systems Design \& Implementation - Volume 3 (NSDI'06), Vol. 3. USENIX Association, Berkeley, CA, USA, 4-4.

• Research of OceanStore’s Project

Credit

• Modified version ofhttp://nrl.iis.sinica.edu.tw/Web2.0/presentation/ERMfDSS.ppt

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

Related terms

• Distributed Storage Systems• Storage systems that aggregate the disks of many

nodes scattered throughout the Internet

Motivation

• One of the most important goals of distributed storage system: Robust

• Solution?• -> replication

• Durability: objects that an application has putput into the system are not lost due to disk failure

• Availability: getget will be able to return the object promptly

Motivation

• Failure• Transient failure: availability

• Out of power• Scheduled maintenance• Network problem

• Permanent failure: durability • Disk failure

Contribution

• Develop and Implement an algorithm Carbonite to store immutable objects durably and at a low bandwidth cost in a distributed storage systems

• Create replica fast enough to handle failure• Keep track of all the replicas• Use a model to determine a reasonable # of

replications

Outline

Model of the relationship between:

•Network capacity•Amount of replicated data•Number of replicas•Durability

Providing Durability

• Durability is more practical and useful than availability

• Challenge to durability• Create new replicas faster than losing them• Reduce network bandwidth• Distinguish transient failures from permanent disk

failures• Reintegration

Challenges to Durability

• Create new replicas faster than replicas are destroyed• Creation rate < failure rate system is infeasible

• Higher number of replicas do not allow the system to survive a high average failure rate

• Creation rate = failure rate + ε (ε is small) bursts of failures may destroy all of the replicas

112/04/18

Number of Replicas as a Birth-Death Process• Assumption: independent exponential inter-failure and

inter-repair times

• λf : average failure rate

• μi: average repair rate at state i

• rL : lower bound of number of replicas, is the target number of replicas in order to survive bursts of failures

112/04/18

Model Simplification

• Fixed μ and λ the equilibrium number of replicas is Θ = μ/ λ

• Higher values of Θ decrease the time it takes to repair an object

112/04/18

Example• PlanetLab

• 490 nodes• Average inter-failure time 39.85 hours• 150 KB/s bandwidth

• Assumption• 500 GB per node• rL = 3

• λ = 365 day / 490 * (39.85 / 24) = 0.439 disk failures / year

• μ = 365 day / (500 GB * 3 / 150 KB/sec) = 3 disk copies / year

• Θ = μ/ λ = 6.85

112/04/18

PlanetLab •A large research testbed for computer networking and distributed system research with nodes located around the world•Different organizations each donate one or more computers

λ depends on •# of nodes•Frequency of failure

μ depends on •Amount of data•Bandwidth

Θ = μ/ λ = constant * bandwidth*#nodes* inter-failure time /(amount of data*RL )

Impact of Θ

112/04/18

Θ = 1

• Bandwith ↑ μ ↑ Θ ↑

• rL ↑ μ ↓ Θ ↓

• Θ is the theoretical upper limit of extra replica number• If Θ < 1, the system can no longer maintain full replication

regardless of rL

Θ = 1

Θ = constant * bandwidth/RL

Choosing rL

• Guidelines• Large enough to ensure durability• At least one more than the maximum burst of

simultaneous failures• Small enough rL, if a low value of rL would suffice, the

bandwidth maintaining the extra replicas would be wasted.

112/04/18

rL vs Durablility• Higher rL means high cost but tolerates more bursts of

failures

• Larger data size λ ↑ need higher rL

Analytical results from PlanetLab traces (4 years)

112/04/18

Outline

Proper placement of replicas on servers

Node’s Scope

• Defintion: Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We call the size of that set the node’s scope.

• scope є [rL , N]

• N: total number of nodes in the system

112/04/18

Effect of Scope

• Small scope• Easy to keep track of objects• Needs more time to create new object

• Big scope• Reduces repair time, thus increases durability• Needs to monitor more nodes• If large number of objects and random placement,

may increase the likelihood of simultaneous failures

112/04/18

Scope vs. Repair Time

• Scope ↑ repair work is spread over more access links and completes faster

• rL ↓ scope must be higher to achieve the same durability

112/04/18

Outline

Reduce bandwidth wasted due to the transient failure

•How to distinguish the failure, i.e., transient and permanent failure. •Reintegrate replicas stored on nodes after transient failures.

Reintegration

• Reintegrate replicas stored on nodes after transient failures

• System must be able to track all the replicas

112/04/18

Effect of Node Availability

• a: the average fraction of time that a node is available

• Pr [new replica needs to be created]

= Pr [less than rL replicas are available] :

• Chernoff bound: 2rL/a replicas are needed to keep rL copies available

112/04/18

Node Availability vs. Reintegration• Reintegration can work safely with 2rL/a replicas

• 2/a is the penalty for not being able to distinguishing transient and permanent failures

• rL = 3

112/04/18

Create replicas as needed

• Probability of making replicas depends on a• Estimate a?

112/04/18

Timeouts

• A heuristic to avoid misclassifying temporary failures as permanent

• setting a timeout can reduce response to transient failures but its success depends greatly on its relationship to the downtime distribution and can in some instances reduce durability as well.

Four Replication Algorithms

• Cates• Fixed number of replicas rL

• Timeout

• Total Recall• Batch (In addition to rL replicas, make e additional

copies so that can make repair less frequent)

• Carbonite• Timeout + reintegration

• Oracle• Hypothetical system that can differentiate transient

failures from permanent failures

112/04/18

Comparison

112/04/18

Outline

ChallengesLike Node Monitoring

Node Monitoring for Failure Detection• Carbonite requires that each node know the

number of available replicas of each object for which it is responsible

• The goal of monitoring is to allow the nodes to track the number of available replicas

• Challenges:• Monitoring consistent hashing systems• Monitoring host availability

112/04/18

Outline

Conclusion

• Described a set of techniques that allow wide-area systems to efficiently store and maintain large amounts of data

• Keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

112/04/18

Questions or Comments?

Thanks!

Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas...

Documents

Vivaldi: A Decentralized Network Coordinate Systemsigcomm/paper.pdf · Vivaldi: A Decentralized Network Coordinate System Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris MIT

NSDI (April 24, 2009) © 2009 Andreas Haeberlen, MPI-SWS 1 NetReview: Detecting when interdomain routing goes wrong Andreas Haeberlen MPI-SWS / Rice Ioannis

© 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove

Computer security: certification Frans Kaashoek 6.033 Spring 2007

Opracowanie danych pomiarowych - home.agh.edu.plhome.agh.edu.pl/~dabek/Dydaktyka/Studia_dzienne/Materia3y/Opra... · 1Zwiezł˛ a˛informacje˛ na temat historii i obecnego statusu

1 Vivaldi: A Decentralized Network Coordinate System Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris Presented by: Chen Qian

A. Haeberlen Differential Privacy Under Fire 1 USENIX Security (August 12, 2011) Andreas Haeberlen Benjamin C. Pierce Arjun Narayan University of Pennsylvania

A systems approach to teaching computer systemsA systems approach to teaching computer systems Jerry Saltzer and Frans Kaashoek {Saltzer, kaashoek}@mit.edu Massachusetts Institute

Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,

Using Platform-Specific Optimizations in Stub-Code …...Using Platform-Speciﬁc Optimizations in Stub-Code Generation Andreas Haeberlen University of Karlsruhe haeberlen@ira.uka.de

PC hardware and x86 3/3/08 Frans Kaashoek MIT kaashoek@mit.edu

A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of

Dhrupad Bhardwaj Andreas Haeberlen - Penn Engineeringcse400/CSE400_2015_2016/posters/poster_6.pdfAndreas Haeberlen Abstract Recruiting portals today are exclusively focused on job

Retroactive Auditing Xi Wang Nickolai Zeldovich Frans Kaashoek MIT CSAIL

Exokernel – Engler, Kaashoek etc.… “Mechanism is policy”

Network Computing Laboratory 1 Vivaldi: A Decentralized Network Coordinate System Authors: Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris MIT Published

PRISM: Private Retrieval of the Internet’s Sensitive Metadata Ang ChenAndreas Haeberlen University of Pennsylvania

Vivaldi: A Decentralized Network Coordinate Systempdos.csail.mit.edu/papers/vivaldi:sigcomm/paper.pdf · Vivaldi: A Decentralized Network Coordinate System Frank Dabek, Russ Cox,

X86 segmentation, page tables, and interrupts 3/17/08 Frans Kaashoek MIT kaashoek@mit.edu