Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas...

Preview:

Citation preview

Efficient replica maintenance for distributed storage systems

Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris. 2006.

1

About the Paper

• 137 Citations (Google Scholar)• In Proceedings of the 3rd conference on

Networked Systems Design \& Implementation - Volume 3 (NSDI'06), Vol. 3. USENIX Association, Berkeley, CA, USA, 4-4.

• Research of OceanStore’s Project

2

Credit

• Modified version ofhttp://nrl.iis.sinica.edu.tw/Web2.0/presentation/ERMfDSS.ppt

3

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

4

Related terms

• Distributed Storage Systems• Storage systems that aggregate the disks of many

nodes scattered throughout the Internet

5

Motivation

• One of the most important goals of distributed storage system: Robust

• Solution?• -> replication

• Durability: objects that an application has putput into the system are not lost due to disk failure

• Availability: getget will be able to return the object promptly

6

Motivation

• Failure• Transient failure: availability

• Out of power• Scheduled maintenance• Network problem

• Permanent failure: durability • Disk failure

7

Contribution

• Develop and Implement an algorithm Carbonite to store immutable objects durably and at a low bandwidth cost in a distributed storage systems

• Create replica fast enough to handle failure• Keep track of all the replicas• Use a model to determine a reasonable # of

replications

8

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

9

Model of the relationship between:

•Network capacity•Amount of replicated data•Number of replicas•Durability

Providing Durability

• Durability is more practical and useful than availability

• Challenge to durability• Create new replicas faster than losing them• Reduce network bandwidth• Distinguish transient failures from permanent disk

failures• Reintegration

10

Challenges to Durability

• Create new replicas faster than replicas are destroyed• Creation rate < failure rate system is infeasible

• Higher number of replicas do not allow the system to survive a high average failure rate

• Creation rate = failure rate + ε (ε is small) bursts of failures may destroy all of the replicas

112/04/18

11

Number of Replicas as a Birth-Death Process• Assumption: independent exponential inter-failure and

inter-repair times

• λf : average failure rate

• μi: average repair rate at state i

• rL : lower bound of number of replicas, is the target number of replicas in order to survive bursts of failures

112/04/18

12

Model Simplification

• Fixed μ and λ the equilibrium number of replicas is Θ = μ/ λ

• Higher values of Θ decrease the time it takes to repair an object

112/04/18

13

Example• PlanetLab

• 490 nodes• Average inter-failure time 39.85 hours• 150 KB/s bandwidth

• Assumption• 500 GB per node• rL = 3

• λ = 365 day / 490 * (39.85 / 24) = 0.439 disk failures / year

• μ = 365 day / (500 GB * 3 / 150 KB/sec) = 3 disk copies / year

• Θ = μ/ λ = 6.85

112/04/18

PlanetLab •A large research testbed for computer networking and distributed system research with nodes located around the world•Different organizations each donate one or more computers

14

λ depends on •# of nodes•Frequency of failure

μ depends on •Amount of data•Bandwidth

Θ = μ/ λ = constant * bandwidth*#nodes* inter-failure time /(amount of data*RL )

Impact of Θ

112/04/18

Θ = 1

15

• Bandwith ↑ μ ↑ Θ ↑

• rL ↑ μ ↓ Θ ↓

• Θ is the theoretical upper limit of extra replica number• If Θ < 1, the system can no longer maintain full replication

regardless of rL

Θ = 1

Θ = constant * bandwidth/RL

Choosing rL

• Guidelines• Large enough to ensure durability• At least one more than the maximum burst of

simultaneous failures• Small enough rL, if a low value of rL would suffice, the

bandwidth maintaining the extra replicas would be wasted.

112/04/18

16

rL vs Durablility• Higher rL means high cost but tolerates more bursts of

failures

• Larger data size λ ↑ need higher rL

Analytical results from PlanetLab traces (4 years)

112/04/18

17

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

18

Proper placement of replicas on servers

Node’s Scope

• Defintion: Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We call the size of that set the node’s scope.

• scope є [rL , N]

• N: total number of nodes in the system

112/04/18

19

Effect of Scope

• Small scope• Easy to keep track of objects• Needs more time to create new object

• Big scope• Reduces repair time, thus increases durability• Needs to monitor more nodes• If large number of objects and random placement,

may increase the likelihood of simultaneous failures

112/04/18

20

Scope vs. Repair Time

• Scope ↑ repair work is spread over more access links and completes faster

• rL ↓ scope must be higher to achieve the same durability

112/04/18

21

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

22

Reduce bandwidth wasted due to the transient failure

•How to distinguish the failure, i.e., transient and permanent failure. •Reintegrate replicas stored on nodes after transient failures.

Reintegration

• Reintegrate replicas stored on nodes after transient failures

• System must be able to track all the replicas

112/04/18

23

Effect of Node Availability

• a: the average fraction of time that a node is available

• Pr [new replica needs to be created]

= Pr [less than rL replicas are available] :

• Chernoff bound: 2rL/a replicas are needed to keep rL copies available

112/04/18

24

Node Availability vs. Reintegration• Reintegration can work safely with 2rL/a replicas

• 2/a is the penalty for not being able to distinguishing transient and permanent failures

• rL = 3

112/04/18

25

Create replicas as needed

• Probability of making replicas depends on a• Estimate a?

112/04/18

26

Timeouts

• A heuristic to avoid misclassifying temporary failures as permanent

• setting a timeout can reduce response to transient failures but its success depends greatly on its relationship to the downtime distribution and can in some instances reduce durability as well.

27

Four Replication Algorithms

• Cates• Fixed number of replicas rL

• Timeout

• Total Recall• Batch (In addition to rL replicas, make e additional

copies so that can make repair less frequent)

• Carbonite• Timeout + reintegration

• Oracle• Hypothetical system that can differentiate transient

failures from permanent failures

112/04/18

28

Comparison

112/04/18

29

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

30

ChallengesLike Node Monitoring

Node Monitoring for Failure Detection• Carbonite requires that each node know the

number of available replicas of each object for which it is responsible

• The goal of monitoring is to allow the nodes to track the number of available replicas

• Challenges:• Monitoring consistent hashing systems• Monitoring host availability

112/04/18

31

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

32

Conclusion

• Described a set of techniques that allow wide-area systems to efficiently store and maintain large amounts of data

• Keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

112/04/18

33

34

Questions or Comments?

Thanks!

Recommended