28
MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Unpublished – Please do not distribute

MinCopysets: Derandomizing Replication in Cloud Storage

  • Upload
    lamond

  • View
    104

  • Download
    15

Embed Size (px)

DESCRIPTION

MinCopysets: Derandomizing Replication in Cloud Storage. Asaf Cidon , Ryan Stutsman, Stephen Rumble, Sachin Katti , John Ousterhout and Mendel Rosenblum. Stanford University. Overview. Assumptions: no geo-replication, Azure uses much smaller clusters in practice. - PowerPoint PPT Presentation

Citation preview

Page 1: MinCopysets:  Derandomizing  Replication in Cloud Storage

MinCopysets: Derandomizing Replication in Cloud Storage

Stanford University

Asaf Cidon, Ryan Stutsman, Stephen Rumble,Sachin Katti, John Ousterhout and Mendel Rosenblum

Unpublished – Please do not distribute

Page 2: MinCopysets:  Derandomizing  Replication in Cloud Storage

Overview

Assumptions: no geo-replication, Azure uses much smaller clusters in practiceUnpublished – Please do not distribute

Page 3: MinCopysets:  Derandomizing  Replication in Cloud Storage

• Primary data stored on master (memory)• Divide each master’s data into chunks• Chunks are replicated on backups (disk)

– When master crashes, recover from thousands of backups

RAMCloud

Masters

Backups

CrashedMaster

Unpublished – Please do not distribute

Page 4: MinCopysets:  Derandomizing  Replication in Cloud Storage

Node 1 Node 2 Node 3 Node 4 Node 5

Node 6 Node 7 Node 8 Node 9 Node 10

Random Replication

Chunk 1 Chunk 2 Chunk 3

Chunk 1 Secondary

Chunk 1 Primary

Chunk 1 Secondary

Chunk 2 Secondary

Chunk 2 Secondary

Chunk 2 Primary

Chunk 3 Primary

Chunk 3 Secondary

Chunk 3 Secondary

Unpublished – Please do not distribute

Page 5: MinCopysets:  Derandomizing  Replication in Cloud Storage

The Problem

• Randomized replication loses data in power outages–0.5-1% of the nodes fail to reboot–1-2 times a year–Result: handful of chunks (GBs of data) are

unavailable (LinkedIn ‘12)• Sub-problem: managed power downs

–Software upgrades–Reduced power consumption

Unpublished – Please do not distribute

Page 6: MinCopysets:  Derandomizing  Replication in Cloud Storage

Intuition

• If we have one chunk, we are safe:– Replicate chunk on three nodes– Data is lost if failed nodes contain three copies of a

chunk– 1% of the nodes fail: 0.0001% of data loss

• If we have millions of chunks, we lose data:– 1000 node HDFS cluster has 10 million chunks– 1% of the nodes fail: 99.93% of data loss

Unpublished – Please do not distribute

Page 7: MinCopysets:  Derandomizing  Replication in Cloud Storage

Mathematical Intuition

• A copyset of nodes is a single unit of failure– Each chunk is replicated on a single copyset

• For one chunk, the probability of data loss is: – F = number of failed nodes– R = replication factor– N = number of nodes

• For all chunks, the probability is: – B = number of chunks

Unpublished – Please do not distribute

Page 8: MinCopysets:  Derandomizing  Replication in Cloud Storage

Changing R Doesn’t Help

Unpublished – Please do not distribute

Page 9: MinCopysets:  Derandomizing  Replication in Cloud Storage

Changing the Chunk Size Doesn’t Help

Unpublished – Please do not distribute

Page 10: MinCopysets:  Derandomizing  Replication in Cloud Storage

MinCopysets: Decouple Load Balancing and Durability

• Split nodes into fixed replication groups• Random Distribution: Place primary replica on

random node• Deterministic Replication: Place secondary

replicas deterministically on same replication group as primary

Unpublished – Please do not distribute

Page 11: MinCopysets:  Derandomizing  Replication in Cloud Storage

MinCopysets Architecture

Replication Group 3Replication Group 2Replication Group 1

Chunk 1 Chunk 2 Chunk 3 Chunk 4

Node 55

Chunk 1 Secondary

Chunk 3 Primary

Node 7

Chunk 1 Primary

Chunk 3 Secondary

Node 24

Chunk 1 Secondary

Chunk 3 Secondary

Node 2

Node 83 Node 8

Chunk 2 Secondary

Chunk 2 Secondary

Chunk 2 Primary

Node 1

Node 22 Node 47

Chunk 4 Primary

Chunk 4 Secondary

Chunk 4 Secondary

Unpublished – Please do not distribute

Page 12: MinCopysets:  Derandomizing  Replication in Cloud Storage

Unpublished – Please do not distribute

Page 13: MinCopysets:  Derandomizing  Replication in Cloud Storage

Unpublished – Please do not distribute

Page 14: MinCopysets:  Derandomizing  Replication in Cloud Storage

Unpublished – Please do not distribute

Page 15: MinCopysets:  Derandomizing  Replication in Cloud Storage

Extreme Failure Scenarios

• In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities

• For example:– 4000 node HDFS cluster– 120 nodes fail to reboot after power outage– Only 3.5% probability of data loss

Unpublished – Please do not distribute

Page 16: MinCopysets:  Derandomizing  Replication in Cloud Storage

Extreme Failure Scenarios: Normal Clusters

Unpublished – Please do not distribute

Page 17: MinCopysets:  Derandomizing  Replication in Cloud Storage

Extreme Failure Scenarios: Big Clusters

Unpublished – Please do not distribute

Page 18: MinCopysets:  Derandomizing  Replication in Cloud Storage

MinCopysets’ Trade Off

• Trades off frequency and magnitude of failures–Expected data loss is the same–Data loss occurs very rarely–The magnitude of data loss is greater

Unpublished – Please do not distribute

Page 19: MinCopysets:  Derandomizing  Replication in Cloud Storage

Frequency vs. Magnitude of Failures

• Setup:– 5000 node HDFS cluster– 3 TB per machine– R = 3– Power outage once a year

• Random replication– Lose 5.5 GB every single year

• MinCopysets– Lose data once every 625 years– Lose an entire node in case of failure

Unpublished – Please do not distribute

Page 20: MinCopysets:  Derandomizing  Replication in Cloud Storage

RAMCloud Implementation

• RAMCloud implementation was relatively straightforward

• Two non-trivial issues:1. Need to manage groups of nodes

• Allocate chunks on entire groups• Manage nodes joining and leaving groups

2. Machine failures are more complex• Need to re-replicate entire group, rather than

individual nodes

Unpublished – Please do not distribute

Page 21: MinCopysets:  Derandomizing  Replication in Cloud Storage

RAMCloud Implementation

RAMCloudCoordinator

RAMCloudMaster

RAMCloudBackup

Request:Assign Replication

Group RPC

Server ID ReplicationGroup ID

Server 0 5

Server 1 0

Server 2 5

Server 3 7

… …

Request:Open New Chunk RPC

Reply:Replication

Group

Coordinator Server List

Unpublished – Please do not distribute

Page 22: MinCopysets:  Derandomizing  Replication in Cloud Storage

HDFS Implementation

• Even simpler than RAMCloud• In HDFS replication decisions are centralized

on NameNode, in RAMCloud they are distributed– NameNode assigns DataNodes to replication

groups• Prototyped in 200 LoC

Unpublished – Please do not distribute

Page 23: MinCopysets:  Derandomizing  Replication in Cloud Storage

HDFS Issues

• Has the same issues as RAMCloud in managing groups of nodes

• Issue: Repair bandwidth– Solution: Hybrid scheme

• Issue: Network bottlenecks and load balancing– Solution: Kill replication group, re-replicate its data

elsewhere• Issue: Replication group’s capacity is limited by node

with the smallest capacity– Solution: Choose replication groups with similar capacities

Unpublished – Please do not distribute

Page 24: MinCopysets:  Derandomizing  Replication in Cloud Storage

Facebook’s HDFS Replication

• Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss

• Facebook’s Algorithm:– Primary replica is replicated on node j and rack k– Secondary replicas are replicated on randomly

selected nodes among (j+1,… ,j+5), on racks (k+1, k+2)

Unpublished – Please do not distribute

Page 25: MinCopysets:  Derandomizing  Replication in Cloud Storage

Facebook’s Replication

Unpublished – Please do not distribute

Page 26: MinCopysets:  Derandomizing  Replication in Cloud Storage

Hybrid MinCopysets

• Split nodes into replication groups of 2 and 15• First and second replica are always placed on

the group of 2• Third replica is randomly placed on the group

of 15

Page 27: MinCopysets:  Derandomizing  Replication in Cloud Storage
Page 28: MinCopysets:  Derandomizing  Replication in Cloud Storage

Thank You!

Stanford UniversityUnpublished – Please do not distribute