Upload
hoangquynh
View
227
Download
1
Embed Size (px)
Citation preview
Giza: Erasure Coding Objects across Global Data Centers
Yu Lin Chen*ꭞ, Shuai Mu ꭞ, Jinyang Li ꭞ, Cheng Huang *, Jin li *, Aaron Ogus*, and Douglas Phillips*
ꭞ New York University, *Microsoft Corporation
USENIX ATC, Santa Clara, CA, July 2017
1
When is cross DC coding a win?
1. Inexpensive cross DC communication.
2. Coding friendly workload
a) Stores a lot of data in large objects (less coding overhead)
b) Infrequent reconstruction due to read or failure.
6
OneDrive is occupied mostly by large objects
Object > 1GB
occupies 53%
of total storage
capacity.
10
OneDrive is occupied mostly by large objects
Objects < 4MB
occupies only
0.9% of total
storage capacity
11
OneDrive reads are infrequent after a month
47% of the total
bytes read occur
within the first day
of the object creation
13
OneDrive reads are infrequent after a month
less than 2% of the
total bytes read occur
after a month
14
Outline
1. The case for erasure coding across data centers.
2. Giza design and challenges
3. Experimental Results
15
Giza: a strongly consistent versioned object store
• Erasure code each object individually.
• Optimize latency for Put and Get.
• Leverage existing single-DC cloud storage API.
16
Table Blob
Giza node Giza nodeGiza node
DC1
DC2
DC3
Giza’s architecture
17
Giza node
Table Blob
Giza node Giza nodeGiza node
Giza’s workflowClient
Put(Blob1, Data)
Table Blob
Giza node Giza nodeGiza node
DC1
DC2
DC318
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
DC1
DC2
DC3
Giza’s workflowClient
Put(Blob1, Data)
Put(GUID1, Fragment A)
19
Table Blob
Giza node Giza nodeGiza node
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
DC1
DC2
DC3
Giza’s workflowClient
Put(Blob1, Data)
{DC1: GUID1, DC2: GUID2, DC3: GUID3}
20
Table Blob
Giza node Giza nodeGiza node
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
Giza metadata versioning
Key Metadata (version #: actual metadata)
Blob1 v1: md1
Let mdthis = {DC1:GUID4, DC2:GUID5, DC3:GUID6}
Key Metadata (version #: actual metadata)
Blob1 v1: md1, v2: mdthis
21
DC1
DC2
DC3
Inconsistency during concurrent put operationsClient
Put(Blob1, Data1)
Client
Put(Blob1, Data2)
22
Table Blob
Giza node Giza nodeGiza node
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
Inconsistency during concurrent put operations
Key Metadata
Blob1 v1: md1
DC1
Key Metadata
Blob1 v1: md2
DC2
Key Metadata
Blob1 v1: md1
DC3
md1 md2 md1
Order of arrival
md2 md2md1
WARNING: INCONSISTENT STATE
23
Inconsistency during concurrent put operations
Key Metadata
Blob1 v1: md1
DC1
Key Metadata
Blob1 v1: md2
DC2
Key Metadata
Blob1 v1: md1
DC3
md1 md2 md1
Order of arrival
md2 md2md1
WARNING: INCONSISTENT STATE
24
FAILURE!
Choosing version value via consensus
Key Metadata
Blob1 v1: md1
v2: md2
DC1
Key Metadata
Blob1 v1: md1
v2: md2
DC2
Key Metadata
Blob1 v1: md1
v2: md2
DC3
OR
Key Metadata
Blob1 v1: md2
v2: md1
DC1
Key Metadata
Blob1 v1: md2
v2: md1
DC2
Key Metadata
Blob1 v1: md2
v2: md1
DC3 25
Implementing Fast Paxos with Azure table API
Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)
DC1
DC2
DC328
Table Blob
Giza node Giza nodeGiza node
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
Implementing Fast Paxos with Azure table API
Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)
DC1
DC2
DC329
Table Blob
Giza node Giza nodeGiza node
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
Giza node
Implementing Fast Paxos with Azure table API
Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)
Remote DC
Read metadata rowand decide to accept or rejectproposal (v2 = md
this)
30
Etag = 1
Table Blob
Giza node Giza nodeGiza node
Table
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza nodeGiza node
Implementing Fast Paxos with Azure table API
Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)
Remote DC
Etag = 2
Etag = 1
31
Fail toupdate tablesince etaghas changed
Giza node
Table Blob
Giza node Giza nodeGiza node
Table
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza nodeGiza node
Write metadata rowback to the table if proposal accepted
Implementing Fast Paxos with Azure table API
Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)
Remote DC
Etag = 1
Etag = 1
32
Write metadata rowback to the table if proposal accepted
Successfullyupdate tablesince etaghasn’t changed
Giza node
Table Blob
Giza node Giza nodeGiza node
Table
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza nodeGiza node
Parallelizing metadata and data path
DC1
DC2
DC333
Table Blob
Giza node Giza nodeGiza node
Table Blob
Giza node Giza nodeGiza nodeGiza node
Table Blob
Giza node Giza nodeGiza node
Data path fails, metadata path succeeds
• During read, Giza will revert back to the latest version where the data is available.
Ongoing, data path failed
Key Metadata (version #: actual metadata)
Blob1 v1: md1, v2: md2
34
Giza Put
• Execute metadata and data path in parallel.
• Upon success, return acknowledgement to the client.
• In the background, update highest committed version number in the metadata row.
Ongoing, not committed
Key Committed Metadata (version #: actual metadata)
Blob1 1 v1: md1, v2: md2
35
Giza Get
• Optimistically read the latest committed version locally.
• Execute both data path and metadata path in parallel:
• Data path retrieves the data fragments indicated by the local latest committed version.
• Metadata path will retrieve the metadata row from all data centers and validate the latest committed version.
36
Outline
1. The case for erasure coding across data centers.
2. Giza design and challenges
3. Experimental Results
37