Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed StorageMuhammed Uluyol, Anthony Huang, Ayush Goel,Mosharaf Chowdhury, Harsha V. Madhyastha
University of Michigan
Web Server
Data Site
Distribute Web Servers for Interactive Latency
2
Data Site
Distribute Data for Availability
3
Web Server
Data Site
Distribute Data for Availability and Latency
4
Web Server
Linearizability Imposes Unavoidable Trade-offs
5
● Read vs Write Latency
● Read Latency vs Cost
Linearizability Imposes Unavoidable Trade-offs
6
● Read vs Write Latency
● Read Latency vs Cost
Read
Overlap ensures consistency
Write
Linearizability Imposes Unavoidable Trade-offs
7
● Read vs Write Latency
● Read Latency vs Cost
Read
Overlap ensures consistency
Write
Linearizability Imposes Unavoidable Trade-offs
8
● Read vs Write Latency
● Read Latency vs Cost
Read
Write
Linearizability Imposes Unavoidable Trade-offs
9
● Read vs Write Latency
● Read Latency vs Cost
Linearizability Imposes Unavoidable Trade-offs
10
● Read vs Write Latency
● Read Latency vs Cost
Linearizability Imposes Unavoidable Trade-offs
11
● Read vs Write Latency
● Read Latency vs Cost
Linearizability Imposes Unavoidable Trade-offs
12
● Read vs Write Latency
● Read Latency vs Cost
How Do Existing Approaches Perform?
13
Low
est P
ossi
ble
Rea
d La
tenc
y
StorageOverhead Budget
Write Latency Budget
0
How Do Existing Approaches Perform?
14
Better
0
Low
est P
ossi
ble
Rea
d La
tenc
y
Write Latency BudgetStorage
Overhead Budget
How Do Existing Approaches Perform?
15
0
Low
est P
ossi
ble
Rea
d La
tenc
y
Write Latency BudgetStorage
Overhead Budget
Better
How Do Existing Approaches Perform?
16
Better
Better
0
Low
est P
ossi
ble
Rea
d La
tenc
y
Write Latency BudgetStorage
Overhead Budget
● EPaxos: state-of-the-art geo-replication protocol
How Do Existing Approaches Perform?
17
● EPaxos: state-of-the-art geo-replication protocol
● Compare with estimate of theoretical lower bound
How Do Existing Approaches Perform?
18
● EPaxos: state-of-the-art geo-replication protocol
● Compare with estimate of theoretical lower bound○ No particular protocol○ Respects consistency and
fault-tolerance constraints
How Do Existing Approaches Perform?
19
● EPaxos: state-of-the-art geo-replication protocol
● Compare with estimate of theoretical lower bound○ No particular protocol○ Respects consistency and
fault-tolerance constraints
How Do Existing Approaches Perform?
20
2⨉ storage forsame latency
2⨉ readlatency
● EPaxos: state-of-the-art geo-replication protocol
● Compare with estimate of theoretical lower bound○ No particular protocol○ Respects consistency and
fault-tolerance constraints
How Do Existing Approaches Perform?
21
2⨉ storage forsame latency
2⨉ readlatency
Core problem: replicationEach site stores a full copy
● Each site stores 1/kth of the data
● RS-Paxos: Paxos on erasure-coded data
Lowering Cost with Erasure Coding Utility of RS-Paxos
22
● Each site stores 1/kth of the data
● RS-Paxos: Paxos on erasure-coded data
Lowering Cost with Erasure Coding
23
1.5⨉ write latency for same read
1.5⨉ latency at same storage
● Each site stores 1/kth of the data
● RS-Paxos: Paxos on erasure-coded data
Lowering Cost with Erasure Coding
24
1.5⨉ write latency for same read
1.5⨉ latency at same storage
RS-Paxos Limitations
● Two-round writes● k-site intersection between quorums
Recap of the Problem
● Want to spread data across DCs, but constraints that impose trade-offs
● State-of-the-art falls short of the optimal
● Use erasure coding → hurts latency
25
● Two-round writesApproximates latency of one-round writes
● k-site intersection between quorums1-site intersection (common-case)
Pando: Near-Optimal Trade-off
26
Paxos Review
27
Paxos Review
● 2-Phase writes: first become leader
I am leader Ack.
30 ms 30 ms
28
Paxos Review
● 2-Phase writes: first become leader, then write
I am leader Ack. Write
data Ack.
30 ms 30 ms 30 ms 30 ms
29
Paxos Review
● 2-Phase writes: first become leader, then write
I am leader Ack. Write
data Ack.
30 ms 30 ms 30 ms 30 ms
30
One-round write protocol
Quickly Executing 2-Phase Writes
● Step 1: faster Phase 1○ Flexible Paxos [OPODIS’16]: need Phase 1, 2 quorums to intersect○ Phase 1 quorums need not overlap
Write data
30 ms
Ack.
30 ms
Ack.
10 ms
I am leader
10 ms
31
Quickly Executing 2-Phase Writes
● Step 1: faster Phase 1● Step 2: overlap latency cost of Phase 1 with Phase 2
○ RPC Chains [NSDI’09]: start Phase 2 at a delegate
32
30 ms
Ack.Write data
15 ms
Ack.
10 ms
I am leader
10 ms
● Two-round writesApproximates latency of one-round writes
● k-site intersection between quorums1-site intersection (common-case)
Pando: Near-Optimal Trade-off
33
✔
Write to All
Phase 2
34
k=2
Write to All, Wait for Quorum
Phase 2
35
k=2
v2
v2
v2
v2
Write to All, Wait for Quorum
Phase 2
36
v2
v2
v1
v2
Read
Rare Case Common Casek=2
Read
v2
v2
v2
v2
Achieving 1-Site Intersection
Phase 2
37
v2
v2
v1
v2
Read
Rare Case Common Casek=2
Read
Maybea write finished
v2
v2
v2
v2
Achieving 1-Site Intersection
Phase 2
38
v2
v2
v1
v2
Read
Rare Case Common Casek=2
Read
Maybea write finished
● Two-round writesApproximates latency of one-round writes
● k-site intersection between quorums1-site intersection (common-case)
Pando: Near-Optimal Trade-off
39
✔
✔
● Two-round writesApproximates latency of one-round writes
● k-site intersection between quorums1-site intersection (common-case)
Pando: Near-Optimal Trade-off
40
✔
✔
See paper:● Correctness● Bounding latency under conflicts
Evaluation: Proximity to Lower Bound
● Access set: DCs hosting web servers reading/writing data
● MIP solver selects data sites to minimize latency
● 500 access sets
41
Sample access set
Measure gap
Pando is Close to the Lower Bound
42
11⨉ higher
13⨉ higher
Pando is Close to the Lower Bound
43
8⨉ higher
11⨉ higher
13⨉ higher
∪
Pando is Close to the Lower Bound
44
8⨉ higher
11⨉ higher
13⨉ higherPotential for true 1-round writes
∪
Pando is Close to the Lower Bound
45
8⨉ higher
11⨉ higher
13⨉ higherPotential for true 1-round writes
∪
● Cloud deployment confirms solver latency estimates
● Up to 46% cost ($) savings
Conclusion
● Pando: linearizability across geo-distributed DCs
● Achieves a near-optimal read–write–storage trade-off○ Allow for erasure-code data to minimize cost○ Rethink how to use Paxos in the wide-area setting
46
Backup Slides
47
Deployment Latency
48
Latency Under Conflicts
49
Contributions of Each Technique
50
Throughput
51
Read Latency After Failure
52