Upload
glyn
View
36
Download
0
Embed Size (px)
DESCRIPTION
HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems. James Cowling 1 , Daniel Myers 1 , Barbara Liskov 1 Rodrigo Rodrigues 2 , Liuba Shrira 3 1 MIT CSAIL 2 INESC-ID and Instituto Superior T é cnico 3 Brandeis University. Byzantine Fault Tolerance. - PowerPoint PPT Presentation
Citation preview
HQ Replication:Efficient Quorum Agreement forReliable Distributed Systems
James Cowling1, Daniel Myers1, Barbara Liskov1
Rodrigo Rodrigues2, Liuba Shrira3
1MIT CSAIL2INESC-ID and Instituto Superior Técnico
3Brandeis University
Byzantine Fault Tolerance› Reliable client-server distributed systems
» Server replicated across group of replica machines
› General operations› Bounded number f of Byzantine replicas› Must ensure correct system state
» Consistent ordering of client operations
State of the Art› Approaches:
» State Machine Replication – BFT 3f+1 replicas
» Byzantine Quorums – Q/U 5f+1 replicas Increased performance Degradation when writes contend
Contributions› Low overhead Byzantine Fault Tolerance
» Performance of Byzantine Quorums without 5f+1 replicas or contention degradation
› Hybrid Quorum scheme for Byzantine Fault Tolerance» Quorum approach in normal-case» Use Byzantine agreement to resolve write
contention
Outline› Current Approaches› HQ Replication› BFT Improvements› Performance Evaluation› Conclusions
State Machine Replication› BFT - Castro and Liskov TOCS ’02
» Operations ordered by primary » Agreed upon by replicas
Client
Primary
Replica 2
Replica 3
Replica 4
Request Pre-Prepare Prepare Commit Reply
Byzantine Quorums› Q/U - Abd-El-Malek et al.
SOSP ’05
› Client controlled protocol» Replicas order operations
independently
› Optimistic» Best case one-phase
protocol» Worst case unbounded
Randomized backoff
Client
Replica 1
Replica 2
Replica 3
Replica 4
Replica 5
Update Reply
Replica 6
Advantages/DisadvantagesBFT
› Good» 3f+1 replicas» Bounded number of
phases› Bad
» Higher latency» Quadratic
communication
Q/U› Good
» Best-case performance One-phase write Low replica load
› Bad» 5f+1 replicas» Degraded
performance when writes contend
HQ Replication› 3f+1 replicas› Supports general operations› No all-to-all communication in normal-
case› BFT used to resolve contention
HQ Replication
Client
Replica 1
Replica 2
Replica 3
Replica 4
Write1 Write1 OK Write2 Write2 OK
› One-phase read› Two-phase write
High-level Write Protocol› Two-phase write protocol› Phase 1:
» Client obtains timestamp grant from each replica
› Phase 2:» Client forms certificate from 2f+1
matching grants» Sends to replicas to complete write
Grants› Promise to execute operation at given
sequence number» Assuming agreement from quorum
› Grant» Client ID» Object ID» Hash over requested operation» Sequence Number (timestamp)» Replica signature
Certificates› Certificate
» Quorum (2f+1) matching grants› Proves quorum of replicas agree to
ordering of operation» Uniquely identify client, operation and
sequential ordering» Existence of certificate precludes
existence of conflicting certificate
Replica State› Multiple independent objects› State per-object
» Certificate supporting most recent write» Operation status
Active– Write in progress, outstanding grant
Quiescent– No current write operation
Write Phase 1› Client sends write request to replicas
» If quiescent, replica assigns new grant to client
» If active, replica sends currently outstanding grant
› Several Possibilities» All grants match» Grants for different client» Grants conflict
Isolated Write
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
Write A
Write AWrite A
Isolated Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Write A
Write AWrite A
Isolated Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Grant <1,1,A>1
Grant <1,1,A>2Grant <1,1,A>
3
Isolated Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Matching grants: Phase 2 write
Grant <1,1,A>1
Grant <1,1,A>2Grant <1,1,A>
3
Isolated Write
client 1
replica 1
replica 2
replica 3
Cert {G1,G2,G3}
Cert {G1,G2,G3}Cert {G1 ,G
2 ,G3 }
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Matching grants: Phase 2 write
Isolated Write
client 1
replica 1
replica 2
replica 3
execute A
execute A
execute A
Cert {G1,G2,G3}
Cert {G1,G2,G3}Cert {G1 ,G
2 ,G3 }
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
Result A
Result AResult A
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
result
Write Complete
Result A
Result AResult A
Incomplete Write
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write AWrite A
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Write A
Write AWrite A
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Grant <1,1,A>1
Grant <1,1,A>2Grant <1,1,A>
3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Client 1 slow or failed
Grant <1,1,A>1
Grant <1,1,A>2Grant <1,1,A>
3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Write B
Write B
Write B
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Grant<1,1,A> 1
Grant <1,1,A>2
Grant <1,1,A>3
Replicas active: Return current grant
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Grants for different client: Perform Writeback
Grant<1,1,A> 1
Grant <1,1,A>2
Grant <1,1,A>3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2Cert {
G 1,G2,G
3}, W
rite B
Cert {G1,G2,G3}, Write B
Cert {G1,G2,G3}, Write B
Grants for different client: Perform Writeback
Incomplete Write
client 1
replica 1
replica 2
replica 3
client 2
execute A
execute A
execute A
Cert {G 1,G
2,G3}
, Write
B
Cert {G1,G2,G3}, Write B
Cert {G1,G2,G3}, Write B
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
client 2Cert {
G 1,G2,G
3}, W
rite B
Cert {G1,G2,G3}, Write B
Cert {G1,G2,G3}, Write B
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
client 2
Grant<2,2,B> 1
Grant <2,2,B>2
Grant <2,2,B>3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
client 2
Matching grants: Phase 2 write
Grant<2,2,B> 1
Grant <2,2,B>2
Grant <2,2,B>3
Write Contention
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write B
Write A
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
client 2
Write A
Write B
Write A
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
Grant <1,1,A>1
Grant <1,1,A>2Grant <2,1,B>
3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Conflicting grants: Request resolution
Grant <1,1,A>1
Grant <1,1,A>2Grant <2,1,B>
3
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Cert {G1,G2,G3}
Cert {G1,G2,G3}Cert {G1 ,G
2 ,G3 }
Conflicting grants: Request resolution
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
ContentionResolution
Cert {G1,G2,G3}
Cert {G1,G2,G3}Cert {G1 ,G
2 ,G3 }
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
execute A
execute A
execute A
Cert {G1,G2,G3}
Cert {G1,G2,G3}Cert {G1 ,G
2 ,G3 }
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
execute B
execute B
execute B
Cert {G1,G2,G3}
Cert {G1,G2,G3}Cert {G1 ,G
2 ,G3 }
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2
Result A
Result AResult A
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2
result
Result A
Result AResult A
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2
Result B
Result B
Result B
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2result
Result B
Result B
Result B
Contention Resolution› BFT module used to resolve contention
» Establish sequential order on contending ops
› On receiving resolve request:» Freeze local object state» Send state to primary
› Primary runs BFT on combined state› Replicas execute contending operations
Additional Details› Read protocol› State transfer› Multi-object transactions› Performance enhancements
Performance Enhancements› Preferred quorums
»Core protocol run by only 2f+1 replicas
› Symmetric-key cryptography»Authenticators instead of signatures
Collection of 3f+1 MACs <mi,1,mi,2,…,mi,n>
»Lower CPU overhead
BFT Improvements› Preferred quorums
»Reduces degree of quadratic communication
› Single MAC per message»Significant improvements over
authenticators
Non-Contention Message Overhead
Messages sent/received at each replica per write request
Non-Contention Bandwidth Use
Total bandwidth at each replica per write request
Experimental Setup› HQ and BFT prototypes deployed on
Emulab» Up to 16 replicas (f=5), 200 clients (4 per
machine)› New BFT codebase› Implement counter service
» Negligible operation payload» Multiple objects
Private non-contention objects Shared contention object
Non-contention Throughput
Maximum operation throughput
Resilience to Contention
Throughput degradation with increasing write-contention
Resilience to Contention
Throughput degradation with increasing write-contention
new
BFT Batching› BFT allows batching at primary› Greatly reduces internal protocol
communication› Increased delay
Client
Primary
Replica 1
Replica 2
Replica 3
Request Pre-Prepare Prepare Commit Reply
once per batch
Batched Performance
Effect of BFT batching on maximum write throughput
Recommendations› Use Q/U when
» Latency critical» Contention low» 5f+1 replicas acceptable
› Use HQ when» Low latency important» Moderate contention
› Use BFT when» Contention high» Throughput more important than latency
Conclusions› First Byzantine Quorum protocol with
3f+1 replicas» Supports general operations» Resilient to Byzantine clients
› Introduced Hybrid technique» Resolve contention without performance
degradation» Applicable to general quorum systems
› Found optimized BFT to perform well under high load
Questions?
Further Details› HQ Replication: Properties and optimizations
» James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues and Liuba Shrira. Technical Memo In Prep., MIT Computer Science and Artificial Laboratory, Cambridge, Massachusetts, 2006.
› Contact:» [email protected]» http://people.csail.mit.edu/cowling/
Write-back Operation› Write certificate paired with a subsequent
request› Used to ensure progress with slow
replicas or clients» Completes phase 2 for a slow client» Advances state of slow replicas
› Replica processes write phase 2 based on certificate, then the paired request
Backups…
Slow Replicas› Some grants in quorum have old
timestamp
› Perform writeback to slow replicas, using certificate provided with highest grant» Brings replicas up to date and solicits new
grants
Why 3f+1?› 3f+1 replicas
» f of which can be faulty› 2f+1 agree on any ordering
» f of these may be Byzantine» The remaining f may be slow
› Maximum of 2f can respond with old system state, but not 2f+1
› Won’t HQ have a higher rate of contention since it’s two phase (higher latency) than Q/U?» No – contention window only between first
replica receives phase 1 request to last replica receives it. Hence independent of two-phase, and actually smaller than in Q/U