View
217
Download
0
Category
Tags:
Preview:
Citation preview
Carnegie Mellon
Increasing Intrusion Tolerance Via Scalable Redundancy
Greg Gangergreg.ganger@cmu.edu
Natassa9 Ailamaki Mike Reiter Priya Narasimhan Chuck Cranor
Carnegie Mellon
Technical Objective To design, implement and evaluate new protocols for
implementing intrusion-tolerant services that scale better Here, “scale” refers to efficiency as number of servers and number of
failures tolerated grows
Targeting three types of services Read-write data objects Custom “flat” object types for particular applications, notably
directories for implementing an intrusion-tolerant file system Arbitrary objects that support object nesting
Carnegie Mellon
Expected Impact Significant efficiency and scalability benefits over today’s
protocols for intrusion tolerance
For example, for data services, we anticipate At-least twofold latency improvement even at small configurations
(e.g., tolerating 3-5 Byzantine server failures) over current best And improvements will grow as system scales up
A twofold improvement in throughput, again growing with system size
Without such improvements, intrusion tolerance will remain relegated to small deployments in narrow application areas
Carnegie Mellon
The Problem Space Distributed services manage redundant state across servers to
tolerate faults We consider tolerance to Byzantine faults, as might result from an
intrusion into a server or client A faulty server or client may behave arbitrarily
We also make no timing assumptions in this work An “asynchronous” system
Primary existing practice: replicated state machines Offers no load dispersion, requires data replication, and degrades as
system scales with O(N2) messages
Carnegie Mellon
Our approach Combine techniques to eliminate work in common cases
Server-side versioning allows optimism with read-time repair, if nec. allows work to be off-loaded to clients in lieu of server agreement
Quorum systems (and erasure coding) allows load dispersion (and more efficient redundancy for bulk
data) Several others applied to defend against Byzantine actions
Major risk? could be complex for arbitrary objects
Carnegie Mellon
Evaluation We are Scenario I: “centralized server setting”
Baseline: the BFT library Popular, publicly available implementation of Byzantine fault-tolerant
state machine replication (by Castro & Liskov) Reported to be an efficient implementation of that approach
Two measures Average latency of operations, from client’s perspective Peak sustainable throughput of operations
Our consistency definition: linearizability of invocations
Carnegie Mellon
Outline Overview Read-write storage protocol Some results Continuing work
Carnegie Mellon
Read-write block storage Clients erasure-code/replicate blocks into fragments Storage-nodes version fragments on every write
Storage-nodes
F3F1 F2 F4 F5
Client Data block
FragmentsF1 F2 F3 F4 F5
Carnegie Mellon
Challenges: Concurrency Concurrent updates can violate linearizability
Data Data
4 51 2 3
Servers
4 5 1 2 3
Carnegie Mellon
Challenges: Server Failures Can attempt to mislead clients
Typically addressed by “voting”
Servers
????
31 2 4 54’
Carnegie Mellon
54
Challenges: Client Failures Byzantine client failures can also mislead clients
Typically addressed by submitting a request via an agreement protocol
Servers
Data?
1 2 3 4’ ?2’
Carnegie Mellon
Consistency via versioning
Leverage versioning storage nodes for consistency
Allow writes to proceed with versioning All writes create new data versions Partial writes and concurrency won’t destroy data
Reader detects and resolves update conflicts Concurrency rare in FS workloads (typically < 1%) Offloads work to client resulting in greater scalability
Only perform extra work when needed Optimistically assume fault-free, concurrency-free operation Single round-trip for reads and writes in common case
Carnegie Mellon
Our system model
Crash-recovery storage-node fault model Up to t total bad storage-nodes (crashed/Byzantine) Up to b ≤ t Byzantine (arbitrary faults) So, t - b faults are crash-recovery faults
Client fault model Any number of crash or Byzantine clients
Asynchronous timing model Point-to-point authenticated channels
Carnegie Mellon
Read/write protocol Unit of update: a block
Complete blocks are read and written Erasure-coding may be used for space-efficiency
Update semantics: Read–write No guarantee of contents between read & write Sufficient for block-based storage
Consistency: Linearizability Liveness: wait-freedom
Carnegie Mellon
R/W protocol: Write
1. Client erasure-codes data-item into N data-fragments
2. Client tags write requests with logical timestamp Round-trip required to read logical time
3. Client issues requests to at least W storage-nodes
4. Storage-nodes validate integrity of request
5. Storage-nodes insert request into version history
6. Write completes after W requests have completed
Carnegie Mellon
R/W protocol: Read1. Client reads latest version from storage-node subset
Read set guaranteed to intersect with latest complete write
2. Client determines latest candidate write (candidate)
Set of responses containing the latest timestamp
3. Client classifies the candidate as one of: Complete Incomplete Repairable
For consistency: only complete writes can be returned
Carnegie Mellon
R/W protocol: Read classification Based on client’s (limited) system knowledge
Failures and asynchrony lead to imperfect information
Candidate classification rules: Complete: candidate exists on W nodes
candidate is decoded and returned
Incomplete: candidate exists on W nodes Read previous version to determine new candidate Iterate…perform classification on new candidate
Repairable: candidate may exist on W nodes Repair and return data-item
Carnegie Mellon
D0 determined complete, returned
Example: Successful read(N=5, W=3, t=1, b=0)
Tim
e Ø Ø Ø Ø ØD0 D0 D0
D1
T0T1
Storage Nodes
D0 D1
D0
T1
Client read operation after T1
1 2 3 4 5
ØD0
D1 latest candidateD1 incompleteD0 latest candidate
Carnegie Mellon
Example: Repairable read(N=5, W=3, t=1, b=0)
Tim
e Ø Ø Ø Ø ØD0 D0 D0
D1
T0T1T2
Storage Nodes
D0 D1D2T2
Client read operation after T2
D2
1 2 3 4 5D2 D2D2
D2 repairableRepair D2
D2 D2
D2 D2
Return D2D2 latest candidate
Carnegie Mellon
Protecting against Byzantine storage-nodes Must defend against servers that modify data in their possession
Solution: Cross checksums [Gong 89] Hash each data-fragment Concatenate all N hashes Append cross checksum to each fragment Clients verify hashes against fragments and use cross checksums as
“votes”
Data-item
Data-fragmentsHashes
Crosschecksum
Carnegie Mellon
Protecting against Byzantine clients Must ensure all fragment sets decode to same value
Solution: Validating timestamps Write: place hash of cross checksum in timestamp
also prevents multiple values being written at same timestamp Storage-nodes validate their fragment against corresponding hash Read: regenerate fragments and cross checksum
Data-items
Data-fragments
≠
Example: Byzantine encoding with “poisonous” fragment
F1 F2 F3 F4 F5
Carnegie Mellon
Experimental setup Prototype system: PASIS 20 node cluster
Dual 1 GHz Pentium III storage-nodes Single 2 GHz Pentium IV clients
100 Mb switched Ethernet 16 KB data-item size (before encoding)
Blowup of over the data-item size Each fragment is the data-item size
Carnegie Mellon
PASIS response time
1 2 3 40
2
4
6
8
10
12
14
16
18
20M
ean
resp
onse
tim
e (m
s)
Total failures tolerated (t)
1-way 16KB ping
Writes b = t
Reads b = t Writes b = 1
Reads b = 1
Fault modelsb = t and b = 1
N = 2t + 2b + 1
N = 17N = 11
Decode computationNW delay: redundant fragments
Carnegie Mellon
Throughput experiment
Same system set-up as resp. time experiment Clients issue read or write requests
Increase number of clients to increase load
Demonstrate value of erasure-codes Increase m to reduce per storage-node load
Compare with Byzantine atomic broadcast BFT library [Castro & Liskov 99] Supports arbitrary operations Replica (with multicast): limits write throughput O(N2) messages: limits performance scalability
Carnegie Mellon
Reduce per storage-node loadwith erasure-codes
BFT uses replicationwhich increases per storage-node load
PASIS vs. BFT: Write throughput
0 2 4 6 80
500
1000
1500
2000
2500
3000
3500
Thr
ough
put
(req
/s)
Clients
PASISPASISBFT
m Nb = t = 1
2 53 61 4
60%
PASIS has higher writethroughput than BFT
Carnegie Mellon
PASIS vs. BFT: Read throughput
0 2 4 6 80
500
1000
1500
2000
2500
3000
3500
Thr
ough
put
(req
/s)
Clients
PASISPASISBFT
m N
2 5
b = t = 1
3 61 4
Carnegie Mellon
Continuing work New testbed: 70 servers connected with switched Gbit/sec
experiments can then explore higher scalability points baseline and our results will come from this testbed
Protocol for arbitrary deterministic functions on objects built from same basic primitives
Protocol for objects with nested objects adds requirement of replicated invocations
Carnegie Mellon
Summary Goal: To design, implement and evaluate new protocols for
implementing intrusion-tolerant services that scale better Here, “scale” refers to efficiency as number of servers and number of
failures tolerated grows
Started with a protocol for read-write storage based on versioning and quorums scales efficiently (and much better than BFT) also flexible (can add assumptions to reduce costs)
Going forward (in progress) generalize types of objects and operations that can be supported
Carnegie Mellon
Questions?
Carnegie Mellon
Garbage collection Pruning old versions is necessary to reclaim space
Versions prior to latest complete write can be pruned
Storage-nodes need to know latest complete write In isolation they do not have this information Perform read operation to classify latest complete write
Many possible policies exist for when to clean what
Best to clean during idle time (if possible) Rank blocks in order of greatest potential gains Work remains in this area
Recommended