Upload
awesomesos
View
938
Download
0
Embed Size (px)
DESCRIPTION
My presentation on handling byzantine faults in distributed systems given for my graduate dependability course
Citation preview
1
Dealing with Byzantine Faults
CS 686 Final Projectbrought to you by Chris Sosa
2
Overview
Motivation in Dependable Systems
Common Types of Byzantine Faults
Solutions in Real Systems
3
The Myths Hardware cannot be
“traitorous”! Anthropomorphic model Any system with
consensus is susceptible It’s never happened
before Often misclassified Legionnaire's Disease
4
The Awful Truth Time-Triggered Architecture
Radioactive Fault injection to one node
Messed up timing protocol (SOS) Formed Cliques until system failed
Quad Redundant Control System No message exchange Lots of redundancy One fault propagated to look like
many
Professor Knight’s Computer
5
Trends in Dependable Systems
1. Device Physics• Smaller and faster not always
better• Cosmic Rays, etc.
2. Movement to Distributed Topologies
3. Usage of Commercial off-the-shelf (COTS) Technology
6
Common Types of Observed Faults1. Value
• Issues related to digital values being the extreme of analog
• Propagation2. Temporal
• Different observations at same time• Synchronization doesn’t help very much
3. Value + Temporal
7
Solutions?
8
Solutions (1) Full Exchange
Uses classical Byzantine agreement SPIDER – bus (ROBUS) design
9
Solutions (2) Hierarchical
Uses hierarchy of different fault tolerant techniques including Byzantine Agreement
Seen with Fail-Stop processors SAFEbus
Communication backplane for Boeing 777 Uses two buses which are themselves dual
redundant –different forms of parity detect errors
Uses self-checking pairs on top of buses
10
Solutions (3)
Filtering Targets propagation of Byzantine faults Tries to either
Mask faults by forcing output to some straight value (removes value-type faults)
Segments system into Fault Containment Regions (FCR’s) where we put protections to stop propagation
11
Ignorance is not Bliss
Can invalidate failure model Propagation of one fault can be
disastrous No amount of redundancy can help
Large Economic Factor Possible costs of recall and redeployment
12
Conclusions
Byzantine faults are real! Problems with Ignoring them No amount of Redundancy can
tolerate them w/out message exchange
Three categories of solutions to deal with them
13
Questions?
14
BGP Quick Review Algorithm is expensive:
Each processor has to broadcast its values for many rounds
Chooses majority value Requires n > 3f where f is # of failures
and n is the # of processors With signed messages
Can tolerate more failures Still expensive