Upload
maria-needham
View
220
Download
4
Embed Size (px)
Citation preview
Fundamentals Stream Session 8:Fault Tolerance &Dependability I
CSC 253
Gordon Blair, François Taïani
Distributed Systems
CSC253 / 2005-06 G. Blair/ F. Taiani 2
Overview of the dayToday: Fault-Tolerance in DS / Guest Lecture
9-10am: Fundamental Stream / Fault Tolerance (a)
10-11am: Guest Lecturer Prof. Jean-Charles FabreHow to assess the robustness of OS and middleware
12-1pm: Fundamental Stream / Fault Tolerance (b)
Next week: Advanced Fault-Tolerance
CSC253 / 2005-06 G. Blair/ F. Taiani 3
Overview of the session A famous example (video)
Basic dependability concepts Also covered in CSc 365 (Year B) "Safety Critical Systems”
Fault-tolerance at the node level: Replication
Fault-tolerance at the network level: TCP
Fault-tolerance at the environment level: Data centres
Associated Reading: Chapter 7. Fault Tolerance of Tanenbaum & van Steen
CSC253 / 2005-06 G. Blair/ F. Taiani 4
Video?A Famous Example
April 14, 1912
CSC253 / 2005-06 G. Blair/ F. Taiani 5
A Major Disaster The worst peacetime maritime disaster
more than 1500 dead only 706 survivors
The wreck only discovered in 1985 2 1/2 miles down on the ocean floor
CSC253 / 2005-06 G. Blair/ F. Taiani 6
The Titanic and Fault-Tolerance How fault-tolerant was the Titanic?
best technology of the time: deemed “practicably” unsinkable 16 watertight compartments Titanic could float with its first 4 compartments flooded
Some flaws in the design: poor turning ability bulkheads only went as high as E-deck not enough lifeboats!
CSC253 / 2005-06 G. Blair/ F. Taiani 7
What does It Teach Us? Good to plan ahead for problems
Like collision and hull damage for a ship
Replication / redundancy is the key to survival Titanic contained 16 watertight compartments Could still float with the first four flooded (& other combinations) But always a limit to what you can tolerate
But you also need diversity / independence Compartments are watertight Critical issue: prevent water propagation (deck E issue)
The case of the titanic can be applied to many other systems Generic concepts have been developed to discuss them
CSC253 / 2005-06 G. Blair/ F. Taiani 8
Dependability Concepts Failure: what we want to avoid
Titanic: sinking and killing people Distributed System: stop functioning correctly
Fault: a problem that could cause a failure Titanic: the iceberg, design flaws Distributed System: hardware crash, software bug
Errors: an erroneous internal state that can lead to a failure Titanic: damaged hull, water in compartment DS: erroneous data, inconsistent internal
behaviour
CSC253 / 2005-06 G. Blair/ F. Taiani 9
System and Sub-Systems Faults, Errors, Failures are recursive notions
A systems usually made of components e.g. sub-systems Titanic: hull, decks, compartments, smoke stacks Distributed system: nodes, network links, routers, hubs,
services Nodes contain CPUs, hard disks, network card, ... Network comprises routers, hubs, gateways, name servers ...
Failure of internal component: a fault for enclosing system
CSC253 / 2005-06 G. Blair/ F. Taiani 10
Means to Dependability Fault prevention:
Do not go through the Artic Ocean (no icebergs) Buy good hardware (less faulty hardware) Hire skilled programmers (less bugs)
Fault removal Blast the iceberg (not practicable as external fault) Test and replace faulty hardware before it is used Test and remove bugs
Fault tolerance Use replication / redundancy and diversity Prevent water / error propagation
CSC253 / 2005-06 G. Blair/ F. Taiani 11
Quality of Service Notion of 'failure' is not black or white
Consider a web server taking 5 minute to deliver each request It fulfils its functional specification (to deliver web pages) It works better than a server that would not reply at all But can it be considered to be functioning correctly?
Lots of intermediary situations
Quality of Service of a system: how 'good' a system is Multiple dimension: latency, bandwidth, security, availability, reliability, ...
'Failure' usually reserved to when a service becomes unusable Goal: highest QoS in spite of faults (and at the lowest cost)
optimal service complete failure
CSC253 / 2005-06 G. Blair/ F. Taiani 12
Graceful Degradation Ideally fault tolerance should mask any internal failure/fault
A router crashes but you do not even notice
In practice not always possible Essentially because of cost + fault assumption
Goal is then to minimise impact of faults on system user Avoid complete failure Maintain highest possible QoS in spite of faults
This is called "Graceful Degradation" Extremely important: leaves time to administrators to react Users continue using the service, albeit with a lower QoS
CSC253 / 2005-06 G. Blair/ F. Taiani 13
Distributed Systems andFault Tolerance
Fault tolerance requires distribution For replication and independence of failure modes I.e. to tolerate earthquakes not all servers in San Francisco
Distribution requires fault-tolerance In large distributed systems, partial node failures unavoidable Single points of failure to be avoided as much as possible
CSC253 / 2005-06 G. Blair/ F. Taiani 14
What Can Go Wrong in a DS? Nodes (servers, clients, peers)
Faulty hardware crash or data corruption Power failure crash (and sometimes data corruption) Any "physical" accident: fire, flood, earthquake, ...
Network Same as nodes.
Routers gateways: whole subnet impacted Name servers: whole name domain impacted
Congestion (dropped packets)
Users / People Involuntary mistakes Security attacks (external & internal, i.e. from legitimate users)
CSC253 / 2005-06 G. Blair/ F. Taiani 15
Main Steps of Fault Tolerance Error Detection
Detecting that something is wrong (like water in the keel) The earlier the better The earlier the more difficult
Error Recovery Preventing errors from propagating (watertight door) Removing errors Does not mean root cause (fault) removed
Fault treatment (usually off-line, manually) Fault Diagnostic Fault Removal
CSC253 / 2005-06 G. Blair/ F. Taiani 16
Error Detection Error detection in the Titanic quite “easy”
No water should leak inside the boat (safety invariant)
Distributed Systems: more difficult What should be the result of a request is usually unknown
In some cases sanity or correctness check possible In many cases some form of redundancy needed
√xC2
1.414213
(1.414213)2=?
CSC253 / 2005-06 G. Blair/ F. Taiani 17
Error Recovery Backward error recovery
Return system into a previous error-free (hopefully) state Can be achieved in generic manner (replication, checkpointing) Risk of inconsistency in general case Can be minimised (see next week)
Forward error recovery Bring system in a valid future state Highly application dependent Makes sense as soon as real-time is involved Examples: video streaming, flight control software
CSC253 / 2005-06 G. Blair/ F. Taiani 18
Node FT: Replication Why replicate?
Performance: e.g. replication of heavily loaded web servers Allows error detection (when replicas disagree) Allows backward error recovery
Importance of fault-model Effect of replication depends on how individual replicas fail Crash faults availability: % availability = 1-pn
(where n = no of replicas, p = probability of individual failure)
Requirements for a replication service Replication transparency ... in spite of network partitions,
disconnections, etc.
CSC253 / 2005-06 G. Blair/ F. Taiani 19
0
0.2
0.4
0.6
0.8
1
1.2
0.5 0.4 0.3 0.2 0.1 0
Probability of failure of 1 replica
Whole systemavailability
1 replica2 replicas3 replicas4 replicas5 replicas
Focus on Availability Assumptions (very important!)
Replicas fail independently Replicas fail silently (crashes, no arbitrary behaviour)
CSC253 / 2005-06 G. Blair/ F. Taiani 20
A Generic Replication Architecture
C
C
FE
FE
Front endsClients
RM
RM
RM
Replicamanagers
ServiceRequest &
replies
CSC253 / 2005-06 G. Blair/ F. Taiani 21
Passive Replication What is passive replication (primary backup)?
FEs communicate with a single primary RM, which must then communicate with secondary RMs (slaves)
Requires election of new primary in case of primary failure Only tolerate crash faults (silent failure of replicas) Variant: primary’s state saved to stable storage (cold passive)
saving to stable storage known as “checkpointing”
C FE
FE
RM
RM
RM
Primary
C
Backup
Backup
CSC253 / 2005-06 G. Blair/ F. Taiani 22
Active Replication What is active replication?
Front Ends multicast requests to every Replica Manager Appropriate reliable group communications needed ordering guarantees crucial (see W4: Group Communication)
Tolerate crash-faults and arbitrary faults
FE
RM
RM
RM CFEC
CSC253 / 2005-06 G. Blair/ F. Taiani 23
Comparing Active vs. Passive
Passive ActiveComm. overheadProcessing overheadRecovery overheadFault modelDeterminism
LowLowHighOnly crash faultNot required
HighHighLowArbitrary faultsRequired
less expensiveless complexless powerful
more powerfulmore expensivemore complex
CSC253 / 2005-06 G. Blair/ F. Taiani 24
Network FT: TCP TCP is a fault-tolerant protocol
correct messages arrives even if packets get lost or corrupted
A TCP packet:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Error detection against corruption
Fault-model
C.SC 251: Networking
CSC253 / 2005-06 G. Blair/ F. Taiani 25
TCP Checksum TCP checksum : 16bit blocks of packet added & complemented
(additional one-complementing here and there) 216 (~65 000) possible combinations TCP packets typically around 1500 bytes 212000 combinations Several packets bounds to have the same checksum
Roughly: 212000 ÷ 216 = 211984 ~ 16 x 103594
The Trick Usual corruption unlikely to produce a consistent checksum Notion of hamming distance Additional checksum at the network (IP) and link layers (Ethernet)
(!)multiple corruptions
CSC253 / 2005-06 G. Blair/ F. Taiani 26
Between 2 string of bits: the number of bit switches required to go from 1 to the other
Distance between 1011101 and 1001001? 2
Minimum Hamming Distance of a code minimal hamming distance between 2 valid "code words" i.e. minimal number of corruptions to fall back on a valid packet
Simple checksum: packets of 8 bits checksum on 4 bits:
adds the two 4-bit blocks
What is the minimal hamming distance of this code?
Between 2 string of bits: the number of bit switches required to go from 1 to the other
Distance between 1011101 and 1001001?
Hamming Distance
0110 1010 1100
0110
1010+
????
CSC253 / 2005-06 G. Blair/ F. Taiani 27
0100 110010101000
0110
1000+
0100
Hamming Distance
Hamming distance is 2 Note that the 2 corruptions must obey a particular pattern In a network corruption usually occur in bursts
0110 1010 11001000
0110
1000+
0100
1000+Matches again.Valid packet.Corruption not detected.
Does not match. Invalid packet.Corruption detected.
CSC253 / 2005-06 G. Blair/ F. Taiani 28
What Is in a Checksum? The TCP checksum is a form of error detection code
Min. Hamming distance reflects strength of the detection Far more complex error detection codes exist For instance CRC: cyclic redundancy check
It is a form of redundancy information redundancy: if packet OK no need for checksum partial redundancy: impossible to reconstruct packet from
checksum
When code used "strong" enough (min. Hamming dist > 2) possible reconstruct "most likely" correct packet Error correction code (not seen in this course) Provides error recovery
CSC253 / 2005-06 G. Blair/ F. Taiani 29
Environment FT: Data Centres Mission critical systems (your company’s web server)
Little point in addressing only one type of fault Major risks: power failure & overheating
Collocation Provider: rents luxury “parking places” for servers Multiple network operators Secured physical access Multiple power sources (including power generator) Highly robust air conditioning
Example: Redbus Interhouse plc (http://www.interhouse.org/) 3 centres in London, others in Amsterdam, Frankfurt, Milan, Paris Even there things can go wrong
http://www.theregister.co.uk/2004/06/16/redbus_power_fails/
CSC253 / 2005-06 G. Blair/ F. Taiani 30
What we'll see next week
Refinement of what we have seen today
Replication scheme: the problem of consistencyConsistent check-pointing for primary backupFault-tolerant total ordering for active replication
Contrast replication and distributed transactionsShort flash-back to Week 3 lecture
CSC253 / 2005-06 G. Blair/ F. Taiani 31
References
Chapter 7 of Tanenbaum & van Steen
On the Titanic http://www.charlespellegrino.com/Time%20Line.htm
Very detailed account http://octopus.gma.org/space1/titanic.html
Fundamental concepts of dependability by Algirdas Avizenis Jean-Claude Laprie Brian Randell http://www.cert.org/research/isw/isw2000/papers/56.pdf
CSC253 / 2005-06 G. Blair/ F. Taiani 32
Expected Learning Outcomes
At the end of this 8th Unit:
You should understand the basic dependability concepts of fault, error, failure
You should know the basic steps of fault-tolerance
You should appreciate fundamental fault-techniques like replication and error-detection codes
You should be able to explain and compare active and passive replication style