Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 11
Dependability I:Reliable Distributed Systems
Prof. Neeraj Suri
www.deeds.informatik.tu-darmstadt.de
Andreas Johansson, Robert Lindstrom, Constantin Sarbu
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 22
Logistics
Track (Gebiet): Trusted Systems (Informatik II)Exercises: (Tuesday: 9.50-10.35)NEW LECTURE ROOM: E 215 (Piloty S2/02)
Assistants:Andreas Johansson:Robert LindstromConstantin Sarbu
Related Seminars: Dependable Embedded SystemsDependability and Security (starts tomorrow!)
Exam: Oral exam (depends on # of students)Website: www.deeds.informatik.tu-darmstadt.de
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 33
More logistical stuff…Book(s)
There are very few books covering the extremely broad area spanning dependable systems, distributed systems, middleware and esp. SW Engg. For the initial part, I’ll utilize the first book.
Distributed Systems for System Architects; P. Verissimo, L. Rodriques: Kluwer Press 2001. ISBN 0-7923-7266-2.
Fault Tolerance in Distributed Systems; P. Jalote, Prentice Hall, 1994 ) Advanced Concepts in Operating Systems: Distributed, Database, and Multiprocessor Operating System; M. Singhal and N. Shivaratri; McGraw-Hill Publishing Company, New York. 1994, 525 pages (oriented more towards OS though with excellent coverage of some distributed protocols ) Distributed Systems – Principles & Paradigms: Tannenbaum; Prentice Hall 2002
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 44
Distributed Systems? What’s the reliability issue?
• 1 m/c System: up 95% of the time (.95)• 2 m/c System: (.95) * (.95) = up 90% of the time• 7 m/c System (.95)7 = up 70% of the time• …• 10 m/c Dist. System (.95)10 = up only 59% of the time!
so, how do we make distributed systems usably reliable/dependable?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 55
Base definitions
Dependability I: Reliable Distributed Systems
• Dependability: The quantification of trust in a systems ability to deliver the desired service (& at the desired time instant)
Fault Tolerance/Reliability:Ensuring that a device or a system provides for “sustained delivery” of desired services in the presence of “faults/component failures”.
• Distributed: When a machine whose existence one is unaware of causes a loss of expected service… ☺• Systems: HW, SW, Protocols,…
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 66
-- failure region --
1. Fault Avoidance
2. Fault Tolerance
Performance
extent of failure -->
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 77
What faults/perturbations are of interest?
• specification mistakes• implementation mistakes• operational/external perturbations• component/application defects, aging• ...
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 88
What aspects of faults are of interest?
physical, logical, environmental perturbations...
• fault nature: data faults, timing faults
• fault duration: permanent, transient, intermittent,…
• fault-severity: fail-stop, fail-omission, fail-safe (traffic lights red) ...
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 99
Cause & effect relationships
unless we can “detect” an error, any fault model is pretty useless!!
Faults Errors Failures
fault latency error latency
...or cascaded across system levels
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1010
Metrics of FT/Dependability
FT/Dependability: Ensuring that a device or a system provides for “sustained delivery” of desired services in the presence of “faults”.
How do we measure FT attributes of “sustained delivery”?
Availability, Reliability, Safety ...
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1111
AvailabilityMTTF
System Up
System Down
MTTR
Availability = Total Up-Time/Total (Up +Down-Time)
Availability = MTTF / (MTTF + MTTR)
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1212
Availability Downtime/year Component
90% > 1 month Basic PC99% ~4 days Server99.9% ~9 hours Cluster99.99% ~1 hour XMP99.999% ~2 mins ATT Net Switches99.9999% ~2 sec Engine Controller
- Does availability mean anything if the system goes down at a critical time?- Does availability have diff. meaning if systems are used for diff. durations?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1313
Reliability: Probability of a system being operational at (a) a given time instance or (b) over a desired time-interval
1
system failure
Prob.
010 100
time
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1414
Availability: Detection, Identification & Repair
Reliability: use of “extra resources” (redundancy) to keep prob. of operation at the desired level.
Note: Initially, we’ll start caring less and less about performance –though every feature is obtained at the cost of performance!!!
Initial key objective will be just to deliver desired services ...
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1515
What is a Distributed System for us?
Collection of resources (OS’s, CPU’s, memory) linked by a data/info. delivery mechanism...providing sustained delivery of desired services- Distributed OS’s, Servers, Storage, Databases etc
– high performance + high availability + fail-safe reliability!– more redundancy more things that can go wrong!
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1616
Reliability, Availability Redundancy
• Flavor 1: Physical/Spatial Redundancy [Add resources HW/SW]
• Flavor 2: Temporal Redundancy [Redo tasks]
• ....and combinations… (information redundancy etc)
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1717
Classical FT & Redundancy
P1
P2=
• Duplex
• TMR
• Temporal
output
P1
P2
P3
= output
X redo
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1818
Don’t they suffice for reliability, functionality & usability??
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1919
• poor performance!!!!
• lock-step synchronization [+ reliable clocks]
• limited extensibility: “spurious” transients
• SPF : “perfect” voter/error-detectors
• limited efficiency and fault coverage (transients?)
– real world driven by (a) heterogeneity and (b) performance!
P1
P2
P3
= o/p
P1
P2= output
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2020
Distributed systems... drivers
• efficiency concerns: high perf. + high rel. ?
... loose sync., no SPF
... user/appl. transparency – SW/OS interface
• possibly extended fault handling with capability of graceful degradation, recovery and repair (off-line, on-line)
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2121
DS paradigms• centralized or distributed or
networked or …OS/Middleware/System/Apps
• TDMA, CSMA/CD ...• …
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2222
So...
+ lots of non-dedicated resources (possible redundancy…)+ less restrictions on perfect sync. and voters etc
- increased failure rates- new fault models?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2323
Fault models etc..Earlier:
• Faults Errors Failures• Transients, Intermittent and Permanent
Now:
• OS’s/ Middleware/Applications Processors/• Timing Issues: msg. delivery aspects...
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2424
FT: What faults are of interest to us?
basis: physical, logical, environmental perturbations...
• fault duration: transients, intermittent, permanent
• fault nature: - data faults: out of range etc etc- timing faults: early, late, omission of msgs...
• fault-severity: fail-stop, fail-omission, fail-safe, ...
unless we can “detect” an error, any fault model is pretty useless!!
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2525
Do these suffice ?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2626
New fault models...Byzantine Faults
A
B C
valid data: {0,100}
30 70
{30,70} {70,30}
• basic distributed functions (sync. etc) also need to be tolerant to the expanded fault models expensive!
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2727
DS Fault/Failure Semantics
• Need to now see what role a resource can play (by itself and combined with others) to affect the overall system ops....
• Failure Semantics: F.Stop, F.Safe, F.Silent (& how can we make nodes follow these semantics ???)
Fail Stop -> Crash -> Omission -> Timing -> Incorrect Computation -> Byzantine
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2828
DS Issues...
- new system + fault/failure semantics- error occurrence + detection capabilities- reliable data transfer issues- dist. system co-ordination issues- overheads, response time, stability
distr. systems: new capabilities, new headaches
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2929
Issues..
1. Basic objective is still the same: redundant resources, redundant info. processing to mask, detect and handle errors & aggregate system needs to more reliable than its components…
2. Changes: strict HW paradigms change to include HW, SW, OS, middleware & comm. issues
3. Cycle by cycle co-ordination does not work
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3030
How do we get the dist. resources to co-ordinate to get useful output and FT functionality?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3131
DS Co-ordination
Asynchronous or Synchronous?
1. Given redundant copies of information, how does one do “co-ordination” to come up with a agreed result?
2. How are distributed tasks/requests “ordered”?3. Given distributed resources, how do they agree on a
“single” course of action?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3232
Asynchronous: “Single” Decision - Commit
2PC: Two Phase Commit Protocols– coordinator (pre-specified or selected dynamically)– multiple secondary sites (“cohorts”)
Objective: How do all nodes agree and execute a single decision? [all agree or no action taken...]
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3333
co-ordinator cohorts
1. send COMMIT-REQ to all
.......”bounded waiting”.......
------------------------------------4. receive AGREE from all
put COMMIT in log &send COMMIT to all
4’. receive ABORT send ABORT to all
5. ACK from all? DONE
2. get msg(COMMIT-REQ)3. if ready, send AGREE
(write undo/redo logs)else, send ABORT
------------------------------------
4 receive COMMITrelease resources,send ACK
4’ receive ABORT, undo actions
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3434
Comments...
• time-lag in making decisions – control applications??• resources locked till voting/decision is completed
• msg. overhead• reliable commn. assumptions
• possibilities of livelock + deadlock• limited fault tolerance
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3535
...and..
Synchronous System Approaches
• How to achieve FT Synchronization, at what level?• How to achieve FT Agreements?• How to build operations on top of dist. sync. functions?
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3636
…course coverage
• Distributed Systems (DS) & Dependability– Chaps 1,2,3 :-: 6,7,8: DS foundations, DS services,
Dependability Semantics, TCP/IP Servers
• SW/OS & Dependability– SW design techniques: recovery blocks, n-version SW, …– SW robustness methodologies: assertions, wrappers
• Verification & Validation/Testing (System, SW, OS)– Tools and Techniques (Experimental, Analytical, Formal)
• Security & Dependability
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3737
… labs
• L1: Semaphores• L2: Ordering• L3: Group Communication• L4: Leader Election• L5: Voting Algorithms• L6: Byzantine Agreement• L7: 2PC Protocols
Lab accounts after lecture next weekL1 handed out after Lecture 2 ~Nov 8th or so…Mail Matrikelnummer + FB # to Andreas!!
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3838
Logistics
NEW LECTURE ROOM: E 215: Piloty Bldg.- Hochschulstr. 10
Website: www.deeds.informatik.tu-darmstadt.de