Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 11

Dependability I:Reliable Distributed Systems

Prof. Neeraj Suri

www.deeds.informatik.tu-darmstadt.de

Andreas Johansson, Robert Lindstrom, Constantin Sarbu


Logistics

Track (Gebiet): Trusted Systems (Informatik II)Exercises: (Tuesday: 9.50-10.35)NEW LECTURE ROOM: E 215 (Piloty S2/02)

Assistants:Andreas Johansson:Robert LindstromConstantin Sarbu

Related Seminars: Dependable Embedded SystemsDependability and Security (starts tomorrow!)

Exam: Oral exam (depends on # of students)Website: www.deeds.informatik.tu-darmstadt.de


More logistical stuff…Book(s)

There are very few books covering the extremely broad area spanning dependable systems, distributed systems, middleware and esp. SW Engg. For the initial part, I’ll utilize the first book.

Distributed Systems for System Architects; P. Verissimo, L. Rodriques: Kluwer Press 2001. ISBN 0-7923-7266-2.

Fault Tolerance in Distributed Systems; P. Jalote, Prentice Hall, 1994 ) Advanced Concepts in Operating Systems: Distributed, Database, and Multiprocessor Operating System; M. Singhal and N. Shivaratri; McGraw-Hill Publishing Company, New York. 1994, 525 pages (oriented more towards OS though with excellent coverage of some distributed protocols ) Distributed Systems – Principles & Paradigms: Tannenbaum; Prentice Hall 2002


Distributed Systems? What’s the reliability issue?

• 1 m/c System: up 95% of the time (.95)• 2 m/c System: (.95) * (.95) = up 90% of the time• 7 m/c System (.95)7 = up 70% of the time• …• 10 m/c Dist. System (.95)10 = up only 59% of the time!

so, how do we make distributed systems usably reliable/dependable?


Base definitions

Dependability I: Reliable Distributed Systems

• Dependability: The quantification of trust in a systems ability to deliver the desired service (& at the desired time instant)

Fault Tolerance/Reliability:Ensuring that a device or a system provides for “sustained delivery” of desired services in the presence of “faults/component failures”.

• Distributed: When a machine whose existence one is unaware of causes a loss of expected service… ☺• Systems: HW, SW, Protocols,…


-- failure region --

1. Fault Avoidance

2. Fault Tolerance

Performance

extent of failure -->


What faults/perturbations are of interest?

• specification mistakes• implementation mistakes• operational/external perturbations• component/application defects, aging• ...


What aspects of faults are of interest?

physical, logical, environmental perturbations...

• fault nature: data faults, timing faults

• fault duration: permanent, transient, intermittent,…

• fault-severity: fail-stop, fail-omission, fail-safe (traffic lights red) ...


Cause & effect relationships

unless we can “detect” an error, any fault model is pretty useless!!

Faults Errors Failures

fault latency error latency

...or cascaded across system levels


Metrics of FT/Dependability

FT/Dependability: Ensuring that a device or a system provides for “sustained delivery” of desired services in the presence of “faults”.

How do we measure FT attributes of “sustained delivery”?

Availability, Reliability, Safety ...


AvailabilityMTTF

System Up

System Down

MTTR

Availability = Total Up-Time/Total (Up +Down-Time)

Availability = MTTF / (MTTF + MTTR)


Availability Downtime/year Component

90% > 1 month Basic PC99% ~4 days Server99.9% ~9 hours Cluster99.99% ~1 hour XMP99.999% ~2 mins ATT Net Switches99.9999% ~2 sec Engine Controller

- Does availability mean anything if the system goes down at a critical time?- Does availability have diff. meaning if systems are used for diff. durations?


Reliability: Probability of a system being operational at (a) a given time instance or (b) over a desired time-interval

1

system failure

Prob.

010 100

time


Availability: Detection, Identification & Repair

Reliability: use of “extra resources” (redundancy) to keep prob. of operation at the desired level.

Note: Initially, we’ll start caring less and less about performance –though every feature is obtained at the cost of performance!!!

Initial key objective will be just to deliver desired services ...


What is a Distributed System for us?

Collection of resources (OS’s, CPU’s, memory) linked by a data/info. delivery mechanism...providing sustained delivery of desired services- Distributed OS’s, Servers, Storage, Databases etc

– high performance + high availability + fail-safe reliability!– more redundancy more things that can go wrong!


Reliability, Availability Redundancy

• Flavor 1: Physical/Spatial Redundancy [Add resources HW/SW]

• Flavor 2: Temporal Redundancy [Redo tasks]

• ....and combinations… (information redundancy etc)


Classical FT & Redundancy

P1

P2=

• Duplex

• TMR

• Temporal

output

P1

P2

P3

= output

X redo


Don’t they suffice for reliability, functionality & usability??


• poor performance!!!!

• lock-step synchronization [+ reliable clocks]

• limited extensibility: “spurious” transients

• SPF : “perfect” voter/error-detectors

• limited efficiency and fault coverage (transients?)

– real world driven by (a) heterogeneity and (b) performance!

P1

P2

P3

= o/p

P1

P2= output


Distributed systems... drivers

• efficiency concerns: high perf. + high rel. ?

... loose sync., no SPF

... user/appl. transparency – SW/OS interface

• possibly extended fault handling with capability of graceful degradation, recovery and repair (off-line, on-line)


DS paradigms• centralized or distributed or

networked or …OS/Middleware/System/Apps

• TDMA, CSMA/CD ...• …


So...

+ lots of non-dedicated resources (possible redundancy…)+ less restrictions on perfect sync. and voters etc

- increased failure rates- new fault models?


Fault models etc..Earlier:

• Faults Errors Failures• Transients, Intermittent and Permanent

Now:

• OS’s/ Middleware/Applications Processors/• Timing Issues: msg. delivery aspects...


FT: What faults are of interest to us?

basis: physical, logical, environmental perturbations...

• fault duration: transients, intermittent, permanent

• fault nature: - data faults: out of range etc etc- timing faults: early, late, omission of msgs...

• fault-severity: fail-stop, fail-omission, fail-safe, ...

unless we can “detect” an error, any fault model is pretty useless!!


Do these suffice ?


New fault models...Byzantine Faults

A

B C

valid data: {0,100}

30 70

{30,70} {70,30}

• basic distributed functions (sync. etc) also need to be tolerant to the expanded fault models expensive!


DS Fault/Failure Semantics

• Need to now see what role a resource can play (by itself and combined with others) to affect the overall system ops....

• Failure Semantics: F.Stop, F.Safe, F.Silent (& how can we make nodes follow these semantics ???)

Fail Stop -> Crash -> Omission -> Timing -> Incorrect Computation -> Byzantine


DS Issues...

- new system + fault/failure semantics- error occurrence + detection capabilities- reliable data transfer issues- dist. system co-ordination issues- overheads, response time, stability

distr. systems: new capabilities, new headaches


Issues..

1. Basic objective is still the same: redundant resources, redundant info. processing to mask, detect and handle errors & aggregate system needs to more reliable than its components…

2. Changes: strict HW paradigms change to include HW, SW, OS, middleware & comm. issues

3. Cycle by cycle co-ordination does not work


How do we get the dist. resources to co-ordinate to get useful output and FT functionality?


DS Co-ordination

Asynchronous or Synchronous?

1. Given redundant copies of information, how does one do “co-ordination” to come up with a agreed result?

2. How are distributed tasks/requests “ordered”?3. Given distributed resources, how do they agree on a

“single” course of action?


Asynchronous: “Single” Decision - Commit

2PC: Two Phase Commit Protocols– coordinator (pre-specified or selected dynamically)– multiple secondary sites (“cohorts”)

Objective: How do all nodes agree and execute a single decision? [all agree or no action taken...]


co-ordinator cohorts

1. send COMMIT-REQ to all

.......”bounded waiting”.......

------------------------------------4. receive AGREE from all

put COMMIT in log &send COMMIT to all

4’. receive ABORT send ABORT to all

5. ACK from all? DONE

2. get msg(COMMIT-REQ)3. if ready, send AGREE

(write undo/redo logs)else, send ABORT

------------------------------------

4 receive COMMITrelease resources,send ACK

4’ receive ABORT, undo actions


Comments...

• time-lag in making decisions – control applications??• resources locked till voting/decision is completed

• msg. overhead• reliable commn. assumptions

• possibilities of livelock + deadlock• limited fault tolerance


...and..

Synchronous System Approaches

• How to achieve FT Synchronization, at what level?• How to achieve FT Agreements?• How to build operations on top of dist. sync. functions?


…course coverage

• Distributed Systems (DS) & Dependability– Chaps 1,2,3 :-: 6,7,8: DS foundations, DS services,

Dependability Semantics, TCP/IP Servers

• SW/OS & Dependability– SW design techniques: recovery blocks, n-version SW, …– SW robustness methodologies: assertions, wrappers

• Verification & Validation/Testing (System, SW, OS)– Tools and Techniques (Experimental, Analytical, Formal)

• Security & Dependability


… labs

• L1: Semaphores• L2: Ordering• L3: Group Communication• L4: Leader Election• L5: Voting Algorithms• L6: Byzantine Agreement• L7: 2PC Protocols

Lab accounts after lecture next weekL1 handed out after Lecture 2 ~Nov 8th or so…Mail Matrikelnummer + FB # to Andreas!!


Logistics

NEW LECTURE ROOM: E 215: Piloty Bldg.- Hochschulstr. 10

Website: www.deeds.informatik.tu-darmstadt.de

Documents

Dependability I: Reliable Distributed Systems