38
Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1 1 Dependability I: Reliable Distributed Systems Prof. Neeraj Suri www.deeds.informatik.tu-darmstadt.de Andreas Johansson, Robert Lindstrom, Constantin Sarbu

Dependability I: Reliable Distributed Systems

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 11

Dependability I:Reliable Distributed Systems

Prof. Neeraj Suri

www.deeds.informatik.tu-darmstadt.de

Andreas Johansson, Robert Lindstrom, Constantin Sarbu

Page 2: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 22

Logistics

Track (Gebiet): Trusted Systems (Informatik II)Exercises: (Tuesday: 9.50-10.35)NEW LECTURE ROOM: E 215 (Piloty S2/02)

Assistants:Andreas Johansson:Robert LindstromConstantin Sarbu

Related Seminars: Dependable Embedded SystemsDependability and Security (starts tomorrow!)

Exam: Oral exam (depends on # of students)Website: www.deeds.informatik.tu-darmstadt.de

Page 3: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 33

More logistical stuff…Book(s)

There are very few books covering the extremely broad area spanning dependable systems, distributed systems, middleware and esp. SW Engg. For the initial part, I’ll utilize the first book.

Distributed Systems for System Architects; P. Verissimo, L. Rodriques: Kluwer Press 2001. ISBN 0-7923-7266-2.

Fault Tolerance in Distributed Systems; P. Jalote, Prentice Hall, 1994 ) Advanced Concepts in Operating Systems: Distributed, Database, and Multiprocessor Operating System; M. Singhal and N. Shivaratri; McGraw-Hill Publishing Company, New York. 1994, 525 pages (oriented more towards OS though with excellent coverage of some distributed protocols ) Distributed Systems – Principles & Paradigms: Tannenbaum; Prentice Hall 2002

Page 4: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 44

Distributed Systems? What’s the reliability issue?

• 1 m/c System: up 95% of the time (.95)• 2 m/c System: (.95) * (.95) = up 90% of the time• 7 m/c System (.95)7 = up 70% of the time• …• 10 m/c Dist. System (.95)10 = up only 59% of the time!

so, how do we make distributed systems usably reliable/dependable?

Page 5: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 55

Base definitions

Dependability I: Reliable Distributed Systems

• Dependability: The quantification of trust in a systems ability to deliver the desired service (& at the desired time instant)

Fault Tolerance/Reliability:Ensuring that a device or a system provides for “sustained delivery” of desired services in the presence of “faults/component failures”.

• Distributed: When a machine whose existence one is unaware of causes a loss of expected service… ☺• Systems: HW, SW, Protocols,…

Page 6: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 66

-- failure region --

1. Fault Avoidance

2. Fault Tolerance

Performance

extent of failure -->

Page 7: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 77

What faults/perturbations are of interest?

• specification mistakes• implementation mistakes• operational/external perturbations• component/application defects, aging• ...

Page 8: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 88

What aspects of faults are of interest?

physical, logical, environmental perturbations...

• fault nature: data faults, timing faults

• fault duration: permanent, transient, intermittent,…

• fault-severity: fail-stop, fail-omission, fail-safe (traffic lights red) ...

Page 9: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 99

Cause & effect relationships

unless we can “detect” an error, any fault model is pretty useless!!

Faults Errors Failures

fault latency error latency

...or cascaded across system levels

Page 10: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1010

Metrics of FT/Dependability

FT/Dependability: Ensuring that a device or a system provides for “sustained delivery” of desired services in the presence of “faults”.

How do we measure FT attributes of “sustained delivery”?

Availability, Reliability, Safety ...

Page 11: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1111

AvailabilityMTTF

System Up

System Down

MTTR

Availability = Total Up-Time/Total (Up +Down-Time)

Availability = MTTF / (MTTF + MTTR)

Page 12: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1212

Availability Downtime/year Component

90% > 1 month Basic PC99% ~4 days Server99.9% ~9 hours Cluster99.99% ~1 hour XMP99.999% ~2 mins ATT Net Switches99.9999% ~2 sec Engine Controller

- Does availability mean anything if the system goes down at a critical time?- Does availability have diff. meaning if systems are used for diff. durations?

Page 13: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1313

Reliability: Probability of a system being operational at (a) a given time instance or (b) over a desired time-interval

1

system failure

Prob.

010 100

time

Page 14: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1414

Availability: Detection, Identification & Repair

Reliability: use of “extra resources” (redundancy) to keep prob. of operation at the desired level.

Note: Initially, we’ll start caring less and less about performance –though every feature is obtained at the cost of performance!!!

Initial key objective will be just to deliver desired services ...

Page 15: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1515

What is a Distributed System for us?

Collection of resources (OS’s, CPU’s, memory) linked by a data/info. delivery mechanism...providing sustained delivery of desired services- Distributed OS’s, Servers, Storage, Databases etc

– high performance + high availability + fail-safe reliability!– more redundancy more things that can go wrong!

Page 16: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1616

Reliability, Availability Redundancy

• Flavor 1: Physical/Spatial Redundancy [Add resources HW/SW]

• Flavor 2: Temporal Redundancy [Redo tasks]

• ....and combinations… (information redundancy etc)

Page 17: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1717

Classical FT & Redundancy

P1

P2=

• Duplex

• TMR

• Temporal

output

P1

P2

P3

= output

X redo

Page 18: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1818

Don’t they suffice for reliability, functionality & usability??

Page 19: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 1919

• poor performance!!!!

• lock-step synchronization [+ reliable clocks]

• limited extensibility: “spurious” transients

• SPF : “perfect” voter/error-detectors

• limited efficiency and fault coverage (transients?)

– real world driven by (a) heterogeneity and (b) performance!

P1

P2

P3

= o/p

P1

P2= output

Page 20: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2020

Distributed systems... drivers

• efficiency concerns: high perf. + high rel. ?

... loose sync., no SPF

... user/appl. transparency – SW/OS interface

• possibly extended fault handling with capability of graceful degradation, recovery and repair (off-line, on-line)

Page 21: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2121

DS paradigms• centralized or distributed or

networked or …OS/Middleware/System/Apps

• TDMA, CSMA/CD ...• …

Page 22: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2222

So...

+ lots of non-dedicated resources (possible redundancy…)+ less restrictions on perfect sync. and voters etc

- increased failure rates- new fault models?

Page 23: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2323

Fault models etc..Earlier:

• Faults Errors Failures• Transients, Intermittent and Permanent

Now:

• OS’s/ Middleware/Applications Processors/• Timing Issues: msg. delivery aspects...

Page 24: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2424

FT: What faults are of interest to us?

basis: physical, logical, environmental perturbations...

• fault duration: transients, intermittent, permanent

• fault nature: - data faults: out of range etc etc- timing faults: early, late, omission of msgs...

• fault-severity: fail-stop, fail-omission, fail-safe, ...

unless we can “detect” an error, any fault model is pretty useless!!

Page 25: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2525

Do these suffice ?

Page 26: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2626

New fault models...Byzantine Faults

A

B C

valid data: {0,100}

30 70

{30,70} {70,30}

• basic distributed functions (sync. etc) also need to be tolerant to the expanded fault models expensive!

Page 27: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2727

DS Fault/Failure Semantics

• Need to now see what role a resource can play (by itself and combined with others) to affect the overall system ops....

• Failure Semantics: F.Stop, F.Safe, F.Silent (& how can we make nodes follow these semantics ???)

Fail Stop -> Crash -> Omission -> Timing -> Incorrect Computation -> Byzantine

Page 28: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2828

DS Issues...

- new system + fault/failure semantics- error occurrence + detection capabilities- reliable data transfer issues- dist. system co-ordination issues- overheads, response time, stability

distr. systems: new capabilities, new headaches

Page 29: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 2929

Issues..

1. Basic objective is still the same: redundant resources, redundant info. processing to mask, detect and handle errors & aggregate system needs to more reliable than its components…

2. Changes: strict HW paradigms change to include HW, SW, OS, middleware & comm. issues

3. Cycle by cycle co-ordination does not work

Page 30: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3030

How do we get the dist. resources to co-ordinate to get useful output and FT functionality?

Page 31: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3131

DS Co-ordination

Asynchronous or Synchronous?

1. Given redundant copies of information, how does one do “co-ordination” to come up with a agreed result?

2. How are distributed tasks/requests “ordered”?3. Given distributed resources, how do they agree on a

“single” course of action?

Page 32: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3232

Asynchronous: “Single” Decision - Commit

2PC: Two Phase Commit Protocols– coordinator (pre-specified or selected dynamically)– multiple secondary sites (“cohorts”)

Objective: How do all nodes agree and execute a single decision? [all agree or no action taken...]

Page 33: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3333

co-ordinator cohorts

1. send COMMIT-REQ to all

.......”bounded waiting”.......

------------------------------------4. receive AGREE from all

put COMMIT in log &send COMMIT to all

4’. receive ABORT send ABORT to all

5. ACK from all? DONE

2. get msg(COMMIT-REQ)3. if ready, send AGREE

(write undo/redo logs)else, send ABORT

------------------------------------

4 receive COMMITrelease resources,send ACK

4’ receive ABORT, undo actions

Page 34: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3434

Comments...

• time-lag in making decisions – control applications??• resources locked till voting/decision is completed

• msg. overhead• reliable commn. assumptions

• possibilities of livelock + deadlock• limited fault tolerance

Page 35: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3535

...and..

Synchronous System Approaches

• How to achieve FT Synchronization, at what level?• How to achieve FT Agreements?• How to build operations on top of dist. sync. functions?

Page 36: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3636

…course coverage

• Distributed Systems (DS) & Dependability– Chaps 1,2,3 :-: 6,7,8: DS foundations, DS services,

Dependability Semantics, TCP/IP Servers

• SW/OS & Dependability– SW design techniques: recovery blocks, n-version SW, …– SW robustness methodologies: assertions, wrappers

• Verification & Validation/Testing (System, SW, OS)– Tools and Techniques (Experimental, Analytical, Formal)

• Security & Dependability

Page 37: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3737

… labs

• L1: Semaphores• L2: Ordering• L3: Group Communication• L4: Leader Election• L5: Voting Algorithms• L6: Byzantine Agreement• L7: 2PC Protocols

Lab accounts after lecture next weekL1 handed out after Lecture 2 ~Nov 8th or so…Mail Matrikelnummer + FB # to Andreas!!

Page 38: Dependability I: Reliable Distributed Systems

Dependability I: Fall 2004 Reliable Distributed Systems: Some figures © 2001 Verissímo and Rodrigues 3838

Logistics

NEW LECTURE ROOM: E 215: Piloty Bldg.- Hochschulstr. 10

Website: www.deeds.informatik.tu-darmstadt.de