22
Dependability and Resilience of Computing Systems Jean-Claude Laprie

Dependability and Resilience of Computing Systems

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dependability and Resilience of Computing Systems

0

Dependability and Resilienceof Computing Systems

Jean-Claude Laprie

Page 2: Dependability and Resilience of Computing Systems

1

Dependability: an integrative concept☞ Based on, and elaboration upon ‘A. Avizienis, J.C. Laprie, B.

Randell, C. Landwehr, Basic Concepts and Taxonomy ofDependable and Secure Computing, IEEE Tr. Dependableand Secure Computing, 2004’

Resilience: a framework for ubiquitous computingchallenges☞ Based on, and elaboration upon outcomes of

the European Network of Excellence ReSIST(Resilience for Survivability in InformationSociety Technologies)

Conclusion

Page 3: Dependability and Resilience of Computing Systems

2

SafetyReliability ConfidentialityAvailability Integrity Maintainability

FaultPrevention

FaultTolerance

FaultRemoval

FaultForecasting

Faults Errors FailuresActivation Propagation Causation

FaultsFailures …… Causation

Attributes

Threats

Means

Dependability: delivery of service that can justifiably be trusted, thusavoidance of failures that are unacceptably frequent orsevere

Page 4: Dependability and Resilience of Computing Systems

3

Attributes

AvailabilityReliabilitySafetyConfidentialityIntegrityMaintainability

Dependability Means

Fault PreventionFault ToleranceFault RemovalFault Forecasting

ThreatsFaultsErrorsFailures

SecurityAuthorizedactions

Page 5: Dependability and Resilience of Computing Systems

4

Faultforecating

Ordinal or qualitative evaluation

Probabilistic orquantitative evaluation

ModelingOperational testing

Faulttolerance

Error detection

System recoveryError handlingFault handling

Development

Static verificationDynamic verification

VerificationDiagnosisCorrectionNon-regression verificationFault

removalOperational life Corrective or preventive maintenance

Means

Faultprevention

Attributes

Availability

SafetyConfidentialityIntegrityMaintainability

ThreatsFaultsErrorsFailures

DevelopmentPhysicalInteraction

Dependability

Reliability

Page 6: Dependability and Resilience of Computing Systems

5

Dependability definition

First part: delivery of service that can justifiably be trusted

☞ Enables to generalize availability, reliability, safety, confidentiality,integrity, maintainability, that are then attributes of dependability

Second part: avoidance of service failures that are unacceptably frequent orsevere

☞ Failure frequency and severity ➜ risk assessment and management

☞ A system can, and usually does, fail. Is it however still dependable?When does it become undependable?

criterion for deciding whether or not, in spite of service failures, asystem is still to be regarded as dependable

☞ Unacceptably frequent or severe service failures ➜ dependabilityfailure, connected to development process failure

Page 7: Dependability and Resilience of Computing Systems

6

Dependability attributes

Availability, Reliability, Safety, Confidentiality, Integrity, Maintainability:Property-based attributes

Event-based attributes

Robustness: persistence of dependability with respect to external faults

Survivability: persistence of dependability in the presence of active fault(s)

Resilience: persistence of dependability when facing functional,environmental, or technological, changes

Attributes based on distinguishing among various types of (meta-)information

Accountability: availability and integrity of the person who performed anoperation

Authenticity: integrity of a message content and origin, and possibly someother information, such as the time of emission

Non-repudiability: availability and integrity of the identity of the sender of amessage (non-repudiation of the origin), or of the receiver (non-repudiation of reception)

Page 8: Dependability and Resilience of Computing Systems

7

Fault Error Failure

Deviation of thedelivered service

from correct service,i.e., implementing

the system function

Part of systemstate that may

cause asubsequent

failure

Adjudged orhypothesizedcause of an

error

Failure… …Fault

System doesnot comply withthe specificationdescribing the

system function

Specificationdoes not

adequatelydescribe the

system function

Dependability Threats

PropagationActivation Causation

Taking advantage ofa vulnerability, i.e.,

an internal faultwhich enables the

system to be harmed

Computationprocess

Internalfault

Externalfault

Contextdependent

Page 9: Dependability and Resilience of Computing Systems

8

Faults Errors Failures FaultsFailures …… PropagationActivation CausationCausation

Content failures

DomainTiming failures

Signaled failures

Unsignaled failuresDetectability

ConsistencyConsistent failures

Inconsistent(Byzantine) failures

Catastrophic failuresConsequences

Minor failures

IdentificationLatent errors

Detected errors

ExtentSingle errors

Multiple related errors

Page 10: Dependability and Resilience of Computing Systems

9

Faults Errors Failures FaultsFailures …… PropagationActivation CausationCausation

Orig

in

Man

ifest

atio

n

Phase of creationor occurrence

Development faults

Operational faults

System boundaries

Internal faults

External faults

Phenomenologicalcause

Natural faults

Human-made faults

IntentMalicious faults

Non-malicious faults

Capability

Accidental faults

Incompetence faults

Deliberate faults

PersistencePermanent faults

Transient faults

Activationreproducibility

Solid faults

Elusive faults

Hardware faults

Software faults

System faults

Dimension

Omission faults

Commission faultsStyle

PluralitySingle faults

Multiple faults

CorrelationIndependent faults

Related faults

DamageSoft faults

Hard faults

StatusDormant faults

Active faults

Page 11: Dependability and Resilience of Computing Systems

10

Faults— by origin —

Non-malicious

Malicious Non-malicious.

Non-malicious.

Non-malicious Non-malicious MaliciousIntent

Internal Internal ExternalSystem boundaries

Development OperationalPhase of creationor occurrence

Accid. Incomp. Accid. Accidental Accidental Accid. Delib. Incomp. DeliberateDelib. Delib.Capability

Phenomenologicalcause Human-madeNatural Natural NaturalHuman-made

Development faults Physical faults Interaction faults

Maliciouslogic

PhysicalDeterior.

PhysicalInterference

IntrusionAttempts

VirusesWorms

Input Mistakes

Overload

Configuration andReconfiguration

Mistakes

Design Flaws

Production Defects

Page 12: Dependability and Resilience of Computing Systems

11

Style

Persistence

Faul

ts b

y m

anife

stat

ion

Dimension

Faults by origin

Physical faults Development faults Interaction faults

Plurality

Omission faults ● ● ●

Correlation

Damage

Status

Activationreproducibility

Commission faults ● ● ●

Permanent faults ● ● ●

Transient faults ● ●

Hardware faults ● ● ●

Software faults ● ●

System faults ● ● ●

Single faults ● ● ●

Multiple faults ● ● ●

Independent faults ● ● ●

Related faults ● ● ●

Hard faults ● ● ●

Soft faults ● ● ●

Dormant faults ● ●

Active faults ● ● ●

Solid faults ● ● ●

Elusive faults ● ● ●

Page 13: Dependability and Resilience of Computing Systems

12

A few milestones

12th IEEE Int. Symposium on Fault-Tolerant Computing, Santa Monica, June1982 Session ‘Fundamental Concepts of Fault Tolerance’, papers by

A. Avizienis, H. Kopetz, J.C. Laprie and A. Costes, A.S. Robinson,T. Anderson and P.A. Lee, P.A. Lee and D. Morgan

Session ‘Prospective on the State-of-the-Art’, position paper byW.C. Carter

Paper by J.C. Laprie, 15th IEEE Int. Symposium on Fault-TolerantComputing, Ann Arbor, June 1985, Dependable Computing and FaultTolerance: Concepts and Terminology

Publication by J.C. Laprie of the book Dependability: Basic Concepts andTerminology — in English, French, German, Italian, Japanese, SpringerVerlag, 1992, vol. 5 of the series ‘Dependable Computing and Fault-TolerantSystems’, A. Avizienis, H. Kopetz, J.C. Laprie, eds.

Paper by A. Avizienis, J.C. Laprie, B. Randell, and C. Landwehr, BasicConcepts and Taxonomy of Dependable and Secure Computing, IEEETransactions on Dependable and Secure Computing, 2004

Page 14: Dependability and Resilience of Computing Systems

13

☞ Strength of the dependability concept: integrative role Enables the more classical notions of reliability, availability, safety,

confidentiality, integrity, maintainability, etc. to be put into perspective, asattributes of dependability

Model provided for the means for achieving dependability Those means are much more orthogonal to each other than the more

classical classification according to the attributes of dependability Facilitates the unavoidable trade-offs, as the attributes tend to conflict

with each other Fault - Error - Failure model

Understanding and mastering the threats Unified presentation of the threats, while explaining their specificities

via the various fault classes

☞ Concept and terminology widely accepted by international community IEEE Fault-Tolerant Computing Symposium + IFIP Dependable Computing

for Critical Applications Conference ➠ IEEE/IFIP Conference onDependable Systems and Network

Regional (Europe, Pacific Rim, South America) Fault Tolerant ComputingConferences ➠ Dependable Computing Conferences

Page 15: Dependability and Resilience of Computing Systems

14

Ubiquitous systems: future large, networked, continuously evolving systemsconstituting complex information infrastructures —perhaps involving everything fromsuper-computers and huge server "farms" to myriads of small mobile computers andtiny embedded devices

Page 16: Dependability and Resilience of Computing Systems

15

Short term[e.g., second to hours, as indynamically changing systems]

Functional[incl. Specification faults]

Foreseen[e.g., new versioning]

Medium term[e.g., hours to months, as inreconfigurations or new versioning]

Environmental[incl. interaction faults]

Foreseeable[e.g., advent of newhardware platforms]

Long term[e.g., months to years, as inreorganisations resulting frommergers, or in military coaltions]

Technological[incl. developmentand physical faults]

Unforeseen[e.g., drastic changes inservivce requests ornew types of threats]

At stake: Maintain dependability in spite of changes

Threat evolution

Resilience: persistence of dependability when facing changes

TimingNature Prospect

Attacks Mismatches in evolutionary changes Side-effects in emerging behaviors Increasing importance of configuration faults Increasing proportion of hardware transient faults

Page 17: Dependability and Resilience of Computing Systems

16

Adjective Resilient In use for 30+ years Recently, escalating use ➔ buzzword Used essentially as synonym, or

substitute, to fault tolerant Noteworthy exception: preface

of Resilient Computing Systems,T. Anderson (Ed.), Collins, 1985

«The two key attributes here are dependabilityand robustness. […] A computing system canbe said to be robust if it retains its ability todeliver service in conditions which are beyondits normal domain of operation»

➠ in dependability and security ofcomputing systems

Adaptation tochanges, andgetting back

after a setback

Materialscience

Ecology

Childpsychiatryandpsychology

Industrialsafety

Business

Socialpsychology

➠ in other domains

Resilience

Generalisation of fault tolerance

change tolerance

Page 18: Dependability and Resilience of Computing Systems

17

Technologies for resilience

Changes Evolvability☞ Self evolvability: adaptation

Trusted service Assessability☞ Verification and evaluation

Complex systems Diversity☞ Taking advantage of existing diversity,

and augmenting it

Ambient intelligence Usability☞ Human and system users

Fault Removal

Fault Forecasting

Assessability

Fault Prevention

Fault Tolerance

Evolvability

Usability

Diversity

● ●Means fordependability

Page 19: Dependability and Resilience of Computing Systems

18

Assessability

Continuous changes

Need for complementingpre-deployment verification and evaluation

withoperational verification and evaluation

Emerging behaviors due tovast number of interactions

Vast number ofconfigurations,

includingdependencies

not existing priorto deployment

Elusive,context-

dependent,faults

Concurrentexecutions dueto prevalence of

multi-coresystems

Page 20: Dependability and Resilience of Computing Systems

19

Project on operational assessment under elaboration

Continuous testing Perturbation injection

System avatars created from system snapshots— in the deployment environment, without altering

system’s state —

Failure predictionvia estimation of

failure rateincrease

Uncoveringconfiguration faults,

elusive faults,concurrency faults

Page 21: Dependability and Resilience of Computing Systems

20

Expectations of the computing science and engineeringcommunity 50th anniversary issue of Communications of the ACM (January 2008) : most

articles mention dependability or resilience as a major factor (Jeannette Wing,Rodney Brooks, Gul Agha, John Crowcroft, Gordon Bell)

☞ Rodney Brooks : « New formalisms will let us analyze complex distributed systems,producing new theoretical insights that lead to practical real-world payoffs. Exactlywhat the basis for these formalisms will be is, of course impossible to guess. My ownbet is on resilience and adaptability »

March 2008 issue of IEEE Computer, feature section devoted to softwareengineering in the 21st century

☞ Barry Boehm : « In the 21st century, software engineers face the often formidablechallenges of simultaneously dealing with rapid change, uncertainty and emergence,dependability, diversity, and interdependence »

Page 22: Dependability and Resilience of Computing Systems

21

Ubiquitous computing systems: integral part of society

Dependability of currentcomputing systems• Web sites availability:

one to two 9s• Well managed systems

availability: four 9s

Dependability of classicalessential infrastructures —

public transportation,energy and water supply,

fixed phone: availability ofat least five 9s

Society problem

Societal reponsibility

//➊Mismatch

➋Increasingdependency

One 9

Two 9s

Three 9s

Four 9s

Five 9s

Six 9s

36j 12h0,9

3j 16h0,99

8h 46mn0,999

52mn 34s0,9999

5mn 15s0,99999

32s0,999999

Outageduration/yearAvailability