65
Critical systems Lecture 6 Critical systems Critical systems specification Critical systems development Critical systems validation 1 lecture is based on Chapters 3, 9,

Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Embed Size (px)

Citation preview

Page 1: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Critical systems

Lecture 6

Critical systems Critical systems specification Critical systems development Critical systems validation

1

This lecture is based on Chapters 3, 9, 20, 24

Page 2: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

2

Critical systems Safety-critical systems

a failure may result in injury, loss of life or major environmental damage.

Chemical manufacturing plant, aircraft Mission-critical systems

a failure may result in the failure of some goal-directed activity navigational system for a spacecraft

Business-critical systems a failure may result in the failure of the business using that

system customer account system in a bank

Critical systems are the systems where failures can result in significant economic losses, physical damage or threats to human life.

High severity of these failures

High cost of these failures

Page 3: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

DependabilityDependability

Availability Reliability Security

The ability of the systemto deliver services when

requested

The ability of the systemto deliver services as

specified

The ability of the systemto operate withoutcatastrophic failure

The ability of the systemto protect itelf againstaccidental or deliberate

intrusion

Safety

probability probability judgement judgement

Dependability = Trustworthiness

A dependable system is trusted by its users. Trustworthiness essentially means the degree of user confidence that the system will operate as they expect and that the

system will not 'fail' in normal use. Not dependable, Very dependable, Ultra dependable

Page 4: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Insulin pump data-flow

The system does not need to implement all dependability attributes.

It must be: Available : to deliver insuline Reliable: to deliver the correct amount of insuline Safe: it should not cause excessive doses of insuline

It must not be Secure: It is not exposed to external attacks.

The insulin pump system monitors blood sugar levels and delivers an appropriate dose of insulin

Needleassembly

Sensor

Display1 Display2

Alarm

Pump Clock

Controller

Power supply

Insulin reservoir

Insulinr equirementcomputation

Blood sugaranalysis

Blood sugarlevel

Insulindeliv erycontroller

Insulindosage

Blood

Bloodpar ameters

Blood sugarlevel

Insulin

Pump controlcommands Insulin

r equirement

Failures may occur in: • hardware• software• human mistakes

Page 5: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Other dependability properties Repairability

Reflects the extent to which the system can be repaired in the event of a failure.

Maintainability Reflects the extent to which the system can be

adapted to new requirements. Survivability

Reflects the extent to which the system can deliver services whilst under hostile attack.

Error tolerance Reflects the extent to which user input errors

can be avoided and tolerated.

Page 6: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Increasing dependability

Increasing the dependability of a system costs a lot. High levels of dependability can only be achieved at the

expense of system performance. Extra & redundant code for checking states and recovery.

Reasons for prioritising dependability: Systems that are unreliable, unsafe or insecure are often unused. System failure costs may be enormous. It is difficult to retrofit dependability.

Low Medium High Veryhigh

Ultra-high

Dependability

Due to additional design, implementation and validation

Cos

t

Page 7: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Availability and reliability

7

Reliability The ability of a system or component to perform its required function under stated conditions for a specified period of time.

Availability The degree to which a system or a component is operational and accessible when required for use.

System A System B

•It fails once a year•It takes one week to repair it

•It fails once a month•It takes 5 minutes to repair it

Some users may tolerate frequent failures as long as the system may recover quickly from these failures

Page 8: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Rel

iabi

lity

System failure An event that occurs at some point in time when the system does not deliver a service as expected by its users.

System error Erroneous system behaviour where the behaviour of the system does not conform to its specification.

System fault An incorrect system state, i.e., a system state that is unexpected by the designers of the system.

Human error Human behaviour that results in the introduction of faults or mistake into a system

System failure An event that occurs at some point in time when the system does not deliver a service as expected by its users.

System error Erroneous system behaviour where the behaviour of the system does not conform to its specification.

System fault An incorrect system state, i.e., a system state that is unexpected by the designers of the system.

Human error Human behaviour that results in the introduction of faults or mistake into a system

Causes by cause

Fault

CauseCause IDCause type

NonSoftware

Cause

OtherDefect

Problem(Report)

Report IDTitle

DynamicProblem (Failure)

StaticProblem

Defect

Other definition

Sommerville’s definition

Page 9: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Approaches to improve reliability Fault avoidance

Techniques are used that either minimise the possibility of mistakes and/or trap mistakes/faults before these result in the introduction of system faults (use of static analysis to detect faults).

Fault detection and removal Verification and validation techniques increasing the chances to

detect and remove faults before the system is used (systematic testing).

Fault tolerance Techniques ensuring that faults in a system do not result in

system errors or that system errors do not result in system failures (use of self-checking facilities, redundant system modules).

Page 10: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Input/output mapping

The reliability of the system is the probability that a particular input will lie in the set of inputs that cause erroneous outputs

IeInput set

OeOutput set

Program

Inputs causingerroneous outputs

Erroneousoutputs

Different people will use the system in different ways so User 2 will encounter system failure.

Possibleinputs

User1

User3

User2

Erroneousinputs

Removing X% of the faults in a system will not necessarily improve the reliability by X%. At IBM: removing 60% of

product defects resulted in a 3% improvement in reliability

Program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not

affect the perceived reliability A program with known faults

may therefore still be seen as reliable by its users

Low Medium High Veryhigh

Ultra-high

Dependability

Due to additional design, implementation and validation

Cos

t

Page 11: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Safety models

Safety is concerned with ensuring that the system

cannot cause damage. Types of systems

Primary safety critical software Software malfunctioning causes hardware malfuction resulting in

human injury or environmental damage Secondary safety critical software

Software malfunctioning results in design defects, which in turn pose a threat to humans and environment.

11

Page 12: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Specification defects If the system specification is incorrect then the

system can behave as specified but still cause an accident.

Hardware failures may cause the system to behave in an unpredictable way Hard to anticipate in the specification

Correct individual operator inputs may lead to system malfunction in specific contexts Often the result of operator mistake

Unsafe reliable systems

Page 13: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Safety terms

13

Accident An unplanned event or sequence of events which results in (or mishap) human death or injury, damage to property or to the

environment.

A computer-controlled machine injuring its operator is an example of an accident.

Hazard A condition with the potential for causing or contributing to an accident.

A failure of the sensor that detects an obstacle in front of a machine is an example of a hazard.

Damage A measure of the loss resulting from a mishap.

Damage can range from many people killed as a result of an accident to minor injury or property damage.

Accident An unplanned event or sequence of events which results in (or mishap) human death or injury, damage to property or to the

environment.

A computer-controlled machine injuring its operator is an example of an accident.

Hazard A condition with the potential for causing or contributing to an accident.

A failure of the sensor that detects an obstacle in front of a machine is an example of a hazard.

Damage A measure of the loss resulting from a mishap.

Damage can range from many people killed as a result of an accident to minor injury or property damage.

Page 14: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Saf

ety

term

s

14

Hazards severity An assessment of the worst possible damage which could result from a particular hazard.

Hazard severity can range from catastrophic where many people are killed to minor where only minor damage results.

Hazard The probability of the events occurring which create a probability hazard.

Probability values tent to be arbitrary but range from probable (say 1/100 chance of a hazard occurring) to implausible (no conceivable situations are likely where the hazard could occur.

Risk This is a measure of the probability that the system will cause an accident.

The risk is assessed by considering the hazard probability, the hazard severity and the probability that the hazard will result in an accident.

Hazards severity An assessment of the worst possible damage which could result from a particular hazard.

Hazard severity can range from catastrophic where many people are killed to minor where only minor damage results.

Hazard The probability of the events occurring which create a probability hazard.

Probability values tent to be arbitrary but range from probable (say 1/100 chance of a hazard occurring) to implausible (no conceivable situations are likely where the hazard could occur.

Risk This is a measure of the probability that the system will cause an accident.

The risk is assessed by considering the hazard probability, the hazard severity and the probability that the hazard will result in an accident.

Page 15: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Assuring safety Assuring safety is to ensure either that accidents do not

occur or that the consequences of an accident are minimal. Hazard avoidance

The system is designed so that some classes of hazard simply cannot arise.

a cutting machine avoids the hazard of the operator’s hands being in the blade pathway

Hazard detection and removal The system is designed so that hazards are detected and removed

before they result in an accident a chemical plant system detects excessive pressure and opens a relief

valve to reduce the pressure before an explosion occurs Damage limitation

The system includes protection features that minimise the damage that may result from an accident

Fire extinguishers in an aircraft engine

Bild 15

Most of the accidents are almost all due to a combination of malfunctions rather than single failures.

Page 16: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Sec

urit

y The security of a system is an assessment of the extent that the system protects itself from external attacks that may be accidental or deliberate virus attack unauthorised use of system services unauthorised modification of the system and its data.

Security is becoming increasingly important as systems are networked so that external access to the system through the Internet is possible

Security is an essential pre-requisite for availability, reliability and safety 16

Exposure Possible loss or harm in a computing system.Analogous to accident.

Vulnerability A weakness in a computer-based system that may be exploited to cause loss or harm.Analogous to hazard.

Attack An exploitation of a system vulnerability.

Threats Circumstances that have potential to cause loss or harm.

Control A protective measure that reduces a system vulnerability.

Exposure Possible loss or harm in a computing system.Analogous to accident.

Vulnerability A weakness in a computer-based system that may be exploited to cause loss or harm.Analogous to hazard.

Attack An exploitation of a system vulnerability.

Threats Circumstances that have potential to cause loss or harm.

Control A protective measure that reduces a system vulnerability.

Page 17: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Types of damage caused by external attack

17

For some types of critical system, security is the most important dimension of system dependability, for instance, systems managing confidential information.

Denial of service Normal services are unavailable or service

provision is significantly degraded Corruption of programs or data

The programs or data in the system may be modified in an unauthorised way

Disclosure of confidential information Information may be exposed to people who are

not authorised to read or use that information

Page 18: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Security assurance Vulnerability avoidance

The system is designed so that vulnerabilities do not occur. If there is no network connection then external attack is

impossible.

Attack detection and elimination The system is designed so that attacks on vulnerabilities

are detected and neutralised before they result in an exposure. Virus checkers find and remove viruses before they infect a

system.

Exposure limitation The system is designed so that the adverse

consequences of a successful attack are minimised. A backup policy allows damaged information to be restored.

Page 19: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Critical systems

Lecture 4

Critical systems Critical systems specification Critical systems development Critical systems validation

19

Page 20: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Critical systems specification

The specification for critical systems must be of high quality and accurately reflect the needs of users. Types of requirements:

System functional requirements define error checking, recovery facilities and other features

Non-functional requirements for availability and reliability

‘shall not’ requirements. For safety and security Sometimes decomposed into more specific functional requirements.

The system shall not allow users to modify access permissions on any files that they have not created (security)

The system shall not allow reverse thrust mode to be selected when the aircraft is in flight (safety)

Page 21: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Stages of risk-based analysis

Risk reduction assessment: Define how each risk must be eliminated or reduced when the system is designed.

Risk analysis andclassification

Risk reductionassessment

Riskassessment

Dependabilityrequirements

Riskdecomposition

Root causeanalysis

Riskdescription

Riskidentification

Risk identification: Identify potential risks that may arise.

Risk analysis and classification: Assess the seriousness of each risk.

Risk decomposition: Decompose risks to discover their potential root causes.

Applicable to any dependable attribute

Page 22: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Risk identification

Identify the risks faced by the critical system. In safety-critical systems, the risks are the hazards

that can lead to accidents. In security-critical systems, the risks are the potential

attacks on the system.

Insulin overdose (service failure). Insulin underdose (service failure). Power failure due to exhausted battery (electrical). Electrical interference with other medical equipment

(electrical). Poor sensor and actuator contact (physical). Parts of machine break off in body (physical). Infection caused by introduction of machine (biological). Allergic reaction to materials or insulin (biological).

Risk analysis andclassification

Risk reductionassessment

Riskassessment

Dependabilityrequirements

Riskdecomposition

Root causeanalysis

Riskdescription

Riskidentification

Page 23: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Unacceptable regionRisk cannot be tolerated

Risk tolera ted only ifrisk reduction is impractical

or grossly expensive

Acceptableregion

Negligible risk

ALARPregion

Risk analysis and classification

The process is concerned with understanding the likelihood that a risk will arise and the potential consequences if an accident or incident should occur.

Risk analysis andclassification

Risk reductionassessment

Riskassessment

Dependabilityrequirements

Riskdecomposition

Root causeanalysis

Riskdescription

Riskidentification

Acceptability level

Page 24: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Risk analysis and classification

Estimate the risk probability and the risk severity. The aim must be to identify risks that are likely to arise or that

have high severity.

Risk analysis andclassification

Risk reductionassessment

Riskassessment

Dependabilityrequirements

Riskdecomposition

Root causeanalysis

Riskdescription

Riskidentification

Page 25: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Risk decomposition, fault-tree technique

Incorrectsugar levelmeasured

Incorrectinsulin doseadministered

or

Correct dosedelivered atwrong time

Sensorfailure

or

Sugarcomputation

error

Timerfailure

Pumpsignals

incorrect

or

Insulincomputation

incorrect

Deliverysystemfailure

Arithmeticerror

or

Algorithmerror

Arithmeticerror

or

Algorithmerror

A deductive top-down technique concerned with discovering the root causes of risks in a particular system.

Put the risk or hazard at the root of the tree and identify the system states that could lead to that hazard.

Where appropriate, link these with ‘and’ or ‘or’ conditions.

A goal should be to minimise the number of single causes of system failure.

Risk analysis andclassification

Risk reductionassessment

Riskassessment

Dependabilityrequirements

Riskdecomposition

Root causeanalysis

Riskdescription

Riskidentification

Page 26: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Risk reduction assessment

The aim of this process is to identify dependability requirements that specify how the risks should be managed and ensure that accidents/incidents do not arise.

Risk reduction strategies Risk avoidance: the system is designed so that the

risk or hazard cannot arise Risk detection and removal: the system is designed

so that risks are detected and neutralised before they result in an accident.

Damage limitation: the system is designed so that the consequences of an accident are minimised.

Risk analysis andclassification

Risk reductionassessment

Riskassessment

Dependabilityrequirements

Riskdecomposition

Root causeanalysis

Riskdescription

Riskidentification

Page 27: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Safety specificationSR1: The system shall not deliver a single dose of insulin that is greater than a specified

maximum dose for a system user.SR2: The system shall not deliver a daily cumulative dose of insulin that is greater than a

specified maximum for a system user.SR3: The system shall include a hardware diagnostic facility that shall be executed at

least 4 times per hour.SR4: The system shall include an exception handler for all of the exceptions that are

identified in Table 3.SR5: The audible alarm shall be sounded when any hardware or software anomaly is

discovered and a diagnostic message as defined in Table 4 should be displayed.SR6: In the event of an alarm in the system, insulin delivery shall be suspended until theuser has reset the system and cleared the alarm.

The safety requirements of a system should be separately specified.

These requirements should be based on an analysis of the possible hazards and risks.

the system specification should be formulated so that the hazards are unlikely to result in an accident

Page 28: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

IEC

615

08: t

he s

afet

y li

fe-c

ycle

Hazard and riskanalysis

Concept andscope definition

Validation O & M Installation

Planning Safety-relatedsystems

development

External riskreductionfacilities

Operation andmaintenance

Planning and development

Systemdecommissioning

Safety req.allocation

Safety req.derivation

Installation andcommissioning

Safetyvalidation

Page 29: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Security specification Has some similarities to safety specification

Not possible to specify security requirements quantitatively;

The requirements are often ‘shall not’ rather than ‘shall’

requirements.

Differences No well-defined notion of a security life cycle for security

management; No standards;

Generic threats rather than system specific hazards;

Mature security technology (encryption, etc.).

The dominance of a single supplier (Microsoft) means that

huge numbers of systems may be affected by security

failure.

Page 30: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

The approach to security analysis is based around the assets to be protected and their value to an organisation.

The security specification process

Assetidentification

System assetlist

Threat analysis andrisk assessment

Security req.specification

Securityrequirements

Threat andrisk matrix

Securitytechnology

analysis

Technologyanalysis

Threatassignment

Asset andthreat

description

The assets (data and programs) and their required degree of protection are identified.

Possible security threats are identified and the risks associated with each of these threats is estimated.

Identified threats are related to the assets so that, for each identified asset, there is a list of associated threats.

Available security technologies and their applicability against the identified threats are assessed.

The security requirements are specified. Where appropriate, these will explicitly identify the security technologies that may be used to protect against different threats to the system.

Page 31: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Types of security requirement

Intrusion detection requirements How to detect attacks

Non-repudiation requirements A third party in a transaction

cannot deny its involvement Privacy requirements

How to maintain data privacy Security auditing requirements

How system use can be audited and checked

System maintenance security requirements How not to accidentally destroy

security

SEC1: All system users shall be identified using their library card number and personal password.

SEC2: Users privileges shall be assigned according to the class of user (student, staff, library staff).

Identification requirements

Should the system identify users before interacting with them

Authentication requirements How are users identified

Authorisation requirements The privileges and access

permissions of the users Immunity requirements

How to protect the system against threads

Integrity requirements How avoid data corruption

Page 32: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

System reliability specification

Hardware reliability What is the probability of a hardware component failing

and how long does it take to repair that component? Software reliability

How likely is it that a software component will produce an incorrect output. Software failures are different from hardware failures in that software does not wear out. It can continue in operation even after an incorrect result has been produced.

Operator reliability How likely is it that the operator of a system will make

an error?

Page 33: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Functional reliability requirements

A predefined range for all values that are input by the operator shall be defined and the system shall check that all operator inputs fall within this predefined range.

The system must use N-version programming to implement the braking control system.

The system must be implemented in a safe subset of Ada and checked using static analysis.

Functional reliability requirements specify how failures may be avoided or tolerated.

Page 34: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

An appropriate reliability metric should be chosen to specify the overall system reliability.

Non

-fun

ctio

nal

reli

abil

ity

spec

ific

atio

nMetric Explanation

POFOD Probability of failure on demand

The likelihood that the system will fail when a service request is made. A POFOD of 0.001 means that 1 out of a thousand service requests may result in failure.

ROCOF Rate of failure occurrence

The frequency of occurrence with which unexpected behaviour is likely to occur. A ROCOF of 2/100 means that 2 failures are likely to occur in each 100 operational time units. This metric is sometimes called the failure intensity.

MTTF Mean time to failure

The average time between observed system failures. An MTTF of 500 means that 1 failure can be expected every 500 time units.

AVAIL Availability

The probability that the system is available for use at a given time. Availability of 0.998 means that in every 1000 time units, the system is likely to be available for 998 of these.

Reliability measurements do NOT take the

consequences of failure into account.

Non-functional reliability requirements are expressed quantatively

Page 35: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Failure classification

The types of failure are system specific and the consequences of a system failure depend on the nature of that failure.

When specifying reliability, it is not just the number of system failures that matter but the consequences of these failures.

Transient Occurs only with certain inputs

Permanent Occurs with all inputs

Recoverable System can recover without operator intervention

Unrecoverable Operator intervention needed to recover from failure

Non-corrupting Failure does not corrupt system state or data

Corrupting Failure corrupts system state or data

Transient Occurs only with certain inputs

Permanent Occurs with all inputs

Recoverable System can recover without operator intervention

Unrecoverable Operator intervention needed to recover from failure

Non-corrupting Failure does not corrupt system state or data

Corrupting Failure corrupts system state or data

RepeatableOccurred only once

Page 36: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

For each sub-system, analyse the consequences of possible system failures.

From the system failure analysis, partition failures into appropriate classes. Transient, permanent, recoverable, unrec., corruptive, non-

corruptive Severity

For each failure class identified, set out the reliability using an appropriate metric. PODOF = 0.002 for transient failures PODOF = 0.00002 for permanent failures

Identify functional reliability requirements to reduce the chances of critical failures.

Steps to a reliability specification

Page 37: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Critical systems

Lecture 4

Dependability Critical systems specification Critical systems development Critical system validation

37

Page 38: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Approaches to developing dependability software

Fault avoidance design and implementation process to minimise

faults Fault detection

V&V to discover and remove faults Fault tolerance

Detect unexpected behaviour and prevent system failure

Few

Number of residual errors

Many Very few

The cost of producing fault free software is very high. It is only cost-effective in exceptional situations. It is often cheaper to accept software faults and pay for their consequences than to expend resources on developing fault-free software.

A fault-tolerant system is a system that can continue in operation after some system faults have manifested themselves. The goal of fault tolerance is to ensure that system faults do not result in system failure.

cost

Page 39: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Techniques for developing fault-free software

Dependable software processes Quality management Formal specification Static verification Strong typing Safe programming Protected information

Information hiding and encapsulation

Page 40: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Dep

enda

ble

proc

esse

sDocumentable The process should have a defined process model that sets out

the activities in the process and the documentation that is to beproduced during these activities.

Standardised A comprehensive set of software development standards thatdefine how the software is to be produced and documentedshould be available.

Auditable The process should be understandable by people apart fromprocess participants who can check that process standards arebeing followed and make suggestions for process improvement.

Diverse The process should include redundant and diverse verificationand validation activities.

Robust The process should be able to recover from failures ofindividual process activities.

A software development process is well defined, repeatable and includes a spectrum of verification and validation activities (irrespective of the people involved in the process)

Page 41: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Process activities for fault avoidance and detection

Requirements inspections. Requirements management. Model checking.

Internal, external (dynamic and static models are consistent)

Design and code inspection. Static analysis. Test planning and management. Configuration management.

Page 42: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Programming constructs and techniques that contribute to fault avoidance and fault tolerance

Design for simplicity Exception handling Information hiding Minimise the use of unsafe programming constructs.

Floating-point numbersPointersDynamic memory allocationParallelismRecursion

InterruptsInheritanceAliasingUnbounded arraysDefault input processing

Some standards for safety-critical systems development completely prohibit the use of some of these constructs

Page 43: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Fault tolerance actions

Fault detection The system must detect that a fault (an incorrect system

state) has occurred or will occur. Damage assessment

The parts of the system state affected by the fault must be detected and assessed.

Fault recovery The system must restore its state to a known safe state.

Fault repair The system may be modified to prevent recurrence of the

fault. As many software faults are transitory, this is often unnecessary.

System action

Human and/or system action

Page 44: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Fault detection and damage assessment

Define constraints that must hold for all legal states. Check the state against these constraints.

Checksums are used for damage assessment in data transmission.

Redundant pointers can be used to check the integrity of data structures.

Watch dog timers can check for non-terminating processes. If no response after a certain time, a problem is assumed.

Preventative fault detectionThe fault detection mechanism is initiated before the state change is committed. If an erroneous state is detected, the change is not made.

Retrospective fault detection

The fault detection mechanism is initiated after the system state has been changed.

Page 45: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Backward recovery Restore the system state to a known safe state.

Forward recovery Apply repairs to a corrupted system state and

set the system state to the intended value.

Fault recovery and repair

Page 46: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Hardware fault tolerance

Depends on triple-modular redundancy (TMR). There are three replicated identical components

that receive the same input and whose outputs are compared.

If one output is different, it is ignored and component failure is assumed.

A2

A1

A3

Outputcomparator

Page 47: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

N-version programming

Version 2

Version 1

Version 3

Outputcomparator

N-versions

Agreedresult

Faultmanager

Input

As in hardware systems, the output comparator is a simple piece of software that uses a voting mechanism to select the output.

Page 48: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Rec

over

y bl

ocks

The different system versions are designed and implemented by different teams.

Acceptancetest

Algorithm 2

Algorithm 1

Algorithm 3

Recovery blocks

Test forsuccess

Re-test

Retry

Re-test

Try algorithm1

Continue execution ifacceptance test succeeds

Signal exception if allalgorithms fail

Acceptance testfails – retry

Page 49: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Critical systems

Lecture 4

Dependability Critical systems specification Critical systems development Critical system validation

49

Page 50: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Validation of critical systems

The verification and validation for critical systems involves High costs of failure High cost of V&V

V & V costs take up more than 50% of the total system development costs.

Page 51: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Reliability validation

Reliability validation involves exercising the program to assess whether or not it has reached the required level of reliability.

Computeobservedreliability

Apply tests tosystem

Prepare testdata set

Identifyoperational

profiles

OP uncertainty High cost of test data generation

Statistical uncertainty

Page 52: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Operational profiles

An operational profile is a set of test data whose frequency matches the actual frequency of these inputs from ‘normal’ usage of the system.

It can be generated from real data collected from an existing system or (more often) depends on assumptions made about the pattern of usage of a system.

...

Number ofinputs

Input classes

If you are not sure of your operational profile, then you cannot be confident about the accuracy of your reliability measurements.

Page 53: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Rel

iabi

lity

pre

dict

ion

A reliability growth model is a mathematical model of the system reliability change as it is tested and faults are removed.

It is used as a means of reliability prediction by extrapolating from current data

Reliability

Requiredreliability

Fitted reliabilitymodel curve

Estimatedtime of reliability

achievement

Time

= Measured reliabilityROCOF

Page 54: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Equal-step reliability growth

Reliability(ROCOF)

t1 t2 t3 t4 t5Time

ROCOF: Rate of failure occurrence

The equal-step growth model is simple but it does not normally reflect reality.

Reliability does not necessarily increase with change as the change can introduce new faults.

The rate of reliability growth tends to slow down with time as frequently occurring faults are discovered and removed from the software.

Page 55: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Random-step reliability growth

A random-growth model where reliability changes fluctuate may be a more accurate reflection of real changes to reliability.

t1 t2 t3 t4 t5Time

Note different reliabilityimprovements

Fault repair adds new faultand decreases reliability(increases ROCOF)

Reliability(ROCOF)

ROCOF: Rate of failure occurrence

Page 56: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Rel

iabi

lity

pre

dict

ion

Benefits: Planning of testing Customer negotiations

Reliability

Requiredreliability

Fitted reliabilitymodel curve

Estimatedtime of reliability

achievement

Time

= Measured reliabilityROCOF

Page 57: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Safety assurance

Safety assurance is concerned with establishing a confidence level in the system. Quantitative measurement of safety is impossible.

Confidence in the safety of a system can vary from very low to very high.

Confidence is developed through: Past experience with the company developing the software; The use of dependable processes and process activities

geared to safety Extensive V & V including both static and dynamic validation

techniques.

Page 58: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Mandatory reviews for safety-critical systems, suggested by D. Parnas

Review for correct intended function. Review for maintainable, understandable

structure. Review to verify algorithm and data structure

design against specification. Review to check code consistency with

algorithm and data structure design. Review the adequacy of system test cases

Page 59: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Proof of correctness

currentDose = 0

currentDose = 0

if statement 2then branch

executed

currentDose =maxDose

currentDose =maxDose

if statement 2else branchexecuted

if statement 2not executed

currentDose >= minimumDose andcurrentDose <= maxDose

or

currentDose >maxDose

administerInsulin

Contradiction

Contradiction Contradiction

Pre-conditionfor unsafe state

Overdoseadministered

assign assign

Safety arguments are intended to show that the system cannot reach an unsafe state.

They are generally based on proof by contradiction Assume that an unsafe

state can be reached; Show that this is

contradicted by the program code.

execution paths

Page 60: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Process assurance Explicit attention should be paid to safety

during all stages of the software process. This means that specific safety assurance

activities must be included in the process. Hazard logging and monitoring The appointment of project safety

engineers. The extensive use of safety reviews. Formal certification of safety-critical

components. The use of a very detailed configuration

management system.

Bild 60

Assuring the quality of the system development process is important for all critical systems but it is particularly important for safety-critical systems

Hazard and riskanalysis

Concept andscope definition

Validation O & M Installation

Planning Safety-relatedsystems

development

External riskreductionfacilities

Operation andmaintenance

Planning and development

Systemdecommissioning

Safety req.allocation

Safety req.derivation

Installation andcommissioning

Safetyvalidation

Page 61: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Security assessment

Security assessment has something in common with safety assessment.

It is intended to demonstrate that the system cannot enter some state (an unsafe or an insecure state) rather than to demonstrate that the system can do something.

However, there are differences Safety problems are accidental; security problems

are deliberate; Security problems are more generic - many

systems suffer from the same problems; Safety problems are mostly related to the application domain

Page 62: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Security validation Experience-based validation

The system is reviewed and analysed against the types of attack that are known to the validation team.

Tool-based validation Various security tools such as password checkers are

used to analyse the system in operation. Tiger teams

A team is established whose goal is to breach the security of the system by simulating attacks on the system.

Formal verification The system is verified against a formal security

specification.

Page 63: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

Saf

ety

and

depe

ndab

ilit

y ca

ses

Safety and dependability cases are structured documents that set out detailed arguments and evidence that a required level of safety or dependability has been achieved.

They are normally required by regulators before a system can be certified for operational use.

Development Delivery

External certification

body

A documented body of evidence that provides a convincing and valid argument that a system is adequately safe for a given application in a given environment.

Safety caseDocument

….

Safety caseDocument

….

Component Description

System description An overview of the system and a description of its critical components.

Safety requirements The safety requirements abstracted from the system requirements specification.

Hazard and risk analysis

Documents describing the hazards and risks that have been identified and the measures taken to reduce risk.

Design analysis A set of structured arguments that justify why the design is safe.

Verification and validation

A description of the V & V procedures used and, where appropriate, the test plans for the system. Results of system V &V.

Review reports Records of all design and safety reviews.

Team competences Evidence of the competence of all of the team involved in safety-related systems development and validation.

Process QA Records of the quality assurance processes carried out during system development.

Change management processes

Records of all changes proposed, actions taken and, where appropriate, justification of the safety of these changes.

Associated safety cases

References to other safety cases that may impact on this safety case.

Page 64: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

currentDose = 0

currentDose = 0

if statement 2then branch

executed

currentDose =maxDose

currentDose =maxDose

if statement 2else branchexecuted

if statement 2not executed

currentDose >= minimumDose andcurrentDose <= maxDose

or

currentDose >maxDose

administerInsulin

Contradiction

Contradiction Contradiction

Pre-conditionfor unsafe state

Overdoseadministered

assign assign

Argument structure

EVIDENCE

EVIDENCE

EVIDENCE

<< ARGUMENT >> CLAIM

The maximum insulin dose will not exceed maxDose

Safety argument model

Test data

Static analysis report

•The safety argument shows that ….•In 400 tests, …..•The analysis results show that ….

A safety case or a safety claim

Page 65: Critical systems Lecture 6 n Critical systems n Critical systems specification n Critical systems development n Critical systems validation 1 This lecture

The maximum singledose computed bythe pump softwarewill not exceedmaxDose

maxDose is set upcorrectly when thepump is configured

maxDose is a safedose for the user ofthe insulin pump

The insulin pumpwill not deliver asingle dose of insulinthat is unsafe

In normaloperation, themaximum dosecomputed will notexceed maxDose

If the software fails,the maximum dosecomputed will notexceed maxDose

Claim hierarchyClaims may be organised into a hierarchy

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

ClaimClaim

ClaimClaim

ClaimClaim

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

EvidenceEvidence

Argument

ArgumentArgument

ArgumentArgument