9. Reliability theory · 9. Reliability theory Basic concepts (1) • Reliability: The ability of an item to perform a required function, under given environmental and operational

ELEC-C7210 Modeling and analysis of communication networks

9. Reliability theory

1

Material based on original slides by Tuomas Tirronen


Contents

• Introduction• Structural system models• Reliability of structures of independent repairable components• Reliable network topology design

2


History

• As a technological concept, reliability emerged after WW1, practicalmethods were developed during and after WW2– For example Lusser’s law, i.e., product probability law of series of

components, was formulated by Robert Lusser during V1 flying bomb tests– Arised from the need to improve and control the quality of industrial

products with many parts• 50s, 60s

– ballistic missiles, space programs– first journal, IEEE Transactions on Reliability 1963

• 70s– safety of nuclear power plants

• 80s, 90s– oil and gas industries, computer programs to evaluate reliability, software

reliability, ...• 00s, new kinds of operation concepts (remote control/maintenance of

systems) require reliability analysis, network reliability etc3


Approaches to reliability

• Hardware reliability– Physical approach

• Strength S of an item is a random variable• Load L the item is exposed to is another random variable• Reliability R = Pr(S > L)• Structural reliability analysis

– Actuarial approach ¬ our approach• Time to failure T is studied using its distribution F(t)• All information of individual strengths, loads, etc is conveyed in F(t)• System reliability analysis

• Software reliability• Human reliability

4


Basic concepts (1)

• Reliability: The ability of an item to perform a required function, undergiven environmental and operational conditions and for a stated periodof time (ISO 8402)– the item can be a single component or a larger entity (system)– required function may refer to a single function or many

• Quality: The totality of features and characteristics of a product orservice that bear on its ability to satisfy stated or implied needs (ISO8402)– i.e., ”conformance to specifications”– reliability can be seen as an extension of quality into the time domain

• Availability: The ability of an item to perform its required function at astated instant of time or over a stated period of time (BS4778)– i.e., can the item be used at some time instant or what is the time fraction

the item is usable (= average availability)

5


Basic concepts (2)

• Maintainability: The ability of an item, under stated conditions, to beretained in, or restored to, a state in which it can perform its requiredfunctions, when maintenance is performed under stated conditions andusing prescribed procedures and resources (BS4778)– if an item can be repaired, then maintainability determines the availability of

the item• Dependability: Collective term to describe availability performance

and influencing factors: Reliability performance, maintainabilityperformance and maintenance support performance (IEC60300)– umbrella term often used when covering reliability issues

• Safety: Freedom from those conditions that can cause death, injury,occupational illness, or damage to or loss of equipment or property(MIL-STD-882D)

• Security: Dependability with respect to prevention of deliberate hostileactions

6


Basic concepts(3)

• Fault– a defect or mistake which leads to error. Reason for an error.

• Error– a system state which can lead to a failure

• Failure– The termination of its ability to perform a required function (BS 4778)– An unacceptable deviation from the design tolerance or in the anticipated

delivered service, an incorrect output, the incapacity to perform the desiredfunction (NASA 2002)

7


Basic concepts (4)

8

CauseFault

Error

Failure

Fault prevention• aim is to design a system without

faults• physical shielding of components,

careful manufacturing etc.

Fault tolerance• aim is to be able to provide the

service even in the presence offaults

• main tool: redundancy!• hardware• software• information• time


Repairable and nonrepairable items

• Can study two types of items– Nonrepairable items

• The item can be single item or larger system• We are only interested in the time until first failure – whatever happens

after this is of no interest to us• Interesting measures include: Mean time to failure, reliability

(function) and failure rate

– Repairable items• Single item or larger system• Interesting measures include: Availability, mean time between

failures, mean down time, number of failures in some time interval• In some sources the term dependability is used instead of availability

to mean the same thing

9

Our

focu

s


Systems of items

• We also study systems of many items or components. There are twopossibilities for modeling systems:

– Systems of independent components• Easy analysis: independence of components → independence of

probabilities• Most examples during this course assume independence

– Systems with dependent components• Exact analysis is harder, even impossible, because of the

dependencies• Analysis of the system as a stochastic process

10

Our

focu

s


Tools and models

• As function of time, we have state models where systems aremodelled as stochastic processes (cf. queueing models in earlierlectures)– especially repairable items/systems– the failure process, repair times, etc.

• Structure of systems and its subsystems -> structural models– reliability block diagrams, structure function

• Tools:– Basic probability theory– Stochastic processes

• Markov chains/processes• (Renewal processes)

– Statistical methods• Main limitations of ”probabilistic reliability analysis”: human errors,

human factor11


Applications

• Risk analysis– Identification of accidental events– Causal analysis– Consequence analysis

• Environmental protection• Quality• Optimization and maintenance• Engineering design• Verification of quality• Research and development• ...

12


Reliability in communications and networking (1)

• From user point of view, an interesting quality of service concept is thenetwork availability

– = Pr(usercanaccesstheagreednetworkservicesattime )– Average availability tells us the time fraction the system is available

• A way to understand availability of networks is to study the downtimeof a network (or outage of some specific service) per year

13

# nines Avg. availability Downtime / year2-nines 0.99 87.6 hours3-nines 0.999 8 hours 46 mins4-nines 0.9999 52 mins 34 secs5-nines 0.99999 5 mins 15 secs6-nines 0.999999 31.5 secs7-nines 0.9999999 3.15 secs


Reliability in communications and networking (2)

• In addition to availability, network operators, service providers andequipment manufacturers are interested also in– reliability of components (mean times to failure, number of failures in

some time interval etc.)– maintainability– security of networks

• Reliability is an important factor when planning new services, networksor equipment

• Note that dependability, reliability and availability may have differentdefinitions in different sources. Be careful to understand what are thedefinitions of the different concepts.

14


Aim of the lecture

• We focus on– Repairable systems– Systems of independent components– Exponential assumptions on mean time to failure and mean down time– Thus, we get simple models using Markovian analysis– Apply the models to topology design of communication networks where

availability is defined as connectivity of the network

15


Literature

• Reliability theory / Dependability– System Reliability Theory: Models, Statistical Methods and Applications,

2nd edition, Marvin Rausand and Arnljot Høyland, Wiley, 2004– Mesh-Based Survivable Networks: Options and Strategies for Optical,

MPLS, SONET and ATM Networking, Wayne D. Grover, Prentice Hall,2004

– Moniste: Luotettavuus, käytettävyys, huollettavuus (luotettavuusteoria.pdf),Keijo Ruohonen, TTKK, 2002

– TKK courses AS-116.3180, Mat-2.3118

16


Contents


17


Reliability block diagrams

• Reliability block diagrams (RBD) are used to describe the function ofa system of components– it shows the logical connections between components

• A system works if there is a path of functioning components from thestart point (a) to the end point (b)

• RBDs give a deterministic model for the structure of a system– the whole system works properly if and only if some set of the components

function• It is important to determine which specific function of the system is

modelled: the logical structure may be different for different functions

18


Series and parallel structures

• When a system functions if and only if all of the components function,the logical structure is a series structure

• When a system functions if at least one of all possible n componentsfunctions, the logical structure is a parallel structure

• Series and parallel structures can be further combined to model morecomplex structures

1 2 3 4

2

1

3

4

19

a b

ba


Structure function (1)

• The state vector of a structure is x = (x1, x2, ... , xn), where each statevariable xi is either 1 when component i is functioning or 0 whencomponent i is in a failed state

• The structure function of the system is

• For a series structure, the structure function is

îíì

=statefailedainissystem theif0gfunctioninissystem theif1

)(xf

Õ=

=×=n

iin xxxx

121)( Lxf

– system works if and only if xi = 1 for all i

20


Structure function (2)

• For parallel structure the structure function is

Õ= =

=--=--×--=n

i

n

iiin xxxxx

1 121 )1(1)1()1()1(1)( CLxf

– If any xi = 1, then the system functions– The last operator (upwards product) is reap ”ip”

• Example:For structure with 2 components in parallel we have

C2

121212121 )1)(1(1),(

=

-+=---==i

i xxxxxxxxxf

21


Path set and cut set methods

• For small systems the structure function ( ) can be written down byvisually inspecting the system as a combination of series and parallelstructures

• However, for large systems it is not possible!

• Therefore, we need systematic computational methods forgenerating the structure function ( )– Path set and cut set methods allow this

22


Path/cut sets (1)

• Definition: A path set P is set of components which by functioningensure that the system is functioning. A path set is minimal if it cannotbe reduced without losing its status as a path set.

• Definition: A cut set K is set of components which by failing causethe system to fail. A cut set is minimal if it cannot be reduced.

• Example:

23

12

3Path sets:

}3,2,1{}3,1{}2,1{

Cut sets:

}3,2,1{}3,1{}2,1{}3,2{

}1{ Minimal path sets:

Minimal cut sets:

}3,1{}2,1{ 21 == PP}3,2{}1{ 21 == KK


Path set method

• Let us denote rj(x) the structure function of jth minimal path

• The whole structure functions if and only if at least one minimal pathset is functioning,

• Path set method:1. Determine the path sets of the structure2. Determine minimal path sets Pj3. Calculate the structure functions of minimal path setc as series stuctures4. Take ”ip” over all functions you get in step 3.5. Simplify as needed (TIP: Power of binary variable = variable without any

power, xij=xi)

24

ÕÎ

=jPi

ij x)(xr

CCj Pi

ij

jjj j

xÕÕÎ

==--= )())(1(1)( xxx rrf


Cut set method

• Let us denote k j(x) the structure function of jth minimal cut

• Now the structure fails if and only if at least one structurecorresponding to the minimal cut sets fail

• Cut set method:1. Determine the cut sets of the structure2. Determine minimal cut sets Kj3. Calculate the structure functions of minimal cuts sets as parallel structures4. Multiply all functions you get in step 3.5. Simplify as needed

25

ÕÕÎ

==j Ki

ij

jj

xC)()( xx kf

Cj jKi Ki

iij xxÎ Î

Õ --== )1(1)(xk


Demo/Exercise

• Determine the structure function of independent components below

a) directly (by using results for series/parallel structures and combining)b) using path set methodc) using cut set method

26

12

3


Contents


27


Repairable components/systems

• Now we study systems where components can be repaired or replacedupon failures (or even before), i.e., repairable components

• We are interested for example in– system reliability– component/system availability:– mean number of failures during a time interval– mean time between failures, MTBF– mean downtime (or repair time) of systems, MDT (MTTR)

• For this purpose we can model the systems/failure processes asstochastic processes– thus, we have studied the theoretical background already in the beginning

of this course

28


Reliability of maintained systems (1)

• The system is called maintainable, when its components arerepaired/restored to working condition using some kind of maintenance– Can be preventive, corrective, …

• Let X(t) denote the stochastic process of the system with X(t) = 1 if thesystem is operational and 0 otherwise

• The main measure is availability, A(t)– also ( ) =Ā(t) = 1 – A(t), the unavailability is studied

29)(lim

)(1lim)(lim

)(1)(

}1)({)(

0

0

tAA

dttAAA

dttAA

tXPtA

t

avav

av

¥®

¥®¥®

=

==

=

==

ò

ò

exists)(whentyavailabiliLimiting

tyavailabiliaveragerunLong

tyavailabiliAverage

tyAvailabili

t

tt

t

tt

tt


Availability of single component as on-off process (1)

• We can model a single component as an on-off type process X(t) with

= 1 ifcomponentisoperational0 otherwise

• Measures related to maintainable systems are– Mean time between failures, MTBF– Mean downtime, MDT– Mean time to failure, MTTF

30

1

0

X(t)

t

MTBF

MTTF

MDT


Reliability of single component as on-off process (2)

• Markov model– MTTF is independent and exponentially distributed with mean 1/– MDT is independent and exponentially distributed with mean 1/

• Steady state distribution simply:

– Steady-state distribution holds even when MTTF and MDT have generaldistributions (but still independent), insensitivity property

– Then no more a Markovian process but a so-called renewal process 31

10m

l

ïïî

ïïí

ì

+=

+==

+=

+==

MDTMTTFMTTF

MDTMTTFMDT

1

0

mlmp

mllp

av

av

A

U

10 43421 43421

~Exp(l) ~Exp(m)


Examples (1)

• Example 1:A machine has MTTF = 1000 hours and MDT = 5 hours

The average availability is

• Example 2:Item has independent uptimes with constant failure rate l. Downtimes are IIDwith mean MDT. Usually we have MDT << MTTF, the average unavailability isthen approximately

32

995.051000

1000MDTMTTF

MTTF»

+=

+=avA

MDTMDT1

MDTMDTMTTF

MDTMDTMTTF

MTTF11

×»×+

×=

+=

+-=-=

ll

l

avav AA


Systems of independent components (1)

• Consider a system consisting of n independent components– The state vector of a system is

• MTTF and MDT of component i independent and exp. distributed withmean 1/ and 1/ , respectively– Let = = 1 = / +– That is, is the availability of component i

• Then the steady state distribution of state = , … , is simply theproduct of Bernoulli distributions of each component i,

= 1 −

– Again distribution holds even under general distributions for MTTF andMDT (insensitivity)

))(,...),(),(()( 21 tXtXtXt n=X

33



• In general, the average availability of the system is defined as

= = 1• The state space Ω can be partitioned into two sets

1. Up states Ω• where the system is working. Note that some components may be in

failed state, but the system still provides the intended service.

2. Down states Ω• where the system does not perform the required function

• The (average) availability of the system is given by

= ( )∈

= 1 = 1 ( )∈

– similarly, unavailability is the sum of probabilities of down states34



• As ( ) is a binary-valued function,

• For series structure: (independence of ’s !!)

35

)]([}1)({ XX ff EPAav ===

ÕÕÕ===

==÷÷ø

öççè

æ==

n

ii

n

ii

n

iiav pXEXEEA

111

][)]([ Xf

• And similarly for parallel structure:

C

Cn

ii

n

ii

n

ii

n

ii

n

iiav

ppXE

XEXEEA

111

11

)1(1])[1(1

)1(1)]([

===

==

=--=--=

÷÷ø

öççè

æ--=÷÷

ø

öççè

æ==

ÕÕ

ÕXf



• However, in general

= ( ) ≠ ( )

– Thus, to calculate availability one can not just write down the structurefunction ( ) and replace ’s by the corresponding ’s!

• Instead, the function must be first simplified– Note that ( ) is a polynomial function– All higher exponents of ’s are equal to , i.e., → etc.– To the simplified structure function one can then apply the expectation

operator

36


Demo/exercise

• Calculate the availability of the system below using the data given inthe table

• Hint: Use the structure function derived earlier, use availabilities ofcomponents 37

12

3

i MTTFi (hours) MDTi (hours)1 750 82 300 153 500 10


Models with state-dependent rates

• Earlier we assumed components are completely independent fromeach other

• Markov models can have state-dependent rates– The dynamics (or transition rates) may depend on the state to reflect some

physical causes resulting from the given state– For example, if there is only one repair man, when there are many faults

the repair rates are affected– But still we assume that MTTF’s and MDT’s obey exponential distributions

• One can construct the associated Markov process and solve steadystate via global balance equations

38


Example

• Consider parallel structure of two components. Uptimes are exp.distributed with rates l1 and l2. Repair rates are, correspondingly, m1and m2. Also, there is only one person to repair and he spends half ofthe time repairing component 1 and 2 when both are down.

• Now solve equilibrium probabilities pi. Average availability is theprobability that at least one component works:

39

l10

l2

1

2 3l1

l2m2/2m2

m1/2

m1

Systemstate

State ofcomponent 1

State ofcomponent 2

0 1 11 0 12 1 03 0 0

210 ppp ++=avA


Contents


40


Topology design problem

• Topology design is the starting point in network design

• Think of the network as a graph with nodes connected by links– Typically network topology is heavily influenced by the set of physical

locations that need connectivity, so nodes are often given– Also, many of the primary links between nodes are defined by the node

locations– In practice, design space allows to add some or few additional links and

nodes

• Question is..– Given a network topology (nodes + links), what is a reliable network?– By considering the network as a graph, reliability/availability can be

formalized by the notion of graph connectivity

41


Graphs and k-connectivity

• Consider the network as a graph G(N,J) consisting of a set of nodes Nand set of links J

• Definition: A graph is said to be connected if there exists a pathbetween every pair of nodes in the graph.

• Definition: Graph G is k-edge-connected if it remains connected afterremoval of any k-1 edges.– Remember: edge = link

• Definition: Graph G is k-vertex-connected if it remains connectedafter removal of any k-1 vertices.– Remember: vertex = node– Removal of node means that all links connected to the node are removed

from the graph• Efficient algorithms exist to check k-connectivity of the graph

42


Examples

• 1-edge-connected

• 2-edge-connected

43


Topology design method (1)

• Topology design objective:– For redundancy, all nodes in the network need to be at least 2-(edge)-

connected with probability 0.99999 (i.e., “5 nines”)– That is, the network must be resilient to single link failures

• Consider a given network topology represented by graph G(N,J)

• Assume that link ∈ is operational with probability , but the nodesare perfectly reliable– State = , … ,– State space Ω = 0,1

44



• The structure function is then

= 1, ifnetworkinstate is2 − connected0, otherwise

• And the availability is defined as

= is2 − connected = ( )∈

– Note that the size of state space is 2^J (grows exponentially!)

• If availability is too low new links need to be added– Need to define heuristics for identifying most useful locations

45



• Taking into account node failures– Node ∈ is operational with probability

• We still require that all nodes must stay 2-connected with 5-nines– Thus, all nodes must then be operational and

= is2 − connected|allnodeson ∙ allnodeson= ( ⋯ ) ∙ is2 − connected|allnodeson

– The conditional probability of 2-(edge)-connectedness is evaluated asbefore assuming that nodes do not fail

• Note! This is just one version of the topology design objective and newones can be easily defined.

46


47

THE END

• What you should understand/remember:– what kind of things reliability theory studies– basic measures, MTTF, MDT– how to calculate structure function of simple systems and how to use that

to calculate the availability/reliability of a system– how to make Markov models of simple maintained systems and calculate

the availability– how can graph connectivity be used as a measure of reliability in data

network topology design

Documents

9. Reliability theory · 9. Reliability theory Basic concepts (1) • Reliability: The ability of an item to perform a required function, under given environmental and operational