7. Markov Models Reliable System Design 2011 by: Amir M. Rahmani

7. Markov Models

Reliable System Design 2011by: Amir M. Rahmani

matlab1.ir

Markov Models

The primary difficulty with the combinatorial models is that many complex systems cannot be modeled easily in a combinatorial fashion.

The fault coverage is sometimes difficult to incorporate into the reliability expression in a combinatorial model.

The process of repair is very difficult to model in a combinatorial model.

Alternative: Markov models

matlab1.ir

Markov Process

In 1907 A.A. Markov published a paper in which he defined and investigated the properties of what are now known as Markov processes.

A Markov process with a discrete state space is referred to as a Markov Chain.

A set of random variables forms a Markov chain if the probability that the next state is Sn+1 depends only on the current state Sn, and not on any previous states

matlab1.ir

Markov Process

A stochastic process is a function whose values are random variables

The classification of a random process depends on different quantities

• – state space• – index (time) parameter• – statistical dependencies among the random

variables X(t) for different values of the index parameter t.

matlab1.ir

Markov Process

Categories of Markov state-space models:• 1. Discrete space and discrete time• 2. Discrete space and continuous time• 3. Continuous space and discrete time• 4. Continuous space and continuous time

The first two categories involve a discrete space; that is, the states of the system can be numbered with an integer.

In the first and the third categories, the system changes by discrete time steps.

The second category is the one most useful for modeling fault-tolerant systems.

matlab1.ir

Markov Process

States must be• – mutually exclusive• – collectively exhaustive

Let Pi(t)= Probability of outgoing in the state Si at time t.

Markov Properties• – future state probability depends only on current state

• independent of time in state• path to state

matlab1.ir

State Transition Diagrams

A Markov state transition diagram can graphically represent all:

• 1- System states and their initial conditions. • 2- Transitions between system states and corresponding

transition rates

The transition rates are replaced with equivalent transition probabilities considering that the state transition time is very small (Δt ) this leads to

• 1- A situation where the system can remain in the current state after time t with some probability.

• 2- Thus, in the above case, a situation where the system can go to the next state(s) (transition rates) after time t with some probability.

matlab1.ir

Construction of State Transition Diagram

The basic steps in constructing state transition diagrams are:

• 1- Define the failure criteria of the system. • 2- Enumerate all of the possible states of the

system and classify them into good or failed states.

• 3- Determine the transition rates between various states and draw the state transition diagram

matlab1.ir

Example

State diagram for one component Let X denote the lifetime for a component. The Markov property is defined as follows:

The probability that a component fails in the small interval Λt is proportional to the length of the interval.

λ is the proportional constant. The probability above does not depend on the

time t.

0

)/(

t

ttXttXP

matlab1.ir

Markov Process Assume exponential failure law with failure rate λ. Probability that system failed at t+Δt, given that is was

working at time t is given by

matlab1.ir

Reliability for one component

The probability that the component works at the time t+ Δt is

We divide with Δt

Let Δt →0 , and we get

)()1()( 11 tPtttP

)())()(

111 tP

t

t

t

tPttP

)()( 11 tPtP

matlab1.ir

Reliability for one component

The solution to this differential equation is

Assuming that the component works at the time t = 0, so

The reliability of the component is:

matlab1.ir

Failure probability for one component

The probability that the component does not work at the time t+ Δt is

We divide with Δt

Let Δt →0 , and we get

)()()( 010 tPttPttP

)()()(

100 tP

t

tPttP

)()( 10 tPtP

matlab1.ir

Failure probability for one component

Solving the differential equation yields

matlab1.ir

Markov chain modelThe equation system can be written using matrices

where

and

Q is called the transition rate matrix.

matlab1.ir

Cold stand-by system with one spare

State diagram

State labeling• 2 Primary module works• 1 Spare module works (Primary module does not work)• 0 No module works, system failure

Assumption: The failure rate for the spare is zero.

matlab1.ir

Cold stand-by system with one spare

We calculate the reliability of the system by solving the equation system

Where

matlab1.ir

The Equation System

We solve this by Laplace transform using the following relation

Laplace transforms: Time function Laplace transform

2

2

)(

1

1

1

11

ste

se

st

s

t

t

)0()()(~

PsPstP

matlab1.ir

Solving the Equation System

The Laplace transform get

where

which give us

matlab1.ir


1- We compute

which gives the following time function

2- We compute

The reliability of the system can be written as:

matlab1.ir

Calculating MTTF

Let X1 and X2 denote the time spent in state 2 and state 1, respectively. MTTF for the system can then be written as

Alternatively, the MTTF can be computed as

matlab1.ir

Reliability

matlab1.ir

Coverage

Designing a fault-tolerant system that will correctly detect, mask or recover from every conceivable fault, or error, is not possible in practice.

Even if a system can be designed to tolerate a very large number of faults, or errors, there are for most systems a non-zero probability that a single fault will be remained. such faults are known as “non-covered” faults.

The probability that a fault is covered (i.e., correctly handled by the fault-tolerance mechanisms) is known as the coverage factor, and denoted c.

The probability that a fault is non-covered can then be written as 1 - c.

matlab1.ir

Cold Stand-by system with Coverage factor

State diagram

We can write-up the Q-matrix directly by inspecting the state diagram.

matlab1.ir


We have the following equation system

After applying the Laplace transform, we get

We then compute

matlab1.ir


can we compute directly from the first equation

We then compute

Reliability for the system is

matlab1.ir

The Reliability with Coverage factor

matlab1.ir

Calculating MTTF

matlab1.ir

Availability

Definition: the probability that a system is functioning properly at a given time t.

When calculating the availability we consider both failures and repairs. We must make assumptions about the function time (up time) and the repair time (down time).

The repair time consists of the time it takes to perform the repair, the time between the system failure and the repair is started, and the time it takes to restart the system after the repair is completed.

matlab1.ir

Steady-state Availability

E [X0] = MTTFF (Mean Time To First Failure)

E [Xi] = MTTF (Mean Time To Failure)

E [Yi] = MTTR (Mean Time To Repair)

MTTR + MTTF = MTBF (Mean Time Between Failures)

matlab1.ir

Design Tradeoffs

How to make availability approach 100%?

MTTF → infinity (high reliability) MTTR → zero (fast recovery)

MTTRMTTF

MTTF tyAvailabili

matlab1.ir

Availability vs. Reliability

– Reliability is measured by mean time To failure (MTTF)

- There is no repair in the state of system failure for modeling reliability.

– Availability is a function of MTTF and mean time to repair (MTTR) MTTF/(MTTF+MTTR)

– A system may have a high MTBF, but low availability

matlab1.ir

Markov chain model for a simplex system

State0: System OK Failure rate: λ1: System failure Repair rate: μ

Availability: A(t) = P0 (t)Reliability: R(t) = e-λt

Maintainability: M(t) = 1 – e-μt

matlab1.ir

The availability for a simplex system

matlab1.ir

The availability for a simplex system

matlab1.ir

Steady-state Availability

Assuming exponentially distributed function times and repair times, we get

matlab1.ir

Markov chain for a hot stand-by system

State0,1: System OK Failure rate: λ2: System failure Repair rate: μ

Availability: A(t) = P0 (t) + P1 (t)

Assumption: Only one repair-person works with the system when a failure has occurred.

matlab1.ir

Safety Definition: The probability that a system is either

functioning properly, or is in safe failed state.

Calculating safety is similar to calculating reliability.

In a reliability model there is usually only one absorbing state, while in a safety model there are at least two absorbing states.

Among the absorbing states in a safety model, at least one represents that system is in a safe shut-down state, and at least one represents that a catastrophic failure has occurred.

matlab1.ir

Safety for a simplex system with coverage factorWe obtain the following markov chain model

and the corresponding transition-rate matrix

matlab1.ir

Safety for a simplex system with coverage factor

The solutions of the differential equations are:

The safety of the system is:

The steady-state safety is:

Documents

7. Markov Models Reliable System Design 2011 by: Amir M. Rahmani