Software Reliability Research Pankaj Jalote Professor, CSE, IIT Kanpur, India

Software Reliability Software Reliability ResearchResearchPankaj JalotePankaj Jalote

Professor, CSE, IIT Kanpur, Professor, CSE, IIT Kanpur, IndiaIndia

System ReliabilitySystem Reliability

System – an entity that provides System – an entity that provides defined behavior at interfacesdefined behavior at interfaces• System is a hierarchy of subsystems, System is a hierarchy of subsystems,

each subsystem being a systemeach subsystem being a system Reliability of a system - its ability to Reliability of a system - its ability to

provide failure-free operationprovide failure-free operation Failure – the system behavior is Failure – the system behavior is

incorrect or not as expected; is a incorrect or not as expected; is a random phenomenonrandom phenomenon

Reliability QuantificationReliability Quantification

Reliability of a system defined as Reliability of a system defined as failure probability in a time periodfailure probability in a time period

R(t) = Prob that system has notR(t) = Prob that system has notfailed by time tfailed by time t

For rel work, often distribution of R(t) For rel work, often distribution of R(t) is specifiedis specified

Reliability Quantification..Reliability Quantification..

Reliability can also be quantified by Reliability can also be quantified by Mean Time to Failure (MTTF)Mean Time to Failure (MTTF)

Also by failure rate (no of failures per Also by failure rate (no of failures per unit time.)unit time.)

From R(t), MTTF or failure rate can be From R(t), MTTF or failure rate can be determineddetermined

Under some assumptions, failure rate Under some assumptions, failure rate and MTTF are inversely relatedand MTTF are inversely related

Software ReliabilitySoftware Reliability

Software (un)reliability not caused Software (un)reliability not caused due to aging but due to bugsdue to aging but due to bugs

The more the bugs, the lesser the The more the bugs, the lesser the reliability of the softwarereliability of the software

Still failures seem random, hence rel Still failures seem random, hence rel theory can be appliedtheory can be applied

Software Reliability ResearchSoftware Reliability Research

Two main threadsTwo main threads• Software reliability modeling – how to Software reliability modeling – how to

model and predict sw relmodel and predict sw rel• Improving sw reliability – by removing Improving sw reliability – by removing

defects through program checking, defects through program checking, verification, testing,…verification, testing,…

Will discuss some work being done Will discuss some work being done here in these twohere in these two

Software Reliability Software Reliability ModelingModeling

Software ReliabilitySoftware Reliability

Software systems often are one-offSoftware systems often are one-off• Measuring reliability in lab not practical Measuring reliability in lab not practical

as too much failure data is needed; as too much failure data is needed; requires timerequires time

Failures often result in fault removal, Failures often result in fault removal, leading to reliability improvementleading to reliability improvement• Predicting future reliability from Predicting future reliability from

measured reliability is hardermeasured reliability is harder Hence different models neededHence different models needed

Software Reliability Growth ModelsSoftware Reliability Growth Models

Assume that reliability is a function Assume that reliability is a function of the defect level and as defects are of the defect level and as defects are removed, reliability improvesremoved, reliability improves

Model the failure-fix process of Model the failure-fix process of software evolutionsoftware evolution

Many models have been proposed in Many models have been proposed in the last 3 decadesthe last 3 decades

Model parameters determined from Model parameters determined from past data on failures and fixespast data on failures and fixes

Reliability of Software ProductsReliability of Software Products

For software products, a large For software products, a large population exists in field and faults population exists in field and faults are not removed as failures occurare not removed as failures occur

According to SRGMs, the reliability According to SRGMs, the reliability should remain the sameshould remain the same

I.e. the failure rate should be I.e. the failure rate should be constantconstant

Average Failure Rate of a MS Average Failure Rate of a MS ProductProduct

Failure intensity

00.010.020.030.040.050.060.070.080.09

1 2 3 4 5 6 7 8 9 10 11

Months frm release

Fai

lure

s/m

on

th/u

nit

Reasons for this PhenomenonReasons for this Phenomenon

Users learn with time and avoid Users learn with time and avoid failure causing situationfailure causing situation

Users start with exploring more, then Users start with exploring more, then limit to some part of the productlimit to some part of the product• Most users use a few product featuresMost users use a few product features

Configuration related failures are Configuration related failures are much more in the startmuch more in the start

These failures reduce with timeThese failures reduce with time

A New Model for Product Rel.A New Model for Product Rel.

For a user, there is a transient failure For a user, there is a transient failure rate, which decays with a factorrate, which decays with a factor

With time the transient goes, and With time the transient goes, and failure rate reaches a steady statefailure rate reaches a steady state

Steady state failure rate – represents Steady state failure rate – represents the reliability of the productthe reliability of the product

Failure Rate of a UnitFailure Rate of a Unit

Failure rate for one Failure rate for one unit isunit isλ (i) = λ0 *αλ (i) = λ0 *αii + λf + λf

λ0 is the initial λ0 is the initial transient ratetransient rate

λf is the final λf is the final steady state ratesteady state rate

α is the decay α is the decay factorfactor

Failure rate of a unit

Time

Fai

lure

rat

e

Applying it to a ProductApplying it to a Product

Considered the failure and sale data Considered the failure and sale data of a real product for MSof a real product for MS

Applying the model to the data and Applying the model to the data and determining parameters, we getdetermining parameters, we getλ0 = 0.04 failures/monthλ0 = 0.04 failures/month

λf = 0.008 failures/monthλf = 0.008 failures/month

α = 0.4 (i.e. 40% decay each month)α = 0.4 (i.e. 40% decay each month)

Example…Example…

Steady state failure rate is 1/6Steady state failure rate is 1/6thth of of average rate in month 2, 1/3average rate in month 2, 1/3rdrd of of average rate in month 4average rate in month 4

I.e. initial MTTF could be 1/6I.e. initial MTTF could be 1/6thth the the steady state MTTFsteady state MTTF

Steady state is reached quite soon – Steady state is reached quite soon – in two to three monthsin two to three months

Software Architecture Software Architecture Based Rel EstimationBased Rel Estimation

Sw ArchitectureSw Architecture

Architecture is the components in the Architecture is the components in the system and how they are connectedsystem and how they are connected

Is decided very early in sw projectIs decided very early in sw project If reliability and performance can be If reliability and performance can be

modeled from architecture, can modeled from architecture, can improve the architectureimprove the architecture

Some work going on in arch. based Some work going on in arch. based perf. and rel modeling perf. and rel modeling

Program VerificationProgram Verification

Program VerificationProgram Verification

Basic goal – to ensure that program Basic goal – to ensure that program is free of defects (bugs) as much as is free of defects (bugs) as much as possiblepossible

Good program verification leads to Good program verification leads to higher reliabilityhigher reliability

Program Verification TechniquesProgram Verification Techniques

Testing – program is executed with Testing – program is executed with test data to find bugstest data to find bugs

Static analysis – program source Static analysis – program source code is analyzedcode is analyzed

Dynamic analysis – program run on Dynamic analysis – program run on some data and assertions madesome data and assertions made

Model checkingModel checking Formal verificationFormal verification

TechniquesTechniques

Most techniques work in isolationMost techniques work in isolation Sometimes they are complimentary Sometimes they are complimentary

in their defect detection capabilityin their defect detection capability Combining techniques meaningfully Combining techniques meaningfully

can improve reliabilitycan improve reliability We are working on techniques for We are working on techniques for

combining testing and static analysiscombining testing and static analysis

State-based Testing State-based Testing AutomationAutomation

TestingTesting

Testing remains main verification Testing remains main verification activity – most reliance on itactivity – most reliance on it

Consumes as much as half of the Consumes as much as half of the total effort in a sw producttotal effort in a sw product

Testing: test case design, execution, Testing: test case design, execution, checking the results, then checking the results, then debugging, fixing, retestingdebugging, fixing, retesting

Each step is expensiveEach step is expensive

Test AutomationTest Automation

Test automation can help reduce Test automation can help reduce cost and make testing more effectivecost and make testing more effective

Most test automation approaches Most test automation approaches focus on data collection, re-testingfocus on data collection, re-testing

Little effort in complete end-to-end Little effort in complete end-to-end automationautomation

We are working on automating OO We are working on automating OO testing using state based modelstesting using state based models

SummarySummary

Software reliability is a rich and wide Software reliability is a rich and wide areaarea

Exciting work going on across the Exciting work going on across the world in modeling, analysis, program world in modeling, analysis, program checking, testing, etcchecking, testing, etc

Lots of open issuesLots of open issues

Documents

Software Reliability Research Pankaj Jalote Professor, CSE, IIT Kanpur, India