307
Stochastic Modelling and Applied Probability 41 Terje Aven Uwe Jensen Stochastic Models in Reliability Second Edition

[Stochastic Modelling and Applied Probability] Stochastic Models in Reliability Volume 41 ||

  • Upload
    uwe

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Stochastic Modelling and Applied Probability 41

Terje AvenUwe Jensen

Stochastic Models in ReliabilitySecond Edition

Stochastic Mechanics Stochastic ModellingRandom Media and Applied Probability

Signal Processing and Image Synthesis (Formerly:Mathematical Economics and Finance Applications of Mathematics)

Stochastic Optimization

Stochastic Control

41Stochastic Models in Life Sciences

Edited by P.W. GlynnY. Le Jan

Advisory Board M. HairerI. KaratzasF.P. KellyA. KyprianouB. ØksendalG. PapanicolaouE. PardouxE. PerkinsH.M. Soner

For further volumes:http://www.springer.com/series/602

Terje Aven • Uwe Jensen

Stochastic Modelsin Reliability

Second Edition

123

Terje AvenUniversity of StavangerStavanger, Norway

Uwe JensenFak. NaturwissenschaftenInst. Angewandte Mathematik u. StatistikUniversitat HohenheimStuttgart, Germany

ISSN 0172-4568ISBN 978-1-4614-7893-5 ISBN 978-1-4614-7894-2 (eBook)DOI 10.1007/978-1-4614-7894-2Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2013942488

Mathematics Subject Classification (2010): 60G, 60K, 60K10, 60K20, 90B25

© Springer Science+Business Media New York 1999, 2013This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodologynow known or hereafter developed. Exempted from this legal reservation are brief excerpts in connectionwith reviews or scholarly analysis or material supplied specifically for the purpose of being entered andexecuted on a computer system, for exclusive use by the purchaser of the work. Duplication of this pub-lication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation, in its current version, and permission for use must always be obtained from Springer. Permis-sions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liableto prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date of publica-tion, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errorsor omissions that may be made. The publisher makes no warranty, express or implied, with respect to thematerial contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Preface

In this second edition of the book, two major topics have been added to theoriginal version. The first one relates to copula models (Sect. 2.3), which areused to study the effects of structural dependencies on system reliability. Webelieve that an introduction to the fundamental ideas and concepts of copulamodels is important when reviewing basic reliability theory. The second newtopic we have included is maintenance optimization models under constraints(Sect. 5.5). These models have been addressed in some recent publications tomeet the demand for models that adequately balance economic criteria andsafety. We consider two specific models. The first is the so-called delay timemodel where the aim is to determine optimal inspection intervals minimiz-ing the expected discounted costs under some safety constraints. The secondmodel is also about optimal inspection, but here the system is representedby a monotone (coherent) structure function. In addition, we have made anumber of minor adjustments to increase precision and we have also correctedmisprints.

We received positive feedback to the first edition from friends and col-leagues. Their hints and suggestions have been incorporated into this secondedition. We thank all who contributed, by whatever means, to preparing thenew edition.

Stavanger, Norway Terje AvenStuttgart, Germany Uwe Jensen

v

Preface to the First Edition

As can be seen from the files of the databases of Zentralblatt/MathematicalAbstracts and Mathematical Reviews, about 1% of all mathematical publica-tions are connected to the keyword reliability. This gives an impression of theimportance of this field and makes it clear that it is impossible to include allthe topics connected to reliability in one book. The existing literature on re-liability covers inter alia lifetime analysis, complex systems and maintenancemodels, and the books by Barlow and Proschan [31, 32] can be viewed as firstmilestones in this area. Since then the models and tools have been developedfurther. The aim of Stochastic Models in Reliability is to give a comprehensiveup-to-date presentation of some of the classical areas of reliability, based on amore advanced probabilistic framework using the modern theory of stochas-tic processes. This framework allows the analyst to formulate general failuremodels, establish formulas for computing various performance measures, aswell as to determine how to identify optimal replacement policies in complexsituations. A number of special cases analyzed previously can be included inthis framework. Our book presents a unifying approach to some of the key re-search areas of reliability theory, summarizing and extending results obtainedin recent years. Having future work in this area in mind, it will be usefulto have at hand a general set-up where the conditions and assumptions areformulated independently of particular models.

This book comprises five chapters in addition to two appendices.Chapter 1 gives a short introduction to stochastic models of reliability,

linking existing theory and the topics treated in this book. It also contains anoverview of some questions and problems to be treated in the book. In additionSect. 1.1.6 explains why martingale theory is a useful tool for describing andanalyzing the structure of complex reliability models. In the final section ofthe chapter we briefly discuss some important aspects of reliability modelingand analysis, and present two real-life examples. To apply reliability modelsin practice successfully, there are many challenges related to modeling andanalysis that need to be faced. However, it is not within the scope of this

vii

viii Preface to the First Edition

book to discuss these challenges in detail. Our text is an introduction to thetopic and of motivational character.

Chapter 2 presents an overview of some parts of basic reliability theory: thetheory of complex (monotone) systems, both binary and multistate systems,as well as lifetime distributions and nonparametric classes of lifetime distri-butions. The aim of this chapter has not been to give a complete overview ofthe existing theory, but to highlight important areas and give a basis for thecoming chapters.

Chapter 3 presents a general set-up for analyzing failure-prone systems.A (semi-) martingale approach is adopted. This general approach makes itpossible to formulate a unifying theory of both nonrepairable and repairablesystems, and it includes point processes, counting processes, and Markov pro-cesses as special cases. The time evolution of the system can also be analyzedon different information levels, which is one of the main attractions of the(semi-) martingale approach. Attention is drawn to the failure rate process,which is a key parameter of the model. Several examples of application of theset-up are given, including a monotone (coherent) system of possibly depen-dent components, and failure time and (minimal) repair models. A model foranalyzing the time to failure based on risk reserves (the difference betweentotal income and accumulated costs of repairs) is also covered.

In the next two chapters we look more closely at types of models foranalyzing situations where the system and its components could be repairedor replaced in the case of failures, and where we model the downtime or costsassociated with downtimes.

Chapter 4 gives an overview of availability theory of complex systems,having components that are repaired upon failure. Emphasis is placed onmonotone systems comprising independent components, each generating analternating renewal process. Multistate systems are also covered, as well assystems comprising cold standby components. Different performance measuresare studied, including the distributions of the number of system failures in atime interval and the downtime of the system in a time interval. The chaptergives a rather comprehensive asymptotic analysis, providing a theoretical basisfor approximation formulae used in cases where the time interval consideredis long or the components are highly available.

Chapter 5 presents a framework for models of maintenance optimization,using the set-up described in Chap. 3. The framework includes a number ofinteresting special cases dealt with by other authors.

By allowing different information levels, it is possible to extend, for ex-ample, the classical age replacement model and minimal repair/replacementmodel to situations where information is available about the underlying con-dition of the system and the replacement time is based on this information.Again we illustrate the applicability of the model by considering monotonesystems.

Chapters 3–5 are based on stochastic process theory, including theoryof martingales and point, counting, and renewal processes. For the sake ofcompleteness and to help the reader who is not familiar with this theory,

Preface to the First Edition ix

two appendices have been included summarizing the mathematical basis andsome key results. Appendix A gives a general introduction to probability andstochastic process theory, whereas Appendix B gives a presentation of resultsfrom renewal theory. Appendix A also summarizes basic notation and symbols.

Although conceived mainly as a research monograph, this book can alsobe used for graduate courses and seminars. It primarily addresses probabilistsand statisticians with research interests in reliability. But at least parts of itshould be accessible to a broader group of readers, including operations re-searchers and engineers. A solid basis in probability and stochastic processesis required, however. In some countries many operations researchers and reli-ability engineers now have a rather comprehensive theoretical background inthese topics, so that it should be possible to benefit from reading the moresophisticated theory presented in this book. To bring the reliability field for-ward, we believe that more operations researchers and engineers should befamiliar with the probabilistic framework of modern reliability theory. Chap-ters 1 and 2 and the first part of Chaps. 4 and 5 are more elementary and donot require the more advanced theory of stochastic processes.

References are kept to a minimum throughout, but readers are referred tothe bibliographic notes following each chapter, which give a brief review ofthe material covered and related references.

Acknowledgments

We express our gratitude to our institutions, the Stavanger University College,the University of Oslo, and the University of Ulm, for providing a rich intel-lectual environment, and facilities indispensable for the writing of this book.The authors are grateful for the financial support provided by the NorwegianResearch Council and Deutscher Akademischer Austauschdienst. We wouldalso like to acknowledge our indebtedness to Jelte Beimers, Jørund Gasemyr,Harald Haukas, Tina Herberts, Karl Hinderer, Gunter Last, Volker Schmidt,Richard Serfozo, Marcel Smith, Fabio Spizzichino and Rune Winther for mak-ing helpful comments and suggestions on the manuscript. Thanks for TEXnicalsupport go to Jurgen Wiedmann.

We especially thank Bent Natvig, University of Oslo, for the great dealof time and effort he spent reading and preparing comments. Thanks also goto the three reviewers for providing advice on the content and organizationof the book. Their informed criticism motivated several refinements and im-provements. Of course, we take full responsibility for any errors that remain.

We also acknowledge the editing and production staff at Springer for theircareful work. In particular, we appreciate the smooth cooperation of JohnKimmel.

Stavanger, Norway Terje AvenUlm, Germany Uwe Jensen

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Lifetime Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Damage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Different Information Levels . . . . . . . . . . . . . . . . . . . . . . . . 41.1.4 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.5 Predictable Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.6 A General Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 Availability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Optimization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Reliability Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.1 Nuclear Power Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Gas Compression System . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Basic Reliability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 Binary Monotone Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.2 Multistate Monotone Systems . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Basic Notions of Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.1 Nonparametric Classes of Lifetime Distributions . . . . . . . 352.2.2 Closure Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.3 Stochastic Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3 Copula Models of Complex Systems in Reliability . . . . . . . . . . . 422.3.1 Introduction to Copula Models . . . . . . . . . . . . . . . . . . . . . . 422.3.2 The Influence of the Copula on the Lifetime

Distribution of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 452.3.3 Archimedean Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.3.4 The Expectation of the Lifetime of a Two-Component-

System with Exponential Marginals . . . . . . . . . . . . . . . . . . 502.3.5 Marshall–Olkin Distribution . . . . . . . . . . . . . . . . . . . . . . . . 52

xi

xii Contents

3 Stochastic Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.1 Notation and Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.1.1 The Semimartingale Representation . . . . . . . . . . . . . . . . . 593.1.2 Transformations of SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2 A General Lifetime Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.2.1 Existence of Failure Rate Processes . . . . . . . . . . . . . . . . . . 723.2.2 Failure Rate Processes in Complex Systems . . . . . . . . . . . 733.2.3 Monotone Failure Rate Processes . . . . . . . . . . . . . . . . . . . . 773.2.4 Change of Information Level . . . . . . . . . . . . . . . . . . . . . . . . 78

3.3 Point Processes in Reliability:Failure Time and Repair Models . . . . . . . . . . . . . . . . . . . . . . . . . . 813.3.1 Alternating Renewal Processes: One-Component

Systems with Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.3.2 Number of System Failures for Monotone Systems . . . . . 853.3.3 Compound Point Process: Shock Models . . . . . . . . . . . . . 863.3.4 Shock Models with State-Dependent Failure

Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883.3.5 Shock Models with Failures of Threshold Type . . . . . . . . 893.3.6 Minimal Repair Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903.3.7 Comparison of Repair Processes for Different

Information Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.3.8 Repair Processes with Varying Degrees of Repair . . . . . . 973.3.9 Minimal Repairs and Probability of Ruin . . . . . . . . . . . . . 98

4 Availability Analysis of Complex Systems . . . . . . . . . . . . . . . . . 1054.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2 One-Component Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.2.1 Point Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.2.2 The Distribution of the Number of System Failures . . . . 1094.2.3 The Distribution of the Downtime in a Time

Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.2.4 Steady-State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.3 Point Availability and Mean Number of System Failures . . . . . . 1204.3.1 Point Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.3.2 Mean Number of System Failures . . . . . . . . . . . . . . . . . . . . 121

4.4 Distribution of the Number of System Failures . . . . . . . . . . . . . . 1254.4.1 Asymptotic Analysis for the Time to the First System

Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264.4.2 Some Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 1314.4.3 Asymptotic Analysis of the Number of System

Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1354.5 Downtime Distribution Given System Failure . . . . . . . . . . . . . . . 145

4.5.1 Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1464.5.2 General Monotone System . . . . . . . . . . . . . . . . . . . . . . . . . . 1484.5.3 Downtime Distribution of the ith System Failure . . . . . . 149

Contents xiii

4.6 Distribution of the System Downtime in an Interval . . . . . . . . . . 1514.6.1 Compound Poisson Process Approximation . . . . . . . . . . . 1524.6.2 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

4.7 Generalizations and Related Models . . . . . . . . . . . . . . . . . . . . . . . 1584.7.1 Multistate Monotone Systems . . . . . . . . . . . . . . . . . . . . . . . 1584.7.2 Parallel System with Repair Constraints . . . . . . . . . . . . . . 1654.7.3 Standby Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5 Maintenance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1755.1 Basic Replacement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

5.1.1 Age Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1755.1.2 Block Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.1.3 Comparisons and Generalizations . . . . . . . . . . . . . . . . . . . . 178

5.2 A General Replacement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.2.1 An Optimal Stopping Problem . . . . . . . . . . . . . . . . . . . . . . 1805.2.2 A Related Stopping Problem . . . . . . . . . . . . . . . . . . . . . . . . 1835.2.3 Different Information Levels . . . . . . . . . . . . . . . . . . . . . . . . 189

5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1905.3.1 The Generalized Age Replacement Model . . . . . . . . . . . . . 1905.3.2 A Shock Model of Threshold Type . . . . . . . . . . . . . . . . . . . 1935.3.3 Information-Based Replacement of Complex

Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1945.3.4 A Parallel System with Two Dependent

Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1975.3.5 Complete Information About T1, T2 and T . . . . . . . . . . . . 1985.3.6 A Burn-In Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

5.4 Repair Replacement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075.4.1 Optimal Replacement Under a General Repair

Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075.4.2 A Markov-Modulated Repair Process: Optimization

with Partial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 2085.4.3 The Case of m=2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

5.5 Maintenance Optimization Models Under Constraints . . . . . . . . 2155.5.1 A Delay Time Model with Safety Constraints . . . . . . . . . 2155.5.2 Optimal Test Interval for a Monotone Safety System . . . 229

A Background in Probability and Stochastic Processes . . . . . . . 245A.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245A.2 Random Variables, Conditional Expectations . . . . . . . . . . . . . . . 246

A.2.1 Random Variables and Expectations . . . . . . . . . . . . . . . . . 246A.2.2 Lp-Spaces and Conditioning . . . . . . . . . . . . . . . . . . . . . . . . 248A.2.3 Properties of Conditional Expectations . . . . . . . . . . . . . . . 251A.2.4 Regular Conditional Probabilities . . . . . . . . . . . . . . . . . . . 252A.2.5 Computation of Conditional Expectations . . . . . . . . . . . . 253

A.3 Stochastic Processes on a Filtered Probability Space . . . . . . . . . 254

xiv Contents

A.4 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257A.5 Martingale Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259A.6 Semimartingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

A.6.1 Change of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267A.6.2 Product Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

B Renewal Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273B.1 Basic Theory of Renewal Processes . . . . . . . . . . . . . . . . . . . . . . . . 273B.2 Renewal Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280B.3 Regenerative Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281B.4 Modified (Delayed) Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

1

Introduction

This chapter gives an introduction to the topics covered in this book: failuretime models, complex systems, different information levels, maintenance andoptimal replacement. We also include a section on reliability modeling, wherewe draw attention to some important factors to be considered in the modelingprocess. Two real life examples are presented: a reliability study of a systemin a power plant and an availability analysis of a gas compression system.

1.1 Lifetime Models

In reliability we are mainly concerned with devices or systems that fail at anunforeseen or unpredictable (this term is defined precisely later) random ageof T > 0. This random variable is assumed to have a distribution F, F (t) =P (T ≤ t), t ∈ R, with a density f . The hazard or failure rate λ is defined onthe support of the distribution by

λ(t) =f(t)

F (t),

with the survival function F (t) = 1−F (t). The failure rate λ(t) measures theproneness to failure at time t in that λ(t)� t ≈ P (T ≤ t+�t|T > t) for small�t. The (cumulative) hazard function is denoted by Λ,

Λ(t) =

∫ t

0

λ(s) ds = − ln{F (t)}.

The well-known relation

F (t) = P (T > t) = exp{−Λ(t)} (1.1)

establishes the link between the cumulative hazard and the survival function.Modeling in reliability theory is mainly concerned with additional information

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 1,© Springer Science+Business Media New York 2013

1

2 1 Introduction

about the state of a system, which is gathered during the operating time ofthe system. This additional information leads to updated predictions aboutproneness to system failure. There are many ways to introduce such additionalinformation into the model. In the following sections some examples of how tointroduce additional information and how to model the lifetime T are given.

1.1.1 Complex Systems

As will be introduced in detail in Chap. 2, a complex system comprises ncomponents with positive random lifetimes Ti, i = 1, 2, . . . , n, n ∈ N. LetΦ : {0, 1}n → {0, 1} be the structure function of the system, which isassumed to be monotone. The possible states of the components and ofthe system, “intact” and “failed,” are indicated by “1” and “0,” respec-tively. Then Φt = Φ(Xt) describes the state of the system at time t, whereXt = (Xt(1), . . . , Xt(n)) and Xt(i) denotes the indicator function

Xt(i) = I(Ti > t) =

{1 if Ti > t0 if Ti ≤ t,

which is 1, if component i is intact at time t, and 0 otherwise. The lifetime Tof the system is then given by T = inf{t ∈ R+ : Φt = 0}.Example 1.1. As a simple example the following system with three compo-nents is considered, which is intact if component 1 and at least one of thecomponents 2 or 3 are intact:

2

3

• •1

In this example Φt = Xt(1){1− (1−Xt(2))(1−Xt(3))} is easily obtainedwith T = inf{t ∈ R+ : Φt = 0} = T1 ∧ (T2 ∨ T3), where as usual a ∧ b anda∨b denote min {a, b} and max {a, b}, respectively. The additional informationabout the lifetime T is given by the observation of the state of the singlecomponents. As long as all components are intact, only a failure of component1 leads to system failure. If one of the components 2 or 3 fails first, then thenext component failure is a system failure.

Under the classical assumption that all components work independently,i.e., the random variables Ti, i = 1, . . . , n, are independent, certain character-istics of the system lifetime are of interest:

• Determining the system lifetime distribution from the known componentlifetime distributions or at least finding bounds for this distribution (seeSects. 2.1 and 2.2).

1.1 Lifetime Models 3

• Are certain properties of the component lifetime distributions likeincreasing failure rate (IFR) or increasing failure rate average (IFRA)preserved by forming monotone systems? One of these closure theoremsstates, for example, that the distribution of the system lifetime is IFRA ifall component lifetimes have IFRA distributions (see Sect. 2.2).

• In what way does a certain component contribute to the functioning ofthe whole system? The answer to this question leads to the definition ofseveral importance measures (see Sect. 2.1).

1.1.2 Damage Models

Additional information about the lifetime T can also be introduced into themodel in a quite different way. If the state or damage of the system at timet ∈ R+ can be observed and this damage is described by a random variableXt,then the lifetime of the system may be defined as

T = inf{t ∈ R+ : Xt ≥ S},i.e., as the first time the damage hits a given level S. Here S can be a constantor, more general, a random variable independent of the damage process. Someexamples of damage processes X = (Xt) of this kind are described in thefollowing subsections.

Wiener Process

The damage process is a Wiener process with positive drift starting at 0 andthe failure threshold S is a positive constant. The lifetime of the system isthen known to have an inverse Gaussian distribution. Models of this kind areespecially of interest if one considers different environmental conditions underwhich the system is working, as, for example, in so-called burn-in models.An accelerated aging caused by additional stress or different environmentalconditions can be described by a change of time. Let τ : R+ → R+ be anincreasing function. Then Zt = Xτ(t) denotes the actual observed damage.The time transformation τ drives the speed of the deterioration. One possibleway to express different stress levels in time intervals [ti, ti+1), 0 = t0 < t1 <. . . < tk, i = 0, 1, . . . , k − 1, k ∈ N, is the choice

τ(t) =

i−1∑j=0

βj(tj+1 − tj) + βi(t− ti), t ∈ [ti, ti+1), βv > 0.

In this case it is seen that if F0 is the inverse Gaussian distribution functionof T = inf{t ∈ R+ : Xt ≥ S}, and F is the distribution function of thelifetime Ta = inf{t ∈ R+ : Zt ≥ S} under accelerated aging, then F (t) =F0(τ(t)). A generalization in another direction is to consider a random timechange, which means that τ is a stochastic process. By this, randomly varyingenvironmental conditions can be modeled.

4 1 Introduction

Compound Point Processes

Processes of this kind describe so-called shock processes where the system issubject to shocks that occur from time to time and add a random amountto the damage. The successive times of occurrence of shocks, Tn, are givenby an increasing sequence 0 < T1 ≤ T2 ≤ . . . of random variables, where theinequality is strict unless Tn = ∞. Each time point Tn is associated with areal-valued randommark Vn, which describes the additional damage caused bythe nth shock. The marked point process is denoted (T, V ) = (Tn, Vn), n ∈ N.From this marked point process the corresponding compound point processX with

Xt =

∞∑n=1

I(Tn ≤ t)Vn (1.2)

is derived, which describes the accumulated damage up to time t. The simplestexample is a compound Poisson process in which the shock arrival process isPoisson and the shock amounts (Vn) are i.i.d. random variables. As before,the lifetime T is the first time the damage process (Xt) hits the level S. If wego one step further and assume that S is not deterministic and fixed, but arandom failure level, then we can describe a situation in which the observeddamage process does not carry complete information about the (failure) stateof the system; the failure can occur at different damage levels S.

Another way to describe the failure mechanism is the following. Let theaccumulated damage up to time t be given by the shock process Xt as in(1.2). If the system is up at t− just before t, the accumulated damage equalsXt− = x and a shock of magnitude y occurs at t, then the probability offailure at t is p(x + y), where p(x) is a given [0, 1]-valued function. In thismodel failures can only occur at shock times and the accumulated damagedetermines the failure probability.

1.1.3 Different Information Levels

It was pointed out above in what way additional information can lead to a reli-ability model. But it is also important to note that in one and the same modeldifferent observation levels are possible, i.e., the amount of actual availableinformation about the state of a system may vary. The following exampleswill show the effect of different degrees of information.

1.1.4 Simpson’s Paradox

This paradox says that if one compares the death rates in two countries, sayA and B, then it is possible that the crude overall death rate in country A ishigher than in B although all age-specific death rates in B are higher than in A.This can be transferred to reliability in the following way. Considering a two-component parallel system, the failure rate of the system lifetime may increase

1.1 Lifetime Models 5

although the component lifetimes have decreasing failure rates. The followingproposition, which can be proved by some elementary calculations, yields anexample of this.

Proposition 1.2. Let T = T1 ∨ T2 with i.i.d. random variables Ti, i = 1, 2,following the common distribution F ,

F (t) = 1− e−u(t), t ≥ 0, u(t) = γt+ α(1− e−βt), α, β, γ > 0.

If 2αeα <(

γβ

)2< 1, then the failure rate λ of the lifetime T increases,

whereas the component lifetimes Ti have decreasing failure rates.

This example shows that it makes a great difference whether only the sys-tem lifetime can be observed (aging property: IFR) or additional informationabout the component lifetimes is available (aging property: DFR). The agingproperty of the system lifetime of a complex system does not only dependon the joint distribution of the component lifetimes but also, of course, onthe structure function. Instead of a two-component parallel system, considera series system where the component lifetimes have the same distributions asin Proposition 1.2. Then the failure rate of Tser = T1 ∧ T2 decreases, whereasTpar = T1 ∨ T2 has an IFR.

1.1.5 Predictable Lifetime

The Wiener process X = (Xt), t ∈ R+, with positive drift μ and variancescaling parameter σ, is a popular damage threshold model. The process X canbe represented asXt = σBt+μt, where B is standard Brownian motion. If oneassumes that the failure level S is a fixed known constant, then the lifetimeT = inf{t ∈ R+ : Xt ≥ S} follows an inverse Gaussian distribution with afinite mean ET = S/μ. One criticism of this model is that the paths of X arenot monotone. As a partial answer, one can respond that maintenance actionsalso lead to improvements and thusX could be decreasing at some time points.A more severe criticism from the point of view of the available information isthe following. It is often assumed that in this model the paths of the damageprocess can be observed continuously. But this would make the lifetime T apredictable random time (a precise definition follows in Chap. 3), i.e., there isan increasing sequence τn, n ∈ N, of random time points that announces thefailure. In this model one could choose τn = inf{t ∈ R+ : Xt ≥ S − 1/n}, andtake n large enough and stop operating the system at τn “just” before failure,to carry out some preventive maintenance, cf. Fig. 1.1. This does not usuallyapply in practical situations. This example shows that one has to distinguishcarefully between the different information levels for the model formulation(complete information) and for the actual observation (partial information).

6 1 Introduction

S − 1n

Xt

0

t

S

τn T

Fig. 1.1. Predictable stopping time

1.1.6 A General Failure Model

The general failure model considered in Chap. 3 uses elements of the theoryof stochastic processes and particularly some martingale theory. Some of thereaders might wonder whether sophisticated theory like this is necessary andsuitable in reliability, a domain with engineering applications. Instead of acomprehensive justification we give a motivating example.

Example 1.3. We consider a simple two-component parallel system with ind-ependent Exp(αi) distributed component lifetimes Ti, i = 1, 2. The systemlifetime T = T1 ∨ T2 has distribution function

F (t) = P (T1 ≤ t, T2 ≤ t) = (1− e−α1t)(1 − e−α2t)

with an ordinary failure rate

λ(t) =α1e

−α1t + α2e−α2t − (α1 + α2)e

−(α1+α2)t

e−α1t + e−α2t − e−(α1+α2)t.

This formula is rather complicated for such a simple system and reveals noth-ing about the structure of the system.Using elementary calculus it can beshown that for α1 �= α2 the failure rate is increasing on (0, t∗) and decreas-ing on (t∗,∞) for some t∗ > 0. This property of the failure rate, however, isneither obvious nor immediate to see. We also know that F is of IFRA type.

But is it not more natural and simpler to say that a failure rate (process)should be 0 as long as both components work (no system failure can occur)and, when the first component failure occurs, then the rate switches to α1 orα2 depending on which component survives? We want to derive a model thatallows such a simple failure rate process and also includes the ordinary failurerate. Of course, this simple failure rate process, which can be expressed as

1.2 Maintenance 7

λt = α1I(T2 ≤ t < T1) + α2I(T1 ≤ t < T2),

needs knowledge about the random component lifetimes Ti. Now the failurerate λt is a stochastic process and the information about the status of thecomponents at time t is represented by a filtration. The model allows forchanging the information level and the ordinary failure rate can be derivedfrom λt on the lowest level possible, namely no information about the com-ponent lifetimes.

The modern theory of stochastic processes allows for the development ofa general failure model that incorporates the above aspects: time dynamicsand different information levels. Chapter 3 presents this model. The failurerate process λt is one of the basic parameters of this set-up. If we considerthe lifetime T , under some mild conditions we obtain the failure rate processon {T > t} as the limit of conditional expectations with respect to the pre-t-history (σ-algebra) Ft,

λt = limh→0+

1

hP (T ≤ t+ h|Ft),

extending the classical failure rate λ(t) of the system. To apply the set-up,focus should be placed on the failure rate process (λt). When this processhas been determined, the model has basically been established. Using theabove interpretation of the failure rate process, it is in most cases ratherstraightforward to determine its form. The formal proofs are, however, oftenquite difficult.

If we go one step further and consider a model in which the system canbe repaired or replaced at failure, then attention is paid to the number Nt ofsystem failures in [0, t]. Given certain conditions, the counting process N =(Nt), t ∈ R+, has an “intensity” that as an extension of the failure rate processcan be derived as the limit of conditional expectations

λt = limh→0+

1

hE[Nt+h −Nt|Ft],

where Ft denotes the history of the system up to time t. Hence we can interpretλt as the (conditional) expected number of system failures per unit of time attime t given the available information at that time. Chapter 3 includes severalspecial cases that demonstrate the broad spectrum of potential applications.

1.2 Maintenance

To prolong the lifetime, to increase the availability, and to reduce the prob-ability of an unpredictable failure, various types of maintenance actions arebeing implemented. The most important maintenance actions include:

• Preventive replacements of parts of the system or of the whole system• Repairs of failed units

8 1 Introduction

• Providing spare parts• Inspections to check the state of the system if not observed continuously

Taking maintenance actions into account leads, depending on the specificmodel, to one of the following subject areas: Availability Analysis and Opti-mization Models.

1.2.1 Availability Analysis

If the system or parts of it are repaired or replaced when failures occur, theproblem is to characterize the performance of the system. Different measuresof performance can be defined as, for example,

• The probability that the system is functioning at a certain point in time(point availability)

• The mean time to the first failure of the system• The probability distribution of the downtime of the system in a given time

interval.

Traditionally, focus has been placed on analyzing the point availability andits limit (the steady-state availability). For a single component, the steady-state formula is given by MTTF/(MTTF + MTTR), where MTTF andMTTR represent the mean time to failure and the mean time to repair (meanrepair time), respectively. The steady-state probability of a system compris-ing several components can then be calculated using the theory of complex(monotone) systems.

Often, performance measures related to a time interval are used. Suchmeasures include the distribution of the number of system failures, and thedistribution of the downtime of the system, or at least the mean of these dis-tributions. Measures related to the number of system failures are importantfrom an operational and safety point of view, whereas measures related tothe downtime are more interesting from a productional point of view. Infor-mation about the probability of having a long downtime in a time intervalis important for assessing the economic risk related to the operation of thesystem. For production systems, it is sometimes necessary to use a multistaterepresentation of the system and some of its components, to reflect differentproduction levels.

Compared to the steady-state availability, it is of course more complicatedto compute the performance measures related to a time interval, in particu-lar the probability distributions of the number of system failures and of thedowntime. Using simplifications and approximations, it is however possible toestablish formulas that can be used in practice. For highly available systems,a Poisson approximation for the number of system failures and a compoundPoisson approximation for the downtime distribution are useful in many cases.

These topics are addressed in Chap. 4, which gives a detailed analysisof the availability of monotone systems. Emphasis is placed on performance

1.3 Reliability Modeling 9

measures related to a time interval. Sufficient conditions are given for whenthe Poisson and the compound Poisson distributions are asymptotic limits.

1.2.2 Optimization Models

If a valuation structure is given, i.e., costs of replacements, repairs, downtime,etc., and gains, then one is naturally led to the problem of planning the main-tenance action so as to minimize (maximize) the costs (gains) with respect toa given criterion. Examples of such criteria are expected costs per unit timeand total expected discounted costs.

Example 1.4. We resume Example 1.3, p. 6, and consider the simple two-component parallel system with independent Exp(αi) distributed componentlifetimes Ti, i = 1, 2, with the system lifetime T = T1 ∨ T2. We now allowpreventive replacements at costs of c units to be carried out before failure,and a replacement upon system failure at cost c + k. It seems intuitive thatT1 ∧ T2, the time of the first component failure, should be a candidate for anoptimal replacement time with respect to some cost criterion, at least if c is“small” compared to k. How can we prove that this random time T1 ∧ T2 isoptimal among all possible replacement times? How can we characterize theset of all possible replacement times?

These questions can only be answered in the framework of martingaletheory and are addressed in Chap. 5.

One can imagine that thousands of models (and papers) can be created bycombining the different types of lifetime models with different maintenanceactions. The general optimization framework formulated in Chap. 5 incorpo-rates a number of such models. Here the emphasis is placed on determiningthe optimal replacement time of a deteriorating system. The framework isbased on the failure model of Chap. 3, which means that rather complex andvery different situations can be studied. Special cases include monotone sys-tems, (minimal) repair models, and damage processes, with different informa-tion levels.

1.3 Reliability Modeling

Models analyzed in this book are general, in the sense that they do not referto any specific real life situation but are applicable in a number of cases. Thisis the academic and theoretical approach of mathematicians (probabilists,statisticians) who provide tools that can be used in applications.

The reliability engineer, on the other hand, has a somewhat different start-ing point. He or she is faced with a real problem and has to analyze this prob-lem using a mathematical model that describes the situation appropriately.

10 1 Introduction

Sometimes it is rather straightforward to identify a suitable model, but oftenthe problem is complex and it is difficult to see how to solve it. In many cases,a model needs to be developed. The modeling process requires both experienceon the part of the practitioner and knowledge on the part of the theorist.

However, it is not within the scope of this book to discuss in detail themany practical aspects related to reliability modeling and analysis. Only afew issues will be addressed. In this introductory section we will highlightimportant factors to be considered in the modeling process and two real lifeexamples will be presented.

The objectives of the reliability study can affect modeling in many ways,for example, by specifying which performance measures and which factors(parameters) are to be analyzed. Different objectives will require differentapproaches and methods for modeling and analysis. Is the study to providedecision support in a design process of a system where the problem is to choosebetween alternative solutions; is the problem to give a basis for specifyingreliability requirements; or is the aim to search for an optimal preventivemaintenance strategy? Clearly, these situations call for different models.

The objectives of the study may also influence the choice of the computa-tional approach. If it is possible to use analytical calculation methods, thesewould normally be preferred. For complex situations, Monte Carlo simulationoften represents a useful alternative, cf., e.g., [13, 64].

The modeling process starts by clarifying the characteristics of the situa-tion to be analyzed. Some of the key points to address are:

Can the system be decomposed into a set of independent subsystems (com-ponents)? Are all components operating normally or are some on stand-by?What is the state of the component after a repair? Is it “as good as new”?What are the resources available for carrying out the repairs? Are some typesof preventive maintenance being employed? Is the state of the componentsand the system continuously monitored, or is it necessary to carry out inspec-tions to reveal their condition? Is information available about the underlyingcondition of the system and components, such as wear, stress, and damage?

Having identified important features of the system, we then have to lookmore specifically at the various elements of the model and resolve questionslike the following:

• How should the deterioration process of the components and systembe modeled? Is it sufficient to use a standard lifetime model wherethe age of the unit is the only information available? How should therepair/replacement times be modeled?

• How are the preventive maintenance activities to be reflected in the model?Are these activities to be considered fixed in the model or is it possible toplan preventive maintenance action so that costs (rewards) are minimized(maximized)?

• Is a binary (two-state) approach for components and system sufficientlyaccurate, or is multistate modeling required?

1.3 Reliability Modeling 11

• How are the system and components to be represented? Is a reliabilityblock diagram appropriate?

• Are time dynamics to be included or is a time stationary model sufficient?• How are the parameters of the model to be determined? What kind of

input data are required for using the model? How is uncertainty to bedealt with?

Depending on the answers to these questions, relevant models can be iden-tified. It is a truism that no model can cover all aspects, and it is recommendedthat one starts with a simple model describing the main features of the system.

The following application examples give further insight into the situationsthat can be modeled using the theory presented in this book.

1.3.1 Nuclear Power Station

In this example we consider a small part of a very complex technical system, inwhich safety aspects are of great importance. The nuclear power station underconsideration consists of two identical boiling water reactors in commercialoperation, each with an electrical power of 1,344MW. They started in 1984and 1985, respectively, working with an efficiency of 35%.

Nuclear power plants have to shut down from time to time to exchangethe nuclear fuel. This is usually performed annually. During the shutdownphase a lot of maintenance tasks and surveillance tests are carried out. Oneproblem during such phases is that decay heat is still produced and thushas to be removed. Therefore, residual heat removal (RHR) systems are inoperation. At the particular site, three identical systems are available, eachwith a capacity of 100%. They are designed to remove decay heat duringaccident conditions occurring at full power as well as for operational purposesin cooldown phases.

One of these RHR systems is schematically shown in Fig. 1.2. It consists ofthree different trains including the closed cooling water system. Several pumpsand valves are part of the RHR system. The primary cooling system can bemodeled as a complex system comprising the following main components:

• Closed cooling water system pump (CCWS)• Service water system pump (SWS)• Low-pressure pump with a pre-stage (LP)• High-pressure pump (HP)• Nuclear heat exchanger (RHR)• Valves (V1, V2, V3)

For the analysis we have to distinguish between two cases:

1. The RHR system is not in operation.Then the functioning of the system can be viewed as a binary structureof the main components as is shown in the reliability block diagram in

12 1 Introduction

Fig. 1.2. Cooling system of a power plant

V1

V2

LP RHR SWS CCWS HP V3

Fig. 1.3. Reliability block diagram

Fig. 1.3. When the system is needed, it is possible that single componentsor the whole system fails to start on demand. In this case, to calculate theprobability of a failure on demand, we have to take all components in thereliability block diagram into consideration. Two of the valves, V1 and V2,are in parallel. Therefore, the RHR system fails on demand if either V1 andV2 fail or at least one of the remaining components LP,. . . , HP, V3 fails.We assume that the time from a check of a component until a failure in theidle state is exponentially distributed. The failure rates are λv1 , λv2 , λv3for the valves and λp1 , λp2 , λp3 , λp4 , λh for the other components. If thecheck (inspection or operating period) dates t time units back, then theprobability of a failure on demand is given by

1− {1− (1− e−λv1 t)(1− e−λv2 t)}e−(λp1+λp2+λp3+λp4+λh+λv3 )t.

1.3 Reliability Modeling 13

2. The RHR system is in operation.During an operation phase, only the pumps and the nuclear heat exchangercan fail to operate. If the valves have once opened on demand when the op-eration phase starts, these valves cannot fail during operation. Therefore,in this operation case, we can either ignore the valves in the block diagramor assign failure probability 0 to V1, V2, V3. The structure reduces to asimple series system. If we assume that the failure-free operating timesof the pumps and the heat exchanger are independent and have distribu-tions Fp1 , Fp2 , Fp3 , Fp4 , and Fh, respectively, then the probability that thesystem fails before a fixed operating time t is just

1− Fp1(t)Fp2 (t)Fp3 (t)Fp4 (t)Fh(t),

where F (t) denotes the survival probability.

In both cases the failure time distributions and the failure rates have to beestimated. One essential condition for the derivation of the above formulae isthat all components have stochastically independent failure times or lifetimes.In some cases such an independence condition does not apply. In Chap. 3 ageneral theory is developed that also includes the case of complex systemswith dependent component lifetimes. The framework presented covers differ-ent information levels, which allow updating of reliability predictions usingobservations of the condition of the components of the system, for example.

1.3.2 Gas Compression System

This example outlines various aspects of the modeling process related to thedesign of a gas compression system.

A gas producer was designing a gas production system, and one of the mostcritical decisions was related to the design of the gas compression system.

At a certain stage of the development, two alternatives for the compressionsystem were considered:

(i) One gas train with a maximum throughput capacity of 100%(ii) Two trains in parallel, each with a maximum throughput capacity of 50%.

Normal production is 100%. For case (i) this means that the train isoperating normally and a failure stops production completely. For case (ii)both trains are operating normally. If one train fails, production is reduced to50%. If both trains are down, production is 0.

Each train comprises compressor–turbine, cooler, and scrubber. A failureof one of these “components” results in the shutdown of the train. Thus atrain is represented by a series structure of the three components compressor–turbine, cooler, and scrubber.

14 1 Introduction

The following failure and repair time data were assumed:

Component Failure rate Mean repair time(unit of time: 1 year) (unit of time: 1 h)

Compressor–turbine 10 12Cooler 2 50Scrubber 1 20

To compare the two alternatives, a number of performance measures wereconsidered. Particular interest was shown in performance measures related tothe number of system shutdowns, the time the system has a reduced produc-tion level, and the total production loss due to failures of the system. The gassales agreement states that the gas demand is to be met with a very high reli-ability, and failures could lead to considerable penalties and loss of goodwill,as well as worse sales perspectives for the future.

Using models as will be described in Chap. 4, it was possible to computethese performance measures, given certain assumptions.

It was assumed that each component generates an alternating renewal pro-cess, which means that the repair brings the component to a condition that isas good as new. The uptimes were assumed to be distributed exponentially,so that the component in the operating state has a constant failure rate. Thefailure rate used was based on experience data for similar equipment. Such acomponent model was considered to be sufficiently accurate for the purposeof the analysis. The exponential model represents a “first-order approxima-tion,” which makes it rather easy to gain insight into the performance of thesystem. For a complex “component” with many parts to be maintained, it isknown that the overall failure rate exhibits approximately exponential nature.Clearly, if all relevant information is utilized, the exponential model is rathercrude. But again we have to draw attention to the purpose of the analysis:provide decision support concerning the choice of design alternatives. Onlythe essential features should be included in the model.

A similar type of reasoning applies to the problem of dependency betweencomponents. In this application all uptimes and downtimes of the compo-nents were assumed to be independent. In practice there are, of course, somedependencies present, but by looking into the failure causes and the way thecomponents were defined, the assumption of independence was not consideredto be a serious weakness of the model, undermining the results of the analysis.

To determine the repair time distribution, expert opinions were used. Therepair times, which also include fault diagnosis, repair preparation, test andrestart, were assessed for different failure modes. As for the uptimes, it wasassumed that no major changes over time take place concerning componentdesign, operational procedures, etc.

1.3 Reliability Modeling 15

Uncertainty related to the input quantities used was not considered. Ins-tead, sensitivity studies were performed with the purpose of identifying howsensitive the results were with respect to variations in input parameters.

Of the results obtained, we include the following examples:

• The gas train is down 2.7% of the time in the long run.• For alternative (i), the average system failure rate, i.e., the average number

of system failures per year, equals 13. For alternative (ii) it is distinguishedbetween failures resulting in production below 100% and below 50%. Theaverage system failure rates for these levels are approximately 26 and 0.7,respectively. Alternative (ii) has a probability of about 50% of having oneor more complete shutdowns during a year.

• The mean lost production equals 2.7% for both alternatives. The proba-bility that the lost production during 1 year is more than 4% of demand isapproximately equal to 0.16 for alternative (i) and 0.08 for alternative (ii).

This last result is based on assumptions concerning the variation of therepair times. Refer to Sect. 4.7.1, p. 162, where the models and methods usedto compute these measures are summarized.

The results obtained, together with an economic analysis, gave the man-agement a good basis for choosing the best alternative.

Bibliographic Notes. There are now many journals strongly devotedto reliability, for example, the IEEE Transactions on Reliability and Relia-bility Engineering and System Safety. In addition, there are many journals inProbability and Operations Research that publish papers in this field.

As mentioned before, there is an extensive literature covering a variety ofstochastic models of reliability. Instead of providing a long and, inevitably,almost certainly incomplete list of references, some of the surveys and reviewarticles are quoted, as well as some of the reliability books.

From time to time, the Naval Research Logistics Quarterly journal pub-lishes survey articles in this field, among them the renowned article by Pier-skalla and Voelker [130], which appeared with 259 references in 1976, updatedby Sherif and Smith [144] with an extensive bibliography of 524 referencesin 1981, followed by Valdez-Flores and Feldman [158] with 129 references in1989. Bergman’s review [39] reflects the author’s experience in industry andemphasizes the usefulness of reliability methods in applications. Gertsbakh’spaper [75] reviews asymptotic methods in reliability and especially investigatesunder what conditions the lifetime of a complex system with many compo-nents is approximately exponentially distributed. Natvig [125] gives a conciseoverview of importance measures for monotone systems. The surveys of Arjas[4] and Koch [108] consider reliability models using more advanced mathemat-ical tools as marked point processes and martingales. A guided tour for thenon-expert through point process and intensity-based models in reliability ispresented in the article of Hokstad [89]. The book of Thompson [155] gives a

16 1 Introduction

more elementary presentation of point processes in reliability. Other reliabil-ity books that we would like to draw attention to are Aven [13], Barlow andProschan [31, 32], Beichelt and Franken [36], Bergman and Klefsjo [40], Gaede[70], Gertsbakh [74], Høyland and Rausand [90], and Kovalenko, Kuznetsov,and Pegg [110]. Some of the models addressed in this introduction are treatedin the overview of Jensen [94] where related references can also be found.

2

Basic Reliability Theory

This chapter presents some basic theory of reliability, including complexsystem theory and properties of lifetime distributions. Basic availability theoryand models for maintenance optimization are included in Chaps. 4 and 5,respectively.

The purpose of this chapter is not to give a complete overview of the exist-ing theory, but to introduce the reader to common reliability concepts, models,and methods. The exposition highlights basic ideas and results, and it providesa starting point for the more advanced theory presented in Chaps. 3–5.

2.1 Complex Systems

This section gives an overview of some basic theory of complex systems. Binarymonotone (coherent) systems are covered, as well as multistate monotonesystems.

2.1.1 Binary Monotone Systems

In this section we give an introduction to the classical theory of monotone(coherent) systems. First we study the structural relations between a systemand its components. Then methods for calculation of system reliability arereviewed when the component reliabilities are known. When not stated oth-erwise, the random variables representing the state of the components areassumed to be independent.

Structural Properties

We consider a system comprising n components, which are numbered con-secutively from 1 to n. In this section we distinguish between two states: afunctioning state and a failure state. This dichotomy applies to the system as

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 2,© Springer Science+Business Media New York 2013

17

18 2 Basic Reliability Theory

1 2 n• •a b. . .

Fig. 2.1. Series structure

well as to each component. To indicate the state of the ith component, weassign a binary variable xi to component i:

xi =

{1 if component i is in the functioning state0 if component i is in the failure state.

(The term binary variable refers to a variable taking on the values 0 or 1.)Similarly, the binary variable Φ indicates the state of the system:

Φ =

{1 if the system is in the functioning state0 if the system is in the failure state.

We assume that

Φ = Φ(x),

where x = (x1, x2, . . . , xn), i.e., the state of the system is determined com-pletely by the states of the components. We refer to the function Φ(x) as thestructure function of the system, or simply the structure. In the following wewill often use the phrase structure in place of system.

Example 2.1. A system that is functioning if and only if each component isfunctioning is called a series system. The structure function for this system isgiven by

Φ(x) = x1 · x2 · . . . · xn =

n∏i=1

xi.

A series structure can be illustrated by the reliability block diagram in Fig. 2.1.“Connection between a and b” means that the system functions.

Example 2.2. A system that is functioning if and only if at least one compo-nent is functioning is called a parallel system. The corresponding reliabilityblock diagram is shown in Fig. 2.2.

The structure function is given by

Φ(x) = 1− (1− x1)(1 − x2) · · · (1− xn) = 1−n∏

i=1

(1− xi). (2.1)

The expression on the right-hand side in (2.1) is often written∐xi. Thus, a

parallel system with two components has structure function

2.1 Complex Systems 19

1

2

n

• •...

Fig. 2.2. Parallel structure

Φ(x) = 1− (1− x1)(1 − x2) =2∐

i=1

xi,

which we also write as Φ(x) = x1∐x2.

Example 2.3. A system that is functioning if and only if at least k out of ncomponents are functioning is called a k-out-of-n system. A series system is ann-out-of-n system, and a parallel system is a 1-out-of-n system. The structurefunction for a k-out-of-n system is given by

Φ(x) =

⎧⎨⎩

1 if∑n

i=1 xi ≥ k

0 if∑n

i=1 xi < k.

As an example, we will look at a 2-out-of-3 system. This system can beillustrated by the reliability block diagram shown in Fig. 2.3. An airplanethat is capable of functioning if and only if at least two of its three enginesare functioning is an example of a 2-out-of-3 system.

Definition 2.4. (Monotone system). A system is said to be monotone if

1. its structure function Φ is nondecreasing in each argument, and2. Φ(0) = 0 and Φ(1) = 1.

Condition 1 says that the system cannot deteriorate (that is, change fromthe functioning state to the failed state) by improving the performance of acomponent (that is, replacing a failed component by a functioning compo-nent). Condition 2 says that if all the components are in the failure state,then the system is in the failure state, and if all the components are in thefunctioning state, then the system is in the functioning state.

All the systems we consider are monotone. In the reliability literature,much attention has be devoted to coherent systems, which is a subclass ofmonotone systems. Before we define a coherent system we need some notation.

20 2 Basic Reliability Theory

1

1

2

2 3

3• •

Fig. 2.3. 2-Out-of-3 structure

The vector (·i,x) denotes a state vector where the state of the ithcomponent is equal to 1 or 0; (1i,x) denotes a state vector where the stateof the ith component is equal to 1, and (0i,x) denotes a state vector wherethe state of the ith component is equal to 0; the state of component j, j �= i,equals xj . If we want to specify the state of some components, say i ∈ J(J ⊂ {1, 2, . . . , n}), we use the notation (·J ,x). For example, (0J ,x) denotesthe state vector where the states of the components in J are all 0 and thestate of component i, i /∈ J , equals xi.

Definition 2.5. (Coherent system). A system is said to be coherent if

1. its structure function Φ is nondecreasing in each argument, and2. each component is relevant, i.e., there exists at least one vector (·i,x) such

that Φ(1i,x) = 1 and Φ(0i,x) = 0.

It is seen that if Φ is coherent, then Φ is also monotone. We also need thefollowing terminology.

Definition 2.6. (Minimal cut set). A cut set K is a set of components thatby failing causes the system to fail, i.e., Φ(0K ,1) = 0. A cut set is minimalif it cannot be reduced without losing its status as a cut set.

Definition 2.7. (Minimal path set). A path set S is a set of componentsthat by functioning ensures that the system is functioning, i.e., Φ(1S ,0) = 1.A path set is minimal if it cannot be reduced without losing its status as apath set.

Example 2.8. Consider the reliability block diagram presented in Fig. 2.4. Theminimal cut sets of the system are: {1, 5}, {4, 5}, {1, 2, 3}, and {2, 3, 4}. Notethat, for example, {1, 4, 5} is a cut set, but it is not minimal. The minimalpath sets are {1, 4}, {2, 5}, and {3, 5}. In the following we will refer to thisexample as the “5-components example.”

2.1 Complex Systems 21

1 4

5

2

3

• •

Fig. 2.4. Example of a reliability block diagram

Computing System Reliability

Let Xi be independent binary random variables representing the state of theith component at a given point in time, i = 1, 2, . . . , n. Let

pi = P (Xi = 1)

qi = P (Xi = 0)

h = h(p) = P (Φ(X) = 1) (2.2)

g = g(q) = P (Φ(X) = 0),

where p = (p1, p2, . . . , pn), q = (q1, q2, . . . , qn), and X = (X1, X2, . . . , Xn).The probabilities pi and qi are referred to as the reliability and unreliabilityof component i, respectively, and h and g the corresponding reliability andunreliability of the system.

The problem is to compute the system reliability h given the componentreliabilities pi. Often it will be more efficient to let the starting point of thecalculation be the unreliabilities. Note that h+ g = 1 and pi + qi = 1.

Before we present methods for computation of system reliability for ageneral structure, we will look closer into some special cases. We start withthe series structure.

Example 2.9. (Reliability of a series structure). For a series structure thesystem functioning means that all the components function, hence

h = P (Φ(X) = 1) = P (

n∏i=1

Xi = 1)

= P (X1 = 1, X2 = 1, . . . , Xn = 1)

=

n∏i=1

P (Xi = 1) =

n∏i=1

pi. (2.3)

22 2 Basic Reliability Theory

Example 2.10. (Reliability of a parallel structure). The reliability of aparallel structure is given by

h = 1−n∏

i=1

(1 − pi) =n∐

i=1

pi. (2.4)

The proof of (2.4) is analogous to the proof of (2.3).

Example 2.11. (Reliability of a k-out-of-n structure). The reliability ofa k-out-of-n structure of independent components, which all have the samereliability p, equals

h =

n∑i=k

(ni

)pi(1 − p)n−i.

This formula holds since∑n

i=1Xi has a binomial distribution with parametersn and p under the given assumptions. The case that the component reliabilitiesare not equal is treated later.

Next we look at an arbitrary series–parallel structure. By using the calcu-lation formulae for a series structure and a parallel structure it is relativelystraightforward to calculate the reliability of combinations of series and par-allel structures, provided that each component is included in just one suchstructure. Let us consider an example.

Example 2.12. Consider again the reliability block diagram in Fig. 2.4. Thesystem can be viewed as a parallel structure of two independent modules: thestructure comprising the components 1 and 4, and the structure comprisingthe components 2, 3, and 5. The reliability of the former structure equals p1p4,whereas the reliability of the latter equals (1 − (1 − p2)(1 − p3))p5. Thus thesystem reliability is given by

h = 1− {1− p1p4}{1− (1 − (1− p2)(1 − p3))p5}.Assuming that q1 = q2 = q3 = 0.02 and q4 = q5 = 0.01, this formula givesh = 0.9997, i.e., g = 3 · 10−4.

If, for example, a 2-out-of-3 structure of independent components with thesame reliability p is in series with the above system, the total system reliabil-ity will be as above multiplied by the reliability of the 2-out-of-3 structure,which equals

(32

)p2(1− p) +

(33

)p3(1− p)0 = 3p2(1− p) + p3.

Now consider a general monotone structure. Computation of system relia-bility for complex systems might be a formidable task (in fact, impracticablein some cases) unless an efficient method (algorithm) is used. Developing suchmethods is therefore an important area of research within reliability theory.

2.1 Complex Systems 23

There exist a number of methods for reliability computation of a generalstructure. Many of these methods are based on the minimal cut (path) sets.For smaller systems the so-called inclusion–exclusion method may be applied,but this method is primarily a method for approximate calculations for sys-tems that are either very reliable or unreliable.

Inclusion–Exclusion Method. Let Aj be the event that minimal cut setKj is not functioning, j = 1, 2, . . . , k. Then clearly,

P (Aj) =∏i∈Kj

qi

and

g = P (

k⋃j=1

Aj).

Furthermore, let

w1 =∑k

j=1 P (Aj)

w2 =∑

i<j P (Ai

⋂Aj)

...wr =

∑1≤i1<i2<···<ir≤k P (

⋂rj=1Aij ).

Then the well-known inclusion–exclusion formula states that

g = w1 − w2 + w3 − · · ·+ (−1)k+1wk (2.5)

and for r ≤ k

g ≤ w1 − w2 + w3 − · · ·+ wr, r odd

g ≥ w1 − w2 + w3 − · · · − wr, r even.

Although in general it is not true that the upper bounds decrease and thelower bounds increase, in practice it may be necessary to calculate only a fewwr terms to obtain a close approximation. If the component unreliabilitiesqi are small, i.e., the reliabilities pi are large, then the w2 term will usuallybe negligible compared to w1, such that g ≈ w1. Note that w1 is an upperbound for g. By using w1 as an estimate for the system unreliability, we willoverestimate the system unreliability. In most cases, such an underestimationof reliability is preferable compared to an overestimation of reliability.

With a large number of minimal cut sets, the exact calculation using (2.5)will be extensive. The number of terms in the sum in wr equals

(kr

). Thus the

total number of terms is

k∑r=1

(kr

)= (1 + 1)k − 1 = 2k − 1.

24 2 Basic Reliability Theory

Example 2.13. (Continuation of Examples 2.8 and 2.12). The problem is tocalculate the unreliability of the 5-components system of Fig. 2.4 by meansof the approximation method described above. We assume that q1 = q2 =q3 = 0.02 and q4 = q5 = 0.01. We find that w1 = 3 · 10−4, which meansthat g ≈ 3 · 10−4. It is intuitively clear that the error term by using thisapproximation will not be significant. Calculating w2 confirms this:

w2 = q1q4q5 + q1q2q3q5 + q1q2q3q4q5 + q1q2q3q4q5 + q2q3q4q5 + q1q2q3q4

= 2.2 · 10−6.

There exist also other bounds and approximations for the system reliability.For example, it can be shown that

1−k∏

j=1

(1−∏i∈Kj

qi) = 1−k∏

j=1

∐i∈Kj

pi

is an upper bound for g, and a good approximation for small values of thecomponent unreliabilities qi; see Barlow and Proschan [32], p. 35. This boundis always as good as or better than w1. In the following we sketch some alter-native methods for reliability computation.

Method Using the Minimal Cut Set Representation of the StructureFunction. Using

Φ(X) =

k∏j=1

∐i∈Kj

Xi,

and by multiplying out the right-hand side of this expression, we can findan exact expression of h (or g). As an illustration consider a 2-out-of-3 sys-tem. Then

Φ = (X1

∐X2) · (X1

∐X3) · (X2

∐X3)

and by multiplication we obtain

Φ = X1 ·X2 +X1 ·X3 +X2 ·X3 − 2 ·X1 ·X2 ·X3.

We have used Xri = Xi for r = 1, 2, . . .. It follows by taking expectations that

h = p1 p2 + p1 p3 + p2 p3 − 2 p1 p2 p3.

For systems with low reliabilities, it is possible to establish similar resultsbased on the minimal path sets.

State Enumeration Method. Of the direct methods that do not use theminimal cut (path) sets, the state enumeration method is conceptually thesimplest. With this method reliability is calculated using

2.1 Complex Systems 25

1

2

4

5

3• •

Fig. 2.5. Bridge structure

h = EΦ(X) =∑x

Φ(x)P (X = x) =∑

x:Φ(x)=1

n∏i=1

pxi

i (1− pi)1−xi .

This method, however, is not suitable for larger systems, since the number ofterms in the sum can be extremely large, up to 2n − 1.

Factoring Method. Of other methods we will confine ourselves to describingthe so-called factoring algorithm (pivot-decomposition method). The basicidea of this method is to make a conditional probability argument using therelation

h(p) = pih(1i,p) + (1− pi)h(0i,p), (2.6)

where h(xi,p) equals the reliability of the system given that the state ofcomponent i is xi. Formula (2.6) follows from the law of total probability.This process repeats until the system comprises only series–parallel structures.To illustrate the method we will give an example.

Example 2.14. Consider a bridge structure as given by the diagram shown inFig. 2.5. If we first choose to pivot on component 3, formula (2.6) holds withi = 3. It is not difficult to see that given x3 = 1, the system structure hasthe form

1

2

4

5

• •

26 2 Basic Reliability Theory

and that given x3 = 0, the system structure has the form

1

2

4

5

• •

These two structures are both of series–parallel form, and we see that

h(13,p) = (p1∐p2)(p4

∐p5)

h(03,p) = p1p4∐p2p5.

Thus a formula for the exact computation of h(p) is established. Note that itwas sufficient to perform only one pivotal decomposition in this case. If thestructure given x3 = 1 had not been in a series–parallel form, we would havehad to perform another pivotal decomposition, and so on.

For a monotone structure Φ we have

Φ(x∐

y) ≥ Φ(x)∐

Φ(y), (2.7)

where x∐

y = (x1∐y1, . . . , xn

∐yn). This is seen by noting that Φ(x

∐y)

is greater than or equal to both Φ(x) and Φ(y). It follows from (2.7) that

h(p∐

p′) ≥ h(p)∐

h(p′)

for all 0 ≤ p ≤ 1 and 0 ≤ p′ ≤ 1. These results state that redundancy atthe component level is more effective than redundancy at system level. Thisprinciple is well known among design engineers. Note that if the system is aparallel system, then equality holds in the above inequalities. If the system iscoherent, then equality holds if and only if the system is a parallel system.

Time Dynamics

The above theory can be applied to different situations, covering both rep-airable and nonrepairable systems. As an example, consider a monotone sys-tem in a time interval [0, t0], and assume that the components of the systemare “new” at time t = 0 and that a failed component stays in the failurestate for the rest of the time interval. Thus the component is not repaired orreplaced. This situation, for example, can describe a system with componentfailure states that can only be discovered by testing or inspection. We assumethat the lifetime of component i is determined by a lifetime distribution Fi(t)having failure rate function λi(t). To calculate system reliability at a fixedpoint in time, i.e., the reliability function at this point, we can proceed as

2.1 Complex Systems 27

above with qi = Fi(t) and pi = Fi(t). Thus, for a series system the reliabilityat time t takes the form

h =

n∏i=1

Fi(t). (2.8)

But Fi(t) can be expressed by means of the failure rate λi(t):

Fi(t) = e−∫ t0λi(u)du. (2.9)

By putting (2.9) into formula (2.8) we obtain

h = e−∫ t0[∑n

i=1 λi(u)]du. (2.10)

From (2.10) we can conclude that the failure rate of a series structure ofindependent components equals the sum of the failure rates of the componentsof the structure. In particular this means that if the components have constantfailure rates λi, i = 1, 2, . . . , n, then the series structure has constant failurerate

∑λi.

For a parallel structure we do not have a similar result. With constantfailure rates of the components, the system will have a time-dependent failurerate; cf. Example 1.3, p. 6.

Reliability Importance Measures

An important objective of many reliability and risk analyses is to identifythose components or events that are most important (critical) from a relia-bility/safety point of view and that should be given priority with respect toimprovements. Thus, we need an importance measure. A large number of suchmeasures have been suggested (see Bibliographic Notes, p. 55). Here we brieflydescribe two measures, Improvement Potential and Birnbaum’s measure.

Consider again the 5-components example (cf. pp. 20, 22, and 24). Theunreliability of the system equals

g = {1− p1p4}{1− p5(p2 + p3 − p2p3)} ≈ w1

w1 = q1q5 + q4q5 + q1q2q3 + q2q3q4= 0.02 · 0.01 + 0.01 · 0.01

+0.02 · 0.02 · 0.02 + 0.02 · 0.02 · 0.01= 3 · 10−4.

If we look at the subsystems comprising the minimal cut sets, it is clearfrom the above expression that subsystems {1, 5} and {4, 5} are most impor-tant in the sense that they are contributing most to unreliability. To decidewhich components are most important, we must define more precisely what ismeant by important. For example, we might decide to let the component withthe highest potential for increasing the system reliability be most important(measure for reliability improvement potential) or the component that has the

28 2 Basic Reliability Theory

largest effect on system reliability by a small improvement of the componentreliability (Birnbaum’s measure).

Improvement Potential. The following reliability importance measure forcomponent i, IAi , is appropriate in a large number of situations, in particularduring design:

IAi = h(1i,p)− h(p),

where h(p) is the reliability of the system and h(1i,p) is the reliability ass-uming that component i is in the best state 1. The measure IAi expresses thesystem reliability improvement potential of the component, in other words,the unreliability that is caused by imperfect performance of component i. Thismeasure can be used for all types of reliability definitions, and it can be usedfor repairable or nonrepairable systems.

For a highly reliable monotone system the measure IAi is equivalent to thewell-known Vesely–Fussell importance measure [86]. In fact, in this case IAi isapproximately equal to the sum of the unreliabilities of the minimal cut setsthat include component i, i.e.,

IAi ≈∑

j: i∈Kj

∏l∈Kj

ql. (2.11)

This is seen by applying the inclusion–exclusion formula. This formula statesthat 1 − h(p) ≈ ∑k

j=1

∏l∈Kj

ql. Putting qi = 0 in this formula and

subtracting, we obtain the desired approximation formula for IAi . Note that,like the Vesely–Fussell measure, the measure IAi gives the same importanceto all the components of a parallel system, irrespective of component reliabil-ities, namely, IAi =

∏nj=1 qj . This is as it should be because each one of the

components has the potential of making the system unreliability negligible,for example, by introducing redundancy.

Example 2.15. Computation of IAi for the 5-components example gives

IA1 = 2 · 10−4, IA2 = 1 · 10−5,IA3 = 1 · 10−5, IA4 = 1 · 10−4,IA5 = 3 · 10−4.

Thus component 5 is the most important component based on this measure.Components 1 and 4 follow in second and third place, respectively.

Birnbaum’s Measure. Birnbaum’s measure for the reliability importanceof component i, IBi , is defined by

IBi =∂h

∂pi.

2.1 Complex Systems 29

Thus Birnbaum’s measure equals the partial derivative of the system reliabilitywith respect to pi. The approach is well known from classical sensitivity anal-yses. We see that if IBi is large, a small change in the reliability of componenti will give a relatively large change in system reliability.

Birnbaum’s measure might be appropriate, for example, in the operationphase where possible improvement actions are related to operation and main-tenance parameters. Before looking closer into specific improvement actions ofthe components, it will be informative to measure the sensitivity of the systemreliability with respect to small changes in the reliability of the components.

To compute IBi the following formula is often used:

IBi = h(1i,p)− h(0i,p). (2.12)

This formula is established using (2.6), p. 25.

Example 2.16. Using (2.12) we find that

IB1 = 1.03 · 10−2 = 1 · 10−2, IB2 = IB3 = 6 · 10−4,IB4 = 1.02 · 10−2 = 1 · 10−2, IB5 = 3 · 10−2.

We see that for this example the Birnbaum measure gives the same rankingof the components as the measure IAi . However, this is not true in general.

It is not difficult to see that

IBi = E[Φ(1i,X)− Φ(0i,X)] = P (Φ(1i,X)− Φ(0i,X) = 1)

= P (Φ(1i,X) = 1, Φ(0i,X) = 0).

If Φ(1i,x) − Φ(0i,x) = 1, we call (1i,x) a critical path vector and (0i,x) acritical cut vector for component i. For simplicity, we often say that componenti is critical for the system.

Thus we have shown that IBi equals the probability that the system isin a state so that component i is critical for the system. If the componentsare dependent, this probability is often used as the definition of Birnbaum’smeasure. Now set pj = 1/2 for all j �= i. Then

IBi =1

2n−1

∑(·i,x)

[Φ(1i,x)− Φ(0i,x)] =1

2n

∑x

[Φ(1i,x)− Φ(0i,x)].

This quantity is used as a measure of the structural importance of compo-nent i.

Some Comments on the Use of Importance Measures. The two imp-ortance measures presented in this section can be useful tools in the sys-tem optimization process/system improvement process. This process can bedescribed as follows:

30 2 Basic Reliability Theory

1. Identify the most important units by means of the chosen importancemeasure

2. Identify possible improvement actions/measures for these units3. Estimate the effect on reliability by implementing the measure4. Perform cost evaluations5. Make an overall evaluation and take a decision.

The importance measure to be used in a particular case depends on thecharacteristics we want the measure to reflect. Undoubtedly, different situ-ations call for different importance measures. In a design phase the systemreliability improvement potential IAi might be the most informative measure,but for a system with frozen design, the Birnbaum measure might be moreinformative, since this measure reflects how small component reliability imp-rovements affect system reliability.

Dependent Components

In the following some remarks on systems with dependent components aremade. A more systematic treatment concerning copula models can be foundin the last subsection of this chapter.

One of the most difficult tasks in reliability engineering is to analyze depen-dent components (often referred to as common mode failures). It is difficult toformulate the dependency in a mathematically stringent way and at the sametime obtain a realistic model and to provide data for the model. Whether wesucceed in incorporating a “correct” contribution from common mode failuresis very much dependent on the modeling ability of the analyst. By definingthe components in a suitable way, it is often possible to preclude dependency.For example, common mode failures that are caused by a common externalcause can be identified and separated out so that the components can be con-sidered as independent components. Another useful method for “elimination”of dependency is to redefine components. For example, instead of includinga parallel structure of dependent components in the system, this structurecould be represented by one component. Of course, this does not remove thedependency, but it moves it to a lower level of the analysis. Special techniques,such as Markov modeling, can then be used to analyze the parallel structureitself, or we can try to estimate/assign reliability parameters directly for thisnew component.

Although it is often possible to “eliminate” dependency between compo-nents by proper modeling, it will in many cases be required to establish amodel that explicitly takes into account the dependency. Refer to Chap. 3 forexamples of such models.

Another way of taking into account dependency is to obtain bounds to thesystem reliability, assuming that the components are associated and not neces-sarily independent. Association is a type of positive dependency, for example,as a result of components supporting loads. The precise mathematical defini-tion is as follows (cf. [32]):

2.1 Complex Systems 31

Definition 2.17. Random variables T1, T2, . . . , Tn are associated if

cov[f(T), g(T)] ≥ 0

for all pairs of increasing binary functions f and g.

A number of results are established for associated components, for example,the following inequalities:

max1≤j≤s

∏i∈Sj

pi ≤ h ≤ 1− max1≤j≤k

∏i∈Kj

qi,

where Sj equals the jth minimal path set, j = 1, 2, . . . , s and Kj equals thejth minimal cut set, j = 1, 2, . . . , k. This method usually leads to very wideintervals for the reliability.

2.1.2 Multistate Monotone Systems

In this section parts of the theory presented in Sect. 2.1.1 will be generalized toinclude multistate systems where components and system are allowed to havean arbitrary (finite) number of states/levels. Multistate monotone systems areused to model, e.g., production and transportation systems for oil and gas,and power transmission systems.

We consider a system comprising n components, numbered consecutivefrom 1 to n. As in the binary case, xi represents the state of component i,i = 1, 2, . . . , n, but now xi can be in one out of Mi + 1 states,

xi0, xi1, xi2, . . . , xiMi (xi0 < xi1 < xi2 < · · · < xiMi ).

The set comprising these states is denoted Si. The states xij represent, forexample, different levels of performance, from the worst, xi0, to the best,xiMi . The states xi0, xi1, . . . , xi,Mi−1 are referred to as the failure states ofthe components.

Similarly, Φ = Φ(x) denotes the state (level) of the system. The variousvalues Φ can take are denoted

Φ0, Φ1, . . . , ΦM (Φ0 < Φ1 < · · · < ΦM ).

We see that if Mi = 1, i = 1, 2, . . . , n, and M = 1, then the model is identicalwith the binary model of Sect. 2.1.1.

Definition 2.18. (Monotone system). A system is said to be monotone if

1. its structure function Φ is nondecreasing in each argument, and2. Φ(x10, x20, . . . , xn0) = Φ0 and Φ(x1M1 , x2M2 , . . . , xnMn) = ΦM .

In the following we will restrict attention to monotone systems. As usual,we use the convention that (x1, x2, . . . , xn) > (z1, z2, . . . , zn) means that xi ≥zi, i = 1, 2, . . . , n, and there exists at least one i such that xi > zi.

32 2 Basic Reliability Theory

1

2

3

�a b• •

Fig. 2.6. A simple example of a flow network

Definition 2.19. (Minimal cut vector). A vector z is a cut vector to levelc if Φ(z) < c. A cut vector to level c, z, is minimal if Φ(x) ≥ c for all x > z.

Definition 2.20. (Minimal path vector). A vector y is a path vector tolevel c if Φ(y) ≥ c. A path vector to level c, y, is minimal if Φ(x) < c for allx < y.

Example 2.21. Figure 2.6 shows a simple example of a flow network model.The system comprises three components. Flow (gas/oil) is transmitted from ato b. The components 1 and 2 are binary, whereas component 3 can be in oneout of three states: 0, 1, or 2. The states of the components are interpretedas flow capacity rates for the components. The state/level of the system isdefined as the maximum flow that can be transmitted from a to b, i.e.,

Φ = Φ(x) = min{x1 + x2, x3}.

If, for example, the component states are x1 = 0 , x2 = 1, and x3 = 2, thenthe flow throughput equals 1, i.e., Φ = Φ(0, 1, 2) = 1. The possible systemlevels are 0, 1, and 2. We see that Φ is a multistate monotone system. Theminimal cut vectors and path vectors are as follows:

System level 2Minimal cut vectors: (0, 1, 2), (1, 0, 2), and (1, 1, 1)Minimal path vectors : (1, 1, 2)

System level 1Minimal cut vectors: (0, 0, 2) and (1, 1, 0)Minimal path vectors : (0, 1, 1) and (1, 0, 1).

Computing System Reliability

Assume that the state Xi of the ith component is a random variable, i =1, 2, . . . , n. Let

2.1 Complex Systems 33

pij = P (Xi = xij),hj = P (Φ(X) ≥ Φj),a = EΦ(X)/ΦM =

∑j Φj P (Φ(X) = Φj)/ΦM .

We call hj the reliability of the system at system level j. For the flow networkexample above, a represents the expected throughput (flow) relatively to themaximum throughput (flow) level.

The problem is to compute hj for one or more values of j, and a,based on the probabilities pij . We assume that the random variables Xi areindependent.

Example 2.22. (Continuation of Example 2.21). Assume that

pi1 = 1− pi0 = 0.96, i = 1, 2,p32 = 0.97, p31 = 0.02, p30 = 0.01.

Then by simple probability calculus we find that

h2 = P (X1 = 1, X2 = 1, X3 = 2)= 0.96 · 0.96 · 0.97 = 0.894;

h1 = P (X1 = 1 ∪X2 = 1, X3 ≥ 1)= P (X1 = 1 ∪X2 = 1)P (X3 ≥ 1)= {1− P (X1 = 0)P (X2 = 0)}P (X3 ≥ 1)= 0.9984 · 0.99 = 0.988;

a = (0.094 · 1 + 0.894 · 2)/2 = 0.941.

For the above example it is easy to calculate the system reliability directlyby using elementary probability rules. For larger systems it will be very time-consuming (in some cases impossible) to perform these calculations if specialtechniques or algorithms are not used. If the minimal cut vectors or pathvectors for a specific level are known, the system reliability for this level canbe computed exactly, using, for example, the algorithm described in [17]. Forhighly reliable systems, which are most common in practice, simple approxi-mations can be used as described in the following.

Analogous to the binary case, approximations can be established based onthe inclusion–exclusion method. For example, we have

1− hj =∑r

n∏i=1

P (Xi ≤ zri )− ε, (2.13)

where (zr1 , zr2 , . . . , z

rn) represents the rth cut vector for level j and ε is a positive

error term satisfying

ε ≤∑r<l

n∏i=1

P (Xi ≤ min{zri , zli}).

34 2 Basic Reliability Theory

Example 2.23. (Continuation of Example 2.22). If we use (2.13) to calculatehj , we obtain

h2 ≈ 1− (0.04 · 1 · 1 + 1 · 0.04 · 1 + 1 · 1 · 0.03) = 0.890,h1 ≈ 1− (0.04 · 0.04 · 1 + 1 · 1 · 0.01) = 0.988,a ≈ (1 · 0.098 + 2 · 0.890)/2 = 0.939.

We can conclude that the approximations are quite good for this example.

The problem of determining the probabilities pij will, as in the binary case,depend on the particular situation considered. Often it will be appropriate todefine pij by the limiting availabilities of the component, cf. Chap. 4.

Discussion

The traditional reliability theory based on a binary approach has recentlybeen generalized by allowing components and system to have an arbitraryfinite number of states. For most reliability applications, binary modelingshould be sufficiently accurate, but for certain types of applications, such asgas and oil production and transportation systems and telecommunication,a multistate approach is usually required for the system and components. Ina gas transportation system, for example, the state of the system is definedas the rate of delivered gas, and in most cases a binary model (100%, 0%)would be a poor representation of the system. A component in such a sys-tem may represent a compressor station comprising a certain number (M) ofcompressor units in parallel. The states of the component equal the capacitylevels corresponding to M compressor units running, M − 1 compressor unitsrunning, and so on.

There also exists a number of reliability importance measures for multi-state systems (see Bibliographic Notes, p. 55). Many of these measures repre-sent natural generalizations of importance measures of binary systems. We see,for example, that the measure IA can easily be extended to multistate models.For the Birnbaum measure, it is not so straightforward to generalize the mea-sure. Several measures have been proposed as, for example, the r, s-reliabilityimportance Ir,si of component i, which is given by

Ir,si = P (Φ(ri,X) ≥ Φk)− P (Φ(si,X) ≥ Φk), (2.14)

where Φ(ji,X) equals the state of the system given that Xi = xij .

2.2 Basic Notions of Aging

In this section we introduce and recapitulate some properties of lifetime dis-tributions. Let T be a positive random variable with distribution functionF : T ∼ F, i.e., P (T ≤ t) = F (t). If F has a density f, then λ(t) = f(t)/F (t)

2.2 Basic Notions of Aging 35

is the failure or hazard rate, where as usual F (t) = 1 − F (t) denotes thesurvival probability. Here and in the following we sometimes simplify thenotation and define a mapping by its values to avoid constructions likeλ : D → R+, D⊂ R� {t ∈ R+ : F (t) = 0}, t �→ λ(t) = f(t)/F (t), ifthere is no fear of ambiguity. Interpreting T as the lifetime of some com-ponent or system, the failure rate measures the proneness to failure at timet : λ(t)� t ≈ P (T ≤ t+�t|T > t). The well-known relation

F (t) = exp

{−∫ t

0

λ(s)ds

}

shows that F is uniquely determined by the failure rate. One notion of agingcould be an increasing failure rate (IFR). However, this IFR property is insome cases too strong and other intuitive notions of aging have been sug-gested. Among them are the increasing failure rate average (IFRA) propertyand the notions of new better than used (NBU) and new better than usedin expectation (NBUE). In the following subsection these concepts are intro-duced formally and the relationships among them are investigated.

Furthermore, these notions should be applied to complex systems. If weconsider the time dynamics of such systems, we want to investigate how thereliability of the whole system changes in time if the components have one ofthe mentioned aging properties.

Another question is how different lifetime (random) variables and their cor-responding distributions can be compared. This leads to notions of stochasticordering. The comparison of the lifetime distribution with the exponentialdistribution leads to useful estimates of the system reliability.

2.2.1 Nonparametric Classes of Lifetime Distributions

We first define the IFR and decreasing failure rate (DFR) properties of alifetime distribution F by means of the conditional survival probability

P (T > t+ x|T > t) = F (t+ x)/F (t).

Definition 2.24. Let T be a positive random variable with T ∼ F .

(i) F is an IFR distribution if F (t + x)/F (t) is nonincreasing in t on thedomain of the distribution for each x ≥ 0.

(ii) F is a DFR distribution if F (t + x)/F (t) is nondecreasing in t on thedomain of the distribution for each x ≥ 0.

In the following we will restrict attention to the “increasing” part in thedefinition of the aging notion. The “decreasing part” can be treated anal-ogously. The IFR property says that with increasing age the probability ofsurviving x further time units decreases. This definition does not make use

36 2 Basic Reliability Theory

of the existence of a density f (failure rate λ). But if a density exists, thenthe IFR property is equivalent to a nondecreasing failure rate, which canimmediately be seen as follows. From

λ(t) = limx→0+

1

x

{1− F (t+ x)

F (t)

}

we obtain that the IFR property implies that λ is nondecreasing. Conversely,if λ is nondecreasing, then we can conclude that

P (T > t+ x|T > t) = exp

{−∫ t+x

t

λ(s)ds

}

is nonincreasing, i.e., F is IFR. If F has the IFR property, then it is continuousfor all t < t∗ = sup{t ∈ R+ : F (t) > 0} (possibly t∗ = ∞) and a jump can onlyoccur at t∗ if t∗ <∞. This can be directly deduced from the IFR definition.

It seems reasonable that the aging properties of the components of a mono-tone structure are inherited by the system. However, the example of a parallelstructure with two independent components, the lifetimes of which are dis-tributed Exp(λ1) and Exp(λ2), respectively, shows that in this respect theIFR property is too strong. As was pointed out in Example 1.3, p. 6, forλ1 �= λ2, the failure rate of the system lifetime is increasing in (0, t∗) anddecreasing in (t∗,∞) for some t∗ > 0, i.e., constant component failure rateslead in this case to a nonmonotone system failure rate. To characterize theclass of lifetime distributions of systems with IFR components we are led tothe IFRA property. We use the notation

Λ(t) =

∫ t

0

dF (s)

1− F (s−),

which is the accumulated failure rate. The distribution function F is uniquelydetermined by Λ and the relation is given by

F (t) = exp{−Λc(t)}∏s≤t

(1−ΔΛ(s))

for all t such that Λ(t) <∞, where ΔΛ(s) = Λ(s)−Λ(s−) is the jump heightat time s and Λc(t) = Λ(t)−∑s≤t ΔΛ(s) is the continuous part of Λ (cf. [2],p. 91 or [115], p. 436). In the case that F is continuous, we obtain the simpleexponential formula F (t) = exp{−Λ(t)} or Λ(t) = − ln F (t).

Definition 2.25. A distribution F is IFRA if −(1/t) ln F (t) is nondecreasingin t > 0 on {t ∈ R+: F (t) > 0}.Remark 2.26. (i) The “decreasing” analog is denoted DFRA.(ii) If F is IFRA, then (F (t))1/t is nonincreasing, which is equivalent to

F (αt) ≥ (F (t))α

for 0 ≤ α ≤ 1 and t ≥ 0.

2.2 Basic Notions of Aging 37

Next we will introduce two aging notions that are related to the residuallifetime of a component of age t. Let T ∼ F be a positive random variablewith finite expectation. Then the distribution of the remaining lifetime aftert ≥ 0 is given by

P (T − t > x|T > t) =F (x+ t)

F (t)

with expectation

μ(t) = E[T − t|T > t] =1

F (t)

∫ ∞

0

F (x + t)dx =1

F (t)

∫ ∞

t

F (x)dx (2.15)

for 0 ≤ t < t∗ = sup{t ∈ R+ : F (t)>0}. The conditional expectation μ(t) iscalled mean residual life at time t.

Definition 2.27. Let T ∼ F be a positive random variable.

(i) F is NBU, if

F (x + t) ≤ F (x)F (t) for x, t ≥ 0.

(ii) F is NBUE, if μ = ET <∞ and

μ(t) ≤ μ for 0 ≤ t < t∗.

Remark 2.28. (i) The corresponding notions for “better” replaced by “worse,”NWU and NWUE, are obtained by reversing the inequality signs.

(ii) These properties are intuitive notions of aging. F is NBU means thatthe probability of surviving x further time units for a component of age tdecreases in t. For NBUE distributions the expected remaining lifetime fora component of age t is less than the expected lifetime of a new component.

Now we want to establish the relations between these four notions of aging.

Theorem 2.29. Let T ∼ F be a positive random variable with finite expecta-tion. Then we have

F IFR ⇒ F IFRA ⇒ F NBU ⇒ F NBUE.

Proof. F IFR ⇒ F IFRA: Since an IFR distribution F is continuousfor all t < t∗ = sup{t ∈ R+ : F (t)>0}, the simple exponential formulaF (t) = exp{−Λ(t)} holds true and we see that the IFR property implies thatexp{Λ(t+x)−Λ(t)} is increasing in t for all positive x. Therefore Λ is convex,i.e., Λ(αt+(1−α)u) ≤ αΛ(t)+(1−α)Λ(u), 0 ≤ α ≤ 1. Taking the limit u→ 0−we have Λ(0−) = 0 and Λ(αt) ≤ αΛ(t), which amounts to F (αt) ≥ (F (t))α.But this is equivalent to the IFRA property (see Remark 2.26 above).

F IFRA ⇒ F NBU: With the abbreviations a = −(1/x) ln F (x) andb = −(1/y) ln F (y) we obtain from the IFRA property for positive x, y that−(1/(x+ y)) ln F (x+ y) ≥ a ∨ b = max{a, b} and

38 2 Basic Reliability Theory

− ln F (x+ y) ≥ (a ∨ b)(x+ y) ≥ ax+ by = − ln F (x)− ln F (y).

But this is the NBU property F (x + y) ≤ F (x)F (y).F NBU ⇒ F NBUE: This inequality follows by integrating the NBU

inequality

F (t)μ(t) =

∫ ∞

0

F (x+ t)dx ≤ F (t)

∫ ∞

0

F (x)dx = F (t)μ,

which completes the proof. ��

Examples can be constructed which show that none of the above implica-tions can be reversed.

2.2.2 Closure Theorems

In the previous subsection it was mentioned that the lifetime of a monotonesystem with IFR components need not be of IFR type. This gave rise to thedefinition of the IFRA class of lifetime distributions, and we will show thatthis class is closed under forming monotone structures. There are also otherreliability operations, among them mixtures of distributions or forming thesum of random variables, and the question arises whether certain distributionclasses are closed under these operations. For example, convolutions arise inconnection with the addition of lifetimes and cold reserves.

Before we come to the IFRA Closure Theorem we need a preparatorylemma to prove a property of the reliability function h(p) = P (Φ(X) = 1) ofa monotone structure.

Lemma 2.30. Let h be the reliability function of a monotone structure. Thenh satisfies the inequality

h(pα) ≥ hα(p) for 0 < α ≤ 1,

where pα = (pα1 , . . . , pαn).

Proof. We prove the result for binary structures, which are nondecreasing ineach argument (nondecreasing structures) but not necessarily satisfy Φ(0) = 0and Φ(1) = 1. We use induction by n, the number of components in thesystem. For n = 1 the assertion is obviously true. The induction step is carriedout by means of the pivotal decomposition formula:

h(pα) = pαnh(1n,pα) + (1− pαn)h(0n,p

α).

Now h(1n,pα) and h(0n,p

α) define reliability functions of nondecreasingstructures with n − 1 components. Therefore we have h(·n,pα) ≥ hα(·n,p)and also

h(pα) ≥ pαnhα(1n,p) + (1− pαn)h

α(0n,p).

2.2 Basic Notions of Aging 39

The last step is to show that

pαnhα(1n,p) + (1− pαn)h

α(0n,p) ≥ (pnh(1n,p) + (1− pn)h(0n,p))α.

But since v(x) = xα is a concave function for x ≥ 0, we have

v(x + a)− v(x) ≥ v(y + a)− v(y) for 0 ≤ x ≤ y, 0 ≤ a.

Setting a = pn(h(1n,p) − h(0n,p)), x = pnh(0n,p) and y = h(0n,p) yieldsthe desired inequality. ��

Now we can establish the IFRA Closure Theorem.

Theorem 2.31. If each of the independent components of a monotone struc-ture has an IFRA lifetime distribution, then the system itself has an IFRAlifetime distribution.

Proof. Let F, Fi, i = 1, . . . , n, be the distributions of the lifetimes of the systemand the components, respectively. The IFRA property is characterized by

Fi(αt) ≥ (Fi(t))α

for 0 ≤ α ≤ 1 and t ≥ 0. The distribution F is related to the Fi by thereliability function h :

F (t) = h(F1(t), . . . , Fn(t)).

By Lemma 2.30 above using the monotonicity of h we can conclude that

F (αt) = h(F1(αt), . . . , Fn(αt)) ≥ h(Fα1 (t), . . . , Fα

n (t))

≥ hα(F1(t), . . . , Fn(t)) = Fα(t)

for 0 < α ≤ 1. For α = 0 this inequality holds true since F (0) = 0. Thisproves the IFRA property of F . ��

We know that independent IFR components form an IFRA monotone sys-tem and hence, if the components have exponentially distributed lifetimes, thesystem lifetime is of IFRA type. Since constant failure rates are also includedin the DFR class, one cannot hope for a corresponding closure theorem forDFRA distributions. However, considering other reliability operations thingsmay change. For example, let {Fk : k ∈ N} be a family of distributions andF =

∑∞k=1 pkFk be its mixture with respect to some probability distribution

(pk). Then it is known that the DFR and the DFRA property are preserved,i.e., if all Fk are DFR(A), then the mixture F is also DFR(A) (for a proofof a slightly more general result see [32] p. 103). Of course, by the same ar-gument as above a closure theorem for mixtures cannot hold true for IFRAdistributions.

Finally, we state a closure theorem for convolutions. Since a complete proofis lengthy (and technical), we do not present it here; we refer to [32], p. 100,and [139], p. 23.

40 2 Basic Reliability Theory

Theorem 2.32. Let X and Y be two independent random variables with IFRdistributions. Then X + Y has an IFR distribution.

By induction this property extends to an arbitrary finite number of randomvariables. This shows, for example, that the Erlang distribution is of IFR typebecause it is the distribution of the sum of exponentially distributed randomvariables.

2.2.3 Stochastic Comparison

There are many possibilities to compare random variables or their distribu-tions, respectively, with each other, and a rich literature treats various waysof defining stochastic orders. One of the most important in reliability is thestochastic order. Let X and Y be two random variables. Then X is said to besmaller in the stochastic order, denoted X ≤st Y, if P (X > t) ≤ P (Y > t) forall t ∈ R+. In reliability terms we say that X is stochastically smaller thanY , if the probability of surviving a given time t is smaller for X than for Yfor all t. Note that the stochastic order compares two distributions, the ran-dom variables could even be defined on different probability spaces. One mainpoint is now to compare a given lifetime distribution with the exponentialone. The reason why we choose the exponential distribution is its simplic-ity and the special role it plays on the border between the IFR(A) and theDFR(A) classes. However, it turns out that in general a random variable withan IFR(A) distribution is not stochastically smaller than an exponentiallydistributed one, but their distributions cross at most once.

Lemma 2.33. Let T be a positive random variable with IFRA distribution Fand xp be fixed such that F (xp) = p (p-quantile). Then for 0 < p < 1

F (t) ≥ e−αt for 0 ≤ t < xp and

F (t) ≤ e−αt for xp ≤ t

holds true, where α = − 1xp

ln(1− p).

Proof. For an IFRA distribution v(t) = (− ln F (t))/t is nondecreasing. There-fore the result follows by noting that v(t) ≤ v(xp) = α for t < xp and v(t) ≥ αfor t ≥ xp. ��

The last lemma compares an IFRA distribution with an exponential dis-tribution with the same p-quantile. It is also of interest to compare F havingexpectation μ with a corresponding Exp(1/μ) distribution. The easiest wayseems to be to set α = 1/μ in the above lemma. But an IFRA distributionfunction may have jumps so that there might be no t with v(t) = 1/μ. If, onthe other hand, F has the stronger IFR property, then it is continuous fort < t∗ = sup{t ∈ R+ : F (t) > 0} (possibly t∗ = ∞) and a jump can onlyoccur at t∗ if t∗ < ∞. So we find a value tμ with v(tμ) = 1/μ excluding thedegenerate case F (μ) = 0, i.e., t∗ = μ. This leads to the following result.

2.2 Basic Notions of Aging 41

Lemma 2.34. Let T be a positive random variable with IFR distribution F,mean μ and let tμ = inf{t ∈ R+ : − 1

t ln F (t) ≥ 1μ}. Then

F (t) ≥ e−tμ for 0 ≤ t < tμ,

F (t) ≤ e−tμ for tμ ≤ t

and tμ ≥ μ hold true.

Proof. The inequality for the survival probability follows from Lemma 2.33with α = 1/μ, where in the degenerate case t∗ = μ we have tμ = t∗ = μ.It remains to show tμ ≥ μ. To this end we first confine ourselves to thecontinuous case and assume that F has no jump at t∗. Then F (T ) has auniform distribution on [0, 1] and we obtain E[ln F (T )] = −1. Now

F (t+ x)

F (t)= exp{−(Λ(t+ x)− Λ(t))}

is nonincreasing in t for all x ≥ 0, which implies that Λ(t) = − ln F (t) isconvex, and we can apply J.L. Jensen’s inequality to yield

1 = E[− ln F (T )] ≥ − ln F (μ).

This is tantamount to − 1μ ln F (μ) ≤ 1

μ and hence tμ ≥ μ, which proves theassertion for continuous F .

In case F has a jump at t∗ we can approximate F by continuous distribu-tions. Then t∗ is finite and all considerations can be carried over to the limit.We omit the details. ��

Example 2.35. Let T follow a Weibull distribution F (t) = exp{−tβ} withmean μ = Γ (1+1/β), where Γ is the Gamma function. Then clearly F is IFR,if β > 1. Lemma 2.34 yields F (t) ≥ exp{−t/μ} for 0 ≤ t < tμ = (1/μ)1/(β−1)

and tμ ≥ μ. Note that in this case tμ > μ, which extends slightly the well-known result F (t) ≥ exp{−t/μ} for 0 ≤ t < μ (see [32] Theorem 6.2, p.111).

A lot of other bounds for the survival probability can be set up undervarious conditions (see the references listed in the Bibliographic Notes). Nextwe want to give one example of how such bounds can be carried over tomonotone systems. As an immediate consequence of the last lemma we obtainthe following corollary.

Corollary 2.36. Let h be the reliability function of a monotone system withlifetime distribution F . If the components are independent with IFR distribu-tions Fi and mean μi, i = 1, . . . , n, then we have

F (t) ≥ h(e−t/μ1 , . . . , e−t/μn) for t < min{μ1, . . . , μn}.

42 2 Basic Reliability Theory

Actually the inequality holds true for t < min{tμ1 , . . . , tμn}. The idea ofthis inequality is to give a bound on the reliability of the system at time tonly based on h and μi and the knowledge that the Fi are of IFR type. If thereliability function h is unknown, then it could be replaced by that of a seriessystem to yield

F (t) ≥ h(e−t/μ1 , . . . , e−t/μn) ≥n∏

i=1

e−t/μi = exp

{−t

n∑i=1

1

μi

}

for t < min{μ1, . . . , μn}.These few examples given here indicate how aging properties lead to

bounds on the reliability or survival probability of a single component andhow these affect the lifetime of a system comprising independent components.

2.3 Copula Models of Complex Systems in Reliability

2.3.1 Introduction to Copula Models

We consider a complex system comprising n components. The lifetimes ofthe components are described by non-negative random variables T1, cdotsTn,where Ti has continuous distribution Fi with support R+, i = 1, . . . , n. Usu-ally, the lifetimes are assumed to be stochastically independent. But in anumber of cases such an assumption is not likely to hold true, e.g., if allcomponents of a system are exposed to the same environmental conditionsor stresses. Therefore, we want to extend the model to possibly dependentlifetimes with joint cumulative distribution function H :

H(t1, . . . , tn) = P (T1 ≤ t1, . . . , Tn ≤ tn).

To investigate the influence of the dependence structure on the systemreliability it turns out to be useful to assume that the dependence structureis given by a copula. Such a copula C is defined as an n-variate distributionfunction on the cube [0, 1]n with marginals that are uniform distributions on[0, 1], i.e.,

1. C(u) = 0 for any u ∈ [0, 1]n, if at least one coordinate of u = (u1, . . . , un)is 0.

2. C(u) = ui for any u ∈ [0, 1]n, if all coordinates of u are 1 except ui.

The link between the joint distribution function H and the marginal dis-tribution functions Fi of the random variables Ti is given by a copula C.According to Sklar’s theorem (see Nelsen [127]) for any n-variate distributionH with marginals Fi there exists an n-copula C such that

H(t1, . . . , tn) = C(F1(t1), . . . , Fn(tn))

2.3 Copula Models of Complex Systems in Reliability 43

for all t1, . . . , tn. If F1, . . . , Fn are continuous, as it is assumed here, then thiscopula C is uniquely determined.

As before, we consider a binary monotone system admitting two states:working (coded as 1) and failed (coded as 0). The state of the system isuniquely determined by the binary states of the n components, i.r., there isa structure function Φ : {0, 1}n → {0, 1} emitting the state of the systemaccording to the states of the components. We consider a monotone system,i.e., we assume that this structure function is monotone in each componentand Φ(0, . . . , 0) = 0, Φ(1, . . . , 1) = 1. Let Xt(i) = I(Ti > t), i = 1, . . . , ndescribe the state of the ith component at time t, t ∈ R+, where I is theindicator function. Then

FS(t) := P (Φ(Xt(1), . . . , Xt(n)) = 0),

is the distribution function of the system lifetime. Of course, in addition tothe structure function Φ, this distribution also depends on the copula C.

One aim is to investigate how the dependence structure determines thelifetime distribution FS of the system and in particular in which way prop-erties such as expectation or quantiles depend on the copula. To this end weneed the system lifetime distribution FS to be given explicitly in terms ofΦ and C as follows (see [71]). Let C be an n-dimensional copula and C theinduced probability measure such that C(t1, . . . , tn) = C(

∏ni=1[0, ti]). Note

that since the support of the copula C is [0, 1]n we have C([0, 1]n) = 1. For0 ≤ s ≤ 1 we denominate the intervals Bs

0 = [0, s] and Bs1 = (s, 1], where

B11 = ∅.We introduce the function GΦ,C : [0, 1]n → [0, 1] with

FS(t) := P (Φ(Xt(1), . . . , Xt(n)) = 0) = GΦ,C(F1(t), . . . , Fn(t)),

to emphasize that the lifetime distribution FS depends on Φ and on C. Thisfunction GΦ,C can be determined as follows (for a proof see [71]).

Theorem 2.37. The system lifetime distribution FS is given for all t ≥ 0 by

FS(t) = GΦ,C(F1(t), . . . , Fn(t)).

where

GΦ,C(t1, . . . , tn) := 1−∑

x∈{0,1}n

Φ(x) · C( n∏

i=1

Btixi

).

Since this formula is rather complex we will explain it in more detail forthe case n = 2 and give some examples.

Let Y1, Y2 be random variables each uniformly distributed on [0, 1] withjoint distribution C(t1, t2) = P (Y1 ≤ t1, Y2 ≤ t2), t1, t2 ∈ [0, 1] and inducedprobability measure C. For the sets D1 = Bt1

0 × Bt20 , D2 = Bt1

0 × Bt21 , D3 =

Bt11 ×Bt2

1 , D4 = Bt11 ×Bt2

0 in Fig. 2.7 we get

44 2 Basic Reliability Theory

D1 D4

D2 D3

t1 1

t2

1

Fig. 2.7. Example for n = 2

C(D1) = P (Y1 ≤ t1, Y2 ≤ t2) = C(t1, t2),

C(D2) = P (Y1 ≤ t1, t2 ≤ Y2 ≤ 1)= C(t1, 1)− C(t1, t2) = t1 − C(t1, t2),

C(D3) = P (t1 < Y1 ≤ 1, t2 < Y2 ≤ 1)= 1− C(1, t2)− C(t1, 1) + C(t1, t2)= 1− t2 − t1 + C(t1, t2),

C(D4) = P (t1 < Y1 ≤ 1, Y2 ≤ t2)= C(1, t2)− C(t1, t2) = t2 − C(t1, t2).

Example 2.38.

(i) In the case of a parallel system with n components, the structure functionis given by Φ(x1, . . . , xn) = 1 −∏n

i=1(1 − xi), which is 0 if and only ifx = (0, . . . , 0). Therefore, the sum in GΦ,C extends over all possible xexcept the null vector yielding

GΦ,C(t1, . . . , tn) = 1−(1− C

( n∏i=1

Bti0

))= C

(t1, . . . , tn

).

It follows as to be expected that

FS(t) = GΦ,C(F(t)) = C(F1(t), . . . , Fn(t)) = H(t, . . . , t).

(ii) For a series system with n components, we have Φ(x1, . . . , xn) =∏n

i=1xi,which is 1 if and only if x = (1, . . . , 1). Hence

2.3 Copula Models of Complex Systems in Reliability 45

GΦ,C

(t1, . . . , tn

)= 1− C

( n∏i=1

Bti1

).

If we denote H(t1, . . . , tn) = P (T1 > t1, . . . , Tn > tn) the survival func-tion of H and C the n-dimensional joint survival function correspondingto C, then we get for the lifetime distribution of a series system

FS(t) = 1− H(t, . . . , t) = 1− C(F1(t), . . . , Fn(t)).

In the special case n = 2 we have GΦ,C = t1 + t2 − C(t1, t2) yielding

FS(t) = F1(t) + F2(t)− C(F1(t), F2(t)).

(iii) If the n component lifetimes are independent, then the copula C is theproduct copula

∏(t1, . . . , tn) = t1 · · · · · tn. Thus

GΦ,C(t1, . . . , tn) = 1−∑

x∈{0,1}n

Φ(x)

n∏i=1

t1−xi

i (1− ti)xi .

The intact probabilities of the components at time t are Fi(t) = 1 −Fi(t) = P (Xi(t) = 1), i = 1, . . . , n. The system reliability is then given by

FS(t) =∑

x∈{0,1}n

Φ(x)

n∏i=1

(Fi(t))xi(Fi(t))

1−xi ,

the well-known formula that results from the state enumeration method(see Chap. 2.1, p. 25).

2.3.2 The Influence of the Copula on the Lifetime Distributionof the System

In the following we want to investigate in which way the dependence structure,i.e., the copula, influences one-dimensional properties q(FS) of the systemlifetime distribution FS(t), where the functional q : D → R is a mappingfrom the space D of distribution functions of non-negative random variablesto R = R ∪ {−∞,∞}.

Important examples of such functionals are

• the system reliability Rt at a fixed time t

Rt(FS) = P (Φ(Xt(1), . . . , Xt(n)) = 1) = 1− FS(t) = FS(t),

• the expectation E

E(FS) =

∫ ∞

0

FS(t)dt,

46 2 Basic Reliability Theory

• the p-quantiles Qp of the system lifetime distribution

Qp(FS) = inf{t ∈ R+ : FS(t) ≥ p}, 0 < p ≤ 1.

To investigate the influence of the copula on these one-dimensional quan-tities we first have to compare different multivariate distributions. There area lot of comparison methods that are presented in some detail in [123, 99] andrelated to copulas in Nelsen [127]. We summarize briefly the notions we need.

We consider n non-negative random variables T1, . . . , Tn with joint distri-bution function H , marginals F1, . . . , Fn and survival function H(t1, . . . , tn) =P (T1 > t1, . . . , Tn > tn). In the case n = 2 we have the relation: H(t1, t2) =1−F1(t1)−F2(t2)+H(t1, t2). Now we want to compare two n-variate distri-bution functions H,G ∈ D(F1, . . . , Fn), where D(F1, . . . , Fn) denotes the setof distribution functions with marginals F1, . . . , Fn, each with support R+.

Definition 2.39. Let H,G ∈ D(F1, . . . , Fn), n ≥ 2.

(i) G is more positive lower orthant dependent (PLOD) than H, writtenH ≺cL G, if H(t) ≤ G(t) for all t = (t1, . . . , tn) ∈ R

n.(ii) G is more positive upper orthant dependent (PUOD) than H, written

H ≺cU G, if H(t) ≤ G(t) for all t.(iii) G is more concordant than H, written H ≺c G, if both H(t) ≤ G(t) and

H(t) ≤ G(t) hold for all t.

For n = 2, parts (i) and (ii) of the above definition are equivalent as canbe seen from the relation between H and H. This does not hold true in higherdimensions. To compare two distributions H,G ∈ D(F1, . . . , Fn) with fixedmarginals it is, of course, enough to compare their corresponding copulas.

For n = 2 random variables X,Y with continuous distribution functionsF,G and copula C, there are well-known measures of the degree of dependencesuch as Kendall’s tau τX,Y or Spearman’s rho ρX,Y , expression which can beexpressed in terms of the copula C :

τX,Y = 4

∫∫[0,1]2

C(u, v)dC(u, v)− 1, ρX,Y = 12

∫∫[0,1]2

C(u, v)dudv − 3.

This shows that monotonicity of copulas with respect to the PLOD-ordering inherits monotonicity of Kendall’s tau and Spearman’s rho. In asimilar way we want to investigate the effect of an increase of dependency onone-dimensional properties q(FS) of the system lifetime distribution. We can-not hope for results for arbitrary systems, but for parallel and series systems,see Fig. 2.8, we can prove the following theorem. For this we need the usualstochastic order on D: F ≤s G iff F (t) ≥ G(t) for all t ≥ 0.

2.3 Copula Models of Complex Systems in Reliability 47

c1

cn

(a)

c1 cn

(b)

Fig. 2.8. (a)Parallel and (b)series system

Theorem 2.40. Let the functional q : D → R be nondecreasing with respectto the usual stochastic order on D and let C1 and C2 be two n-dimensionalcopulas.

(i) If for a parallel system C1 ≺cL C2 then

q(FSC2

) ≤ q(FSC1

);

(ii) if for a series system C1 ≺cU C2 then

q(FSC1

) ≤ q(FSC2

).

If q is nonincreasing then the inequalities in (i) and (ii) are reversed.

Proof. (i) For a parallel system, note that according to Example 2(i) it holdsthat

FSCi(t) = Ci(F1(t), . . . , Fn(t)),

where i = 1, 2 and F1(t), . . . , Fn(t) ∈ D. It is clear that FSC1

(t) ≤ FSC2

(t) for all

t ≥ 0, since C1 ≺cL C2. That means FSC2

≤s FSC1

. Because of the monotonicityof q we get the assertion

q(FSC2

) ≤ q(FSC1

).

The proof of (ii) is similar: For a series system we have

FSCi(t) = 1− Ci(F1(t), . . . , Fn(t)).

Therefore, the PUOD-ordering of Ci yields FSC1

≤s FSC2

and consequently theassertion.

The case of nonincreasing q is obvious. ��

The above theorem shall be applied to the three functionals mentioned ear-lier, namely the system reliability Rt(F

S) = FS(t), the expectation E(FS) =∫∞0FS(t)dt and the quantile Qp(F

S) := inf{t ∈ R+ : FS(t) ≥ p}, 0 < p ≤ 1.Note that these functionals are all nondecreasing with respect to the usualstochastic ordering.

48 2 Basic Reliability Theory

One is often interested in bounds for these reliability quantities in caseswhen the marginals are (approximately) known but the dependence structureis unknown. For this we can utilize the so called Frechet–Hoeffding bounds(see Nelsen [127])

W (u1, . . . , un) = max{1− n+∑n

i=1 ui, 0},M(u1, . . . , un) = min{u1, . . . , un}.

While M itself is a copula, W is for n ≥ 3 no distribution function. It isknown (see Nelsen [127]) that all copulas C lie within these two bounds, i.e.,

W ≺cL C ≺cL M.

Using the preceding theorem yields

(i) for a parallel system:

Rt(FSM ) ≤ Rt(F

SC ) ≤ Rt(F

SW ),

E(FSM ) ≤ E(FS

C ) ≤ E(FSW ),

Qp(FSM ) ≤ Qp(F

SC ) ≤ Qp(F

SW ),

where we used the notation FSC for the system lifetime distribution ac-

cording to the copula C.(ii) in the case n = 2 the relation W ≺cU C ≺cU M holds true yielding the

inverse inequalities for a series system:

Rt(FSW ) ≤ Rt(F

SC ) ≤ Rt(F

SM ),

E(FSW ) ≤ E(FS

C ) ≤ E(FSM ),

Qp(FSW ) ≤ Qp(F

SC ) ≤ Qp(F

SM ),

This example provides us with an upper bound Qp(FSW ) and a lower bound

Qp(FSM ), respectively, for the quantile Qp(F

SC ) of a parallel system. The cor-

responding bounds for the quantile Qp(FSC ) of a series system are Qp(F

SM )]

and Qp(FSW ), respectively. Note that the lower bound for a parallel system

coincides with the upper bound for a series system.This example verifies also that the stronger the dependence between the

component lifetimes in a series system is, the more reliable the system is. Butfor a parallel system the reverse holds true, the system becomes weaker thestronger the dependence is, always under the assumption that the marginalsremain the same.

2.3 Copula Models of Complex Systems in Reliability 49

2.3.3 Archimedean Copulas

In general it is not easy to check whether multivariate copulas are PLOD,PUOD, or CONCORDANT ordered. But for an important subclass, theso-called Archimedean copulas, the concordance order can be checked by in-vestigating the properties of generators of Archimedean copulas (see Nelsen[127]). A function ϕ : [0, 1] → [0,∞] is a generator (of an n-dimensionalArchimedean copula), if ϕ is continuous, strictly decreasing, ϕ(0) = ∞, ϕ(1) =0 and the inverse ϕ−1 is completely monotonic, i.e.,

(−1)kdk

dtkϕ−1(t) ≥ 0, t ≥ 0, k = 0, 1, 2, . . .

The function C : [0, 1]n → [0, 1] defined by

C(u) = ϕ−1(ϕ(u1) + ϕ(u2) + · · ·+ ϕ(un)

is then an n-dimensional Archimedean copula with generator ϕ.

Definition 2.41. A function f : R+ → R is subadditive, if for all x1, . . . ,xn ∈ R+

f(x1 + · · ·+ xn) ≤ f(x1) + · · ·+ f(xn). (2.16)

Using this definition the following theorem supplies us with a sufficientand necessary condition to check the concordance order of two Archimedeancopulas C1, C2 with generators, ϕ1, and ϕ2, respectively.

Theorem 2.42. Let C1 and C2 be n-dimensional Archimedean copulas gen-erated by ϕ1 and ϕ2. Then C1 ≺cL C2 if and only if ϕ1 ◦ ϕ−1

2 is subadditive.

Proof. Let f = ϕ1◦ϕ−12 . The function f is continuous and nondecreasing with

f(0) = ϕ1 ◦ ϕ−12 (0) = ϕ1(1) = 0.

According to the definition, C1 ≺cL C2 holds true if and only if for allx1, . . . , xn ∈ [0, 1]

ϕ−11 (ϕ1(x1) + · · ·+ ϕ1(xn)) ≤ ϕ−1

2 (ϕ2(x1) + · · ·+ ϕ2(xn)). (2.17)

Inserting ti = ϕ2(xi), i = 1, . . . , n, (2.17) is equivalent to:

ϕ−11 (f(t1) + · · ·+ f(tn)) ≤ ϕ−1

2 (t1 + · · ·+ tn), (2.18)

for all t1, . . . , tn ≥ 0.Applying the strictly decreasing function ϕ1 to both sides of (2.18) on gets

f(t1 + · · ·+ tn) ≤ f(t1) + · · ·+ f(tn).

This shows the equivalence of the subadditivity of f = ϕ1◦ϕ−12 and C1 ≺cL C2.

��

50 2 Basic Reliability Theory

To verify whether ϕ1 ◦ϕ−12 is subadditive may still be a challenge. There-

fore, we state three sufficient conditions for subadditivity in the followingcorollary. The elementary proofs can be found in Nelsen [127] for the casen = 2, which can easily be extended to the general case n ≥ 2.

Corollary 2.43. Under the assumptions of Theorem 2.42 C1 ≺cL C2 holdstrue if either of the following conditions is satisfied

(i) ϕ1 ◦ ϕ−12 is concave;

(ii) ϕ1/ϕ2 is nondecreasing on (0, 1);(iii) ϕ1 and ϕ2 are continuously differentiable on (0, 1) and ϕ′

1/ϕ′2 is non-

decreasing on (0, 1).

2.3.4 The Expectation of the Lifetimeof a Two-Component-System with Exponential Marginals

As an example we consider a complex system with n = 2 components withlifetimes T1, T2, which are both exponentially distributed with the same pa-rameter λ > 0. To model the dependence we consider the one-parameterClayton or Pareto family of copulas

Cθ(u, v) = [(u−θ + v−θ − 1)+]−1/θ, θ ∈ [−1,∞)\{0}

with generator ϕθ(t) = 1θ (t

−θ − 1). Is this family positively ordered in thesense that for θ1 ≤ θ2 we have Cθ1 ≺c Cθ2? Note that in the case n = 2 thePLOD- and PUOD-ordering coincide and are equivalent to the concordantordering ≺c. To check whether the Clayton family is positively ordered we canuse Corollary 2.43 part (iii). The generator ϕθ is continuously differentiableon (0, 1) with ϕ′

θ(t) = −t−θ−1. The ratio ϕ′θ1/ϕ′

θ2= tθ2−θ1 is nondecreasing

on (0, 1) for θ1 ≤ θ2 which is sufficient for Cθ1 ≺c Cθ2 , i.e., the degree ofdependence increases with θ. The extreme cases θ = −1 and θ → ∞ are theFrechet–Hoeffding bounds C−1 = W and C∞ = M . The limiting case θ → 0yields the product copula C0 =

∏(independence).

Parallel System

The lifetime T = T1 ∨ T2 of a parallel system has distribution functionF parCθ

(t) = P (T ≤ t) = Cθ(F1(t), F2(t)). Since Cθ is positively ordered (con-cordance ordering) the expectation

E(F parCθ

) =

∫ ∞

0

(1− Cθ(F1(t), F2(t))dt

is decreasing in θ. The extreme and special cases are:

2.3 Copula Models of Complex Systems in Reliability 51

• θ = −1, C−1 =W : E(F parW ) =

∫∞0 (1 −W (F1(t), F2(t))dt.

In the exponential case F1(t) = F2(t) = F (t) = 1− exp(−λt) we get

E(F parW ) =

∫ ∞

0

[1− (2F (t)− 1)+]dt = (1 + ln 2)1

λ.

• θ = 0, C0 =∏

: E(F par∏ ) =∫∞0

(1 − F1(t)F2(t))dt.

In the exponential case F1(t) = F2(t) = F (t) = 1− exp(−λt) we get

E(F par∏ ) =

∫ ∞

0

[1− F 2(t)]dt =3

2· 1λ.

• θ = ∞, C∞ =M : E(F parM ) =

∫∞0

[1−M(F1(t), F2(t))]dt.In the exponential case F1(t) = F2(t) = F (t) = 1− exp(−λt) we get

E(F parM ) =

∫ ∞

0

[1− F (t)]dt =1

λ.

This shows that in the independence case the second component in this two-component parallel system prolongs the mean lifetime by 50%. The most pos-sible prolongation is about 70% [ln 2 · 100] in the extreme negative correlationcase, whereas, as to be expected, the worst case is a correlation of 1 betweenthe component lifetimes, in which case a second component does not pay.

Series System

The lifetime T = T1∧T2 of a series system has distribution function F serCθ

(t) =P (T ≤ t) = F1(t) + F2(t) − Cθ(F1(t), F2(t)) according to Example 2.38. Forthe expectation of the system lifetime we get

E(F serCθ

) = E(T1) + E(T2)− E(T1 ∨ T2).Therefore, the properties of the expectation can be transferred from the par-allel system:

• θ = −1, C−1 =W : E(F serW ) = E(T1)+E(T2)−

∫∞0

(1−W (F1(t), F2(t))dt.In the exponential case we get

E(F serW ) =

2

λ− (1 + ln 2)

1

λ= (1− ln 2)

1

λ.

• θ = 0, C0 =∏

: E(F ser∏ ) = E(T1) + E(T2)−∫∞0

(1− F1(t)F2(t))dt.In the exponential case we get

E(F ser∏ ) =2

λ− 3

2· 1λ= 0.5 · 1

λ.

• θ = ∞, C∞ =M : E(F serM ) = E(T1)−E(T2)−

∫∞0 [1−M(F1(t), F2(t))]dt.

In the exponential case we get

E(F serM ) =

1

λ.

This shows that the expected system lifetime of a series system can be reducedto about 30% [(1− ln 2) · 100] of the expected lifetime of one component.

52 2 Basic Reliability Theory

2.3.5 Marshall–Olkin Distribution

In this subsection we consider the bivariate Marshall–Olkin (M–O) distribu-tion and investigate the influence of the degree of dependence on the systemreliability. The M–O distribution is interesting in so far as it can be interpretedphysically. As before we consider a complex system with two components. Thesystem is subject to shocks that are always “fatal” to one or both of the com-ponents. The shocks occur at times Z1, Z2, Z12, where we differentiate whetheronly the first, only the second, or both components are destroyed. These ran-dom variables are assumed to be independent and exponentially distributedwith parameters λ1, λ2, λ12 > 0, respectively. The component lifetimes T1, T2are given by

T1 = Z1 ∧ Z12 and T2 = Z2 ∧ Z12

and follow exponential distributions with parameters λ1 + λ12 and λ2 + λ12.The joint distribution of T1 and T2 is called the Marshall–Olkin distribu-

tion with joint distribution function:

H(t1, t2) = H(t1, t2) + F1(t1) + F2(t2)− 1

= exp (−λ1t1 − λ2t2 − λ12(t1 ∨ t2))− exp (−(λ1 + λ12)t1)

− exp (−(λ2 + λ12)t2) + 1, t1, t2 ≥ 0.

The associated M–O copula is:

Cα,β(u1, u2) = min((1 − u1)1−α(1− u2), (1− u1)(1 − u2)

1−β) + u1 + u2 − 1

where 0 ≤ u1, u2 ≤ 1 and α = λ12

λ1+λ12, β = λ12

λ2+λ12. As limiting cases we get

for the M–O copula

C0,0(u1, u2) = limα→0+

Cα,β(u1, u2) = limβ→0+

Cα,β(u1, u2) =∏

(u1, u2) = u1 · u2

and

C1,1(u1, u2) =M(u1, u2) = u1 ∧ u2.This implies that the limit λ1 → ∞, λ2 → ∞ or λ12 = 0 result in the

product copula, whereas the limit λ12 → ∞ or λ1 = λ2 = 0 yield the upperFrechet–Hoeffding bound. The family Cα,β , 0 ≤ α, β ≤ 1 is positively orderedwith respect to the concordance ordering in α(β fixed) as well as in β(α fixed).For 0 ≤ α, β ≤ 1 we get ∏

≺c Cα,β ≺c M.

Now we are in a position to compare the reliabilities Rt(FparC ) and Rt(F

serC )

by means of Theorem 2.40 for different copulas and all t ≥ 0:

Rt(Fser∏ ) ≤ Rt(F

serCα,β

) ≤ Rt(FserM ) = Rt(F

parM ) ≤ Rt(F

parCα,β

) ≤ Rt(Fpar∏ )

2.3 Copula Models of Complex Systems in Reliability 53

The Parallel System

For a parallel system the reliability Rt(FparCα,β

) can be explicitly determinedas follows

Rt(FparCα,β

) = FS(t) = 1− Cα,β(F1(t), F2(t))

= 1−min((1 − F1(t))1−α(1− F2(t)), (1 − F1(t))(1 − F2(t))

1−β)

−F1(t)− F2(t) + 1

= e−(λ1+λ12)t + e−(λ2+λ12)t − e−(λ1+λ2+λ12)t, t ≥ 0.

The reliability functions for different copulas with the same marginals Fi(t) =1− exp(−10t), i = 1, 2, are displayed graphically in Fig. 2.9.

Fig. 2.9. Reliability functions of a parallel system

The dotted line in Fig. 2.9 represents the independence case with λ1 =10, λ2 = 10, λ12 = 0. The dashed line corresponds to λ1 = 5, λ2 = 5, λ12 = 5,whereas the solid line represents the upper Frechet–Hoeffding bound withλ1 = 0, λ2 = 0, λ12 = 10.

54 2 Basic Reliability Theory

Figure 2.9 shows that with increasing measure of dependence between thecomponent lifetimes, here increasing λ12, the reliabilities of a parallel systemare decreasing. For example, for t = 0.1, the reliability is in the range ofR0.1 =0.60(λ12 = 0) to R0.1 = 0.37(λ12 = 10), i.e. the reliability may decreaseto about 60% of the reliability in the independence case due to correlationbetween the component lifetimes.

The Series System

Analogously we can analyze the reliability of a series system under the sameconditions as above. The system reliability is

Rt(FserCα,β

) = FS(t) = 1− F1(t)− F2(t) + Cα,β(F1(t), F2(t))

= e−(λ1+λ2+λ12)t, t ≥ 0.

Figure 2.10 shows the reliability functions for different copulas.

Fig. 2.10. Reliability functions of a series system

As before, the dotted line in Fig. 2.10 represents the independence casewith λ1 = 10, λ2 = 10, λ12 = 0. The dashed line corresponds to λ1 = 5, λ2 =5, λ12 = 5, whereas the solid line represents the upper Frechet–Hoeffdingbound with λ1 = 0, λ2 = 0, λ12 = 10.

2.3 Copula Models of Complex Systems in Reliability 55

With increasing measure of dependence the series system becomes better inthat the reliability increases. Furthermore, a parallel system is always more re-liable than a series with the same marginals. For the upper Frechet–Hoeffdingbound the reliability functions of the parallel and the series system coincide,i.e., the best series systems is as reliable as the worst parallel system. In thislimit case the correlation of the component lifetimes is ρ(T1, T2) = 1.

Bibliographic Notes. The basic reliability theory of complex systemswas developed in the 1960s and 1970s, and is to a large extent covered by thetwo books of Barlow and Proschan [31] and [32]. Some more recent books inthis field are Aven [13] and Høyland and Rausand [90]. Our presentation isbased on Aven [13], which also includes the theory of multistate monotonesystems. This theory was developed in the 1980s. Refer to Natvig [126] andAven [17] for further details and references.

For specific references to methods (algorithms) for reliability computa-tions, see [132] and the many papers on this topic appearing in reliabilityjournals each year.

Birnbaum’s reliability importance measure presented in Sect. 2.1.1 was in-troduced by Birnbaum [43]. The improvement potential measure has beenused in different contexts, see, e.g., [13, 28]. The measure (2.14) was proposedby Butler [52]. For other references on reliability importance measures, see[13, 28, 39, 79, 86, 90, 125].

Section 2.2, which presents some well-known properties of lifetime distribu-tions, is based on Barlow and Proschan [31], [32], Gertsbakh [74], and Shakedand Shanthikumar [139]. We have not dealt with stochastic comparisons andorders in detail. An overview of this topic with applications in reliability canbe found in the book of Shaked and Shanthikumar [139].

Good sources for multivariate comparison methods and dependence con-cepts are Muller and Stoyan [123], Joe [99] and, in particular related to cop-ulas, Nelsen [127].

3

Stochastic Failure Models

A general set-up should include all basic failure time models, should takeinto account the time-dynamic development, and should allow for differentinformation and observation levels. Thus, one is led in a natural way to thetheory of stochastic processes in continuous time, including (semi-) martingaletheory, in the spirit of Arjas [3, 4] and Koch [108]. As was pointed out inChap. 1, this theory is a powerful tool in reliability analysis. It should bestressed, however, that the purpose of this chapter is to present and introduceideas rather than to give a far reaching excursion into the theory of stochasticprocesses. So the mathematical technicalities are kept to the minimum levelnecessary to develop the tools to be used. Also, a number of remarks andexamples are included to illustrate the theory. Yet, to benefit from readingthis chapter a solid basis in stochastics is required. Section 3.1 summarizes themathematics needed. For a more comprehensive and in-depth presentation ofthe mathematical basis, we refer to Appendix A and to monographs such asby Bremaud [50], Dellacherie and Meyer [61, 62], Kallenberg [101], or Rogersand Williams [133].

3.1 Notation and Fundamentals

Let (Ω,F , P ) be the basic probability space. The information up to time t isrepresented by the pre-t-history Ft, which contains all events of F that canbe distinguished up to and including time t. The filtration F = (Ft), t ∈ R+,which is the family of increasing pre-t-histories, is assumed to follow the usualconditions of completeness and right continuity, i.e., Ft ⊂ F contains all P -negligible sets of F and Ft = Ft+ =

⋂s>t Fs. We define F∞ =

∨t≥0 Ft as

the smallest σ-algebra containing all events of Ft for all t ∈ R+.If {Xj, j ∈ J} is a family of random variables and {Aj , j ∈ J} is a system

of subsets in F , then σ(Xj , j ∈ J) and σ(Aj , j ∈ J), respectively, denotethe completion of the generated σ-field, i.e., the generated σ-field includingall P -negligible sets of F . In many cases the information is determined by a

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 3,© Springer Science+Business Media New York 2013

57

58 3 Stochastic Failure Models

stochastic process Z = (Zt), t ∈ R+, and the corresponding filtration is theso-called natural or internal one, which is generated by this stochastic processand denoted F

Z = (FZt ), t ∈ R+,FZ

t = σ(Zs, 0 ≤ s ≤ t). But since it issometimes desirable to observe one stochastic process on different informationlevels, it seems more convenient to use filtrations as measures of information.On the basic filtered probability space we now consider a stochastic processZ = (Zt), which is adapted to a general filtration F, i.e., on the F-informationlevel the process can be observed, or in mathematical terms: FZ

t ⊂ Ft, whichassures that Zt is Ft-measurable for all t ∈ R+. All stochastic processes are, ifnot stated otherwise, assumed to be right-continuous and to have left limits.

A random variable X is integrable if E|X | < ∞. If the pth power ofa random variable X is integrable, E|X |p < ∞, 1 ≤ p < ∞, then it issometimes said that X is an element of Lp, the vector space of real-valuedrandom variables with finite pth moment. A stochastic process (Xt), t ∈ R+, iscalled integrable if all Xt are integrable, i.e., Xt ∈ L1 for all t ∈ R+. A familyof random variables (Xt), t ∈ R+, is called uniformly integrable, if

limc→∞ sup

t∈R+

E[|Xt|I(|Xt| ≥ c)] = 0.

To simplify the notation, we assume that relations such as ⊂,= or ≤, <,= between measurable sets and random variables, respectively, alwayshold with probability one, which means that the term P -a.s. is suppressed.For conditional expectations no difference is made between a version and theequivalence class of P -a.s. equal versions.

If we consider a stochastic process X = (Xt) and do not demand that it

is right-continuous, then expressions like Yt =∫ t

0Xsds have no meaning unless

(Xt) fulfills some measurability condition in the argument t. One condition isthe following.

Definition 3.1. A stochastic process X is F-progressive or progressively mea-surable, if for every t the mapping (s, ω) → Xs(ω) on [0, t]×Ω is measurablewith respect to the product σ-algebra B([0, t])⊗Ft, where B([0, t]) is the Borelσ-algebra on [0, t].

Every left- or right-continuous adapted process is progressivelymeasurable.If X is progressive, then so is Y = (Yt), Yt =

∫ t

0Xsds. A further measurabil-

ity restriction is needed in connection with stochastic processes in continuoustime. This is the fundamental concept of predictability.

Definition 3.2. Let F be a filtration on the basic probability space and letP(F) be the σ-algebra on (0,∞)×Ω generated by the system of sets

(s, t]×A, 0 ≤ s < t, A ∈ Fs, t > 0.

P(F) is called the F-predictable σ-algebra on (0,∞)×Ω. A stochastic processX = (Xt) is called F-predictable, if X0 is F0-measurable and the mapping(t, ω) → Xt(ω) on (0,∞)×Ω into R is measurable with respect to P(F).

3.1 Notation and Fundamentals 59

Every left-continuous process adapted to F is F-predictable. In mostapplications we will be concerned with predictable processes that are left-continuous. Note that F-predictable processes are also F-progressive.

To get an impression of the meaning of the term predictable, we remarkthat for an F-predictable process X the value Xt can be predicted from theinformation available “just” before time t, i.e., Xt is measurable with respectto Ft− =

∨s<t Fs = σ(As, As ∈ Fs, 0 ≤ s < t). Processes of this kind are imp-

ortant elements in the framework of point processes. Additional informationon these measurability concepts can be found in Appendix A.3, p. 254.

Some further important terms are introduced in the following definitions.

Definition 3.3. A random variable τ with values in R+ ∪ {∞} is called anF-stopping time if {τ ≤ t} ∈ Ft for all t ∈ R+.

Thus a stopping time is related to the given information in that at anytime t it is possible to decide whether τ has happened up to time t or not, onlyusing information of the past and present but not anticipating the future.

If F = (Ft) is a filtration and τ an F-stopping time, then the informationup to the random time τ is given by Fτ = {A ∈ F∞ : A ∩ {τ ≤ t} ∈ Ft forall t ∈ R+}. To understand the meaning of this definition, we specialize toa deterministic stopping time τ = t∗ ∈ R+. Then A ∈ Ft∗ is equivalent toA∩ {t∗ ≤ t} ∈ Ft for all t ∈ R+, where {t∗ ≤ t} stands for Ω if t∗ ≤ t and for∅ otherwise, i.e., for t = t∗ the event must be in Ft∗ and then it is in Ft forall t > t∗ because the filtration is monotone.

Definition 3.4. An integrable F-adapted process (Xt), t ∈ R+, is called amartingale (submartingale, supermartingale), if for all s > t, s, t ∈ R+,

E[Xs|Ft] = (≥,≤)Xt.

In the following we denote by M the set of martingales with paths that areright-continuous and have left-hand limits and by M0 the set of martingalesM ∈ M with M0 = 0.

3.1.1 The Semimartingale Representation

Semimartingale representations of stochastic processes play a key role in ourset-up. They allow the process to be decomposed into a drift or regressionpart and an additive random fluctuation described by a martingale.

Definition 3.5. A stochastic process Z = (Zt), t ∈ R+, is called a smoothsemimartingale (SSM) if it has a decomposition of the form

Zt = Z0 +

∫ t

0

fsds+Mt, (3.1)

where f = (ft), t ∈ R+, is a progressively measurable stochastic process with

E∫ t

0|fs|ds < ∞ for all t ∈ R+, E|Z0| < ∞ and M = (Mt) ∈ M0. Short

notation: Z = (f,M).

60 3 Stochastic Failure Models

A martingale is the mathematical model of a fair game with constantexpectation function EM0 = 0 = EMt for all t ∈ R+. The drift term is anintegral over a stochastic process. To give this integral meaning, (ft) shouldalso be measurable in the argument t, which is ensured, for example, if f hasright-continuous paths or, more general, if f is progressively measurable. Sincethe drift part in the above decomposition is continuous, a process Z, whichadmits such a representation, is called a SSM or smooth F-semimartingale ifwe would like to emphasize that Z is adapted to the filtration F. For someadditional details concerning SSMs, see the Appendix A.6, p. 266.

Below we formulate conditions under which a process Z admits a semi-martingale representation and show how this decomposition can be found. Tothis end we denote D(t, h) = h−1E[Zt+h − Zt|Ft], t, h ∈ R+.

C1 For all t, h ∈ R+, versions of the conditional expectation E[Zt+h|Ft] existsuch that the limit

ft = limh→0+

D(t, h)

exists P -a.s. for all t ∈ R+ and (ft), t ∈ R+, is F-progressively measurable

with E∫ t

0 |fs|ds <∞ for all t ∈ R+.C2 For all t ∈ R+, (hD(t, h)), h ∈ R+, has P -a.s. paths, which are absolutely

continuous.C3 For all t ∈ R+, a constant c > 0 exists such that {D(t, h) : 0 < h ≤ c} is

uniformly integrable.

The following theorem shows that these conditions are sufficient for a SSMrepresentation.

Theorem 3.6. Let Z = (Zt), t ∈ R+, be a stochastic process on the probabilityspace (Ω,F , P ), adapted to the filtration F. If C1, C2, and C3 hold true, thenZ is an SSM with representation Z = (f,M), where f is the limit defined inC1 and M is an F-martingale given by

Mt = Zt − Z0 −∫ t

0

fsds.

Proof. We have to show that with (ft) from condition C1 the right-continuous

process Mt = Zt − Z0 −∫ t

0fsds is an F-martingale, i.e., that for all A ∈ Ft

and s ≥ t, s, t ∈ R+, E[IAMs] = E[IAMt], where IA denotes the indicatorvariable. This is equivalent to

E[IA(Ms −Mt)] =

∫A

(Zs − Zt −

∫ s

t

fudu

)dP = 0.

For all r, t ≤ r ≤ s, and A ∈ Ft, IA is Fr-measurable. This yields

1

hE[IA(Zr+h − Zr)] =

1

hE [E[IA(Zr+h − Zr)|Fr]]

= E

[IA

1

hE[Zr+h − Zr|Fr]

]= E[IAD(r, h)].

3.1 Notation and Fundamentals 61

From C1 it follows that D(r, h) → fr as h → 0+ and therefore alsoIAD(r, h) → IAfr as h→ 0+ P -a.s. Now IAD(r, h) is uniformly integrable byC3, which ensures that

limh→0+

E[IAD(r, h)] = limh→0+

1

hE[IA(Zr+h − Zr)] = E[IAfr]. (3.2)

Because of C2 there exists a process (gt) such that

E[IA(Zs − Zt)] = E

[IA

∫ s

t

gudu

]=

∫ s

t

E[IAgu]du, (3.3)

where the second equality follows from Fubini’s theorem. Then (3.2) and (3.3)together yield

E[IA(Zs − Zt)] =

∫ s

t

E[IAfu]du = E

[IA

∫ s

t

fudu

],

which proves the assertion. ��

Remark 3.7. (i) In the terminology of Dellacherie and Meyer [62] an SSM

Z = (f,M) is a special semimartingale because the drift term∫ t

0 fsds iscontinuous and therefore predictable. Hence the decomposition of Z is uniqueP -a.s., because a second decomposition Z = (f ′,M ′) leads to the continuousand therefore predictable martingale M −M ′ of integrable variation, whichis identically 0 (cf. Appendix A.5, Lemma A.39, p. 263). (ii) It can be shownthat if Z = (f,M) is an SSM and for some constant c > 0 the family of

random variables {|h−1∫ t+h

t fsds| : 0 < h ≤ c} is bounded by some integrablerandom variable Y, then the conditions C1–C3 hold true, i.e., C1–C3 areunder this boundedness condition not only sufficient but also necessary for asemimartingale representation. The proof of the main part (C2) is based onthe Radon/Nikodym theorem. The details are of technical nature, and theyare therefore omitted and left to the interested reader. (iii) For applicationsit is often of interest to find an SSM representation for point processes, i.e.,to determine the compensator of such a process (cf. Definition 3.4 on p. 62).For such and other more specialized processes, specifically adapted methodsto find the compensator can be applied, see below and [16, 50, 58, 103, 115].

One of the simplest examples of a process with an SSM representation isthe Poisson process (Nt), t ∈ R+, with constant rate λ > 0. It is well-knownand easy to see from the definition of a martingale thatMt = Nt−λt defines amartingale with respect to the internal filtration FN

t = σ(Ns, 0 ≤ s ≤ t). If weconsider conditions C1–C3, we find that D(t, h) = λ for all t, h ∈ R+ becausethe Poisson process has independent and stationary increments: E[Nt+h −Nt|FN

t ] = E[Nt+h − Nt] = ENh = hλ. Therefore, we see that C1–C3 aresatisfied with ft = λ for all ω ∈ Ω and all t ∈ R+, which results in the

representation Nt =∫ t

0λds+Mt = λt+Mt.

The Poisson process is a point process as well as an example of a Markovprocess, and the question arises under which conditions point and Markovprocesses admit an SSM representation.

62 3 Stochastic Failure Models

Point and Counting Processes

A point process over R+ can be described by an increasing sequence ofrandom variables or by a purely atomic random measure or by means ofits corresponding counting process. Since we want to use the semimartin-gale structure of point processes, we will mostly use the last descriptionof a point process. A (univariate) point process is an increasing sequence(Tn), n ∈ N, of positive random variables, which may also take the value+∞ : 0 < T1 ≤ T2 ≤ . . .. The inequality is strict unless Tn = ∞. Wealways assume that T∞ = limn→∞ Tn = ∞, i.e., that the point process isnonexplosive.

This point process is also completely characterized by the random measureμ on (0,∞) defined by

μ(ω,A) =∑k≥1

I(Tk(ω) ∈ A)

for all Borel sets A of (0,∞).Another equivalent way to describe a point process is by a counting process

N = (Nt), t ∈ R+, with

Nt(ω) =∑k≥1

I(Tk(ω) ≤ t),

which is, for each realization ω, a right-continuous step function with jumps ofmagnitude 1 and N0(ω) = 0. Nt counts the number of time points Tn, whichoccur up to time t. Since (Nt), t ∈ R+, and (Tn), n ∈ N, obviously carry thesame information, the associated counting process is sometimes also called apoint process.

A slight generalization is the notion of a multivariate point process. Let(Tn), n ∈ N, be a point process as before and (Vn), n ∈ N, a sequence ofrandom variables with values in a finite set {a1, . . . , am}. Then the sequence ofpairs (Tn, Vn), n ∈ N, is called a multivariate point process and the associatedm-variate counting process Nt = (Nt(1), . . . , Nt(m)) is defined by

Nt(i) =∑k≥1

I(Tk ≤ t)I(Vk = ai), i ∈ {1, . . . ,m}.

Let us now consider a univariate point process (Tn), n ∈ N, and its associ-ated counting process (Nt), t ∈ R+, with ENt <∞ for all t ∈ R+ on a filteredprobability space (Ω,F ,F, P ). The traditional definition of the compensatorof a point process is the following.

Definition 3.8. Let N be an integrable point process adapted to the filtra-tion F. The unique F-predictable increasing process A = (At), such that

E

∫ ∞

0

CsdNs = E

∫ ∞

0

CsdAs (3.4)

is fulfilled for all nonnegative F-predictable processes C, is called the compen-sator of N with respect to F.

3.1 Notation and Fundamentals 63

The existence and the uniqueness of the compensator can be proved bythe so-called dual predictable projection. We refer to the work of Jacod [92].The following martingale characterization of the compensator links the dyn-amical view of point processes with the semimartingale set-up (for a proof,see [103], p. 60).

Theorem 3.9. Let N be an integrable point process adapted to the filtration F.Then A is the F-compensator of N if and only if the difference process N −Ais an F-martingale of M0.

Proof (Sketch). Let A be the compensator and C be the predictable processdefined as the indicator of the set (t, s] × B, where s > t, B ∈ Ft. Then thedefinition of the compensator yields

E[IB(Ns −Nt)] = E[IB(As −At)], (3.5)

which givesE[IB(Ns −As)] = E[IB(Nt −At)].

Hence, N −A is a martingale.Conversely, if N − A is a martingale, then A is integrable and we obtain

(3.5). In the general case, (3.4) can be established using the monotone classtheorem. ��

If we view the compensator as a random measure A(dt) on (0,∞), then wecan interpret this measure in an infinitesimal form by the heuristic expression

A(dt) = E[dNt|Ft−].

So, by an increment dt in time from t on, the increment A(dt) is what we canpredict from the information gathered in [0, t) about the increase of Nt, anddMt = dNt −A(dt) is what remains unforeseen. Thus, sometimes M is calledan innovation martingale and A(dt) the (dual) predictable projection.

In many cases (which are those we are mostly interested in) the F-compensator A of a counting process N can be represented as an integralof the form

At =

∫ t

0

λsds

with some nonnegative (F-progressively measurable) stochastic process (λt),t ∈ R+, i.e., N has an SSM representation N = (λ,M).

Definition 3.10. Let N be an integrable counting process with an F-SSM rep-resentation

Nt = At +Mt =

∫ t

0

λsds+Mt,

where (λt), t ∈ R+, is a nonnegative process. Then λ is called the F-intensityof N.

64 3 Stochastic Failure Models

Remark 3.11. (i) To speak of the intensity is a little bit misleading (but harm-less) because it is not unique. It can be shown (see Bremaud [50], p. 31) thatif one can find a predictable intensity, then it is unique except on a set ofmeasure 0 with respect to the product measure of P and Lebesgue measure.On the other hand, if there exists an intensity, then one can always find apredictable version. (ii) The heuristic interpretation

λtdt = E[dNt|Ft−]

is very similar to the ordinary failure or hazard rate of a random variable.

Theorem 3.9 and Definition 3.10 link the point process to the semimartin-gale representation, and using the definition of the compensator, it is possibleto verify formally that a process λ is the F-intensity of the point process N .We have to show that

E

∫ ∞

0

CsdNs = E

∫ ∞

0

Csλsds

for all nonnegative F-predictable processes C. Another way to verify that aprocess A is the compensator is to check the general conditions C1–C3 onpage 60 or to use the conditions given by Aven [16].

To go one step further we now specialize to the internal filtration FN =

(FNt ), FN

t = σ(Ns, 0 ≤ s ≤ t), and determine the FN -compensator of N in anexplicit form. The proof of the following theorem can be found in Jacod [92]and in Bremaud [50], p. 61. Regular conditional distributions are introducedin Appendix A.2, p. 252.

Theorem 3.12. Let N be an integrable point process and FN its internal

filtration. For each n let Gn(ω,B) be the regular conditional distribution ofthe interarrival time Un+1 = Tn+1 − Tn, n ∈ N0, T0 = 0, given the past FN

Tn

at the FN -stopping time Tn : Gn(ω,B) = P (Un+1 ∈ B|FN

Tn)(ω).

(i) Then for Tn < t ≤ Tn+1 the compensator A is given by

At = ATn +

∫ t−Tn

0

Gn(dx)

Gn([x,∞)).

(ii) If the conditional distribution Gn admits a density gn for all n, then theFN -intensity λ is given by

λt =∑n≥0

gn(t− Tn)

1− ∫ t−Tn

0gn(x)dx

I(Tn < t ≤ Tn+1).

Note that expressions of the form “ 00” are always set equal to 0.

Example 3.13. (Renewal process). Let the interarrival times Un+1 = Tn+1 −Tn, n ∈ N0, T0 = 0, be i.i.d. random variables with common distributionfunction F , density f and failure rate r: r(t) = f(t)/(1 − F (t)). Then it

3.1 Notation and Fundamentals 65

follows from Theorem 3.12 that with respect to the internal history FNt =

σ(Ns, 0 ≤ s ≤ t) the intensity on {Tn < t ≤ Tn+1} is given by λt = r(t− Tn).This results in the SSM representation N = (λ,M),

Nt =

∫ t

0

λsds+Mt

with the intensity

λt =∑n≥0

r(t − Tn)I(Tn < t ≤ Tn+1).

This corresponds to our supposition that the intensity at time t is the failurerate of the last renewed item before t at an age of t− Tn.

Example 3.14. (Markov-modulated Poisson process). A Poisson process canbe generalized by replacing the constant intensity with a randomly varyingintensity, which takes one of the m values λi, 0 < λi <∞, i ∈ S = {1, . . . ,m},m ∈ N. The changes are driven by a homogeneous Markov chain Y = (Yt), t ∈R+, with values in S and infinitesimal parameters qi, the rate to leave statei, and qij , the rate to reach state j from state i:

qi = limh→0+

1

hP (Yh �= i|Y0 = i),

qij = limh→0+

1

hP (Yh = j|Y0 = i), i, j ∈ S, i �= j,

qii = −qi = −∑j =i

qij .

The point process (Tn) corresponds to the counting process N = (Nt), t ∈ R+,with

Nt =

∞∑n=1

I(Tn ≤ t).

It is assumed thatN has a stochastic intensity λYt with respect to the filtrationF, generated by N and Y :

Ft = σ(Ns, Ys, 0 ≤ s ≤ t).

Then N is called a Markov-modulated Poisson process with SSMrepresentation

Nt =

∫ t

0

λYsds+Mt.

Roughly spoken, in state i the point process is Poisson with rate λi. But notethat the ordinary failure rate of T1 is not constant. If we cannot observe theMarkov chain Y, but only the point process (Tn), then we look for an intensitywith respect to the subfiltration A = (At), t ∈ R+, At = σ(Ns, 0 ≤ s ≤ t). Forthis we have to estimate the current state of the Markov chain, involving theinfinitesimal parameters qi, qij . For this we refer to Sects. 3.2.4 and 5.4.2.

66 3 Stochastic Failure Models

Markov Processes

The question whether Markov processes admit semimartingale representationscan generally be answered in the affirmative: (most) Markov processes andbounded functions of such processes have an SSM representation.

Let (Xt), t ∈ R+, be a right-continuous homogeneous Markov process on(Ω,F , P x) with respect to the (internal) filtration Ft = σ(Xs, 0 ≤ s ≤ t)with values in a measurable space (S,B(S)). For applications we will oftenconfine ourselves to S = R with its Borel σ-field B. Here P x, x ∈ S, denotesthe probability measure on the set of paths, which start in X0 = x: P x

(X0 = x) = 1.Let B denote the set of bounded, measurable functions on S with values in

R and let Ex denote expectation with respect to P x. Then the infinitesimalgenerator A is defined as follows: If for f ∈ B the limit

limh→0+

1

h(Exf(Xh)− f(x)) = g(x)

exists for all x ∈ S with g ∈ B, then we set Af = g and say that f belongsto the domain D(A) of the infinitesimal generator A. It is known that iff ∈ D(A), then

Mft = f(Xt)− f(X0)−

∫ t

0

Af(Xs)ds

defines a martingale (cf., e.g., [101], p. 328). This shows that a functionZt = f(Xt) of a homogeneous Markov process has an SSM representationif f ∈ D(A).

Example 3.15 (Markov pure jump process). A homogeneous Markov processX = (Xt) with right-continuous paths, which are constant between isolatedjumps, is called a Markov pure jump process. As before, P x denotes the prob-ability law conditioned on X0 = x and τx = inf{t ∈ R+ : Xt �= x} theexit time of state x. It is known that τx follows an Exp(λ(x)) distribution if0 < λ(x) < ∞ and that P x(τx = ∞) = 1 if λ(x) = 0, for some suitable map-ping λ on the set of possible outcomes of X0 with values in R+. Let v(x, ·) bethe jump law or transition probability at x, defined by v(x,B) = P x(Xτx ∈ B)for λ(x) > 0. If f belongs to the domain of D(A) of the infinitesimal generator,then we obtain (cf. Metivier [122])

Af(x) = λ(x)

∫(f(y)− f(x))v(x, dy). (3.6)

Let us now consider some particular cases. (i) Poisson process N = (Nt) withparameter λ > 0. In this case we have jumps of height 1, i.e., v(x, {x+1}) = 1.For f(x) = x we get Af(x) ≡ λ. This again shows that Nt−λt is a martingale.If we take f(x) = x2, then we obtain Af(x) = λ(2x + 1) and for N2 we havethe SSM representation

3.1 Notation and Fundamentals 67

N2t = f(Nt) =

∫ t

0

λ(2Ns + 1)ds+Mft .

(ii) Compound Poisson process X = (Xt). Let N be a Poisson process withan intensity λ : R → R+, 0 < λ(x) < ∞, and (Yn), n ∈ N, a sequence of i.i.d.random variables with finite mean μ. Then

Xt =

Nt∑n=1

Yn

defines a Markov pure jump process with ν(x,B) = P x(Xτx ∈ B) = P (Y1 ∈B − x). By formula (3.6) for the infinitesimal generator we get the SSM rep-resentation

Xt =

∫ t

0

λ(Xs)μds+Mt.

We now return to the general theory of Markov processes. The so-calledDynkin formula states that for a stopping time τ we have

Exg(Xτ ) = g(x) + Ex

∫ τ

0

Ag(Xs)ds

if Exτ <∞ and g ∈ D(A) (see Dynkin [66], p. 133). This formula can now beextended to the more general case of SSMs. If Z = (f,M) is an F-SSM with(P -a.s.) bounded Z and f , then for all F-stopping times τ with Eτ < ∞ weobtain

EZτ = EZ0 + E

∫ τ

0

fsds.

Here EMτ = 0 is a consequence of the Optional Sampling Theorem (seeAppendix A.5, Theorem A.34, p. 262). The following example shows how theDynkin formula can be applied to determine the expectation of a stoppingtime.

Example 3.16. Let B = (Bt) be a k-dimensional Brownian motion with initialpoint B0 = x and g a bounded twice continuously differentiable function onR

k with bounded derivatives. Then we obtain (cf. Metivier [122], p. 201) theSSM representation for g(Bt) :

g(Bt) = g(x) +1

2

∫ t

0

k∑i,j=1

∂2g

∂xi∂xj(Bs)ds+Mg

t .

For some R > 0 and |x| < R we consider the stopping time σ = inf{t ∈ R+ :|Bt| ≥ R} with respect to the internal filtration, which is the first exit timeof the ball KR = {y ∈ R

k : |y| < R}. By means of the Dynkin formula we candetermine the expectation Exσ in the following way. Let us assume Exσ <∞and choose g(x) = |x|2. Dynkin’s formula then yields

68 3 Stochastic Failure Models

Exg(Bσ) = R2 = |x|2 + 1

2Ex

∫ σ

0

2k ds

= |x|2 + kExσ,

which is tantamount to Exσ = k−1(R2 − |x|2). To show Exσ < ∞ we mayreplace σ by τn = n ∧ σ in the above formula: Exτn ≤ k−1(R2 − |x|2) andtogether with the monotone convergence theorem the result is established.

3.1.2 Transformations of SSMs

Next we want to investigate under which conditions certain transformationsof SSMs again lead to SSMs and leave the SSM property unchanged.

Random Stopping

One example is the stopping of a process Z, i.e., the transformation fromZ = (Zt) to the process Zζ = (Zt∧ζ), where ζ is some stopping time. IfZ = (f,M) is an F-SSM and ζ is an F-stopping time, then Zζ is again anF-SSM with representation

Zζt = Z0 +

∫ t

0

I(ζ > s)fsds+Mt∧ζ , t ∈ R+.

This result is an immediate consequence of the fact that a stopped martingaleis a martingale.

A Product Rule

A second example of a transformation is the product of two SSMs. To seeunder which conditions such a product of two SSMs again forms an SSM,some further notations and definitions are required, which are presented inAppendix A. Here we only give the general result. For the conditions and adetailed proof we refer to Appendix A.6, Theorem A.51, p. 269.

Let Z = (f,M) and Y = (g,N) be F-SSMs with M,N ∈ M20 and MN ∈

M0. Then, under suitable integrability conditions, ZY is an F-SSM withrepresentation

ZtYt = Z0Y0 +

∫ t

0

(Ysfs + Zsgs)ds+Rt,

where R = (Rt) is a martingale in M0.

Remark 3.17. (i) If Z = (f,M) and Y = (g,N) are two SSMs and f andg are considered as “derivatives,” then Y f + Zg is the “derivative” of theproduct ZY in accordance with the ordinary product rule. (ii) MartingalesM,N , for which MN is a martingale are called orthogonal. This propertycan be interpreted in the sense that the increments of the martingales are“conditionally uncorrelated,” i.e.,

E[(Mt −Ms)(Nt −Ns)|Fs] = 0

for all 0 ≤ s ≤ t.

3.1 Notation and Fundamentals 69

A Change of Filtration

Another transformation is a certain change of the filtration, which allows theobservation of a stochastic process on different information levels.

Definition 3.18. Let A = (At), t ∈ R+, and F = (Ft), t ∈ R+, be two filtra-tions on the same probability space (Ω,F , P ). Then A is called a subfiltrationof F if At ⊂ Ft for all t ∈ R+.

In this case F can be viewed as the complete information filtration andA as the actual observation filtration on a lower level. If Z = (f,M) is anSSM with respect to the filtration F, then the projection to the observationfiltration A is given by the conditional expectation Z with Zt = E[Zt|At]. Thefollowing projection theorem states that Z is an A-semimartingale. Differentversions of this theorem are proved in the literature. The version presentedhere for SSMs is based on, [50], pp. 87, 108, [100], p. 202 and [161].

Theorem 3.19 (Projection Theorem). Let Z = (f,M) be an F-SSM andA a subfiltration of F. Then Z with

Zt = Z0 +

∫ t

0

fsds+ Mt (3.7)

is an A-SSM, where

(i) Z is A-adapted with a.s. right-continuous paths with left-hand limits andZt = E[Zt|At] for all t ∈ R+;

(ii) f is A-progressively measurable with ft = E[ft|At] for almost all t ∈ R+

(Lebesgue measure);(iii) M is an A-martingale.

If in addition Z0,∫∞0 |fs|ds ∈ L2 and M ∈ M2

0, then Z0,∫∞0 |fs|ds ∈ L2 and

M ∈ M20.

Unfortunately, monotonicity properties of Z and f do not in general extendto Z and f , respectively. So if, for example, f has monotone paths, this neednot be true for the corresponding process f . Whether f has monotone pathsdepends on the path properties of f as well as on the subfiltration A. If fis already adapted to the subfiltration A, then it is obvious that f = f. Inthis case projecting onto the subfiltration only filters information out, whichdoes not affect the drift term.

The Projection Theorem will mainly be applied to solve optimal stoppingproblems on different information levels in the following manner. Let Z =(f,M) be an F-SSM and let Z = (f , M ) be the corresponding A-SSM withrespect to a subfiltration A of F. To determine the maximum of EZτ in theset CA of A-stopping times τ , i.e., to solve the optimal stopping problem onthe lower A-information level, we can use the rule of successive conditioningfor conditional expectations (cf. Appendix A.2, p. 251) to obtain

70 3 Stochastic Failure Models

sup{EZτ : τ ∈ CA} = sup{EZτ : τ ∈ CA}.

In Sect. 5.2.1, Theorem 5.9, p. 181, conditions are given under which the stop-ping problem for an SSM Z can be solved. If these conditions apply to Z,then we can solve this optimal stopping problem on the A-level according toTheorem 5.9. Could the stopping problem be solved on the F-level, then weget a bound for the stopping value on the A-level in view of the inequality

sup{EZτ : τ ∈ CA} ≤ sup{EZτ : τ ∈ CF}.

3.2 A General Lifetime Model

First let us consider the simple indicator process Zt = I(T ≤ t), where T isthe lifetime random variable defined on the basic probability space. ObviouslyZ is the counting process corresponding to the simple point process (Tn) withT = T1 and Tn = ∞ for n ≥ 2. The paths of this indicator process Z areconstant, except for one jump from 0 to 1 at T . Let us assume that thisindicator process has a smooth F-semimartingale representation with an F-martingale M ∈ M0 and a nonnegative stochastic process λ = (λt):

I(T ≤ t) =

∫ t

0

I(T > s)λsds+Mt, t ∈ R+. (3.8)

The general lifetime model is then defined by the filtration F and the corre-sponding F-SSM representation of the indicator process.

Definition 3.20. The process λ = (λt), t ∈ R+, in the SSM-representation(3.8) is called the F-failure rate or the F-hazard rate process and the compen-

sator Λt =∫ t

0I(T > s)λsds is called the F-hazard process.

We drop F, when it is clear from the context. As was mentioned before(cf. Remark 3.11 on p. 64), the intensity of the indicator (point) process isnot unique. If one F-failure rate λ is known, we may pass to a left-continuousversion (λt−) to obtain a predictable, unique intensity:

I(T ≤ t) =

∫ t

0

I(T ≥ s)λs−ds+Mt.

Before investigating under which conditions such a representation exists, someexamples are given.

Example 3.21. If the failure rate process λ is deterministic, forming expecta-tions leads to the integral equation

F (t) = P (T ≤ t) = EI(T ≤ t) =

∫ t

0

P (T > s)λsds =

∫ t

0

(1− F (s))λsds.

3.2 A General Lifetime Model 71

The unique solution

F (t) = 1− F (t) = exp

{−∫ t

0

λsds

}(3.9)

is just the well-known relation between the standard failure rate and the distri-bution function. This shows that if the hazard rate process λ is deterministic,it coincides with the ordinary failure rate.

Example 3.22. In continuation of Example 1.1, p. 2, we consider a three-component system with one component in series with a two-componentparallel system. It is assumed that the component lifetimes T1, T2, T3 are i.i.d.exponentially distributed with parameter α > 0. What is the failure rate pro-cess corresponding to the system lifetime T = T1 ∧ (T2 ∨ T3)? This dependson the information level, i.e., on the filtration F.

• Ft = σ(Xs, 0 ≤ s ≤ t), where Xs = (Xs(1), Xs(2), Xs(3)) and Xs(i) =I(Ti > s), i = 1, 2, 3. Observing on the component level means that Ft

is generated by the indicator processes of the component lifetimes up totime t. It can be shown (by means of the results of the next section) thatthe failure rate process of the system lifetime is given by λt = α{1 + (1−Xt(2)) + (1 − Xt(3))} on {T > t}. As long as all components work, therate is α due to component 1. When one of the two parallel components 2or 3 fails first, then the rate switches to 2α.

• Ft = σ(I(T ≤ s), 0 ≤ s ≤ t). If only the system lifetime can be observed,the failure rate process diminishes to the ordinary deterministic failure rate

λt = α

(1 + 2

1− e−αt

2− e−αt

).

Example 3.23. Consider the damage threshold model in which the deteriora-tion is described by the Wiener process Xt = σBt + μt, where B is standardBrownian motion and σ, μ > 0 are constants. In this case, whether and in whatway the lifetime T = inf{t ∈ R+ : Xt ≥ K},K ∈ R+, can be characterized bya failure rate process, also depends on the available information.

• Ft = σ(Bs, 0 ≤ s ≤ t). Observing the actual state of the system proves tobe too informative to be described by a failure rate process. The martingalepart is identically 0, the drift part or the predictable compensator is theindicator process I(T ≤ t) itself. No semimartingale representation (3.8)exists because the lifetime is predictable, as we will see in the followingsection.

• Ft = σ(I(T ≤ s), 0 ≤ s ≤ t). If only the system lifetime can be observed,conditions change completely. A representation (3.8) exists. The first hit-ting time T of the barrierK is known to follow a so-called inverse Gaussiandistribution (cf. [133], p. 26). The failure rate process is then the ordinaryfailure rate corresponding to the density

f(t) =K√

2πσ2t3exp

{− (K − μt)2

2σ2t

}, t > 0.

72 3 Stochastic Failure Models

3.2.1 Existence of Failure Rate Processes

It is possible to formulate rather general conditions on Z to ensure asemimartingale representation (3.8) as shown by Theorem 3.6, p. 60. Butin reliability models we often have more specific processes Vt = I(T ≤ t) forwhich a representation (3.8) has to be found. Whether such a representationexists should depend on the random variable T (or on the probability measureP ) and on the filtration F. If T is a stopping time with respect to the filtrationF, then a representation (3.8) only exists for stopping times which are totallyinaccessible in the following sense:

Definition 3.24. An F-stopping time τ is called

• predictable if an increasing sequence (τn), n ∈ N, of F-stopping times τn <τ exists such that limn→∞ τn = τ ;

• totally inaccessible if P (τ = σ < ∞) = 0 for all predictable F-stoppingtimes σ.

Roughly speaking, a stopping time τ is predictable, if it is announced bya sequence of (observable) stopping times, τ is totally inaccessible if it occurs“suddenly” without announcement. For example, a random variable T withan absolutely continuous distribution has the representation

Vt = I(T ≤ t) =

∫ t

0

I(T > s)λ(s)ds+Mt, t ∈ R+

with respect to the filtration FT = (Ft) generated by T : Ft = σ(T ∧ t), where

λ is the ordinary failure rate.In general it can be shown that, if V has a SSM representation (3.8), then

T is a totally inaccessible stopping time. On the other hand, if T is totallyinaccessible, then there is a (unique) decomposition V = Λ+M in which theprocess Λ is (P -a.s.) continuous. We state this result without proof (cf. [62],p. 137 and [122], p. 113).

Lemma 3.25. Let (Ω,F ,F, P ) be a filtered probability space and T an F-stopping time.

(i) If the process V = (Vt), Vt = I(T ≤ t), has an SSM representation

Vt =

∫ t

0

I(T > s)λsds+Mt, t ∈ R+,

then T is a totally inaccessible stopping time and the martingale M isbounded in L2, M ∈ M2

0.(ii) If T is a totally inaccessible stopping time, then the process V = (Vt),

Vt = I(T ≤ t), has a unique (P -a.s.) decomposition V = Λ +M , whereM is a uniformly integrable martingale and Λ is continuous (P -a.s., thepredictable compensator).

3.2 A General Lifetime Model 73

“Most” continuous functions are absolutely continuous (except somepathological special cases). Therefore, we can conclude from Lemma 3.25 that

the class of lifetime models with a compensator Λ of the form Λt =∫ t

0I(T >

s)λsds is rich enough to include models for most real-life systems in contin-uous time. In view of Example 3.23 the condition that V admits an SSMrepresentation seems a natural restriction, because if the lifetime could bepredicted by an announcing sequence of stopping times, maintenance actionswould make no sense, they could be carried out “just” before a failure. InExample 3.23 τn = inf{t ∈ R+ : Xt = K− 1

n} is such an announcing sequencewith respect to Ft = σ(Bs, 0 ≤ s ≤ t) (compare also Fig. 1.1, p. 6). In addi-tion, Example 3.23 shows that one and the same random variable T can bepredictable or totally inaccessible depending on the corresponding informationfiltration.

How can the failure rate process λ be ascertained or identified for a giveninformation level F? In general, we can determine λ under the conditions ofTheorem 3.6 as the limit

I(T > t)λt = limh→0+

1

hP (t < T ≤ t+ h|Ft)

in the sense of almost sure convergence. Another way to verify whether agiven process λ is the failure rate is to show that the corresponding hazardprocess defines the compensator of I(T ≤ t). In some special cases λ can berepresented in a more explicit form, as for example for complex systems. Thiswill be carried out in some detail in the next section.

3.2.2 Failure Rate Processes in Complex Systems

In the following we want to derive the hazard rate process for the lifetime T ofa complex system under fairly general conditions. We make no independenceassumption concerning the component lifetimes, and we allow two or morecomponents to fail at the same time with positive probability.

Let Ti, i = 1, . . . , n, be n positive random variables that describe the com-ponent lifetimes of a monotone complex system with structure function Φ.Our aim is to derive the failure rate process for the lifetime

T = inf{t ∈ R+ : Φ(Xt) = 0}with respect to the filtration F given by Ft = σ(Xs, 0 ≤ s ≤ t), where as beforeXs = (Xs(1), . . . , Xs(n)) and Xs(i) = I(Ti > s), i = 1, . . . , n. We call thisfiltration the complete information filtration or filtration on the componentlevel.

For a specific outcome ω let m(ω) be the number of different failure timepoints 0 < T(1) < T(2) < · · · < T(m) and J(k) = {i : Ti(ω) = T(k)(ω)} the setof components that fail at T(k). For completeness we define

T(r) = ∞, J(r) = ∅ for r ≥ m+ 1.

74 3 Stochastic Failure Models

Thus, the sequence (T(k), J(k)), k ∈ N, forms a multivariate point process. Nowwe fix a certain failure pattern J ⊂ {1, . . . , n} and consider the time TJ ofoccurrence of this pattern, i.e.,

TJ =

{T(k) if J(k) = J for some k∞ if J(k) �= J for all k.

The corresponding counting process Vt(J) = I(TJ ≤ t) has a compensatorAt(J) with respect to F, which is assumed to be absolutely continuous suchthat λt(J) is the F-failure rate process:

Vt(J) =

∫ t

0

I(TJ > s)λs(J)ds+Mt(J).

In the case P (TJ = ∞) = 1, we set λt(J) = 0 for t ∈ R+.

Example 3.26. If we assume that the component lifetimes are independentrandom variables, the only interesting (nontrivial) failure patterns are thoseconsisting of only one single component J = {j}, j ∈ {1, . . . , n}. In this casethe F-failure rate processes λt({j}) are merely the ordinary failure rates λt(j)corresponding to Tj.

Example 3.27. We now consider the special case n = 2 in which (T1, T2) followsthe bivariate exponential distribution of Marshall and Olkin (cf. [121]) withparameters β1, β2 > 0 and β12 ≥ 0. A plausible interpretation of this distribu-tion is as follows. Three independent exponential random variables Z1, Z2, Z12

with corresponding parameters β1, β2, β12 describe the time points when ashock causes failure of component 1 or 2 or all intact components at the sametime, respectively. Then the component lifetimes are given by T1 = Z1 ∧ Z12

and T2 = Z2 ∧ Z12, and the joint survival probability is seen to be

P (T1 > t, T2 > s) = exp{−β1t− β2s− β12(t ∨ s)}, s, t ∈ R+.

The three different patterns to distinguish are {1}, {2}, {1, 2}. Note thatT{1} �= T1 as we have for example T{1} = ∞ on {T1 = T2}, i.e., on{Z12 < Z1 ∧ Z2}. Calculations then yield

λt({1}) =⎧⎨⎩β1 on {T1 > t, T2 > t}β1 + β12 on {T1 > t, T2 ≤ t}0 elsewhere,

λt({2}) is given by obvious index interchanges, and

λt({1, 2}) ={β12 on {T1 > t, T2 > t}0 elsewhere.

Now we have the F-failure rate processes λ(J) at hand for each pattern J .We are interested in deriving the F-failure rate process λ of T. The nexttheorem shows how this process λ is composed of the single processes λ(J)

3.2 A General Lifetime Model 75

on the component observation level F. Here we remind the reader of somenotation introduced in Chap. 2. For x ∈ R

n and J = {j1, . . . , jr} ⊂ {1, . . . , n},the vectors (1J ,x) and (0J ,x) denote those n-dimensional state vectors inwhich the components xj1 , . . . , xjr of x are replaced by 1s and 0s, respectively.Let D(t) be the set of components that have failed up to time t, formally

D(t) =

{J(1) ∪ . . . ∪ J(k) if T(k) ≤ t < T(k+1)

∅ if t < T(1).

Then we define a pattern J to be critical at time t ≥ 0 if

I(J ∩D(t) = ∅) (Φ(1J ,Xt)− Φ(0J ,Xt)) = 1

and denote by

ΓΦ(t) = {J ⊂ {1, . . . , n} : I(J ∩D(t) = ∅) (Φ(1J ,Xt)− Φ(0J ,Xt)) = 1}the collection of all such patterns critical at t.

Theorem 3.28. Let (λt(J)) be the F-failure rate process corresponding to TJ ,J ⊂ {1, . . . , n}. Then for all t ∈ R+ on {T > t} :

λt =∑

J⊂{1,...,n}I(J ∩D(t) = ∅)(Φ(1J ,Xt)− Φ(0J ,Xt))λt(J) =

∑J∈ΓΦ(t)

λt(J).

Proof. By Definition 3.8, p. 62, a predictable increasing process (At) is thecompensator of the counting process (Vt), Vt = I(T ≤ t), if

E

∫ ∞

0

CsdVs = E

∫ ∞

0

CsdAs

holds true for every nonnegative F-predictable process C. Thus, we have toshow that

E

∫ ∞

0

CsdVs = E

∫ ∞

0

CsI(T > s)∑

J∈ΓΦ(s)

λs(J)ds (3.10)

for all nonnegative predictable processes C. Since (λt(J)) are the F-failurerate processes corresponding to TJ , we have for all J ⊂ {1, . . . , n}

E

∫ ∞

0

Cs(J)dVs(J) = E

∫ ∞

0

Cs(J)I(TJ > s)λs(J)ds

and therefore

E

∫ ∞

0

∑J⊂{1,...,n}

Cs(J)dVs(J) = E

∫ ∞

0

∑J⊂{1,...,n}

Cs(J)I(TJ > s)λs(J)ds

(3.11)

76 3 Stochastic Failure Models

holds true for all nonnegative predictable processes (Ct(J)). If we especiallychoose for some nonnegative predictable process C

Ct(J) = Ctft−,

where ft− is the left-continuous version of ft = I(J ∈ ΓΦ(t)), we see that(3.11) reduces to (3.10), noting that under the integral sign we can replaceft− by ft, and the proof is complete. ��

Remark 3.29. (i) The proof follows the lines of Arjas (Theorem 4.1 in [6])except the definition of the set ΓΦ(t) of the critical failure patterns at timet. In [6] this set includes on {T > t} all cut sets, whereas in our definitionthose cut sets J are excluded for which at time t “it is known” that TJ = ∞.However, this deviation is harmless because in [6] only extra zeros are added.(ii) We now have a tool that allows us to determine the failure rate processcorresponding to the lifetime T of a complex system in an easy way: Add attime t the failure rates of those patterns that are critical at t.

As an immediate consequence we obtain the following corollary.

Corollary 3.30. Let Ti, i = 1, . . . , n, be independent random variables thathave absolutely continuous distributions with ordinary failure rates λt(i). Thenthe F-failure rate processes λ({i}) are deterministic, λt({i}) = λt(i) and on{T > t}

λt =

n∑i=1

(Φ(1i,Xt)− Φ(0i,Xt))λt(i) =∑

{i}∈ΓΦ(t)

λt(i), t ∈ R+. (3.12)

In the case of independent component lifetimes we only have to add theordinary failure rates of those components critical at t to obtain the F-failurerate of the system at time t. It is not enough to require that P (Ti = Tj) = 0for i �= j if we drop the independence assumption as the following exampleshows.

Example 3.31. Let U1, U2 be i.i.d. random variables from an Exp(β) distribu-tion and T1 = U1, T2 = U1+U2 be the component lifetimes of a two-componentseries system. Then we obviously have P (T1 = T2) = 0, but the F-failure rateof T{2} = T2 on {T2 > t}

λt({2}) = βI(T1 ≤ t)

is not deterministic. The system F-failure rate is seen to be

I(T > t)λt = I(T1 > t)β.

To see how formula (3.12) can be used we resume Example 3.22, p. 71.

3.2 A General Lifetime Model 77

Example 3.32. Again we consider the three-component system with onecomponent in series with a two-component parallel system such that the life-time of the system is given by T = T1 ∧ (T2 ∨ T3). It is assumed that thecomponent lifetimes T1, T2, T3 are i.i.d. exponentially distributed with param-eter α > 0. If at time t all three components work, then only component 1belongs to ΓΦ(t) and I(T > t)λt = αI(T1 > t) on {T2 > t, T3 > t}. If one ofthe components 2 or 3 has failed first before time t, say component 2, thenΓΦ(t) = {{1}, {3}} and I(T > t)λt = α(I(T1 > t) + I(T3 > t)) on {T2 ≤ t}.Combining these two formulas yields the failure rate process on {T > t}

λt = α(1 + I(T2 ≤ t) + I(T3 ≤ t))

given in Example 3.22.

Example 3.33. We now go back to the pair (T1, T2) of random variables, whichfollows the bivariate exponential distribution of Marshall and Olkin with par-ameters β1, β2 > 0 and β12 ≥ 0 and consider a parallel system with lifetimeT = T1 ∨ T2. Then on {T > t} the critical patterns are

ΓΦ(t) =

⎧⎨⎩

{1, 2} on {T1 > t, T2 > t}{1} on {T1 > t, T2 ≤ t}{2} on {T1 ≤ t, T2 > t}.

Using the results of Example 3.27, p. 74, the F-failure rate process of thesystem lifetime is seen to be

I(T > t)λt = β12I(T1 > t, T2 > t) + (β1 + β12)I(T1 > t, T2 ≤ t)

+ (β2 + β12)I(T1 ≤ t, T2 > t),

which can be reduced to

I(T > t)λt = β12I(T > t) + β1I(T1 > t, T2 ≤ t) + β2I(T1 ≤ t, T2 > t).

3.2.3 Monotone Failure Rate Processes

We have investigated under which conditions failure rate processes exist andhow they can be determined explicitly for complex systems. In reliabilityit plays an important role whether failure rates are monotone increasing ordecreasing. So it is quite natural to extend such properties to F-failure ratesin the following way.

Definition 3.34. Let an F-SSM representation (3.8) hold true for the positiverandom variable T with failure rate process λ. Then λ is called F-increasing(F-IFR, increasing failure rate) or F-decreasing (F-DFR, decreasing failurerate), if λ has P -a.s. nondecreasing or nonincreasing paths, respectively, fort ∈ [0, T ).

78 3 Stochastic Failure Models

Remark 3.35. (i) Clearly, monotonicity properties of λ are only of importanceon the random interval [0, T ). On [T,∞) we can specify λ arbitrarily. (ii) Inthe case of complex systems the above definition reflects both, the informationlevel F and the structure function Φ. An alternative definition, which is derivedfrom notions of multivariate aging terms, is given by Arjas [5]; see also Shakedand Shanthikumar [140].

In the case of a complex system with independent component lifetimes,the following closure lemma can be established.

Proposition 3.36. Assume that in a monotone system the component life-times Ti, i = 1, . . . , n, are independent random variables with absolutely con-tinuous distributions and ordinary nondecreasing failure rates λt(i) and letF be the filtration on the component level. Then the F-failure rate process λcorresponding to the system lifetime T is F-IFR.

Proof. Under the assumptions of the lemma no patterns with two or morecomponents are critical. Since the system is monotone, the number of elementsin ΓΦ(t) is nondecreasing in t. So from (3.12), p. 76, it can be seen that if allcomponent failure rates are nondecreasing, the F-failure rate process λ is alsonondecreasing for t ∈ [0, T ). ��

Such a closure theorem does not hold true for the ordinary failure rate ofthe lifetime T as can be seen from simple counterexamples (see Sect. 2.2.1 or[32], p. 83). From the proof of Proposition 3.36 it is evident that we cannotdraw an analogous conclusion for decreasing failure rates.

3.2.4 Change of Information Level

One of the advantages of the semimartingale technique is the possibility ofstudying the random evolution of a stochastic process on different informationlevels. This was described in general in Sect. 3.1.2 by the projection theorem,which says in which way an SSM representation changes when changing thefiltration from F to a subfiltration A. This projection theorem can be appliedto the lifetime indicator process

Vt = I(T ≤ t) =

∫ t

0

I(T > s)λsds+Mt. (3.13)

If the lifetime can be observed, i.e., {T ≤ s} ∈ As for all 0 ≤ s ≤ t, thenthe change of the information level from F to A leads from (3.13) to therepresentation

Vt = E[I(T ≤ t)|At] = I(T ≤ t) =

∫ t

0

I(T > s)λsds+ Mt, (3.14)

3.2 A General Lifetime Model 79

where λt = E[λt|At]. Note that, in general, this formula only holds for almostall t ∈ R+. In all our examples we can find A-progressive versions of theconditional expectations. The projection theorem shows that it is possibleto obtain the failure rate on a lower information level merely by formingconditional expectations under some mild technical conditions.

Remark 3.37. Unfortunately, monotonicity properties are in general not pre-served when changing the observation level. As was noted above (seeProposition 3.36), if all components of a monotone system have independentlifetimes with increasing failure rates, then T is F-IFR on the component ob-servation level. But switching to a subfiltration A may lead to a nonmonotonefailure rate process λ.

The following example illustrates the role of partial information.

Example 3.38. Consider a two-component parallel system with i.i.d. randomvariables Ti, i = 1, 2, describing the component lifetimes, which follow anexponential distribution with parameter α > 0. Then the system lifetime isT = T1 ∨ T2 and the complete information filtration is given by

Ft = σ(I(T1 > s), I(T2 > s), 0 ≤ s ≤ t).

In this case the F-semimartingale representation (3.13) is given by

I(T ≤ t) =

∫ t

0

I(T > s)α{I(T1 ≤ s) + I(T2 ≤ s)}ds+Mt

=

∫ t

0

I(T > s)λsds+Mt.

Now several subfiltrations can describe different lower information levels whereit is assumed that the system lifetime T can be observed on all observationlevels. Examples of partial information and the formal description via subfil-trations A and A-failure rates are as follows:

a) Information about T until h, after h complete information.

Aat =

{σ(I(T ≤ s), 0 ≤ s ≤ t) for 0 ≤ t < hFt for t ≥ h,

λat =

{2α(1 − (2− e−αt)−1) for 0 ≤ t < hλt for t ≥ h.

b) Information about component lifetime T1 and T :

Abt = σ(I(T ≤ s), I(T1 ≤ s), 0 ≤ s ≤ t),

λbt = α(I(T1 ≤ t) + I(T1 > t)P (T2 ≤ t)).

80 3 Stochastic Failure Models

c) Information about T only:

Act = σ(I(T ≤ s), 0 ≤ s ≤ t),

λct = 2α(1− (2− e−αt)−1).

The failure rate corresponding to Ac of this example is the standard de-

terministic failure rate, because {T > t} is an atom of Act (there is no subset

of {T > t} in Act of positive probability) so that λc can always be chosen

to be deterministic on {T > t}. This corresponds to our intuition becauseon this information level we cannot observe any other random event beforeT. Example 3.21 shows that such deterministic failure rates satisfy the well-known exponential formula (3.9), p. 71. An interesting question to ask isthen: under what conditions will such an exponential formula also extend torandom failure rate processes? This question is referred to briefly in [4] andanswered in [165] to some extent. The following treatment differs slightly inthat the starting point is the basic lifetime model of this section. The failurerate process λ is assumed to be observable on some level A, i.e., λ is adaptedto that filtration. This observation level can be somewhere between the trivialfiltration G = (Gt), t ∈ R+, Gt = {∅, Ω}, which does not allow for any ran-dom information, and the basic complete information filtration F. So T itselfneed not be observable at level A (and should not, if we want to arrive at anexponential formula). Using the projection theorem we obtain

E[I(T ≤ t)|At] = 1− Ft =

∫ t

0

Fsλsds+ Mt, (3.15)

where F denotes the conditional survival probability,

Ft = E[I(T > t)|At] = P (T > t|At),

and M is an A-martingale. In general, F need not be monotone and can berather irregular. But if F has continuous paths of bounded variation, thenthe martingale M is identically 0 and the solution of the resulting integralequation is

Ft = exp

{−∫ t

0

λsds

}, (3.16)

which is a generalization of formula (3.9). If A is the trivial filtration G,then (3.16) coincides with (3.9). For (3.16) to hold, it is necessary that theobservation of λ and other events on level A only have “smooth” influence onthe conditional survival probability.

Remark 3.39. This is a more technical remark to show how one can proceedif F is not continuous. Let (Ft−), t ∈ R+, be the left-continuous version of F .Equation (3.15) can be rewritten as

Ft = 1−∫ t

0

Fs−λsds− Mt.

3.3 Point Processes in Reliability: Failure Time and Repair Models 81

Under mild conditions an A-martingale L can be found such that M can berepresented as the (stochastic) integral Mt =

∫ t

0Fs−dLs, take

Lt =

∫ t

0

I(Fs− > 0)

Fs−dMs.

With the semimartingale Z,Zt = − ∫ t

0 λsds− Lt, (3.15) becomes

Ft = 1 +

∫ t

0

Fs−dZs.

If Z is of locally finite variation then the unique solution of this integralequation is given by the so-called Doleans exponential (see [101], p. 440)

Ft = E(Zt) = exp{Zct }∏

0<s≤t

(1 +ΔZs)

= exp

{−∫ t

0

λsds

}exp {−Lc

t}∏

0<s≤t

(1−ΔLs),

where Zc(Lc) denotes the continuous part of Z(L) andΔZs = Zs−Zs−(ΔLs =Ls − Ls−) denotes the jump height at s. This extended exponential formulashows that possible jumps of the conditional survival probability are notcaused by jumps of the failure rate process but by (unpredictable) jumpsof the martingale part.

3.3 Point Processes in Reliability: Failure Timeand Repair Models

A number of models in reliability are described by point processes and theircorresponding counting processes. As examples we can think of shock models,in which shocks affecting a technical system arrive at random time points Tnaccording to a point process causing some damage of random amount Vn, orwe can think of repair models, in which failures occur at random time pointsTn causing random repair costs Vn. In both cases the sequence (Tn, Vn) is amultivariate or marked point process to be introduced as follows.

Definition 3.40. Let (Tn), n ∈ N, be a point process and (Vn), n ∈ N, asequence of random variables taking values in a measurable space (S,S). Thena marked point process (Tn, Vn), n ∈ N, is the ordered sequence of time pointsTn and marks Vn associated with the time points, and (S,S) is called themark space.

The mark Vn describes the event occurring at time Tn, for example themagnitude of the shock arriving at a system at time Tn (see Fig. 3.1). For eachA ∈ S we associate the counting process (Nt(A)), t ∈ R+,

82 3 Stochastic Failure Models

V1

V2

V3

T1 T2 T3

S

t

Fig. 3.1. Marked point process

Nt(A) =∞∑n=1

I(Vn ∈ A)I(Tn ≤ t),

which counts the number of marked points up to time t with marks in A. Thisfamily of counting processes N carries the same information as the sequence(Tn, Vn) and is therefore an equivalent description of the marked point process.

Example 3.41. A point process (Tn) can be viewed as a marked point processfor which S consists of a single point. Another link between point and markedpoint processes is given by the counting process N = (Nt), Nt = Nt(S), whichcorresponds to the sequence (Tn).

Example 3.42 (Alternating Renewal Process). Consider a system, which isrepaired or replaced after failure (models of this kind are treated in detail inSect. 4.2). Let Uk represent the length of the kth operation period and Rk thelength of the kth repair/replacement time. Assume that (Uk) and (Rk), k ∈ N,are independent i.i.d. sequences of positive random variables. Let the markspace be S = {0, 1}, where 0 and 1 stand for “repair/replacement completed”and “failure”, respectively. Then the random time points Tn are

Tn =

[n+12 ]∑

k=1

Uk +

[n2 ]∑k=1

Rk, n = 1, 2, . . . ,

where [a] denotes the integer part of a. The mark sequence is deterministicand alternating between 0 and 1:

Vn =1

2(1 + (−1)n+1).

We see that Nt({0}) counts the number of number of completed repairs andNt({1}) failures up to time t.

3.3 Point Processes in Reliability: Failure Time and Repair Models 83

We now want to extend the concept of stochastic intensities from pointprocesses to marked point processes. The internal filtration F

N of (Tn, Vn) isdefined by

FNt = σ(Ns(A), 0 ≤ s ≤ t, A ∈ S).

This filtration is equivalently generated by the history {(Tn, Vn), Tn ≤ t} ofthe marked point process.

Definition 3.43. Let F be some filtration including FN : FN

t ⊂ Ft, t ∈ R+.A stochastic process (λt(A), t ∈ R+, A ∈ S) is called the stochastic intensity ofthe marked point process N, if (i) for each t, A→ λt(A) is a random measureon S; (ii) for each A ∈ S, Nt(A) admits the F-intensity λt(A).

We can now formulate the extension of Theorem 3.12, p. 64, to markedpoint processes (cf. [50], p. 238, [92, 115], p. 22).

Theorem 3.44. Let N be an integrable marked point process and FN its in-

ternal filtration. Suppose that for each n there exists a regular conditionaldistribution of (Un+1, Vn+1), Un+1 = Tn+1−Tn, given the past FN

Tnof the form

Gn(ω,A,B) = P (Un+1 ∈ A, Vn+1 ∈ B|FNTn

)(ω)

=

∫A

gn(ω, s,B)ds,

where gn(ω, s,B) is, for fixed B, a measurable function and, for fixed (ω, s),a finite measure on (S,S). Then the process given by

λt(C) =gn(t− Tn, C)

Gn([t− Tn,∞), S)=

gn(t− Tn, C)

1− ∫ t−Tn

0gn(s, S)ds

on (Tn, Tn+1] is a stochastic intensity of N and for each C ∈ S,

Nt(C)−∫ t

0

λs(C)ds

is an FN -martingale.

To find the SSM representation of a stochastic process, which is derivedfrom a marked point process, we can make use of the intensity of the latter.The following theorem is proved in Bremaud [50], p. 235. For the formulationof this result it is more convenient to use a slightly different notation for theprocess Nt(C), namely,

Nt(C) = N(t, C) =∞∑

n=1

I(Vn ∈ C)I(Tn ≤ t).

84 3 Stochastic Failure Models

Theorem 3.45. Let (N(t, C)), t ∈ R+, C ∈ S, be an integrable marked pointprocess admitting the intensity λt(C) with respect to some filtration F. LetH(t, z) be an S-marked F-predictable process, such that, for all t ∈ R+,we have

E

∫ t

0

∫S

|H(s, z)|λs(dz)ds <∞.

Then, defining M(ds, dz) = N(ds, dz)− λs(dz)ds,

∫ t

0

∫S

H(s, z)M(ds, dz)

is an F-martingale.

In the following subsections we consider some examples and particularcases. As was mentioned in Example 3.41 a point process (Tn) and its associ-ated counting process (Nt) are special cases of marked point processes. Pointprocess models in our SSM set-up require the assumption that the count-ing process (Nt), t ∈ R+, on a filtered probability space (Ω,F ,F, P ) has anabsolutely continuous compensator or, what amounts to the same, admits anF-SSM representation

Nt =

∫ t

0

λsds+Mt. (3.17)

This point process model is consistent with the general lifetime model con-sidered in Sect. 3.2. If the process N is stopped at T1, then (3.17) reduces to(3.13):

Nt∧T1 = I(T1 ≤ t) =

∫ t∧T1

0

λsds+Mt∧T1

=

∫ t

0

I(T1 > s)λsds+M ′t ,

where M ′ is the stopped martingale M,M ′t =Mt∧T1 . The time to first failure

or shock corresponds to the lifetime T = T1.In general, N is determined by its compensator or by its intensity λ, and

it is possible to construct a point process N (and a corresponding probabilitymeasure) from a given intensity λ (these problems are considered in somedetail in [92], see also [115], Chap. 8). This allows us to define point processmodels in reliability by considering a given intensity.

3.3.1 Alternating Renewal Processes: One-ComponentSystems with Repair

We resume Example 3.42, p. 82, and assume that the operating times Uk

follow a distribution F with density f and failure rate ρ(t) = f(t)/F (t),whereas the repair times follow a distributionG with density g and hazard rate

3.3 Point Processes in Reliability: Failure Time and Repair Models 85

η(t) = g(t)/G(t). Note that the failure/hazard rate is always set to 0 outsidethe support of the distribution. Then Nt({0}) counts the number of failuresup to time t with an intensity λt({0}) = ρ(t − Tn)X(t) on (Tn, Tn+1], whereX(t) = Vn on (Tn, Tn+1] indicates whether the system is up or down at t. Thecorresponding internal intensity for Nt({1}) is λt({1}) = η(t−Tn)(1−X(t)).If the operating times are exponentially distributed with rate ρ > 0, theexpected number of failures up to time t is given by

ENt({0}) = ρ

∫ t

0

EX(s)ds.

3.3.2 Number of System Failures for Monotone Systems

We now consider a monotone system comprising m independent components.For each component we define an alternating renewal process, indexed by“i.” The operating and repair times Uik and Rik, respectively, are indepen-dent i.i.d. sequences with distributions Fi and Gi. We make the assumptionthat the up-time distributions Fi are absolutely continuous with failure ratesλt(i). The point process (Tn) is the superposition of the m independent al-ternating renewal processes (Tin), i = 1, . . . ,m, and the associated countingprocess is merely the sum of the single counting processes. Since we are onlyinterested in the occurrence of failures now, we denote by Nt(i) the num-ber of failures of component i (omitting the argument {0}) and the totalnumber of component failures by Nt =

∑mi=1Nt(i). The time Tn records the

occurrence of a component failure or completion of a repair. As in Chap. 2,Φ : A → {0, 1} is the structure function, where A = {0, 1}m, and the pro-cess Xt = (Xt(1), . . . , Xt(m)) denotes the vector of component states attime t with values in A. The mark space is S = A × A and the value ofVn = (XTn−,XTn) describes the change of the component states occurringat time Tn, where we set V0 = {(1, . . . , 1), (1, . . . , 1)}, i.e., we start with in-tact components at T0 = 0. Note that Vn = (x,y) means that y = (0i,x) ory = (1i,x) for some i ∈ {1, . . . ,m}, because we have absolutely continuousup-time distributions so that at time Tn only one component changes its sta-tus. Combining Corollary 3.30, p. 76, and Theorem 3.44, p. 83, we get thefollowing result.

Corollary 3.46. Let Γ = {(x,y) ∈ S : Φ(x) = 1, Φ(y) = 0,y = (0j ,x) forsome j ∈ {1, . . . ,m}} be the set of marks indicating a system failure. Thenthe process

Nt(Γ ) =

m∑i=1

∫ t

0

{Φ(1i,Xs)− Φ(0i,Xs)}dNs(i)

counting the number of system failures up to time t admits the intensity

λt(Γ ) =

m∑i=1

{Φ(1i,Xt)− Φ(0i,Xt)}ρt(i)Xt(i)

86 3 Stochastic Failure Models

with respect to the internal filtration, where

ρt(i) =

∞∑k=0

λt−Tik(i)I(Tik < t ≤ Ti,k+1).

Proof. We know that ρt(i)Xt(i) are intensities of Nt(i) and thus

Mt(i) = Nt(i)−∫ t

0

ρs(i)Xs(i)ds

defines a martingale (also with respect to the internal filtration of the super-position because of the independence of the component processes). Define

ΔΦt(i) = Φ(1i,Xt)− Φ(0i,Xt)

and let ΔΦt−(i) be the left-continuous and therefore predictable version ofthis process. Since at a jump of Nt(i) no other components change their status(P -a.s.), we have

∫ t

0

ΔΦs(i)dNs(i) =

∫ t

0

ΔΦs−(i)dNs(i).

It follows that

Nt(Γ )−∫ t

0

λs(Γ )ds =

∫ t

0

m∑i=1

ΔΦs(i)dMs(i)

=

∫ t

0

m∑i=1

ΔΦs−(i)dMs(i).

But the last integral is the sum of integrals of bounded, predictable processesand so by Theorem 3.45 is a martingale, which proves the assertion. ��

To determine the expected number of system failures up to time t, weobserve that EMt(i) = 0, i.e., ENt(i) =

∫ t

0ms(i)ds withms(i) = Eρs(i)Xs(i),

and that ΔΦt(i) and ρt(i)Xt(i) are stochastically independent. This results in

ENt(Γ ) =

∫ t

0

m∑i=1

E[ΔΦs(i)]ms(i)ds. (3.18)

3.3.3 Compound Point Process: Shock Models

Let us now assume that a system is exposed to shocks at random times(Tn). A shock occurring at Tn causes a random amount of damage Vn andthese damages accumulate. The marked point process (Tn, Vn) with markspace (R,B(R)) describes this shock process. To avoid notational difficulties

3.3 Point Processes in Reliability: Failure Time and Repair Models 87

we write in this subsection N(t, C) for the associated counting processes,describing the number of shocks up to time t with amounts in C. We areinterested in the so-called compound point process

Xt =

N(t)∑n=1

Vn

with N(t) = N(t,R), which gives the total damage up to t, and we want toderive the infinitesimal characteristics or the “intensity” of this process, i.e.,to establish an SSM representation. We might also think of repair models, inwhich failures occur at random time points Tn. Upon failure, repair is per-formed. If the cost for the nth repair is Vn, then Xt describes the accumulatedcosts up to time t.

To derive an SSM representation of X, we first assume that we are givena general intensity λt(C) of the marked point process with respect to somefiltration F. The main point now is to observe that

Xt =

∫ t

0

∫S

zN(ds, dz).

Then we can use Theorem 3.45, p. 84, with the predictable processH(s, z) = zto see that

MF

t =

∫ t

0

∫S

z(N(ds, dz)− λs(dz)ds)

is a martingale if E∫ t

0

∫S |z|λs(dz)ds < ∞. Equivalently, we see that X has

the F-SSM representation X = (f,MF), with

fs =

∫S

zλs(dz).

To come to a more explicit representation we make the following assump-tions (A):

• The filtration is the internal one FN ;

• Un+1 = Tn+1 − Tn is independent of FNTn

∨ σ(Vn+1);• Un+1 has absolutely continuous distribution with density gn(t) and (ordi-

nary) failure or hazard rate rn(t);• Vn+1 is a positive random variable, independent of FN

Tn, with finite mean

EVn+1.

Under these assumptions we get by Theorem 3.44, p. 83,

λt(C) =

∞∑n=0

rn(t− Tn)P (Vn+1 ∈ C)I(Tn < t ≤ Tn+1)

88 3 Stochastic Failure Models

and therefore the SSM representation

Xt =

∫ t

0

∞∑n=0

E[Vn+1]rn(s− Tn)I(Tn < s ≤ Tn+1)ds+MFN

t .

In the case of constant expectations EVn = EV1 we have

fs = E[V1]λs(R).

3.3.4 Shock Models with State-Dependent Failure Probability

Now we introduce a failure mechanism in which the marks Vn = (Yn,Wn)are pairs of random variables, where Yn, Yn > 0, represents the amount ofdamage caused by the nth shock and Wn equals 1 or 0 according to whetherthe system fails or not at the nth shock. Upon failure, repair is performed.So the marks Vn take values in S = R+ × {0, 1}. The associated counting

process is N(t,R+ × {0, 1}), and N(t) = N(t,R+ × {1}) counts the numberof failures up to time t. The accumulated damage is described by

Xt =

N(t,S)∑n=1

Yn.

In addition to (A), p. 87, we now assume

• Yn+1 is independent of FNTn

with distribution

Fn+1(y) = P (Yn+1 ≤ y);

• For each k ∈ N0 there exists a measurable function pk(x) such that 0 ≤pk(x) ≤ 1 and

P (Wn+1 = 1|FNTn

∨ σ(Yn+1)) = pN(Tn)(XTn + Yn+1). (3.19)

Note that FNTn

= σ((Ti, Yi,Wi), i = 1, . . . , n) and that

N(Tn) =

n∑i=1

Wi, XTn =

n∑i=1

Yi.

The assumption (3.19) can be interpreted as follows: if the accumulated dam-age is x and k failures have already occurred, then an additional shock ofmagnitude y causes the system to fail with probability pk(x+ y).

To derive the compensator of N(t,R+×{1}), the number of failures up totime t, we observe that

P (Un+1 ∈ A, Yn+1 ∈ R+,Wn+1 = 1|FNTn)

= P (Un+1 ∈ A)P (Wn+1 = 1|FNTn)

= P (Un+1 ∈ A)E[pN(Tn)

(XTn + Yn+1)|FNTn

].

3.3 Point Processes in Reliability: Failure Time and Repair Models 89

Then Theorem 3.44 yields the intensity on {Tn < t ≤ Tn+1}:

λt(R+ × {1}) = rn(t− Tn)E[pN(Tn)

(XTn + Yn+1)|FNTn

].

Example 3.47. As a shock arrival process we now consider a Poisson processwith rate ν, 0 < ν <∞, and an i.i.d. sequence of shock amounts with commondistribution F. Then we get

λt(R+ × {1}) = ν

∫ ∞

0

pN(t)(Xt + y)dF (y).

If the failure probability does not depend on the number of failures N andthe shock magnitudes are deterministic, Yn = 1, then we have

λt(R+ × {1}) = vp(Nt + 1).

To derive a semimartingale description of the first time to failure

T = inf{Tn :Wn = 1},

we simply stop the counting process N at the FN -stopping time T and get

I(T ≤ t) = N(t ∧ T ) =∫ t∧T

0

λs(R+ × {1})ds+Mt∧T

=

∫ t

0

I(T > s)λs(R+ × {1})ds+Mt∧T ,

whereM is a martingale. The time to first failure admits a failure rate process,which is just the intensity of the counting process N .

3.3.5 Shock Models with Failures of Threshold Type

The situation is as above; we only change the failure mechanism in that thefirst time to failure T is defined as the first time the accumulated damagereaches or exceeds a given threshold K ∈ R+:

T = inf

⎧⎨⎩t ∈ R+ :

N(t,S)∑i=1

Yi ≥ K

⎫⎬⎭ = inf

{Tn :

n∑i=1

Yi ≥ K

}.

This is the hitting time of the set [K,∞).This failure model seems to be quite different from the previous one. How-

ever, we see that it is just a special case setting the failure probability functionpk(x) of (3.19) for all k equal to the indicator of the interval [K,∞) :

pk(x) = p(x) = I[K,∞)(x).

90 3 Stochastic Failure Models

Then we get

P (Wn+1 = 1|FNTn) = E[p(XTn + Yn+1)|FN

Tn]

= P (Yn+1 +XTn ≥ K|FNTn)

= 1− Fn+1((K −XTn)−).

This can be interpreted as follows: If the accumulated damage after n shocksis x, then the system fails with probability P (Yn+1 ≥ K − x) when the nextshock occurs, which is the probability that the total damage hits the thresholdK. Obviously, all shocks after T are counted by N(t) = N(t,R+ × {1}). Thefailure counting process N has on {Tn < t ≤ Tn+1} the intensity

λt(R+ × {1}) = rn(t− Tn){1− Fn+1((K −XTn)−)}. (3.20)

The first time to failure is described by

I(T ≤ t) =

∫ t

0

I(T > s)λs(R+ × {1})ds+Mt,

with a suitable martingale M .

Example 3.48. Let us again consider the compound Poisson case with shockarrival rate ν and Fn = F for all n ∈ N0. Since rn(s−Tn) = ν and (K−XTn) =(K −Xt) on {Tn < t < Tn+1}, we get

I(T ≤ t) =

∫ t

0

I(T > s)νF ((K −Xs)−)ds+Mt.

3.3.6 Minimal Repair Models

In the literature covering repair models special attention has been given toso-called minimal repair models. Instead of replacing a failed system by a newone, a repair restores the system to a certain degree. These minimal repairsare often verbally described (and defined) as in the following:

• “The . . . assumption is made that the system failure rate is not disturbedafter performing minimal repair. For instance, after replacing a single tubein a television set, the set as a whole will be about as prone to failure afterthe replacement as before the tube failure” (Barlow and Hunter [30]).

• “A minimal repair is one which leaves the unit in precisely the conditionit was in immediately before the failure” (Phelps [129]).

The definition of the state of the system immediately before failure dependsto a considerable degree on the information one has about the system. So itmakes a difference whether all components of a complex system are observedor only failure of the whole system is recognized. In the first case the lifetimeof the repaired component (tube of TV set) is associated with the residual

3.3 Point Processes in Reliability: Failure Time and Repair Models 91

system lifetime. In the second case the only information about the conditionof the system immediately before failure is the age. So a minimal repair in thiscase would mean replacing the system (the whole TV set) by another one ofthe same age that as yet has not failed. Minimal repairs of this kind are alsocalled black box or statistical minimal repairs, whereas the component-wiseminimal repairs are also called physical minimal repairs.

Example 3.49. We consider a simple two-component parallel system with inde-pendent Exp(1) distributed component lifetimes X1, X2 and allow for exactlyone minimal repair.

• Physical minimal repair. After failure at T = T1 = X1∨X2 the componentthat caused the system to fail is repaired minimally. Since the componentlifetimes are exponentially distributed, the additional lifetime is given byan Exp(1) random variable X3 independent of X1 and X2. The total life-time T1 +X3 has distribution

P (T1 +X3 > t) = e−t(2t+ e−t).

• Black box minimal repair. The lifetime T = T1 = X1 ∨X2 until the firstfailure of the system has distribution P (T1 ≤ t) = (1 − e−t)2 and failure

rate λ(t) = 2 1−exp (−t)2−exp (−t) . The additional lifetime T2 − T1 until the second

failure is assumed to have conditional distribution

P (T2 − T1 ≤ x|T1 = t) = P (T1 ≤ t+ x|T1 > t) = 1− e−x 2− e−(t+x)

2− e−t.

Integrating leads to the distribution of the total lifetime T2 :

P (T2 > t) = e−t(2− e−t)(1 + t− ln (2− e−t)).

It is (perhaps) no surprise that the total lifetime after a black box minimalrepair is stochastically greater than after a physical minimal repair:

P (T2 > t) ≥ P (T1 +X3 > t), for all t ≥ 0.

Below we summarize some typical categories of minimal repair models, andgive some further examples. Let (Tn) be a point process describing the failuretimes at which instantaneous repairs are carried out and let N = (Nt), t ∈ R+,be the corresponding counting process

Nt =

∞∑n=1

I(Tn ≤ t).

We assume that N is adapted to some filtration F and has F-intensity (λt).Different types of repair processes are characterized by different intensities λ.The repairs are minimal if the intensity λ is not affected by the occurrenceof failures or, in other words, if one cannot determine the failure time pointsfrom the observation of λ. More formally, minimal repairs can be characterizedas follows.

92 3 Stochastic Failure Models

Definition 3.50. Let (Tn), n ∈ N, be a point process with an integrablecounting process N and corresponding F-intensity λ. Suppose that Fλ=(Fλ

t ), t∈ R+, is the filtration generated by λ: Fλ

t = σ(λs, 0 ≤ s ≤ t). Then the pointprocess (Tn) is called a minimal repair process (MRP) if none of the variablesTn, n ∈ N, for which P (Tn < ∞) > 0 is an F

λ-stopping time, i.e., for alln ∈ N with P (Tn <∞) > 0 there exists t ∈ R+ such that {Tn ≤ t} /∈ Fλ

t .

This is a rather general definition that comprises the well-known specialcase of a nonhomogeneous Poisson process as is seen below. A renewal processwith a strictly increasing or decreasing hazard rate r of the interarrival timeshas intensity (compare Example 3.13, p. 64)

λt =∑n≥0

r(t− Tn)I(Tn < t ≤ Tn+1), T0 = 0, λ0 = r(0+),

and is therefore not an MRP, because Nt = |{s ∈ R+ : 0 < s ≤ t, λs+ = λ0}|.In the following we give some examples of (minimal) repair processes.

(a) In the basic statistical minimal repair model the intensity is a time-dependent deterministic function λt = λ(t), so that the process is a nonho-mogeneous Poisson process. This means that the age (the failure intensity)is not changed as a result of a failure (minimal repair). Here Fλ

t = {Ω, ∅}for all t ∈ R+, so clearly the failure times Tn are no Fλ-stopping times. Thefollowing special cases have been given much attention in the literature:

λp(t) = λβ(λt)β−1 (Power law),

λL(t) = λeβt (Log linear model).

For the parallel system in Example 3.49, one has λ(t) = 2 1−exp (−t)2−exp (−t) . If

the intensity is a constant, λt ≡ λ, the times between successive repairsare independent Exp(λ) distributed random variables. This is the case inwhich repairs have the same effect as replacements.

(b) If in (a) the intensity is not deterministic but a random variable λ(ω),which is known at the time origin (λ is F0-measurable), or, more general,λ = (λt) is a stochastic process such that λt is F0-measurable for allt ∈ R+, i.e., F0 = σ(λs, s ∈ R+) and Ft = F0 ∨ σ(Ns, 0 ≤ s ≤ t), thenthe process is called a doubly stochastic Poisson process or a Cox process.The process generalizes the basic model (a); the failure (minimal repair)times are no F

λ-stopping times, since Fλt = σ(λ) ⊂ F0 and Tn is not

F0-measurable.Also the Markov-modulated Poisson process of Example 3.14, p. 65, wherethe intensity λt = λYt is determined by a Markov chain (Yt), is an MRP.Indeed, it is a slight modification of a doubly stochastic Poisson processin that the filtration Ft = σ(Ns, Ys, 0 ≤ s ≤ t) does not include theinformation about the paths of λ in F0.

3.3 Point Processes in Reliability: Failure Time and Repair Models 93

(c) For the physical minimal repair in Example 3.49, λt = I(X1 ∧ X2 ≤ t).In this case Fλ is generated by the minimum of X1and X2. The first failuretime of the system, T1, equals X1∨X2, which is not an F

λ-stopping time.The filtration generated by λt comprises no information about X1 ∨X2.

In the following we give another characterization of an MRP.

Theorem 3.51. Assume that P (Tn < ∞) = 1 for all n ∈ N and that thereexist versions of conditional probabilities Ft(n) = E[I(Tn ≤ t)|Fλ

t ] such thatfor each n ∈ N (Ft(n)), t ∈ R+, is an (Fλ-progressive) stochastic process.

(i) Then the point process (Tn) is an MRP if and only if for each n ∈ N

there exists some t ∈ R+ such that

P (0 < Ft(n) < 1) > 0.

(ii) If furthermore (Ft) = (Ft(1)) has P -a.s. continuous paths of boundedvariation on finite intervals, then

1− Ft = exp

{−∫ t

0

λsds

}.

Proof. (i) To prove (i) we show that P (Ft(n) ∈ {0, 1}) = 1 for all t ∈ R+

is equivalent to Tn being an Fλ-stopping time. Since we have F0(n) = 0

and by the dominated convergence theorem for conditional expectations

limt→∞Ft(n) = 1,

the assumption that P (Ft(n) ∈ {0, 1}) = 1 for all t ∈ R+ is equivalentto Ft(n) = I(Tn ≤ t) (P -a.s.). But as (Ft(n)) is adapted to F

λ thismeans that Tn is an F

λ-stopping time. This shows that under the givenassumptions P (0 < Ft(n) < 1) > 0 is equivalent to Tn being no F

λ-stopping time.

(ii) For the second assertion we apply the exponential formula (3.16) as de-scribed on p. 80. ��

Example 3.52. In continuation of Example 3.49 of the two-component parallelsystem we allow for repeated physical minimal repairs. Let (Xk), k ∈ N, be asequence of i.i.d. random variables following an exponential distribution withparameter 1 : Xk ∼Exp(1). Then we define

T1 = X1 ∨X2, Tn+1 = Tn +Xn+2, n ∈ N.

We consider the filtration generated by the sequence (Xk), k ∈ N. The inten-sity of the corresponding counting process Nt =

∑∞n=1 I(Tn ≤ t) with respect

to this filtration is then λt = I(X1 ∧ X2 ≤ t). [If we had considered thefiltration generated by the sequence (Tn), n ∈ N we would have derived thedeterministic intensity 2(1− exp(−t))/(2− exp(−t)).]

94 3 Stochastic Failure Models

By elementary calculations it can be seen that

E[I(T1 > t)|Fλt ] = P (T1 > t|X1 ∧X2 ∧ t)

is continuous and nonincreasing. According to Theorem 3.51 it follows that(Tn) is an MRP and that the time to the first failure has conditional distri-bution

1− Ft = exp

{−∫ t

0

I(X1 ∧X2 ≤ s)ds

}= exp

{−(t−X1 ∧X2)+}.

Now we want to illustrate the above definition of a minimal repair in amore complex situation. We consider the shock damage repair model describedin Sect. 3.3.4. We now assume that the shock arrival process (T ∗

k ) is a non-homogeneous Poisson process with intensity function ν(t) and that (Vk) withVk = (Yk,Wk) is an i.i.d. sequence of pairs of random variables, independentof (T ∗

k ). The common distribution of the positive variables Yk is denoted F.The failure mechanism is as before, but the probability of failure at the occur-rence of a shock p(x) if the accumulated damage is x, is independent of thenumber of previous failures. Then we obtain for the failure counting processthe intensity

λt = ν(t)

∫ ∞

0

p(Xt− + y)dF (y), (3.21)

where

Xt =

∞∑k=1

YkI(T∗k ≤ t)

denotes the accumulated damage up to time t. The following theorem showsunder which condition the failure point process is an MRP.

Theorem 3.53. If 0 < p(x) < 1 for all x holds true, then the point process(Tn) driven by the intensity (3.21) is an MRP.

Proof. The random variables Wk equal 1 or 0 according to whether the sys-tem fails or not at the kth shock. The first failure time T1 can then be repre-sented by

T1 = inf{T ∗k :Wk = 1}.

At each occurrence of a shock a Bernoulli experiment is carried out withoutcome Wk. The random variable Wk is not measurable with respect toσ(XT∗

k) because by the condition 0 < p(x) < 1 it follows that

E[I(Wk = 1)|XT∗k] = P (Wk = 1|XT∗

k) = p(XT∗

k) /∈ {0, 1}.

This shows that T1 cannot be an FX -stopping time, where F

X is generatedby the process X = (Xt). Since we have Fλ

t ⊂ FXt , T1 is no F

λ-stopping timeeither. By induction via

3.3 Point Processes in Reliability: Failure Time and Repair Models 95

Tn+1 = inf{T ∗k > Tn :Wk = 1}

we infer that none of the variables Tn is an Fλ-stopping time, which shows

that (Tn) is an MRP. ��

Remark 3.54. (1) In the case p(x) = c for some c, 0 < c ≤ 1, the process is anonhomogeneous Poisson process with intensity λt = ν(t)c and therefore anMRP. (2) The condition 0 < p(x) < 1 excludes the case of threshold modelsfor which p(x) = 1 for x ≥ K and p(x) = 0 else for some constant K > 0. Forsuch a threshold model we have

T1 = inf{t ∈ R+ : λt ≥ ν(t)},

if P (Yk ≤ x) > 0 for all x > 0. In this case T1 is an Fλ-stopping time and

consequently (Tn) is no MRP.

3.3.7 Comparison of Repair Processes for DifferentInformation Levels

Consider a monotone system comprising m independent components withlifetimes Zi, i = 1, . . . ,m and corresponding ordinary failure rates λt(i). Itsstructure function Φ : {0, 1}m → {0, 1} represents the state of the system(1:intact, 0:failure), and the process Xt = (Xt(1), . . . , Xt(m)) denotes thevector of component states at time t with values in {0, 1}m. Example 3.49 sug-gests comparing the effects of minimal repairs on different information levels.However, it seems difficult to define such point processes for arbitrary informa-tion levels. One possible way is sketched in the following where considerationsare restricted to the complete information F-level (component-level) and the“black-box-level” A

T generated by T = T1,At = σ(I(T1 ≤ s), 0 ≤ s ≤ t).Note that T1 describes the time to first failure, i.e.,

T1 = inf{t ∈ R+ : Φ(Xt) = 0}.

This time to first system failure is governed by the hazard rate process λ fort ∈ [0, T ) (cf. Corollary 3.30 on p. 76):

λt =

m∑i=1

(Φ(1i,Xt)− Φ(0i,Xt))λt(i). (3.22)

Our aim is to extend the definition of λt also on {T1 ≤ t}. To this endwe extend the definition of Xt(i) on {Zi ≤ t} following the idea that uponsystem failure the component which caused the failure is repaired minimallyin the sense that it is restored and operates at the same failure rate as ithad not failed before. So we define Xt(i) = 0 on {Zi ≤ t} if the first failureof component i caused no system failure, otherwise we set Xt(i) = 1 on

96 3 Stochastic Failure Models

{Zi ≤ t} (note that in the latter case the value of Xt(i) is redefined fort = Zi). In this way we define Xt and by (3.22) the process λt for all t ∈R+. This completed intensity λt induces a point process (Nt) which countsthe number of minimal repairs on the component level. The correspondingcomplete information filtration F = (Ft), t ∈ R+, is given by

Ft = σ(Ns, I(Zi ≤ s), 0 ≤ s ≤ t, i = 1, . . . ,m).

To investigate whether the process (Nt) is an MRP we define the randomvariables

Yi = inf{t ∈ R+ : Φ(1i,Xt)− Φ(0i,Xt) = 1}, i = 1, . . . ,m, inf ∅ = ∞,

which describe the time when component i becomes critical, i.e., the time fromwhich on a failure of component i would lead to system failure. It follows that

λt =m∑i=1

I(Yi ≤ t)λt(i),

Fλt = σ(I(Yi ≤ s), 0 ≤ s ≤ t, i = 1, . . . ,m).

Obviously on {Yi < ∞} we have Zi > Yi and it can be shown that Zi isnot measurable with respect to σ(Y1, . . . , Ym). For a two component parallelsystem this means that Z1 ∨Z2 is not measurable with respect to σ(Z1 ∧Z2),which holds true observing that E[I(Z1 ∨ Z2 > z)|Z1 ∧ Z2] /∈ {0, 1} for somez (note that the random variables Zi are assumed to be independent). Theextension to the general case is intuitive but the details of a formal, lengthyproof are omitted. We state that the time to the first failure

T1 = mini=1,...,m

ZiI(Yi <∞)

is no Fλ-stopping time. By induction it can be seen that also Tn is no F

λ-stopping time and (Tn) is an MRP.

Now we want to consider the same system on the “black-box-level”. Thechange to the A

T -level by conditioning leads to the failure rate λ, λt =E[λt|At]. This failure rate λ can be chosen to be deterministic,

λt = E[λt|T1 > t],

it is the ordinary failure rate of T1. For the time to the first system failure wehave the two representations

I(T1 ≤ t) =

∫ t

0

I(T1 > s)λsds+Mt F-level

=

∫ t

0

I(T1 > s)λsds+ Mt AT -level.

3.3 Point Processes in Reliability: Failure Time and Repair Models 97

From the deterministic failure rate λ a nonhomogeneous Poisson process(T ′

n)n∈N, 0 < T ′1 < T ′

2 < · · · can be constructed where T1 and T ′1 have the

same distribution. This nonhomogeneous Poisson process with

N ′t =

∞∑n=1

I(T ′n ≤ t) =

∫ t

0

λsds+M ′t

describes the MRP on the AT -level. Comparing these two information levels,

Example 3.49 suggests ENt ≥ EN ′t for all positive t. A general comparison,

also for arbitrary subfiltrations, seems to be an open problem (cf. [4, 124]).

Example 3.55. In the two-component parallel system of Example 3.49 we havethe failure rate process λt = I(X1 ∧ X2 ≤ t) on the component level and

λt = 2 1−exp (−t)2−exp (−t) on the black-box level. So one has two descriptions of the

same random lifetime T = T1

I(T1 ≤ t) =

∫ t

0

I(T1 > s)I(X1 ∧X2 ≤ s)ds+Mt

=

∫ t

0

I(T1 > s)21− e−s

2− e−sds+ Mt.

The process N counts the number of minimal repairs on the component level:

Nt =

∫ t

0

I(X1 ∧X2 ≤ s)ds+Mt.

This is a delayed Poisson process, the (repair) intensity of which is equalto 1 after the first component failure. The process N ′ counts the number ofminimal repairs on the black-box level:

N ′t =

∫ t

0

21− e−s

2− e−sds+M ′

t.

This is a nonhomogeneous Poisson process with an intensity which corre-sponds to the ordinary failure rate of T1. Elementary calculations yield indeed

ENt = t− 1

2(1 − e−2t) ≥ EN ′

t = t− ln(2− e−t).

To interpret this result one should note that on the component level only thecritical component which caused the system to fail is repaired. A black boxrepair, which is a replacement by a system of the same age that has not yetfailed, could be a replacement by a system with both components working.

3.3.8 Repair Processes with Varying Degrees of Repair

As in the minimal repair section, let (Tn) be a point process describing failuretimes at which instantaneous repairs are carried out and let N = (Nt), t ∈ R+,be the corresponding counting process. We assume that N is adapted to somefiltration F and has F-intensity (λt).

98 3 Stochastic Failure Models

One way to model varying levels or degrees of repairs is the following.Consider a new item or system having lifetime distribution F with failurerate r(t). Assume that the nth repair has the effect that the distribution tothe next failure is that of an unfailed item of age An ≥ 0. Then An = 0means complete repair (as good as new) or replacement and An > 0 can beinterpreted as a partial repair which sets the item back to the functioningstate. Theorem 3.12, p. 64, immediately yields the intensity of such a repairprocess with respect to the internal filtration F

N : Let (An), n ∈ N, be asequence of nonnegative random variables such that An is F

NTn

-measurable,

then the FN -intensity of N is given by

λt =

∞∑n=0

r(t− Tn +An)I(Tn < t ≤ Tn+1), A0 = T0 = 0.

The two extreme cases are:

1. An = 0, for all n ∈ N. Then N is a renewal process with interarrival timedistribution F, all repairs are complete restorations to the as good as newstate.

2. An = Tn for all n ∈ N. Then N is a nonhomogeneous Poisson processwith intensity r(t), all repairs are (black box) minimal repairs.

In addition we can introduce random degrees Zn ≤ 1 of the nth repair.Starting with a new item the first failure occurs at T1. A repair with degree Z1

is instantaneously carried out and results in a virtual age of A1 = (1−Z1)T1.Continuing we can define the sequence of virtual ages recursively by

An+1 = (1− Zn+1)(An + Tn+1 − Tn), A0 = 0.

Negative values of Zn may be interpreted as additional aging due to the nthfailure or a clumsy repair. In the literature there exist many models describingdifferent ways of generating or prescribing the random sequence of repairdegrees, cf. Bibliographic Notes.

3.3.9 Minimal Repairs and Probability of Ruin

In this section we investigate a model that combines a certain reward andcost structure with minimal repairs. Consider a one-unit system that failsfrom time to time according to a point process. After failure a minimal repairis carried out that leaves the state of the system unchanged. The system canwork in one of m unobservable states. State “1” stands for new or in goodcondition and “m” is defective or in bad condition. Aging of the system isdescribed by a link between the failure point process and the unobservablestate of the system. The failure or minimal repair intensity may depend onthe state of the system.

Starting with an initial capital of u ≥ 0, there is some constant flow ofincome, on the one hand, and, on the other hand, each minimal repair incursa random cost. The risk process R = (Rt), t ∈ R+, describes the differencebetween the income including the initial capital u and the accumulated costs

3.3 Point Processes in Reliability: Failure Time and Repair Models 99

for minimal repairs up to time t. The time of ruin is defined as τ = τ(u) =inf{t ∈ R+ : Rt ≤ 0}. Since explicit formulas are rarely available, we areinterested in bounds for P (τ < ∞) and P (τ ≤ t), the infinite and the finitehorizon ruin probabilities.

A related question is when to stop processing the system and carrying outan inspection or a renewal in order to maximize some reward functional. Thisproblem is treated in Sect. 5.4.

For the mathematical formulation of the model, let the basic probabilityspace (Ω,F , P ) be equipped with a filtration F, the complete informationlevel, to which all processes are adapted, and let S = {1, . . . ,m} be the setof unobservable states. We assume that the time points of failures (minimalrepairs) 0 < T1 < T2 < · · · form a Markov-modulated Poisson process asdescribed in Example 3.14, p. 65. Let us recapitulate the details:

• The changes of the states are driven by a homogeneous Markov processY = (Yt), t ∈ R+, with values in S and infinitesimal parameters qi, therate to leave state i, and qij , the rate to reach state j from state i

qi = limh→0+

1

hP (Yh �= i|Y0 = i),

qij = limh→0+

1

hP (Yh = j|Y0 = i), i, j ∈ S, i �= j,

qii = −qi = −∑j =i

qij .

• The time points (Tn) form a point process and N = (Nt), t ∈ R+, isthe corresponding counting process Nt =

∑n≥1 I(Tn ≤ t), which has

a stochastic intensity λYt depending on the unobservable state, i.e., Nadmits the representation

Nt =

∫ t

0

λYsds+Mt,

where M is an F-martingale and 0 < λi < ∞, i ∈ S. Since the filtrationFλ(Fλ = F

Y , if λi �= λj for i �= j) generated by the intensity does notinclude F

N as a subfiltration, it follows that Tn, n ∈ N, is not an Fλ-

stopping time. Therefore, according to Definition 3.50, p. 92, N is a MRP.• (Xn), n ∈ N, is a sequence of positive i.i.d. random variables, independent

of N and Y , with common distribution F and finite mean μ. The costcaused by the nth minimal repair at time Tn is described by Xn.

• There is an initial capital u and an income of constant rate c > 0 perunit time.

Now the process R, given by

Rt = u+ ct−Nt∑n=1

Xn

describes the available capital at time t as the difference of the income andthe total amount of costs for minimal repairs up to time t.

100 3 Stochastic Failure Models

The process R is commonly used in other branches of applied probabilitylike queueing or collective risk theory. In risk theory one is mainly interestedin the distribution of the time to ruin τ = inf{t ∈ R+ : Rt ≤ 0}.

The Failure Rate Process of the Ruin Time

We want to show that the indicator process Vt = I(τ(u) ≤ t) has a semi-martingale representation

Vt = I(τ ≤ t) =

∫ t

0

I(τ > s)hsds+Mt, t ∈ R+, (3.23)

whereM is a mean zero martingale with respect to the filtration F = (Ft), t ∈R+, which is generated by all introduced random quantities:

Ft = σ(Ns, Ys, Xi, 0 ≤ s ≤ t, i = 1, . . . , Nt).

The failure rate process h = (ht), t ∈ R+, can be derived in the same wayas was done for shock models with failures of threshold type (cf. p. 89). Notethat ruin can only occur at a failure time; therefore, the ruin time is a hittingtime of a compound point process:

τ = inf

{t ∈ R+ : At =

Nt∑n=1

Bn ≥ u

}= inf {Tn : ATn ≥ u} ,

where Bn = Xn− cUn and Un = Tn−Tn−1, n = 1, 2, . . .. Replacing Xt by At,r(t− Tn) by λYt , and the threshold S by u in formula (3.20) on p. 90, we getthe following lemma.

Lemma 3.56. Let τ = τ(u) be the ruin time and F the distribution of theclaim sizes, F (x) = F ((x,∞)) = P (X1 > x), x ∈ R. Then the F-failure rateprocess h is given by

ht = λYt F (Rt−) =

m∑i=1

λiI(Yt = i)F (Rt−), t ∈ R+.

The failure rate processes h is bounded above by max{λi : i ∈ S}. If allclaim arrival rates λi coincide, λ = λi, i ∈ S, we have the classical Poissoncase, and it is not surprising that the hazard rate decreases when the riskreserve increases and vice versa. Of course, the paths of R are not monotoneand so the failure rate processes do not have monotone paths either. But theyhave (stochastically) a tendency to increase or decrease in the following sense.As follows from the results of Sect. 3.3.3 the processR has an F-semimartingalerepresentation

Rt =

∫ t

0

m∑i=1

I(Ys = i)(c− λiμ)ds+ Lt

3.3 Point Processes in Reliability: Failure Time and Repair Models 101

with a mean zero F-martingale L. If we have positive drift in all environmentalstates, i.e., c− λiμ > 0, i = 1, . . . ,m, then R is a submartingale and it is seenthat h tends to 0 as t→ ∞ (P -a.s.). On the other hand, if the claim rate λYt

is increasing (P -a.s.) and the drift is nonpositive for all states, i.e., c− λiμ ≤0, i = 1, . . . ,m, and F is convex on the support of the distribution, thenR is a supermartingale and it follows by Jensen’s inequality for conditionalexpectations:

E[ht+s|Ft] = E[λYt+s F (Rt+s−)|Ft] ≥ E[λYt F (Rt+s−)|Ft]

= λYtE[F (Rt+s−)|Ft] ≥ λYt F (E[Rt+s − |Ft])

≥ λYt F (Rt−) = ht, t, s ∈ R+.

This shows that h is a submartingale, i.e., h is stochastically increasing.

Bounds for Finite Time Ruin Probabilities

Except in simple cases, such as Poisson arrivals of exponentially distributedclaims (P/E case), the finite time ruin probabilities ψ(u, t) = P (τ(u) ≤ t)cannot be expressed by the basic model parameters in an explicit form. Sothere is a variety of suggested bounds and approximations (see Asmussen[9] and Grandell [78] for overviews). In the following, bounds for the ruinprobabilities in finite time will be derived that are based on the semimartingalerepresentation given in Lemma 3.56. It turns out that especially for smallvalues of t known bounds can be improved.

From now on we assume that the claim arrival process is Poisson with rateλ > 0. Then Lemma 3.56 yields the representation

Vt = I(τ(u) ≤ t) =

∫ t

0

I(τ(u) > s)λF (Rs)ds+Mt, t ∈ R+. (3.24)

Note that the paths of R have only countable numbers of jumps such thatunder the integral sign Rs− can be replaced by Rs. Taking expectations onboth sides of (3.24) one gets by Fubini’s theorem

ψ(u, t) =

∫ t

0

E[I(τ(u) > s)λF (Rs)]ds (3.25)

=

∫ t

0

(1− ψ(u, s))λE[F (Rs)|τ(u) > s]ds.

As a solution of this integral equation we have the following representation ofthe finite time ruin probability:

ψ(u, t) = 1− exp

{−∫ t

0

λE[F (Rs)|τ(u) > s]ds

}. (3.26)

This shows that the (possibly defective) distribution of τ(u) has the haz-ard rate

λE[F (Rt)|τ(u) > t].

102 3 Stochastic Failure Models

Now let NX be the renewal process generated by the sequence (Xi), i∈N,NX

t = sup{k ∈ N0 :∑k

i=1Xi ≤ t}, and A(u, t) =∫ t

0 a(u, s)ds, wherea(u, s)=λP (NX

u+cs = Ns). Then bounds for ψ(u, t) can be established.

Theorem 3.57. For all u, t ≥ 0, the following inequality holds true:

B(u, t) ≤ ψ(u, t) ≤ A(u, t),

where A is defined as above and B(u, t) = 1− exp{−λ ∫ t

0 F (u+ cs)ds}.Proof. For the lower bound we use the representation (3.26) and simply ob-serve that E[F (Rs)|τ(u) > s] ≥ F (u+ cs).

For the upper bound we start with formula (3.24). Since {τ(u) > t} ⊂{Rt ≥ 0}, we have

Vt =

∫ t

0

I(τ(u) > s)λF (Rs)ds+Mt

≤∫ t

0

I(Rs ≥ 0)λF (Rs)ds+Mt

Taking expectations on both sides of this inequality we get

ψ(u, t) = EVt ≤∫ t

0

λE[I(Rs ≥ 0)F (Rs)]ds.

It remains to show that a(u, t) = λE[I(Rs ≥ 0)F (Rs)]. Denoting the k-fold

convolution of F by F ∗k and Tk =∑k

i=1Xi it follows by the independence ofthe claim arrival process and (Xi), i ∈ N,

E

[I(Rt ≥ 0)F

(u+ ct−

Nt∑i=1

Xi

)]

=∞∑k=0

E

[I

(u+ ct−

k∑i=1

Xi ≥ 0

)F

(u+ ct−

k∑i=1

Xi

)]P (Nt = k)

=

∞∑k=0

∫ u+ct

0

F (u+ ct− x)dF ∗k(x)P (Nt = k)

=∞∑k=0

{F ∗k(u+ ct)− F ∗(k+1)(u+ ct)}P (Nt = k)

=

∞∑k=0

P (NXu+ct = k)P (Nt = k)

= P (NXu+ct = Nt),

which completes the proof. ��

3.3 Point Processes in Reliability: Failure Time and Repair Models 103

The bounds of the theorem seem to have several advantages: as numericalexamples show, they perform well especially for small values of t for whichψ(u, t) � ψ(u,∞) (see Aven and Jensen [25]). In addition no assumptionshave been made about the tail of the claim size distribution F and the driftof the risk reserve process, which are necessary for most of the asymptoticmethods. This makes clear, on the other hand, that one cannot expect thesebounds to perform well for t→ ∞.

Bibliographic Notes. The book of Bremaud [50] is one of the basicsources of the martingale dynamics of point process systems. The introduction(p. XV) also contains a sketch of the historical development. The SSM app-roach in connection with optimal stopping problems is considered by Jensen[98]. Comprehensive overviews over lifetime models in the martingale frame-work are those of Arjas [3, 4] and Koch [108]. An essential basis for the presen-tation of point processes in the martingale framework was laid by Jacod [92].A number of books on point processes are available now. Among others, themartingale approach is exposed in Bremaud [50], Karr [103], and Daley andVere-Jones [58], which also include the basic results about marked point pro-cesses. A full account on marked point processes can be found in the mono-graph of Last and Brandt [115].

Details on the theory of Markov processes, briefly mentioned in Sect. 3.1,can be found in the classic book of Dynkin [66] or in the more recent mono-graphs on stochastic processes mentioned at the beginning of this chapter.

One of the first papers considering random hazard rates in lifetime modelsis that of Bergman [38]. Failure rate processes for multivariate reliability sys-tems were introduced by Arjas in [6]. Shock processes have been investigatedby a number of authors. Aven treated these processes in the framework ofcounting processes in some generality in [15]. Recent work on shock models ofthreshold type concentrates on deriving the distribution of the hitting (life-)time under general conditions. Wendt [163] considers a doubly stochastic Pois-son shock arrival process, whereas Lehmann [119] investigates shock modelswith failure thresholds varying in time.

Models of minimal repairs have been considered by Barlow and Hunter [30],Aven [18], Bergman [39], Block et al. [48], Stadje and Zuckerman [151], Shakedand Shanthikumar [141], and Beichelt [35], among others. Our formulation ofthe minimal repair concept in a general counting process framework is takenfrom [24]. Varying degrees of repairs are investigated in a number of paperslike Brown and Proschan [51], Kijima [107], and Last and Szekli [116, 117].

As was pointed out by Bergman [39], information plays an important rolein minimal repair models. Further steps in investigating information-basedminimal repair were carried out by Arjas and Norros [7] and Natvig [124].

General references to risk theory are among others the books of Grandell[77] and Rolski et al. [134]. Overviews over bounds and approximations ofruin probabilities can be found in Asmussen [9] and Grandell [78]. Most ofthe approximations are based on limit theorems for ψ(u, t) as u→ ∞, t→ ∞.One of the exceptions is the inverse martingales technique used by Delbaenand Haezendonck [60].

4

Availability Analysis of Complex Systems

In this chapter we establish methods and formulas for computing various per-formance measures of monotone systems of repairable components. Emphasisis placed on the point availability, the distribution of the number of failuresin a time interval, and the distribution of downtime of the system. A numberof asymptotic results are formulated and proved, mainly for systems havinghighly available components.

The performance measures are introduced in Sect. 4.1. In Sects. 4.3–4.6 re-sults for binary monotone systems are presented. Since many of these resultsare based on the one-component case, we first give in Sect. 4.2 a rather compre-hensive treatment of this case. Section 4.7 presents generalizations and relatedmodels. Section 4.7.1 covers multistate monotone systems. In Sects. 4.2–4.5and 4.7.1 it is assumed that there are at least as many repair facilities (chan-nels) as components. In Sect. 4.7.2 we consider a parallel system having rrepair facilities, where r is less than the number of components. Attention isdrawn to the case with r = 1. Finally, in Sect. 4.7.3 we present models foranalysis of passive redundant systems.

In this chapter we focus on the situation that the components have ex-ponential lifetime distributions. See Sect. 4.7.1, p. 163, and BibliographicNotes, p. 173, for some comments concerning the more general case of non-exponential lifetimes.

4.1 Performance Measures

We consider a binary monotone system with state process (Φt) = (Φ(Xt)),as described in Sect. 2.1. Here Φt equals 1 if the system is functioningat time t and 0 if the system is not-functioning at time t, and Xt =(Xt(1), Xt(2), . . . , Xt(n)) ∈ {0, 1}n describes the states of the components.The performance measures relate to one point in time t or an interval J ,which has the form [0, u] or (u, v], 0 < u < v. To simplify notation, we simplywrite u instead of [0, u].

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 4,© Springer Science+Business Media New York 2013

105

106 4 Availability Analysis of Complex Systems

Emphasis will be placed on the following performance measures:

(a) Point availability at time t, A(t), given by

A(t) = EΦt = P (Φt = 1).

(b) Let NJ be equal to the number of system failures in the interval J . Weconsider the following performance measures

P (NJ ≤ k), k ∈ N0,

M(J) = ENJ ,

A[u, v] = P (Φt = 1, ∀t ∈ [u, v])

= P (Φu = 1, N(u,v] = 0).

The performance measure A[u, v] is referred to as the interval reliability.(c) Let YJ denote the downtime in the interval J , i.e.,

YJ =

∫J

(1− Φt) dt.

We consider the performance measures

P (YJ ≤ y), y ∈ R+,

AD(J) =EYJ|J | ,

where |J | denotes the length of the interval J . The measure AD(J) is inthe literature sometimes referred to as the interval unavailability, but weshall not use this term here.

The above performance measures relate to a fixed point in time or a finitetime interval. Often it is more attractive, in particular from a computationalpoint of view, to consider the asymptotic limit of the measure (as t, u orv → ∞), suitably normalized (in most cases such limits exist). In the followingwe shall consider both the above measures and suitably defined limits.

4.2 One-Component Systems

We consider in this section a one-component system. Hence Φt = Xt = Xt(1).If the system fails, it is repaired or replaced. Let Tk, k ∈ N, represent thelength of the kth operation period, and let Rk, k ∈ N, represent the lengthof the kth repair/replacement time for the system; see Fig. 4.1. We assumethat (Tk), k ∈ N, and (Rk), k ∈ N, are independent i.i.d. sequences of positiverandom variables. We denote the probability distributions of Tk and Rk by Fand G, respectively, and assume that they have finite means, i.e.,

4.2 One-Component Systems 107

t

1

0T1 T2 T3R1 R2

Xt

Fig. 4.1. Time evolution of a failure and repair process for a one-component systemstarting at time t = 0 in the operating state

μF <∞, μG <∞.

In reliability engineering μF and μG are referred to as the mean time to failure(MTTF) and the mean time to repair (MTTR), respectively.

To simplify the presentation, we also assume that F is an absolutely con-tinuous distribution, i.e., F has a density function f and failure rate functionλ. We do not make the same assumption for the distribution function G, sincethat would exclude discrete repair time distributions, which are often used inpractice.

In some cases we also need the variances of F and G, denoted σ2F and σ2

G,respectively. In the following, when writing the variance of a random variable,or any other moment, it is tacitly assumed that these are finite.

The sequenceT1, R1, T2, R2, · · ·

forms an alternating renewal process.We introduce the following variables

Sn = T1 +

n−1∑k=1

(Rk + Tk+1), n ∈ N,

and

S◦n =

n∑k=1

(Tk +Rk), n ∈ N.

By convention, S0 = S◦0 = 0, and sums over empty sets are zero. We see that

Sn represents the nth failure time, and S◦n represents the completion time of

the nth repair.The Sn sequence generates a modified (delayed) renewal process N with

renewal function M . The first interarrival time has distribution F . All otherinterarrival times have distribution F ∗G (convolution of F and G), with meanμF + μG. Let H

(n) denote the distribution function of Sn. Then

H(n) = F ∗ (F ∗G)∗(n−1),

108 4 Availability Analysis of Complex Systems

where B∗n denotes the n-fold convolution of a distribution B and as usualB∗0 equals the distribution with mass of 1 at 0. Note that we have

M(t) =

∞∑n=1

H(n)(t)

(cf. (B.2), p. 274, in Appendix B). The S◦n sequence generates an ordinary

renewal processN◦ with renewal functionM◦. The interarrival times, Tk+Rk,have distribution F ∗G, with mean μF +μG. Let H

◦(n) denote the distributionfunction of S◦

n. ThenH◦(n) = (F ∗G)∗n.

Let αt denote the forward recurrence time at time t, i.e., the time from t tothe next event:

αt = SNt+1 − t on {Xt = 1}and

αt = S◦N◦

t +1 − t on {Xt = 0}.Hence, given that the system is up at time t, the forward recurrence time αt

equals the time to the next failure time. If the system is down at time t, theforward recurrence time equals the time to complete the repair. Let Fαt andGαt denote the conditional distribution functions of αt given that Xt = 1 andXt = 0, respectively. Then we have for x ∈ R

Fαt(x) = P (αt ≤ x|Xt = 1) = P (SNt+1 − t ≤ x|Xt = 1)

andGαt(x) = P (αt ≤ x|Xt = 0) = P (S◦

N◦t +1 − t ≤ x|Xt = 0).

Similarly for the backward recurrence time, we define βt, Fβt , and Gβt . Thebackward recurrence time βt equals the age of the system if the system is upat time t and the duration of the repair if the system is down at time t, i.e.,

βt = t− S◦N◦

ton {Xt = 1}

andβt = t− SNt on {Xt = 0}.

4.2.1 Point Availability

We will show that the point availability A(t) is given by

A(t) = F (t) +

∫ t

0

F (t− x)dM◦(x) = F (t) + F ∗M◦(t). (4.1)

Using a standard renewal argument conditioning on the duration of T1 +R1,it is not difficult to see that A(t) satisfies the following equation:

4.2 One-Component Systems 109

A(t) = F (t) +

∫ t

0

A(t− x)d(F ∗G)(x)

(cf. the derivation of the renewal equation in Appendix B, p. 275). Hence, byusing Theorem B.2, p. 275, in Appendix B, formula (4.1) follows. Alterna-tively, we may use a more direct approach, writing

Xt = I(T1 > t) +

∞∑n=1

I(S◦n ≤ t, S◦

n + Tn+1 > t),

which gives

A(t) = EXt = F (t) +∞∑

n=1

∫ t

0

F (t− x)dH◦(n)(x)

= F (t) +

∫ t

0

F (t− x)dM◦(x).

The point unavailability A(t) is given by A(t) = 1−A(t) = F (t)− F ∗M◦(t).In the case that F is exponential with failure rate λ, it can be shown that

A(t) ≤ λμG,

see Proposition 4.11, p. 114.By the Key Renewal Theorem (Theorem B.7, p. 277, in Appendix B), it

follows thatlimt→∞A(t) =

μF

μF + μG, (4.2)

noting that the mean of F ∗G equals μF +μG and∫∞0 F (t)dt = μF . The right-

hand side of (4.2) is called the limiting availability (or steady-state availability)and is for short denoted A. The limiting unavailability is defined as A = 1−A.Usually μG is small compared to μF , so that

A =μG

μF+ o

(μG

μF

),

μG

μF→ 0.

4.2.2 The Distribution of the Number of System Failures

Consider first the interval [0, v]. We see that

{Nv ≤ n} = {Sn+1 > v}, n ∈ N0,

because if the number of failures in this interval is less than or equal to n,then the (n+ 1)th failure occurs after v, and vice versa. Thus, for n ∈ N0,

P (Nv ≤ n) = 1− (F ∗G)∗n ∗ F (v). (4.3)

Some closely related results are stated below in Propositions 4.1 and 4.2.

110 4 Availability Analysis of Complex Systems

Proposition 4.1. The probability of n failures occurring in [0, v] and the sys-tem being up at time v is given by

P (Nv = n,Xv = 1) =

∫ v

0

F (v − x)d(F ∗G)∗n(x), n ∈ N0.

Proof. The result clearly holds for n = 0. For n ≥ 1, the result follows byobserving that

{Nv = n,Xv = 1} = {S◦n + Tn+1 > v, S◦

n ≤ v}. ��

Proposition 4.2. The probability of n failures occurring in [0, v] and the sys-tem being down at time v is given by

P (Nv = n,Xv = 0) =

{∫ v

0G(v − x)dH(n)(x) n ∈ N

0 n = 0.

Proof. The proof is similar to the proof of Proposition 4.1. For n ∈ N, it isseen that

{Nv = n,Xv = 0} = {Sn +Rn > v, Sn ≤ v}. ��

From Propositions 4.1 and 4.2 we can deduce several results, for example,a formula for P (Nu = n|Xu = 1) using that

P (Nu = n|Xu = 1) =P (Nu = n,Xu = 1)

A(u).

In the theorem below we establish general formulas for P (N(u,v] ≤ n) andA[u, v].

Theorem 4.3. The probability that at most n (n ∈ N0) failures occur duringthe interval (u, v] equals

P (N(u,v] ≤ n) = [1− Fαu ∗ (F ∗G)∗n(v − u)]A(u)

+[1−Gαu ∗ (F ∗G)∗n ∗ F (v − u)]A(u),

and

A[u, v] = Fαu(v − u)A(u).

Proof. To establish the formula for P (N(u,v] ≤ n), we condition on the stateof the system at time u:

P (N(u,v] ≤ n) =∑1

j=0 P (N(u,v] ≤ n|Xu = j)P (Xu = j).

From this equality the formula follows trivially for n = 0. For n ∈ N, we needto show that the following two equalities hold true:

4.2 One-Component Systems 111

P (N(u,v] > n|Xu = 1) = (Fαu ∗G) ∗ (F ∗G)∗(n−1) ∗ F (v − u), (4.4)

P (N(u,v] > n|Xu = 0) = Gαu ∗ (F ∗G)∗n ∗ F (v − u). (4.5)

But (4.4) follows directly from (4.3) with the forward recurrence time distri-bution given {Xu = 1} as the first operating time distribution. Formula (4.5)is established analogously.

The formula for A[u, v] is seen to hold observing that

A[u, v] = P (Xu = 1, N(u,v] = 0)

= A(u)P (N(u,v] = 0|Xu = 1)

= A(u)P (αu > v − u|Xu = 1).

This completes the proof of the theorem. ��

If the downtimes are much smaller then the uptimes in probability (whichis the common situation in practice), then N is close to a renewal processgenerated by all the uptimes. Hence, if the times to failure are exponentiallydistributed, the process N is close to a homogeneous Poisson process. Formalasymptotic results will be established later, see Sect. 4.4.

In the following two propositions we relate the distribution of the forwardand backward recurrence times and the renewal functions M and M◦.

Proposition 4.4. The probability that the system is up (down) at time t andthe forward recurrence time at time t is greater than w is given by

A[t, t+ w] = P (Xt = 1, αt > w)

= F (t+ w) +

∫ t

0

F (t− x+ w)dM◦(x), (4.6)

P (Xt = 0, αt > w) =

∫ t

0

G(t− x+ w)dM(x). (4.7)

Proof. Consider first formula (4.6). It is not difficult to see that

XtI(αt > w) =∞∑n=0

I(S◦n ≤ t, S◦

n + Tn+1 > t+ w). (4.8)

By taking expectations we find that

P (Xt = 1, αt > w) = F (t+ w) +

∞∑n=1

∫ t

0

F (t− x+ w)dH◦(n)(x)

= F (t+ w) +

∫ t

0

F (t− x+ w)dM◦(x).

112 4 Availability Analysis of Complex Systems

This proves (4.6). To prove (4.7) we use a similar argument writing

(1−Xt)I(αt > w) =∞∑

n=1

I(Sn ≤ t, Sn +Rn > t+ w). (4.9)

This completes the proof of the proposition. ��

Proposition 4.5. The probability that the system is up (down) at time t andthe backward recurrence time at time t is greater than w is given by

P (Xt = 1, βt > w) =

{F (t) +

∫ t−w

0 F (t− x)dM◦(x) w ≤ t0 w > t

(4.10)

P (Xt = 0, βt > w) =

{∫ t−w

0 G(t− x)dM(x) w ≤ t0 w > t.

(4.11)

Proof. The proof is similar to the proof of Proposition 4.4. Replace the indi-cator function in the sums in (4.8) and (4.9) by

I(S◦n + Tn+1 > t, S◦

n + w < t)

andI(Sn +Rn > t, Sn + w < t),

respectively. ��

Theorem 4.6. The asymptotic distributions of the state process (Xt) and theforward (backward) recurrence times at time t are given by

limt→∞P (Xt = 1, αt > w) =

∫∞wF (x) dx

μF + μG

limt→∞P (Xt = 0, αt > w) =

∫∞w G(x) dx

μF + μG

limt→∞P (Xt = 1, βt > w) =

∫∞wF (x) dx

μF + μG(4.12)

limt→∞P (Xt = 0, βt > w) =

∫∞wG(x) dx

μF + μG.

Proof. The results follow by applying the Key Renewal Theorem (seeAppendix B, p. 277) to formulas (4.6), (4.7), (4.10), and (4.11). ��

Let us introduce

F∞(w) =

∫ w

0F (x) dx

μF, (4.13)

G∞(w) =

∫ w

0G(x) dx

μG. (4.14)

4.2 One-Component Systems 113

The distribution F∞ (G∞) is the asymptotic limit distribution of the forwardand backward recurrence times in a renewal process generated by the uptimes(downtimes) and is called the equilibrium distribution for F (G), cf. Theo-rem B.13, p. 279, in Appendix B. We would expect that F∞ and G∞ are equalto the asymptotic distributions of the forward and backward recurrence timesin the alternating renewal process. As shown in the following proposition, thisholds in fact true.

Proposition 4.7. The asymptotic distribution of the forward and backwardrecurrence times are given by

limt→∞ Fαt(w) = lim

t→∞ Fβt(w) = F∞(w)

andlimt→∞ Gαt(w) = lim

t→∞ Gβt(w) = G∞(w). (4.15)

Proof. To establish these formulas, we use (4.2) (see p. 109), Theorem 4.6,and identities like

P (αt > w|Xt = 1) =P (Xt = 1, αt > w)

A(t). ��

The following theorem expresses the asymptotic distribution of N(t,t+w] asa function of F , G, F∞, G∞ and A.

Theorem 4.8. For n ∈ N0,

limt→∞P (N(t,t+w] ≤ n) = [1− F∞ ∗ (F ∗G)∗n(w)]A +

+[1−G∞ ∗ (F ∗G)∗n ∗ F (w)]A.

Proof. The result follows from the expression for the distribution of the num-ber of failures given in Theorem 4.3, p. 110, combined with the limiting avail-ability formula (4.2), p. 109, and Proposition 4.7. ��

If the lifetime distribution F is exponential with failure rate λ, then weknow that the forward recurrence time αt has the same distribution for all t,and it is easily verified from the expression (4.13) for the equilibrium distri-bution for F that F∞(t) = F (t).

Next we consider an increasing interval (t, t+w], w → ∞. Then we can usethe normal distribution to find an approximate value for the distribution of N .The asymptotic normality, as formulated in the following theorem, follows byapplying the Central Limit Theorem for renewal processes, see Theorem B.12,p. 278, in Appendix B. The notation N(μ, σ2) is used for the normal distri-bution with mean μ and variance σ2.

114 4 Availability Analysis of Complex Systems

Theorem 4.9. The asymptotic distribution of N(t,t+w] as w → ∞, is givenby

N(t,t+w] − w/(μF + μG)

[w(σ2F + σ2

G)/(μF + μG)3]1/2D→ N(0, 1). (4.16)

The expected number of system failures can be found from the distribu-tion function. Obviously, M(v) ≈ M◦(v) for large v. The exact relationshipbetween M(v) and M◦(v) is given in the following proposition.

Proposition 4.10. The difference between the renewal functions M(v) andM◦(v) equals the unavailability at time v, i.e.,

M(v) =M◦(v) + A(v).

Proof. Using that P (Nv ≤ n) = 1 − (F ∗ G)∗n ∗ F (v) (by (4.3), p. 109) andthe expression (4.1), p. 108, for the availability A(t), we obtain

M(v) =

∞∑n=1

P (Nv ≥ n)

=∞∑

n=0

(F ∗G)∗n ∗ F (v) = F (v) +M◦ ∗ F (v)

= M◦(v) + A(v),

which is the desired result. ��

The number of system failures in [0, v], Nv, generates a counting processwith stochastic intensity process

ηv = λ(βv)Xv, (4.17)

where λ is the failure rate function and βv is the backward recurrence timeat time v, i.e., the relative age of the system at time v, cf. Sect. 3.3.2, p. 85.We have m(v) = Eηv, where m(v) is the renewal density ofM(v). Thus if thesystem has an exponential lifetime distribution with failure rate λ,

m(v) = λA(v). (4.18)

In general,m(v) ≤ [sup

s≤vλ(s)]A(v). (4.19)

This bound can be used to establish an upper bound also for the unavailabilityA(t).

Proposition 4.11. The unavailability at time t, A(t), satisfies

A(t) ≤ sups≤t

λ(s)

∫ t

0

G(u)du ≤ [sups≤t

λ(s)]μG. (4.20)

4.2 One-Component Systems 115

Proof. From (4.7), p. 111, we have

A(t) = P (Xt = 0) =

∫ t

0

G(t− x)dM(x) =

∫ t

0

G(t− x)m(x)dx. (4.21)

Using (4.19) this gives

A(t) ≤∫ t

0

G(t− x)[sups≤x

λ(s)]A(x)dx.

It follows that

A(t) ≤ sups≤t

λ(s)

∫ t

0

G(t− x)dx

= sups≤t

λ(s)

∫ t

0

G(u)du ≤ [sups≤t

λ(s)]μG,

which proves (4.20). ��

Hence, if the system has an exponential lifetime distribution with failurerate λ, then

A(t) ≤ λ

∫ t

0

G(s)ds ≤ λμG. (4.22)

It is also possible to establish lower bounds on A(t). A simple bound isobtained by combining (4.21) and the fact that

t ≤ ESNt+1 ≤ (μF + μG)(1 +M(t))

(cf. Appendix B, p. 279), giving

A(t) ≥ G(t)M(t) ≥ G(t)

(t

μF + μG− 1

).

Now suppose at time t that the system is functioning and the relativeage is u. What can we then say about the intensity process at time t + v(v > 0)? The probability distribution of ηt+v is determined if we can find thedistribution of the relative age at time t+ v. But the relative age is given by(4.10), p. 112, slightly modified to take into account that the first uptime hasdistribution given by Fu(x) = 1− F (u+ x)/F (u) for 0 ≤ u ≤ t:

P (Xt+v = 1, βt+v > w|Xt = 1, βt = u)

=

{Fu(v) +

∫ v−w

0F (v − x)dM◦(x) w ≤ v

0 w > v.

The asymptotic distribution, as v → ∞, is the same as in formula (4.12),p. 112.

The (modified) renewal process (Nt) has cycle lengths Tk +Rk with meanμF + μG, k ≥ 2. Thus we would expect that the (mean) average number of

116 4 Availability Analysis of Complex Systems

failures per unit of time is approximately equal to 1/(μF +μG) for large t. Inthe following theorem some asymptotic results are presented that give preciseformulations of this idea.

Theorem 4.12. With probability one ,

limt→∞

Nt

t=

1

μF + μG. (4.23)

Furthermore,

limt→∞

ENt

t=

1

μF + μG, (4.24)

limu→∞E[Nu+w −Nu] =

w

μF + μG, (4.25)

limt→∞(ENt − t

μF + μG) =

σ2F + σ2

G

2(μF + μG)2− 1

2.

Proof. These results follow directly from renewal theory, see Appendix B,pp. 276–278. ��

4.2.3 The Distribution of the Downtime in a Time Interval

First we formulate and prove some results related to the mean of the downtimein the interval [0, u]. As before (cf. Sect. 4.1, p. 106), we let Yu represent thedowntime in the interval [0, u].

Theorem 4.13. The expected downtime in [0, u] is given by

EYu =

∫ u

0

A(t)dt. (4.26)

Asymptotically, the (expected) portion of time the system is down equals thelimiting unavailability, i.e.,

limu→∞AD(u) = lim

u→∞EYuu

= A. (4.27)

With probability one,

limu→∞

Yuu

= A. (4.28)

4.2 One-Component Systems 117

Proof. Using the definition of Yu and Fubini’s theorem we find that

EYu = E

∫ u

0

(1− Φt)dt

=

∫ u

0

E(1− Φt)dt

=

∫ u

0

A(t)dt.

This proves (4.26). Formula (4.27) follows by using (4.26) and the limitingavailability formula (4.2), p. 109. Alternatively, we can use the Renewal Re-ward Theorem (Theorem B.15, p. 280, in Appendix B), interpreting Yu asa reward. From this theorem we can conclude that EYu/u converges to theratio of the expected downtime in a renewal cycle and the expected length ofa cycle, i.e., to the limiting unavailability A. The Renewal Reward Theoremalso proves (4.28). ��

Now we look into the problem of finding formulas for the downtime distri-bution.

Let Nops denote the number of system failures after s units of operational

time, i.e.,

Nops =

∞∑n=1

I(

n∑k=1

Tk ≤ s).

Note that

Nops ≥ n⇔

n∑k=1

Tk ≤ s, n ∈ N. (4.29)

Let Zs denote the total downtime associated with the operating time s, butnot including s, i.e.,

Zs =

Nops−∑

i=1

Ri,

whereNop

s− = limu→s−N

opu .

DefineCs = s+ Zs.

We see that Cs represents the calendar time after an operation time of s timeunits and the completion of the repairs associated with the failures occurredup to s but not including s.

The following theorem gives an exact expression of the probability distri-bution of Yu, the total downtime in [0, u].

118 4 Availability Analysis of Complex Systems

Theorem 4.14. The distribution of the downtime in a time interval [0, u] isgiven by

P (Yu ≤ y) =

∞∑n=0

G∗n(y)P (Nopu−y = n) (4.30)

=∞∑n=0

G∗n(y)[F ∗n(u− y)− F ∗(n+1)(u− y)]. (4.31)

Proof. To prove the theorem we first argue that

P (Yu ≤ y) = P (Cu−y ≤ u) = P (u− y + Zu−y ≤ u)= P (Zu−y ≤ y).

This first equality follows by noting that the event Yu ≤ y is equivalent to theevent that the uptime in the interval [0, u] is equal to or longer than u−y. Thismeans that the point in time when the total uptime of the system equals u−ymust occur before or at u, i.e., Cu−y ≤ u. Now using a standard conditionalprobability argument it follows that

P (Zu−y ≤ y) =∞∑n=0

P (Zu−y ≤ y|Nop(u−y)− = n)P (Nop

(u−y)− = n)

=

∞∑n=0

G∗n(y)P (Nop(u−y)− = n)

=∞∑n=0

G∗n(y)P (Nopu−y = n).

We have used that the repair times are independent of the process Nops and

that F is continuous. This proves (4.30). Formula (4.31) follows by using(4.29). ��

In the case that F is exponential with failure rate λ the following simplebounds apply

e−λ(u−y)[1 + λ(u− y)G(y)] ≤ P (Yu ≤ y) ≤ e−λ(u−y)[1−G(y)].

The lower bound follows by including only the first two terms of the sum in(4.30), observing that Nop

t is Poisson distributed with mean λt, whereas theupper bound follows by using (4.30) and the inequality

G∗n(y) ≤ (G(y))n.

In the case that the interval is rather long, the downtime will be approximatelynormally distributed, as is shown in Theorem 4.15 below.

4.2 One-Component Systems 119

t

1

0R1 T1 T2 T3R2 R3

Xt

Fig. 4.2. Time evolution of a failure and repair process for a one-component systemstarting at time t = 0 in the failure state

Theorem 4.15. The asymptotic distribution of Yu as u→ ∞, is given by

√u

(Yuu

− A

)D→ N(0, τ2), (4.32)

where

τ2 =μ2Fσ

2G + μ2

Gσ2F

(μF + μG)3.

Proof. The result follows by applying Theorem B.17, p. 280, in Appendix B,observing that the length of the first renewal cycle equals S◦

1 = T1 + R1, thedowntime in this cycle equals YS◦

1= R1 and

Var[R1 − AS◦1 ]

ES◦1

=Var[R1A− T1A]

ES◦1

=A2Var[R1] + A2Var[T1]

μF + μG

=μ2Fσ

2G + μ2

Gσ2F

(μF + μG)3. ��

4.2.4 Steady-State Distribution

The asymptotic results established above provide good approximations for theperformance measures related to a given point in time or an interval. Basedon the asymptotic values we can define a stationary (steady-state) processhaving these asymptotic values as their distributions and means. To definesuch a process in our case, we generalize the model analyzed above by allowingX0 to be 0 or 1.

Thus the time evolution of the process is as shown in Fig. 4.2 or as shownin Fig. 4.1 (p. 107) beginning with an uptime. The process is characterizedby the parameters A(0), F ∗(t), F (t), G∗(t), G(t), where F ∗(t) denotes thedistribution of the first uptime provided that the system starts in state 1 attime 0 (i.e., X0 = 1) and G∗(t) denotes the distribution of the first downtime

120 4 Availability Analysis of Complex Systems

provided that the system starts in state 0 at time 0 (i.e., X0 = 0). Nowassuming that F ∗(t) and G∗(t) are equal to the asymptotic distributions ofthe recurrence times, i.e., F∞(t) and G∞(t), respectively, and A(0) = A, thenit can be shown that the process (Xt, αt) is stationary; see Birolini [44]. Thismeans that we have, for example,

A(t) = A, ∀t ∈ R+,

A[u, u+ w] =

∫∞wF (x) dx

μF + μG, ∀u,w ∈ R+,

M(u, u+ w] =w

μF + μG, ∀u,w ∈ R+.

4.3 Point Availability and Mean Numberof System Failures

Consider now a monotone system comprising n independent components. Foreach component we define a model as in Sect. 4.2, indexed by “i”. The uptimesand downtimes of component i are thus denoted Tik and Rik with distributionsFi and Gi, respectively. The lifetime distribution Fi is absolutely continuouswith a failure rate function λi(t). The process (Nt) refers now to the number ofsystem failures, whereas (Nt(i)) counts the number of failures of component i.The counting process (Nt(i)) has intensity process (ηt(i)) = (λi(βt(i))Xt(i)),where (Xt(i)) equals the state process of component i and (βt(i)) the backwardrecurrence time of component i. The mean of (Nt(i)) is denotedMi(t), whereasthe mean of the renewal process having interarrival times Tik +Rik, k ∈ N, isdenoted M◦

i (t). If the process (Xt) is regenerative, we denote the consecutivecycle lengths S1, S2, . . .. We write S in place of S1. Remember that a stochasticprocess (Xt) is called regenerative if there exists a finite random variable Ssuch that the process beyond S is a probabilistic replica of the process startingat 0. The precise definition is given in Appendix B, p. 281.

In the following we establish results similar to those obtained in the pre-vious section. Some results are quite easy to generalize to monotone systems,others are extremely difficult. Simplifications and approximative methods aretherefore sought. First we look at the point availability.

4.3.1 Point Availability

The following results show that the point availability (limiting availability) ofa monotone system is equal to the reliability function h with the componentreliabilities replaced by the component availabilities Ai(t) (Ai).

Theorem 4.16. The system availability at time t, A(t), and the limiting sys-tem availability, limt→∞A(t), are given by

4.3 Point Availability and Mean Number of System Failures 121

A(t) = h(A1(t), A2(t), . . . , An(t)) = h(A(t)), (4.33)

limt→∞A(t) = h(A1, A2, . . . , An) = h(A). (4.34)

Proof. Formula (4.33) is simply an application of the reliability function for-mula (2.2), see p. 21, with Ai(t) = P (Xt(i) = 1). Since the reliability functionh(p) is a linear function in each pi (see Sect. 2.1, p. 25), and therefore a con-tinuous function, it follows that A(t) → h(A1, A2, . . . , An) as t → ∞, whichproves (4.34). ��

The limiting system availability can also be interpreted as the expectedportion of time the system is operating in the long run, or as the long runaverage availability, noting that

limt→∞E

[1

t

∫ t

0

Φsds

]= lim

t→∞1

t

∫ t

0

A(s)ds = limt→∞A(t).

4.3.2 Mean Number of System Failures

We first state some results established in Sect. 3.3.2, cf. formula (3.18), p. 86.See also (4.17) and (4.18), p. 114.

Theorem 4.17. The expected number of system failures in [0, u] is given by

ENu =n∑

i=1

∫ u

0

[h(1i,A(t)) − h(0i,A(t))] dMi(t) (4.35)

=

n∑i=1

∫ u

0

[h(1i,A(t)) − h(0i,A(t))]mi(t) dt

=n∑

i=1

∫ u

0

[h(1i,A(t)) − h(0i,A(t))]Eηt(i)dt,

where mi(t) is the renewal density function of Mi(t).

Corollary 4.18. If component i has constant failure rate λi, i = 1, 2, . . . , n,then

ENu =

n∑i=1

∫ u

0

[h(1i,A(t))− h(0i,A(t))]λiAi(t)dt, (4.36)

≤ uλ,

where λ =∑n

i=1 λi.

Next we will generalize the asymptotic results (4.23)–(4.25), p. 116.

122 4 Availability Analysis of Complex Systems

Theorem 4.19. The expected number of system failures per unit of time isasymptotically given by

limu→∞

ENu

u=

n∑i=1

h(1i,A)− h(0i,A)

μFi + μGi

, (4.37)

limu→∞

EN(u,u+w]

w=

n∑i=1

h(1i,A)− h(0i,A)

μFi + μGi

. (4.38)

Furthermore, if the process X is a regenerative process having finite expectedcycle length, i.e., ES <∞, then with probability one,

limu→∞

Nu

u=

n∑i=1

h(1i,A)− h(0i,A)

μFi + μGi

. (4.39)

Proof. To prove these results, we make use of formula (4.35). Dividing thisformula by u and using the Elementary Renewal Theorem (see AppendixB, p. 277), formula (4.37) can be shown to hold noting that E[Φ(1i,Xt)− Φ(0i,Xt)] → [h(1i,A) − h(0i,A)] as t → ∞. Let h∗i (t) = E[Φ(1i,Xt) −Φ(0i,Xt)] and h∗i its limit as t → ∞. Then we can write formula (4.35)divided by u in the following form:

n∑i=1

{h∗iMi(u)

u+

1

u

∫ u

0

[h∗i (t)− h∗i ]dMi(t)

}.

Hence in view of the Elementary Renewal Theorem, formula (4.37) follows if

limu→∞

1

u

∫ u

0

[h∗i (t)− h∗i ]dMi(t) = 0. (4.40)

But (4.40) is seen to hold true by Proposition B.14, p. 279, in Appendix B.The formula (4.38) is shown by writing

E[Nu+w −Nu] =n∑

i=1

∫ u+w

u

E[Φ(1i,Xt)− Φ(0i,Xt)]dMi(t)

and using Blackwell’s Theorem, see Theorem B.9, p. 278, in Appendix B.If we assume that the process X is regenerative with ES < ∞, it follows

from the theory of renewal reward processes (see Appendix B, p. 280) thatwith probability one, limu→∞Nu/u exists and equals

limu→∞

ENu

u=ENS

ES.

Combining this with (4.37), we can conclude that (4.39) holds true, and theproof of the theorem is complete. ��

4.3 Point Availability and Mean Number of System Failures 123

Definition 4.20. The limit of ENu/u, given by formula (4.37), is referred toas the system failure rate and is denoted λΦ, i.e.,

λΦ = limu→∞

ENu

u=

n∑i=1

h(1i,A)− h(0i,A)

μFi + μGi

. (4.41)

Remark 4.21. 1. Heuristically, the limit (4.37) can easily be established: Inthe interval (t, t+w), t large and w small, the probability that componenti fails equals approximately w/(μFi + μGi), and this failure implies asystem failure if Φ(1i,Xt) = 1 and Φ(0i,Xt) = 0, i.e., the system fails ifcomponent i fails. But the probability that Φ(1i,Xt) = 1 and Φ(0i,Xt) =0 is approximately equal to h(1i,A) − h(0i,A), which gives the desiredresult.

2. At time t we can define a system failure rate λΦ(t) by

λΦ(t) =

n∑i=1

[Φ(1i,Xt)− Φ(0i,Xt)]ηt(i),

cf. Sect. 3.3.2, p. 85. Since

EλΦ(t) =

n∑i=1

[h(1i,At)− h(0i,At)]mi(t),

where mi(t) denotes the renewal density of Mi(t), we see that EλΦ(t) →λΦ as t→ ∞ provided that mi(t) → 1/(μFi + μGi). From renewal theory,see Theorem B.10, p. 278, in Appendix B, we know that if the renewalcycle lengths Tik +Rik have a density function h with h(t)p integrable forsome p > 1, and h(t) → 0 as t→ ∞, then Mi has a density mi such thatmi(t) → 1/(μFi +μGi) as t→ ∞. See the remark following Theorem B.10for other sufficient conditions for mi(t) → 1/(μFi + μGi) to hold. If com-ponent i has an exponential lifetime distribution with parameter λi, thenmi(t) = λiAi(t), (cf. (4.18), p. 114), which converges to 1/(μFi + μGi).

It is intuitively clear that the process X is regenerative if the componentshave exponential lifetime distributions. Before we prove this formally, we for-mulate a result related to EN◦

u : the expected number of visits to the beststate (1, 1, . . . , 1) in [0, u]. The result is analogous to (4.35) and (4.37).

Lemma 4.22. The expected number of visits to state (1, 1, . . . , 1) in [0, u] isgiven by

EN◦u =

n∑i=1

∫ u

0

∏j =i

Aj(t) dM◦i (t). (4.42)

Furthermore,

limu→∞

EN◦u

u=

n∏j=1

Aj

n∑i=1

1

μFi

. (4.43)

124 4 Availability Analysis of Complex Systems

Proof. Formula (4.42) is shown by arguing as in the proof of (4.35) (cf.Sect. 3.3.2, p. 85), writing

EN◦u = E

⎡⎣ n∑

i=1

∫ u

0

∏j =i

Xj(t) dN◦t (i)

⎤⎦ .

To show (4.43) we can repeat the proof of (4.37) to obtain

limu→∞

EN◦u

u=

n∑i=1

∏j =i

Aj1

μFi + μGi

=

n∏j=1

Aj

n∑i=1

1

μFi

.

This completes the proof of the lemma. ��The above result can be shown heuristically using the same type of

arguments as in Remark 4.21. For highly available components we haveAi ≈ 1, hence the limit (4.43) is approximately equal to

n∑i=1

1

μFi

.

This is as expected noting that the number of visits to state (1, 1, . . . , 1) thenshould be approximately equal to the average number of component failuresper unit of time. If a component fails, it will normally be repaired before anyother component fails, and, consequently, the process again returns to state(1, 1, . . . , 1).

Theorem 4.23. If all the components have exponential lifetimes, then X isa regenerative process.

Proof. Because of the memoryless property of the exponential distributionand the fact that all component uptimes and downtimes are independent, wecan conclude that X is regenerative (as defined in Appendix B, p. 281) if wecan prove that P (S < ∞) = 1, where S = inf{t > S′ : Xt = (1, 1, . . . , 1)}and S′ = min{Ti1 : i = 1, 2, . . . , n}. It is clear that if X returns to the state(1, 1, . . . , 1), then the process beyond S is a probabilistic replica of the processstarting at 0.

Suppose that P (S < ∞) < 1. Then there exists an ε > 0 such thatP (S < ∞) ≤ 1 − ε. Now let τi be point in time of the ith visit of X to thestate (1, 1, . . . , 1), i.e., τ1 = S and for i ≥ 2,

τi = inf{t > τi−1 + S′i : Xt = (1, 1, . . . , 1)},

where S′i has the same distribution as S′. We define inf{∅} = ∞. Since τi <∞

is equivalent to τk − τk−1 <∞, k = 1, 2, . . . , i (τ0 = 0), we obtain

P (τi <∞) = [P (S <∞)]i ≤ (1− ε)i.

4.4 Distribution of the Number of System Failures 125

For all t ∈ R+,P (N◦

t ≥ i) ≤ P (τi <∞),

and it follows that

EN◦t =

∞∑i=1

P (N◦t ≥ i)

≤∞∑i=1

(1− ε)i

=1− ε

1− (1− ε)=

1− ε

ε<∞.

Consequently, EN◦t /t → 0 as t → ∞. But this result contradicts (4.43), and

therefore P (S <∞) = 1. ��

Under the given set-up the regenerative property only holds true if thelifetimes of the components are exponentially distributed. However, this canbe generalized by considering phase-type distributions with an enlarged statespace, which also includes the phases; see Sect. 4.7.1, p. 163.

4.4 Distribution of the Number of System Failures

In general, it is difficult to calculate the distribution of the number of systemfailures N(u,v]. Only in some special cases it is possible to obtain practicalcomputation formulas, and in the following we look closer into some of these.

If the repair times are small compared to the lifetimes and the lifetimesare exponentially distributed with parameter λi, then clearly the number offailures of component i in the time interval (u, u + w], Nu+w(i) − Nu(i), isapproximately Poisson distributed with parameter λiw. If the system is aseries system, and we make the same assumptions as above, it is also clearthat the number of system failures in the interval (u, u+w] is approximatelyPoisson distributed with parameter

∑ni=1 λiw. The number of system failures

in [0, t], Nt, is approximately a Poisson process with intensity∑n

i=1 λi.If the system is highly available and the components have constant failure

rates, the Poisson distribution (with the asymptotic rate λΦ) will in fact alsoproduce good approximations for more general systems. As motivation, weobserve that EN(u,u+w]/w is approximately equal to the asymptotic systemfailure rate λΦ, and N(u,u+w] is “nearly independent” of the history of N upto u, noting that the process X frequently restarts itself probabilistically, i.e.,X re-enters the state (1, 1, . . . , 1).

Refer to [22, 82] for Monte Carlo simulation studies of the accuracy ofthe Poisson approximation. As an illustration of the results obtained in thesestudies, consider a parallel system of two identical components where the

126 4 Availability Analysis of Complex Systems

failure rate λ is equal to 0.05, the repair times are all equal to 1, and theexpected number of system failures is equal to 5. This means, as shown below,that the time interval is about 1,000 and the expected number of componentfailures is about 100. Using the definition of the system failure rate λΦ (cf.(4.41), p. 123) with μG = 1, we obtain

ENu

u=

5

u≈ λΦ = 2A1

1

μF1 + μG1

= 2μG

1λ + μG

· 11λ + μG

≈ 2λ2 = 0.005.

Hence u ≈ 1, 000 and 2ENu(i) ≈ 2λu ≈ 100. Clearly, this is an approximatesteady-state situation, and we would expect that the Poisson distribution givesan accurate approximation. The Monte Carlo simulations in [22] confirm this.The distance measure, which is defined as the maximum distance between thePoisson distribution (with mean λΦu) and the “true” distribution obtainedby Monte Carlo simulation, is equal to 0.006. If we take instead λ = 0.2 andENu = 0.2, we find that the expected number of component failures is about1. Thus, we are far away from a steady-state situation and as expected thedistance measure is larger: 0.02. But still the Poisson approximation producesrelatively accurate results.

In the following we look at the problem of establishing formalized asymp-totic results for the distribution of the number of system failures. We firstconsider the interval reliability.

4.4.1 Asymptotic Analysis for the Time to the First SystemFailure

The above discussion indicates that the interval reliability A[0, u], defined byA[0, u] = P (Nu = 0), is approximately exponentially distributed for highlyavailable systems comprising components with exponentially distributed life-times. This result can also be formulated as a limiting result as shown in thetheorem below. It is assumed that the process X is a regenerative processwith regenerative state (1, 1, . . . , 1). The variable S denotes the length of thefirst renewal cycle of the process X, i.e., the time until the process returns tostate (1, 1, . . . , 1). Let TΦ denote the time to the first system failure and q theprobability that a system failure occurs in a renewal cycle, i.e.,

q = P (NS ≥ 1) = P (TΦ < S).

For q ∈ (0, 1), let P0 and P1 denote the conditional probability given NS = 0and NS ≥ 1, i.e., P0(·) = P (·|NS = 0) and P1(·) = P (·|NS ≥ 1). Thecorresponding expectations are denoted E0 and E1. Furthermore, let c20S =[E0S

2/(E0S)2]− 1 denote the squared coefficient of variation of S under P0.

The notationP→ is used for convergence in probability and

D→ for con-vergence in distribution, cf. Appendix A, p. 248. We write Exp(t) for the

4.4 Distribution of the Number of System Failures 127

exponential distribution with parameter t, Poisson(t) for the Poisson distri-bution with mean t and N(μ, σ2) for the normal distribution with mean μ andvariance σ2.

For each component i (i ∈ {1, 2, . . . , n}) we assume that there is a sequenceof uptime and downtime distributions (Fij , Gij), j = 1, 2, . . ..

To simplify notation, we normally omit the index j. When assuming inthe following that X is a regenerative process, it is tacitly understood for allj ∈ N. We shall formulate conditions which guarantee that αTΦ is asymptot-ically exponentially distributed with parameter 1, where α is a suitable nor-malizing “factor” (more precisely, a normalizing sequence depending on j).The following factors will be studied: q/E0S, q/ES, 1/ETΦ, and λΦ. Thesefactors are asymptotically equivalent under the conditions stated in the the-orem below, i.e., the ratio of any two of these factors converges to one asj → ∞. To motivate this, note that for a highly available system we haveETΦ ≈ E0S(1/q) ≈ ES(1/q), observing that E0S equals the length of a cyclehaving no system failures and 1/q equals the expected number of cycles untila system failure occurs (the number of such cycles is geometrically distributedwith parameter q). We have E0S ≈ ES when q is small. Note also that

λΦ =ENS

ES(4.44)

by the Renewal Reward Theorem (Theorem B.15, p. 280, in Appendix B).For a highly available system we have ENS ≈ q and hence λΦ ≈ q/ES.Results from Monte Carlo simulations presented in [22] show that the factorsq/E0S, q/ES, and 1/ETΦ typically give slightly better results (i.e., betterfit to the exponential distribution) than the system failure rate λΦ. From acomputational point of view, however, λΦ is much more attractive than theother factors, which are in most cases quite difficult to compute. We thereforenormally use λΦ as the normalizing factor.

The basic idea of the proof of the asymptotic exponentiality of αTΦ is asfollows: If we assume that X is a regenerative process and the probabilitythat a system failure occurs in a renewal cycle, i.e., q, is small (converges tozero), then the time to the first system failure will be approximately equalto the sum of a number of renewal cycles having no system failures; and thisnumber of cycles is geometrically distributed with parameter q. Now if q → 0as j → ∞, the desired result follows by using Laplace transformations. Theresult can be formulated in general terms as shown in the lemma below.

Note that series systems are excluded since such systems have q = 1. Wewill analyze series systems later in this section; see Theorem 4.35, p. 143.

Lemma 4.24. Let S, Si, i = 1, 2, . . ., be a sequence of non-negative i.i.d. ran-dom variables with distribution function F (t) having finite mean a, a > 0 andfinite variance, and let ν be a random variable independent of (Si), geomet-rically distributed with parameter q (0 < q ≤ 1), i.e., P (ν = k) = qpk−1, k =1, 2, . . . , p = 1− q. Furthermore, let

128 4 Availability Analysis of Complex Systems

S∗ =

ν−1∑i=1

Si.

Consider now a sequence Fj, qj (j = 1, 2, . . .) satisfying the above conditionsfor each j. Then if (as j → ∞)

q → 0 (4.45)

and

qc2S → 0, (4.46)

where c2S denotes the squared coefficient of variation of S, we have (as j → ∞)

qS∗

a

D→ Exp(1). (4.47)

Proof. Let S∗ = qS∗/a. By conditioning on the value of ν, it is seen that theLaplace transform of S∗, LS∗(x) = Ee−xS∗

, equals q/[1− pL(x)], where L(x)is the Laplace transform of Si. Let ψ(x) = [L(x)− 1 + ax]/x. Then

LS∗(x) =q

1− p(1− ax+ xψ(x)).

We need to show that

LS∗(x) = Ee−(qx/a)S∗ → 1

1 + x,

since the convergence theorem for Laplace transforms then give the desiredresult. Noting that

Ee−(qx/a)S∗=

1

1 + px− (px/a)ψ(qx/a),

we must require that(x/a)ψ(qx/a) → 0,

i.e.,

[L(qx/a)− 1 + qx]/q → 0.

Using ES = a and the inequalities 0 ≤ e−t − 1 + t ≤ t2/2, we find that

0 ≤ [L(qx/a)− 1 + qx]/q = E[e−(qx/a)S − 1 + (qx/a)S]/q

≤ E[(qx/a)S]2/2q

=x2

2

q

a2ES2

=x2

2q(1 + c2S).

4.4 Distribution of the Number of System Failures 129

The desired conclusion (4.47) follows now since q → 0 and qc2S → 0 (assump-tions (4.45) and (4.46)). ��

Theorem 4.25. Assume that X is a regenerative process, and that Fij andGij change in such a way that the following conditions hold (as j → ∞) :

q → 0, (4.48)

qc20S → 0, (4.49)

qE1S

ES→ 0, (4.50)

E1(NS − 1) → 0. (4.51)

ThenA[0, u/λΦ] → e−u, i.e., λΦTΦ

D→ Exp(1). (4.52)

Proof. Using Lemma 4.24, we first prove that under conditions (4.48)–(4.50)we have

TΦq

E0S

D→ Exp(1). (4.53)

Let ν denote the renewal cycle index associated with the time of the firstsystem failure, TΦ. Then it is seen that TΦ has the same distribution as

ν−1∑k=1

S0k +Wν ,

where (S0k) and (Wk) are independent sequences of i.i.d. random variableswith

P (S0k ≤ s) = P0(S ≤ s)

andP (Wk ≤ w) = P1(TΦ ≤ w).

Both sequences are independent of ν, which has a geometrical distributionwith parameter q = P (NS ≥ 1). Hence, (4.53) follows from Lemma 4.24provided that

Wνq

E0S

P→ 0. (4.54)

By a standard conditional probability argument it follows that

ES = (1− q)E0S + qE1S,

130 4 Availability Analysis of Complex Systems

and by noting that

qE1TφE0S

=qEW

E0S≤ qE1S

E0S=qE1S(1− q)

ES − qE1S

=qE1SES (1− q)

1− qE1SES

→ 0, (4.55)

we see that (4.54) holds.Using (4.44) we obtain

λφq/E0S

=λφ

q/ES

E0S

ES

=ENS/ES

q/ES

E0S

ES

=ENS

q

E0S

ES.

Now ENS/q = 1 + E1(NS − 1) → 1 in view of (4.51), and

E0S

ES=

1− qE1SES

1− q→ 1

by (4.48) and (4.50). Hence the ratio of λφ and q/E0S converges to 1. Com-bining this with (4.53), the conclusion of the theorem follows. ��

Remark 4.26. The above theorem shows that

αTφD→ Exp(1)

for α equal to λφ. But the result also holds for the normalizing factorsq/E0S, q/ES, and 1/ETφ. For q/E0S and q/ES this is seen from the proofof the theorem. To establish the result for 1/ETφ, let

S∗ =

ν−1∑i=1

S0i.

Then ES∗ = E0S(1−q)/q, observing that the mean of ν equals 1/q. It followsthat

ETφ = E0S(1− q)/q + E1Tφ,

which can be rewritten as

qETφ/E0S = 1− q + qE1Tφ/E0S.

We see that the right-hand side of this expression converges to 1, remember-ing (4.48),(4.50), and (4.55). Hence, 1/ETφ is also a normalizing factor. Notethat the condition (4.51) is not required if the normalizing factor equals eitherq/E0S, q/ES, or 1/ETφ.

We can conclude that the ratio between any of these normalizing factorsconverges to one if the conditions of the theorem hold true.

4.4 Distribution of the Number of System Failures 131

4.4.2 Some Sufficient Conditions

It is intuitively clear that if the components have constant failure rates, andthe component unavailabilities converge to zero, then the conditions of The-orem 4.25 would hold. In Theorems 4.27 and 4.30 below this result will beformally established. We assume, for the sake of simplicity, that no singlecomponent is in series with the rest of the system. If there are one or morecomponents in series with the rest of the system, we know that the time tofailure of these components has an exact exponential distribution, and by in-dependence it is straightforward to establish the limiting distribution of thetotal system.

Define

d =

n∑i=1

λiμGi , λ =

n∑i=1

λi.

Theorem 4.27. Assume that the system has no components in series withthe rest of the system, i.e., Φ(0i,1) = 1 for i = 1, 2, . . . , n. Furthermore,assume that component i has an exponential lifetime distribution with failurerate λi > 0, i = 1, 2, . . . , n.

If d → 0 and there exist constants c1 and c2 such that λi ≤ c1 < ∞and ER2

i ≤ c2 < ∞ for all i, then the conditions (4.48),(4.49), and (4.50)

of Theorem 4.25 are met, and, consequently, αTΦD→ Exp(1) for α equal to

q/E0S, q/ES, or 1/ETφ.

Proof. As will be shown below, it is sufficient to show that q → 0 holds(condition (4.48)) and that there exists a finite constant c such that

λ2E(S′′)2 ≤ c, (4.56)

where S′′ represents the “busy” period of the renewal cycle, which equals thetime from the first component failure to the next regenerative point, i.e., to thetime when the process again visits state (1, 1, . . . , 1). (The term “busy” periodis taken from queueing theory. In the busy period at least one component isunder repair.) Let S′ be an exponentially distributed random variable withparameter λ representing the time to the first component failure. This meansthat we can write

S = S′ + S′′.

Assume that we have already proved (4.56). Then this condition and (4.48)imply (4.50), noting that

132 4 Availability Analysis of Complex Systems

qE1S

ES≤ λqE1S

= λ(qE1S′ + qE1S

′′)

= q + λqE[S′′|NS ≥ 1]

= q + λE[S′′I(NS ≥ 1)]

≤ q + λq1/2[E(S′′)2]1/2

= q + q1/2[λ2E(S′′)2]1/2,

where the last inequality follows from Schwartz’s inequality. Furthermore,condition (4.56) together with (4.48) imply (4.49), noting that

c20S ≤ E0S2

(E0S)2

≤ λ2E0S2

= λ2E[S2I(NS = 0)]/(1− q)

≤ λ2ES2/(1− q)

= λ2{E(S′)2 + E(S′′)2 + 2E[S′S′′]}/(1− q)

≤ λ2{(2/λ2) + E(S′′)2 + 2(E(S′)2E(S′′)2)1/2}/(1− q)

= {2 + λ2E(S′′)2 + 2(21/2) (λ2E(S′′)2)1/2}/(1− q),

where we again have used Schwartz’s inequality. Alternatively, an upper boundon E[S′S′′] can be established using that S′ and S′′ are independent:

E[S′S′′] = ES′ES′′ = (1/λ)ES′′ ≤ (1/λ){E(S′′)2}1/2.Now, to establish (4.48), we note that with probability λi = λi/λ, the busy

period begins at the time of the failure of component i. If, in the interval ofrepair of this component, none of the remaining components fails, then thebusy period comes to an end when the repair is completed. Therefore, sincethere are no components in series with the rest of the system,

1− q ≥n∑

i=1

λi

∫ ∞

0

e−t(λ−λi)dGi(t),

where Gi is the distribution of the repair time of component i. Hence,

q ≤n∑

i=1

λi

∫ ∞

0

[1− e−t(λ−λi)]dGi(t)

≤n∑

i=1

λi

∫ ∞

0

tdGi(t) = d.

Consequently, d→ 0 implies q → 0.

4.4 Distribution of the Number of System Failures 133

It remains to show (4.56). Clearly, the busy period will only increase ifwe assume that the flow of failures of component i is a Poisson flow withparameter λi, i.e., we adjoin failures that arise according to a Poisson processon intervals of repair of component i, assuming that repair begins immediatelyfor each failure. This means that the process can be regarded as an M/G/∞queueing process, where the Poisson input flow has parameter λ and there arean infinite number of devices with servicing time distributed according to thelaw

G(t) =n∑

i=1

λiGi(t).

Note that the probability that a “failure is due to component i” equals λi. Itis also clear that the busy period increases still more if, instead of an infinitenumber of servicing devices, we take only one, i.e., the process is a queueingprocess M/G/1. Thus, E(S′′)2 ≤ E(S′′)2, where S′′ is the busy period in asingle-line system with a Poisson input flow λ and servicing distribution G(t).It is a well-known result from the theory of queueing processes (and branchingprocesses) that the second-order moment of the busy period (extinction time)equals ER2

G/(1 − λERG)3, where RG is the service time having distribution

G, see, e.g., [80]. Hence, by introducing d2 =∑n

i=1 λiER2i we obtain

λ2E(S′′)2 ≤ λd2(1− d)3

≤ n2c21c2(1− d)3

.

The conclusion of the theorem follows. ��

We now give sufficient conditions for E1(N − 1) → 0 (assumption (4.51)in Theorem 4.25).

We defineμi = sup

0≤t<t∗{E[Ri1 − t|Ri1 > t]},

where t∗ = sup{t ∈ R+ : Gi(t) > 0}. We see that μi expresses the maximumexpected residual repair time of component i. We might have μi = ∞, butwe shall in the following restrict attention to the finite case. We know fromSect. 2.2, p. 37, that if Gi has the NBUE property, then

μi ≤ μGi .

If the repair times are bounded by a constant c, i.e., P (Rik ≤ c) = 1, thenμi ≤ c. Let

μ =n∑

i=1

μi.

Lemma 4.28. Assume that the lifetime of component i is exponentially dis-tributed with failure rate λi, i = 1, 2, . . . , n. Then

P1(NS ≥ k) ≤ (λμ)k−1, k = 2, 3, . . . . (4.57)

134 4 Availability Analysis of Complex Systems

Proof. The lemma will be shown by induction. We first prove that (4.57)holds true for k = 2. Suppose the first system failure occurs at time t. LetLt denote the number of component failures after t until all components areagain functioning for the first time. Furthermore, let Rit denote the remainingrepair time of component i at time t (put Rit = 0 if component i is functioningat time t). Finally, let Vt = maxiRit and let GVt(v) denote the distributionfunction of Vt. Note that Lt ≥ 1 implies that at least one component must failin the interval (t, t + Vt) and that the probability of at least one componentfailure in this interval increases if we replace the failed components at t byfunctioning components. Using these observations and the inequality 1−e−x ≤x, we obtain

P (Lt ≥ 1) =

∫ ∞

0

P (Lt ≥ 1|Vt = v)dGVt(v)

≤∫ ∞

0

(1− e−λv)dGVt(v)

≤ λ

∫ ∞

0

vdGVt(v) = λEVt ≤ λE∑i

Rit

≤ λμ.

Since NS ≥ 2 implies Lt ≥ 1, formula (4.57) is shown for k = 2 and P1

conditional on the event that the first system failure occurs at time t. Inte-grating over the failure time t, we obtain (4.57) for k = 2. Now assume thatP1(NS ≥ k) ≤ (λμ)k−1 for a k ≥ 2. We must show that

P1(NS ≥ k + 1) ≤ (λμ)k.

We have

P1(NS ≥ k + 1) = P1(NS ≥ k + 1|NS ≥ k)P1(NS ≥ k)

≤ P1(NS ≥ k + 1|NS ≥ k) · (λμ)k−1,

thus it remains to show that

P1(NS ≥ k + 1|NS ≥ k) ≤ λμ. (4.58)

Suppose that the kth system failure in the renewal cycle occurs at time t. Thenif at least one more system failure occurs in the renewal cycle, there must beat least one component failure before all components are again functioning,i.e., Lt ≥ 1. Repeating the above arguments for k = 2, the inequality (4.58)follows. ��

Remark 4.29. The inequality (4.57) states that the number of system failuresin a renewal cycle when it is given that at least one system failure occurs isbounded in distribution by a geometrical random variable with parameter λμ(provided this quantity is less than 1)

4.4 Distribution of the Number of System Failures 135

Theorem 4.30. Assume that the system has no components in series with therest of the system. Furthermore, assume that component i has an exponentiallifetime distribution with failure rate λi > 0, i = 1, 2, . . . , n. If d′ → 0, whered′ = λμ, and there exist constants c1 and c2 such that λi ≤ c1 < ∞ andER2

i ≤ c2 < ∞ for all i, then the conditions (4.48)–(4.51) of Theorem 4.25(p. 129) are all met, and, consequently, the limiting result (4.52) holds, i.e.,

λΦTΦD→ Exp(1).

Proof. Since d ≤ d′, it suffices to show that condition (4.51) holds under thegiven assumptions. But from (4.57) of Lemma 4.28 we have

E1(NS − 1) ≤ d′/(1− d′),

and the desired result follows. ��

The above results show that the time to the first system failure is approxi-mately exponentially distributed with parameter q/E0S ≈ q/ES ≈ 1/ETΦ ≈λΦ. For a system comprising highly available components, it is clear thatP (Xt = 1) would be close to one, hence the above approximations for theinterval reliability can also be used for an interval (t, t+ u].

4.4.3 Asymptotic Analysis of the Number of System Failures

For a highly available system, the downtimes will be small compared to theuptimes, and the time from when the system has failed until it returns to thestate (1, 1, . . . , 1) will also be small. Hence, the above results also justify thePoisson process approximation for N . More formally, it can be shown thatNt/α converges in distribution to a Poisson distribution under the same as-sumptions as the first system failure time converges to the exponential distri-bution. Let T ∗

Φ(k) denote the time between the (k − 1)th and the kth systemfailure. From this sequence we define an associated sequence TΦ(k) of i.i.d.variables, distributed as TΦ, by letting TΦ(1) = T ∗

Φ(1), TΦ(2) be equal to thetime to the first system failure following the first regenerative point after thefirst system failure, etc. Then it is seen that

TΦ(1) + TΦ(2)(1− I(N(1) ≥ 2)) ≤ T ∗Φ(1) + T ∗

Φ(2) ≤ TΦ(1) + TΦ(2) + Sν ,

where N(1) = equals the number of system failures in the first renewal cyclehaving one or more system failures, and Sν equals the length of this cycle (νdenotes the renewal cycle index associated with the time of the first systemfailure). For α being one of the normalizing factors (i.e., q/E0S, q/ES, 1/ETΦ,or λΦ), we will prove that αTΦ(2)I(N(1) ≥ 2) converges in probability to zero.It is sufficient to show that P (N(1) ≥ 2) → 0 noting that

P (αTΦ(2)I(N(1) ≥ 2) > ε) ≤ P (N(1) ≥ 2).

136 4 Availability Analysis of Complex Systems

But

P (N(1) ≥ 2) = P1(NS ≥ 2) ≤ E1(NS − 1),

where the last expression converges to zero in view of (4.51), p. 129. Thedistribution of Sν is the same as the conditional probability of the cycle lengthgiven a system failure occurs in the cycle, cf. Theorem 4.25 and its proof. Thus,if (4.48)–(4.51) hold, it follows that α(T ∗

Φ(1)+T∗Φ(2)) converges in distribution

to the sum of two independent exponentially distributed random variableswith parameter 1, i.e.,

P (Nt/α ≥ 2) = P (α(T ∗Φ(1) + T ∗

Φ(2)) ≤ t)

→ 1− e−t − te−t.

Similarly, we establish the general distribution. We summarize the result inthe following theorem.

Theorem 4.31. Assume that X is a regenerative process, and that Fij andGij change in such a way that (asj → ∞) the conditions (4.48)–(4.51) hold.Then (asj → ∞)

Nt/αD→ Poisson(t), (4.59)

where α is a normalizing factor that equals either q/E0S, q/ES, 1/ETΦ orλΦ.

Results from Monte Carlo simulations [22] indicate that the asymptoticsystem failure rate λΦ is normally preferable as parameter in the Poissondistribution when the expected number of system failures is not too small(less than one). When the expected number of system failures is small, thefactor 1/ETΦ gives slightly better results. The system failure rate is howevereasier to compute.

Asymptotic Normality

Now we turn to a completely different way to approximate the distributionof Nt. Above, the up and downtime distributions are assumed to change suchthat the system availability increases and after a time rescalingNt converges toa Poisson variable. Now we leave the up and downtime distribution unchangedand establish a central limit theorem as t increases to infinity. The theoremgeneralizes (4.16), p. 114.

Theorem 4.32. If X is a regenerative process with cycle length S, Var[S] <∞ and Var[NS] <∞, then as t→ ∞,

√t

(Nu+t −Nu

t− λΦ

)D→ N(0, γ2Φ),

where

γ2ΦES = Var[NS − λΦS]. (4.60)

4.4 Distribution of the Number of System Failures 137

Proof. Noting that the system failure rate λΦ is given by

λΦ =ENS

ES, (4.61)

the result follows from Theorem B.17, p. 280, in Appendix B. ��

Below we argue that if the system failure rate is small, then we have

γ2Φ ≈ λΦ.

We obtain

γ2Φ =Var[NS − λΦS]

ES=E(NS − λΦS)

2

ES

≈ EN2S

ES≈ ENS

ES= λΦ,

where the last approximation follows by observing that if the system failurerate is small, then NS with a probability close to one is equal to the indicatorfunction I(NS ≥ 1). More formally, it is possible to show that under certainconditions, γ2Φ/λΦ converges to one. We formulate the result in the followingproposition.

Proposition 4.33. Assume X is a regenerative process with cycle length Sand that Fij and Gij change in such a way that conditions (4.48)–(4.50) ofTheorem 4.25 (p. 129) hold (as j → ∞). Furthermore, assume that (as j → ∞)

E1(NS − 1)2 → 0 (4.62)

andqc2S → 0, (4.63)

where c2S denotes the squared coefficient of variation of S. Then (as j → ∞)

γ2ΦλΦ

→ 1.

Proof. Using (4.60) and writing N in place of NS we get

γ2ΦλΦ

=E(N − λΦS)

2

λΦES

=q−1EN2 + q−1(λΦ)

2ES2 − 2q−1λΦE[NS]

q−1λΦES

=E1N

2 + q−1(λΦ)2ES2 − 2q−1λΦE[NS]

q−1λΦES.

Since the denominator converges to 1 (the denominator equals the ratio be-tween two normalizing factors), the result follows if we can show that E1N

2

138 4 Availability Analysis of Complex Systems

converges to 1 and all the other terms of the numerator converge to zero.Writing

E1N2 = E1[1 + (N − 1)]2 = 1 + E1(N − 1)2 + 2E1(N − 1)

and using condition (4.62), it is seen that E1N converges to 1. Now considerthe term q−1(λΦ)

2ES2. Using that λΦ = EN/ES (formula (4.61)) we obtain

q−1(λΦ)2ES2 = q−1(EN/ES)2ES2 = q−1(EN)2{ES2/(ES)2}

= q(E1N)2(1 + c2S) = q[1 + E1(N − 1)]2(1 + c2S).

Letting q → 0 (condition (4.48)), and applying (4.62) and (4.63), we seethat q−1(λΦ)

2ES2 converges to zero. It remains to show that q−1λΦE[NS]converges to zero. But this is shown in the same way as the previous term,noting that

E[NS] ≤ (EN2)1/2(ES2)1/2

by Schwartz’s inequality. This completes the proof of the proposition. ��

Proposition 4.34. Under the same conditions as formulated in Theorem4.30, p. 135, the following limiting result holds true (as j → ∞):

γ2ΦλΦ

→ 1.

Proof. It is sufficient to show that conditions (4.62) and (4.63) hold. Condition(4.62) follows by using that under P1, N is bounded in distribution by ageometrical distribution random variable with parameter d′ = λμ, cf. (4.57)of Lemma 4.28, p. 133. Note that for a variable N that has a geometricaldistribution with parameter d′ we have

E(N − 1)2 =∞∑k=1

(k − 1)2(d′)k−1(1− d′)

=d′(1 + d′)(1− d′)2

.

From this equality it follows that E1(NS − 1)2 → 0 as d′ → 0. To establish(4.63) we can repeat the arguments in the proof of Theorem 4.27, p. 131,showing (4.49), observing that

c2S ≤ ES2

(ES)2≤ λ2ES2. ��

For a parallel system of two components it is possible to establish simpleexpressions for some of the above quantities, such as q and ETΦ.

4.4 Distribution of the Number of System Failures 139

Parallel System of Two Identical Components

Consider a parallel system comprising two identical components having ex-ponential life lengths with failure rate λ. Suppose one of the components hasfailed. Then we see that a system failure occurs, i.e., the number of systemfailures in the cycle is at least 1 (NS ≥ 1), if the operating component failsbefore the repair is completed. Consequently,

q = P (NS ≥ 1) =

∫ ∞

0

F (t)dG(t) =

∫ ∞

0

(1− e−λt)dG(t),

where F (t) = P (T ≤ t) = 1−e−λt and G(t) = P (R ≤ t) equal the componentlifetime and repair time distribution, respectively. It follows that

q ≤∫ ∞

0

λtdG(t) = λμG.

Thus for a parallel system comprising two identical components, it is triviallyverified that the convergence of λμG to zero implies that q → 0. From theTaylor formula we have 1−e−x = x− 1

2x2+x3O(1), x→ 0, where |O(1)| ≤ 1.

Hence, if λμG → 0 and ER3/μ3G is bounded by a finite constant, we have

q = λμG − λ2

2ER2 + λ3ER3O(1)

= λμG − (λμG)2

2(1 + c2G) + o((λμG)

2),

where c2G denotes the squared coefficient of variation of G defined byc2G =VarR/μ2

G. We can conclude that if λμG is small, then comparing dis-tributions G with the same mean, those with a large variance exhibit a smallprobability q.

If we instead apply the Taylor formula 1−e−x = x−x2O(1), we can write

q = λμG + o(λμG), λμG → 0.

For this example it is also possible to establish an explicit formula for E0S.It is seen that

E0S = Emin{T1, T2}+ E[R|R < T ],

where T1 and T2 are the times to failure of component 1 and 2, respectively.But

Emin{T1, T2} =1

and

E[R|R < T ] = E[RI(R < T )]/(1− q)

=

∫ ∞

0

re−λr dG(r)/(1 − q).

140 4 Availability Analysis of Complex Systems

This gives

E0S =1

2λ+

1

1− q

∫ ∞

0

re−λr dG(r).

From the Taylor formula we have e−x = 1− xO(1), x→ 0, where |O(1)| ≤ 1.Using this and noting that

∫ ∞

0

re−λr dG(r) = μG[1 + λμG(c2G + 1)O(1)],

it can be shown that if the failure rate λ and the squared coefficient of variationc2G are bounded by a finite constant, then the normalizing factor q/E0S isasymptotically given by

q

E0S= 2λ2μG + o(λμG), λμG → 0.

Now we will show that the system failure rate λΦ, defined by (4.41), p. 123,is also approximately equal to 2λ2μG. First note that the unavailability of acomponent, A, is given by A = λμG/(1 + λμG). It follows that

λΦ =2A

λ−1 + μG= 2λ2μG + o(λμG), λμG → 0, (4.64)

provided that the failure rate λ is bounded by a finite constant.Next we will compute the exact distribution and mean of TΦ. Let us denote

this distribution by FTΦ(t). In the following FX denotes the distribution of anyrandom variable X and FiX(t) = Pi(X ≤ t), i = 0, 1, where P0(·) = P (·|NS =0) and P1(·) = P (·|NS ≥ 1). Observe that the length of a renewal cycle S canbe written as S′ + S′′, where S′ represents the time to the first failure of acomponent, and S′′ represents the “busy” period, i.e., the time from when onecomponent has failed until the process returns to the best state (1, 1). Thevariables S′ and S′′ are independent and S′ is exponentially distributed withrate λ = 2λ. Now, assume a component has failed. Let R denote the repairtime of this component and let T denote the time to failure of the operatingcomponent. Then

F1T (t) = P (T ≤ t|T ≤ R) =1

q

∫ ∞

0

(1− e−λ(t∧r))dG(r),

where a ∧ b denotes the minimum of a and b. Furthermore,

F0R(t) = P (R ≤ t|R < T ) =1

q

∫ t

0

e−λrdG(r),

where q = 1− q. Now, by conditioning on whether a system failure occurs inthe first renewal cycle or not, we obtain

FTΦ(t) = qP (TΦ ≤ t|NS ≥ 1) + qP (TΦ ≤ t|NS = 0)

= qF1TΦ(t) + qF0TΦ(t). (4.65)

4.4 Distribution of the Number of System Failures 141

To find an expression for F1TΦ(t) we use a standard conditional probabilityargument, yielding

F1TΦ(t) =

∫ t

0

P1(TΦ ≤ t|S′ = s)dFS′(s)

=

∫ t

0

P (T ≤ t− s|T ≤ R)dFS′ (s)

=

∫ t

0

F1T (t− s)dFS′(s).

Consider now F0TΦ (t). By conditioning on S = s, we obtain

F0TΦ(t) =

∫ t

0

P0(TΦ ≤ t|S = s)dF0S(s)

=

∫ t

0

FTΦ(t− s)dF0S(s).

Inserting the above expressions into (4.65) gives

FTΦ(t) = h(t) + q

∫ t

0

FTΦ (t− s)dF0S(s),

where

h(t) = q

∫ t

0

F1T (t− s)dFS′(s). (4.66)

Hence, FTΦ(t) satisfies a renewal equation with the defective distributionqF0S(s), and arguing as in the proof of Theorem B.2, p. 275, in Appendix B,it follows that

FTΦ(t) = h(t) +

∫ t

0

h(t− s)dM0(s), (4.67)

where the renewal function M0(s) equals

∞∑j=1

qjF ∗j0S(s).

Noting that F0S = FS′ ∗F0R, the Laplace transform of S′ equals 2λ/(2λ+ v),q = LG(λ) and LF0R(v) = LG(v+λ)/LG(λ), we see that the Laplace transformof M0 takes the form

LM0(v) =q 2λ2λ+vLF0R(v)

1− q 2λ2λ+vLF0R(v)

=2λ

2λ+vLG(v + λ)

1− 2λ2λ+vLG(v + λ)

.

It is seen that the Laplace transform of F1T is given by

LF1T (v) =1

1− LG(λ)(1− LG(v + λ))

λ

λ + v.

142 4 Availability Analysis of Complex Systems

Now using (4.67) and (4.66) and the above expressions for the Laplace trans-form we obtain the following simple formula for LFTΦ

:

LFTΦ(v) =

2λ2

λ+ v· 1− LG(v + λ)

v + 2λ(1− LG(v + λ)).

The mean ETΦ can be found from this formula, or alternatively by using adirect renewal argument. We obtain

ETΦ = ES′ + E(TΦ − S′)

=1

2λ+ Emin{R, T }+ (1− q)ETΦ,

noting that the time one component is down before system failure occurs orthe renewal cycle terminates equals min{R, T }. If a system failure does notoccur, the process starts over again. It follows that

ETΦ =1

2qλ+Emin{R, T }

q.

Note that

Emin{R, T } =

∫ ∞

0

F (t)G(t)dt =

∫ ∞

0

e−λtG(t)dt.

It is also possible to write

ETΦ =3

1− 23LG(λ)

1− LG(λ).

Now using the Taylor formula e−x = 1− xO(1), |O(1)| ≤ 1, we obtain

Emin{R, T } =∫ ∞

0

e−λtG(t)dt = μG + λμ2G(c

2G + 1)O(1),

where c2G is the squared coefficient of variation of G. From this it can be shownthat the normalizing factor 1/ETΦ can be written in the same form as theother normalizing factors:

1

ETΦ= 2λ2μG + o(λμG), λμG → 0,

assuming that λ and c2G are bounded by a finite constant.

Asymptotic Analysis for Systems having Components in Serieswith the Rest of the System

We now return to the general asymptotic analysis. Remember that d =∑λiμGi and λ =

∑λi . So far we have focused on nonseries systems (series

4.4 Distribution of the Number of System Failures 143

system have q = 1). Below we show that a series system also has a Poissonlimit under the assumption that the lifetimes are exponentially distributed.We also formulate and prove a general asymptotic result for the situation thatwe have some components in series with the rest of the system. A componentis in series with the rest of the system if Φ(0i,1) = 0.

Theorem 4.35. Assume that Φ is a series system and the lifetimes are ex-ponentially distributed. Let λi be the failure rate of component i. If d→ 0 (asj → ∞), then (as j → ∞)

Nt/λ

D→ Poisson(t).

Proof. Let NPt (i) be the Poisson process with intensity λi generated by the

consecutive uptimes of component i. Then it is seen that

n∑i=1

NPt/λ

(i)−D = Nt/λ ≤n∑

i=1

NPt/λ

(i),

where

D =

n∑i=1

NPt/λ

(i)−Nt/λ.

We have D ≥ 0 and hence the conclusion of the theorem follows if we canshow that ED → 0, since then D converges in probability to zero. Note that∑n

i=1NPt/λ

(i) is Poisson distributed with mean

E

n∑i=1

NPt/λ

(i) =

n∑i=1

(t/λ)λi = t. (4.68)

From (4.36) of Corollary 4.18, p. 121, we have

ENt/λ =

n∑i=1

∫ t/λ

0

[h(1i,A(s)) − h(0i,A(s))]λiAi(s)ds,

which gives

ENt/λ =

n∑i=1

∫ t/λ

0

∏k =i

Ak(s)λiAi(s)ds

= λ

∫ t/λ

0

n∏k=1

Ak(s)ds.

Using this expression together with (4.68), the inequalities 1 −∏i(1 − qi) ≤∑i qi, and the component unavailability bound (4.22) of Proposition 4.11,

p. 114, (Ai(t) ≤ λiμGi), we find that

144 4 Availability Analysis of Complex Systems

ED = λ

∫ t/λ

0

[1−

n∏i=1

Ai(s)

]ds

≤ λ

∫ t/λ

0

n∑i=1

Ai(s)ds

≤ λ(t/λ)n∑

k=1

λiμGi

= td.

Now if d→ 0, we see that ED → 0 and the proof is complete. ��

Remark 4.36. Arguing as in the proof of the theorem above it can be shownthat if aj → a as j → ∞, then

Najt/λ

D→ Poisson(ta).

Observe that∑n

i=1NPajt/λ

(i) is Poisson distributed with parameter ajt and

as j → ∞ this variable converges in distribution to a Poisson variable withparameter at.

Theorem 4.37. Assume that the components have exponentially distributedlifetimes, and let λi be the failure rate of component i. Let A denote the setof components that are in series with the rest of the system, and let B be theremaining components. Let NA, λA, etc., denote the number of system failures,the total failure rate, etc., associated with the series system comprising thecomponents in A. Similarly define NB, αB , dB, etc., for the system comprisingthe components in B. Assume that the following conditions hold (as j → ∞) :

1. d→ 02. The conditions of Theorem 4.25, p. 129, i.e., (4.48)–(4.51), hold for systemB

3. λA/αB → a.

Then (as j → ∞)

Nt/αBD→ Poisson(t(1 + a)).

Remark 4.38. The conditions of Theorem 4.25 ensure that

NBt/αB

D→ Poisson(t),

cf. Theorem 4.31, p. 136. Theorem 4.30, p. 135, gives sufficient conditions for(4.48)–(4.51).

4.5 Downtime Distribution Given System Failure 145

Proof. First note that

Nt/αB ≤ NAt/αB +NB

t/αB = NAajt/λA +NB

t/αB ,

where aj = λA/αB. Now in view of Remark 4.36 above and the conditions ofthe theorem, it is sufficient to show that D∗, defined as the expected numberof times system A fails while system B is down, or vice versa, converges tozero. But noting that the probability that system A (B) is not functioning isless than or equal to d (the unreliability of a monotone system is bounded bythe sum of the component unreliabilities, which in its turn is bounded by d,cf. (4.22), p. 115), it is seen that

D∗ ≤ d[ENAaj t/λA + ENB

t/αB ] ≤ d[λAajt/λA + ENB

t/αB ]

= d[ajt+ ENBt/αB ].

To find a suitable bound on ENBt/αB , we need to refer to the argumentation

in the proof of Theorem 4.43, formulas (4.88) and (4.93), p. 156. Using theseresults we can show that ENB

t/αB → t. Hence, D∗ → 0 and the theorem is

proved. ��

4.5 Downtime Distribution Given System Failure

In this section we study the downtime distribution of the system given thata failure has occurred. We investigate the downtime distribution given a fail-ure at time t, the asymptotic (steady-state) distribution obtained by lettingt → ∞, and the distribution of the downtime following the ith system fail-ure. Recall that Φ represents the structure function of the system and Nt thenumber of system failures in [0, t]. Component i generates an alternating re-newal process with uptime distribution Fi and downtime distribution Gi, withmeans μFi and μGi , respectively. The lifetime distribution Fi is absolutelycontinuous with a failure rate function λi. The n component processes areindependent.

Let ΔNt = Nt−Nt−. Define GΦ(·, t) as the downtime distribution at timet, i.e.,

GΦ(y, t) = P (Y ≤ y|ΔNt = 1),

where Y is a random variable representing the downtime (we omit the depen-dency on t). The asymptotic (steady-state) downtime distribution is given by

GΦ(y) = limt→∞GΦ(y, t),

assuming that the limit exists. It turns out that it is quite simple to establishthe asymptotic (steady-state) downtime distribution of a parallel system, sowe first consider this category of systems.

146 4 Availability Analysis of Complex Systems

4.5.1 Parallel System

Consider a parallel system comprising n stochastically identical components,with repair time distribution G. Since a system failure coincides with one andonly one component failure, we have

P (Y > y|ΔNt = 1) = G(y)[Gαt(y)]n−1,

where Gαt(y) = P (αt(i) > y|Xi(t) = 0) denotes the distribution of the for-ward recurrence time in state 0 of a component. But we know from (4.14) and(4.15), p. 112, that the asymptotic distribution of Gαt(y) is given by

limt→∞ Gαt(y) =

∫∞yG(x)dx

μG= G∞(y). (4.69)

Thus we have proved the following theorem.

Theorem 4.39. For a parallel system of n identical components, the asymp-totic (steady-state) downtime distribution given system failure, equals

GΦ(y) = 1− G(y)

[∫∞y G(x)dx

μG

]n−1

. (4.70)

Next we consider a parallel system of not necessarily identical components.We have the following result.

Theorem 4.40. Let mi(t) be the renewal density function of Mi(t), and as-sume that mi(t) is right-continuous and satisfies

limt→∞mi(t) =

1

μFi + μGi

. (4.71)

For a parallel system of not necessarily identical components, the asymptotic(steady-state) downtime distribution given system failure equals

GΦ(y) =

n∑i=1

ci

⎡⎣1− Gi(y)

∏k =i

∫∞y Gk(x) dx

μGk

⎤⎦ ,

where

ci =1/μGi∑n

k=1 1/μGk

(4.72)

denotes the asymptotic (steady-state) probability that component i causes asystem failure.

4.5 Downtime Distribution Given System Failure 147

Proof. The proof follows the lines of the proof of Theorem 4.39, the differencebeing that we have to take into consideration which component causes systemfailure and the probability of this event given system failure. Clearly,

1− Gi(y)∏k =i

∫∞y Gk(x) dx

μGk

equals the asymptotic downtime distribution given that component i causessystem failure. Hence it suffices to show (4.72). Since the system failure rate

λΦ is given by λΦ =∑n

i=1 λ(i)Φ , where

λ(i)Φ =

∏k =i

Ak1

μFi + μGi

represents the expected number of system failures per unit of time causedby failures of component i, an intuitive argument gives that the asymptotic(steady-state) probability that component i causes system failure equals

λ(i)Φ

λΦ=

1μFi

+μGi

∏k =i Ak∑n

l=11

μFl+μGl

∏k =l Ak

=

1μGi

∏nk=1 Ak∑n

l=11

μGl

∏nk=1 Ak

= ci.

To establish sufficient conditions for this result to hold, we need to carryout a somewhat more formal proof. Let ci(t) be defined as the conditionalprobability that component i causes system failure given that the systemfailure occurs at time t. For each h > 0 let

N c[t,t+h)(i) =

∫[t,t+h)

(Φ(1i,Xs)− Φ(0i,Xs))dNs(i)

N c[t,t+h) =

n∑i=1

N c[t,t+h)(i).

Then

ci(t) = limh→0+

P (N c[t,t+h)(i) = 1)

P (N c[t,t+h) = 1)

= limh→0+

1hEN

c[t,t+h)(i)− oi(1)

1hEN

c[t,t+h) − o(1)

, (4.73)

whereoi(1) = E[N c

[t,t+h)(i))I(Nc[t,t+h)(i) ≥ 2)]/h

148 4 Availability Analysis of Complex Systems

ando(1) = E[N c

[t,t+h))I(Nc[t,t+h) ≥ 2)]/h.

Hence it remains to study the limit of the ratio of the first terms of (4.73).Using that

EN c[t,t+h)(i) =

∫[t,t+h)

(h(1i,A(s))− h(0i,A(s))mi(s)ds,

where Ai(s) = P (Xs(i) = 1) equals the availability of component i at time s,it follows that

ci(t) ={h(1i,A(t)) − h(0i,A(t))}mi(t)∑n

k=1{h(1k,A(t))− h(0k,A(t))}mk(t).

From this expression, we see that limt→∞ ci(t) = ci provided that

limt→∞mi(t) =1

μFi + μGi

.

This completes the proof of the theorem. ��

Remark 4.41. 1. From renewal theory (see Theorem B.10, p. 278, in Ap-pendix B) sufficient conditions can be formulated for the limiting result(4.71) to hold true. For example, if the renewal cycle lengths Tik+Rik havea density function h with h(t)p integrable for some p > 1, and h(t) → 0as t → ∞, then Mi has a density mi such that mi(t) → 1/(μFi + μGi)as t → ∞. If component i has an exponential lifetime distribution withparameter λi, then we know that mi(t) = λiAi(t) (cf. (4.18), p. 114),which converges to 1/(μFi + μGi).

2. From the above proof it is seen that the downtime distribution at time t,GΦ(y, t), is given by

GΦ(y, t) =

n∑i=1

ci(t)

⎡⎣1− Gi(y)

∏k =i

Gkαt(y)

⎤⎦ .

4.5.2 General Monotone System

Consider now an arbitrary monotone system comprising the minimal cut setsKk, k = 1, 2, . . . , k0. No simple formula exists for the downtime distributionin this case. But for highly available systems the following formula can beused to approximate the downtime distribution:

∑k

rkGKk(y),

4.5 Downtime Distribution Given System Failure 149

where

rk =λKk∑l λKl

.

Here λKkand GKk

denote the asymptotic (steady-state) failure rate of min-imal cut set Kk and the asymptotic (steady-state) downtime distribution ofminimal cut set Kk, respectively, when this set is considered in isolation (i.e.,we consider the parallel system comprising the components in Kk). We seethat rk is approximately equal to the probability that minimal cut set Kk

causes system failure. Refer to [23, 72] for more detailed analyses in the generalcase. In [72] it is formally proved that the asymptotic downtime distributionexists and is equal to the steady-state downtime distribution.

4.5.3 Downtime Distribution of the ith System Failure

The above asymptotic (steady-state) formulas for GΦ give in most cases goodapproximations to the downtime distribution of the ith system failure, i ∈ N.Even for the first system failure observed, the asymptotic formulas producerelatively accurate approximations. This is demonstrated by Monte Carlo sim-ulations in [23]. An example is given below. Let the distance measure Di(y)be defined by

Di(y) = |GΦ(y)− Gi,Φ(y)|,where Gi,Φ(y) equals the “true” downtime distribution of the ith system fail-ure obtained by Monte Carlo simulations. In Fig. 4.3 the distance measureof the first and second system failure have been plotted as a function of yfor a parallel system of two identical components with constant repair timesand exponential lifetimes. As we can see from the figure, the distance is quitesmall; the maximum distance is about 0.012 for i = 1 and 0.004 for i = 2.

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

.................................................. i =1

.................................................. i =2

Di(y)

y....................................................................................................................................................................................................................................................................................................................

....................................

............................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................

.................................................................................................................................................................................................................................................................................................. .................................................... ...............................................................................................................................................................

Fig. 4.3. The distance Di(y), i = 1, 2, as a function of y for a parallel system of twocomponents with constant repair times, μG = 1, λ = 0.1

150 4 Availability Analysis of Complex Systems

Only for some special cases are explicit expressions for the downtime dis-tribution of the ith system failure known. Below we present such expressionsfor the downtime distribution of the first failure for a two-component parallelsystem of identical components with exponentially distributed lifetimes.

Theorem 4.42. For a parallel system of two identical components with con-stant failure rate λ and repair time distribution G, the downtime distributionG1,p2(y) of the first system failure is given by

G1,p2(y) = 1− G(y)

∫∞0

∫ s

0 G(y + s− x)dF (x) dF (s)∫∞0

∫ s

0 G(s− x)dF (x) dF (s)(4.74)

= 1− G(y)

∫∞y

[1− e−λ(r−y)]dG(r)∫∞0 [1− e−λr]dG(r)

. (4.75)

Proof. Let Ti andRi have distribution function F andG, respectively, i = 1, 2,and let

Y = min1≤i≤2

(Ti +Ri)− max1≤i≤2

(Ti).

It is seen that the downtime distribution G1,p2(y) equals the conditional dis-tribution of Y given that Y > 0. The equality (4.74) follows if we can showthat

P (Y > y) = G(y)

∫ ∞

0

∫ s

0

2G(y + s− x)dF (x) dF (s). (4.76)

Consider the event that Ti = s, Tj = x, Ri > y, and Tj +Rj > y+ s for x < sand j �= i. For this event it holds that Y is greater than y. The probability ofthis event, integrated over all s and x, is given by∫ ∞

0

∫ s

0

G(y + s− x)G(y)dF (x)dF (s).

By taking the union over i = 1, 2, we find that (4.76) holds.But the double integral in (4.76) can be written as∫ ∞

0

2

∫ s

0

G(y + s− x)d(1 − e−λx)d(1− e−λs)

= 1−∫ ∞

0

∫ s

0

G(y + s− x)2λ2e−λ(x+s)dxds

= 1−∫ ∞

0

∫ ∞

x

G(y + s− x)λe−λ(s−x)2λe−2λxdsdx.

Introducing r = y + s− x gives

1−∫ ∞

0

2λe−2λx

∫ ∞

y

G(r)λe−λ(r−y)drdx

= 1−∫ ∞

y

G(r)λe−λ(r−y)dr

=

∫ ∞

y

(1− e−λ(r−y))dG(r).

4.6 Distribution of the System Downtime in an Interval 151

Thus the formulas (4.75) and (4.74) in the theorem are identical. Thiscompletes the proof of the theorem. ��

Now what can we say about the limiting downtime distribution of the firstsystem failure as the failure rate converges to 0? Is it equal to the steady-state downtime distribution GΦ? Yes, for the above example we can showthat if the failure rate converges to 0, the distribution G1,p2(y) converges tothe steady-state formula, i.e.,

limλ→0

G1,p2(y) = 1− G(y)

∫∞yG(r)dr

μG= GΦ(y).

This is seen by noting that

limλ→0

∫∞y

[1− e−λ(r−y)]dG(r)∫∞0 [1− e−λr]dG(r)

=

∫∞y

(r − y)dG(r)∫∞0 rdG(r)

=

∫∞yG(r)dr∫∞

0 G(r)dr.

This result can be extended to general monotone systems, and it is notnecessary to establish an exact expression for the distribution of the firstdowntime; see [72]. Consider the asymptotic set-up introduced in Sect. 4.4,to study highly available components, with exponential lifetime distributionsFij(t) = 1 − e−λijt and fixed repair time distributions Gi, and where we as-sume λij → 0 as j → ∞. Then for a parallel system it can be shown that thedistribution of the ith system downtime converges as j → ∞ to the steady-state downtime distribution GΦ. For a general system it is more complicated.Assuming that the steady-state downtime distribution converges as j → ∞ toG∗

Φ (say), it follows that the distribution of the ith system downtime convergesto the same limit. See [72] for details.

4.6 Distribution of the System Downtime in an Interval

In this section we study the distribution of the system downtime in a timeinterval. The model considered is as described in Sect. 4.3, p. 120. The systemanalyzed is monotone and comprises n independent components. Componenti generates an alternating renewal process with uptime distribution Fi anddowntime distribution Gi.

We immediately observe that the asymptotic expression for the expectedaverage downtime presented in Theorem 4.13, p. 116, also holds for monotonesystems, with A = h(A). Formula (4.28) of Theorem 4.13 requires that theprocess X is a regenerative process with finite expected cycle length.

The rest of this section is organized as follows. First we present some ap-proximative methods for computing the distribution of Yu (the downtime in

152 4 Availability Analysis of Complex Systems

the time interval [0, u]) in the case that the components are highly available,utilizing that (Yu) is approximately a compound Poisson process, denoted(CPu), and the exact one-unit formula (4.30), p. 118, for the downtime distri-bution. Then we formulate some sufficient conditions for when the distribu-tion of CPu is an asymptotic limit. The framework is the same as describedin Sect. 4.4.1, p. 126. Finally, we study the convergence to the normal distri-bution.

4.6.1 Compound Poisson Process Approximation

We assume that the components have constant failure rate and that the com-ponents are highly available, i.e., the products λiμGi are small. Then it canbe heuristically argued that the process (Yu), u ∈ R+, is approximately acompound Poisson process,

Yu ≈Nu∑i=1

Yi ≈ CPu. (4.77)

Here Nu is the number of system failures in [0, u] and Yi is the downtime ofthe ith system failure. The dependency between Nu and the random variablesYi is not “very strong” since Nu is mainly governed by the renewal cycleswithout system failures. We can ignore downtimes Yi being the second, third,etc., system failure in a renewal cycle of the process X. The probability ofhaving two or more system failures in a cycle is small since we are assuminghighly available components. This means that the random variables Yi areapproximately independent and identically distributed.

From this we can find an approximate expression for the distribution ofYu.

A closely related approximation can be established by considering systemoperational time, as described in the following.

Let Nops be the number of system failures in [0, s] when we consider op-

erational time. Similar to the reasoning in Sect. 4.4, p. 125, it can be arguedthat Nop

s is approximately a homogeneous Poisson process with intensity λ′Φ,where λ′Φ is given by

λ′Φ =n∑

i=1

h(1i,A)−h(0i,A)

(μFi + μGi)h(A).

To motivate this result, we note that the expected number of system failuresper unit of time when considering calendar time is approximately equal to theasymptotic (steady-state) system failure rate λΦ, given by (cf. formula (4.41),p. 123)

λΦ =

n∑i=1

h(1i,A)−h(0i,A)

μFi + μGi

.

4.6 Distribution of the System Downtime in an Interval 153

Then observing that the ratio between calendar time and operational time isapproximately 1/h(A), we see that the expected number of system failures perunit of time when considering operational time, ENop(u, u+w]/w, is approx-imately equal to λΦ/h(A)Furthermore, Nop

(u,u+w] is “nearly independent” of

the history of Nop up to u, noting that the state process X frequently restartsitself probabilistically, i.e., X re-enters the state (1, 1, . . . , 1). It can be shownby repeating the proof of the Poisson limit Theorem 4.31, p. 136, and usingthe fact that h(A) → 1 as λiμGi → 0, that Nop

t/α has an asymptotic Poisson

distribution with parameter t. The system downtimes given system failure areapproximately identically distributed with distribution function G(y), say, in-dependent of Nop, and approximately independent observing that the stateprocess X with a high probability restarts itself quickly after a system failure.The distribution function G(y) is normally taken as the asymptotic (steady-state) downtime distribution given system failure or an approximation to thisdistribution; see Sect. 4.5.

Considering the system as a one-unit system, we can now apply the ex-act formula (4.30), p. 118, for the downtime distribution with the Poissonparameter λ′Φ. It follows that

P (Yu ≤ y) ≈∞∑n=0

G∗n(y)[λ′Φ(u− y)]n

n!e−λ′

Φ(u−y) = Pu(y), (4.78)

where the equality is given by definition. Formula (4.78) gives good approx-imations for “typical real life cases” with small component availabilities; see[82]. Figure 4.4 presents the downtime distribution for a parallel system oftwo components with the repair times identical to 1 and μF = 10 using thesteady-state formula GΦ for G (formula (4.70), p. 146). The “true” distribu-tion is found using Monte Carlo simulation. We see that formula (4.78) givesa good approximation.

4.6.2 Asymptotic Analysis

We argued above that (Yu) is approximately equal to a compound Poisson pro-cess when the system comprises highly available components. In the followingtheorem we formalize this result.

The set-up is the same as in Sect. 4.4.1, p. 126. We consider for eachcomponent i a sequence {Fij , Gij}, j ∈ N, of distributions satisfying certainconditions. To simplify notation, we normally omit the index j. When assum-ing in the following that X is a regenerative process, it is tacitly understoodfor all j ∈ N.

We say that the renewal cycle is a “success” if no system failure occursduring the cycle and a “fiasco” if a system failure occurs.

Let α be a suitable normalizing factor (or more precisely, a normalizingsequence in j) such that Nt/α converges in distribution to a Poisson variable

154 4 Availability Analysis of Complex Systems

0.8

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8y

P(Y10≤y)

P10(y)

Fig. 4.4. P10(y) and P (Y10 ≤ y) for a parallel system of two components withconstant repair times, μG = 1, λ = 0.1

with mean t, cf. Theorem 4.31, p. 136. Normally we take α = λΦ, but wecould also use q/E0S, q/ES, or 1/ETΦ, where q equals the probability thata system failure occurs in a cycle, S equals the length of a cycle, E0S equalsthe expected length of a cycle with no system failures, and TΦ equals the timeto the first system failure. Furthermore, let Yi1 denote the length of the firstdowntime of the system in the ith “fiasco” renewal cycle, and Yi2 the lengthof the remaining downtime in the same cycle. We assume that the asymptotic

distribution of Yi1 exists (as j → ∞): Yi1D→ G∗

Φ (say).A random variable is denoted CP(r,G) if it has the same distribution as∑N

i=1 Yi, where N is a Poisson variable with mean r, the variables Yi are i.i.d.with distribution function G, and N and Yi are independent. The distributionof CP(r,G) equals

∞∑i=0

G∗i ri

i!e−r,

where G∗i denotes ith convolution of G.

Theorem 4.43. Assume that X is a regenerative process, and that Fij andGij change in such a way that the following conditions hold (as j → ∞) :

q → 0, (4.79)

qc20S → 0, (4.80)

where c20S = [E0S2/(E0S)

2]− 1 denotes the squared coefficient of variation ofS under P0,

4.6 Distribution of the System Downtime in an Interval 155

qE1S

ES→ 0, (4.81)

E1(NS − 1) → 0, (4.82)

Yi1D→ G∗

Φ. (4.83)

Then (as j → ∞)

Yt/αD→ CP(t, G∗

Φ), (4.84)

where α = λΦ, q/E0S, q/ES, or 1/ETΦ.

Proof. First we will introduce two renewal processes, N ′ and N ′′, having thesame asymptotic properties as Nt/α. From Theorem 4.31, p. 136, we knowthat

Nt/αD→ Poisson(t)

under conditions (4.79)–(4.82).Let ν(1) equal the renewal cycle index associated with the first “fiasco”

renewal cycle, and let U1 denote the time to the starting point of this cycle,i.e.,

U1 =

ν(1)−1∑i=1

Si.

Note that if the first cycle is a “fiasco” cycle, then U1 = 0. Starting from thebeginning of the renewal cycle ν(1)+1, we define U2 as the time to the startingpoint of the next “fiasco” renewal cycle. Similarly we define U3, U4, . . .. Therandom variables Ui are equal the interarrival times of the renewal processN ′′

t , i.e.,

N ′′t =

∞∑k=1

I

(k∑

i=1

Ui ≤ t

).

By repeating the proofs of Theorem 4.25 (p. 129) and Theorem 4.31 it is seenthat

N ′′t/α

D→ Poisson(t). (4.85)

Using that the process N ′′t and the random variables Yi are independent, and

the fact that Yi1D→ G∗

Φ (assumption (4.83)), it follows that

N ′′t/α∑

i=1

Yi1D→ CP(t, G∗

Φ). (4.86)

A formal proof of this can be carried out using Moment Generating Functions.Next we introduce N ′

t as the renewal process having interarrival timeswith the same distribution as U1 + Sν(1), i.e., the renewal cycle also includesthe “fiasco” cycle. It follows from the proof of Theorem 4.25, using condition(4.81), that N ′

t/α has the same asymptotic Poisson distribution as Nt/α.

156 4 Availability Analysis of Complex Systems

It is seen that

N ′t ≤ N ′′

t , (4.87)

N ′t ≤ Nt ≤

N ′′t∑

i=1

N(i) = N ′′t +

N ′′t∑

i=1

(N(i) − 1), (4.88)

where N(i) equals the number of system failures in the ith “fiasco” cycle.Note that N ′′

t is at least the number of “fiasco” cycles up to time t, includingthe one that is possibly running at t, and N ′

t equals the number of finished“fiasco” cycles at time t without the one possibly running at t.

Now to prove the result (4.84) we will make use of the following inequali-ties:

Yt/α ≤N ′′

t/α∑i=1

Yi1 +

N ′′t/α∑

i=1

Yi2, (4.89)

Yt/α ≥N ′′

t/α∑i=1

Yi1 −N ′′

t/α∑i=N ′

t/α+1

Yi1. (4.90)

In view of (4.86), and the inequalities (4.89) and (4.90), we need to show that

N ′′t/α∑

i=1

Yi2P→ 0, (4.91)

N ′′t/α∑

i=N ′t/α

+1

Yi1P→ 0. (4.92)

To establish (4.91) we first note that

Yi2P→ 0,

sinceP (Yi2 > ε) ≤ P1(NS ≥ 2) ≤ E1(NS − 1) → 0

by (4.82). Using Moment Generating Functions it can be shown that (4.91)holds.

The key part of the proof of (4.92) is to show that (N ′′t/α) is uniformly inte-

grable in j (t fixed). If this result is established, then since N ′′t/α

D→ Poisson(t)

by (4.85) it follows thatEN ′′

t/α → t. (4.93)

And because of the inequality (4.87), (N ′t/α) is also uniformly integrable so

that EN ′t/α → t, and we can conclude that (4.92) holds noting that

4.6 Distribution of the System Downtime in an Interval 157

P (N ′′t/α −N ′

t/α ≥ 1) ≤ EN ′′t/α − EN ′

t/α → 0.

Thus it remains to show that (N ′′t/α) is uniformly integrable.

Let FU denote the probability distribution of U and let Vl =∑l

i=1 Ui.Then we obtain

E[N ′′t/αI(N

′′t/α ≥ k)] =

∞∑l=k

P (N ′′t/α ≥ l) + (k − 1)P (N ′′

t/α ≥ k)

=∞∑l=k

P (Vl ≤ t/α) + (k − 1)P (Vk ≤ t/α)

=

∞∑l=k

F ∗lU (t/α) + (k − 1)F ∗k

U (t/α)

≤∞∑l=k

(FU (t/α))l + (k − 1)(FU (t/α))

k

=(FU (t/α))

k

1− FU (t/α)+ (k − 1)(FU (t/α))

k.

Since FU (t/α) → 1− e−t, as j → ∞, it follows that for any sequence Fij , Gij

satisfying the conditions (4.79)–(4.82), (N ′′t/α) is uniformly integrable. To see

this, let ε be given such that 0 < ε < e−t. Then for j ≥ j0 (say) we have

supj≥j0

E[N ′′t/αI(N

′′t/α ≥ k)] ≤ (1− e−t + ε)k

e−t − ε+ (k − 1)(1 − e−t + ε)k.

Consequently,limk→∞

supjE[N ′′

t/αI(N′′t/α ≥ k)] = 0,

i.e., (N ′′t/α) is uniformly integrable, and the proof is complete. ��

Remark 4.44. The conditions (4.79)–(4.82) of Theorem 4.43 ensures the asymp-totic Poisson distribution of Nt/α, cf. Theorem 4.31, p. 136. Sufficient condi-tions for (4.79)–(4.82) are given in Theorem 4.27, p. 131.

Asymptotic Normality

We now study convergence to the normal distribution. The theorem below is“a time average result”—it is not required that the system is highly available.The result generalizes (4.32), p. 119.

Theorem 4.45. If X is a regenerative process with cycle length S and asso-ciated downtime Y = YS , Var[S] <∞, and Var[Y ] <∞, then as t→ ∞,

158 4 Availability Analysis of Complex Systems

√t

[Ytt− A

]D→ N(0, τ2Φ), (4.94)

where

τ2Φ =Var[Y − AS]

ES(4.95)

A =EY

ES. (4.96)

Proof. The result (4.94) follows by applying the Central Limit Theorem forrenewal reward processes, Theorem B.17, p. 280, in Appendix B. ��

In the case that the system is highly available, we have

τ2Φ ≈ λΦEY21 , (4.97)

where Y1 is the downtime of the first system failure (note that Y1 = Y11).The idea used to establish (4.97) is the following: As before, let S be equal tothe time of the first return to the best state (1, 1, . . . , 1). Then (4.97) followsby using (4.95), (4.96), A ≈ 0, the fact that Y ≈ Y1 if a system failureoccurs in the renewal cycle, the probability of two or more failures occurringin the renewal cycle is negligible, and λΦ = ENS/ES (by the Renewal RewardTheorem, p. 280). We obtain

τ2Φ =Var[Y − AS]

ES=E(Y − AS)2

ES

≈ EY 2

ES=E1Y

2q

ES≈ EY 2

1 q

ES

≈ EY 21

ENS

ES= EY 2

1 λΦ,

which gives (4.97).More formally, it is possible to show that under certain conditions, the

ratio τ2Φ/λΦEY21 converges to 1, see [26].

4.7 Generalizations and Related Models

4.7.1 Multistate Monotone Systems

We consider a multistate monotone system Φ as described in Sect. 2.1.2, p. 31,observed in a time interval J , with the following extensions of the model: Weassume that there exists a reference levelDt at time t, t ∈ J , which expresses adesirable level of system performance at time t. The reference level Dt at timet is a positive random variable, taking values in {d0, d1, . . . , dr}. For a flownetwork system we interpret Dt as the demand rate at time t. In the following

4.7 Generalizations and Related Models 159

we will use the word “demand rate” also in the general case. The state of thesystem at time t, which we in the following refer to as the throughput rate,is assumed to be a function of the states of the components and the demandrate, i.e.,

Φt = Φ(Xt, Dt).

If Dt is a constant, we write Φ(Xt). The process (Φt) takes values in{Φ0, Φ1, . . . , ΦM}.

Performance Measures

The performance measures introduced in Sect. 4.1, p. 105, can now be gener-alized to the above model.

(a) For a fixed time t we define point availabilities

P (Φt ≥ Φk|Dt = d),

E[Φt|Dt = d],

P (Φt ≥ Dt).

(b) Let NJ be defined as the number of times the system state is below de-mand in J . The following performance measures related to NJ are con-sidered

P (NJ ≤ k), k ∈ N0,

ENJ ,

P (Φt ≥ Dt, ∀t ∈ J) = P (NJ = 0).

Some closely related measures are obtained by replacing Dt by Φk and NJ

by NkJ , where N

kJ is equal to the number of times the process Φ is below

state Φk during the interval J .(c) Let

YJ =

∫J

(Dt − Φt) dt

=

∫J

Dt dt −∫J

Φt dt.

We see that YJ represents the lost throughput(volume) in J , i.e., the differ-ence between the accumulated demand (volume) and the actual through-put (volume) in J . The following performance measures related to YJ areconsidered

160 4 Availability Analysis of Complex Systems

P (YJ ≤ y), y ∈ R+,

EYJ|J | ,

E∫J Φt dt

E∫JDt dt

, (4.98)

where |J | denotes the length of the interval J . The measure (4.98) is calledthroughput availability.

(d) Let

ZJ =1

|J |∫J

I(Φt ≥ Dt) dt.

The random variable Z represents the portion of time the throughputrate equals (or exceeds) the demand rate. We consider the following per-formance measures related to ZJ

P (ZJ ≤ y), y ∈ R+,

EZJ .

The measure EZJ is called demand availability.

As in the binary case we will often use in practice the limiting values ofthese performance measures.

The above performance measures are the most common measures used inreliability studies of offshore oil and gas production and transport systems,see, e.g., Aven [13]. In particular, the throughput availability is very muchused when predicting the performance of various design options. For economicanalysis and as a basis for decision-making, however, it is essential to be ableto compute the total distribution of the throughput loss, and not only themean. The measures related to the number of times the system is below acertain demand level is also useful, but more from an operational and safetypoint of view.

Computation

We now briefly look into the computation problem for some of the measuresdefined above. To simplify the analysis we shall make the following assump-tions:

Assumptions

1. J = [0, u].2. The demand rateDt equals the maximum throughput rate ΦM for all t.

4.7 Generalizations and Related Models 161

3. The n component processes (Xt(i)) are independent. Furthermore, withprobability one, the n component processes (Xt(i)) make no transitions(“jumps”) at the same time.

4. The process (Xt(i)) generates an alternating renewal process Ti1, Ri1,Ti2, Ri2, . . ., as described in Sect. 4.2, p. 106, where Tim represents thetime spent in the state xiMi during the mth visit to this state, and Rim

represents the time spent in the states {xi0, xi1, . . . , xi,Mi−1} during themth visit to these states.For all i and r,

air = limt→∞P (Xt(i) = xir)

exist.

Arguing as in the binary case we can use results from regenerative and renewalreward processes to generalize the results obtained in the previous sections.To illustrate this, we formulate some of these extensions below. The proofsare omitted. We will focus here on the asymptotic results. Refer to Theorems4.16, p. 120, and 4.19, p. 122, for the analogous results in the binary case. Weneed the following notation:

μi = ETim + ERim

Nt = Nk[0,t] (k is fixed)

pir(t) = P (Xt(i) = xir); if t is fixed, we write pir and X(i)

p = (p10, . . . , pnMn)

a = (a10, . . . , anMn)

Φk(X) = I(Φ(X) ≥ Φk)

hk(p) = EΦk(X)

h(p) = EΦ(X)

(1ir,p) = p with pir replaced by 1 and pil = 0 for l �= r.

We see that μi is equal to the expected cycle length for component i, Nt

represents the number of times the process Φ is below state Φk during theinterval [0, t], and Φk(X) equals 1 if the system is in state Φk or better, and0 otherwise.

Theorem 4.46. The limiting availabilities are given by

limt→∞EΦ(Xt) = h(a),

limt→∞P (Φ(Xt) ≥ Φk) = hk(a).

Theorem 4.47. Let

γilr = hk(1il, a)− hk(1ir, a)

162 4 Availability Analysis of Complex Systems

and let filr denote the expected number of times component i makes a tran-sition from state xil to state xir during a cycle of component i. Assumefilr < ∞. Then the expected number of times the system state is below Φk

per unit of time in the long run equals

limu→∞

ENu

u= lim

u→∞E[Nu+s −Nu]

s=

n∑i=1

∑r<l

filr γilrμi

. (4.99)

If X is a regenerative process having finite expected cycle length, then withprobability one,

limu→∞

Nu

u=

n∑i=1

∑r<l

filr γilrμi

.

The limit (4.99) is denoted λΦ. If the random variables Tim are exponen-tially distributed, then X is regenerative, cf. Theorem 4.23, p. 124.

It is also possible to extend the asymptotic results related to the distribu-tion of the number of system failures at level k, and the distribution of thelost volume (downtime). We can view the system as a binary system of binarycomponents, and the asymptotic results of Sects. 4.4–4.6 apply.

Gas Compression System

Consider the gas compression system example in Sect. 1.3.2, p. 13. Two designalternatives were studied:

(i) One gas train with a maximum throughput capacity of 100%.(ii) Two trains in parallel, each with a maximum throughput capacity of 50%.

Normal production is 100%. Each train comprises compressor–turbine, coolerand scrubber. To analyze the performance of the system it was consideredsufficient to use approximate methods developed for highly available systems,as presented in this chapter. In the system analysis, each train was treated asone component, having exponential lifetime distribution with a failure rate of13 per year, and mean repair time equal to

(10/13) · 12 + (2/13) · 50 + (1/13) · 20 ≈ 18.5 (h).

From this we find that the asymptotic unavailability A, given by formula(4.2), p. 109, for a train equals 0.027, assuming 8,760h per year. The numberof system failures per unit of time is given by the system failure rate λΦ. Foralternative (i) there is only one failure level and λΦ = 13. For alternative (ii)we must distinguish between failures resulting in production below 100% andbelow 50%. The system in these two cases can be viewed as a series systemof the two trains and a parallel system of the two trains, respectively. Hencethe system failure rate for these levels is approximately equal to 26 and 0.7,respectively. Note that for the latter case (cf. (4.64), p. 140),

4.7 Generalizations and Related Models 163

λΦ ≈ 2 · A · 13.

Using that the number of system failures is approximately Poisson distributed,we can compute the probability that a certain number of failures occurs duringa specific period of time. For example, we find that for alternative (ii) there isa probability of about e−0.7 = 0.50 of having no complete shutdowns duringa year.

Let EY denote the asymptotic mean lost production relative to the de-mand. For alternative (i) it is clear that EY equals 0.027, observing that afailure results in 100% loss and the unavailability equals 0.027. For alternative(ii), we obtain the same value for the asymptotic mean lost production, as isseen from the following calculation

EY = 0.5 · 2 · 0.027 · 0.973 + 1 · 0.0272 = 0.027.

The first term in the sum represents the contribution from failures leading to50% loss, whereas the second term represents the contribution from failuresleading to 100% loss. The latter contribution is in practice negligible comparedto the former one. To compute the distribution of the lost production, weneed to know more about the distribution of the repair time R of the train.It was assumed in this application that ER2 = 1, 000, which correspondsto a squared coefficient of variation equal to 1.9 and a standard deviationequal to 25.7. The unit of time is hours. This assumption makes it possible toapproximate the distribution of the lost production during a year, using thenormal approximation. We know the mean (EY = 0.027) and need to estimatethe variance of Y . To do this we make use of (4.97), p. 158, stating that thevariance in the binary case is approximately equal to λΦEY

21 /t, where t is the

length of the time period considered and Y1 is the downtime of the first systemfailure. For alternative (i) we find that the variance equals approximately

(13/8760) · 1000/8760 = 1.7 · 10−4,

and for alternative (ii) (we ignore situations with both components down sothat the lost production is approximately 50% of the downtime)

(50/100)2 · (26/8760) · 1000/8760 = 0.85 · 10−4.

From this we estimate, for example, that the probability that the lost produc-tion during 1 year is more than 4% of demand, to be 0.16 for alternative (i)and 0.08 for alternative (ii).

Special Case: Phase-Type Distributions

In the asymptotic analysis in Sects. 4.4–4.6 main emphasis has been placedon the situation that the lifetimes are exponentially distributed. Using theso-called phase-type approach, we can show that the multistate model also

164 4 Availability Analysis of Complex Systems

provides a framework for covering other types of distributions. The phase-typeapproach makes use of the fact that a distribution function can be approxi-mated by a mixture of Erlang distributions (with the same scale parameter),cf., e.g., Asmussen [8] and Tijms [156]. It is common to use a mixture oftwo Erlang distributions with the first two moments matching the distribu-tion considered. Now assume that the lifetime of component i, Fi, can bedescribed by the sum of Mi random variables, each of which is exponentiallydistributed with rate λi0, i.e., the lifetime of component i is Erlangian dis-tributed with parameters λi0 and Mi. Then we have a situation that fits intothe above multistate framework and the asymptotic results can be applied.The state space for component i is {0, 1, . . . ,Mi}. The component process(Xt(i)) starts in state Mi, it stays there a time governed by an exponentialrandom variable with rate λi0 and jumps to state Mi−1, it stays there a timegoverned by an exponential random variable with rate λi0 and jumps to stateMi − 2, and this continues until the process reaches state 0. After a durationhaving distribution Gi in state 0 it returns to state Mi. We see that filr = 1if l = r + 1 and filr = 0 otherwise (for r < l). Furthermore,

μi = Mi1

λi0+ μGi = μFi + μGi ,

λΦ =

n∑i=1

[h1(1i1, a)− h1(1i0, a)]1

μi=

n∑i=1

[h(1i,A)− h(0i,A)]1

μi,

ai0 =μGi

μi= Ai,

using the terminology from the binary theory. Remember that the formu-las established in Sects. 4.2 and 4.3 for the expected cycle length and thesteady-state (un)availability of component i, and the system failure rate, areapplicable also for nonexponential distributions.

Thus by modifying the state space, we have been able to extend the results,i.e., the Theorems 4.25 (p. 129), 4.31 (p. 136), and 4.43 (p. 154), in the previoussections to Erlang distributions.

Now assume that the lifetime distribution of component i is a mixture ofErlang distributions, i.e., with probability pir > 0 the distribution equals anErlang distribution with parameters λi0 and Mir, r = 1, 2, . . . , ri. This situa-tion can be analyzed as above with the state space for component i given by{0, 1, . . . ,Mi}, whereMi=maxr{Mir}. If the component state process (Xt(i))is in state 0, it will go to state Mir with probability pir. Then the componentstays in this state for a time governed by an exponential distribution withparameter λi0, before it jumps to state Mir − 1, etc. As above we can usethe formulas for the binary case to compute the expected cycle length andsteady-state (un)availability of component i, and the system failure rate. It isseen that

μFi =

ri∑r=1

pirMir1

λi0.

4.7 Generalizations and Related Models 165

We can conclude that the set-up also covers mixtures of Erlang distributions,and Theorems 4.25, 4.31, and 4.43 apply.

Note that we have not proved that the limiting results obtained in theprevious sections hold true for general lifetime distributions Fij . We haveshown that if the distributions Fij all belong to a certain class of mixtures ofErlang distributions, then the results hold. Starting from general distributionsFij , we can write Fij as a limit of Fijr , r → ∞, where Fijr are mixtures ofErlang distributions. But interchanging the limits as j → ∞ and as r → ∞is not justified in general. Refer also to Bibliographic Notes, p. 173, for somecomments related to the non-exponential case.

4.7.2 Parallel System with Repair Constraints

Consider the model as described in Sect. 4.3, p. 120, but assume now thatthere are repair constraints, i.e., a maximum of r (r < n) components canbe repaired at the same time. Hence if i, i > r, components are down, theremaining i − r components are waiting in a repair queue. We shall restrictattention to the case r = 1, i.e., there is only one repair facility (channel)available. The repair policy is first come first served. We assume exponentiallydistributed lifetimes.

Consider first a parallel system of two components, and the set-up ofSect. 4.4, p. 126. It is not difficult to see that ETΦ, q, and E0S are identi-cal to the corresponding quantities when there are no repair constraints; seesection on parallel system of two identical components p. 139. We can alsofind explicit expressions for ES and λΦ. Since the time to the first componentfailure is exponentially distributed with parameter 2λ, ES = 1/2λ + ES′′,where S′′ equals the time from the first component failure until the processagain returns to (1, 1). Denoting the repair time of the failed component byR, we see that

ES′′ = μG + qE[S′′ −R|NS ≥ 1].

But E[S′′ −R|NS ≥ 1] = ES′′, and it follows that

ES =1

2λ+

μG

1− q.

Hence

λΦ =ENS

ES=q/(1− q)

ES

=2λq

1− q + 2λμG.

Alternatively, and easier, we could have found λΦ by defining a cycle S asthe time between two consecutive visits to a state with just one componentfunctioning. Then it is seen that ES = μG+(1−q)/2λ and ENS = q, resultingin the same λΦ as above.

166 4 Availability Analysis of Complex Systems

Now suppose we have n ≥ 2, and let Φt be defined as the number of com-ponents functioning at time t. To analyze the system, we can utilize that thestate process Φt is a semi-Markov process with jump times at the completionof repairs. In state 0, 1, . . . , n−1 the time between transitions has distributionG(t) and the transition probability Pij is given by

Pij =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎩

∫∞0

(i

i− j + 1

)F (s)i−j+1(1− F (s))j−1 dG(s),

1 ≤ j ≤ i ≤ n− 1

∫∞0

(1− F (s))i dG(s), j = i+ 1

0, 1 ≤ i < j − 1,

observing that if the state is i and the repair is completed at time s, then theprobability that the process jumps to state j, where j ≤ i ≤ n − 1, equalsthe probability that i− j + 1 components fail before s and j − 1 componentssurvive s; and, furthermore, if the state is i and the repair is completed attime s, then the probability that the process jumps to state i + 1 equals theprobability that i components survive s. Now if the process is in state n, itstays there for an exponential time with rate nλ, and jumps to state n− 1.

Having established the transition probabilities, we can compute a numberof interesting performance measures for the system using results from semi-Markov theory. For example, we have an explicit formula for the asymptoticprobability that P (Φt = k) as t→ ∞, which depends on the mean time spentin each state and the limiting probabilities of the embedded discrete-timeMarkov chain; see Ross [135], p. 104.

4.7.3 Standby Systems

In this section we study the performance of standby systems comprising nidentical components of which n − 1 are normally operating and one is in(cold) standby. Emphasis is placed on the case that the components haveconstant failure rates, and the mean repair time is relatively small comparedto the MTTF.

Standby systems as analyzed here are used in many situations in real life.As an example we return to the gas compression system in Sect. 1.3, p. 13 andSect. 4.7.1, p. 162. To increase the availability for the alternatives considered,we may add a standby train such that when a failure of a train occurs, thestandby train can be put into operation and a production loss is avoided.

Model

The following assumptions are made:

• Normally n− 1 components are running and one is in standby.

4.7 Generalizations and Related Models 167

• Failed components are repaired. The repair regime is characterized byR1 Only one component can be repaired at a time (one repair

facility/channel), the repair policy is “first come first served,” orR2 Up to n repairs can be carried out at a time (n repair facilities/channels).

• Switchover to the standby component is perfect, i.e., instantaneous andfailure-free.

• A standby component that has completed its repair is functioning atdemand, i.e., the failure rate is zero in the standby state.

• All failure times and repair times are independent with probability distri-butions F (t) and G(t), respectively. F is absolutely continuous and hasfinite mean, and G has finite third-order moment. We assume

∫ ∞

0

F (t)dG(t) > 0.

In the following T refers to a failure time of a component and R refers toa repair time.The squared coefficient of variation of the repair time distribution is de-noted c2G.

Let Φt denote the state of the system at time t, i.e., the number of com-ponents functioning at time t (Φt ∈ {n, n− 1, . . . , 0}). For repair regime R1,Φ is generally a regenerative process, or a modified regenerative process. Fora two-component system it is seen that the time points when Φ jumps tostate 1 are regenerative points, i.e., the time points when (i) the operatingcomponent fails and the second component is not under repair (the processjumps from state 2 to 1) or (ii) both components are failed and the repairof the component being repaired is completed (the process jumps from state0 to 1). For n > 2, the points in time when the process jumps from state 0to 1 are regenerative points, noting that the situation then is characterizedby one “new” component, and n− 1 in a repair queue. Assuming exponentiallifetimes, we can define other regenerative points, e.g., consecutive visits tothe best state n, or consecutive visits to state n− 1.

Also for a two-component system under repair regime R2, the process gen-erally generates a (modified) regenerative process. The regenerative points aregiven by the points when the process jumps from state 2 to 1 (case (i) above).If the system has more than two components (n > 2), the regenerative prop-erty is not true for a general failure time distribution. However, under theassumption of an exponential time to failure, the process is regenerative. Re-generative points are given by consecutive visits to state n, or points whenthe process jumps from state n to state n − 1. In the following, when con-sidering a system of more than two components, we assume an exponentiallifetime distribution. Remember that a cycle refers to the length between twoconsecutive regenerative points.

168 4 Availability Analysis of Complex Systems

Performance Measures

The system can be considered as a special case of a multistate monotone sys-tem, with the demand rate Dt set to n− 1. Hence the performance measuresdefined in Sect. 4.7.1, p. 158, also apply to the system analyzed in this section.Availability refers to the probability that at least n− 1 components are func-tioning, and system failure refers to the event that the state process Φ is belown− 1. Note that we cannot apply the computation results of Sect. 4.7.1 sincethe state processes of the components are not stochastically independent. Thegeneral asymptotic results obtained in Sects. 4.4–4.6 for regenerative processesare however applicable.

Of the performance measures we will put emphasis on the limiting avail-ability, and the limiting mean of the number of system failures in a timeinterval.

We need the following notation for i = n, n− 1, . . . , 0:

pi(t) = P (Φt = i),

pi = limt→∞ pi(t),

provided the limits exist. Clearly, the availability at time t, A(t), is given by

A(t) = pn(t) + pn−1(t)

and the limiting availability, A, is given by

A = pn + pn−1.

Computation

First, we focus on the limiting unavailability A, i.e., the expected portion oftime in the long run that at least two components are not functioning. Underthe assumption of constant failure and repair rates this unavailability caneasily be computed using Markov theory, noting that Φ is a birth and deathprocess. The probability pi of having i components down is given by (cf. [13],p. 303)

pi = pn−i =zi

1 +∑n

j=1 zj, (4.100)

where

zi =

{(n−1)(n−1)!

(n−i)!1∏i

l=1 ulδi i = 1, 2, . . . , n

1 i = 0

δ = μG/μF

ul = 1 under repair regime R1 and l under repair regime R2.

Note that if δ is small, then pi ≈ zi for i ≥ 1. Hence

A ≈ p2 ≈ (n− 1)2

u2δ2. (4.101)

4.7 Generalizations and Related Models 169

We can also write

A =(n− 1)2

u2δ2 + o(δ2), δ → 0.

In general we can find expressions for the limiting unavailability by usingthe regenerative property of the process Φ. Defining Y and S as the systemdowntime in a cycle and the length of a cycle, respectively, it follows from theRenewal Reward Theorem (Theorem B.15, p. 280, in Appendix B) that

A =EY

ES. (4.102)

Here system downtime corresponds to the time two or more of the componentsare not functioning. Let us now look closer into the problem of computing A,given by (4.102), under repair regime R1.

Repair Regime R1. In general, semi-Markov theory can be used to estab-lish formulas for the unavailability, cf. [27]. In practice, we usually have μG

relatively small compared to μF . Typically, δ = μG/μF is less than 0.1. Inthis case we can establish simple approximation formulas as shown below.

First we consider the case with two components, i.e., n = 2. The regener-ative points for the process Φ are generated by the jumps from state 2 to 1.In view of (4.102) the limiting system unavailability A can be written as

A =E[max{R− T, 0}]

ET + E[max{R− T, 0}] (4.103)

=(μG − w)

μF + (μG − w), (4.104)

where

w = E[min{R, T }] =∫ ∞

0

F (t)G(t) dt,

noting that max{R−T, 0} = R−min{R, T } and the system downtime equals0 if the repair of the failed component is completed before the failure of theoperating component, and equals the difference between the repair time of thefailed component and the time to failure of the operating component if thisdifference is positive. Thus we have proved the following theorem.

Theorem 4.48. If n = 2, then the unavailability A is given by (4.104).

We now assume an exponential failure time distribution F (t) = 1 − e−λt.Then we have

A ≈ A′, (4.105)

170 4 Availability Analysis of Complex Systems

where

A′ =λ2

2ER2 =

δ2

2[1 + c2G]. (4.106)

This gives a simple approximation formula for computing A. The approxima-tion (4.105) is established formally by the following proposition.

Proposition 4.49. If n = 2 and F (t) = 1− e−λt, then

0 ≤ A′ − A ≤ (A′)2 +δ3

6

ER3

μ3G

. (4.107)

Proof. Using that 1 − e−λt ≤ λt and changing the order of integration, itfollows that

A =λ(μG − w)

1 + λ(μG − w)≤ λ(μG − w) (4.108)

= λ

∫ ∞

0

F (t)G(t)dt

≤ λ

∫ ∞

0

(λt)G(t)dt

= λ21

2ER2 = A′. (4.109)

It remains to show the right-hand inequality of (4.107). Considering

A(1 + λ(μG − w)) = λ

∫ ∞

0

F (t)G(t)dt

≥ λ

∫ ∞

0

(λt− 1

2(λt)2

)G(t)dt

= A′ − 1

6λ3ER3

and the inequalities A ≤ λ(μG −w) ≤ A′ obtained above, it is not difficult tosee that

0 ≤ A′ − A ≤ Aλ(μG − w) +1

6λ3ER3 ≤ (A′)2 +

1

6λ3ER3,

which completes the proof. ��

Hence A′ overestimates the unavailability and the error term will be neg-ligible provided that δ = μG/μF is sufficiently small.

Next, let us compare the approximation formula A′ with the standard“Markov formula” AM = δ2, obtained by assuming exponentially distributedfailure and repair times (replace c2G by 1 in the expression (4.106) for A′, oruse the Markov formula (4.101), p. 168). It follows that

4.7 Generalizations and Related Models 171

A′ = AM · 12[1 + c2G]. (4.110)

From this, we see that the use of the Markov formula when the squaredcoefficient of variation of the repair time distribution, c2G, is not close to 1,will introduce a relatively large error. If the repair time is a constant, thenc2G = 0 and the unavailability using the Markov formula is two times A′. If c2Gis large, say 2, then the unavailability using the Markov formula is 2/3 of A′.

Assume now n > 2. The repair regime is R1 as before. Assume that δ isrelatively small. Then it is possible to generalize the approximations obtainedabove for n = 2.

Since δ is small, there will be a negligible probability of having Φ ≤ n− 3,i.e., three or more components not functioning at the same time. By neglectingthis possibility we obtain a simplified process that is identical to the processfor the two-component system analyzed above, with failure rate (n − 1)λ.Hence by replacing λ with (n−1)λ, formula (4.105) is valid for general n, i.e.,A ≈ A′, where

A′ =[(n− 1)δ]2

2[1 + c2G].

The error bounds are, however, more difficult to obtain, see [27].The relation between the approximation formulas A′ and AM , given by

(4.101), p. 168, are the same for all n ≥ 2. Hence A′ = AM · 12 [1+c2G] (formula(4.110)) holds for n > 2 too.

Next we will establish results for the long run average number of sys-tem failures. It follows from the Renewal Reward Theorem that ENt/t andE[Nt+s − Nt]/s converge to λΦ = EN/ES as t → ∞, where N equals thenumber of system failures in one renewal cycle and S equals the length ofthe cycle as before. With probability one, Nt/t converges to the same value.Under repair regime R1, N ∈ {0, 1}. Hence EN equals the probability thatthe system fails in a cycle, i.e., EN = q using the terminology of Sects. 4.3and 4.4. Below we find expressions for λΦ in the case that the repair regimeis R1. The regenerative points are consecutive visits to state n− 1.

Theorem 4.50. If n = 2, then

λΦ =q

μF + EY, (4.111)

where

q =

∫ ∞

0

F (t)dG(t), (4.112)

EY =

∫ ∞

0

F (t)G(t) dt.

Proof. First note that EY equals the expected downtime in a cycle and isgiven by

172 4 Availability Analysis of Complex Systems

EY = E[(R− T )I(T < R)] = E[R −min{R, T }],cf. (4.103)–(4.104), p. 169. We have established above that

λΦ =EN

ES=

q

ES,

where N equals the number of system failures in one renewal cycle, S equalsthe length of the cycle, and q = P (T ≤ R) equals the probability of having asystem failure during a cycle. Thus it remains to show that

ES = μF + EY. (4.113)

Suppose the system has just jumped to state 1. We then have one componentoperating and one undergoing repair. Now if a system failure occurs (i.e.,T ≤ R), then the cycle length equals R, and if a system failure does not occur(i.e., T > R), then the cycle length equals T . Consequently,

S = I(T ≤ R)R+ I(T > R)T = T + (R− T )I(T < R).

Formula (4.113) follows and the proof is complete. ��

We see from (4.111) that if F (t) is exponential with rate λ and the com-ponents are highly available, then

λΦ ≈ λ2μG.

If n > 2 and the repair regime is R1, it is not difficult to see that qis given by (4.112) with F (t) replaced by 1 − e−(n−1)λt. It is however moredifficult to find an expression for ES. For highly available components, we canapproximate the system with a two-state system with failure rate (n − 1)λ;hence,

λΦ ≈ [(n− 1)λ]2μG,

ES ≈ 1

(n− 1)λ.

When the state process of the system jumps from state n to n− 1, it will re-turn to state n with a high probability and the sojourn time in state n−1 willbe relatively short; consequently, the expected cycle length is approximatelyequal to the expected time in the best state n, i.e., 1/(n− 1)λ.

Repair Regime R2. Finally in this section we briefly comment on the repairregime R2. We assume constant failure rates. It can be argued that if there isample repair facilities, i.e., the repair regime is R2, the steady-state unavail-ability is invariant with respect to the repair time distribution, cf., e.g., Smith[145] and Tijms [156], p. 175. This means that we can use the steady-stateMarkov formula (4.100), p. 168, also when the repair time distribution is not

4.7 Generalizations and Related Models 173

exponential. The result only depends on the repair time distribution throughits mean value. However, a strict mathematical proof of this invariance resultdoes not seem to have been presented yet.

Bibliographic Notes. Alternating renewal processes are studied in manytextbooks, e.g., Birolini [44] and Ross [135]. Different versions of the one-component downtime distribution formula in Theorem 4.14 (p. 118) have beenformulated and proved in the literature, cf. [44, 45, 57, 65, 69, 154]. The firstversion was established by Takacs. Theorem 4.14, which is taken from Haukasand Aven [82], seems to be the most general formulation and also has thesimplest proof.

Some key references to the theory of point availability of monotone systemsand the mean number of system failures are Barlow and Proschan [31, 32] andRoss [136]; see also Aven [13]. Parallel systems of two identical componentshave been studied by a number of researchers, see, e.g., [34, 73, 76]. Gaver[73] established formulas for the distribution and mean of the time to the firstsystem failure, identical to those presented in Sect. 4.4, p. 139. Our derivationof these formulas is different however from Gaver’s.

Asymptotic analysis of highly available systems has been carried out bya number of researchers. A survey is given by Gertsbakh [75], with emphasison results related to the convergence of the distribution of the first systemfailure to the exponential distribution. See also the books by Gnedenko andUshakov [76], Ushakov [157], and Kovalenko et al. [110, 111]. Some of theearliest results go back to work done by Keilson [104] and Solovyev [148]. Aresult similar to Lemma 4.24 (p. 127) was first proved by Keilson [104]; see also[76, 105, 109]. Our version of this lemma is taken from Aven and Jensen [26].To establish the asymptotic exponential distribution, different normalizingfactors are used, e.g., q/E0S, where q equals the probability of having at leastone system failure in a renewal cycle and E0S equals the expected cycle lengthgiven that no system failures occur in the cycle. This factor, as well as the otherfactors considered in the early literature in this field (cf., e.g., the references[75, 76, 157]) are generally difficult to compute. The asymptotic failure rateof the system, λφ, is more attractive from a computational point of view, andis given most attention in this presentation. We find it somewhat difficult toread some of the earlier literature on availability. A large part of the researchin this field has been developed outside the framework of monotone systemtheory. Using this framework it is possible to give a unified presentation ofthe results. Our set-up and results (Sect. 4.4) are to a large extent taken fromthe recent papers by Aven and Haukas [22] and Aven and Jensen [26]. Thesepapers also cover convergence of the number of system failures to the Poissondistribution.

The literature includes a number of results proving that the exponen-tial/Poisson distribution is the asymptotic limit of certain sums of pointprocesses. Most of these results are related to the thinning of independentprocesses, see e.g., Cinlar [55], Daley and Vere-Jones [58], and Kovalenko et

174 4 Availability Analysis of Complex Systems

al. [111]. See also Lam and Lehoczky [114] and the references therein. Theseresults are not applicable for the availability problems studied in this book.

Sections 4.5 and 4.6 are to a large extent based on Gasemyr and Aven[72], Aven and Haukas [23], and Aven and Jensen [26]. Gasemyr and Aven[72] and Aven and Haukas [23] study the asymptotic downtime distributiongiven system failure. Theorem 4.42 is due to Haukas (see [26, 81]) and Smith[146]. Aven and Jensen [26] gives sufficient conditions for when a compoundPoisson distribution is an asymptotic limit for the distribution of the downtimeof a monotone system observed in a time interval. An alternative approachfor establishing the compound Poisson process limit is given by Serfozo [138].There exist several asymptotic results in the literature linking the sums ofindependent point processes with integer marks to the compound Poissonprocess; see, e.g., [153]. It is, however, not possible to use these results forstudying the asymptotic downtime distributions of monotone systems.

Section 4.7.1 generalizes results obtained in the previous sections to mul-tistate systems. The presentation on multistate systems is based on Aven[11, 14]. For the analysis in Sect. 4.7.3 on standby systems, reference is givento the work by Aven and Opdal [27].

In this chapter we have primarily focused on the situation that the com-ponent lifetime distributions are exponential. In Sect. 4.7.1 we outlined howsome of the results can be extended to phase-type distributions. A detailedanalysis of the nonexponential case (nonregenerative case) is however outsidethe scope of this book. Further research is needed to present formally provedresults for the general case. Presently, the literature covers only some partic-ular cases. Intuitively, it seems clear that it is possible to generalize many ofthe results obtained in this chapter. Consider, for example, the convergenceto the Poisson process for the number of system failures. As long as the com-ponents are highly available, we would expect that the number of failures areapproximately Poisson distributed. But formal asymptotic results are ratherdifficult to establish; see, for example, [102, 106, 112, 152, 162]. Strict con-ditions have to be imposed to establish the results, to the system structureand the component lifetime and downtime distributions. Also the general ap-proach of showing that the compensator of the counting process converges inprobability (see Daley and Vere-Jones [58], p. 552), is difficult to apply in oursetting.

Of course, this chapter covers only a small number of availability mod-els compared to the large number of models presented in the literature. Wehave, for example, not included models where some components remain in“suspended animation” while a component is being repaired/replaced, andmodels allowing preventive maintenance. For such models, and other relatedmodels, refer to the above cited references, Beichelt and Franken [36], Osaki[128], Srinivasan and Subramanian [150], Van Heijden and Schornagel [160],and Yearout et. al. [166]. See also the survey paper by Smith et al. [147].

5

Maintenance Optimization

In this chapter we combine the general lifetime model of Chap. 3 withmaintenance actions like repairs and replacements. Given a certain cost andreward structure an optimal repair and replacement strategy will be derived.We begin with some basic and well-known models and come then to morecomplex ones, which show how the general approach can be exploited to opena variety of different optimization models.

5.1 Basic Replacement Models

First of all we consider some basic models that are simple in both the lifetimemodeling and the optimization criterion. These basic models include the ageand the block replacement models that are widely used and thoroughly inves-tigated. A technical system is considered, the lifetime of which is describedby a positive random variable T with distribution F . Upon failure the systemis immediately replaced by an equivalent one and the process repeats itself.A preventive replacement can be carried out before failure. Each replacementincurs a cost of c > 0 and each failure adds a penalty cost k > 0.

5.1.1 Age Replacement Policy

For this policy a replacement age s, s > 0, is fixed for each system at whicha preventive replacement takes place. If Ti, i = 1, 2, . . . , are the successivelifetimes of the systems, then τi = Ti∧s denotes the operating time of the ithsystem and equals the ith cycle length. The random variables Ti are assumedto form an i.i.d. sequence with common distribution F , i.e., F (t) = P (Ti ≤ t).The costs for one cycle are described by the stochastic process Z = (Zt), t ∈R+, Zt = c+ kI(T ≤ t). Clearly, the average cost after n cycles is

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 5,© Springer Science+Business Media New York 2013

175

176 5 Maintenance Optimization

∑ni=1 Zτi∑ni=1 τi

and the total cost per unit time up to time t is given by

Ct =1

t

Nt∑i=1

Zτi ,

where (Nt), t ∈ R+, is the renewal counting process generated by (τi) andZτ = c + kI(T ≤ τ) describes the incurred costs in one cycle. It is wellknown from renewal theory (see Appendix B, p. 280) that the limits of theexpectations of these ratios, Ks, coincide and are equal to the ratio of theexpected costs for one cycle and the expected cycle length:

Ks = limn→∞E

[∑ni=1 Zτi∑ni=1 τi

]= lim

t→∞ECt =EZτ

Eτ.

The objective is to find the replacement age that minimizes this long runaverage cost per unit time. Inserting the cost function Zt = c + kI(T ≤ t)we get

Ks =c+ kF (s)∫ s

0 (1− F (x))dx. (5.1)

Now elementary analysis can be used to find the optimal replacement age s,i.e., to find s∗ with

Ks∗ = inf{Ks : s ∈ R+ ∪ {∞}}.Here s∗ = ∞ means that preventive replacements do not pay and it is optimalto replace only at failures. As can be easily seen this case occurs if the lifetimesare exponentially distributed, i.e., if F (t) = 1− exp{−λt}, t ≥ 0, λ > 0, thenK∞ = λ(c+ k) ≤ Ks for all s > 0.

Example 5.1. Using rudimentary calculus we see that in the case of an increas-ing failure rate λ(t) = f(t)/F (t), the optimal replacement age is given by

s∗ = inf

{t ∈ R+ : λ(t)

∫ t

0

F (x)dx − F (t) ≥ c

k

},

where inf ∅ = ∞. By differentiating it is not hard to show that the left-handside of the inequality is increasing in the IFR case so that s∗ can easily bedetermined. As an example consider the Weibull distribution F (t) = 1 −exp{−(λt)β}, t ≥ 0 with λ > 0 and β > 1. The corresponding failure rate isλ(t) = λβ(λt)β−1 and the optimal replacement age is the unique solution of

λ(t)

∫ t

0

exp{−(λx)β}dx− 1 + exp{−(λt)β} =c

k.

The cost minimum is then given by Ks∗ = kλ(s∗).

5.1 Basic Replacement Models 177

The age replacement policy allows for planning of a preventive replacementonly when a new item is installed. If one wants to fix the time points forpreventive replacements in advance for a longer period, one is led to the blockreplacement policy.

5.1.2 Block Replacement Policy

Under this policy the item is replaced at times is, i = 1, 2, . . . and s > 0,and at failures. The preventive replacements occur at regular predeterminedintervals at a cost of c, whereas failures within the intervals incur a cost ofc+ k.

The advantage of this policy is the simple structure and administrationbecause the time points of preventive replacements are fixed and determinedin advance. On the other hand, preventive replacements are carried out, ir-respective of the age of the processing unit, so that this policy is usuallyapplied to several units at the same time and only if the replacement costs care comparatively low.

For a fixed time interval s the long run average cost per unit time is

Ks =(c+ k)M(s) + c

s, (5.2)

where M is the renewal function M(t) =∑∞

j=1 F∗j(t) (see Appendix B, p.

274). If the renewal function is known explicitly, we can again use elementaryanalysis to find the optimal s, i.e., to find s∗ with

Ks∗ = inf{Ks : s ∈ R+ ∪ {∞}}.

In most cases the renewal function is not known explicitly. In such a caseasymptotic expansions like Theorem B.5, p. 277 in Appendix B or numericalmethods have to be used. As is to be expected in the case of an Exp(λ)distribution, preventive replacements do not pay: M(s) = λs and s∗ = ∞.

Example 5.2. Let F be the Gamma distribution function with parameters λ >0 and n = 2. The corresponding renewal function is

M(s) =λs

2− 1

4(1− e−2λs)

(cf. [1], p. 274) and s∗ can be determined as the solution of

d

dsM(s) =

M(s)

s+

c

s(c+ k).

The solution s∗ is finite if and only if c/(c+ k) < 1/4, i.e., if failure replace-ments are at least four times more expensive than preventive replacements.

178 5 Maintenance Optimization

The age and block replacement policies will result in a finite optimal valueof s only if there is some aging and wear-out of the units, i.e., in probabilisticterms the lifetime distribution F fulfills some aging condition like IFR, NBU,or NBUE (see Chap. 2 for these notions). To judge whether it pays to followa certain policy and in order to compare the policies it is useful to considerthe number of failures and the number of planned preventive replacements ina time interval [0, t].

5.1.3 Comparisons and Generalizations

Let F be the underlying lifetime distribution that generates the renewalcounting process (Nt), t ∈ R+, so that Nt describes the number of failuresor completed replacements in [0, t] following the basic policy replace at fail-ure only. Let NA

t (s) and NBt (s) denote the number of failures up to time t

following policy A (age replacement) or B (block replacement), respectively,and RA

t (s) and RBt (s) the corresponding total number of removals in [0, t]

including failures and preventive replacements. We now want to summarizesome early comparison results that can be found, including the proofs, inthe monographs of Barlow and Proschan [31, 32]. We remind the reader ofthe notion of stochastic comparison of two positive random variables X andY : X ≤st Y means P (X > t) ≤ P (Y > t) for all t ∈ R+.

Theorem 5.3. The following four assertions hold true:

(i) Nt ≥st NBt (s) for all t ≥ 0, s ≥ 0 ⇐⇒ F is NBU;

(ii) Nt ≥st NAt (s) for all t ≥ 0, s ≥ 0 ⇐⇒ F is NBU;

(iii) F IFR ⇒ Nt ≥st NAt (s) ≥st N

Bt (s) for all t ≥ 0, s ≥ 0;

(iv) RAt (s) ≤st R

Bt (s) for all t ≥ 0, s ≥ 0.

Part (i) and (ii) say that under the weak aging notion NBU it is usefulto apply a replacement strategy, since the number of failures is (stochasti-cally) decreased under such a strategy. If, in addition, F has an increasingfailure rate, block replacement results in stochastically less failures than agereplacement, and it follows that ENA

t (s) ≥ ENBt (s). On the other hand, for

any lifetime distribution F (irrespective of aging notions) block policies havemore removals than age policies.

Theorem 5.4. NAt (s) is stochastically increasing in s for each t ≥ 0 if and

only if F is IFR.

This result says that IFR is characterized by the reasonable aging condi-tion that the number of failures is growing with increasing replacement age.Somewhat weaker results hold true for the block policy (see Shaked and Zhu[143] for proofs):

Theorem 5.5. If NBt (s) is stochastically increasing in s for each t ≥ 0, then

F is IFR.

5.1 Basic Replacement Models 179

Theorem 5.6. The expected value ENBt (s) is increasing in s for each t ≥ 0

if and only if the renewal function M(t) is convex.

Since the monographs of Barlow and Proschan appeared, many possiblegeneralizations have been investigated concerning (a) the comparison meth-ods, (b) the lifetime models and replacement policies and the cost structures.It is beyond the scope of this book to describe all of these models and refine-ments. Some hints for further reading can be found in the Bibliographic Notesat the end of the chapter.

Berg [37] and Dekker [63] among others use a marginal cost analysis forstudying the optimal replacements problem. Let us, for example, consider thisapproach for block-type policies. In this model it is assumed that the long runaverage cost per unit time is given by

Ks =c+ R(s)

s, (5.3)

where c is the cost of a preventive replacement and R(s) =∫ s

0 r(x)dx denotesthe total expected costs due to deterioration over an interval of length s.The derivative r, called the (marginal) deterioration cost rate, is assumedto be continuous and piecewise differentiable. If in the block replacementmodel of the preceding Sect. 5.1.2 the lifetime distribution function F has abounded density f, then it is known (see Appendix B, p. 278) that also thecorresponding renewal function M admits a density m and we have R(s) =∫ s

0(c + k)m(x)dx, which shows that this is a special case of this block-type

model. Now certain properties of the marginal cost rate can be carried overto the cost function K. The proof of the following theorem is straightforwardand can be found in [63].

Theorem 5.7. (i) If r(t) is nonincreasing on [t0, t1] for some 0 ≤ t0 < t1and r(t0) < Kt0 , then Ks is also nonincreasing in s on [t0, t1];

(ii) if r(t) increases strictly for t > t0 and some t0 ≥ 0, where r(t0) < Kt0 ,and if either

(a) limt→∞ r(t) = ∞ or (b) lim

t→∞ r(t) = a and limt→∞(at−R(t)) > c,

then Ks has a minimum, say K∗ at s∗, which is unique on [t0,∞); moreover,K∗ = Ks∗ = r(s∗).

Thus a myopic policy, in which at every moment we consider whether todefer the replacement or not, is optimal. That is, the expected cost of deferringthe replacement to level t + Δt, being r(t)Δt, should be compared with theminimum average cost over an interval of the same length, being K∗Δt. Henceif r(t) is larger than K∗, the deferment costs are larger and we should replace.This is the idea of marginal cost analysis as described for example in [37, 63].

The above framework can be extended to age-type policies if we considerthe following long run average cost per unit time

180 5 Maintenance Optimization

Ks =c+∫ s

0 r(x)F (x)dx∫ s

0F (x)dx

, (5.4)

where c is the cost of a preventive replacement and r denotes the marginal de-terioration cost rate. Again it can easily be seen that the basic age replacementmodel (5.1) is a special case setting r(x) = kλ(x), where λ(x) = f(x)/F (x) isthe failure rate. Now a very similar analysis can be carried out (see [63]) andthe same theorem holds true for this cost criterion except that condition (ii)(b) has to be replaced by

limt→∞ r(t) = a and a > lim

s→∞Ks for some a > 0.

This shows that behind these two quite different models the same opti-mizations mechanism works. This has been exploited by Aven and Bergmanin [19] (see also [21]). They recognized that for many replacement models theoptimization criterion can be written in the form

E[∫ τ

0athtdt+ c0

]E[∫ τ

0htdt+ p0

] , (5.5)

where τ is a stopping time based on the information about the condition ofthe system, (at) is a nondecreasing stochastic process, (ht) is a nonnegativestochastic process, and c0 and p0 are nonnegative random variables; all vari-ables are adapted to the information about the condition of the system. Both,the block-type model (5.3) and the age-type model (5.4) are included. Take,for example, for all random quantities deterministic values, especially τ = t,ht = F (t), at = r(t), p0 = 0, and c0 = c. This leads to the age-type model. In(5.5) the stopping time τ is the control variable which should be determined ina way that (5.5) is minimized. This problem of choosing a minimizing stoppingtime is known as an optimal stopping problem and will be further developedin the next section.

5.2 A General Replacement Model

In this section we want to develop the tools that allow certain maintenanceproblems to be solved in a fairly general way, also considering the possibilityof taking different levels of information into account.

5.2.1 An Optimal Stopping Problem

In connection with maintenance models as described above, we will have tosolve optimization problems. Often an optimal point in time has to be de-termined that maximizes some reward functional. In terms of the theory ofstochastic processes, this optimal point in time will be a stopping time τ that

5.2 A General Replacement Model 181

maximizes the expectation EZτ of some stochastic process Z. We will see thatthe smooth semimartingale (SSM) representation of Z, as introduced in de-tail in Sect. 3.1, is an excellent tool to carry out this optimization. Therefore,we want to solve the stopping problem and to characterize optimal stoppingtimes for the case in which Z is an SSM and τ ranges in a suitable class ofstopping times, say

CF = {τ : τ is an F-stopping time, τ <∞, EZτ > −∞}.Without any conditions on the structure of the process Z one cannot hope tofind an explicit solution of the stopping problem. A condition called monotonecase in the discrete time setting can be transferred to continuous time asfollows.

Definition 5.8 (MON). Let Z = (f,M) be an SSM. Then the followingcondition

{ft ≤ 0} ⊂ {ft+h ≤ 0} ∀ t, h ∈ R+,⋃

t∈R+

{ft ≤ 0} = Ω (5.6)

is said to be the monotone case and the stopping time

ζ = inf{t ∈ R+ : ft ≤ 0}is called the ILA-stopping rule (infinitesimal-look-ahead).

Obviously in the monotone case the process f driving the SSM Zt =∫ t

0 fsds+Mt remains negative (nonpositive) if it once crosses zero from aboveand the ILA-stopping rule ζ is a natural candidate to solve the maximizationproblem.

Theorem 5.9. Let Z = (f,M) be an F-SSM and ζ the ILA-stopping rule. Ifthe martingale M is uniformly integrable, then in the monotone case (5.6)

EZζ = sup{EZτ : τ ∈ CF}.Remark 5.10. The condition that the martingale is uniformly integrable canbe relaxed; in [98] it is shown that the condition may be replaced by

Mζ ∈ L1, ζ ∈ CF, limt→∞

∫{τ>t}

M−t dP = 0 ∀ τ ∈ CF,

where as usual a− denotes the negative part of a ∈ R : a− = max{−a, 0}. Butin most cases such a generalization will not be used in what follows.

Proof. Since M is uniformly integrable we have EMτ = 0 for all τ ∈ CF as aconsequence of the optional sampling theorem (cf. Appendix A, p. 262). Also ζis an element of CF because ζ <∞ per definition and EZ−

ζ ≤ E|Z0|+E|Mζ| <∞. It remains to show that

182 5 Maintenance Optimization

E

∫ ζ

0

fsds ≥ E

∫ τ

0

fsds

for all τ ∈ CF. But this is an immediate consequence of fs > 0 on {ζ > s}and fs ≤ 0 on {ζ ≤ s}. ��

The following example demonstrates how this optimization technique canbe applied.

Example 5.11. Let ρ be an exponentially distributed random variable withparameter λ > 0 on the basic probability space (Ω,F ,F, P ) equipped withthe filtration F generated by ρ :

Ft = σ({ρ > s}, 0 ≤ s ≤ t) = σ(I(ρ > s), 0 ≤ s ≤ t) = σ(ρ ∧ t).

For the latter equality we make use of our agreement that σ(·) denotes thecompletion of the generated σ-algebra so that, for instance, the event {ρ =t} =

⋂n∈N{t− 1

n < ρ ≤ t} is also included in σ(ρ ∧ t). Then we define

Zt = etI(ρ > t), t ∈ R+.

This process Z can be interpreted as the potential gain in a harvesting problem(in a wider sense): there is an exponentially growing potential gain and atany time t the decision-maker has to decide whether to realize this gain or tocontinue observations with the chance of earning a higher gain. But the gaincan only be realized up to a random time ρ, which is unknown in advance.So there is a risk to loose all potential gains and the problem is to find anoptimal harvesting time.

The process Z is adapted, right-continuous and integrable with

E[Zt+h|Ft] = et+hE[I(ρ > t+ h)|Ft] = e(1−λ)hZt, h, t ∈ R+.

Thus Z is a submartingale (martingale, supermartingale), if λ < 1 (λ = 1, λ >1). Obviously we have

limh→0+

1

hE[Zt+h − Zt|Ft] = Zt(1− λ) = ft.

Theorem 3.6, p. 60, states that Z is an SSM with representation:

Zt = 1 +

∫ t

0

Zs(1 − λ)ds+Mt.

Three cases will be discussed separately:

1. λ < 1. The monotone case (5.6) holds true with The ILA stopping timeζ = ρ. But ζ is not optimal, because EZζ = 0 and Z is a submartingalewith unbounded expectation function: sup{EZτ : τ ∈ CF} = ∞.

5.2 A General Replacement Model 183

2. λ > 1. The monotone case holds true with the ILA stopping time ζ = 0.It is not hard to show that in this case the martingale

Mt = Zt − 1−∫ t

0

Zs(1− λ)ds

is uniformly integrable. Theorem 5.9 ensures that ζ is optimal withEZζ = 1.

3. λ = 1. Again the monotone case (5.6) holds true with the ILA stoppingtime ζ = 0. However, the martingaleMt = etI(ρ > t)− 1 is not uniformlyintegrable. But for all τ ∈ CF we have EM−

τ ≤ 1 and

limt→∞

∫{τ>t}

M−t dP ≤ lim

t→∞

∫{τ>t}

dP = 0,

so that the more general conditions mentioned in the above remark arefulfilled with Mζ = 0. This yields

EZζ = 1 = sup{EZτ : τ ∈ CF}.

5.2.2 A Related Stopping Problem

As was described in Sect. 5.1, replacement policies of age and block type arestrongly connected to the following stopping problem: minimize

Kτ =EZτ

EXτ, (5.7)

in a suitable class of stopping times, where Z and X are real stochastic pro-cesses. For a precise formulation and solution of this problem we use theset-up given in Chap. 3. On the basic complete probability space (Ω,F , P ) afiltration F = (Ft), t ∈ R+, is given, which is assumed to fulfill the usualconditions concerning right continuity and completeness. Furthermore, letZ = (Zt) and X = (Xt), t ∈ R+, be real right-continuous stochastic pro-cesses adapted to the filtration F. Let T > 0 be a finite F-stopping time withEZT > −∞, E|XT | <∞ and

CF

T = {τ : τ is an F-stopping time, τ ≤ T,EZτ > −∞, E|Xτ | <∞}.For τ ∈ CF

T we consider the ratio Kτ in (5.7). The stopping problem is thento find a stopping time σ ∈ CF

T , with

K∗ = Kσ = inf{Kτ : τ ∈ CF

T }. (5.8)

In this model T describes the random lifetime of some technical system.The index t can be regarded as a time point and Ft as the σ-algebra whichcontains all gathered information up to time t. The stochastic processes Z andX are adapted to the stream of information F, i.e., Z and X are observablewith respect to the given information or in mathematical terms, Zt and Xt areFt-measurable for all t ∈ R+. The replacement times can then be identifiedwith stopping times not greater than the system lifetime T.

184 5 Maintenance Optimization

Example 5.12. In the case of block-type models no random information is tobe considered so that the filtration reduces to the trivial one and all stoppingtimes are constants, i.e., CF

T = R+ ∪ {∞}. In this case elementary analysismanipulations yield the optimum and no additional efforts are necessary.

Example 5.13. Let Zt = c + kI(T ≤ t), Xt = t, and Ft = σ(Zs, 0 ≤ s ≤t) = σ(I(T ≤ s), 0 ≤ s ≤ t) be the σ-algebra generated by Z, i.e., at anytime t ≥ 0 it is known whether the system works or not. The F-stoppingtimes τ ∈ CF

T are of the form τ = t∗ ∧ T for some t∗ > 0. Then we haveEZτ = c+ kEI(T ≤ τ) = c+ kP (T ≤ t∗) and EXτ = Eτ, which leads to thebasic age replacement policy.

To solve the above-mentioned stopping problem, we will make use of semi-martingale representations of the processes Z and X. It is assumed that Zand X are SSMs as introduced in Sect. 3.1 with representations

Zt = Z0 +

∫ t

0

fsds+Mt,

Xt = X0 +

∫ t

0

gsds+ Lt.

As in Sect. 3.1 we use the short notation Z = (f,M) and X = (g, L). Almostall of the stochastic processes used in applications without predictable jumpsadmit such SSM representations. The following general assumption is madethroughout this section:

Assumption (A). Z = (f,M) and X = (g, L) are SSMs with EZ0>0,EX0 ≥ 0, gs > 0 for all s ∈ R+ and MT , LT ∈ M0 are uniformly inte-grable martingales, where MT

t =Mt∧T , LTt = Lt∧T .

Remember that all relations between real random variables hold (only) P -almost surely. The first step to solve the optimization problem is to establishbounds for K∗ in (5.8).

Lemma 5.14. Assume that (A) is fulfilled and

q = inf

{ft(ω)

gt(ω): 0 ≤ t < T (ω), ω ∈ Ω

}> −∞.

Thenbl ≤ K∗ ≤ bu

holds true, where the bounds are given by

bu =EZT

EXT,

bl =

{E[Z0−qX0 ]

EXT+ q if E[Z0 − qX0] > 0

EZ0

EX0if E[Z0 − qX0] ≤ 0.

5.2 A General Replacement Model 185

Proof. Because T ∈ CF

T only the lower bound has to be shown. Since themartingales MT and LT are uniformly integrable, the optional sampling the-orem (see Appendix A, p. 262) yields EMτ = ELτ = 0 for all τ ∈ CF

T andtherefore

Kτ ≥ EZ0 + qE[Xτ −X0]

EXτ=EZ0 − qEX0

EXτ+ q ≥ bl.

The lower bound is derived observing that EX0 ≤ EXτ ≤ EXT , whichcompletes the proof. ��

The following example gives these bounds for the basic age replacementpolicy.

Example 5.15 (Continuation of Example 5.13). Let us return to the simplecost process Zt = c + kI(T ≤ t) with the natural filtration as before. ThenI(T ≤ t) has the SSM representation

I(T ≤ t) =

∫ t

0

I(T > s)λ(s)ds +M′t ,

where λ is the usual failure rate of the lifetime T . It follows that the processesZ and X have representations

Zt = c+

∫ t

0

I(T > s)kλ(s)ds+Mt, Mt = kM′t

and

Xt = t =

∫ t

0

ds.

Assuming the IFR property, we obtain with λ(0) = inf{λ(t) : t ∈ R+} andq = kλ(0) the following bounds for K∗ in the basic age replacement model:

bu =EZT

EXT=c+ k

ET,

bl =c

ET+ kλ(0).

These bounds could also be established directly by using (5.1), p. 176. Thebenefit of Lemma 5.14 lies in its generality, which also allows the bounds tobe found in more complex models as the following example shows.

Example 5.16. (Shock Model). Consider now a compound point process modelin which shocks arrive according to a marked point process (Tn, Vn) as wasoutlined in Sect. 3.3.3. Here we assume that (Tn) is a nonhomogeneous Pois-

son process with a deterministic intensity λ(s) integrating to Λ(t) =∫ t

0 λ(s)dsand that (Vn) forms an i.i.d. sequence of nonnegative random variables inde-pendent of (Tn) with Vn ∼ F. The accumulated damage up to time t is thendescribed by

186 5 Maintenance Optimization

Rt =

Nt∑n=1

Vn,

where Nt =∑∞

n=1 I(Tn ≤ t) is the number of shocks arrived until t. Thelifetime of the system is modeled as the first time Rt reaches a fixed thresholdS > 0 :

T = inf{t ∈ R+ : Rt ≥ S}.We stick to the simple cost structure of the basic age replacement model, i.e.,

Zt = c+ kI(T ≤ t).

But now we want to minimize the expected costs per number of arrived shocksin the long run, i.e.,

Xt = Nt.

This cost criterion is appropriate if we think, for example, of systems whichare used by customers at times Tn. Each usage causes some random damage(shock). If the customers arrive with varying intensities governed by externalcircumstances, e.g., different intensities at different periods of a day, it makesno sense to relate the costs to time, and it is more reasonable to relate thecosts to the number of customers served.

The semimartingale representations with respect to the internal filtrationgenerated by the marked point process are (cf. Sect. 3.3.5, p. 89)

Zt = c+

∫ t

0

I(T > s)kλ(s)F ((S −Rs)−)ds+Mt,

Xt =

∫ t

0

λ(s)ds + Lt.

The martingaleM is uniformly integrable and so is LT = (Lt∧T ) if we assume

that E∫ T

0λ(s)ds = EΛ(T ) <∞. Lemma 5.14 yields, with

q = inf{kF ((S −Rt)−) : 0 ≤ t < T (ω), ω ∈ Ω} = kF (S−),

the following bounds for K∗ = inf{Kτ : τ ∈ CF

T } :

bu =c+ k

EXT,

bl =c

EXT+ kF (S−),

where EXT = EΛ(T ). Observe that XT = inf{n ∈ N :∑n

i=1 Vi ≥ S} and

{XT > k} = {∑ki=1 Vi < S}. This yields

EXT =

∞∑k=0

P

(k∑

i=1

Vi < S

)≤

∞∑k=0

F k(S−) =1

F (S−),

5.2 A General Replacement Model 187

if F (S−) < 1. In addition, using Wald’s equation E∑XT

n=1 Vn = EXTEV1 ≥ S,we can derive the following alternative bounds

b′u = (c+ k)

EV1S

,

b′l = (c+ k)F (S−),

which can easily be computed.

To solve the stopping problem (5.8) for a ratio of expectations, we usethe solution of the simpler case in which we look for the maximum of theexpectations EZτ , where Z is an SSM and τ ranges in a suitable class ofstopping times, which has been considered in detail in Sect. 5.2. It is a well-known technique to replace the minimization problem (5.8) by an equivalentmaximization problem. Observing that Kτ = EZτ/EXτ ≥ K∗ is equivalentto K∗EXτ − EZτ ≤ 0 for all τ ∈ CF

T , where equality holds for an optimalstopping time, one has the maximization problem:

Find σ ∈ CF

T with EYσ = sup{EYτ : τ ∈ CF

T } = 0, where (5.9)

Yt = K∗Xt − Zt and K∗ = inf{Kτ : τ ∈ CF

T }.This new stopping problem can be solved by means of the semimartingalerepresentation of the process Y = (Yt) for t ∈ [0, T )

Yt = K∗X0 − Z0 +

∫ t

0

(K∗gs − fs)ds+Rt, (5.10)

where the martingale R = (Rt), t ∈ R+, is given by

Rt = K∗Lt −Mt.

Now the procedure is as follows. If the integrand ks = K∗gs − fs fulfillsthe monotone case (MON), then Theorem 5.9, p. 181, of Sect. 5.2 yields thatthe ILA-stopping rule σ = inf{t ∈ R+ : kt ≤ 0} is optimal, provided themartingale part R is uniformly integrable. Note, however, that this stoppingtime σ depends on the unknown value K∗, which can be determined from theequality EYσ = 0.

Next we want to define monotonicity conditions that ensure (MON). Obvi-ously under assumption (A), p. 184, the monotone case holds true if the ratiofs/gs is increasing (P -a.s.) with f0/g0 < K∗ and lims→∞ fs/gs > K∗. Thevalue K∗ is unknown so that we need to use the bounds derived, and it seemstoo restrictive to demand that the ratio is increasing. Especially bath-tub-shaped functions, which decrease first up to some s0 and increase for s > s0,should be covered by the monotonicity condition. This results in the followingdefinition.

Definition 5.17. Let a, b ∈ R ∪ {−∞,∞} be constants with a ≤ b. Then afunction r : R+ → R is called

188 5 Maintenance Optimization

(i) (a, b)-increasing, if for all t, h ∈ R+

r(t) ≥ a implies r(t + h) ≥ r(t) ∧ b;

(ii) (a, b)-decreasing, if for all t, h ∈ R+

r(t) ≤ b implies r(t + h) ≤ r(t) ∨ a.

Roughly spoken, an (a, b)-increasing function r(t) passes with increasing tthe levels a, b from below and never falls back below such a level. Between aand b the increase is monotone. Obviously a (0, 0)-decreasing function fulfills(MON) if r(∞) ≤ 0. A (−∞,∞)-increasing (decreasing) function is monotonein the ordinary sense.

The main idea for solving the stopping problem is that, if the ratio fs/gssatisfies such a monotonicity condition, instead of considering all stoppingtimes τ ∈ CF

T one may restrict the search for an optimal stopping time to theclass of indexed stopping times

ρx = inf{t ∈ R+ : xgt − ft ≤ 0} ∧ T, inf ∅ = ∞, x ∈ R. (5.11)

The optimal stopping level x∗ for the ratio fs/gs can be determined fromEYσ = 0 and coincides with K∗ as is shown in the following theorem.

Theorem 5.18. Assume (A)(see p. 184) and let ρx, x ∈ R, and the boundsbu, bl be defined as above in (5.11) and in Lemma 5.14, p. 184, respectively. Ifthe process (rt), t ∈ R+, with rt = ft/gt has (bl, bu)-increasing paths on [0, T ),then

σ = ρx∗ ,with x∗ = inf{x ∈ R : xEXρx − EZρx ≥ 0}is an optimal stopping time and x∗ = K∗.

Proof. Since r is (bl, bu)-increasing with bl ≤ K∗ ≤ bu, it follows that r isalso (K∗,K∗)-increasing, i.e., passes K∗ at most once from below. Thus themonotone case holds true for the SSM Y . From the general assumption (A)on p. 184 we deduce that the martingale part of Y is uniformly integrable sothat

σ = inf{t ∈ R+ : K∗gt − ft ≤ 0} ∧ T = ρK∗

is optimal with EYσ = sup{EYτ : τ ∈ CF

T } = 0.It remains to show that x∗ = K∗. Define

v(x) = xEXρx − EZρx = xEX0 − EZ0 + E

∫ ρx

0

(xgs − fs)ds.

Now v(x) is obviously nondecreasing in x and by the definition of ρx and (A)we have v(x) ≥ −EZ0. For x < K∗ and v(x) > −EZ0 the following strictinequality holds, since in this case we have either EX0 > 0 or EX0 = 0 andP (ρx > 0) > 0 :

5.2 A General Replacement Model 189

v(x) < K∗EX0 − EZ0 + E

∫ ρx

0

(K∗gs − fs)ds ≤ v(K∗) = 0.

Equally for x < K∗ and v(x) = −EZ0 we have v(x) < v(K∗) = 0 because ofEZ0 > 0. Therefore,

x∗ = inf{x ∈ R : v(x) ≥ v(K∗) = 0} = K∗,

which proves the assertion. ��

Remark 5.19. 1. If E[Z0− qX0] < 0, then the lower bound bl in Lemma 5.14is attained for σ = 0. So in this case K∗ = EZ0/EX0 is the minimumwithout any further monotonicity assumptions.

2. If no monotonicity conditions hold at all, then x∗ = inf{x ∈ R : xEXρx −EZρx ≥ 0} is the cost minimum if only stopping times of type ρx areconsidered. But T = ρ∞ is among this restricted class of stopping timesso that x∗ is at least an improved upper bound for K∗, i.e., bu ≥ x∗. Fromthe definition of x∗ we obtain x∗ ≥ Kρx∗ , which is obviously boundedbelow by the overall minimum K∗ : bu ≥ x∗ ≥ Kρx∗ ≥ K∗.

3. Processes r with (bl, bu)-increasing paths include especially unimodal orbath-tub-shaped processes provided that r0 < bl.

The case of a deterministic process r is of special interest and is stated asa corollary under the assumptions of the last theorem.

Corollary 5.20. If (ft) and (gt) are deterministic with inverse of the ratior−1(x) = inf{t ∈ R+ : rt = ft/gt ≥ x}, x ∈ R, and X0 ≡ 0, then σ = t∗ ∧ T isoptimal with t∗ = r−1(K∗) ∈ R+ ∪ {∞} and

K∗ = inf

{x ∈ R :

∫ r−1(x)

0

(xgs − fs)P (T > s)ds ≥ EZ0

}.

If, in addition, r is constant with rt ≡ r0 ∀t ∈ R+, then

K∗ =EZ0

EXT+ r0 and σ = T.

Remark 5.21. The bounds for K∗ in Lemma 5.14 are sharp in the followingsense. For constant rt ≡ r0 in the above corollary the upper and lower boundscoincide.

5.2.3 Different Information Levels

As indicated in Sect. 3.2.4 in the context of the general lifetime model, thesemimartingale set-up has its advantage in opening new fields of applications.One of these features is the aspect of partial information. In the frameworkof stochastic process theory, the information is represented by a filtration, an

190 5 Maintenance Optimization

increasing family of σ-fields. So it is natural to describe partial informationby a family of smaller σ-fields. Let A = (At) be a subfiltration of F = (Ft),i.e., At ⊂ Ft for all t ∈ R+. The σ-field Ft describes the complete informationup to time t and At can be regarded as the available partial information thatallows us to observe versions of the conditional expectations Zt = E[Zt|At]and Xt = E[Xt|At], respectively. For all A-stopping times τ it holds true thatEZτ = EZτ and EXτ = EXτ . So the problem to find a stopping time σin the class CA

T of A-stopping times that minimizes Kτ = EZτ/EXτ can bereduced to the ordinary stopping problem by the means developed in the lastsubsection if Z and X admit A-SSM representations:

Kσ = inf

{Kτ =

EZτ

EXτ: τ ∈ CA

ζ

}= inf

{Kτ =

EZτ

EXτ

: τ ∈ CA

ζ

}.

The projection theorem (Theorem 3.19, p. 69) yields:If Z is an F-SSM with representation Z = (f,M) and A is a subfiltration

of F, then Zt = E[Zt|At] is an A-SSM with Z = (f , M), where f is anA-progressively measurable version of (E[ft|At]) , t ∈ R+, and M is an A-martingale.

Loosely spoken, if f is the “density” of Z we get the “density” f of Z simplyas the conditional expectation with respect to the subfiltration A. Then theidea is to use the projection Z of Z to the A-level and apply the above-described optimization technique to Z. Of course, on the lower informationlevel the cost minimum is increased,

inf{Kτ : τ ∈ CA

ζ } ≥ inf{Kτ : τ ∈ CF

ζ },since all A-stopping times are also F-stopping times, and the question, towhat extent the information level influences the cost minimum, has to beinvestigated.

5.3 Applications

The general set-up to minimize the ratio of expectations allows for manyspecial cases covering a variety of maintenance models. Some few of these willbe presented in this section, which show how the general approach can beexploited.

5.3.1 The Generalized Age Replacement Model

We first focus on the age replacement model with the long run average costper unit time criterion: find σ ∈ CF

T with

K∗ = Kσ =EZσ

EXσ= inf{Kσ : τ ∈ CF

T },

5.3 Applications 191

where we now insert Zt = c + I(T ≤ t) and Xt = t, t ∈ R+. Without lossof generality the constant k, the penalty costs for replacements at failures,introduced in Sect. 5.1.1 is set equal to 1. We will now make use of the generallifetime model described in detail in Sect. 3.2. This means that it is assumedthat the indicator process Vt = I(T ≤ t) has an F-SSM representation with afailure rate process λ :

Vt = I(T ≤ t) =

∫ t

0

I(T > s)λsds+Mt.

We know then that λ has nonnegative paths, T is a totally inaccessible F-stopping time, andM a uniformly integrable F-martingale (cf. Definition 3.24and Lemma 3.25, p. 72). With λmin = q = inf{λt : 0 ≤ t < T (ω), ω ∈ Ω} weget from Lemma 5.14, p. 184, the bounds

bl =c

ET+ λmin ≤ K∗ ≤ bu =

c+ 1

ET.

Note that in contrast to Example 5.15, p. 185, λ may be a stochastic failurerate process. If the paths of λ are (bl, bu)-increasing, then the SSMs Z and Xmeet the requirements of Theorem 5.18, p. 188, and it follows that

K∗ = x∗ = inf{x ∈ R : xEρx − EZρx ≥ 0} and σ = ρx∗ ,

where ρx = inf{t ∈ R+ : λt ≥ x} ∧ T. Consequently, if λ is nondecreasingor bath-tub-shaped starting at λ0 < bl, we get this solution of the stoppingproblem. The optimal replacement time is a control-limit rule for the failurerate process λ.

To give an idea of how partial information influences this optimal solution,we resume the example of a two-component parallel system with i.i.d. ran-dom variables Xi ∼Exp(α), i = 1, 2, which describe the component lifetimes(cf. Example 3.38, p. 79). Then the system lifetime is T = X1 ∨ X2 withcorresponding indicator process

Vt = I(T ≤ t) =

∫ t

0

I(T > s)α(I(X1 ≤ s) + I(X2 ≤ s))ds+Mt

=

∫ t

0

I(T > s)λsds+Mt.

Possible different information levels were described in Sect. 3.2.4 in detail. Werestrict ourselves now to four levels:

(a) The complete information level: F = (Ft),

Ft = σ(I(X1 ≤ s), I(X2 ≤ s), 0 ≤ s ≤ t)

with failure rate process λt = λat = α(I(X1 ≤ t) + I(X2 ≤ t)).

192 5 Maintenance Optimization

(b) Information only about T until h > 0, after h complete information:A

b = (Abt)

Abt =

{σ(I(T ≤ s), 0 ≤ s ≤ t) if 0 ≤ t < hFt if t ≥ h

and failure rate process

λbt = E[λt|Abt ] =

{2α(1 − (2− e−αt)−1) if 0 ≤ t < hλt if t ≥ h.

(c) Information about component lifetime X1: Ac= (Ac

t),

Act = σ(I(T ≤ s), I(X1 ≤ s), 0 ≤ s ≤ t)

and failure rate process

λct = E[λt|Act ] = α(I(X1 ≤ t) + I(X1 > t)P (X2 ≤ t)).

(d) Information only about T : Ad= (Adt ),Ad

t = σ(I(T ≤ s), 0 ≤ s ≤ t), and

failure rate (process) λdt = E[λt|Adt ] = 2α(1− (2− e−αt)−1).

In all four cases the bounds remain the same with ET = 32α :

bl =2α

3c, bu =

3(c+ 1).

Since Ab and A

c are subfiltrations of F and include Ad as a subfiltration, we

must have for the optimal stopping values

bl ≤ K∗a ≤ K∗

b ≤ K∗d ≤ bu, K

∗a ≤ K∗

c ≤ K∗d ,

i.e., on a higher information level we can achieve a lower cost minimum. Let usconsider the complete information case in more detail. The failure rate processis nondecreasing and the assumptions of Theorem 5.18, p. 188, are met. Forthe stopping times ρx = inf{t ∈ R+ : λt ≥ x} ∧ T we have to consider valuesof x in [bl, bu] and to distinguish between the cases 0 < x ≤ α and x > α :

• 0 < x ≤ α. In this case we have ρx = X1 ∧X2, Eρx = 12α , EZρx = c, such

that xEρx − EZρx = 0 leads to x∗ = 2αc, where 0 < x∗ ≤ α is equivalentto c ≤ 1

2 ;• α < x. In this case we have ρx = T,Eρx = 32α , EZρx = c + 1, such that

x∗ = bu, x∗ > α is equivalent to c > 1

2 .

The other information levels are treated in a similar way. Only case (b)

needs some special attention because the failure rate process λb is no longermonotone but only piecewise nondecreasing. To meet the (bl, bu)-increasing

condition, we must have λbh < bl, i.e., 2α(1 − (2 − e−αh)−1) < 2α3 c. This

inequality holds for all h ∈ R+, if c ≥ 32 and for h < h(α, c) = − 1

α ln(

3−2c3−c

),

if 0 < c < 32 .

We summarize these considerations in the following proposition the proofof which follows the lines above and is elementary but not straightforward.

5.3 Applications 193

Proposition 5.22. For 0 < c ≤ 12 the optimal stopping times and values

K∗ are

a) K∗a = 2αc, σa = X1 ∧X2;

b) K∗b = α c+(1−eαh)2

0.5+(1−eαh)2, σb = ((X1 ∧X2) ∨ h) ∧ T, if 0 < h < h(α, c);

c) K∗c = α

√2c, σc = X1 ∧

(− 1α ln

(1−√

2c))

;

d) K∗d = 2α

(√c2

4 + c− c2

), σd = T ∧

(− 1

a ln

(1− c

2 −√

c2

4 + c

)).

For c > 12 we have on all levels K∗ = bu and σ = T.

For decreasing c the differences between the cost minima increase. If thecosts c for a preventive replacement are greater than half of the penalty costs,i.e., c > 1

2k = 12 , then extra information and preventive replacements are not

profitable.

5.3.2 A Shock Model of Threshold Type

In the shock model of Example 5.16, p. 185, the shock arrivals were describedby a marked point process (Tn, Vn), where at time Tn a shock causing damageof amount Vn occurs. Here we assume that (Tn) and (Vn) are independentand that (Vn) forms an i.i.d. sequence of nonnegative random variables withVn ∼ F. As usual Nt =

∑∞n=1 I(Tn ≤ t) counts the number of shocks until

t and

Rt =

Nt∑n=1

Vn

describes the accumulated damage up to time t. In the threshold-type model,the lifetime T is given by

T = inf{t ∈ R+ : Rt ≥ S}, S > 0.

Now F is the internal history generated by (Tn, Vn) and (λt) the F-intensity of(Nt). The costs of a preventive replacement are c > 0 and for a replacementat failure c+ k, k > 0, which results in a cost process Zt = c+ kI(T ≤ t). Theaim is to minimize the expected cost per arriving shock in the long run, i.e.,to find σ ∈ CF

T with

K∗ = Kσ = inf

{Kτ =

EZτ

EXτ, τ ∈ CF

T

},

where Xt = Nt. The only assumption concerning the shock arrival process isthat the intensity λ is positive: λt > 0 on [0, T ). According to Example 5.16and Sect. 3.3.3 we have the following SSM representations:

Zt = c+

∫ t

0

I(T > s)kλsF ((S −Rs)−)ds+Mt,

Xt =

∫ t

0

λsds+ Lt.

194 5 Maintenance Optimization

Then the cost rate process r is given on [0, T ) by rt = kF ((S − Rt)−),which is obviously nondecreasing. Under the integrability assumptions of The-orem 5.18, p. 188, we see that the optimal stopping time is σ = ρx∗ = inf{t ∈R+ : rt ≥ x∗}, where the limit x∗ = inf{x ∈ R : xEXρx − EZρx ≥ 0} = K∗

has to be found numerically. Thus the optimal stopping time is a control-limitrule for the process (Rt) : Replace the system the first time the accumulateddamage hits a certain control limit.

Example 5.23. Under the above assumptions let (Nt) be a point process withpositive intensity (λs) and Vn ∼Exp(ν). Then we get with F (x) = exp{−νx}and EXT = E[inf{n ∈ N :

∑ni=1 Vi ≥ S}] = νS + 1 the bounds

bl =c

νS + 1+ ke−νS ,

bu =c+ k

νS + 1,

and the control-limit rules

ρx = inf{t ∈ R+ : k exp{−ν(S −Rt)} ≥ x} ∧ T= inf

{t ∈ R+ : Rt ≥ 1

νln(xk

)+ S

}∧ T.

We set g(x) = 1ν ln(xk ) + S and observe that ρx = inf{t ∈ R+ : Rt ≥ g(x)}, if

0 < x ≤ k. For such values of x we find

EXρx = νg(x) + 1,

EZρx = c+ kP (T = ρx) = c+ ke−ν(S−g(x)) = c+ x.

The probability P (T = ρx) is just the probability that a Poisson process withrate ν has no event in the interval [g(x), S], which equals e−ν(g(x)−S). By thesequantities the optimal control limit x∗ = K∗ is the unique solution of

x∗ =c+ x∗

νg(x∗) + 1,

provided that bl ≤ x∗ ≤ bu. As expected this solution does not depend on thespecific intensity of the shock arrival process.

5.3.3 Information-Based Replacement of Complex Systems

In this section the basic lifetime model for complex systems is combined withthe possibility of preventive replacements. A system with random lifetime T >0 is replaced by a new equivalent one after failure. A preventive replacementcan be carried out before failure. There are costs for each replacement and anadditional amount has to be paid for replacements after failures. The aim is todetermine an optimal replacement policy with respect to some cost criterion.

5.3 Applications 195

Several cost criteria are known among which the long run average cost per unittime criterion is by far the most popular one. But the general optimizationprocedure also allows for other criteria. As an example the total expecteddiscounted cost criterion will be applied in this section. We will also considerthe possibility to take different information levels into account. This set-up willbe applied to complex monotone systems for which in Sect. 3.2 some examplesof various degrees of observation levels were given. For the special case of a two-component parallel system with dependent component lifetimes, it is shownhow the optimal replacement policy depends on the different information levelsand on the degree of dependence of the component lifetimes.

Consider a monotone system with random lifetime T, T > 0, with an F-semimartingale representation

I(T ≤ t) =

∫ t

0

I(T > s)λsds+Mt, (5.12)

for some filtration F. When the system fails it is immediately replaced byan identical one and the process repeats itself. A preventive replacement canbe carried out before failure. Each replacement incurs a cost of c > 0 andeach failure adds a penalty cost k > 0. The problem is to find a replacement(stopping) time that minimizes the total expected discounted costs.

Let α > 0 be the discount rate and (Zτ , τ), (Zτ1 , τ1), (Zτ2 , τ2), . . . a se-quence of i.i.d. pairs of positive random variables, where τi represents thereplacement age of the ith implemented system, i.e., the length of the ithcycle, and Zτi describes the costs incurred during the ith cycle discounted tothe beginning of the cycle. Then the total expected discounted costs are

Kτ = E[Zτ1 + e−ατ1Zτ2 + e−α(τ1+τ2)Zτ3 + · · ·

]

=EZτ

E[1− e−ατ ].

It turns out that Kτ is the ratio of the expected discounted costs for one cycleand E[1 − e−ατ ]. Again the set of admissible stopping (replacement) timesless or equal to T is

CF

T = {τ : τ is an F-stopping time τ ≤ T,EZ−τ <∞}.

The stopping problem is to find a stopping time σ ∈ CF

T with

K∗ = Kσ = inf{Kτ : τ ∈ CF

T }. (5.13)

Stopping at a fixed time t leads to the following costs for one cycle discountedto the beginning of the cycle:

Zt = (c+ kI(T ≤ t))e−αt, t ∈ R+.

Starting from (5.12) such a semimartingale representation can also be ob-tained for Z = (Zt), t ∈ R+, by using the product rule for “differentiating”

196 5 Maintenance Optimization

semimartingales introduced in Sect. 3.1.2. Then Theorem A.51, p. 269, can beapplied to yield for t ∈ [0, T ]:

Zt = c+

∫ t

0

I(T > s)αe−αs

(−c+ λs

k

α

)ds+Rt

= c+

∫ t

0

I(T > s)αe−αsrsds+Rt, (5.14)

where rs = α−1(−αc+λsk) is a cost rate and R = (Rt), t ∈ R+, is a uniformly

integrable F-martingale. Since Xt = 1 − e−αt =∫ t

0 αe−αsds, the ratio of the

“derivatives” of the two semimartingales Z and X is given by (rt).We now consider a monotone system with random component lifetimes

Ti > 0, i = 1, 2, . . . , n, n ∈ N, and structure function Φ : {0, 1}n → {0, 1} asintroduced in Chap. 2. The system lifetime T is given by T = inf{t ∈ R+ :Φt = 0}, where the vector process (Xt) describes the state of the compo-nents and Φt = Φ(Xt) = I(T > t) indicates the state of the system at timet. If the random variables Ti are independent with (ordinary) failure ratesλt(i) and F = (Ft) is the (complete information) filtration generated by X,Ft = σ(Xs, 0 ≤ s ≤ t), then Corollary 3.30 in Sect. 3.2.2 yields the followingsemimartingale representation for Φt :

1− Φt =

∫ t

0

I(T > s)λsds+Mt,

λt =n∑

i=1

(Φ(1i,Xt)− Φ(0i,Xt))λt(i).

To find the minimum K∗ we will proceed as before. First of all bounds bland bu for K∗ are determined by means of q = inf{rt : 0 ≤ t < T (ω), ω ∈ Ω},the minimum of the cost rate with q ≥ −c:

bl =c

E[1− e−αT ]+ q ≤ K∗ ≤ bu =

E[(c+ k)e−αT

]E[1− e−αT ]

. (5.15)

If all failure rates λt(i) are of IFR-type, then the F-failure rate process λand the ratio process r are nondecreasing. Therefore, Theorem 5.18, p. 188,can be applied to yield σ = ρx∗ . So the optimal stopping time is among thecontrol-limit rules

ρx = inf{t ∈ R+ : rt ≥ x} ∧ T= inf

{t ∈ R+ : λt ≥ α

k(c+ x)

}∧ T.

This means: replace the system the first time the sum of the failure rates ofcritical components reaches a given level x∗. This level has to be determined as

x∗ = inf{x ∈ R : xE[1 − e−αρx ]− E[c+ kI(T = ρx)e−αρx ] ≥ 0}.

5.3 Applications 197

The effect of partial information is in the following only considered for thecase that no single component or only some of the n components are observed,say those with index in a subset {i1, i2, . . . , ir} ⊂ {1, 2, . . . , n}, r ≤ n. Then thesubfiltration A is generated by T or by T and the corresponding componentlifetimes, respectively. The projection theorem yields a representation on thecorresponding observation level:

1− Φ = E[I{T≤t}|At] = I{T≤t} =

∫ t

0

I(T > s)λsds+ Mt.

If the A-failure rate process λt = E[λt|At] is (bl, bu)-increasing, then thestopping problem can also be solved on the lower information level by meansof Theorem 5.18. We want to carry out this in more detail in the next section,allowing also for dependencies between the component lifetimes. To keep thecomplexity of the calculations on a manageable level, we confine ourselves toa two-component parallel system.

5.3.4 A Parallel System with Two Dependent Components

A two-component parallel system is considered now to demonstrate how theoptimal replacement rule can be determined explicitly. It is assumed that thecomponent lifetimes T1 and T2 follow a bivariate exponential distribution.There are lots of multivariate extensions of the univariate exponential distri-bution. But it seems that only a few models like those of Freund [68] andMarshall and Olkin [121] are physically motivated.

The idea behind Freund’s model is that after failure of one componentthe stress, placed on the surviving component, is changed. As long as bothcomponents work, the lifetimes follow independent exponential distributionswith parameters β1 and β2. When one of the components fails, the parameterof the surviving component is switched to β1 or β2 respectively.

Marshall and Olkin proposed a bivariate exponential distribution for atwo-component system where the components are subjected to shocks. Thecomponents may fail separately or both at the same time due to such shocks.This model includes the possibility of a common cause of failure that destroysthe whole system at once.

As a combination of these two models the following bivariate distribu-tion can be derived. Let the pair (Y1, Y2) of random variables be distributedaccording to the model of Freund and let Y12 be another positive random vari-able, independent of Y1 and Y2, exponentially distributed with parameter β12.Then (T1, T2) with T1 = Y1 ∧ Y12, T2 = Y2 ∧ Y12 is said to follow a combinedexponential distribution. For brevity the notation γi = β1+β2− βi, i ∈ {1, 2},and β = β1 + β2 + β12 is introduced. The survival function

F (x, y) = P (T1 > x, T2 > y) = P (Y1 > x, Y2 > y)P (Y12 > x ∨ y)

198 5 Maintenance Optimization

is then given by

F (x, y) =

⎧⎨⎩

β1

γ2e−γ2x−(β2+β12)y − β2−β2

γ2e−βy for x ≤ y

β2

γ1e−γ1y−(β1+β12)x − β1−β1

γ1e−βx for x > y,

(5.16)

where here and in the following γi �= 0, i ∈ {1, 2}, is assumed. For βi = βi thisformula diminishes to the Marshall–Olkin distribution and for β12 = 0 (5.16)gives the Freund distribution. From (5.16) the distribution H of the systemlifetime T = T1 ∧ T2 can be obtained:

H(t) = P (T ≤ t) = P (T1 ≤ t, T2 ≤ t) (5.17)

= 1− β2γ1e−(β1+β12)t − β1

γ2e−(β2+β12)t +

β1β2 + β2β1 − β1β2γ1γ2

e−βt.

The optimization problem will be solved for three different informationlevels:

• Complete information about T1, T2 (and T ). The corresponding filtrationF is generated by both component lifetimes:

Ft = σ(I(T1 ≤ s), I(T2 ≤ s), 0 ≤ s ≤ t), t ∈ R+.

• Information about T1 and T . The corresponding filtration A is generatedby one component lifetime, say T1, and the system lifetime:

At = σ(I(T1 ≤ s), I(T ≤ s), 0 ≤ s ≤ t), t ∈ R+.

• Information about T . The filtration generated by T is denoted by B:

Bt = σ(I(T ≤ s), 0 ≤ s ≤ t), t ∈ R+.

In the following it is assumed that βi ≤ βi, i ∈ {1, 2}, and β1 ≤ β2,i.e., after failure of one component the stress placed on the surviving one isincreased. Without loss of generality the penalty costs for replacements afterfailures are set to k = 1. The solution of the stopping problem will be outlinedin the following. More details are contained in [84].

5.3.5 Complete Information About T1, T2 and T

The failure rate process λ on the F-observation level is given by (cf. Exam-ple 3.27, p. 74)

λt = β12 + β2I(T1 < t < T2) + β1I(T2 < t < T1).

Inserting q = −c + β12α−1 in (5.15) we get the bounds for the stopping

value K∗

5.3 Applications 199

bl =cv

1− v+β12α

and bu =(c+ 1)v

1− v,

where v = E[e−αT ] can be determined by means of the distribution H . Sincethe failure rate process is monotone on [0, T ) the optimal stopping time canbe found among the control limit rules ρx = inf{t ∈ R+ : rt ≥ x} ∧ T :

ρx =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

0 for x ≤ β12

α − c

T1 ∧ T2 for β12

α − c < x ≤ β1+β12

α − c

T1 for β1+β12

α − c < x ≤ β2+β12

α − c

T for x > β2+β12

α − c.

The optimal control limit x∗ is the solution of the equation

xE[1 − e−αρx ]− EZρx = 0.

Since the optimal value x∗ lies between the bounds bl and bu, the considera-tions can be restricted to the cases x ≥ bl > β12α

−1− c. In the first case whenβ12α

−1 − c < x ≤ (β1 + β12)α−1 − c, one has ρx = T1 ∧ T2 and

E[1− e−αρx ] =α

β + α

EZρx = cE[e−αρx ] + E[I(T ≤ ρx)e−αρx ] = c

β

β + α+

β12β + α

.

The solution of the equation

x∗α

β + α−(c

β

β + α+

β12β + α

)= 0

is given by

x∗ =1

α(cβ + β12) if

β12α

− c < x∗ ≤ β1 + β12α

− c.

Inserting x∗ in the latter inequality we obtain the condition 0 < c ≤ c1, wherec1 = β1(β + α)−1.

The remaining two cases (β1 + β12)α−1 − c < x ≤ (β2 + β12)α

−1 − c andx > (β2 + β12)α

−1 − c are treated in a similar manner. After some extensivecalculations the following solution of the stopping problem is derived:

ρx∗ =

⎧⎨⎩T1 ∧ T2 for 0 < c ≤ c1T1 for c1 < c ≤ c2T for c2 < c

x∗ =

⎧⎨⎩x∗1 for 0 < c ≤ c1x∗2 for c1 < c ≤ c2x∗3 for c2 < c,

200 5 Maintenance Optimization

where c1 is defined as above and

c2 =β2

(β + α)+

β2(β2 − β1)

(β1 + β12 + α)(β + α),

x∗1 =1

α(cβ + β12),

x∗2 =1

α

(c(β1 + β12) + β12 +

(c+ 1)β2β1 − cβ1β2

β1 + β2 + β12 + α

),

x∗3 = bu.

The explicit formulas for the optimal stopping value were only presented hereto show how the procedure works and that even in seemingly simple casesextensive calculations are necessary. The main conclusion can be drawn fromthe structure of the optimal policy. For small values of c (note that the penaltycosts for failures are k = 1) it is optimal to stop and replace the system at thefirst component failure. For mid-range values of c, the replacement should takeplace when the “better” component with a lower residual failure rate (β1 ≤ β2)fails. If the “worse” component fails first, this results in an replacement aftersystem failure. For high values of c, preventive replacements do not pay, andit is optimal to wait until system failure. In this case the optimal stoppingvalue is equal to the upper bound x∗ = bu.

Information About T1 and T

The failure rate process corresponding to this observation level A is given by

λt = g(t)I(T1 > t) + (β2 + β12)I(T1 ≤ t),

g(t) = β1 + β12 − β1γ1

β2eγ1t + β1 − β1,

where the function g is derived by means of (5.16) as the limit

g(t) = limh→0+

1

hP (t < T1 ≤ t+ h, T2 ≤ t+ h|T1 > t).

The paths of the failure rate process λ depend only on the observable compo-nent lifetime T1 and not on T2. The paths are nondecreasing so that the sameprocedure as before can be applied. For γ1 = β1 + β2 − β1 > 0 the followingresults can be obtained:

ρx∗ =

⎧⎨⎩T1 ∧ b∗ for 0 < c ≤ c1T1 for c1 < c ≤ c2T for c2 < c

x∗ =

⎧⎨⎩x∗1 for 0 < c ≤ c1x∗2 for c1 < c ≤ c2x∗3 for c2 < c.

5.3 Applications 201

The constants c1, c2 and the stopping values x∗2, x∗3 are the same as in the

complete information case. What is optimal on a higher information level andcan be observed on a lower information level must be optimal on the lattertoo. So only the case 0 < c ≤ c1 is new. In this case the optimal replacementtime is T1∧b∗ with a constant b∗, which is the unique solution of the equation

d1 exp{γ1b∗}+ d2 exp{−(β1 + β12 + α)b∗}+ d3 = 0.

The constants di, i ∈ {1, 2, 3}, are extensive expressions in α, the β and γconstants and therefore not presented here (see [84]). The values of b∗ andx∗1 have to be determined numerically. For γ1 < 0 a similar result can beobtained.

Information About T

On this lowest level B, no additional information about the state of thecomponents is available up to the time of system failure. The failure rateis deterministic and can be derived from the distribution H :

λt = − d

dt(ln(1−H(t))).

In this case the replacement times ρx = T ∧ b, b ∈ R+ ∪ {∞}, are the well-known age replacement policies. Even if λ is not monotone, such a policy isoptimal on this B-level. The optimal values b∗ and x∗ have to be determinedby minimizing Kρx as a function of b.

Numerical Examples

The following tables show the effects of changes of two parameters, the re-placement cost parameter c and the “dependence parameter” β12. To be ableto compare the cost minima K∗ = x∗, both tables refer to the same set ofparameters: β1 = 1, β2 = 3, β1 = 1.5, β2 = 3.5, α = 0.08. The optimal replace-ment times are denoted:

a: ρx∗ = T1 ∧ T2 b: ρx∗ = T1 c: ρx∗ = T1 ∧ b∗d: ρx∗ = T ∧ b∗ e: ρx∗ = T = T1 ∨ T2.

Table 5.1 shows the cost minima x∗ for different values of c. For smallvalues of c, the influence of the information level is greater than for mod-erate values. For c > 1.394 preventive replacements do not pay, additionalinformation concerning T is not profitable.

Table 5.2 shows how the cost minimum depends on the parameter β12. Forincreasing values of β12 the difference between the cost minima on differentinformation levels decreases, because the probability of a common failure ofboth components increases and therefore extra information about a singlecomponent is not profitable.

202 5 Maintenance Optimization

Table 5.1. β1 = 1, β2 = 3, β12 = 0.5, β1 = 1.5, β2 = 3.5, α = 0.08

Information levelc bl F A B bu

0.01 6.453 6.813 a 9.910 c 11.003 d 20.506

0.10 8.280 11.875 a 17.208 c 19.678 d 22.333

0.50 16.402 28.543 b 28.543 b 30.455 e 30.455

1.00 26.553 39.764 b 39.764 b 40.606 e 40.606

2.00 46.856 60.900 e 60.900 e 60.900 e 60.900

Table 5.2. β1 = 1, β2 = 3, β1 = 1.5, β2 = 3.5, c = 0.1, α = 0.08

Information levelβ12 bl F A B bu

0.00 1.505 5.000 a 10.739 c 13.231 d 16.552

0.10 2.859 6.375 a 12.032 c 14.520 d 17.698

1.00 15.067 18.750 a 23.688 c 26.132 d 28.235

10.00 138.106 142.500 b 142.500 b 144.168 e 144.168

50.00 687.677 689.448 e 689.448 e 689.448 e 689.448

5.3.6 A Burn-In Model

Many manufactured items, for example, electronic components, tend either tolast a relatively long time or to fail very early. A technique used to screen outthe items with short lifelengths before they are delivered to the customer is theso-called burn-in. To burn-in an item means that before the item is released,it undergoes a test during which it is examined under factory conditions or itis exposed to extra stress. After the test phase of (random) length τ, the itemis put into operation.

Considering m produced items, and given some cost structure such ascosts for failures during and after the test and gains per unit time for releaseditems, one problem related to burn-in is to determine the optimal burn-induration. This optimal burn-in time may either be fixed in advance and itis therefore deterministic, or one may consider the random information givenby the lifelengths of the items failing during the test and obtain a randomburn-in time.

We consider a semimartingale approach for solving the optimal stoppingproblem. In our model, the lifelengths of the items need not be identicallydistributed, and the stress level during burn-in may differ from the one afterburn-in. The information at time t consists of whether and when componentsfailed before t. Under these assumptions, we determine the optimal burn-intime ζ.

Let Tj, j = 1, . . . ,m, be independent random variables representing thelifelengths of the items that are burned in. We assume that ETj <∞ for all j.We consider burn-in under severe conditions. That means that we assume theitems to have different failure rates during and after burn-in, λ0j (t) and λ

1j (t),

5.3 Applications 203

respectively, where it is supposed that λ0j (t) ≥ λ1j (t) for all t ≥ 0. We assumethat the lifelength Tj of the jth item admits the following representation:

I(Tj ≤ t) =

∫ t

0

I(Tj > s)λYs

j (s)ds+Mt(j), j = 1, . . . ,m, (5.18)

where Yt = I(τ < t), τ is the burn-in time and M(j) ∈ M is bounded in L2.This representation can also be obtained by modeling the lifelength of the

jth item in the following way:

Tj = Zj ∧ τ +RjI(Zj > τ), (5.19)

where Zj , Rj , j = 1, . . . ,m, are independent random variables and a ∧ bdenotes the minimum of a and b; Zj is the lifelength of the jth itemwhen it is exposed to a higher stress level and Rj is the operating timeof the item if it survived the burn-in phase. Let Fj be the lifelength dis-tribution, Hj denote the distribution function of Zj, j = 1, . . . ,m, and letHj(0) = Fj(0) = 0, Hj(t) = 1 − Hj(t), Fj(t) = 1 − Fj(t). Furthermore, weassume that Hj and Fj admit densities hj and fj , respectively. It is assumedthat the operating time Rj follows the conditional survival distribution cor-responding to Fj :

P (Tj ≤ t+ s|τ = t < Zj) = P (Rj ≤ s|τ = t < Zj)

=Fj(t+ s)− Fj(t)

Fj(t), t, s ∈ R+.

In order to determine the optimal burn-in time, we introduce the followingcost and reward structure: there is a reward of c > 0 per unit operatingtime of released items. In addition there are costs for failures, cB > 0 for afailure during burn-in and cF > 0 for a failure after the burn-in time τ, wherecF > cB. If we fix the burn-in time for a moment to τ = t, then the net rewardis given by

Zt = cm∑j=1

(Tj − t)+ − cB

m∑j=1

I(Tj ≤ t)− cF

m∑j=1

I(Tj > t), t ∈ R+. (5.20)

Since we assume that the failure time of any item can be observed during theburn-in phase, the observation filtration, generated by the lifelengths of theitems, is given by

F = (Ft), t ∈ R+,Ft = σ(I(Tj ≤ s), 0 ≤ s ≤ t, j = 1, . . . ,m).

In order to determine the optimal burn-in time, we are looking for anF-stopping time ζ ∈ CF satisfying

EZζ = sup{EZτ : τ ∈ CF}.

204 5 Maintenance Optimization

In other words, at any time t the observer has to decide whether to stop or tocontinue with burn-in with respect to the available information up to time t.Since Z is not adapted to F, i.e., Zt cannot be observed directly, we considerthe conditional expectation

Zt = E[Zt|Ft] = cm∑j=1

I(Tj > t)E[(Tj − t)+|Tj > t]−mcF

+(cF − cB)

m∑j=1

I(Tj ≤ t). (5.21)

As an abbreviation we use

μj(t) = E[(Tj − t)+|Tj > t] =1

Fj(t)

∫ ∞

t

Fj(x)dx, t ∈ R+,

for the mean residual lifelength. The derivative with respect to t is given byμ′j(t) = −1 + λ1j (t)μj(t). We are now in a position to apply Theorem 5.9,

p. 181, and formulate conditions under which the monotone case holds true.

Theorem 5.24. Suppose that the functions

gj(t) = −c− cμj(t)(λ0j (t)− λ1j (t)) + (cF − cB)λ

0j (t)

satisfy the following condition:

∑j∈J

gj(t) ≤ 0 implies gj(s) ≤ 0 ∀j ∈ J , ∀J ⊆ {1, . . . ,m}, ∀s ≥ t. (5.22)

Then

ζ = inf

{t ∈ R+ :

m∑j=1

I(Tj > t)gj(t) ≤ 0

}

is an optimal burn-in time:

EZζ = sup{EZτ : τ ∈ CF}.

Proof. In order to obtain a semimartingale representation for Z in (5.21) wederive such a representation for I(Tj > t)μj(t). Since μj(·) and I(Tj > ·) areright-continuous and of bounded variation on [0, t], we can use the integrationby parts formula for Stieltjes integrals (pathwise) to obtain

μj(t)I(Tj > t) = μj(0)I(Tj > 0) +

∫ t

0

μj(s−)dI(Tj > s)

+

∫ t

0

I(Tj > s)dμj(s).

Substituting

5.3 Applications 205

I(Tj > s) = 1 +

∫ s

0

(−I(Tj > x)λ0j (x))dx +Mj(s)

in this formula and using the continuity of μ, we obtain

μj(t)I(Tj > t) = μj(0) +

∫ t

0

[−μj(s)I(Tj > s)λ0j(s) + I(Tj > s)μ′j(s)]ds

+

∫ t

0

μj(s)dMj(s)

= μj(0) +

∫ t

0

I(Tj > s)[−1− μj(s)(λ

0j (s)− λ1j (s))

]ds

+Mj(t),

where Mj is a martingale, which is bounded in L2. This yields the following

semimartingale representation for Z :

Zt = −mcF + c

m∑j=1

μj(0)

+

∫ t

0

m∑j=1

cI(Tj > s)[−1− μj(s)(λ0j (s)− λ1j(s))]ds

+(cF − cB)

∫ t

0

m∑j=1

I(Tj > s)λ0j (s)ds+ Lt

= −mcF + c

m∑j=1

μj(0) +

∫ t

0

m∑j=1

I(Tj > s)gj(s)ds+ Lt

with a uniformly integrable martingale

L = cm∑j=1

Mj + (cF − cB)m∑j=1

Mj ∈ M.

Since for all ω ∈ Ω and all t ∈ R+, there exists some J ⊆ {1, . . . ,m} such that∑mj=1 I(Tj > t)gj(t) =

∑j∈J gj(t), condition (5.22) in the theorem ensures

that the monotone case (MON), p. 181, holds true. Therefore we get thedesired result by Theorem 5.9 and the proof is complete.

Remark 5.25. The structure of the optimal stopping time shows that highrewards per unit operating time lead to short burn-in times whereas greatdifferences cF − cB between costs for failures in different phases lead to longtesting times, as expected.

Equivalent characterizations of condition (5.22) in Theorem 5.24 are givenin the following lemma. The proof can be found in [87].

206 5 Maintenance Optimization

Lemma 5.26. Let tJ = inf{t ∈ R+ :∑

j∈J gj(t) ≤ 0} and denote tj = t{j}for all j ∈ {1, . . . ,m}. Then the following conditions are equivalent:

(i)∑

j∈J gj(t) ≤ 0 implies gj(s) ≤ 0 ∀ j ∈ J , ∀J ⊆ {1, . . . ,m} and∀ s ≥ t.

(ii) tJ = maxj∈J tj ∀J ⊆ {1, . . . ,m} and gj(s) ≤ 0 ∀ s ≥ tj , ∀ j ∈{1, . . . ,m}.

(iii) ∣∣∣∣∣∑

j:gj(t)≤0

gj(t)

∣∣∣∣∣ < minj:gj(t)>0

gj(t) ∀ t < maxj=1,...,m

tj

and gj(s) ≤ 0 ∀ s ≥ tj , ∀ j ∈ {1, . . . ,m}.

The following special cases illustrate the result of the theorem.

1. Burn-in forever. If gj(t) > 0 for all t ∈ R+, j = 1, . . . ,m, then ζ =max{T1, . . . , Tm}, i.e., burn-in until all items have failed.

2. No burn-in. If gj(0) ≤ 0, j = 1, . . . ,m, then ζ = 0 and no burn-in takesplace. This case occurs for instance if the costs for failures during andafter burn-in are the same: cB = cF .

3. Identical items. If all failure rates coincide, i.e., λ01(t) = . . . = λ0m(t) andλ11(t) = . . . = λ1m(t) for all t ≥ 0, then gj(t) = g1(t) for all j ∈ {1, . . . ,m}and condition (A.1) reduces to

g1(s) ≤ 0 for s ≥ t1 = inf{t ∈ R+ : g1(t) ≤ 0}.

If this condition is satisfied, the optimal stopping time is of the formζ = t1 ∧ max{T1, . . . , Tm}, i.e., stop burn-in as soon as g1(s) ≤ 0 or assoon as all items have failed, whatever occurs first.

4. The exponential case. If all failure rates are constant, equal to λ0j and

λ1j , respectively, then μj and therefore gj is constant, too, and ζ(ω) ∈{0, T1(ω), . . . , Tm(ω)}, if condition (5.22) is satisfied. If, furthermore, theitems are “identical,” then we have ζ = 0 or ζ = max{T1, . . . , Tm}.

5. No random information. In some situations the lifelengths of the itemscannot be observed continuously. In this case one has to maximize theexpectation function

EZt = EZt = −mcF + cm∑j=1

Hj(t)μj(t) + (cF − cB)m∑j=1

Hj(t)

in order to obtain the (deterministic) optimal burn-in time. This can bedone using elementary calculus.

5.4 Repair Replacement Models 207

5.4 Repair Replacement Models

In this section we consider models in which repairs are carried out in negligibletime up to the time of a replacement. So the observation of the system doesnot end with a failure, as in the first sections of this chapter, but are continueduntil it is decided to replace the system by a new one. Given a certain coststructure the optimal replacement time is derived with respect to the availableinformation.

5.4.1 Optimal Replacement Under a General Repair Strategy

We consider a system that fails at times Tn, according to a point process(Nt), t ∈ R+, with an intensity (λt) adapted to some filtration F. At failuresa repair is carried out at cost of c > 0, which takes negligible time. A replace-ment can be carried out at any time t at an additional cost k > 0. Followingthe average cost per unit time criterion, we have to find a stopping time σ, ifthere exists one, with

K∗ = Kσ = inf

{Kτ =

cENτ + k

Eτ: τ ∈ CF

},

where CF = {τ : τ F-stopping time, Eτ < ∞} is a suitable class of stoppingtimes. To solve this problem we can adopt the procedure of Sect. 5.2.1 withsome slight modifications.

First of all we have Kτ = EZτ

EXτwith SSM representations

Zt = k +

∫ t

0

cλsds+Mt, (5.23)

Xt =

∫ t

0

ds.

Setting τ = T1, we derive the simple upper bound bu :

bu =c+ k

ET1≥ K∗.

The process Y corresponding to (5.10) on p. 187 now reads

Yt = −k +∫ t

0

(K∗ − cλs)ds+Rt

and therefore we know that, if there exists an optimal finite stopping time σ,then it is among the indexed stopping times

ρx = inf{t ∈ R+ : λt ≥ x

c}, 0 ≤ x ≤ bu,

provided λ has nondecreasing paths. We summarize this in a corollary toTheorem 5.18, p. 188.

208 5 Maintenance Optimization

Corollary 5.27. Let the martingale M in (5.23) be such that (Mt∧ρbu) is

uniformly integrable. If λ has nondecreasing paths and Eρbu <∞, then

σ = ρx∗ , with x∗ = inf{x ∈ R+ : xEρx − cENρx ≥ k},is an optimal stopping time and x∗ = K∗.

Example 5.28. Considering a nonhomogeneous Poisson process with a nonde-creasing deterministic intensity λt = λ(t), we observe that the stopping timesρx = λ−1(x/c) are constants. If λ−1(bu/c) < ∞, then the corollary can beapplied and the optimal stopping time σ is a finite constant.

The simplest case is that of a Poisson process with constant rate λ > 0.In this case we have bu = cλ + kλ > cλ and ρbu = ∞, so that the corollarydoes not apply. But in this case it is easily seen that additional stopping(replacement) costs do not pay and we get that σ = ∞ is optimal withK∗ = cλ.

Example 5.29. Consider the shock model with state-dependent failure proba-bility of Sect. 3.3.4 in which shocks arrive according to a Poisson process withrate ν (cf. Example 3.47, p. 89). The failure intensity is of the form

λt = ν

∫ ∞

0

p(Xt + y)dF (y),

where p(Xt + y) denotes the probability of a failure at the next shock if theaccumulated damage is Xt and the next shock has amount y. Here we assumethat this probability function p does not depend on the number of failures inthe past. Obviously λt is nondecreasing so that Corollary 5.27 applies providedthat the integrability conditions are met.

A variety of point process models as described in Sect. 3.3 can be used inthis set-up. Also more general cost structures could be applied as for examplerandom costs k = (kt), if k admits an SSM representation. Other modifications(discounted cost criterion, different information levels) can be worked outeasily apart of some technical problems.

5.4.2 A Markov-Modulated Repair Process: Optimizationwith Partial Information

In this section a model with a given reward structure is investigated in whichan optimal operating time of a system has to be found that balances someflow of rewards and the increasing cost rate due to (minimal) repairs. Considera one-unit system that fails from time to time according to a point process.After failure a minimal repair is carried out that leaves the state of the systemunchanged. The system can work in one of m unobservable states. State “1”stands for new or in good condition and “m” is defective or in bad condition.Aging of the system is described by a link between the failure point process and

5.4 Repair Replacement Models 209

the unobservable state of the system. The failure or minimal repair intensitymay depend on the state of the system. There is some constant flow of income,on the one hand, and on the other hand, each minimal repair incurs a randomcost amount. The question is when to stop processing the system and carryingout an inspection or a renewal in order to maximize some reward functional.

For the basic set-up we refer to Example 3.14, p. 65 and Sect. 3.3.9. Herewe recapitulate the main assumptions of the model:

The basic probability space (Ω,F , P ) is equipped with a filtration F,the complete information level, to which all processes are adapted, andS = {1, . . . ,m} is the set of unobservable environmental states. The changesof the states are driven by a homogeneous Markov process Y = (Yt), t ∈ R+,with values in S and infinitesimal parameters qi, the rate to leave state i, andqij , the rate to reach state j from state i. The time points of failures (minimalrepairs) 0 < T1 < T2 < · · · form a point process and N = (Nt), t ∈ R+, is thecorresponding counting process:

Nt =

∞∑n=1

I(Tn ≤ t).

It is assumed that N has a stochastic intensity λYt that depends on the unob-servable state, i.e., N is a so-called Markov-modulated Poisson process withrepresentation

Nt =

∫ t

0

λYsds+Mt,

where M is an F-martingale and 0 < λi <∞, i ∈ S.Furthermore, let (Xn), n ∈ N, be a sequence of positive i.i.d. random

variables, independent of N and Y , with common distribution F and finitemean μ. The cost caused by the nth minimal repair at time Tn is describedby Xn.

There is an initial capital u and an income of constant rate c > 0 per unittime.

Now the process R, given by

Rt = u+ ct−Nt∑n=1

Xn,

describes the available capital at time t as the difference of the income andthe total amount of costs for minimal repairs up to time t.

The process R is well-known in other branches of applied probability likequeueing or collective risk theory, where the time to ruin τ = inf{t ∈ R+ :Rt < 0} is investigated (cf. Sect. 3.3.9). Here the focus is on determining theoptimal operating time with respect to the given reward structure. To achievethis goal one has to estimate the unobservable state of the system at time t,given the history of the process R up to time t. This can be done using results

210 5 Maintenance Optimization

in filtering theory as is shown below. Stopping at a fixed time t results in thenet gain

Zt = Rt −m∑j=1

kjUt(j),

where Ut(j) = I(Yt = j) is the indicator of the state at time t and kj ∈ R, j ∈S, are stopping costs (for inspection and replacement), which may depend onthe stopping state. The process Z cannot be observed directly because onlythe failure time points and the costs for minimal repairs are known to anobserver. The observation filtration A = (At), t ∈ R+, is given by

At = σ(Ns, Xi, 0 ≤ s ≤ t, i = 1, . . . , Nt).

Let CA = {τ : τ is a finite A-stopping time, EZ−τ < ∞} be the set of feasible

stopping times in which the optimal one has to be found. As usual a− =−min{0, a} denotes the negative part of a ∈ R. So the problem is to findτ∗ ∈ CA which maximizes the expected net gain:

EZτ∗ = sup{EZτ : τ ∈ CA}.For the solution of this problem an F-semimartingale representation of

the process Z is needed, where it is assumed that the complete informationfiltration F is generated by Y,N, and (Xn):

Ft = σ(Ys, Ns, Xi, 0 ≤ s ≤ t, i = 1, . . . , Nt).

Such a representation can be obtained by means of an SSM representation forthe indicator process Ut(j),

Ut(j) = U0(j) +

∫ t

0

m∑i=1

Us(i)qijds+mt(j),m(j) ∈ M0, (5.24)

as follows (see [95] for details):

Zt = u−m∑j=1

kjU0(j) +

∫ t

0

m∑j=1

Us(j)rjds+Mt, t ∈ R+, (5.25)

where M = (Mt) is an F-martingale and the constants rj are defined by

rj = c− λjμ−∑ν =j

(kν − kj)qjv .

These constants can be interpreted as net gain rates in state j:

• c is the income rate.• λj , the failure rate in state j, is the expected number of failures per unit

of time, μ is the expected repair cost for one minimal repair. So λjμ is therepair cost rate.

• The remaining sum is the stopping cost rate by leaving state j.

5.4 Repair Replacement Models 211

Since the state indicators U(j) and therefore Z cannot be observed, aprojection to the observation filtration A is needed. As described in Sect. 3.1.2such a projection from the F-level (5.25) to the A-level leads to the followingconditional expectations:

Zt = E[Zt|At] = u−m∑j=1

kj U0(j) +

∫ t

0

m∑j=1

Us(j)rjds+ Mt, t ∈ R+. (5.26)

The integrand∑m

j=1 Us(j)rj with Us(j) = E[Us|As] = P (Ys = j|As) is theconditional expectation of the net gain rate at time s given the observationsup to time s. If this integrand has nonincreasing paths, then we know that weare in the “monotone case” (cf. p. 181) and the stopping problem could besolved under some additional integrability conditions. To state monotonicityconditions for the integrand in (5.26), an explicit representation of Ut(j) isneeded, which can be obtained by means of results in filtering theory (see [50],p. 98, [93]) in the form of “differential equations”:

• Between the jumps of N : Tn ≤ t < Tn+1

Ut(j) = UTn(j) +

∫ t

Tn

(m∑i=1

Us(i){qij + Us(j)(λi − λj)})ds,

qjj = −qj, (5.27)

U0(j) = P (Y0 = j), j ∈ S.

• At jumps

UTn(j) =λjUTn−(j)∑mi=1 λiUTn−(i)

, (5.28)

where UTn−(j) denotes the left limit.

The following conditions ensure that the system ages, i.e., it moves fromthe “good” states with high net gains and low failure rates to the “bad” stateswith low and possibly negative net gains and high failure rates, and it is neverpossible to return to a “better” state:

qi > 0, i = 1, . . . ,m− 1, qij = 0 for i > j, i, j ∈ S,

r1 ≥ r2 ≥ · · · ≥ rm = c− λmμ, rm < 0, (5.29)

0 < λ1 ≤ λ2 ≤ · · · ≤ λm.

A reasonable candidate for an optimal A-stopping time is

τ∗ = inf

{t ∈ R+ :

m∑j=1

Ut(j)rj ≤ 0

}, (5.30)

the first time the conditional expectation of the net gain rate falls below 0.

212 5 Maintenance Optimization

Theorem 5.30. Let τ∗ be the A-stopping time (5.30) and assume that con-ditions (5.29) hold true. If, in addition, qim > λm − λi, i = 1, . . . ,m− 1, thenτ∗ is optimal:

EZτ∗ = sup{EZτ : τ ∈ CA}.Proof. Because of EZτ = EZτ for all τ ∈ CA we can apply Theorem 5.9, p.181, of Chap. 3 taking the A-SSM representation (5.26) of Z. We will proceedin two steps:

(a) First, we prove that the monotone case holds true.(b) Second, we show that the martingale part M in (5.26) is uniformly inte-

grable.

(a) We start showing that the integrand∑m

j=1 Us(j)rj has nonincreasingpaths. A simple rearrangement gives

m∑j=1

Us(j)rj = rm + (rm−1 − rm)

m−1∑j=1

Us(j) + · · ·+ (r1 − r2)Us(1).

Since we have from (5.29) that rk−1 − rk ≥ 0, k = 2, . . . ,m, it remains to

show that∑j

ν=1 Us(ν) is nonincreasing in s for j = 1, . . . ,m − 1. Denoting

λ(s) =∑m

j=1 Us(j)λj we get from (5.27) between jumps Tn < s < Tn+1,where T0 = 0,

d

ds

(j∑

ν=1

Us(ν)

)=

j∑ν=1

(m∑i=1

Us(i){qiν + Us(ν)(λi − λν)})

=

m∑i=1

j∑ν=1

Us(i)qiν +

j∑ν=1

Us(ν)(λ(s)− λν)

=

j∑i=1

Us(i)

⎛⎝−

m∑k=j+1

qik + λ(s)− λi

⎞⎠

using qij = 0 for i > j and qii = −∑mk=i+1 qik, i = 1, . . . ,m− 1.

From qim > λm − λi ≥ λ(s)− λi it follows that

d

ds

(j∑

ν=1

Us(ν)

)≤ 0, j = 1, . . . ,m− 1.

At jumps Tn we have from (5.28)

j∑ν=1

(UTn(ν) − UTn−(ν)) =j∑

ν=1

UTn−(ν)λv − λ(Tn−)

λ(Tn−).

5.4 Repair Replacement Models 213

The condition λ1 ≤ · · · ≤ λm ensures that the latter sum is not greater than0. This is obvious in the case λj ≤ λ(Tn−); otherwise, if λj > λ(Tn−), thisfollows from

0 =

m∑ν=1

UTn−(ν)λv − λ(Tn−)

λ(Tn−)≥

j∑ν=1

UTn−(ν)λv − λ(Tn−)

λ(Tn−).

For the monotone case to hold it is also necessary that

⋃t∈R+

{m∑j=1

Ut(j)rj ≤ 0

}= Ω

or equivalently τ∗ < ∞. From (5.24) we obtain by means of the projectiontheorem

Ut(m) = U0(m) +

∫ t

0

m−1∑i=1

Us(i)qimds+ mt(j)

with a nonnegative integrand. This shows that Ut(m) is a bounded submartin-gale. Thus, the limit

U∞(m) = limt→∞ Ut(m) = E[U∞(m)|A∞]

exists and is identical to 1 since limt→∞ Yt = m and hence U∞(m) = 1.Because rm < 0, it is possible to choose some ε > 0 such that (1 − ε)rm +

ε∑m−1

i=1 ri < 0. Therefore, we have

τ∗ = inf

{t ∈ R+ :

m∑j=1

Ut(j)rj ≤ 0

}≤ inf

{t ∈ R+ : Ut(m) ≥ 1− ε

}<∞.

(b) To show that M is uniformly integrable we consider a decomposition ofthe drift term of the F-SSM representation of Z :

∫ t

0

m∑j=1

Us(j)rjds =

∫ t

0

m∑j=1

Us(j)(rj − rm)ds+ trm,

where trm is obviously A-adapted. We use the projection Theorem 3.19, p.69, in the extended version. To this end we have to show that

1. Z0 = c−∑mj=1 kjU0(j) and

∫∞0

|∑mj=1 Us(j)(rj − rm)|ds are square inte-

grable, and that2. M is square integrable.

The details of these parts are omitted here and can be found in [93, 95].To sum up, by (a) the monotone case holds true for Z with a martingale

part M, which is by (b) square integrable and hence uniformly integrable.The monotone stopping Theorem 5.9 can then be applied and the assertionof the theorem follows. ��

214 5 Maintenance Optimization

5.4.3 The Case of m=2 States

For two states the stopping problem can be reformulated as follows. At anunobservable random time, say σ, there occurs a switch from state 1 to state2. Detect this change as well as possible (with respect to the given optimizationcriterion) by means of the failure process observations. The conditions (5.29)now read

q1 = q12 = q > 0, q2 = q21 = 0,

r1 = c− λ1μ− q(k2 − k1) > 0 > r2 = c− λ2μ, (5.31)

0 < λ1 ≤ λ2.

The conditional distribution of σ can be obtained explicitly as the solution ofthe above differential equations. To obtain this explicit solution we assume inaddition P (Y0 = 1) = 1. The result of the (lengthy) calculations is

Ut(2) = P (σ ≤ t|At)=1− e−gn(t)

dn + (λ2 − λ1)∫ t

Tne−gn(s)ds

, Tn ≤ t < Tn+1,

UTn(2)=λ2UTn−(2)

λ1 + (λ2 − λ1)UTn−(2),

where dn =(1− UTn(2)

)−1

, gn(t) = (q − (λ2 − λ1))(t − Tn). The stopping

time τ∗ in (5.30) can now be written as

τ∗ = inf{t ∈ R+ : Ut(2) > z∗}, z∗ =r1

r1 − r2.

For 0 < q < λ2−λ1, Ut(2) increases as long as Ut(2) < q/(λ2−λ1) = r. WhenUt(2) jumps above this level, then between jumps Ut(2) decreases but notbelow the level r. So even in this case under conditions (5.31) the monotonecase holds true if z∗ ≤ q/(λ2 − λ1). As a consequence of Theorem 5.30 wehave the following corollary.

Corollary 5.31. Assume conditions (5.31) with stopping rule τ∗ = inf{t ∈R+ : Ut(2) > z∗}. Then τ∗ is optimal in CA if either q > λ2 − λ1 or z∗ ≤q/(λ2 − λ1).

Remark 5.32. If the failure rates in both states coincide, i.e., λ1 = λ2, theobservation of the failure time points should give no additional informationabout the change time point from state 1 to state 2. Indeed, in this case theconditional distribution of σ is deterministic,

P (σ ≤ t|At) = P (σ ≤ t) = 1− exp {−qt}and τ∗ is a constant. As to be expected, random observations are useless inthis case.

5.5 Maintenance Optimization Models Under Constraints 215

In general, the value of the stopping problem sup{EZτ : τ ∈ CA}, the bestpossible expected net gain, cannot be determined explicitly. But it is possibleto determine bounds for this value. For this, the semimartingale representationturns out to be useful again, because it allows, by means of the projectiontheorem, comparisons of different information levels. The constant stoppingtimes are contained in CA and CA ⊂ CF. Therefore, the following inequalityapplies:

sup{EZt : t ∈ R+} ≤ sup{EZτ : τ ∈ CA} ≤ sup{EZτ : τ ∈ CF}.At the complete information level F the change time point σ can be observed,and it is obvious that under conditions (5.31) the F-stopping time σ is optimalin CF. Thus, we have the following upper and lower bounds bu and bl:

bl ≤ sup{EZτ : τ ∈ CA} ≤ bu

with

bl=sup{EZt : t ∈ R+},bu=sup{EZτ : τ ∈ CF} = EZσ.

Some elementary calculations yield

bl =u− k2 +1

q(c− λ1μ)− r2

qln

( −r2r1 − r2

),

bu = u− k2 +1

q(c− λ1μ).

For λ1 = λ2 the optimal stopping time is deterministic so that in this casethe lower bound is attained.

5.5 Maintenance Optimization Models UnderConstraints

In this section we consider two models: the first one is a so-called delay timemodel with safety constraints. The aim is to determine optimal inspection in-tervals minimizing the expected discounted costs under the safety constraints.The second model is also about optimal inspection but here the system isrepresented by a monotone (coherent) structure function. The state of thecomponents and the system is only revealed through inspections.

5.5.1 A Delay Time Model with Safety Constraints

In many cases, the presence of a fault in a system does not lead to an imme-diate system failure; the system stays in a “defective” state. There will be a

216 5 Maintenance Optimization

time lapse between the occurrence of the fault and the failure of the system–a “delay time”. This is the idea of the delay time models, which have beenthoroughly discussed in the literature. See the Bibliographic Notes at the endof the chapter.

The delay time models are used as bases for determining monitoring strate-gies for detecting system defects or faults. The state of the system is revealedby inspections, except for failures which are observed. The basic delay timemodel was introduced for analyzing inspection policies for systems regularlyinspected each T time units. If an inspection is carried out during the delaytime period, the defect is identified and removed. Thus, the delay time modelis based on the simplest monitoring framework possible: a defective state anda nondefective state. In most of the models, the objective of the delay timeanalysis is to determine optimal inspection times that minimize the (expected)long-run average costs or downtimes.

The framework in the present analysis is the basic delay time model sub-ject to regular inspections every T units of time. If a defect is detected byan inspection, a preventive replacement is performed. If the system fails, acorrective replacement is carried out. A replacement brings the system backto the initial state. A cost is incurred at each inspection.

Furthermore, safety constraints are introduced, related to two importantsafety aspects: the number of failures of the system and the time spent inthe defective state (the delay time). The control of these quantities can beobtained by bounding the probability of at least one system failure occurringduring a certain interval of time and by bounding the probability that thedelay times are larger than a certain number.

The objective of the analysis is to determine an optimal inspection intervalT that minimizes the total expected discounted costs under the two safetyconstraints.

If α is a positive discount factor, a cost C at time t has a value of Ce−αt

at time 0. Letting Ti be the length of the ith replacement cycle and Ci thetotal discounted costs associated with the ith replacement cycle, then the totaldiscounted costs incurred can be written (see Sect. 5.3.3)

EC1

1− E[e−αT1 ]. (5.32)

To explicitly take into account risk and uncertainties we introduce two safetyconstraints. Below these are defined and the results are compared.

In practice we may consider different levels for the safety constraint. Theoptimization produces decision support by providing information about theconsequences of imposing various safety-level requirements.

Before we search for an optimal inspection time T , we need to specifythe optimization model in detail.

5.5 Maintenance Optimization Models Under Constraints 217

Problem Definition and Formulation

We consider a system subject to failures and make the following assumptions.

1. The failure of the system is revealed immediately, and the system is re-placed. The replacement time is negligible and the cost of this correctivemaintenance is Cc.

2. Before failure occurs, the system passes through a defective state. Let Xbe a random variable representing the time to the occurrence of a faultand Y a random variable representing the time in the defective state,in case of no replacement of the system. We denote by F and G thedistributions of X and Y , respectively. We assume that F and G havedensities f and g, respectively. Furthermore, we assume that X and Yhave finite expectations.

3. All random variables X and Y are independent.4. Whether or not the system is in a defective state can only be determined

by inspection.5. An inspection takes place every T units of time, and the cost of each

inspection is CI . These inspections are perfect in the sense that if thesystem is in a defective state, this will be identified by the inspection. Ifa defect is identified at an inspection, the system will be replaced by anew one. The replacement time is negligible. The cost of this preventivemaintenance is Cp, where 0 < CI < Cp < Cc <∞.

The assumption CI < Cp < Cc is justified by the following type of argu-ments. The inspection tasks are assumed to be rather straightforward activi-ties, whereas preventive maintenance tasks are more extensive operations thatinvolve repairs and replacements of the units. Hence it is reasonable to assumeCI < Cp. Furthermore, the corrective maintenance tasks cost more than thepreventive maintenance tasks as the replacement of the system is unplanned;hence Cp < Cc.

Consider a replacement cycle defined by the time interval between replace-ments of the system caused by a preventive maintenance or by a correctivemaintenance. For k = 0, 1, 2, . . ., let XT be a random variable representingthe time between replacements of the system, i.e.,

XT =

{(k + 1)T kT < X < (k + 1)T ≤ X + YX + Y kT < X < X + Y ≤ (k + 1)T

Let FT be the survival function of XT . By conditioning on X = u, we see that

FT (t) = F (t) +

∫ t

[t/T ]T

f(u)G(t− u)du, t ≥ 0, (5.33)

where [x] denotes the integer part of x. From (5.33) we obtain the followinglemma:

218 5 Maintenance Optimization

Lemma 5.33.

1− E[e−αXT

]=

∫ ∞

0

αe−αtF (t)dt

+

∞∑k=0

∫ (k+1)T

kT

f(u)αe−αu

(∫ (k+1)T−u

0

e−αvG(v)dv

)du.

Proof. Denoting by fT the density function of XT one obtains that,

1− E[e−αXT ] = 1−∫ ∞

0

e−αtfT (t)dt

=

∫ ∞

0

αe−αtFXT (t)dt,

integrating by parts. Furthermore, using (5.33) we see that 1−E[e−αXT ] canbe written as

∫ ∞

0

αe−αtF (t)dt+

∫ ∞

0

αe−αt

(∫ t

[t/T ]T

f(u)G(t− u)du

)dt

=

∫ ∞

0

αe−αtF (t)dt +

∞∑k=0

∫ (k+1)T

kT

αe−αt

(∫ t

[t/T ]T

f(u)G(t− u)du

)dt

=

∫ ∞

0

αe−αtF (t)dt +

∞∑k=0

∫ (k+1)T

kT

f(u)

(∫ (k+1)T

u

αe−αtG(t− u)dt

)du

=

∫ ∞

0

αe−αtF (t)dt +∞∑k=0

∫ (k+1)T

kT

αe−αuf(u)

(∫ (k+1)T

u

e−α(t−u)G(t− u)dt

)du

=

∫ ∞

0

αe−αtF (t)dt +

∞∑k=0

∫ (k+1)T

kT

αe−αuf(u)

(∫ (k+1)T−u

0

e−αtG(t)dt

)du,

which shows that the lemma holds. ��

From the assumptions of the model, a cost Cp is incurred whenever apreventive maintenance is performed. Hence, the expected discounted costsassociated with the preventive maintenance in a replacement cycle is given by

Cp

∞∑k=0

e−α(k+1)T

∫ (k+1)T

kT

f(u)G((k + 1)T − u)du, (5.34)

5.5 Maintenance Optimization Models Under Constraints 219

noting that if X = u and kT < u ≤ (k + 1)T , the system is replaced at(k + 1)T if the delay time exceeds (k + 1)T − u.

Analogously, we obtain that the expected discounted costs associated withthe corrective maintenance in a replacement cycle equals

Cc

∞∑k=0

∫ (k+1)T

kT

f(u)

(∫ (k+1)T

u

g(v − u)e−αvdv

)du, (5.35)

observing that if X = u and kT < u ≤ (k + 1)T , the system is replaced at vif the delay time is v − u and v < (k + 1)T .

Furthermore, a cost CI is incurred at each inspection and the expecteddiscounted costs associated with these actions equals

CI

∞∑k=0

k+1∑i=1

e−αiT

∫ (k+1)T

kT

f(u)G((k + 1)T − u)du

+CI

∞∑k=1

k∑i=1

e−αiT

∫ (k+1)T

kT

f(u)G((k + 1)T − u)du,

or rewritten,

CI

∞∑k=0

e−α(k+1)T

∫ (k+1)T

kT

f(u)G((k + 1)T − u)du

+CI

∞∑k=1

k∑i=1

e−αiT

∫ (k+1)T

kT

f(u)du. (5.36)

Notice that the expression

∞∑k=0

e−α(k+1)T

∫ (k+1)T

kT

f(u)G((k + 1)T − u)du,

that appears in (5.34) and (5.36) can be expressed as

∞∑k=0

∫ T

0

f(u+ kT )e−α(u+kT )e−α(T−u)G(T − u)du,

and finally as a consequence of the Monotone Convergence Theorem (seeAppendix A.2.3) we obtain that

∞∑k=0

e−α(k+1)T

∫ (k+1)T

kT

f(u)G((k + 1)T − u)du

=

∫ T

0

hT (u)e−α(T−u)G(T − u)du,

220 5 Maintenance Optimization

where, for T > 0, hT (u) is equal to

hT (u) =

∞∑k=0

f(u+ kT )e−α(u+kT ), 0 ≤ u ≤ T. (5.37)

We denote by Cd(T ) the total expected discounted costs in [0,∞). By (5.32)we can focus on the first cycle. From Lemma (5.33), (5.34), (5.35) and (5.36)we obtain the following expression for Cd(T )

Cd(T ) =

CI

∞∑k=1

k∑i=1

e−αiT

∫ (k+1)T

kT

f(u)du+

∫ T

0

hT (u)c(T − u)du

1 +

∫ T

0

hT (u)(D(T − u)− 1)du

, (5.38)

where hT (u) is given by (5.37) and for 0 ≤ u ≤ T ,

c(u) = (Cp + CI)e−αuG(u) + Cc

∫ u

0

g(v)e−αvdv, (5.39)

D(u) =

∫ u

0

e−αvG(v)dv. (5.40)

Two safety conditions are introduced in this model. The first one is related tothe occurrences of system failures, whereas the second is related to the timespent in a defective state.

Safety Constraint 1: Bound on the Probability of a System Failure

The first constraint is implemented by bounding the probability of occurrenceof one or more failures of the system in an interval [0, A]. Denoting by Nc,T (A)the number of failures of the system in [0, A] with inspection times each Ttime units, the safety constraint is expressed as

P (Nc,T (A) ≥ 1) ≤ ω1,

with 0 < ω1 < 1 or equivalently

1− P (Nc,T (A) = 0) ≤ ω1.

Let Xc,T be the time between successive corrective maintenances, then

P (Nc,T (A) = 0) = Fc,T (A),

where Fc,T represents the survival function of Xc,T . The following lemmashows the analytical expression for the survival function Fc,T .

5.5 Maintenance Optimization Models Under Constraints 221

Lemma 5.34. The survival function Fc,T of Xc,T , representing the time be-tween successive corrective maintenances, can be written in the following way:

Fc,T (t) =k∑

i=0

Bi,T

(F (t− iT ) +

∫ t

kT

f(u− iT )G(t− u)du

),

kT ≤ t ≤ (k + 1)T, k = 0, 1, 2, . . . , (5.41)

where the coefficient Bi,T equals the probability of a preventive maintenanceat iT and is obtained using the recursive formulas:

B0,T = 1

Bk+1,T =

k∑i=0

Bi,T

∫ (k+1)T

kT

f(u− iT )G((k + 1)T − u)du, k = 0, 1, 2, . . .

Proof. Notice that we can express Fc,T (t) as

Fc,T (t) =k∑

i=0

Bi,TPk,i,T (t), kT ≤ t ≤ (k + 1)T,

where Bi,T represents the probability of a preventive maintenance at iT , 1 ≤i ≤ k and Pk,i,T (t) represents the probability that the system does not failin (iT, t] and no preventive maintenance is performed in this interval. If thepreventive maintenance is not performed in (iT, t], then either no defect ofthe system arises in (iT, t] or a defect arises in [kT, t) but it does not lead toa failure before t. Hence,

Pk,i,T (t) = F (t− iT )+∫ t

kT

f(u)G(t−u)du, kT ≤ t ≤ (k+1)T, 0 ≤ i ≤ k.

The probabilities Bi,T are obtained in a recursive way as follows. For i = 0,B0,T , the probability of a preventive maintenance at 0, is equal to 1. For i = 1,B1,T represents the probability of a preventive maintenance at T , and it isequal to

B1,T =

∫ T

0

f(u)G(T − u)du.

Analogously, for i = 2, B2,T represents the probability of a preventive main-tenance at 2T . If a preventive maintenance is performed at 2T , and the firstpreventive maintenance is at T or at 2T . If the first preventive maintenanceis at T and the second one is at 2T , then faults of the system arise in (0, u)(u < T ) and (T, v) (v < 2T ) but do not lead to a failure before T and 2Trespectively. This event has the following probability

(∫ T

0

f(u)G(T − u)du

)(∫ 2T

T

f(v − T )G(2T − v)dv

).

222 5 Maintenance Optimization

If the first preventive maintenance is performed at 2T , and the system faultarises in (T, u) but does not lead to a failure before 2T , the associated prob-ability is equal to ∫ 2T

T

f(u)G(2T − u)du.

Summing over these exclusive events, we obtain

(∫ T

0

f(u)G(T − u)du

)(∫ 2T

T

f(u− T )G(2T − u)du

)

+

∫ 2T

T

f(u)G(2T − u)du

= B0,T

∫ 2T

T

f(u)G(2T − u)du+B1,T

∫ 2T

T

f(u− T )G(2T − u)du

=

1∑i=0

Bi,T

∫ 2T

T

f(u− iT )G(2T − u)du

= B2,T ,

which is the desired result.A preventive maintenance at (k+1)T is equivalent to a preventive mainte-

nance at iT , for any 0 ≤ i ≤ k, no fault of the system in (iT, kT ) and a defectin [kT, (k+ 1)T ) which does not lead a failure before (k+ 1)T . Following thesame type of arguments as above it follows that this event has the followingprobability

k∑i=0

Bi,T

∫ (k+1)T

kT

f(u− iT )G((k + 1)T − u)du.

Hence the result holds. ��

Using (5.41), the safety constraint can be formulated as

aA(T ) ≤ ω1, (5.42)

where 0 < ω1 < 1 and

aA(T ) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1−[A/T ]∑i=0

Bi,T

(F (A− iT ) +

∫ A

[A/T ]T

f(u− iT )G(A− u)du

)A ≥ T

∫ A

0

f(u)G(A− u)du A < T.

(5.43)

5.5 Maintenance Optimization Models Under Constraints 223

Safety Constraint 2: Bound on the Limiting Fraction of TimeSpent in a Defective State

The second safety constraint is related to the time spent in a failure state.What we would like to control is the proportion of time the system is in sucha state. This is implemented by considering the asymptotic limit b(T ), whichis equal to the expected time that the system is in the defective state in areplacement cycle divided by the expected renewal cycle (see Appendix B.2).Hence we can formulate the safety criterion as

b(T ) =

E

∫ XT

0

1d(u)du

E[XT ]≤ ω2,

where 0 < ω2 < 1 and 1d(·) denotes the indicator function which equals 1 ifthe system is defective at time u and 0 otherwise. From (5.33), the expectedlength of a replacement cycle for this model is equal to

E [XT ] =

∫ ∞

0

FT (t)dt

= E [X ] +

∞∑k=0

∫ (k+1)T

kT

f(u)

(∫ (k+1)T−u

0

G(v)dv

)du.

It follows that this second safety constraint can be expressed as

b(T ) ≤ ω2, (5.44)

where b(T ) is given by

b(T ) =

∞∑k=0

∫ (k+1)T

kT

f(u)

(∫ (k+1)T−u

0

G(v)dv

)du

E [X ] +

∞∑k=0

∫ (k+1)T

kT

f(u)

(∫ (k+1)T−u

0

G(v)dv

)du

, 0 < T ≤ ∞.

(5.45)

Optimization

The problem is to find a value of T that minimizes Cd(T ) given by (5.38)under the safety constraints given by (5.42) or (5.44), that is, finding a valueTopt such that

Cd(Topt) = inf{Cd(T ) : T ∈ Υ},where Υ is the set of inspection times satisfying the inequality (5.42) or(5.44), i.e.,

Υ = {T > 0; aA(T ) ≤ Υ1}

224 5 Maintenance Optimization

orΥ = {T > 0; b(T ) ≤ Υ2},

where aA(T ) and b(T ) are given by (5.43) and (5.45), respectively.Analyzing the terms in the function Cd(T ) given by (5.38), we will show

that Cd(T ) is a continuous function in T , with

limT→0

Cd(T ) = ∞.

To show the continuity of the function Cd(T ), we need to assume thatthe density function f of X is continuous. Then hT (u), given by (5.37), iscontinuous in u and continuous in T , and hence

∫ T

0

hT (u)c(T − u)du and 1−∫ T

0

hT (u)(D(T − u)− 1)du,

where c and D are given by (5.39) and (5.40), are continuous functions in T .Moreover,

∫ T

0

hT (u)c(T − u)du ≤ (Cp + CI + Cc)

∫ T

0

hT (u)du

= (Cp + Cc + CI)

∫ ∞

0

f(u)e−αudu,

and consequently

limT→0

∫ T

0

hT (u)c(T − u)du <∞,

and

limT→0

(1 +

∫ T

0

hT (u)(D(T − u)− 1)du

)=

∫ ∞

0

αe−αuF (u)du <∞,

using that E[X ] is finite.Furthermore, notice that

∞∑k=1

k∑i=1

e−αiT

∫ (k+1)T

kT

f(u)du =

∞∑k=1

e−αT − e−α(k+1)T

1− e−αT

∫ (k+1)T

kT

f(u)du

=e−αT

1− e−αT

(1−

∞∑k=1

e−αkT

∫ (k+1)T

kT

f(u)du

),

is continuous in T and

limT→0

∞∑k=1

k∑i=1

e−αiT

∫ (k+1)T

kT

f(u)du = ∞.

5.5 Maintenance Optimization Models Under Constraints 225

Taking these properties into account, the function Cd(T ) given by (5.38) isa continuous function in T and limT→0 Cd(T ) = ∞. Hence the minimum ofCd(T ) in the unconstrained case exists if we include the delay-time policy forT = ∞, i.e., a delay-time policy without inspections for which correspondingexpected discounted costs are given by

limT→∞

Cd(T ) =

Cc

∫ ∞

0

f(u)e−αudu

∫ ∞

0

g(v)e−αvdv∫ ∞

0

αe−αuF (u)du+

∫ ∞

0

αe−αuf(u)du

∫ ∞

0

e−αvG(v)dv

.

We see that Cd(∞) <∞.Let T ∗ be an optimal value of T in the unconstrained case, i.e.,

Cd(T∗) = inf{Cd(T ) : T > 0}.

Clearly, if T ∗ ∈ Υ , then Topt = T ∗, i.e., T ∗ is an optimal solution also to theconstrained optimization problem.

The analytical optimization of Cd(T ) is not straightforward as the func-tion Cd(T ) is not on the standard form seen for many maintenance models(nonincreasing up to a minimum value and then nondecreasing), even whenassuming F and G to have increasing failure rate distributions. As we willshow later, Cd(T ) could have several local minimum values. Also the safetyconstraint functions aA(T ) and b(T ) could have rather irregular forms, whenwe compare these to the common increasing shapes seen for other maintenanceoptimization models.

Numerical Examples

In this section we present some numerical examples of the above model. Theaim is to find a value of T that minimizes Cd(T ) given by (5.38) under the twosafety constraints based on the occurrence of failures in an interval (5.42) andthe fraction of time in a defective state (5.44). We refer to these constraintsas criterion 1 and criterion 2, respectively.

We assume that the distributions of the random variables X and Y followWeibull distributions with nondecreasing failure rates, i.e.,

F (t) = exp{−(λ1t)β1}, G(t) = exp{−(λ2t)

β2}, t ≥ 0,

where βi > 1 for i = 1, 2.Intuitively we may think that the proportion of time that the system is in

a defective state is increasing with respect to T . However, this is not in generaltrue. A counterexample, based on rather extreme failure rates, is given in thefollowing.

226 5 Maintenance Optimization

Let λ1 = 1, λ2 = 1, β1 = 20 and β2 = 30 be the parameters of the Weibulldistributions. For these parameters

E[X ] = 0.9735, E[Y ] = 0.9818.

Figure 5.1 shows a simulation of the long-run proportion of time that thesystem is in a defective state as a function of T . The simulation has beencarried out using 500 points between 0.2 and 2.2 with 500,000 realizationsin each point. We see from the figure that b(T ) in this case shows a ratherirregular form, with many local minimum and maximum values.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Fig. 5.1. Function b(T ) versus T

A similar case is observed for the function aA(T ) given by (5.43). Thisfunction represents the probability of occurrence of at least one failure in[0, A]. For the same numerical example as above, the monotonicity of aA(T )is not guaranteed as we can see from Fig. 5.2, which displays a simulation ofaA(T ) for A = 2.

In the case λ1 = 1, β1 = 20 and λ2 = 1, β2 = 30, the distributions of Xand Y are highly concentrated in the interval [0.8, 1.1], i.e.,

P [0.8 ≤ X ≤ 1.1] = 0.9873, P [0.8 ≤ Y ≤ 1.1] = 0.9988.

We focus on the function a2(T ), the probability of occurrence of one or morefailures in [0, T ]. For T = 1.5, the system is “always” in the defective state andthe inspection avoids a corrective maintenance. Hence, a2(1.5) ≈ 0. However,for values of inspection near to 1, the system could be in a defective stateor not. If it is not, the next inspection will happen at time T = 2 and acorrective maintenance could happen in this period. Hence a2(1) > a2(1.5)and the monotony of a2(T ) is not guaranteed.

5.5 Maintenance Optimization Models Under Constraints 227

0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Fig. 5.2. Function a2(T ) versus T

Next, we specify the costs. Assume Cp = 400, Cc = 1000 and CI = 100 bethe costs incurred for a preventive maintenance, a corrective maintenance andan inspection, respectively. Furthermore, let α = 0.4 be the discount factor.For λ1 = 1, λ2 = 1, β1 = 20 and β2 = 30, Fig. 5.3 displays a simulationof the total expected discounted costs versus T . This simulation has beenperformed using 500 points between 0.2 and 2.5 with 500,000 realizations ineach point. As we can see, for this numerical example Cd(T ) has several localminimum values. The global minimum of Cd(T ) is reached for T ∗ = 1.79, withan expected discounted costs of Cd(1.79) = 397.68.

Finally, we specify the safety constraints, first criterion 1. We assume thatω1 = 0.2 and A = 2, i.e., the probability of occurrence of one or more failuresin two units of time should not exceed 0.2, that is,

P (Nc(2) ≥ 1) ≤ 0.2.

Figure 5.4 shows the total expected discounted costs Cd(T ) along with thefunction a2(T ). We find that

Υ = {T > 0; a2(T ) ≤ 0.2} = (0, 1.898].

In this case, T ∗ = 1.79 ∈ Υ , and hence the optimal value for the con-strained optimization problem under criterion 1 is Topt = 1.79 with a value ofCd(1.79) = 397.68.

Consider now the constrained optimization problem under criterion 2. Weassume that ω2 = 0.15, i.e., the proportion of time that the system is in a

228 5 Maintenance Optimization

0 0.5 1 1.5 2 2.5200

400

600

800

1000

1200

1400

1600

1800

Fig. 5.3. Total expected discounted costs Cd(T ) versus T

defective state should not exceed 0.15. Figure 5.5 shows the total expecteddiscounted costs and the function b(T ) for this problem. In this case

Υ = {T > 0; b(T ) ≤ 0.15}= (0, 0.291]∪ [0.3272, 0.3823]∪ [0.508, 0.5727]∪ [1.041, 1.1454].

By inspection the optimal value for the constrained optimization problem isTopt = 1.1454 with a value of Cd(1.1454) = 687.

In the following example we use a more realistic set of parameter valuesof the Weibull distributions: λ1 = 1, λ2 = 1, β1 = 2 and β2 = 3. In this case

E[X ] = 0.8862, E[Y ] = 0.8930.

Let Cp = 400, Cc = 1000 and CI = 100 be the costs incurred, with α = 0.4 thediscount factor. The functions Cd(T ), aA(T ) and b(T ) are shown in Figs. 5.6–5.8.

Figure 5.6 shows a simulation of the total expected discounted costs Cd(T )versus T for this example. The function Cd(T ) is in standard form, nonincreas-ing up to T = 1.1511 and nondecreasing for T ≥ 1.1511. Hence T ∗ = 1.1511.The corresponding expected discounted costs equal Cd(1.1511) = 804.0365.

We analyze the constrained optimization problem for each safety require-ment. As above we put ω1 = 0.2 for criterion 1. From Fig. 5.7 we find that

Υ = {T > 0; a2(T ) ≤ 0.2} = (0, 0.975].

5.5 Maintenance Optimization Models Under Constraints 229

0.5 1 1.5 2 2.50

0.1

0.2

0.3

0.4

0.5

ω1

0.5 1 1.5 2 2.50

500

1000

1500

2000a

b

Fig. 5.4. (a) Total expected discounted costs Cd(T ) versus T . (b) Function a2(T )versus T

Due to the form of Cd(T ) the optimal value for the constrained optimizationis Topt = 0.975 with a value of Cd(0.975) = 813.55.

For criterion 2, we suppose ω2 = 0.15. From Fig. 5.8,

Υ = {T > 0; b(T ) ≤ 0.15} = (0, 0.313],

and using the same reasoning as above, the optimal value for Cd(T ) is reachedfor Topt = 0.313 with a value of Cd(0.313) = 1372. By comparing the expectedcosts for the unconstrained and the constrained problem, we see that a ratherlarge cost is introduced by implementing the safety constraint.

Both constraints can be used to control the safety level. However, we preferto use criterion 1 as it is more directly related to the failures of the system.

5.5.2 Optimal Test Interval for a Monotone Safety System

In this section we consider a safety system represented by a monotone (coher-ent) structure function of n components. The components and the system canbe in one out of several states. The state of the components and the systemis only revealed through inspections, which are carried at intervals of lengthT . If the inspection shows that the system is in a critical state or has failed,it is overhauled and all components are resumed to good-as-new conditions.The system is in a critical state if further deterioration of a component (com-ponent i jumps from state j to state j − 1) induces system failure. As thesystem is a safety system in standby position, the state of the system and itscomponents is revealed only by testing. The aim of the testing and overhaul is

230 5 Maintenance Optimization

0.5 1 1.5 2 2.50

0.1

0.2

0.3

0.4

0.5

ω2

0.5 1 1.5 2 2.50

500

1000

1500

2000a

b

Fig. 5.5. (a) Total expected discounted costs Cd(T ) versus T . (b) Function b(T )versus T

to avoid that the system fails and stays in the failure state for a long period.However, this goal has to be balanced against the costs of inspections andoverhauls. Too frequent inspections would not be cost optimal. Costs are as-sociated with tests, system downtime, and repairs. The optimization criterionis the expected long-run cost per unit of time.

Below we present a formal set-up for this problem and show how an optimalT can be determined. A special case where the components have three statesis given special attention. It corresponds to a “delay time type system” wherethe presence of a fault in a component does not lead to an immediate failure;there will be a “delay time” between the occurrence of the fault and the failureof the component. We refer to Sect. 5.5.1.

Model and Problem Definition

We consider a safety system comprising n components, numbered consecu-tively from 1 to n. The state of component i at time t, t ≥ 0, is denoted Xt(i),i = 1, 2, . . . , n, whereXt(i) can be in one out ofMi+1 states, 0, 1, . . . ,Mi. Thepaths X·(i) are assumed to be right-continuous. The states represent differentlevels of performance, from the worst, 0, to the best, Mi. At time t = 0, allcomponents are in the best state, i.r., X0(i) =Mi, i = 1, 2, . . . , n. The randomduration time in stateMi is denoted UiMi . The component then jumps to stateMi−1 for a random time Ui(Mi−1), and so on until the component reaches theabsorbing state 0. All sojourn times are positive random variables. The prob-ability distribution of Uij is denoted Fij . The distributions Fij are assumed

5.5 Maintenance Optimization Models Under Constraints 231

0.5 1 1.5 2 2.5800

1000

1200

1400

1600

1800

2000

Fig. 5.6. Total expected discounted costs Cd(T ) versus T

absolute continuous, with finite means. The density and “jump rate” of Fij(t)are denoted fij(t) and rij(t), respectively, i = 1, 2, . . . , n and j = 1, 2, . . . ,Mi.The jump rate rij(t) is defined as usual as

limh→0

1

hP (Uij ≤ t+ h|Uij > t).

Hence rij(t)h (h a small positive number) is approximately equal to the condi-tional probability that component i makes a jump to state j−1 in the interval(t, t + h] given that the component has stayed in state j during the interval[0, t]. The sojourn times UiMi , Ui(Mi−1), . . . , Ui1, i = 1, 2, . . . , n, are assumedindependent. The distribution of the vector of all Uijs, U, is denoted FU.

We denote by G(t,x) the distribution of the vector of component statesXt = (Xt(1), Xt(2), . . . , Xt(n)), i.e.,

G(t,x) = P (Xt(1) = x1, Xt(2) = x2, . . . , Xt(n) = xn).

Here x = (x1, x2, . . . , xn), where xi ∈ {0, 1, . . . ,Mi}. The state of the systemat time t is denoted Φt and is a function of the states of the components, i.e.,

Φt = φ(Xt),

where φ is the structure function of the system. We assume that Φ and φ arebinary, equal to 1 if the system is functioning and 0 otherwise (see Sect. 2.1).The system is a monotone system (see Sect. 2.1.2), i.e., its structure functionφ is nondecreasing in each argument, and

φ(0, 0, . . . , 0) = 0 and φ(M1,M2, . . . ,Mn) = 1.

232 5 Maintenance Optimization

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

0.1

0.2

0.3

0.4

0.5

ω1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8800

1000

1200

1400

1600

1800a

b

Fig. 5.7. (a) Total expected discounted costs Cd(T ) versus T . (b) Function a2(T )versus T

Since at time t = 0 all components are in the best state, Φ(0) = 1. Thecomponents deteriorate and at time τ the system fails, i.e.,

τ = inf{t > 0 : φ(Xt) = 0}.The deterioration of the components and the system failure is revealed byinspections. It is assumed that the system is inspected every T units of time.If the system is found to be in the failure state, a complete overhaul is carriedout meaning that all components are repaired to a good-as-new condition.Furthermore, a preventive policy is introduced: if the system is found to bein a critical state, also a complete overhaul is conducted. The system is saidto be in a critical state if the system is functioning and there exists at leastone i such that the system fails if component i jumps to the state Xt(i)− 1.Let τC be the time to the system first becomes critical. Then

τC = inf{t ≥ 0 : φ(Xt) = 1, φ((Xt(i)− 1)i,Xt) = 0 for at least one i},where φ(·i,x) = φ(x1, . . . , xi−1, ·, xi+1, . . . , xn). We assume τC > 0, i.e., thesystem is not critical at time 0.

The distribution of τC is denoted FτC . The times τ and τC are functionsof the duration times Uij . Let g and gC be defined by

τ = g(U) and τC = gC(U).

The inspections and overhauls are assumed to take negligible time.To further characterize the critical states, we introduce the concept of a

critical path vector for system level 1:

5.5 Maintenance Optimization Models Under Constraints 233

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

0.1

0.2

0.3

0.4

0.5

ω2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8800

1000

1200

1400

1600

1800a

b

Fig. 5.8. (a) Total expected discounted costs Cd(T ) versus T . (b) Function b(T )versus T

Definition 5.35. A state vector x is a critical path vector for system level1 (the functioning state of the system) if and only if φ(x) = 1 and φ((xi −1)i,x) = 0 for at least one i.

From this definition we introduce a maximal critical path vector:

Definition 5.36. A critical path vector x is a maximal critical path vector forsystem level 1 if it cannot be increased without losing its status as a criticalpath vector.

Note that these concepts are different from the common defined path vectorsand minimal path vectors in a monotone system; see Sect. 2.1.2.

Based on the maximal critical minimal path vectors we introduce a newstructure function, φC(x), which is equal to 1 if and only if there exists nomaximal critical path vector xk such that the state x is below or equal toxk, i.e.

φC(x) =∏k

(1− I(x ≤ xk)),

where k runs trough all maximal critical path vectors for the system at level1. We see that the system φC fails as soon as a system state becomes critical.As an example, consider a binary parallel system. Then it is seen that themaximal critical path vectors are (1,0) and (0,1), and φC(x) = x1x2, as if onecomponent fails, the system state becomes critical.

A counting process N is introduced that jumps to 1 at the time of systemfailure, i.e.,

Nt = I(τ ≤ t).

234 5 Maintenance Optimization

Let Vij,t be the virtual age of component i in state j at time t. Then theintensity λt of N is given by

λt =

n∑i=1

Mi∑j=1

rij(Vij,t)I(Xt(i) = j)φ(Xt)(1− φ((j − 1)i,Xt)),

noting that the rate is rij(Vij,t) at time t for component i to cause sys-tem failure by jumping from state j to state j − 1. A formal proof can begiven following the approach in Sect. 3.2.2. By introducing φij(x) = I(xi =j)φ(x)(1 − φ((j − 1)i,x)), the intensity λt can be expressed as

λt =n∑

i=1

Mi∑j=1

rij(Vij,t)φij(Xt).

Analogously, we define a counting process NC for the process φC . This count-ing process jumps to 1 at the time the system becomes critical, i.e.,

NC,t = I(τC ≤ t).

The intensity λC,t of NC is given by

λC,t =

n∑i=1

Mi∑j=1

rij(Vij,t)I(Xt(i) = j)φC(Xt)(1− φC((j − 1)i,Xt)).

Similarly to φij we define φijC(x) = I(xi = j)φC(x)(1− φC((j − 1)i,x)), andhence the intensity λC,t can be expressed as

λC,t =

n∑i=1

Mi∑j=1

rij(Vij,t)φijC (Xt).

The following cost structure is assumed: the cost of a complete overhaul iscp, whereas the cost of each inspection is cI . If the system is not functioninga cost c is incurred per unit of time. All costs are positive numbers.

The problem is to find an optimal T minimizing the long-run expectedcost per unit of time.

Optimization

For a fixed test interval length T , 0 < T <∞, the system is overhauled at timeτT , where τT is the time of the first inspection following a critical state, i.e.,

τT = T ([τC/T ]I + 1),

where [x]I equals the integer part of x. This inspection represents a renewalfor the cost and time processes, and using the renewal reward theorem (see

5.5 Maintenance Optimization Models Under Constraints 235

Appendix B.2), it follows that the long-run (expected) cost per unit time, BT ,can be written:

BT =ECT

EτT, (5.46)

where EτT expresses the expected length of the first renewal cycle (the timeuntil renewal) and ECT expresses the expected cost associated with this cycle.It is seen that EτT <∞ and ECT <∞, observing that EτT ≤∑ij EUij+T ,

and ECT ≤ Tc+ cp + CI(EτT /T + 1). Theorem 5.37 establishes an explicit

formula for EτT and ECT , and hence for BT .

Theorem 5.37. Under the above model assumptions, with τ = g(U) andτC = gC(U), we have

EτT = T

∞∑k=0

(k + 1)

∫u:kT<gC(u)≤(k+1)T

dFU(u) (5.47)

ECT =

∞∑k=0

∫u:kT<gC(u)≤(k+1)T

[cI(k + 1) + cp

+cI(g(u) ≤ (k + 1)T ){(k + 1)T − g(u)}]dFU(u) (5.48)

Proof. To establish (5.47), we write

τT =∞∑k=0

I(kT < τC ≤ (k + 1)T )(k + 1)T.

Then taking expectation we obtain

EτT = E

∞∑k=0

I(kT < gC(U) ≤ (k + 1)T )(k + 1)T

= T∞∑k=0

(k + 1)

∫u:kT<gC(u)≤(k+1)T

dFU(u),

which proves (5.47). To establish (5.48), we use a similar approach writingthe cost CT as a function of τC and τ :

CT =∞∑k=0

I(kT < τC ≤ (k+1)T )[cI(k+1)+cp+cI(τ ≤ (k+1)T ){(k+1)T−τ}],(5.49)

noting that the system is down a period (k + 1)T − τ if the system entersa critical state in the interval (kT, (k + 1)T ] and the system fails before theinspection at time (k + 1)T . Then taking expectations we obtain (5.48). ��

In the following theorem we establish more explicit formulae for EτT andECT by using counting process theory. Then we do not need the distributionof FU(u) but the distribution of Xt, G(t,x). We consider two special cases:

236 5 Maintenance Optimization

– The system is a binary system with binary components,

i.e. Mi = 1 for i = 1, 2, . . . , n. (5.50)

– The rates rij are independent of t,

i.e. the sojourn times are all exponentially distributed. (5.51)

Theorem 5.38. Let

Hij(t,x) =

∫ t

0

rij(s)G(s,x)ds.

For the cases (5.50) and (5.51), we then have

EτT =

∞∑k=0

T (k + 1)

n∑i=1

Mi∑j=1

∑x

φijC (x)

×[Hij((k + 1)T,x)−Hij(kT,x)], (5.52)

where φijC(x) = I(xi = j)φC(x)(1 − φC((j − 1)i,x)). Furthermore, ifGs(t,x|x′) denotes the conditional distribution of X(t) given X(s) = x′

(t > s), we have

ECT =

∞∑k=0

[cI(k + 1) + cp]

n∑i=1

Mi∑j=1

∑x

φijC(x)[Hij((k + 1)T,x)−Hij(kT,x)]

+∞∑k=0

∑x′φC(x

′)G(kT,x′)n∑

i=1

Mi∑j=1

∑x

φij(x)

×∫ (k+1)T

kT

c((k + 1)T − t)rij(t)GkT (t,x|x′)dt. (5.53)

Proof. To establish (5.52), we write

τT =

∞∑k=0

I(kT < τC ≤ (k + 1)T )(k + 1)T =

∞∑k=0

(k + 1)T

∫ (k+1)T

kT

dNC(t)

Then taking expectation, using that NC,t has intensity λC,t, and noting thatwe can write rij(Vi,j,t) = rij(t), we obtain:

EτT = E

∞∑k=0

(k + 1)T

∫ (k+1)T

kT

dNC,t

= T

∞∑k=0

(k + 1)

∫ (k+1)T

kT

EλC,tdt

5.5 Maintenance Optimization Models Under Constraints 237

= T

∞∑k=0

(k + 1)

∫ (k+1)T

kT

n∑i=1

Mi∑j=1

rij(t)EφijC (Xt)dt

= T

∞∑k=0

(k + 1)

n∑i=1

Mi∑j=1

∑x

φijC (x)

∫ (k+1)T

kT

rij(t)G(t,x)dt

= T

∞∑k=0

(k + 1)

n∑i=1

Mi∑j=1

∑x

φijC (x)[Hij((k + 1)T,x)−Hij(kT,x)],

which proves (5.52). To establish (5.53), we rewrite (5.49) to obtain

ECT = E

∞∑k=0

I(kT < τC ≤ (k + 1)T )[cI(k + 1) + cp]

+E

∞∑k=0

I(kT < τC ≤ (k + 1)T )cI(τ ≤ (k + 1)T ){(k + 1)T − τ},

Similarly to the above analysis for EτT it is seen that the first term of thisexpression for ECT equals

∞∑k=0

[cI(k + 1) + cp]

n∑i=1

Mi∑j=1

φijC (x)[Hij((k + 1)T,x)−Hij(kT,x)].

Hence it remains to establish the desired expression for the downtime costs,the second term. This term can be expressed as

E

∞∑k=0

φC(XkT )

∫ (k+1)T

kT

c((k + 1)T − t)dNt,

as φC(Xt) is 1 as long as t < τC . Then using that Nt has intensity λt, weobtain that this expected cost term equals

E

∞∑k=0

φC(XkT )

∫ (k+1)T

kT

c((k + 1)T − t)dNt

= E

∞∑k=0

φC(XkT )

∫ (k+1)T

kT

c((k + 1)T − t)λtdt

= E∞∑k=0

φC(XkT )

∫ (k+1)T

kT

c((k + 1)T − t)n∑

i=1

Mi∑j=1

φij(Xt)rij(t)dt

=

∞∑k=0

∑x′φC(x

′)G(kT,x′)n∑

i=1

Mi∑j=1

∑x

φij(x)×∫ (k+1)T

kT

c((k + 1)T − t)rij(t)GkT (t,x|x′)dt.

Equation (5.53) follows, and the theorem is proved. ��

238 5 Maintenance Optimization

We seek an optimal Topt minimizingBT given by (5.46) and the expressionsfor ECT and EτT in Theorems 5.37 and 5.38. Such a minimum always existsif we include the “perform no testing and overhaul” policy T = ∞ as BT is acontinuous function and limT→0 B

T = ∞. We have B∞ = limT→∞BT =c. The expected average long-run cost per unit of time when there is notesting and overhaul equals c. If we perform very frequent testing, the long-runexpected average cost will be very high due to a large number of inspections.

To find Topt it is convenient to search for T s minimizing the functions

BT (δ) = ECT − δEτT .

If Tδ minimizes BT (δ) and BTδ (δ) = 0, then Tδ minimizes BT , i.e., Tδ isoptimal, and δ = BTδ = inf0<T≤∞BT . This result is well-known from theliterature; see Aven and Bergman [19]. We also refer to (5.9).

Special Case: Parallel System of Two Components

Assume that φ(x) = 1− (1− x1)(1− x2), i.e., the system is a binary parallelsystem composed of two components. The time to the system first becomescritical, τC , can then be expressed as

τC = min{U11, U21},noting that if a component fails, the system is functioning if and only if theother component is functioning. Furthermore, the time to system failure, τ ,equals the maximum component lifetime, i.e.,

τ = max{U11, U21}.It follows that

EτT = T∞∑k=0

(k + 1)[FτC ((k + 1)T )− FτC (kT )]

=

∞∑k=0

(k + 1)[F11(kT ))F21(kT )− F11((k + 1)T ))F21((k + 1)T )],

where F = 1−F . By similar arguments, first considering the costs cI and cp,and then for the cost c condition on U11 = u1 and U21 = u2, we obtain

ECT = E

∞∑k=0

I(kT < τC ≤ (k + 1)T )

×[cI(k + 1) + cp + cI(τ ≤ (k + 1)T ){(k + 1)T − τ}]

= E

∞∑k=0

I(kT < τC ≤ (k + 1)T )[cI(k + 1) + cp] +

5.5 Maintenance Optimization Models Under Constraints 239

E

∞∑k=0

I(kT < τC ≤ (k + 1)T )[cI(τ ≤ (k + 1)T ){(k + 1)T − τ}]

=

∞∑k=0

[F11(kT )F21(kT )− F11((k + 1)T )F21((k + 1)T )][cI(k + 1) + cp] +

∞∑k=0

∫ ∞

0

∫ ∞

0

I(kT < min{u1, u2}

≤ (k + 1)T )[cI(max{u1, u2} ≤ (k + 1)T )

×{(k + 1)T −max{u1, u2}}]dF21(u1)dF11(u2).

The last term due to system downtime can be simplified to

∞∑k=0

∫ (k+1)T

kT

c{(k + 1)T − u1}[F21(u1)− F21(kT )]dF11(u1)

+∞∑k=0

∫ (k+1)T

kT

c{(k + 1)T − u2}[F11(u2)− F11(kT )]dF21(u2).

These results are presented in Proposition 5.39.

Proposition 5.39. For a parallel system of two binary components, the ex-pected renewal cycle and expected associated costs are given by:

EτT =∞∑k=0

(k + 1)[F11(kT )F21(kT )− F11((k + 1)T )F21((k + 1)T )]

ECT =

∞∑k=0

[F11(kT )F21(kT )− F11((k + 1)T )F21((k + 1)T )][cI(k + 1) + cp] +

∞∑k=0

∫ (k+1)T

kT

c{(k + 1)T − u1}[F21(u1)− F21(kT )]dF11(u1) +

∞∑k=0

∫ (k+1)T

kT

c{(k + 1)T − u2}[F11(u2)− F11(kT )]dF21(u2).

An optimal T can then be determined.Similar expressions can easily be derived based on Theorem 5.38. Note

that φi1C(x) is equal to 1 only if x1 = 1 and x2 = 1.

Special Case: Delay Time Model with Three Components

We consider a system comprising n = 3 components, with Mi = 2, i.e., eachcomponent has three states. The state 2 is a perfect functioning state, whereasstate 1 is a “partly defective” state, as a result of a “fault.” There will be a timelapse between the occurrence of the fault and the failure of the component—a

240 5 Maintenance Optimization

“delay time.” To simplify the mathematical analysis, we assume that allsojourn times Uij are exponentially distributed. The constant rates are de-noted rij . The components 1 and 2 are assumed to have the same rates. Therates for different arrival states j are assumed different, i.e., ri2 �= ri1, fori = 1, 2, 3.

The state of the system is given by the structure function

φ(x) = I(x1 + x2 ≥ 1)I(x3 ≥ 1).

Hence the system is functioning if either component 1 or 2 is in state 1 orbetter, and component 3 is in state 1 or better. We may think of the system asa parallel system comprising the components 1 and 2, in series with component3, with each component having a delay time before failure is occurring.

The maximal critical path vectors for level 1 are (0,1,2), (1,0,2) and(2,2,1), and this defines φC(x) and φijC(x). We see that φC(x) = 1 forx = (1, 1, 2) and x > (1, 1, 2), as well as for x = (0, 2, 2) and x = (2, 0, 2). andφ32C(x1, x2, 2) = 1, xi ≥ 1, i = 1, 2.

For two distribution functions F1 and F2, let F1∗F2(t) =∫ t

0 F1(t−s)dF2(s).Then the distribution G(t,x) can be expressed as:

G(t, (2, 2, 2)) = F12(t)F22(t)F32(t) = e−t∑3

i=1 ri2

G(t, (1, 2, 2)) = [F11 ∗ F12(t)]F22(t)F32(t)

=r12

r12 − r11(e−r11t − e−r12t)e−t

∑3i=2 ri2

G(t, (1, 1, 2)) = [F11 ∗ F12(t)][F21 ∗ F22(t)]F32(t)

=r12

r12 − r11(e−r11t − e−r12t)

r22r22 − r21

(e−r21t − e−r22t)e−tr32

G(t, (0, 2, 2)) = F12 ∗ F11(t)F22(t)F32(t)

=

{1− e−r12t − r12

r12 − r11[e−r11t − e−r12t]

}e−t

∑3i=2 ri2

From these expressions compact formulae can be derived for Hij(t,x) =

rij∫ t

0G(s,x)ds.

Similar equations can be established for Gs(t,x|x′), the conditional distri-bution of X(t) given X(s) = x′. We need to compute the conditional distri-bution of P (Xt(i) = j2|Xs(i) = j1) for j2 ≤ j1, j1 = 1, 2, i = 1, 2. We see thatP (Xt(i) = 2|Xs(i) = 2) = Fi2(t−s), P (Xt(i) = 1|Xs(i) = 2) = F11∗F12(t−s),and P (Xt(i) = 1|Xs(i) = 1) = Fi1(t − s). Furthermore; P (Xt(i) = 0|Xs(i) =1) = Fi1(t − s) and P (Xt(i) = 0|Xs(i) = 2) = F12 ∗ Fi1(t − s). From theseformulae we see for example that

Gs(t, (2, 2, 2)|(2, 2, 2)) = F12(t− s)F22(t− s)F32(t− s) = e−(t−s)∑3

i=1 ri2

Gs(t, (1, 2, 2)|(2, 2, 2)) = [F11 ∗ F12(t− s)]F22(t− s)F32(t− s)

=r12

r12 − r11(e−r11(t−s) − e−r12(t−s))e−(t−s)

∑3i=2 ri2

5.5 Maintenance Optimization Models Under Constraints 241

In this way all terms in EτT and ECT can be derived and an optimal Tdetermined.

Numerical Example

We assume that the failure rates are as follows: r12 = r22 = 0.5, r11 = r21 =1.0 and r32 = 1/3, r31 = 1/2. Hence the expected time to failure for the threecomponents are 2+1 = 3, 2+1 = 3 and 3+2 = 5, respectively. The followingcosts are assumed: c = 100, cI = 1 and cp = 5, i.e., the cost of an overhaulis five times the inspection cost and the unit downtime cost is 100 times theinspection cost. Then we can compute the BT function and determine anoptimal inspection time. Figure 5.9 shows the BT function as a function ofT , computed using Maple 10. By inspection an optimal value is obtained forT = 0.43. A number of sensitivity analysis should be performed to see theeffect of changes in the input data. Figure 5.10 shows an example where theunit downtime cost is increased by a factor 10, from 100 to 1,000, to reflectthe serious safety risk caused by downtime. The optimal inspection interval isthen reduced to 0.18.

T

90

70

1.8

50

1.4

10

110

100

80

2.0

60

40

1.6

30

20

1.21.00.80.60.40.2

Fig. 5.9. The BT function for the base case example with c = 100

242 5 Maintenance Optimization

Final Remarks

The optimization of BT needs to be carried out by numerical methods. Forthe numerical example considered in the previous section, the optimizationcriterion is on the standard form seen for many maintenance models (non-increasing up to a minimum value and then nondecreasing). In general thisis, however, not the case for the model studied in this chapter. Examples canbe constructed where the optimization function has several local minimumvalues, which is in line with the examples for a one component system inSect. 5.5.1.

The model can be extended in many ways, for example, by allowing a moregeneral cost structure. As an example we may distinguish between the cost ofan overhaul when the system is in a critical state and when it has failed. Thecalculations of ECT in (5.49) then need to be modified, by considering a costterm cp + c′pI(τ < (k + 1)T ), where c′p is the additional overhaul cost if thesystem has failed compared to being in a critical state. The further analysis isanalogous to the one carried out for ECT . The next step would be to allow theoverhaul cost to depend on the state vector. The analysis would then becomemore complicated, but still within the framework and approach presented.

1.4

75

25

T

2.01.8

150

1.6

125

100

50

1.21.00.80.60.40.2

Fig. 5.10. The BT function for c=1000

5.5 Maintenance Optimization Models Under Constraints 243

Bibliographic Notes. A fundamental reference for basic replacementmodels is Barlow and Proschan [31]. There is an extensive literature aboutpreventive replacement models, which is surveyed in the overviews of Pier-skalla and Voelker [130], Sherif and Smith [144], Valdez-Flores and Feldman[158], and Jensen [94]. Block and Savits [47] and Boland and Proschan [49]give overviews over comparison methods and stochastic order in reliability the-ory. Shaked and Szekli [142] and Last and Szekli [116] compare replacementpolicies via point process methods. A good source for overviews of the vastlyincreasing literature on replacement and maintenance optimization models isthe book Reliability and Maintenance of Complex Systems edited by Ozekiciin the NATO ASI Series.

The presentation in Sect. 5.2 follows the lines of [96]. A general set-up forcost-minimizing problems is introduced in Jensen [96] similar to Bergman [38]and Aven and Bergman [19]. It allows for specialization in different directions.As an example the model presented by Aven [18] covering the total expecteddiscounted costs criterion is included. What goes beyond the results in [19] isthe possibility to take different information levels into account.

There are lots of multivariate extensions of the univariate exponential dis-tribution, for an overview see Hutchinson and Lai [91] or Basu [33], whichalso cover the models of Freund [68] and Marshall and Olkin [121]. A de-tailed derivation, statistical properties, and methods of parameter estimationof the combined exponential distribution can be found in [83]. The optimiza-tion problem also for more general cost structures is treated in Heinrich andJensen [85]. An alternative approach to solve optimization problems of thekind treated in this chapter is to use Markov decision processes. It has notbeen within the scope of this book to develop this theory here. An introductionto this theory can be found in the books of Puterman [131], Bertsekas [41],Davis [59], and Van der Duyn Schouten [159], which also contains applicationsin reliability.

An overview of several problems related to burn-in and the correspondingliterature is given in the review articles by Block and Savits [46], Kuo andKuo [113], and Leemis and Beneke [118]. The problem of sequential burn-in,where the failures of the items are observed and the burn-in time depends onthese failures, is treated in the article of Marcus and Blumenthal [120]. In thepapers of Costantini and Spizzichino [56] and Spizzichino [149] the assumptionthat the component lifelengths are independent is dropped and replaced bycertain dependence models.

The problem of finding optimal replacement times for general repairprocesses has been treated by Aven in [12, 15]. The presentation of Markov-modulated minimal repair processes follows the lines of [93, 95] which includethe technical details. A similar model considering interest rates has been in-vestigated by Schottl [137].

Section 5.5 is based on Aven and Castro [20] and Aven [10]. For reviewsof the literature on delay time models, see Baker and Christer [29], Christerand Redmond [54] and Christer [53].

A

Background in Probability and Stochastic

Processes

This appendix serves as background for Chaps. 3–5. The focus is on stochasticprocesses on the positive real time axis R+ = [0,∞). Our aim is to givethat basis of the measure-theoretic framework that is necessary to make thetext intelligible and accessible to those who are not familiar with the generaltheory of stochastic processes. For detailed presentations of this frameworkwe recommend texts like Dellacherie and Meyer [61, 62], Rogers and Williams[133], and Kallenberg [101]. The point process theory is treated in Karr [103],Daley and Vere-Jones [58], and Bremaud [50]. A “nontechnical” introductionto parts of the general theory accompanied by comprehensive historical andbibliographic remarks can be found in Chap. II of the monograph of Andersenet al. [2]. A good introduction to basic results of probability theory is Williams[164].

A.1 Basic Definitions

We use the following notationN = {1, 2, . . .}N0 = {0, 1, 2, . . .}Z = {0,+1,−1,+2,−2, . . .} set of integersQ = { p

q : p ∈ Z, q ∈ N} set of rationals

R = (−∞,+∞) set of real numbersR+ = [0,∞) set of nonnegative real numbersf ∨g and f ∧g denote max{f, g} and min{f, g}, respectively, where f and

g can be real-valued functions or real numbers. We denote f+ = f ∨ 0 andf− = −(f ∧ 0).

inf ∅ = ∞, sup ∅ = 0. Ratios of the form 00 are set equal to 0.

A function f from a set A to a set B is denoted by f : A→ B and f(a) isthe value of f at a ∈ A. To simplify the notation we also speak of f(a) as afunction.

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,© Springer Science+Business Media New York 2013

245

246 A Background in Probability and Stochastic Processes

For a function f : R → R we denote the left and right limit at a (in thecase of existence) by

f(a−) = limt→a− f(t) = lim

h→0,h>0f(a− h) ,

f(a+) = limt→a+

f(t) = limh→0,h>0

f(a+ h) .

For two functions f, g : R → R we write f(h) = o(g(h)), h → h0, for someh0 ∈ R ∪ {∞}, if

limh→h0

f(h)

g(h)= 0;

we write f(h) = O(g(h)), h → h0, for some h0 ∈ R ∪ {∞}, if

lim suph→h0

|f(h)||g(h)| <∞.

An integral∫f(s)ds of a real-valued measurable function is always an

integral with respect to Lebesgue-measure. Integrals over finite intervals∫ b

a ,a ≤ b, are always integrals

∫[a,b]

over the closed interval [a, b].

The indicator function of a set A taking only the values 1 and 0 is den-oted I(A). This notation is preferred rather than IA or IA(a) in the case ofdescriptions of sets A by means of random variables.

In the following we always refer to a basic probability space (Ω,F , P ),where

• Ω is a fixed nonempty set.• F is a σ-algebra or σ-field on Ω, i.e., a collection of subsets of Ω including

Ω, which is closed under countable unions and finite differences.• P is a probability measure on (Ω,F), i.e., a σ-additive, [0, 1]-valued func-

tion on F with P (Ω) = 1.

If A is a collection of subsets of Ω, then σ(A) denotes the smallest σ-algebra containing A, the σ-algebra generated by A.

If S is some set and S a σ-algebra of subsets of S, then the pair (S,S) iscalled a measurable space. Let S be a metric space (usually R or Rn) and Othe collection of its open sets. Then the σ-algebra generated by O is calledBorel-σ-algebra and denoted B(S), especially we denote B = B(R).

If A and C are two sub-σ-algebras of F , then A ∨ C denotes the σ-algebragenerated by the union of A and C. The product σ-algebra of A and C, gen-erated by the sets A× C, where A ∈ A and C ∈ C, is denoted A⊗ C.

A.2 Random Variables, Conditional Expectations

A.2.1 Random Variables and Expectations

On the fixed probability space (Ω,F , P ) we consider a mapping X intothe measurable space (R,B). If X is measurable (or more exactly F -B-measurable), i.e., X−1(B) = {X−1(B) : B ∈ B} ⊂ F , then it is called a

A.2 Random Variables, Conditional Expectations 247

random variable. The σ-algebra σ(X) = X−1(B) is the smallest one withrespect to which X is measurable. It is called the σ-algebra generated by X .

Definition A.1 (Independence).

(i) Two events A,B ∈ F are called independent, if P (A ∩B) = P (A)P (B).(ii) Suppose A1 and A2 are subfamilies of F : A1,A2 ⊂ F . Then A1 and A2

are called independent, if P (A1 ∩ A2) = P (A1)P (A2) for all A1 ∈ A1,A2 ∈ A2.

(iii) Two random variables X and Y on (Ω,F) are called independent, if σ(X)and σ(Y ) are independent.

The expectation EX (or E[X ]) of a random variable is defined in the usualway as the integral

∫XdP with respect to the probability measure P . If the

expectation E|X | is finite, we call X integrable. The law or distribution of Xon (R,B) is given by FX(B) = P (X ∈ B), B ∈ B, and FX(t) = FX((−∞, t])is the distribution function. Often the index X in FX is omitted when it isclear which random variable is considered. Let g : R → R be a measurablefunction and suppose that g(X) is integrable. Then

Eg(X) =

∫Ω

g(X)dP =

∫R

g(t)dFX(t).

If X has a density fX : R → R+, i.e., P (X ∈ B) =∫B fX(t)dt, B ∈ B, then

the expectation can be calculated as

Eg(X) =

∫R

g(t)fX(t)dt.

The variance of a random variable X with E[X2] <∞ is denoted Var[X ] anddefined by Var[X ] = E[(X − EX)2].

We now present some classical inequalities:

• Markov inequality: Suppose that X is a random variable and g : R+→ R+

a measurable nondecreasing function such that g(|X |) is integrable. Thenfor any real c > 0

Eg(|X |) ≥ g(c)P (|X | ≥ c).

• Jensen’s inequality: Suppose that g : R → R is a convex function and thatX is a random variable such that X and g(X) are integrable. Then

g(EX) ≤ Eg(X).

• Holder’s inequality: Let p, q ∈ R such that p > 1 and 1/p+ 1/q = 1. Sup-pose X and Y are random variables such that |X |p and |Y |q are integrable.Then XY is integrable and

E|XY | ≤ E[|X |p]1/pE[|Y |q]1/q.Taking p = q = 2 this inequality reduces to Schwarz’s inequality.

248 A Background in Probability and Stochastic Processes

• Minkowski’s inequality: Suppose that X and Y are random variables suchthat |X |p and |Y |p are integrable for some p ≥ 1. Then we have the trianglelaw

E[|X + Y |p]1/p ≤ E[|X |p]1/p + E[|Y |p]1/p.

At the end of this section we list some types of convergence of real-valuedrandom variables. Let X,Xn, n ∈ N, be random variables carried by the triple(Ω,F , P ) and taking values in (R,B) with distribution functions F, Fn. Thenthe following forms of convergence Xn → X are fundamental in probabilitytheory.

• Almost sure convergence: We say Xn → X almost surely (P -a.s.) if

P ( limn→∞Xn = X) = 1.

• Convergence in probability: We say XnP→ X in probability, if for every

ε > 0,limn→∞P (|Xn −X | > ε) = 0.

• Convergence in distribution: We say XnD→ X in distribution, if for every

x of the set of continuity points of F ,

limn→∞Fn(x) = F (x).

• Convergence in the pth mean or convergence in Lp: We say Xn → X inthe pth mean, p ≥ 1, or in Lp, if |X |p, |Xn|p are integrable and

limn→∞E|Xn −X |p = 0.

The relationships between these forms of convergence are the following:

Xn → X,P -a.s. ⇒ XnP→ X,

Xn → X in Lp ⇒ XnP→ X,

XnP→ X ⇒ Xn

D→ X.

A.2.2 Lp-Spaces and Conditioning

We introduce the vector spaces Lp = Lp(Ω,F , P ), p ≥ 1, of (equivalenceclasses of) random variables X such that |X |p is integrable, without distin-guishing between random variables X,Y with P (X = Y ) = 1. With the norm‖ X‖p = (E|X |p)1/p the space Lp becomes a complete space in that for anyCauchy sequence (Yn), n ∈ N, there exists a Y ∈ Lp such that ‖ Yn−Y ‖p→ 0for n→ ∞. A sequence (Yn) is called Cauchy sequence if

supr,s≥k

‖ Yr − Ys ‖p→ 0 for k → ∞.

A.2 Random Variables, Conditional Expectations 249

Lp is a complete and metric vector space or Banach space. For 1 ≤ p ≤ q andX ∈ Lq it follows by Jensen’s inequality that

‖ X‖p ≤‖ X‖q.

So Lq is a subspace of Lp if q ≥ p. For p = 2 we define the scalar product〈X,Y 〉 = E[XY ], which makes L2 a Hilbert space, i.e., a Banach space witha norm induced by a scalar product.

We have introduced Lp-spaces to be able to look at conditional expec-tations from a geometrical point of view. Before we give a formal definitionof conditional expectations, we consider the orthogonal projection in Hilbertspaces.

Theorem A.2. Let K be a complete vector subspace of L2 and X ∈ L2. Thenthere exists Y in K such that

(i) ‖ X − Y ‖2 = inf{‖ X − Z‖2 : Z ∈ K},(ii) X − Y ⊥ Z, i.e., E[(X − Y )Z] = 0, for all Z ∈ K.

Properties (i) and (ii) are equivalent and if Y ∗ shares either property (i)or (ii) with Y , then P (Y = Y ∗) = 1.

The short proof of this result can be found in Williams [164]. The theo-rem states that there is one unique element in the subspace K that has theshortest distance from a given element in L2 and the projection direction isorthogonal on K. A similar projection can be carried out from L1(Ω,F , P )onto L1(Ω,A, P ), where A ⊂ F is some sub-σ-algebra of F . Of course, anyA-measurable random variable of L1(Ω,A, P ) is also in L1(Ω,F , P ). Thus,for a given X in L1(Ω,F , P ), we are looking for the “best” approximation inL1(Ω,A, P ). A solution to this problem is given by the following fundamentaltheorem and definition.

Theorem A.3. Let X be a random variable in L1(Ω,F , P ) and let A be asub-σ-algebra of F . Then there exists a random variable Y in L1(Ω,A, P )such that ∫

A

Y dP =

∫A

XdP, for all A ∈ A. (A.1)

If Y ∗ is another random variable in L1(Ω,A, P ) with property (A.1), thenP (Y = Y ∗) = 1.

A random variable Y ∈ L1(Ω,A, P ) with property (A.1) is called (a versionof) the conditional expectation E[X |A] of X given A. We write Y = E[X |A]noting that equality holds P -a.s.

The standard proof of this theorem uses the Radon–Nikodym theorem (cf.for example Billingsley [42]). A more constructive proof is via the OrthogonalProjection Theorem A.2. In the case that EX2 <∞, i.e.,X ∈ L2(Ω,F , P ), wecan use Theorem A.2 directly with K = L2(Ω,A, P ). Let Y be the projection

250 A Background in Probability and Stochastic Processes

of X in K. Then property (ii) of Theorem A.2 yields E[(X − Y )Z] = 0 for allZ ∈ K. Take Z = IA, A ∈ A. Then E[(X − Y )IA] = 0 is just condition (A.1),which shows that Y is a version of the conditional expectation E[X |A]. If Xis not in L2, we split X as X+−X− and approximate both parts by sequencesX+

n = X+ ∧ n and X−n = X− ∧ n, n ∈ N, of L2-random variables. A limiting

argument for n→ ∞ yields the desired result (see [164] for a complete proof).Conditioning with respect to a σ-algebra is in general not very concrete,

so the idea of projecting onto a subspace may give some additional insight.Another point of view is to look at conditioning as an averaging operator.The sub-σ-algebra A lies between the extremes F and G = {∅, Ω}, the trivialσ-field. As can be easily verified from the definition, the corresponding condi-tional expectations of X are X = E[X |F ] and EX = E[X |G]. So for A withG ⊂ A ⊂ F the conditional expectation E[X |A] lies “between” X (no aver-aging, complete information about the value of X) and EX (overall average,no information about the value of X). The more events of F are included inA the more is E[X |A] varying and the closer is this conditional expectationto X in a sense made precise in the following proposition.

Proposition A.4. Suppose X ∈ L2(Ω,F , P ) and let A1 and A2 be sub-σ-algebras of F such that A1 ⊂ A2 ⊂ F . Then, denoting Yi = E[X |Ai], i = 1, 2,we have the following inequalities:

(i) ‖ X − Y2‖2 ≤‖ X − Y1‖2 ≤‖ X − Y2‖2+ ‖ Y2 − Y1‖2.(ii) ‖ Y1 − EX‖2 ≤‖ Y2 − EX‖2 ≤‖ Y1 − EX‖2+ ‖ Y2 − Y1‖2.

Proof. The right-hand side inequalities are just special cases of the trianglelaw for the L2-norm or Minkowski’s inequality. So we need to prove the left-hand inequalities.

(i) Since Y2 is the projection of X on L2(Ω,A2, P ) and

Y1 ∈ L2(Ω,A1, P ) ⊂ L2(Ω,A2, P ),

we can use Theorem A.2 to yield

‖ X − Y2‖2 = inf{‖ X − Z‖2 : Z ∈ L2(Ω,A2, P )} ≤‖ X − Y1‖2.

(ii) Denoting Yi = Yi − EX we see that Y1 is the projection of Y2 onL2(Ω,A1, P ). Again from Theorem A.2 it follows that Y2 − Y1 and Y1are orthogonal. The Pythagoras Theorem then takes the form

‖ Y2‖22 =‖ Y2 − Y1 + Y1‖22 =‖ Y2 − Y1‖22+ ‖ Y1‖22,

which gives ‖ Y1‖2 ≤‖ Y2‖2. ��Remark A.5. 1. Using some of the properties of conditional expectations

stated below, all the inequalities but the first in (i) of the propositioncan be shown to hold also in Lp-norm, p ≥ 1, provided that X ∈ Lp.

A.2 Random Variables, Conditional Expectations 251

2. If we view E[X |A] as a predictor of the unknown X , then Proposition A.4says that the closer A is to F the better in the mean square sense isthis estimate and the bigger is the variance Var[E[X |A]] of this randomvariable.

In particular, if A is generated by a finite or countable partition of Ω, thenthe conditional expectation can be given explicitly.

Theorem A.6. Let X be an integrable random variable, i.e., X ∈ L1, andlet A be a sub-σ-algebra of F generated by a finite or countable partitionA1, A2, . . . of Ω. Then,

E[X |A] =1

P (Ai)

∫Ai

XdP =E[IAiX ]

P (Ai), ω ∈ Ai, P (Ai) > 0.

If P (Ai) = 0, the value of E[X |A] over Ai is set to 0.

A.2.3 Properties of Conditional Expectations

Here and in the following relations like <,≤,= between random variables arealways assumed to hold with probability one and the term P -a.s. is suppressed.All random variables in this subsection are assumed to be integrable, i.e., tobe elements of L1(Ω,F , P ). Let A and C denote sub-σ-algebras of F . Thenthe following properties for conditional expectations hold true.

1. If Y is any version of E[X |A], then EY = EX .2. If X is A-measurable (σ(X) ⊂ A), then E[X |A] =X .3. Linearity. E[aX + bY |A] = aE[X |A] + bE[Y |A], a, b ∈ R.4. Monotonicity. If X ≤ Y , then E[X |A] ≤ E[Y |A].5. Monotone Convergence. If Xn is an increasing sequence and Xn → XP -a.s., then E[Xn|A] converges almost surely:

limn→∞E[Xn|A] = E[X |A].

6. Dominated Convergence. If Xn is a sequence of random variables suchthat sup |Xn| is integrable and Xn → X P -a.s., then E[Xn|A] convergesalmost surely:

limn→∞E[Xn|A] = E[X |A].

7. Jensen’s Inequality. If g : R → R is convex and g(X) is integrable, then

E[g(X)|A] ≥ g(E[X |A]),

in particular‖ X‖p ≥‖ E[X |A]‖p, for p ≥ 1.

8. Successive Conditioning. If H is a sub-σ-algebra of A, then

E[E[X |A]|H] = E[X |H].

252 A Background in Probability and Stochastic Processes

9. Factoring. Let the random variable Z be A-measurable and suppose thatZX is integrable. Then

E[ZX |A] = ZE[X |A].

10. Independent Conditioning. Let C and A be sub-σ-algebras of F suchthat C is independent of σ(X)∨A. Then

E[X |C ∨ A] = E[X |A].

In particular, if X is independent of C, then E[X |C] = EX .

The proofs of all these properties are mainly based on the definition ofthe conditional expectation and follow the ideas of the corresponding proofsfor unconditional expectations, e.g., for monotone and dominated convergence(cf. Williams [164], pp. 89–90).

A.2.4 Regular Conditional Probabilities

We define the conditional probability of an event A ∈ F , given a sub-σ-algebraA as

P (A|A) =E[IA|A].

Clearly, by the monotonicity, linearity, and monotone convergence propertieswe have

0 ≤ P (A|A) ≤ 1,

P (Ω|A) = 1,

and

P (

∞⋃n=1

An|A) =

∞∑n=1

P (An|A)

for a fixed sequence A1, A2, . . . of disjoint events of F . From this we cannotconclude that for almost all ω ∈ Ω the map A �−→ P (A|A)(ω) defines a prob-ability on F . Although we often dispense with a discussion of P -zero sets, itis important here. For example, the last equation showing the σ-additivity ofconditional probability only holds with probability 1. Except in trivial cases,there are uncountable many sequences of disjoint events and each of thesesequences determines an exceptional P -zero set. The union of all these ex-ceptional sets need not have probability 0 (it need not even be an element ofF). But fortunately, for most cases encountered in applications there exists aso-called regular conditional probability.

Definition A.7. A map Q : Ω × F → [0, 1] is called regular conditionalprobability given A ⊂ F , if

(i) for all A ∈ F , ω �−→ Q(ω,A) is a version of E[IA|A];(ii) there exists some N0 ∈ F , P (N0) = 0 such that the map A �−→ Q(ω,A)

is a probability measure on F for all ω /∈ N0.

A.2 Random Variables, Conditional Expectations 253

A.2.5 Computation of Conditional Expectations

Besides the simple case of a sub-σ-algebra A generated by a countable parti-tion of Ω mentioned in Theorem A.6, we consider two further ways to deter-mine conditional expectations E[X |A].

1. If there exists a regular conditional probability Q given A, we can deter-mine the conditional distribution QX of a random variable X given A:QX(ω,B) = Q(ω,X−1(B)). Then for any measurable function g : R → R

such that g(X) is integrable,

∫R

g(x)QX(ω, dx)

is a version of E[g(X)|A].2. We consider two random variables X and Y and a measurable function g

such that g(X) is integrable. We write

E[g(X)|Y ] = E[g(X)|σ(Y )]

for the conditional expectation of g(X) given Y . By definition E[g(X)|Y ]is σ(Y )-measurable and by Doob’s representation theorem (cf. [61], p. 12)there exists a measurable function h : Y (Ω) → R such that

E[g(X)|Y ] = h(Y ).

If we know such a function h, we can also determine h(y) = E[g(X)|Y =y], y ∈ R, the conditional expectation of g(X) given that Y has realizationy. Of course, if P (Y = y) > 0, we have

h(y) = E[g(X)|Y = y] =1

P (Y = y)

∫{Y =y}

g(X)dP.

But even if the set {Y = y} has probability 0, we are now able to deter-mine the conditional expectation of g(X) given that Y takes the value y(provided we know h). Consider the case that a joint density fXY (x, y)of X and Y is known. Let fY (y) =

∫RfXY (x, y)dx be the density of the

(marginal) distribution of Y and

fX|Y (x|y) ={fXY (x, y)/fY (y) if fY (y) �= 0

0 otherwise

the elementary conditional density of X given Y . A natural choice for thefunction h would then be

h(y) =

∫R

g(x)fX|Y (x, y)dx.

254 A Background in Probability and Stochastic Processes

We claim that h(Y ) is a version of the conditional expectation E[g(X)|Y ].To prove this note that the elements of the σ-algebra σ(Y ) are of the formY −1(B) = {ω : Y (ω) ∈ B}, B ∈ B. Therefore, we have to show that

E[g(X)IB(Y )] =

∫ ∫g(x)IB(y)fXY (x, y)dxdy

equals

E[h(Y )IB(Y )] =

∫h(y)IB(y)fY (y)dy

for all B ∈ B. But this follows directly from Fubini’s Theorem, whichproves the assertion.

A.3 Stochastic Processes on a Filtered Probability Space

Definition A.8. 1. A stochastic process is a family X = (Xt), t ∈ R+, ofrandom variables all defined on the same probability space (Ω,F , P ) withvalues in a measurable space (S,S).

2. For ω ∈ Ω the mapping t→ Xt(ω) is called path.3. Two stochastic processes X,Y are called indistinguishable, if P -almost all

paths are identical: P (Xt = Yt, ∀t ∈ R+) = 1.

If it is claimed that a process is unique, we mean uniqueness up to indis-tinguishability. Also for conditional expectations no distinction will be madebetween one version of the conditional expectation and the equivalence class ofP -a.s. equal versions. A real-valued process is called right- or left-continuous,nondecreasing, of bounded variation on finite intervals etc., if P -almost allpaths have this property, i.e., if the process is indistinguishable from a pro-cess, the paths of which all have that property. In particular a process iscalled cadlag (continu a droite, limite a gauche), if almost all paths are right-continuous and left-limited.

If not otherwise mentioned, we always refer in the following to real-valuedstochastic processes, i.e., to processes X = (Xt) for which the Xt take valuesin (S,S) = (R,B), where B = B(R) is the Borel σ-algebra on R.

Definition A.9. A stochastic process X is called

1. integrable, if E|Xt| <∞, ∀t ∈ R+;2. square integrable, if EX2

t <∞, ∀t ∈ R+;3. bounded in Lp, p ≥ 1,if supt∈R+

E|Xt|p <∞;4. uniformly integrable, if limc→∞ supt∈R+

E[|Xt|I(|Xt| > c)] = 0.

Deviating from our notation some authors call an L2-bounded stochasticprocess square integrable.

Uniform integrability plays an important role in martingale theory. There-fore, we look for criteria for this property. A very useful one is given in thefollowing proposition.

A.3 Stochastic Processes on a Filtered Probability Space 255

Proposition A.10. A stochastic process X is uniformly integrable if and onlyif there exists a positive increasing convex function G : R+ → R+ such that

1. limt→∞G(t)t = ∞ and

2. supt∈R+EG(|Xt|) <∞.

In particular, taking G(t) = tp, we see that a process X , which is boundedin Lp for some p > 1, is uniformly integrable. A process bounded in L1 is notnecessarily uniformly integrable. The property of uniform integrability linksthe convergence in probability with convergence in L1.

Theorem A.11. Let (Xn), n ∈ N, be a sequence of integrable random vari-ables that converges in probability to a random variable X, i.e., P (|Xn−X | >ε) → 0 as n→ ∞ ∀ε > 0. Then

X ∈ L1 and XnL1→ X, i.e., E|Xn −X | → 0 as n→ ∞

if and only if (Xn) is uniformly integrable.

So if Xn → X P -a.s. and the sequence is uniformly integrable, then itfollows that EXn → EX,n → ∞. At first sight it seems reasonable that un-der uniform integrability almost sure convergence can be carried over also toconditional expectations E[Xn|A] for some sub-σ-algebra A ⊂ F . But (sur-prisingly) this does not hold true in general, for a counterexample see Jensen[97]. The condition supXn ∈ L1 in the dominated convergence theorem forconditional expectations as stated above is necessary for the convergence resultand cannot be weakened.

To describe the information that is gathered observing some stochasticphenomena in time, we introduce filtrations.

Definition A.12. 1. A family F = (Ft), t ∈ R+, of sub-σ-algebras of Fis called a filtrationif it is nondecreasing, i.e., if s ≤ t, then Fs ⊂ Ft.We denote F∞ =

∨t∈R+

Ft = σ(⋃

t∈R+Ft).

2. If F =(Ft) is a filtration, then we write

Ft+ =⋂h>0

Ft+h and Ft− = σ

(⋃h>0

Ft−h

).

3. A filtration (Ft) is called right-continuous, if for all t ∈ R+, we haveFt+ = Ft.

4. A probability space (Ω,F , P ) together with a filtration F is called a stochas-tic basis: (Ω,F ,F, P ).

5. A stochastic basis (Ω,F ,F, P ) is called complete, if F is complete, i.e., Fcontains all subsets of P -null sets, and if each Ft contains all P -null setsof F .

6. A filtration F is said to fulfill the usual conditions, if it is right-continuousand complete.

256 A Background in Probability and Stochastic Processes

The σ-algebra Ft is often interpreted as the information gathered up totime t, or more precisely, the set of events of F , which can be distinguished attime t. If a stochastic process X = (Xt), t ∈ R+, is observed, then a naturalchoice for a corresponding filtration would be Ft = FX

t = σ(Xs, 0 ≤ s ≤t), which is the smallest σ-algebra such that all random variables Xs, 0 ≤s ≤ t, are Ft-measurable. Here we assume that FX

t is augmented so thatthe generated filtration fulfills the usual conditions. Such an augmentation isalways possible (cf. Dellacherie and Meyer [61], p. 115).

Remark A.13. Sometimes it is discussed whether such an augmentation affectsthe filtration too strongly. Indeed, if we consider, for example, two mutuallysingular probability measures, say P and Q on the measurable space (Ω,F)such that P (A) = 1 − Q(A) = 1 for some A ∈ F , then completing each Ft

with all P and Q negligible sets may result in Ft = F for all t ∈ R+, which isa rather uninteresting case destroying the modeling of the evolution in time.But in the material we cover in this book such cases are not essential and wealways assume that a stochastic basis is given with a filtration meeting theusual conditions.

Definition A.14. A stochastic process X = (Xt), t ∈ R+, is called adaptedto a filtration F = (Ft), if Xt is Ft-measurable for all t ∈ R+.

Definition A.15. A stochastic process X is F-progressive or progressivelymeasurable, if for every t, the mapping (s, ω) → Xs(ω) on [0, t]× Ω is mea-surable with respect to the product σ-algebra B([0, t]) ⊗ Ft, where B([0, t]) isthe Borel σ-algebra on [0, t].

Theorem A.16. Let X be a real-valued stochastic process. If X is left-or right-continuous and adapted to F, then it is F-progressive. If X is F-progressive, then so is

∫ t

0Xsds.

A further measurability restriction is needed in connection with stochasticprocesses in continuous time. This is the fundamental concept of predictability.

Definition A.17. Let F be a filtration on the basic probability space and letP(F) be the σ-algebra on (0,∞)×Ω generated by the system of sets

(s, t]×A, 0 ≤ s < t,A ∈ Fs, t > 0.

P(F) is called the F-predictable σ-algebra on (0,∞)×Ω. A stochastic processX = (Xt) is called F-predictable, if X0 is F0-measurable and the mapping(t, ω) → Xt(ω) on (0,∞)×Ω into R is measurable with respect to P(F).

Theorem A.18. Every left-continuous process adapted to F is F-predictable.

In all applications, we will be concerned with predictable processes thatare left-continuous. Note that F-predictable processes are also F-progressive. Aproperty that explains the term predictable is given in the following theorem.

Theorem A.19. Suppose the process X is F-predictable. Then for all t > 0the variable Xt is Ft−-measurable.

A.4 Stopping Times 257

A.4 Stopping Times

Suppose we want to describe a point in time at which a stochastic processfirst enters a given set, say when it hits a certain level. So this point in timeis a random time because it depends on the random evolution of the process.Observing this stochastic process, it is possible to decide at any time t whetherthis random time has occurred or not. Such random times, which are based onthe available information not anticipating the future, are defined as follows.

Definition A.20. Suppose F = (Ft), t ∈ R+, is a filtration on the measurablespace (Ω,F). A random variable τ : Ω → [0,∞] is said to be a stopping timeif for every t ∈ R+,

{τ ≤ t} = {ω : τ(ω) ≤ t} ∈ Ft.

In particular, a constant random variable τ = t0 ∈ R+ is a stopping time.Since we assume that the filtration is right-continuous, we can equivalentlydescribe stopping times by the condition {τ < t} ∈ Ft : If {τ < t} ∈ Ft forall t ∈ R+, then

{τ ≤ t} =⋂n∈N

{τ < t+

1

n

}∈⋂n∈N

Ft+ 1n= Ft+.

Conversely, if {τ ≤ t} ∈ Ft for all t ∈ R+, then

{τ < t} =⋃n∈N

{τ ≤ t− 1

n

}∈ Ft for t > 0 and {τ < 0} = ∅ ∈ F0.

Proposition A.21. Suppose σ and τ are stopping times. Then σ ∧ τ , σ ∨ τ ,and σ+τ are stopping times. Let (τn), n ∈ N, be a sequence of stopping times.Then sup τn and inf τn are also stopping times.

Proof. First we show that σ + τ is a stopping time and consider the comple-ment of the event {σ + τ ≤ t} :

{σ + τ > t} = {σ > t} ∪ {τ > t} ∪ {σ ≥ t, τ > 0} ∪ {0 < σ < t, σ + τ > t}.The first three events of this union are clearly in Ft. The fourth event

{0 < σ < t, σ + τ > t} =⋃

r∈Q∩[0,t)

{r < σ < t, τ > t− r}

is the countable union of events of Ft and therefore σ + τ is a stopping time.The proof of the remaining assertions follows from

{sup τn ≤ t} =⋂n∈N

{τn ≤ t} ∈ Ft,

{inf τn < t} =⋃n∈N

{τn < t} ∈ Ft,

using the fact that for a right-continuous filtration it suffices to show {inf τn <t} ∈ Ft. ��

258 A Background in Probability and Stochastic Processes

For a sequence of stopping times (τn) the random variables sup τn, inf τnare stopping times, so that lim sup τn, lim inf τn and lim τn (if it exists) arealso stopping times.

We now define the σ-algebra of the past of a stopping time τ .

Definition A.22. Suppose τ is a stopping time with respect to the filtrationF. Then the σ-algebra Fτ of events occurring up to time τ is

Fτ = {A ∈ F∞ : A ∩ {τ ≤ t} ∈ Ft for all t ∈ R+}.We note that τ is Fτ -measurable and that for a constant stopping time

τ = t0 ∈ R+ we have Fτ = Ft0 .

Theorem A.23. Suppose σ and τ are stopping times.

(i) If σ ≤ τ , then Fσ ⊂ Fτ .(ii) If A ∈ Fσ, then A ∩ {σ ≤ τ} ∈ Fτ .(iii) Fσ∧τ = Fσ ∩ Fτ .

Proof. (i) For B ∈ Fσ and t ∈ R+ we have

B ∩ {τ ≤ t} = B ∩ {σ ≤ t} ∩ {τ ≤ t} ∈ Ft,

which proves (i).(ii) Suppose A ∈ Fσ. Then

A ∩ {σ ≤ τ} ∩ {τ ≤ t} = A ∩ {σ ≤ t} ∩ {τ ≤ t} ∩ {σ ∧ t ≤ τ ∧ t}.Now A ∩ {σ ≤ t} and {τ ≤ t} are elements of Ft by assumption and therandom variables σ∧ t and τ ∧ t are both Ft-measurable. This shows that{σ ∧ t ≤ τ ∧ t} ∈ Ft.

(iii) Since σ ∧ τ ≤ σ and σ ∧ τ ≤ τ we obtain from (i)

Fσ∧τ ⊂ Fσ ∩ Fτ .

Conversely, for A ∈ Fσ ∩ Fτ we have

A ∩ {σ ∧ τ ≤ t} = (A ∩ {σ ≤ t}) ∪ (A ∩ {τ ≤ t}) ∈ Ft,

which proves (iii). ��This theorem shows that some of the properties known for fixed time points

s, t also hold true for stopping times σ, τ . Next we consider the link betweena stochastic process X = (Xt), t ∈ R+, and a stopping time σ. It is naturalto investigate variables Xσ(ω)(ω) with random index and the stopped processXσ

t (ω) = Xσ∧t(ω) on {σ < ∞}. To ensure that Xσ is a random variable, weneed that Xt fulfills a measurability requirement in t.

Theorem A.24. If σ is a stopping time and X = (Xt), t ∈ R+, is an F-progressive process, then Xσ is Fσ-measurable and Xσ is F-progressive.

A.5 Martingale Theory 259

Proof. We must show that for any Borel set B ∈ B, {Xσ ∈ B} ∩ {σ ≤t} belongs to Ft. This intersection equals {Xσ∧t ∈ B} ∩ {σ ≤ t}, so weneed only show that Xσ is progressive. Now σ ∧ t is Ft-measurable. Hence,(s, ω) → (σ(ω)∧s, ω) is B([0, t])⊗Ft-measurable. Therefore, the map (s, ω) →Xσ(ω)∧s(ω) is measurable as it is the composition of two measurable maps.Hence Xσ is progressive. ��

Most important for applications are those random times σ that are definedas first entrance times of a stochastic processX into a Borel set B: σ = inf{t ∈R+: Xt ∈ B}. In general, it is very difficult to show that σ is a stopping time.For a discussion of the usual conditions in this connection, see Rogers andWilliams [133], pp. 183–191. For a complete proof of the following theoremwe refer to Dellacherie and Meyer [61], p. 116.

Theorem A.25. Let X be an F-progressive process with respect to the com-plete and right-continuous filtration F and B ∈ B a Borel set. Then

σ(ω) = inf{t ∈ R+ : Xt(ω) ∈ B}is an F-stopping time.

Proof. We only show the simple case where X is right-continuous and B isan open set. Then the right continuity implies that

{σ < t} =⋃

r∈Q∩[0,t)

{Xr ∈ B} ∈ Ft.

Using the right-continuity of F it is seen that σ is an F-stopping time . ��Note that the right-continuity of the paths was used to express {σ < t}

as the union of events {Xr ∈ B} and that we could restrict ourselves to acountable union because B is an open set.

A.5 Martingale Theory

An overview over the historical development of martingale theory can be foundin monographs such as Andersen et al. [2], pp. 115–120, or Kallenberg [101],pp. 464–485. We fix a stochastic basis (Ω,F ,F, P ) and define stochastic pro-cesses with certain properties which are known as the stochastic analogues toconstant, increasing and decreasing functions.

Definition A.26. An integrable F-adapted process X = (Xt), t ∈ R+, is calleda martingale if

Xt = E[Xs|Ft] (A.2)

for all s ≥ t, s, t ∈ R+. A supermartingale is defined in the same way, exceptthat (A.2) is replaced by

Xt ≥ E[Xs|Ft],

260 A Background in Probability and Stochastic Processes

and a submartingale is defined with (A.2) being replaced by

Xt ≤ E[Xs|Ft].

Forming expectations on both sides of the (in)equality we obtain EXt =(≥,≤)EXs, which shows that a martingale is constant on average, a super-martingale decreases, and a submartingale increases on average, respectively.

Example A.27. Let X be an integrable F-adapted process. Suppose that theincrementsXs−Xt are independent of Ft for all s > t, s, t ∈ R+. If these incre-ments have zero expectation (thus the expectation function EXt is constant),then X is a martingale:

E[Xs|Ft] = E[Xt|Ft] + E[Xs −Xt|Ft] = Xt.

Of particular importance are the following cases.

(i) If X is continuous, X0 = 0, and the increments Xs − Xt are normallydistributed with mean 0 and variance s − t, then X is an F-Brownianmotion. In addition to X , also the process Yt = X2

t − t is a martingale:

E[Ys|Ft] = E[(Xs −Xt)2|Ft] + 2XtE[Xs −Xt|Ft] +X2

t − s

= s− t+ 0 +X2t − s = Yt.

(ii) If X0 = 0 and the increments Xs −Xt follow a Poisson distribution withmean s−t, for s > t, thenX is a Poisson process. NowX is a submartingalebecause of

E[Xs|Ft] = Xt + E[Xs −Xt|Ft] = Xt + s− t ≥ Xt

and Xt − t is a martingale.

Example A.28. Let Y be an integrable random variable and define Mt =E[Y |Ft]. ThenM is a martingale because of the successive conditioning prop-erty:

E[Ms|Ft] = E[E[Y |Fs]|Ft] = E[Y |Ft] =Mt, s ≥ t.

So Mt is a predictor of Y given the information Ft gathered up to time t.Furthermore, M is a uniformly integrable martingale. To see this we have toshow that limc→∞ supt∈R+

E[|Mt|I(|Mt| > c)] → 0 as c → ∞. By Jensen’sinequality for conditional expectations we obtain

E[|Mt|I(|Mt| > c)] ≤ E[E[|Y |I(|Mt| > c)|Ft]] = E[|Y |I(|Mt| > c)].

Since Y is integrable and cP (|Mt| > c) ≤ E|Mt| ≤ E|Y |, it follows thatP (|Mt| > c) → 0 uniformly in t, which shows that M is uniformly integrable.

Concerning the regularity of the paths of a supermartingale, the followingresult holds true.

A.5 Martingale Theory 261

Lemma A.29. Suppose X is a supermartingale such that t → EXt isright-continuous. Then X has a modification with all paths cadlag, i.e., thereexists a process Y with cadlag paths such that Xt = Yt P -a.s. for all t ∈ R+.

So for a martingale, a submartingale, or a supermartingale with right-continuous expectation function, we can assume that it has cadlag paths. Fromnow on we make the general assumption that all martingales, submartingales,and supermartingales are cadlag unless stated otherwise.

Lemma A.30. Let M be a martingale and consider a convex function g :R → R such that X = g(M) is integrable. Then X is a submartingale.

If g is also nondecreasing, then the assertion remains true for submartin-gales M .

Proof. LetM be a martingale. Then by Jensen’s inequality we obtain for s ≥ t

Xt = g(Mt) = g(E[Ms|Ft]) ≤ E[g(Ms)|Ft] = E[Xs|Ft],

which shows that X is a submartingale.If M is a submartingale and g is nondecreasing, then

g(Mt) ≤ g(E[Ms|Ft])

shows that the conclusion remains valid. ��

The last lemma is often applied with functions g(x) = |x|p, p ≥ 1. So, ifM is a square integrable martingale, then X =M2 defines a submartingale.

One key result in martingale theory is the following convergence theorem(cf. [62], p. 72).

Theorem A.31. Let X be a supermartingale (martingale). Suppose that

supt∈R+

E|Xt| <∞,

a condition that is equivalent to limt→∞EX−t <∞. Then the random variable

X∞ = limt→∞Xt exists and is integrable.If the supermartingale (martingale) X is uniformly integrable, X∞ exists

and closes X on the right in that for all t ∈ R+

Xt ≥ E[X∞|Ft] (respectively Xt = E[X∞|Ft]).

As a consequence we get the following characterization of the convergenceof martingales.

Theorem A.32. Suppose M is a martingale. Then the following conditionsare equivalent:

262 A Background in Probability and Stochastic Processes

(i) M is uniformly integrable.(ii) There exists a random variable M∞ such that Mt converges to M∞ in

L1 : limt→∞ E|Mt −M∞| = 0.(iii)Mt converges P -a.s. to an integrable random variable M∞, which closes

M on the right: Mt = E[M∞|Ft].

Example A.33. If in Example A.28 we assume that Y is F∞-measurable, thenwe can conclude that the martingale Mt = E[Y |Ft] converges P -a.s. and inL1 to Y .

In Example A.27 (i) we see that Brownian motion (Xt) is not uniformlyintegrable as for any c > 1 we can find a t > 0 such that P (|Xt| > c) ≥ ε forsome ε, 0 < ε < 1. In this case we can conclude that Xt does not converge toany random variable for t→ ∞ neither P -a.s. nor in L1.

Next we consider conditions under which the (super-)martingale propertyalso extends from fixed time points s, t to stopping times σ, τ .

Theorem A.34. (Optional Sampling Theorem). Let X be a supermartin-gale and let σ and τ be two stopping times such that σ ≤ τ . Suppose eitherthat τ is bounded or that (Xt) is uniformly integrable. Then Xσ and Xτ areintegrable and

Xσ ≥ E[Xτ |Fσ]

with equality if X is a martingale.

An often used consequence of Theorem A.34 is the following: If X is auniformly integrable martingale, then setting σ = 0 we obtain EX0 = EXτ

for all stopping times τ (all quantities are related to the same filtration F). Akind of converse is the following proposition.

Proposition A.35. Suppose X is an adapted cadlag process such that for anybounded stopping time τ the random variable Xτ is integrable and EX0 =EXτ . Then X is a martingale.

A further consequence of the Optional Sampling Theorem is that a stopped(super-) martingale remains a (super-) martingale.

Corollary A.36. Let X be a right-continuous supermartingale (martingale)and τ a stopping time. Then the stopped process Xτ = (Xt∧τ ) is a super-martingale (martingale). If either X is uniformly integrable or I(τ < ∞)Xτ

is integrable and limt→∞∫{τ>t} |Xt| dP = 0, then Xτ is uniformly integrable.

Martingales are often constructed in that an increasing process is sub-tracted from a submartingale (cf. Example A.27 (ii), p. 260). This fact ema-nates from the celebrated Doob–Meyer decomposition, which is a cornerstonein modern probability theory.

A.5 Martingale Theory 263

Theorem A.37. (Doob–Meyer decomposition). Let the process X beright-continuous and adapted. Then X is a uniformly integrable submartin-gale if and only if it has a decomposition

X = A+M,

where A is a right-continuous predictable nondecreasing and integrable processwith A0 = 0 and M is a uniformly integrable martingale. The decompositionis unique within indistinguishable processes.

Remark A.38. 1. Several proofs of this and more general results, not res-tricted to uniformly integrable processes, are known (cf. [62], p. 198 and[101], p. 412). Some of these also refer to local martingales, which are notneeded for the applications we have presented and which are therefore notintroduced here.

2. The process A in the theorem above is often called compensator.3. In the case of discrete time such a decomposition is easily constructed in

the following way. Let (Xn), n ∈ N0, be a submartingale with respect toa filtration (Fn), n ∈ N0. Then we define

Xn = An +Mn,

where

An = An−1 + E[Xn|Fn−1]−Xn−1, n ∈ N, A0 = 0,

Mn = Xn −An, n ∈ N0.

The process M is a martingale and A is nondecreasing and predictablein that An is Fn−1-measurable for n ∈ N. This decomposition is unique,since for a second decomposition Xn = An+ Mn with the same propertieswe must have Mn− Mn = An − An, which is a predictable martingale.Therefore,

0 = E[An − An|Fn−1] = An − An, n ∈ N

and A0 = A0 = 0.

The continuous time result needs much more care and uses several lemmas,one of which is interesting in its own right and will be presented here.

Lemma A.39. A process M is a predictable martingale of integrable varia-tion, i.e., E[

∫∞0

|dMs|] <∞, if and only if Mt =M0 for all t ∈ R+.

We will now use the Doob–Meyer decomposition to introduce two typesof (co-)variation processes. For this we recall that M (M0) denotes the classof cadlag martingales (with M0 = 0) and denote by M2(M2

0) the set ofmartingales in M(M0), which are bounded in L2, i.e., supt∈R+

EM2t <∞.

264 A Background in Probability and Stochastic Processes

Definition A.40. For M ∈ M2 the unique compensator of M2 in theDoob–Meyer decomposition, denoted 〈M,M〉 or 〈M〉, is called the predictablevariation process. For M1,M2 ∈ M2 the process

〈M1,M2〉 = 1

4(〈M1 +M2〉 − 〈M1 −M2〉)

is called the predictable covariation process of M1 and M2.

Proposition A.41. Suppose that M1,M2 ∈ M2. Then A = 〈M1,M2〉 is theunique predictable cadlag process with A0 = 0 such that M1M2 −A ∈ M.

Proof. The assertion follows from the Doob–Meyer decomposition and

M1M2 − 〈M1,M2〉 = 1

4

((M1 +M2)

2 − (M1 −M2)2)− 〈M1,M2〉

=1

4

((M1 +M2)

2 − 〈M1 +M2〉)

−1

4

((M1 −M2)

2 − 〈M1 −M2〉).

��

To understand what predictable variation means, we give a heuristicexplanation. Recall that for a martingale M we have for all 0 < h < t

E[Mt −Mt−h|Ft−h] = 0,

or in heuristic form:E[dMt|Ft−] = 0.

Since M2 − 〈M〉 is a martingale and 〈M〉 is predictable, we obtain

E[dM2t |Ft−] = E[d〈M〉t|Ft−] = d〈M〉t.

Furthermore,

dM2t = M2

t −M2t−

= (Mt− + dMt)2 −M2

t−= (dMt)

2 + 2Mt−dMt,

yielding

d〈M〉t = E[(dMt)2|Ft−] + 2Mt−E[dMt|Ft−] = E[(dMt)

2|Ft−]= Var[dMt|Ft−].

This indicates (and it can be proved) that 〈M〉t is the stochastic limit ofthe form

n∑i=1

Var[Mti −Mti−1 |Fti−1 ]

as n→ ∞ and the span of the partition 0 = t0 < t1 < . . . < tn = t tends to 0.

A.5 Martingale Theory 265

Definition A.42. Two martingales M,L ∈ M2 are called orthogonal if theirproduct is a martingale: ML ∈ M.

For two martingales M,L of M2 that are orthogonal we must have〈M,L〉 = 0. If we equip M2 with the scalar product

(M,L)M2 = E[M∞L∞]

inducing the norm ‖ M ‖= (EM2∞)1/2, then M2 becomes a Hilbert space.

Because of ML− 〈M,L〉 ∈ M and 〈M,L〉0 = 0, it follows that

(M,L)M2 = E[M∞L∞] = E〈M,L〉∞ + EM0L0.

So two orthogonal martingalesM,L of M20 are also orthogonal in the Hilbert

space M2 (cf. Elliott [67], p. 88).The set of continuous martingales in M2

0, denoted M2,c0 , is a complete

subspace ofM20 and M2,d

0 is the space orthogonal toM2,c0 . The martingales in

M2,d0 are called purely discontinuous. As an immediate consequence we obtain

that any martingale M ∈ M20 has a unique decomposition M = M c +Md,

where M c ∈ M2,c0 and Md ∈ M2,d

0 .A process strongly connected to predictable variation is the so-called

square bracket process introduced in the following definition.

Definition A.43. Suppose M ∈ M20 and M = M c +Md is the unique de-

composition with M c ∈ M2,c0 and Md ∈ M2,d

0 . The increasing cadlag process[M ] with

[M ]t = 〈M c〉t +∑s≤t

�M2s

is called the quadratic variation of M , where �Mt = Mt −Mt− denotes thejump of M at time t > 0 (�X0 = X0). For martingales M,L ∈ M2

0 we definethe quadratic covariation [M,L] by

[M,L] =1

4([M + L]− [M − L]) .

The following proposition helps to understand the name quadratic covari-ation.

Proposition A.44. Suppose M,L ∈ M20.

1. Let (tni ) be a sequence of partitions 0 = tn0 < tn1 < . . . < tnn = t such thatthe span supi(t

ni+1 − tni ) tends to 0 as n→ ∞. Then

∑i

(Mti+1 −Mti)(Lti+1 − Lti)

converges P -a.s. and in L1 to [M,L]t for all t > 0.2. ML− [M,L] is a martingale.

266 A Background in Probability and Stochastic Processes

A.6 Semimartingales

A decomposition of a stochastic process into a (predictable) drift part and amartingale, as presented for submartingales in the Doob–Meyer decomposi-tion, also holds true for more general processes. We start with the motivatingexample of a sequence (Xn), n ∈ N0, of integrable random variables adaptedto the filtration (Fn). This sequence admits a decomposition

Xn = X0 +n∑

i=1

fi +Mn

with a predictable sequence f = (fn), n ∈ N, (i.e., fn is Fn−1-measurable)and a martingale M = (Mn), n ∈ N0, M0 = 0. We can take

fn = E[Xn −Xn−1|Fn−1],

Mn =

n∑i=1

(Xi − E[Xi|Fi−1]).

This decomposition is unique because a second decomposition of this type,say with a sequence f and a martingale M , would imply that

Mn − Mn =

n∑i=1

(fi − fi)

defines a predictable martingale, i.e., E[Mn − Mn|Fn−1] =Mn − Mn =M0 −M0 = 0, which shows the uniqueness.

Unlike the time-discrete case, corresponding decompositions cannot befound for all integrable processes in continuous time. The role of increasingprocesses in the Doob–Meyer decomposition will now be taken by processesof bounded variation.

Definition A.45. For a cadlag function g : R+ → R the variation is de-fined as

Vg(t) = limn→∞

n∑k=1

|g(tk/n)− g(t(k − 1)/n)|.

The function g is said to have finite variation if Vg(t) < ∞ for all t ∈ R+.The class of cadlag processes A with finite variation starting in A0 = 0 isdenoted V.

For any A ∈ V there is a decomposition At = Bt − Ct with increasingprocesses B,C ∈ V and

Bt + Ct = VA(t) =

∫ t

0

|dAs|.

A.6 Semimartingales 267

Definition A.46. A process Z is a semimartingale if it has a decomposition

Zt = Z0 +At +Mt,

where A ∈ V and M ∈ M0.

There is a rich theory based on semimartingales that relies on the remark-able property that semimartingales are stable under many sorts of operations,e.g., changes of time, of probability measures, and of filtrations preserve thesemimartingale property, also products and convex functions of semimartin-gales are semimartingales (cf. Dellacherie and Meyer [62], pp. 212–252). Theimportance of semimartingales lies also in the fact that stochastic integrals

∫ t

0

HsdZs

of predictable processes H with respect to a semimartingale Z can be int-roduced replacing Stieltjes integrals. It is beyond the scope of this book topresent the whole theory of semimartingales; we confine ourselves to the casethat the process A in the semimartingale decomposition is absolutely contin-uous (with respect to Lebesgue-measure). The class of such processes is richenough to contain most processes interesting in applications and allows thedevelopment of a kind of “differential” calculus.

Definition A.47. A semimartingale Z with decomposition Zt = Z0+At+Mt

is called smooth semimartingale (SSM) if Z is integrable and A has the form

At =

∫ t

0

fsds,

where f is a progressive process and A has locally integrable variation, i.e.,

E

∫ t

0

|fs|ds <∞

for all t ∈ R+. Short notation: Z = (f,M).

As submartingales can be considered as stochastic analog to increasingfunctions, smooth semimartingales can be seen as the stochastic counterpartto differentiable functions. Some of the above-mentioned operations will beconsidered in the following.

A.6.1 Change of Time

Let (τt), t ∈ R+, be a family of stopping times with respect to F = (Ft)such that for all ω, τt(ω) is nondecreasing and right-continuous as a functionof t. Then for an F-semimartingale Z we consider the transformed processZt = Zτt , which is adapted to F = (Ft), where Ft = Fτt .

268 A Background in Probability and Stochastic Processes

Theorem A.48. If Z is an F-semimartingale, then Z is an F-semimartingale.

One example of such a change of time is stopping a process at some fixedstopping time τ :

τt = t ∧ τ.If we consider an SSM Z = (f,M), then the stopped process Zτ = Z = (f , M)is again an SSM with

ft = I(τ > t)ft.

A.6.2 Product Rule

It is known that the product of two semimartingales is a semimartingale (cf.[62], p. 219). However, this does not hold true in general for SSMs. As anexample consider a martingale M ∈ M2

0 with a predictable variation process〈M〉 that is not continuous. Then Z =M is an SSM with f = 0, but Z2 =M2

has a decompositionZ2t = 〈M〉t +Rt

with some martingale R, which shows that Z2 is not an SSM. To establishconditions under which a product rule for SSMs holds true, we first recall theintegration by parts formula for ordinary functions.

Proposition A.49. Let a and b be cadlag functions on R+, which are of finitevariation. Then for each t ∈ R+

a(t)b(t) = a(0)b(0) +

∫ t

0

a(s−)db(s) +

∫ t

0

b(s)da(s)

= a(0)b(0) +

∫ t

0

a(s−)db(s) +

∫ t

0

b(s−)da(s)

+∑

0<s≤t

�a(s)�b(s),

where a(s−) is the left limit at s and �a(s) = a(s)− a(s−).

Replacing a and b by SSMs Z and Y in this integration by parts formulawe need to give ∫ t

0

Ys−dZs

a meaning. The finite variation part can be defined as an ordinary (pathwise)

Stieltjes integral. It remains to define∫ t

0 Ys−dMs whereM is a martingale pos-sibly of unbounded variation. Because we do not want to develop the theory ofstochastic integration, we only quote the following theorem stating conditionsto be used in the product formula we aim at.

A.6 Semimartingales 269

Theorem A.50. Suppose M ∈ M20 and let X be a predictable process such

that

E

∫ ∞

0

X2sd〈M〉s <∞.

Then there exists a unique process∫ t

0XsdMs ∈ M2

0 with the characterizingproperty ⟨∫ t

0

XsdMs, L

⟩=

∫ t

0

Xsd〈M,L〉s

for all L ∈ M20.

For two SSMs Z and Y with martingale parts M and L, respectively,M,L ∈ M2

0, we define the covariation [Z, Y ] by

[Z, Y ]t = 〈M c, Lc〉t +∑s≤t

�Zs�Ys

= 〈M c, Lc〉t + Z0Y0 +∑s≤t

�Ms�Ls

= Z0Y0 + [M,L]t.

After these preparations the following product rule can be established.

Theorem A.51. Let Z = (f,M) and Y = (g, L) be F-SSMs with orthogonalmartingales M,L ∈ M2

0, i.e., ML ∈ M0. Assume that

E

∫ t

0

(|Zsgs|+ |Ysfs|)ds <∞, E|Z0Y0| <∞,

E

∫ ∞

0

Y 2s−d〈M〉s <∞, E

∫ ∞

0

Z2s−d〈L〉s <∞.

Then ZY is an F-SSM with representation

ZtYt = Z0Y0 +

∫ t

0

(Ysfs + Zsgs)ds+Rt,

where R = (Rt) is a martingale in M0.

Proof. To prove the product rule we use a form of integration by parts forsemimartingales, which is an application of Ito’s formula (see [67], p. 140):

ZtYt =

∫(0,t]

Zs−dYs +∫(0,t]

Ys−dZs + [Z, Y ]t.

The definition of stochastic integrals implies

∫(0,t]

Zs−dYs =∫(0,t]

Zs−d(∫ s

0

gudu

)+

∫(0,t]

Zs−dLs.

270 A Background in Probability and Stochastic Processes

The second term of the sum is a martingale of M20 by virtue of

E

∫ ∞

0

Z2s−d〈L〉s <∞.

The first term of the sum is an ordinary Stieltjes integral. Since the paths ofZ have at most countably many jumps, it follows that

∫(0,t]

Zs−d(∫ s

0

gudu

)=

∫ t

0

Zsgsds.

The second integral in the integration by parts formula is treated in thesame way.

It remains to show that in [Z, Y ]t = Z0Y0+[M,L]t the second term of thesum is a martingale. From Proposition A.44, p. 265, we know thatML−[M,L]is a martingale. By virtue of the assumption that ML ∈ M0 the squarebracket process [M,L] must also have the martingale property. Altogetherthe product semimartingale has the representation

ZtYt = Z0Y0 +

∫ t

0

(Zsgs + Ysfs)ds+Rt,

where

Rt =

∫(0,t]

Zs−dLs +

∫(0,t]

Ys−dMs + [M,L]t

is a martingale in M0. This completes the proof. ��Sometimes the product rule is used for a product one factor of which is

the one point process I(ζ ≤ t) with a stopping time ζ. Because of the specialstructure of this factor less restrictive conditions are necessary to establish aproduct rule.

Proposition A.52. Let Z = (f,M) be an F-SSM and ζ > 0 a (totallyinaccessible) F-stopping time with

Yt = I(ζ ≤ t) =

∫ t

0

gsds+ Ls.

Furthermore, it is assumed that for all t ∈ R+

E

∫ t

0

|Zsgs|ds <∞, E

∫ t

0

|Zs−| |dLs| <∞

and �Mζ = 0. Then ZY is an SSM with representation

ZtYt =

∫ t

0

(Zsgs + Ysfs)ds+Rt,

where R ∈ M0.

A.6 Semimartingales 271

Proof. The product ZY can be represented in the form

ZtYt = Zt − Zt∧ζ +

∫ t

0

ZsdYs

with the pathwise defined Stieltjes integral

∫ t

0

ZsdYs =

∫ t

0

Zsgsds+

∫ t

0

ZsdLs.

The second term in this sum can be decomposed as

∫ t

0

ZsdLs =

∫ t

0

Zs−dLs +∑s≤t

�Ms�Ls.

The sum of jumps is 0, since L is continuous outside {(t, ω) : ζ(ω) = t} and

�Mζ = 0. The martingale L is of finite variation and the condition E∫ t

0|Zs−|

|dLs| <∞ implies that the integral of the predictable process Zs− with respectto L is a martingale (cf. [101]).

To sum up we get

ZtYt =

∫ t

0

(fs − I(ζ > s)fs + Zsgs)ds+Mt −M ζt +

∫ t

0

Zs−dLs,

which proves the assertion. ��

B

Renewal Processes

In this appendix we present some definitions and results from the theory ofrenewal processes, including renewal reward processes and regenerative pro-cesses. Key references are [1, 8, 44, 58, 135, 156].

The purpose of this appendix is not to give an all-inclusive presentationof the theory. Only definitions and results needed for establishing the resultsof Chaps. 1–5 (in particular Chap. 4) is covered.

B.1 Basic Theory of Renewal Processes

Let T, Tj, j = 1, 2, . . ., be a sequence of nonnegative independent identicallydistributed (i.i.d.) random variables with distribution function F . To avoidtrivialities, we assume that P (T = 0) < 1. From the nonnegativity of T , itfollows that ET exists, although it may be infinite, and we denote

μ = ET =

∫ ∞

0

P (T > t) dt.

The variance of T is denoted σ2. Let

S0 = 0, Sj =

j∑i=1

Ti , j ∈ N

and defineNt = sup{j : Sj ≤ t},

or equivalently,

Nt =

∞∑j=1

I(Sj ≤ t). (B.1)

The processes (Nt), t ∈ R+, and (Sj), j ∈ N0, are both called a renewal process.We say that a renewal occurs at t if Sj = t for some j ≥ 1. The random variable

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,© Springer Science+Business Media New York 2013

273

274 B Renewal Processes

Nt represents the number of renewals in [0, t]. Since the interarrival times Tjare independent and identically distributed, it follows that after each renewalthe process restarts.

Let M(t) = ENt, 0 ≤ t < ∞. The function M(t) is called the renewalfunction. It can be shown that M(t) is finite for all t. From (B.1) we see that

M(t) =

∞∑j=1

F ∗j(t), (B.2)

where F ∗j denotes the j-fold convolution of F . If, for example, F is a Gammadistribution with parameters 2 and λ, i.e., F (t) = 1− e−λt−λte−λt, it can beshown that

M(t) =λt

2− 1− e−2λt

4.

Refer to [1, 31, 32] for more general formulas for the renewal function ofthe Gamma distribution and expressions and bounds for other distributions.In Proposition B.1 we show how M can be determined (at least in theory)from F . It turns out that M uniquely determines F .

Proposition B.1. There is a one-to-one correspondence between the interar-rival distribution F and the renewal function M .

Proof. We introduce the Laplace transform LB(s) =∫∞0e−sxdB(x), where

B : R+ → R+ is a nondecreasing and right-continuous function. By takingthe Laplace transform L on both sides of formula (B.2) we obtain

LM (s) =

∞∑j=1

LF∗j(s)

=∞∑j=1

(LF (s))j

=LF (s)

1− LF (s), (B.3)

or equivalently

LF (s) =LM (s)

1 + LM (s).

Hence LF is determined by M and since the Laplace transform determinesthe distribution, it follows that F also is determined by M . ��

The function M(t) satisfies the following integral equation:

M(t) = F (t) +

∫ t

0

M(t− x)dF (x),

B.1 Basic Theory of Renewal Processes 275

i.e., M = F +M ∗ F , where ∗ means convolution. This equation is referredto as the renewal equation, and is seen to hold by conditioning on the time ofthe first renewal. Upon doing so we obtain

M(t) =

∫ ∞

0

E[Nt|T1 = x]dF (x)

=

∫ t

0

[1 +M(t− x)]dF (x)

= F (t) + (M ∗ F )(t),

noting that if the first renewal occurs at time x, x ≤ t, then from this pointon the process restarts, and thus the expected number of renewals in [0, t] isjust 1 plus the expected number to arrive in a time t− x from an equivalentrenewal process. A more formal proof is the following;

M(t) = ENt = E

∞∑j=1

I(Sj ≤ t) = F (t) + E

∞∑j=2

I(Sj ≤ t)

= F (t) + E

∞∑j=2

I(Sj − S1 ≤ t− S1)

= F (t) +

∫ t

0

E

∞∑j=2

I(Sj − S1 ≤ t− s)dF (s)

= F (t) +

∫ t

0

M(t− s)dF (s).

To generalize the renewal equation, we write

g(t) = h(t) + (g ∗ F )(t), (B.4)

where h and F are known and g is an unknown function to be determinedas a solution to (B.4). The solution of this equation is given by the followingresult.

Theorem B.2. If the function g satisfies (B.4) and h is bounded on finiteintervals, then

g(t) = h(t) + (h ∗M)(t)

is a solution to (B.4) and the unique solution which is bounded on finite in-tervals.

Proof. A proof of this result is given in Asmussen [8], p. 113. A simpler proofcan however be given in the case where the Laplace transform of h and gexists: Taking Laplace transforms in (B.4), yields

Lg(s) = Lh(s) + Lg(s)LF (s),

276 B Renewal Processes

and it follows that

Lg(s) =Lh(s)

1− LF (s)

= Lh(s)

[1 +

LF (s)

1− LF (s)

]

= Lh(s) + Lh(s)LM (s)

= Lh+h∗M (s),

where the second last equality follows from (B.3). Since the Laplace transformuniquely determines the function, this gives the desired result. ��

Using the (strong) law of large numbers, many results related to renewalprocesses can be established, including the following.

Theorem B.3. With probability one,

Nt

t→ 1

μas t→ ∞.

Proof. By definition of Nt, it follows that

SNt ≤ t ≤ SNt+1.

Hence,SNt

Nt≤ t

Nt≤ SNt+1

Nt.

Now the strong law of large numbers states that with probability one, Sj/j →μ as j → ∞. As can be easily shown, Nt → ∞ as t→ ∞, and thus

SNt

Nt→ μ as t → ∞ (P -a.s.).

By the same argument, we also see that with probability one,

SNt+1

Nt=

SNt+1

Nt + 1

Nt + 1

Nt→ μ · 1 = μ as t→ ∞.

The result follows. ��

We now formulate some limiting results, without proof, including the Ele-mentary Renewal Theorem, the Key Renewal Theorem, Blackwell’s Theorem,and the Central Limit Theorem for renewal processes. Refer to Alsmeyer [1],Asmussen [8], Daley and Vere-Jones [58], and Ross [135] for proofs; see alsoBirolini [44]. Some of the results require that the distribution F is not periodic(lattice). We say that F is periodic if there exists a constant c, c > 0, suchthat T takes only values in {0, c, 2c, 3c, . . .}.

B.1 Basic Theory of Renewal Processes 277

Theorem B.4. (Elementary Renewal Theorem)

limt→∞

M(t)

t=

1

μ.

Theorem B.5. (Tightened Elementary Renewal Theorem). Assumethat σ2 =Var[T ] <∞. If the distribution F is not periodic, then

limt→∞

[M(t)− t

μ

]=σ2 − μ2

2μ2.

Theorem B.6. Assume that σ2 =Var[T ] < ∞. If the distribution F is notperiodic, then

limt→∞

Var[Nt]

t=σ2

μ3.

Before we state the Key Renewal Theorem, we need a definition. Let g bea function defined on R+ and for h > 0 let

gh−(x) = inf0≤δ≤h

g(x− δ), gh+(x) = sup0≤δ≤h

g(x− δ).

We say that g is directly Riemann integrable if for any h > 0;

h

∞∑n=1

|gh−(nh)| and h

∞∑n=1

|gh+(nh)|

are finite, and

limh→0+

h

∞∑n=1

gh−(nh) = limh→0+

h

∞∑n=1

gh+(nh).

In particular, a nonnegative, nonincreasing and integrable function is di-rectly Riemann integrable. See [58, 88] for some other sufficient conditions fora function to be directly Riemann integrable.

Theorem B.7. (Key Renewal Theorem). Assume that the distribution Fis not periodic and g is a directly Riemann integrable function. Then

limt→∞

∫ t

0

g(t− s) dM(s) =1

μ

∫ ∞

0

g(s) ds.

Remark B.8. An alternative formulation of the Key Renewal Theorem is thefollowing: If g is bounded and integrable with g(t) → 0 as t → ∞, then

limt→∞∫ t

0g(t − s) dM(s) = (1/μ)

∫∞0g(s) ds provided that F is spread out.

A distribution function is spread out if there exists an n such that F ∗n has anonzero absolutely continuous component with respect to Lebesgue measure,i.e., we can write F ∗n = G1 +G2, where G1, G2 are nonnegative measures onR+, and G1 has a density with respect to Lebesgue measure.

278 B Renewal Processes

The Key Renewal Theorem is equivalent to Blackwell’s Theorem below.

Theorem B.9. (Blackwell’s Theorem). For a renewal process with a non-periodic distribution F ,

limt→∞[M(t)−M(t− s)] =

s

μ.

If F has a density f , then M has a density m, and

m(t) =

∞∑j=1

f∗j(t),

where f∗1 = f and

f∗j(t) =∫ t

0

f∗(j−1)(t− s)f(s)ds, j = 2, 3, . . . .

Under certain conditions the renewal densitym(t) converges to 1/μ as t→ ∞.

Theorem B.10. (Renewal Density Theorem). Assume that F has a den-sity f with f(t)p integrable for some p > 1, and f(t) → 0 as t→ ∞. Then Mhas a density m such that

limt→∞m(t) =

1

μ.

Remark B.11. The conclusion of the theorem also holds true if F has a densityf , which is directly Riemann integrable, or if F has finite mean and a boundeddensity f satisfying f(t) → 0 as t→ ∞.

Theorem B.12. (Central Limit Theorem). Assume that σ2 =Var[T ] <∞. Then Nt, suitably standardized, tends to a normal distribution as t→ ∞,i.e.,

limt→∞P

(Nt − t/μ√tσ2/μ3

≤ x

)=

1√2π

∫ x

−∞e−

12u

2

du.

Next we formulate the limiting distribution of the forward and backwardrecurrence times αt and βt, defined by

αt = SNt+1 − t,

βt = t− SNt .

The recurrence times αt and βt are the time intervals from t forward to thenext renewal point and backward to the last renewal point (or to the timeorigin), respectively. Let Fαt and Fβt denote the distribution functions of αt

and βt, respectively. The following result is a consequence of the Key RenewalTheorem.

B.1 Basic Theory of Renewal Processes 279

Theorem B.13. Assume that the distribution F is not periodic. Then theasymptotic distribution of the forward and backward recurrence times aregiven by

limt→∞Fαt(x) = lim

t→∞Fβt(x) =

∫ x

0F (s) ds

μ.

This asymptotic distribution of αt and βt is called the equilibrium distri-bution.

A simple formula exists for the mean forward recurrence time; we have

ESNt+1 = μ(1 +M(t)). (B.5)

Formula B.5 is a special case of Wald’s equation (see, e.g., Ross [135]), andfollows by writing

ESNt+1 = E∑k≥1

SkI(Nt + 1 = k) = E∑k≥1

k∑j=1

TjI(Nt + 1 = k)

= E∑j≥1

TjI(Nt + 1 ≥ j) = E∑j≥1

TjI(Sj−1 ≤ t)

=∑j≥1

ETjEI(Sj−1 ≤ t) = μ∑j≥0

F ∗j(t) = μ(1 +M(t)).

Finally in this section we prove a result used in the proof of Theorem 4.19,p. 122.

Proposition B.14. Let g be a real-valued function which is bounded on finiteintervals. Assume that

limt→∞ g(t) = g.

Then

limt→∞

1

t

∫ t

0

g(s)dM(s) =g

μ.

Proof. To prove this result we use a standard ε argument. Given ε > 0, thereexists a t0 such that |g(t)− g| < ε for t ≥ t0. Hence for t > t0 we have

1

t

∫ t

0

|g(s)− g|dM(s)

≤ 1

t

∫ t0

0

|g(s)− g|dM(s) +1

t

∫ t

t0

ε dM(s).

Since t0 is fixed, this gives by applying the Elementary Renewal Theorem,

lim supt→∞

1

t

∫ t

0

|g(s)− g|dM(s) ≤ ε

μ.

The desired conclusion follows. ��

280 B Renewal Processes

B.2 Renewal Reward Processes

Let (T, Y ), (T1, Y1), (T2, Y2), . . ., be a sequence of independent and identicallydistributed pairs of random variables, with T, Tj ≥ 0. We interpret Yj as the“reward” (“cost”) associated with the j th interarrival time Tj . The randomvariable Yj may depend on Tj . Let Zt denote the total reward earned by timet. We see that if the reward is earned at the time of the renewal,

Zt =

Nt∑j=1

Yj .

The limiting value of the average return is established using the law of largenumbers and is given by the following result (cf. [135]).

Theorem B.15. If E|Y | is finite, then

(i) With probability 1Zt

t→ EY

ETas t→ ∞,

(ii)EZt

t→ EY

ETas t→ ∞.

Remark B.16. The conclusions of Theorem B.15 also hold true if Y ≥ 0, EY =∞ and ET <∞.

Many results from renewal theory can be generalized to renewal rewardprocesses. For example Blackwell’s Theorem holds:

limt→∞[Zt − Zt−s] =

sEY

ET.

The following theorem, which is a reformulation of Theorem 3.2, p. 136, in [8],generalizes the Central Limit Theorem for renewal processes, Theorem B.12.

Theorem B.17. Suppose Var[Y ] <∞ and Var[T ] <∞. Then as t→ ∞√t

[Zt

t− EY

ET

]D→ N

(0,

τ2

ET

),

where

τ2 = Var

[Y − EY

ETT

]

= Var[Y ] +

(EY

ET

)2

Var[T ]− 2EY

ETCov[Y, T ].

B.4 Modified (Delayed) Processes 281

B.3 Regenerative Processes

The stochastic process (Xt) is called regenerative if there exists a renewal

process (Tj) such that for k ∈ N, (Xt)t≥0D= (Xt+Sk

)t≥0, and

((Xt+Sk)t≥0, (Tj), j > k) and ((Xt)0≤t≤Sk

, T1, T2, . . . , Tk)

are stochastically independent. Thus the continuation of the process beyondSk is a probabilistic replica of the whole process starting at 0. The randomtimes Sk are said to be regenerative points for the process (Xt) and the timeinterval [Sk−1, Sk) is called the kth cycle of the process.

In the following assume that the state space of (Xt) equals N0={0, 1, 2, . . .}.Let

Pk(t) = P (Xt = k), k ∈ N0.

The following result taken from Ross [135] is stated without proof.

Theorem B.18. If the distribution of T1 has an absolutely continuous com-ponent and E T1 <∞, then

limt→∞ Pk(t) =

E∫ T1

0I(Xt = k) dt

ET1, k ∈ N0.

Remark B.19. We see that if limt→∞ Pk(t) = Pk exists, then

limt→∞

1

tE

∫ t

0

I(Xs = k) ds = Pk.

The quantity (1/t)E∫ t

0I(Xs = k) ds represents the expected portion of time

the process is in state k in [0, t]. Since

1

tE

∫ t

0

I(Xs = k) ds =1

t

∫ t

0

EI(Xs = k) ds =1

t

∫ t

0

Pk(s) ds,

this quantity is also equal to the average probability that the process is instate k.

B.4 Modified (Delayed) Processes

Consider a renewal process (Sj) as defined in Sect. B.1, but assume now that

the first interarrival time T1 has a distribution F , that is not necessarilyidentical to F . The process is referred to as a modified renewal process (ora delayed renewal process). Similarly, we define a modified (delayed) renewalreward process and a modified (delayed) regenerative process. For the modifiedrenewal reward process the distribution of the pair (Y1, T1) is not necessarilythe same as the pairs (Yi, Ti), i = 2, 3, . . ..

282 B Renewal Processes

It can be shown that all the asymptotic results presented in the previoussections of this appendix still hold true for the modified processes. If we takethe first distribution to be equal to the asymptotic distribution of the rec-urrence times, given by Theorem B.13, p. 279, the renewal process becomesstationary in the sense that the distribution of the forward recurrence timeαt does not depend on t. Furthermore,

M(t+ h)−M(t) = h/ET.

References

[1] Alsmeyer, G. (1991) Erneuerungstheorie. Teubner Skripten zur Mathe-matischen Stochastik. B.G. Teubner, Stuttgart.

[2] Andersen, P. K., Borgan, Ø., Gill, R. and Keiding, N. (1992) StatisticalModels Based on Counting Processes. Springer, New York.

[3] Arjas, E. (1993) Information and reliability: A Bayesian perspective.In: Barlow, R., Clarotti, C. and Spizzichino, F. (eds.): Reliability andDecision Making. Chapman & Hall, London, pp. 115–135.

[4] Arjas, E. (1989) Survival models and martingale dynamics. Scand. J.Statist 16, 177–225.

[5] Arjas, E. (1981) A stochastic process approach to multivariate reliabilitysystems: Notions based on conditional stochastic order. Mathematics ofOperations Research 6, 263–276.

[6] Arjas, E. (1981) The failure and hazard processes in multivariate relia-bility systems. Mathematics of Operations Research 6, 551–562.

[7] Arjas, E. and Norros, I. (1989) Change of life distribution via hazardtransformation: An inequality with application to minimal repair.Math-ematics of Operations Research 14, 355–361.

[8] Asmussen, S. (1987) Applied Probability and Queues. Wiley, New York.[9] Asmussen, S. (1984) Approximations for the probability of ruin within

finite time. Scand. Actuarial J ., 31–57.[10] Aven, T. (2009) Optimal test interval for a monotone safety system. J.

Applied Probability 46, 1–12.[11] Aven, T. (1996) Availability analysis of monotone systems. In: S. Ozekici

(ed.): Reliability and Maintenance of Complex Systems. NATO ASISeries F, Springer, Berlin, pp. 206–223.

[12] Aven, T. (1996) Condition based replacement times - a counting processapproach. Reliability Engineering and System Safety. Special issue onMaintenance and Reliability 51, 275–292.

[13] Aven, T. (1992) Reliability and Risk Analysis. Elsevier Applied Science,London.

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,© Springer Science+Business Media New York 2013

283

284 References

[14] Aven, T. (1990) Availability evaluation of flow networks with varyingthroughput-demand and deferred repairs. IEEE Trans. Reliability 38,499–505.

[15] Aven, T. (1987) A counting process approach to replacement models.Optimization 18, 285–296.

[16] Aven, T. (1985) A theorem for determining the compensator of a count-ing process. Scand. J. Statist. 12, 69–72.

[17] Aven, T. (1985) Reliability evaluation of multistate systems of multi-state components. IEEE Trans. Reliability 34, 473–479.

[18] Aven, T. (1983) Optimal replacement under a minimal repair strategy− A general failure model. Adv. Appl. Prob. 15, 198–211.

[19] Aven, T. and Bergman, B. (1986) Optimal replacement times, a generalset-up. J. Appl. Prob. 23, 432–442.

[20] Aven, T. and Castro, I. T. (2008) A delay time model with safety con-straint. Reliability Engineering and System Safety 94, 261–267.

[21] Aven, T. and Dekker, R. (1997) A useful framework for optimal replace-ment models. Reliability Engineering and System Safety 58, 61–67.

[22] Aven, T. and Haukas, H. (1997) Asymptotic Poisson distribution for thenumber of system failures of a monotone system. Reliability Engineeringand System Safety 58, 43–53.

[23] Aven, T. and Haukas, H. (1997) A note on the steady state availabilityof monotone systems. Reliability Engineering and System Safety 59,269–276.

[24] Aven, T. and Jensen, U. (1998) A general minimal repair model.Research report, University of Ulm.

[25] Aven, T. and Jensen, U. (1998) Information based hazard rates for ruintimes of risk processes. Research Report, University of Ulm.

[26] Aven, T. and Jensen, U. (1997) Asymptotic distribution of the downtimeof a monotone system. Mathematical Methods of Operations Research.Special issue on Stochastic Models of Reliability, 45, 355–375.

[27] Aven, T. and Opdal, K. (1996) On the steady state unavailabilityof standby systems. Reliability Engineering and System Safety 52,171–175.

[28] Aven, T. and Østebø, R. (1986) Two new component importance mea-sures for a flow network system. Reliability Engineering 14, 75–80.

[29] Baker, R. D., Christer, A. H.(1994) Review of delay-time OR modellingof engineering aspects of maintenance. European Journal of OperationalResearch 73, 407–422.

[30] Barlow, R. and Hunter, L. (1960) Optimum preventive maintenancepolicies. Operations Res. 8, 90–100.

[31] Barlow, R. and Proschan, F. (1965) Mathematical Theory of Reliability.Wiley, New York.

[32] Barlow, R. and Proschan, F. (1975) Statistical Theory of Reliability andLife Testing. Holt, Rinehart and Winston, New York.

References 285

[33] Basu, A. (1988) Multivariate exponential distributions and their appli-cations in reliability. In: Krishnaiah, P. R. and Rao, C. R. (eds.): Hand-book of Statistics 7. Quality Control and Reliability. North-Holland,Amsterdam, pp. 99–111.

[34] Baxter, L. A. (1981) Availability measures for a two-state system. J.Appl. Prob. 18, 227–235.

[35] Beichelt, F. (1993) A unifying treatment of replacement policies withminimal repair. Nav. Res. Log. Q. 40, 51–67.

[36] Beichelt, F. and Franken, F. (1984) Zuverlassigkeit und Instandhaltung.Carl Hanser Verlag, Munchen.

[37] Berg, M. (1996) Economics oriented maintenance analysis and themarginal cost approach. In: Ozekici, S. (ed.): Reliability and Main-tenance of Complex Systems. NATO ASI Series F, Springer, Berlin,pp. 189–205.

[38] Bergman, B. (1978) Optimal replacement under a general failure model.Adv. Appl. Prob. 10, 431–451.

[39] Bergman, B. (1985) On reliability theory and its applications. Scand. J.Statist . 12, 1–41.

[40] Bergman, B. and Klefsjo, B. (1994) Quality. Studentlitteratur, Lund.[41] Bertsekas, D. (1995) Dynamic Programming and Optimal Control. Vol.

1 and 2. Athena Scientific, Belmont.[42] Billingsley, P. (1979) Probability and Measure. Wiley, New York.[43] Birnbaum, Z. W. (1969) On the importance of different components

in a multicomponent system. In: Krishnaiah, P. R. (ed.) MultivariateAnalysis II, Academic Press, pp. 581–592.

[44] Birolini, A. (1994) Quality and Reliability of Technical Systems.Springer, Berlin.

[45] Birolini, A. (1985) On the use of Stochastic Processes in Modeling Relia-bility Problems. Lecture notes in Economics and Mathematical Systems252, Springer, Berlin.

[46] Block, H. W. and Savits, T. H. (1997) Burn-In. Statistical Science 12,1–19.

[47] Block, H. W. and Savits, T. H. (1994) Comparison of maintenance poli-cies. In: Shaked, M. and Shanthikumar, G. (eds.): Stochastic Orders andtheir Applications. Academic Press, Boston, pp. 463–484.

[48] Block, H. W., Borges, W. and Savits, T. H. (1985) Age-dependentminimal repair. J. Appl. Prob. 22, 370–385.

[49] Boland, P. and Proschan, F. (1994) Stochastic order in system reliabilitytheory. In: Shaked, M. and Shanthikumar, G. (eds.): Stochastic Ordersand their Applications. Academic Press, Boston, pp. 485–508.

[50] Bremaud, P. (1981) Point Processes and Queues. Martingale Dynamics.Springer, New York.

[51] Brown, M. and Proschan, F. (1983) Imperfect repair. J. Appl. Prob. 20,851–859.

286 References

[52] Butler, D. A. (1979) A complete importance ranking for components ofbinary coherent systems, with extensions to multi-state systems. Nav.Res. Log. Q. 26, 565–578.

[53] Christer, A. H. (1999) Developments in delay time analysis for mod-elling plant maintenance. Journal of the Operational Research Society50, 1120–1137.

[54] Christer, A. H. and Redmond D. F. (1992) Revising models of mainte-nance and inspection. International Journal of Production Economics24, 227–234.

[55] Cinlar, E. (1975) Superposition of point processes. In: Lewis, P. (ed.)Stochastic Point Processes. Wiley, New York, pp. 549–606.

[56] Constantini, C. and Spizzichino, F. (1997) Explicit solution of an opt-imal stopping problem: The burn-in of conditionally exponential com-ponents. J. Appl. Prob. 34, 267–282.

[57] Csenki, A. (1994) Cumulative operational time analysis of finite semi-Markov reliability models. Reliability Engineering and System Safety44, 17–25.

[58] Daley, D. J. and Vere-Jones, D. (1988) An Introduction to the Theoryof Point Processes. Springer, Berlin.

[59] Davis, M. H. A. (1993) Markov Models and Optimization. Chapman &Hall, London.

[60] Delbaen, F. and Haezendonck, J. (1985) Inversed martingales in risktheory. Insurance: Mathematics and Economics 4, 201–206.

[61] Dellacherie, C. and Meyer, P. A. (1978) Probabilities and Potential A.North-Holland, Amsterdam.

[62] Dellacherie, C. and Meyer, P. A. (1980) Probabilities and Potential B.North-Holland, Amsterdam.

[63] Dekker, R. (1996) A framework for single-parameter maintenance acti-vities and its use in optimisation, priority setting and combining. In:Ozekici, S. (ed.): Reliability and Maintenance of Complex Systems.NATO ASI Series F, Springer, Berlin, pp. 170–188.

[64] Dekker, R. and Groenendijk, W. (1995) Availability assessment methodsand their application in practice. Microelectron. Reliab. 35, 1257–1274.

[65] Donatiello, L. and Iyer, B. R. (1987) Closed-form solution for systemavailability distribution. IEEE Trans. Reliability 36, 45–47.

[66] Dynkin, E. B. (1965) Markov Processes. Springer, Berlin.[67] Elliott, R. (1982) Stochastic Calculus and Applications. Springer, New

York.[68] Freund, J. E. (1961) A bivariate extension of the exponential distribu-

tion. J. Amer. Stat. Ass. 56, 971–977.[69] Funaki, K. and Yoshimoto, K. (1994) Distribution of total uptime during

a given time interval. IEEE Trans. Reliability 43, 489–492.[70] Gaede, K.-W. (1977) Zuverlassigkeit, Mathematische Modelle. Carl

Hanser Verlag, Munchen.

References 287

[71] Gandy, A. (2005). Effects of Uncertainties in Components on the Sur-vival of Complex Systems with given Dependencies. In: Wilson, A.,Limnios, N., Keller-McNulty, S. and Armijo, Y. (eds.): Modern Sta-tistical and Mathematical Methods in Reliability. World Scientific, NewJersey, pp. 177–189.

[72] Gasemyr, J. and Aven, T. (1999) Asymptotic distributions for the down-times of monotone systems. J. Appl. Prob., to appear.

[73] Gaver, D. P. (1963) Time to failure and availability of paralleled systemswith repair. IEEE Trans. Reliability 12, 30–38.

[74] Gertsbakh, I. B. (1989) Statistical Reliability Theory. Marcel-Dekker,New York.

[75] Gertsbakh, I. B. (1984) Asymptotic methods in reliability: A review.Adv. Appl. Prob. 16, 147–175.

[76] Gnedenko, B. V. and Ushakov, I. A. (1995), edited by Falk, J. A. Prob-abilistic Reliability Engineering. Wiley, Chichester.

[77] Grandell, J. (1991) Aspects of Risk Theory. Springer, New York.[78] Grandell, J. (1991) Finite time ruin probabilities and martingales.

Informatica 2, 3–32.[79] Griffith, W. S. (1980) Multistate reliability models. J. Appl. Prob. 15,

735–744.[80] Grimmelt, G. R. and Stirzaker, D. R. (1992) Probability and Random

Processes. 2nd ed. Oxford Science Publication, Oxford.[81] Haukas, H. and Aven, T. (1997) A general formula for the downtime of

a parallel system. J. Appl. Prob. 33, 772–785.[82] Haukas, H. and Aven, T. (1996) Formulae for the downtime distribution

of a system observed in a time interval. Reliability Engineering andSystem Safety 52, 19–26.

[83] Heinrich, G. and Jensen, U. (1995) Parameter estimation for a bivariatelifetime distribution in reliability with multivariate extensions. Metrika42, 49–65.

[84] Heinrich, G. and Jensen, U. (1996) Bivariate lifetime distributions andoptimal replacement. Mathematical Methods of Operations Research 44,31–47.

[85] Heinrich, G. and Jensen, U. (1992) Optimal replacement rules based ondifferent information levels. Nav. Res. Log. Q. 39, 937–955.

[86] Henley, E.J. and Kumamoto, H. (1981) Reliability Engineering and RiskAssessment. Prentice Hall, New Jersey.

[87] Herberts, T. and Jensen, U. (1998) Optimal stopping in a burn-in model.Research report, University of Ulm.

[88] Hinderer, H. (1987) Remarks on directly Riemann integrable functions.Mathematische Nachrichten 130, 225–230.

[89] Hokstad, P. (1997) The failure intensity process and the formulation ofreliability and maintenance models. Reliability Engineering and SystemSafety 58, 69–82.

288 References

[90] Høyland, A. and Rausand, M. (1994) System Reliability Theory, Wiley,New York.

[91] Hutchinson, T. P. and Lai, C. D. (1990) Continuous Bivariate Distribu-tions, Emphasising Applications. Rumbsby Scientific Publishing, Ade-laide.

[92] Jacod, J. (1975) Multivatiate point processes: predictable projec-tion, Radon-Nikodym derivatives, representation of martingales. Z. furWahrscheinlichkeitstheorie und Verw. Gebiete 31, 235–253.

[93] Jensen, U. and Hsu, G. (1993) Optimal stopping by means of point pro-cess observations with applications in reliability. Mathematics of Oper-ations Research 18, 645–657.

[94] Jensen, U. (1996) Stochastic models of reliability and maintenance: anoverview. In: S. Ozekici (ed.): Reliability and Maintenance of ComplexSystems. NATO ASI Series F, Springer, Berlin, pp. 3–36.

[95] Jensen, U. (1997) An optimal stopping problem in risk theory. ScandActuarial J. 149–159.

[96] Jensen, U. (1990) A general replacement model. ZOR-Methods and Mod-els of Operations Research 34, 423–439.

[97] Jensen, U. (1990) An example concerning the convergence of conditionalexpectations. Statistics 21, 609–611.

[98] Jensen, U. (1989) Monotone stopping rules for stochastic processes ina semimartingale representation with applications. Optimization 20,837–852.

[99] Joe, H. (1997). Multivariate Models and Dependence Concepts. Chap-man & Hall, Boca Raton.

[100] Kallianpur, G. (1980) Stochastic Filtering Theory. Springer, New York.[101] Kallenberg, O. (1997) Foundations of Modern Probability. Springer, New

York.[102] Kaplan, N. (1981) Another look at the two-lift problem. J. Appl. Prob.

18, 697–706.[103] Karr, A. F. (1986) Point Processes and their Statistical Inference. Mar-

cel Dekker, New York.[104] Keilson, J. (1966) A limit theorem for passage times in ergodic regen-

erative processes. Ann. Math. Stat. 37, 866–870.[105] Keilson, J. (1979) Markov Chain Models – Rarity and Exponentiality.

Springer, Berlin.[106] Keilson, J. (1987) Robustness and exponentiality in redundant

repairable systems. Annals of Operations Research 9, 439–447.[107] Kijima, M. (1989) Some results for repairable systems. J. Appl. Prob.

26, 89–102.[108] Koch, G. (1986) A dynamical approach to reliability theory. Proc. Int.

School of Phys. “Enrico Fermi,” XCIV. North-Holland, Amsterdam,pp. 215–240.

[109] Kovalenko, I. N. (1994) Rare events in queueing systems − a survey.Queueing Systems 16, 1–49.

References 289

[110] Kovalenko, I. N., Kuznetsov, N. Y., and Pegg, P. A. (1997) Mathemat-ical Theory of Reliability of Time Dependent Systems with PracticalApplications. Wiley, New York.

[111] Kovalenko, I. N., Kuznetsov, N. Y., and Shurenkov, V. M. (1996)Modelsof Random Processes. CRC Press, London.

[112] Kozlov, V. V. (1978) A limit theorem for a queueing system. Theory ofProbability and its Application 23, 182–187.

[113] Kuo, W. and Kuo, Y. (1983): Facing the headaches of early failures:a state-of-the-art review of burn-in decisions. Proceedings of the IEEE71, 1257–1266.

[114] Lam, T. and Lehoczky, J. (1991) Superposition of renewal processes.Adv. Appl. Prob. 23, 64–85.

[115] Last, G. and Brandt, A. (1995) Marked Point Processes on the RealLine - The Dynamic Approach. Springer, New York.

[116] Last, G. and Szekli, R. (1998) Stochastic comparison of repairable sys-tems. J. Appl. Prob. 35, 348–370.

[117] Last, G. and Szekli, R. (1998) Time and Palm stationarity of repairablesystems. Stoch. Proc. Appl., to appear.

[118] Leemis, L. M. and Beneke, M. (1990) Burn-in models and methods: areview. IIE Transactions 22, 172–180.

[119] Lehmann, A. (1998) Boundary crossing probabilities of Poisson countingprocesses with general boundaries. In: Kahle, W., Collani, E., Franz, J.,and Jensen, U. (eds.): Advances in Stochastic Models for Reliability,Quality and Safety. Birkhauser, Boston, pp. 153–166.

[120] Marcus, R. and Blumenthal, S. (1974) A sequential screening procedure.Technometrics 16, 229–234.

[121] Marshall, A. W. and Olkin, I. (1967) A multivariate exponential distri-bution. J. Amer. Stat. Ass. 62, 30–44.

[122] Metivier, M. (1982) Semimartingales, a Course on Stochastic Processes.De Gruyter, Berlin.

[123] Muller, A., Stoyan, D. (2002)Comparison Methods for Stochastic Modelsand Risks. John Wiley & Sons, New York.

[124] Natvig, B. (1990) On information-based minimal repair and the reduc-tion in remaining system lifetime due to the failure of a specific module.J. Appl. Prob. 27, 365–375.

[125] Natvig, B. (1988) Reliability: Importance of components. In: Johnson,N. and Kotz, S. (eds.): Encyclopedia of Statistical Sciences, vol. 8, Wiley,New York, pp. 17–20.

[126] Natvig, B. (1994) Multistate coherent systems. In: Johnson, N. andKotz, S. (eds.): Encyclopedia of Statistical Sciences, vol. 5. Wiley, NewYork.

[127] Nelsen, R. B. (2006). An Introduction to Copulas. Springer, New York.[128] Osaki, S. (1985) Stochastic System Reliability Modeling. World Scien-

tific, Philadelphia.

290 References

[129] Phelps, R. (1983) Optimal policy for minimal repair. J. Opl. Res. 34,425–427.

[130] Pierskalla, W. and Voelker, J. (1976) A survey of maintenance models:The control and surveillance of deteriorating systems.Nav. Res. Log. Q.23, 353–388.

[131] Puterman, M. L. (1994)Markov Decision Processes: Discrete StochasticDynamic Programming. Wiley, New York.

[132] Rai, S. and Agrawal, D. P. (1990) Distributed Computing network reli-ability. 2nd ed. IEEE Computer Soc. Press, Los Alamitos, California.

[133] Rogers, C. and Williams, D. (1994) Diffusions, Markov Processes andMartingales, Vol. 1, 2nd ed. Wiley, Chichester.

[134] Rolski, T., Schmidli, H., Schmidt, V. and Teugels, J. (1999) StochasticProcesses for Insurance and Finance. Wiley, Chichester.

[135] Ross, S. M. (1970) Applied Probability Models with Optimization Appli-cations. Holden-Day, San Francisco.

[136] Ross, S. M. (1975) On the calculation of asymptotic system reliabilitycharacteristics. In: Barlow R. E., Fussel, J. B. and Singpurwalla, N. D.(eds.) Fault Tree Analysis. Society for Industrial and Applied Mathe-matics, SIAM, Philadelphia, PA.

[137] Schottl, A. (1997) Optimal stopping of a risk reserve process with int-erest and cost rates. J. Appl. Prob. 35, 115–123.

[138] Serfozo, R. (1980) High-level exceedances of regenerative and semi-stationary processes. J. Appl. Prob. 17, 423–431.

[139] Shaked, M. and Shanthikumar, G. (1993) Stochastic Orders and theirApplications. Academic Press, Boston.

[140] Shaked, M. and Shanthikumar, G. (1991) Dynamic multivariate agingnotions in reliability theory. Stoch. Proc. Appl . 38, 85–97.

[141] Shaked, M. and Shanthikumar, G. (1986) Multivariate imperfect repair.Oper. Res. 34, 437–448.

[142] Shaked, M. and Szekli, R. (1995) Comparison of replacement policiesvia point processes. Adv. Appl. Prob. 27, 1079–1103.

[143] Shaked, M. and Zhu, H. (1992) Some results on block replacement poli-cies and renewal theory. J. Appl. Prob. 29, 932–946.

[144] Sherif, Y. and Smith, M. (1981) Optimal maintenance models for sys-tems subject to failure. A review. Nav. Res. Log. Q. 28, 47–74.

[145] Smith, M. (1998) Insensitivity of the k out of n system. Probability inthe Engineering and Informational Sciences, to appear.

[146] Smith, M. (1997)On the availability of failure prone systems. PhD thesisErasmus University, Rotterdam.

[147] Smith, M., Aven, T., Dekker, R. and van der Duyn Schouten, F.A.(1997) A survey on the interval availability of failure prone sys-tems. In: Proceedings ESREL’97 conference, Lisbon, 17–20 June, 1997,pp. 1727–1737.

[148] Solovyev, A.D. (1971) Asymptotic behavior of the time to the firstoccurrence of a rare event. Engineering Cybernetics 9 (6), 1038–1048.

References 291

[149] Spizzichino, F. (1991) Sequential burn-in procedures. J. Statist. Plann.Inference 29, 187–197.

[150] Srinivasan, S.K. and Subramanian, R. (1980) Probabilistic Analysis ofRedundant Systems. Lecture Notes in Economic and Mathematical Sys-tems 175, Springer, Berlin.

[151] Stadje, W. and Zuckerman, D. (1991) Optimal maintenance strategiesfor repairable systems with general degree of repair. J. Appl. Prob. 28,384–396.

[152] Szasz, D. (1977) A problem of two lifts. The Annals of Probability 5,550–559.

[153] Szasz, D. (1975) On the convergence of sums of point processes withinteger marks. In: Lewis, P. (ed.) Stochastic Point Processes., Wiley,New York, pp. 607–615.

[154] Takacs, L. (1957) On certain sojourn time problems in the theory ofstochastic processes. Acta Math. Acad. Sci. Hungar. 8, 169–191.

[155] Thompson, W. A. (1988) Point Process Models with Applications toSafety and Reliability. Chapman and Hall, New York.

[156] Tijms, H. C. (1994) Stochastic Modelling and Analysis: A Computa-tional Approach. Wiley, New York.

[157] Ushakov, I. A. (ed.) (1994) Handbook of Reliability Engineering. Wiley,Chichester.

[158] Valdez-Flores, C. and Feldman, R. (1989) A survey of preventive main-tenance models for stochastically deteriorating single-unit systems. Nav.Res. Log. Q. 36, 419–446.

[159] Van der Duyn Schouten, F. A. (1983) Markov Decision Processes withContinuous Time Parameter. Math. Centre Tracts 164, Amsterdam.

[160] Van Heijden, M. and Schornagel, A. (1988) Interval uneffectiveness dis-tribution for a k-out-of-n multistate reliability system with repair. Eur-opean Journal of Operational Research 36, 66–77.

[161] Van Schuppen, J. (1977) Filtering, prediction and smoothing observa-tions, a martingale approach. SIAM J. Appl. Math. 32, 552–570.

[162] Voina, A. (1982) Asymptotic analysis of systems with a continuous com-ponent. Kibernetika 18, 516–524.

[163] Wendt, H. (1998) A model describing damage processes and resultingfirst passage times. Research Report University of Magdeburg.

[164] Williams, D. (1991) Probability with Martingales. Cambridge UniversityPress, Cambridge.

[165] Yashin, A. and Arjas, E. (1988) A note on random intensities and con-ditional survival functions. J. Appl. Prob. 25, 630–635.

[166] Yearout, R. D., Reddy, P., and Grosh, D. L. (1986) Standby redundancyin reliability − a review. IEEE Trans. Reliability 35, 285–292.

Index

Accumulated failure rate, 36

Age replacement, 175Alternating renewal process, 107, 161

Alternating renewal process, 14

Applicationsavailability analysis of gas compres-

sion system, 162availability analysis of gas compres-

sion system, 13reliability analysis of a nuclear power

plant, 11Associated variables, 30

Asymptotic results

backward recurrence time, 113

compound Poisson process, 152distribution of number of failures,

113, 125, 136distribution of time to failure, 126

downtime distribution, 119, 145

downtime distribution, interval, 153forward recurrence time, 113

highly available systems, 127

mean number of failures, 122multistate monotone system, 162

number of failures, 116

parallel system, 139

series system, 142Availability, 106

bound, 109, 114

demand availability, 160interval (un)availability, 106

interval reliability, 106

limiting (un)availability, 109, 120, 168

long run average, 121point availability, 106, 108, 120steady-state (un)availability, 109, 120throughput availability, 160

Availability, 8

Backward recurrence time, 108, 113,278, 279

Binomial distribution, 22Birnbaum’s measure, 28Birth and death process, 168Bivariate exponential distribution, 197Blackwell’s theorem, 278Block replacement, 177Bounded in Lp, 254Bridge structure, 25Brownian motion, 5, 67, 71Burn-in, 202

Cadlag, 254Central limit theorem, 280Change of time, 267Closure theorem, 38Coefficient of variation, 126, 137, 154,

163, 167, 171Coherent system, 20Common mode failures, 30Compensator, 62Complex system

binary monotone system, 2, 17hazard rate process, 73multistate monotone system, 31

Complex systemsCopula models, 42

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modellingand Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,© Springer Science+Business Media New York 2013

293

294 Index

Compound Poisson process, 4, 67, 152Concordant, 46Conditional expectation, 249Control limit rule, 194Copula, 42

Archimedian, 49Counting process, 7, 62, 114

compensator, 62intensity, 63predictable intensity, 64

Cox process, 92Critical component, 29Critical path vector, 232Cut set, 20Cut vector, 32

Damage models, 3Decreasing

(a,b)-decreasing, 188Delay time model, 215Delayed renewal process, 281Demand availability, 160Demand rate, 160Dependence structure, 43Dependent components, 30

failure rate process, 73optimal replacement, 197

DFR (Decreasing Failure Rate), 5, 35DFRA (Decreasing Failure Rate

Average), 36Discounted cost, 195Doob–Meyer decomposition, 263Downtime

distribution bounds, 118distribution given failure, 145distribution of the ith failure, 149distribution, interval, 118, 151mean, interval, 116steady-state distribution, 145

Elementary renewal theorem, 277Equilibrium distribution, 113, 279Erlang distribution, 163Expectation, 247Exponential distribution, 131

asymptotic limit, 127mean number of system failures, 121parallel system, 139regenerative process, 124

renewal density, 114standby system, 169unavailability bound, 115

Exponential formula, 80

Factoring algorithm, 25Failure rate, 1, 6, 12, 14, 26, 36, 64

accumulated, 36process, 6, 65process, monotone, 77system, 123

Filtration, 57, 255complete, 255subfiltration, 69

Finite variation, 266Flow network, 32Forward recurrence time, 108, 113, 146,

278, 279

Gas compression system, 13, 162General repair strategy, 207

Harvesting problem, 182Hazard function (cumulative), 1Hazard rate, 1, 64Hazard rate process, 70

IFR (Increasing Failure Rate), 3, 5, 35IFR closure theorem, 40IFRA (Increasing Failure Rate

Average), 3, 36IFRA closure theorem, 39Iinfinitesimal generator, 66Inclusion–exclusion method, 23, 33Increasing

(a,b)-increasing, 188Independence, 247Indicator process, 70Indistinguishable, 254Infinitesimal look ahead, 181Information levels, 4

change of, 78Information-based replacement, 194Innovation martingale, 63Inspection, 229Integrability, 58Integrable, 254Intensity, 63

marked point process, 83

Index 295

Interval (un)availability, 106Interval reliability, 106, 110, 129Inverse Gaussian distribution, 3, 5

k-out-of-n system, 19reliability, 22

Key renewal theorem, 277

Lp-space, 248Laplace transform, 128, 141, 274Lifetime distribution, 1, 26, 34Long run average cost, 195Lost throughput distribution, 159

Maintenance, 7Marginal cost analysis, 179Marked point process, 4, 81Markov modulated repair process, 208Markov process, 66

pure jump process, 66Markov theory, 168Markov modulated Poisson process, 65Marshall-Olkin distribution, 52Martingale, 59, 259

innovation, 63orthogonal, 265submartingale, 260supermartingale, 259

Minimal cut set, 20Minimal cut vector, 32Minimal path set, 20Minimal path vector, 32Minimal repairs, 90

black box, 91optimal operating time, 208physical, 91statistical, 91

Modified renewal process, 281Monotone case, 181Monotone system, 2, 17, 231

distribution of number of systemfailures, 125

downtime distribution, 148k-out-of-n system, 19mean number of system failures, 121multistate, 158parallel system, 18point availability, 120series system, 142

series system, 18steady-state availability, 120

Monte Carlo simulation, 10MTTF (Mean Time To Failure), 8, 107MTTR (Mean Time To Repair), 8, 14,

107Multistate monotone system, 31, 158,

168Multivariate point process, 62

NBU (New Better than Used), 37NBUE (New Better than Used in

Expectation), 37Normal distribution, 114, 119, 136, 157Number of system failures

asymptotic results, 135distribution, 109, 125limiting mean, 116mean, 121standby system, 171

Number of system failuresmean, 109

NWU (New Worse than Used), 37NWUE (New Worse than Used in

Expectation), 37

Optimal replacement, 9age replacement, 175block replacement, 177complex system, 194general repair strategy, 207

Optimal stopping problem, 180Optimization criterion, 180Optional Sampling, 262Optional sampling theorem, 67

Parallel system, 139down time distribution, 146downtime distribution of first failure,

150downtime distribution, interval, 153repair constraints, 165

Parallel system, 6, 18optimal replacement, 9reliability, 22

Partial information, 197, 208Path set, 20Path vector, 32Performance measures, 14, 105, 168

296 Index

Phase-type distribution, 125, 163PLOD (positive lower orthant

dependent), 46Point process, 62

compound, 87marked point process, 81multivariate, 62

Poisson approximation, 8, 125Poisson distribution, 136, 143Poisson process, 4, 65

doubly stochastic, 92Markov modulated, 92nonhomogeneous, 92

Predictablevariation, 264

Predictableprojection, 63

Predictableintensity, 64

Predictable process, 58, 256Preventive replacement, 175Probability space, 57, 246Product rule, 268Progressively measurable, 256Progressively measurable process, 58PUOD (positive upper orthant

dependent), 46

Quadratic variation, 265

Random variable, 247Regenerative process, 124, 167, 281Regular conditional expectation, 252Reliability, 21Reliability block diagram, 18Reliability engineer, 9Reliability importance measure, 27, 34

Birnbaum’s measure, 28Improvement potential, 28Vesely–Fussell’s measure, 28

Reliability modeling, 9Renewal density, 114, 121, 148Renewal density theorem, 278Renewal equation, 275Renewal function, 274Renewal process, 273

alternating, 107delayed, 281modified, 281

Renewal process, 64alternating, 82intensity, 65

Renewal reward process, 280Repair models, 81

minimal repairs, 90varying degrees, 97

Repair replacement model, 207Replacement model, 175Risk process, 98Ruin time, 99

Safety constraint, 216Safety system, 229semi-Markov theory, 169Semimartingale, 267

change of filtration, 69product rule, 68semimartingale representation, 59smooth semimartingale (SSM), 59transformations, 68

Series system, 13, 18lifetime distribution, 27reliability, 21

Shock model, 185, 193Shock models, 86Shock process, 4Simpson’s paradox, 4Standby system, 166

ample repair facilities, 172one repair facility, 169

Stationary process, 119Steady-state, 119Stochastic comparison, 40Stochastic order, 46Stochastic process

predictable, 58progressively measurable, 58

Stopping problem, 183Stopping time, 59, 257

predictable, 72totally inaccessible, 72

Structural importance, 29Structure function, 18Subadditive, 49Subfiltration, 69, 190Submartingale, 59Supermartingale, 59

Index 297

Survival probability, 1, 13System failure rate, 123, 127, 136

System failures, 85System reliability, 21

Throughput availability, 160Time to system failure

asymptotic distribution, 126parallel system, 140

Unavailability, 109Uniformly integrable, 58, 254Usual conditions, 255

Wiener process, 3, 5