200
FAULT TOLERANCE IN ARTIFICIAL NEURAL NETWORKS Are Neural Networks Inherently Fault Tolerant? George Ravuama Bolt D.Phil. Thesis University of York Advanced Computer Architecture Group Department of Computer Science November 1992

FAULT TOLERANCE IN - Computer Science - Computer ... was concluded that a potential for inherent fault tolerance does exist in neural network architectures, but it is not exploited

  • Upload
    vuthien

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

FAULT TOLERANCE

IN

ARTIFICIAL NEURAL

NETWORKS

Are Neural Networks Inherently Fault Tolerant?

George Ravuama Bolt

D.Phil. Thesis

University of York

Advanced Computer Architecture Group

Department of Computer Science

November 1992

ABSTRACT

This thesis has examined the resilience of artificial neural networks to the

effect of faults. In particular, it addressed the question of whether neural

networks are inherently fault tolerant. Neural networks were visualised from

an abstract functional level rather than a physical implementation level to

allow their computational fault tolerance to be assessed.

This high-level approach required a methodology to be developed for the

construction of fault models. Instead of abstracting the effects of physical

defects, the system itself was abstracted and fault modes extracted from this

description. Requirements for suitable measures to assess a neural network's

reliability in the presence of faults were given, and general measures

constructed. Also, simulation frameworks were evolved which could allow

comparative studies to be made between different architectures and models.

It was found that a major influence on the reliability of neural networks is

the uniform distribution of information. Critical faults may cause failure for

certain regions of input space without this property. This lead to new

techniques being developed which ensure uniform storage.

It was shown that the basic perceptron unit possesses a degree of fault

tolerance related to the characteristics of its input data. This implied that

complex perceptron based neural networks can be inherently fault tolerant

given suitable training algorithms. However, it was then shown that

back-error propagation for multi-layer perceptron networks (MLP's) does

not produce a suitable weight configuration.

A technique involving the injection of transient faults during back-error

propagation training of MLP's was studied. The computational factor in the

resulting MLP's causing their resilience to faults was then identified. This

lead to a much simpler construction method which does not involve lengthy

training times. It was then shown why the conventional back-error

propagation algorithm does not produce fault tolerant MLP's.

It was concluded that a potential for inherent fault tolerance does exist in

neural network architectures, but it is not exploited by current training

algorithms.

i

CONTENTS

Abstract i

Contents iiList of Figures . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . viii. . . . .

List of Tables . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . x. . . . . .

List of Graphs. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . xi. . . . . .

Acknowledgement. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . xiii. . . . .

Declaration . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . xiv. . . . .

1. Introduction 11.1. Thesis Aims. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1. . . . . .

1.2. Motivation . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 2. . . . . .

1.3. Terminology . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 2. . . . . .

1.3.1. Neural Networks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . 2. . . . . .

1.3.2. Reliability Theory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 5. . . . . .

1.4. Thesis Overview. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 6. . . . . .

1.4.1. Chapter 2: Reliable Neural Networks. . . . . . . . . . . .. . . . . . . . . . . 6. . . . . .

1.4.2. Chapter 3: Concepts. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 7. . . . . .

1.4.3. Chapter 4: A Methodology for Fault Tolerance. . . . . . . . . . . .. . . 7. . . . . .

1.4.4. Chapter 5: ADAM . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 8. . . . . .

1.4.5. Chapter 6: Multi-Layer Perceptron Networks. . . . . . . . . . . .. . . . 8. . . . . .

1.4.6. Chapter 7: Conclusions. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 9. . . . . .

1.4.7. Appendix A: Fault Tolerance of Lateral Interaction Networks. . 9. . . . . .

1.4.8. Appendix B: Glossary. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 9. . . . . .

1.4.9. Appendix C: Data from ADAM Simulations. . . . . . . . . . . .. . . . . 9. . . . . .

2. Reliable Neural Networks 102.1. Introduction . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 10. . . . .

2.2. Frameworks for Analysing Fault Tolerance. . . . . . . . . . . .. . . . . . . . . 11. . . . .

2.2.1. Fault Models. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 12. . . . .

2.2.2. Assessing Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13. . . . .

2.2.3. Simulation Frameworks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 15. . . . .

ii

2.3. Redundancy. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 16. . . . .

2.3.1. Modular Redundancy. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 17. . . . .

2.3.2. Distributed .vs. Local Representations. . . . . . . . . . . .. . . . . . . . . . 18. . . . .

2.3.3. Input and Output Representations. . . . . . . . . . . .. . . . . . . . . . . . . . 20. . . . .

2.3.4. Computational Complexity and Capacity. . . . . . . . . . . .. . . . . . . . 21. . . . .

2.3.5. Basins of Attraction. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 22. . . . .

2.4. Reliability during the Learning Phase. . . . . . . . . . . .. . . . . . . . . . . . . . 23. . . . .

2.4.1. Retraining. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 24. . . . .

2.5. Fault Management. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 25. . . . .

2.6. Analysis of Specific Neural Network Models. . . . . . . . . . . .. . . . . . . . 26. . . . .

2.6.1. Hopfield Neural Network Model. . . . . . . . . . . .. . . . . . . . . . . . . . 26. . . . .

2.6.2. Multi-Layer Perceptron Model. . . . . . . . . . . .. . . . . . . . . . . . . . . . 28. . . . .

2.6.3. CMAC Networks . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . 32. . . . .

2.6.4. Compacta Networks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 33. . . . .

2.7. Fault Tolerance Techniques for Neural Networks. . . . . . . . . . . .. . . . 33. . . . .

2.8. Fault Tolerance of "Real" Neural Networks. . . . . . . . . . . .. . . . . . . . . 36. . . . .

2.9. Conclusions. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 36. . . . .

3. Concepts 383.1. Introduction . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 38. . . . .

3.2. Learning in Neural Networks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 39. . . . .

3.2.1. Supervised Learning. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 40. . . . .

3.3. Distribution . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 41. . . . .

3.4. Generalisation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 42. . . . .

3.4.1. Local vs. Global Generalisation. . . . . . . . . . . .. . . . . . . . . . . . . . . 43. . . . .

3.4.2. Interpolation vs. Inexact Classification. . . . . . . . . . . .. . . . . . . . . . 45. . . . .

3.4.3. Fault Tolerance as a Constraint. . . . . . . . . . . .. . . . . . . . . . . . . . . . 47. . . . .

3.5. Architectural Aspects of Neural Networks. . . . . . . . . . . .. . . . . . . . . . 49. . . . .

3.6. Failure in Neural Networks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 50. . . . .

3.7. Problem Classification. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 51. . . . .

3.7.1. Soft Problem Domains. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 52. . . . .

3.7.2. Considerations for Graceful Degradation. . . . . . . . . . . .. . . . . . . . 53. . . . .

3.8. Computational Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 53. . . . .

3.9. Verifying an Adaptive System. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55. . . . .

3.10. Conclusions. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 56. . . . .

iii

4. A Methodology for Fault Tolerance 574.1. Introduction . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 57. . . . .

4.2. Fault Models . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 58. . . . .

4.3. Visualisation Levels for Neural Networks. . . . . . . . . . . .. . . . . . . . . . 59. . . . .

4.3.1. Abstract Level. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 60. . . . .

4.3.2. Role of Fault Models. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 61. . . . .

4.4. Conventional Fault Models. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 61. . . . .

4.5. Fault Locations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 63. . . . .

4.5.1. Fault Locations for Neural Networks. . . . . . . . . . . .. . . . . . . . . . . 64. . . . .

4.5.2. Example . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 65. . . . .

4.6. Fault Manifestations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . 66. . . . .

4.6.1. Example . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 68. . . . .

4.6.2. Threshold Function. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 69. . . . .

4.6.3. Differential of Threshold Function. . . . . . . . . . . .. . . . . . . . . . . . . 69. . . . .

4.6.4. Weights . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 70. . . . .

4.6.5. Topology. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 72. . . . .

4.6.6. Other Fault Locations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 72. . . . .

4.7. Spatial and Temporal Considerations. . . . . . . . . . . .. . . . . . . . . . . . . . 73. . . . .

4.8. Summary . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 74. . . . .

4.9. Functional Fault Models. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 75. . . . .

4.10. Fault Coverage. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 76. . . . .

4.11. Assessing Reliability. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . 76. . . . .

4.12. Failure in Neural Networks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 77. . . . .

4.12.1. Measuring Failure. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 78. . . . .

4.12.2. Applying Failure Measures. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 80. . . . .

4.12.3. Example . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 81. . . . .

4.13. Relationship to Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 82. . . . .

4.14. Empirical Frameworks. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 83. . . . .

4.14.1. Timescales. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 83. . . . .

4.14.2. Fault Injection Methods. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 85. . . . .

4.14.3. Example . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 86. . . . .

4.14.4. Mean-Time-Before-Failure Methods. . . . . . . . . . . .. . . . . . . . . . 87. . . . .

4.14.5. Example . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 88. . . . .

4.14.6. Service Degradation Methods. . . . . . . . . . . .. . . . . . . . . . . . . . . . 89. . . . .

4.14.7. Example . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 90. . . . .

4.14.8. Summary of Simulation Frameworks. . . . . . . . . . . .. . . . . . . . . . 90. . . . .

4.15. Conclusions. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 91. . . . .

iv

ADAM 925.1. Introduction . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 92. . . . .

5.2. The ADAM System. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . 93. . . . .

5.2.1. Recall of Stored Vectors. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 94. . . . .

5.2.2. Teaching the ADAM System. . . . . . . . . . . .. . . . . . . . . . . . . . . . . 95. . . . .

5.2.3. Memory Saturation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 96. . . . .

5.3. Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 96. . . . .

5.3.1. Fault Model. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 97. . . . .

5.3.2. Software Simulation of Faults . . . . . . . . . . . .. . . . . . . . . . . . . . . . 98. . . . .

5.3.3. Experimental Approaches. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 98. . . . .

5.4. Uniform Storage in ADAM. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 100. . . .

5.4.1. Input Data. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 100. . . .

5.4.2. Analysis of Bit Density on Storage. . . . . . . . . . . .. . . . . . . . . . . . . 100. . . .

5.4.3. Analysis for Tuple Storage P.d.f. . . . . . . . . . . . .. . . . . . . . . . . . . . 102. . . .

5.4.4. Input Data Independent ADAM. . . . . . . . . . . .. . . . . . . . . . . . . . . 104. . . .

5.4.5. Implications for Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . 105. . . .

5.4.6. Conclusions for Uniform Storage. . . . . . . . . . . .. . . . . . . . . . . . . . 107. . . .

5.5. Failure Prediction for Single Tuple ADAM Systems. . . . . . . . . . . .. . 107. . . .

5.5.1. Storage Distribution within a Memory Matrix. . . . . . . . . . . .. . . . 108. . . .

5.5.2. Effect of Faults. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . 111. . . .

5.5.3. Failure . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 112. . . .

5.5.4. Comparison with Empirical Results. . . . . . . . . . . .. . . . . . . . . . . . 113. . . .

5.5.5. Relation of Tuple Size to Probability of Failure. . . . . . . . . . . .. . 114. . . .

5.6. Failure Prediction for Multiple Tuple ADAM Systems. . . . . . . . . . . . 115. . . .

5.7. Fault Tolerance Analysis. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 116. . . .

5.7.1. Varying Number of Tuple Units. . . . . . . . . . . .. . . . . . . . . . . . . . . 117. . . .

5.7.2. Varying Number of Input Patterns. . . . . . . . . . . .. . . . . . . . . . . . . 120. . . .

5.8. Conclusions. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 122. . . .

Graphs from Fault Analysis 124

Multi-Layer Perceptrons 1346.1. Introduction . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 134. . . .

6.2. Construction of Training Sets. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 135. . . .

6.3. Perceptron Units. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 136. . . .

6.3.1. Fault Tolerance of Perceptron Units. . . . . . . . . . . .. . . . . . . . . . . . 137. . . .

6.3.2. Empirical Analysis. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 141. . . .

v

6.3.3. Alternative Visualisation of a Perceptron's Function. . . . . . . . . . 142. . . .

6.4. Multi-Layer Perceptrons. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 143. . . .

6.4.1. Back-Error Propagation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 144. . . .

6.4.2. Fault Model for MLP's. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 144. . . .

6.5. Analysis of the Effect of Faults in MLP's. . . . . . . . . . . .. . . . . . . . . . . 145. . . .

6.5.1. Bipolar Thresholded Units. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 146. . . .

6.5.2. Binary Thresholded Units. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147. . . .

6.5.3. Comparison between Data Representations. . . . . . . . . . . .. . . . . . 147. . . .

6.5.4. Conversion of Binary to Bipolar Thresholded MLP. . . . . . . . . . . 148. . . .

6.6. Fault Tolerance of MLP's. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 149. . . .

6.6.1. Distribution of Information in MLP's. . . . . . . . . . . .. . . . . . . . . . . 151. . . .

6.6.2. Analysis of Back-Error Propagation Learning. . . . . . . . . . . .. . . . 152. . . .

6.7. Training for Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 155. . . .

6.7.1. Training with Weight Faults. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 155. . . .

6.7.2. Comparison with Clay and Sequin's Technique. . . . . . . . . . . .. . . 156. . . .

6.8. Analysis of Trained MLP. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 156. . . .

6.8.1. Analysis of Fault Injection Training. . . . . . . . . . . .. . . . . . . . . . . . 157. . . .

6.8.2. Comparison with MLP trained injecting unit faults. . . . . . . . . . . 159. . . .

6.8.3. New Technique for Fault Tolerant MLP's. . . . . . . . . . . .. . . . . . . 161. . . .

6.9. Results of Scaled MLP Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . 163. . . .

6.10. Consequences for Generalisation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . 166. . . .

6.11. Uniform Hidden Representations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . 168. . . .

6.12. Conclusions . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 170. . . .

Conclusions 1727.1. Overview . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 172. . . .

7.2. Basis for Inherent Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . 173. . . .

7.3. Fault Tolerance Mechanisms. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 173. . . .

7.3.1. Uniform Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 173. . . .

7.3.2. Modular Redundancy. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 174. . . .

7.3.3. Architectural Considerations in ADAM. . . . . . . . . . . .. . . . . . . . . 175. . . .

7.3.4. Learning in Multi-Layer Perceptron Networks. . . . . . . . . . . .. . . 175. . . .

7.4. Inherent Fault Tolerance?. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 177. . . .

7.5. Implications for Future Research. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 178. . . .

7.5.1. Generalisation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 178. . . .

7.5.2. Internal Representations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 179. . . .

7.5.3. Implementations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . 179. . . .

vi

7.5.4. Neural Fault Tolerance. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 179. . . .

A. Fault Tolerance of Lateral Interaction Networks 180A.1. Introduction . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 180. . . .

A.2. Soft/Rigid Application Areas. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 181. . . .

A.2.1. Implications for Reliability . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 182. . . .

A.2.2. Verification . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 182. . . .

A.3. Lateral Inhibition . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 183. . . .

A.3.1. Network Dynamics. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 184. . . .

A.3.2. Operational Behaviour. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 185. . . .

A.3.3. Stabilisation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 185. . . .

A.4. Fault Model . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 186. . . .

A.4.1. Timescale . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 187. . . .

A.5. Definition of Failure . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . 187. . . .

A.5.1. System Failure. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . 188. . . .

A.5.2. Component Failure. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . 189. . . .

A.6. Empirical Investigations. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 190. . . .

A.6.1. Edge Enhancing . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . 190. . . .

A.6.2. Neighbourhood Formation. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 192. . . .

A.7. Conclusions . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 194. . . .

B. Glossary 195

C. Data from ADAM Simulations 200

References 206

vii

L IST OF FIGURES

1.1 Connectivity in neural networks 3

1.2 Functional diagram of unit in neural network 4

3.1 Distribution of a noisy input pattern does not match its generalisationdistribution in input space

43

3.2 Forms of Generalisation: a) Functional Interpolation, b) InexactClassification

46

3.3 Require sufficient training examples to constrain a neural network torepresent underlying problem

48

3.4 Effect of a fault in solution space 53

4.1 Visualisation Levels for Neural Networks: (a) Implementation, (b)Architectural, (c) Abstract

59

4.2 Multi-Layer Perceptron Neural Network 62

4.3 Multi-Layer Perceptron Neural Network 65

4.4 Graph of Threshold Function (a) Continuous, (b) Discrete 68

4.5 Active weight fault representing a unit which always tries tomisclassify its input

72

4.6 Comparing reliability of systems with different characteristics toassess fault tolerance

83

5.1 Schematic of the ADAM System 93

5.2 Distribution of Storage in Matrix Span 102

5.3 Average Distribution of Storage in Matrix Span 103

5.4 Non-independence between tuple units in ADAM 116

6.1 Separating hyperplane for maximal fault tolerance 138

6.2 Multi-Layer Perceptron Neural Network 143

6.3 Plot of common multiplicative term in BP algorithm 154

6.4 Clustering of units' activations around +/- p 154

6.5 Positioning and width of squashing function's slope of three units'hyperplanes between two classes for (a) Normal BP, (b) Stretchingweights during training

167

viii

A.1 Lateral interaction network, dotted lines show how weightscorrespond to Mexican-hat function

184

A.2 Lateral interaction network functions (a) Clustering, and (b) High-frequency filter (LF - Low Frequency, HF - High Frequency)

185

A.3 Faults affecting global weight vector 186

ix

L IST OF TABLES

5.1 Predicted .vs. Experimental for 2,3 and 4-Tuple Units 118

5.2 Memory saturation values when varying number of stored patterns 121

6.1 Change to fault-free activation of output unit caused by hidden unitfailure

147

C.1 Probability of failure for various numbers of 2-tuple units 200

C.2 Probability of failure for various numbers of 3-tuple units 201

C.3 Probability of failure for various numbers of 4-tuple units 202

C.4 Probability of failure for various levels of memory saturation using2-tuple units

203

C.5 Probability of failure for various levels of memory saturation using3-tuple units

204

C.6 Probability of failure for various levels of memory saturation using4-tuple units

205

x

L IST OF GRAPHS

5.1 Storage Distribution Results for Brodatz Texture Images 104

5.2 Preprocessing Technique applied to Basic System 105

5.3 Doubling Number of Tuple Units 106

5.4 Comparing Preprocessing Technique to Doubling Number ofUnits

107

5.5 Predicted .vs. Experimental for 2,3 and 4-Tuple Units 114

5.6 Probability of failure for varying sized tuple units but equalmemory saturation

114

5.7 Predicted .vs. Experimental for 2,3 and 4-Tuple Units 119

5.8 Comparison of fault tolerance with varying number of patternsstored

121

5.9 Service degradation results using various numbers of 2-tuple units124

5.10 Service degradation results using various numbers of 3-tuple units 124

5.11 Service degradation results using various numbers of 4-tuple units124

5.12-5.15

Fault injection results for various numbers of 2-tuple units 125

5.16-5.19

Fault injection results for various numbers of 3-tuple units 126

5.20-5.23

Fault injection results for various numbers of 4-tuple units 127

5.24 Service degradation results for 2-tuple units using variousnumbers of patterns stored

129

5.25 Service degradation results for 3-tuple units using variousnumbers of patterns stored

129

5.26 Service degradation results for 4-tuple units using variousnumbers of patterns stored

129

5.27-5.30

Fault injection results for 2-tuple units using various numbers ofpatterns stored

130

5.31-5.34

Fault injection results for 3-tuple units using various numbers ofpatterns stored

131

5.35-5.38

Fault injection results for 4-tuple units using various numbers ofpatterns stored

132

6.1 Binary .vs. Bipolar Representation in Perceptron Unit 141

xi

6.2 Proportion of failed patterns due to 10% weight faults 149

6.3 Maximum output unit error due to 10% weight faults 150

6.4 Comparison of weight vector directions in MLP's trained withweight faults, a) single fault injection, and b) double faultinjection

157

6.5 Comparison of weight vector lengths in MLP's trained withweight faults, a) single fault injection, and b) double faultinjection

159

6.6 Comparing training with weight faults and unit faults 160

6.7 Comparison of operation tolerance to faults after weight injectiontraining and unit injection training

161

6.8 Output error of MLP with 8 hidden units over time 164

6.9 Fault Tolerance of MLP for various numbers of hidden units 165

6.10 Number of weight faults tolerated before failure occurs givendifferent values for weight stretching factors

166

6.11 Average and minimum Hamming distances between internalrepresentations for various sized hidden layers

169

6.12 Theoretical bound to maximum Hamming Distance betweeninternal representations

169

A.1 Effect of varying the excitatory/inhibitory weights 191

A.2 Variation in Pr(failure) due to dataset characteristics 191

A.3 Standard deviation for various datasets/weight values 192

A.4 Combined results for edge enhancing 192

A.5 Combined results for neighbourhood formation 193

A.6 Combined results for best-match 193

xii

ACKNOWLEDGEME NTS

I am very grateful for the time and effort of my supervisor, Dr. James

Austin in guiding me through this D.Phil over the last three years. I also

would especially like to thank Dr. Gary Morgan for much valuable

discussion on reliability and fault tolerance mechanisms. I am indebted

to Dr. David Martland for introducing me to neural networks during my

B.Sc. at Brunel University. Many thanks to all my friends and colleagues

who have helped me at various stages. I would particularly like to thank

Mike Carter, Bruce Segee, Tom Jackson and Alan Dix for their help.

Lastly, the patience and encouragement given by my family was of great

assistance.

xiii

DECLARATION

Various parts of this thesis have been published in conference proceedings, technical

reports, and journals. These are listed below by chapter:

Chapter 3:

Bolt, G.R., "Fault Tolerance and Robustness in Neural Networks",

IJCNN-91, Seattle 2, pp.A-986 (July 1991).

Chapter 4:

Bolt, G.R., "Assessing the Reliability of Artificial Neural Networks",

IJCNN-91, Singapore 1, pp.578-583 (November 1991).

Bolt, G.R., "Fault Models for Artificial Neural Networks", IJCNN-91,

Singapore 3, pp.1918-1923 (November 1991).

Bolt, G.R., "Investigating Fault Tolerance in Artificial Neural

Networks", YCS 154, Dept. of Computer Science, University of York,

UK (March 1991).

Chapter 5:

Bolt, G.R., Austin, J. and Morgan, G., "Operational Fault Tolerance of

the ADAM Neural Network System", IEE 2nd Int. Conf. Artificial

Neural Networks, Bournemouth, pp.285-289 (November 1991).

Bolt, G.R., Austin, J. and Morgan, G., "Uniform Tuple Storage", Pattern

Recognition Letters 13, pp.339-344 (May 1992).

xiv

Chapter 6:

Bolt, G.R., Austin, J. and Morgan, G., "Fault Tolerant Multi-Layer

Perceptrons", YCS 180, Dept. of Computer Science, University of York,

UK (1992).

Appendix A:

Bolt, G.R., "Fault Tolerance of Lateral Interaction Networks",

IJCNN-91, Singapore 2, pp.1373-1378 (November 1991).

xv

CHAPTER ONE

Introduction

1.1. Thesis Aims

This thesis has two principle objectives which address the reliability of artificial neural

networks. The first will be to investigate and quantify any innate reliability which they

may possess. This will involve gaining an understanding of any existing fault tolerance

mechanisms in artificial neural networks. The second objective will be to find ways to

increase their reliability. To limit the scope of the study, only feedforward neural

networks are considered.

One of the principle questions that will be addressed in meeting the first objective is

whether neural networks are inherently fault tolerant. This property has often been

attributed to neural networks, but no sound arguments have been given to confirm or

deny it.

An essential stage in achieving the above objectives is that the consequences of inherent

or new fault tolerance mechanisms in neural networks can be analysed. Therefore, an

underlying aim will be to define a methodology for analysing the effect of faults on the

reliability of a neural network. Note that the effect of faults will only be considered at

an abstract computational level, no implementations will be analysed. This will allow a

neural network's computational fault tolerance arising from the nature of their

processing method to be understood. It will then be possible for future implementations

to be guided by this information, as well as applying conventional fault tolerance

techniques.

It is assumed that the reader has a basic knowledge of both neural networks and fault

tolerance. However, an overview will be given of the neural network models that are

examined in various chapters. For introductory texts on neural networks, see [1], [2],

Chapter 1

1

and [3]. Reliability theory (see section 1.3.2) and descriptions of various conventional

fault tolerance techniques can be found in [4], [5], and [6].

1.2. Motivation

It is important that the reliability of neural networks can be assessed since it is very

likely that the operation of potential applications will need to be ensured over some of

their lifetime. This will be especially true for safety critical systems. For instance,

neural networks appear to be well-suited for use in control systems, and it will be vital

to know the effect of faults on the system's operation. Also, very high levels of

reliability may be required for some applications. This implies that suitable fault

tolerance techniques need to be developed for neural networks to achieve this aim.

Considering the general architecture of neural networks, it is apparent that they consist

of a very large number of functionally simple components. To ensure that individual

components are reliable, it would require a very large degree of redundancy to exist.

However, this may not be cost effective or even practical in reality. However, if the

overall operation of a neural network can inherently resist the effect of such faults, then

it would imply that fault tolerance techniques need only be applied at a higher

functional level.

1.3. Terminology

This section will provide a brief overview of main terminology used in this thesis. An

extended glossary of terms is given in appendix B for reference. Various terms and

concepts will be described in the next two sections on neural networks and reliability

theory respectively.

1.3.1. Neural Networks

Neural networks provide a parallel processing environment capable of learning to solve

problems from certain domains such as pattern recognition, control of dynamical

systems, and content addressable memory. However, they are not suitable for problems

normally associated with conventional logic based computing systems, such as

performing rapid arithmetic operations.

Chapter 1

2

A neural network's architecture consists of many processing units possessing very

simple computational abilities, their inputs and output being either a discrete or

continuous scalar value. The connectivity between units is unidirectional, but often

extremely complex. Its nature gives rise to a taxonomy of neural networks (see figure

1.1 below). In feedforward neural networks, the output from a unit has no direct or

indirect effect on its operation1, i.e. no loops exist. When this restriction does not hold,

neural networks are termed feedback or recurrent.

Associated with each connection between two units is a numerical value termed a

weight that modifies the scalar output value fed to the receiving unit. In general, these

weights are the only parameters that can be modified in a neural network to determine

its operation. However, the basic nature of its operation is influenced by the sequence in

which its units are updated, i.e. when they output new values based on their current

input. Three updating rules can be identified: synchronous, sequential or asynchronous.

In a synchronous neural network, all units are updated simultaneously. A sequential

neural network is similar, except that units are updated one at a time in a fixed order.

Finally, if units are updated on an individual basis with no fixed ordering, then its

operation is termed asynchronous.

The function computed by an individual unit can now be defined as

where the components of vector w are the weights on its incoming connections, and the

components of x are the input values applied to each connection (see figure 1.2). Note

1 Ignoring feedback due to the operation of the neural network on the environment in which it exists.

Outputs

HiddenUnits

Inputs

FeedforwardConnections Feedback

Connections

a) Feedforward Neural Network b) Feedback Neural Network

Figure 1.1 Connectivity in neural networks

output = fsquashingg

w,x

Chapter 1

3

that these could be the output of other processing units in the neural network. The

function fsquashing which modifies the result of the joint interaction g of inputs and

weights is often called the squashing function or thresholding function. These terms

refer to its role to limit the absolute magnitude of a unit's output. The activation of a

unit is the result of g.

Learning is the process by which the free parameters in a neural network are chosen

such that its operation solves the desired problem. The functionality of a neural network

is not programmed, but rather a learning algorithm modifies its behaviour depending

upon its environment, and any external guidance that might be provided. Three

different styles of learning algorithm can be identified depending upon the information

that is made available.

In supervised learning, the correct output is known for a set of inputs.

Reinforcement learning algorithms only have access to a scalar value

indicating the degree of correctness of the neural network's output.

If no external guiding feedback is supplied, then learning is termed

unsupervised.

Two important properties of neural networks are generalisation and distribution:

Generalisation refers to a neural network producing reasonable outputs for

inputs that it did not encounter during training. For example, if a neural

network is trained to behave as a content addressable memory, then an input

that is corrupted by noise should still recall the correct output.

x1

x2

x3

xn

w1

w2w

3w

n

g()

f () Output

Figure 1.2 Functional diagram of unit in neural network

Chapter 1

4

During learning, the presentation of any input can potentially result in the

modification of any neural network parameter. This is often termed

information distribution. During operation, all elements in a neural network

are involved in processing an input, and this been described as distribution of

processing.

Other terminology relating to neural networks can be found in appendix B.

1.3.2. Reliability Theory

Reliability is defined as the probability that a system is still operating correctly, i.e.

according to its specification, at time t given that it was correct at time t=0. When the

operation of a system no longer meets its specification a failure is deemed to have

occurred.

The reliability of a system can be decreased due to various factors such as incorrect

operation of system components, noise affecting inputs, design inaccuracy, and change

of environment. These influences can all be viewed as faults. More formally, a fault can

be defined as the cause of errors in a system's computation, where an error is that part

in the state of a system that is likely to lead to failure. The particular class of faults

which is of interest in this thesis are those due to the physical failure of components

within a system.

However, physical defects cannot be considered directly in any analysis due to

modelling issues such as complexity and computational cost. This results in the

requirement for a fault model to be developed. It supplies a high level representation of

the effect of faults on the operation of a system's components. Associated with each

fault is a failure rate which is defined as the proportion of components that are likely to

fail over a unit period of time, i.e. describes the rate at which they become defective.

These failure rates allow the occurrence of a variety of faults in a system to be

realistically simulated, provided that the failure rate is accurately known.

One approach for improving the reliability of a system is by increasing the resilience of

its operation to the effect of faults. Methods which perform this task are termed fault

tolerance techniques. Generally such methods act by increasing the redundancy in a

system. Two types of redundancy exist: Spatial and Temporal. The former refers to

Chapter 1

5

duplicating the function of groups of physical components, which results in increasing a

system's computational capacity. The technique of N-modular redundancy (NMR) is a

good example. The latter type of redundancy involves solving a sub-problem many

times and using the results to construct some form of average final solution.

Another property of a system often resulting from the application of fault tolerance

techniques is known as graceful degradation. It can also be inherent within the system

itself. For example, the storage capacity of a memory device merely decreases as

portions of its memory space are lost. Graceful degradation can be defined to be the

ability of a system to provide useful service in the presence of faults.

1.4. Thesis Overview

This section will describe the contents of each chapter in this thesis. Various concepts

for fault tolerance in neural networks are discussed in chapter 3, and the central theme

of studying computational fault tolerance proposed. Chapter 4 supplies a methodology

for investigating fault tolerance in neural networks which is used in investigating two

neural network paradigms, ADAM (chapter 5) and multi-layer perceptrons (chapter 6).

The concept of requiring distribution in the form of producing uniform fault tolerance is

identified as crucial to developing fault tolerant neural networks.

The contents of each chapter will now be given in more detail. An extended glossary is

provided in appendix B which describes the various technical terms used in this thesis.

1.4.1. Chapter 2: Reliable Neural Networks

Chapter 2 presents a review and critique of known past and current research which

either directly or indirectly considers the effect of faults in neural networks. Work

relating to the construction of methodologies for investigating fault tolerance is

considered first. The requirement for a sound methodology is important since it

provides a basis for rigourous research and will allow meaningful comparisons to be

made between results obtained from various neural network models.

Next, various proposed computational concepts in neural networks that promote fault

tolerance will be discussed. Such concepts include distribution of information and

processing, resilience of learning algorithms to faults, and re-learning to recover from

Chapter 1

6

the effect of faults. The related problems of fault detection, location and recovery will

also be included.

There exists a considerable number of empirical investigations into the fault tolerance

of various neural network models. Work relating to the various models will be

described separately. Particular attention will be paid to whether there is any evidence

of neural networks possessing any inherent fault tolerance.

Finally, various methods that are claimed to improve the fault tolerance of certain

neural network models will be described. The relevance of examining the fault

tolerance of artificial neural networks as opposed to that of their implementation will be

considered.

1.4.2. Chapter 3: Concepts

This chapter will discuss the consequences of various features arising from the style of

computation performed by neural network on their fault tolerance and other related

properties. The features include learning, distribution of information and processing,

generalisation, and various architectural characteristics. It will also be discussed how

requiring fault tolerant operation can be viewed as a learning constraint to induce

generalisation in a neural network.

The notion of how failure occurs in a neural network will then be considered, and

contrasted to that which occurs in conventional computational systems. The related

concept of graceful degradation in neural networks will also be discussed. Following

this, a classification of problems will be proposed which is based on the nature of their

solution space. This is then used to explain how graceful degradation occurs in neural

networks.

Finally, the idea of computational fault tolerance will be introduced, and contrasted to

the more conventional physical fault tolerance. Reasons for studying neural networks at

such an abstract level are also given.

1.4.3. Chapter 4: A Methodology for Fault Tolerance

Chapter 4 will present a methodology for investigating the fault tolerance of neural

networks. This is required to provide a common baseline that will allow results between

Chapter 1

7

various neural network models, architectures, etc. to be contrasted. The chapter will

first consider how a fault model can be constructed from an abstract definition of a

neural network. The two basic steps in this process will be described, and an example

given to demonstrate its use. Various concepts relating to the application of fault

models will then considered.

The second part of the chapter will examine how the fault tolerance of neural networks

can be assessed. Finally, various simulation frameworks will be defined which allow

empirical results to be obtained.

1.4.4. Chapter 5: ADAM

This chapter will examine the fault tolerance of a binary weighted neural network

system called ADAM. After describing the neural network's architecture, training and

operation, a fault model will be constructed following the methodology to be given in

chapter 4.

The first area that will be examined is the effect on fault tolerance arising from the

storage distribution properties of tuple units. It will be shown that fault tolerance can be

improved by a new technique which ensures uniform storage. Empirical simulations

will be given which support this. A prediction model will then be constructed for the

fault tolerance of tuple units.

Finally, the fault tolerance of the first stage of ADAM will be analysed, and

comprehensive empirical simulations described and results given.

1.4.5. Chapter 6: Multi-Layer Perceptron Networks

Chapter 6 examines the fault tolerance of perceptron units and the more complex

multi-layer perceptron neural networks. First, the number of defective input

connections a single perceptron unit can tolerate is determined in terms of its input data

characteristics. This gives rise to an alternative visualisation technique for a perceptron

unit's operation.

The fault tolerance of multi-layer perceptron networks is then examined, and found to

be very sensitive to relatively few numbers of weight faults. A technique involving

transient fault injection which has been shown to improve fault tolerance is then

Chapter 1

8

analysed. This leads to an understanding of the underlying mechanisms which allow

fault tolerant multi-layer perceptron networks to be developed. Finally, empirical

simulations are carried out investigating the operational fault tolerance of multi-layer

perceptron networks created using these new construction techniques.

1.4.6. Chapter 7: Conclusions

This chapter draws together the results found in preceding chapters and discusses the

mechanisms in neural networks leading to fault tolerance. The question of whether

neural networks are inherently fault tolerant is at least partially answered.

Finally, avenues for future work extending the research presented in this thesis are

given.

1.4.7. Appendix A: Fault Tolerance of Lateral Interaction Networks

An empirical study of the fault tolerance of single layer neural networks with lateral

connections between units is presented in appendix A. It is given as an example of how

the degree of failure in a neural network can be assessed from a specification of its

functionality, rather than by using a test set of data. This is one of the concepts

described in chapter 4.

1.4.8. Appendix B: Glossary

An extended glossary of terms relating to neural networks and reliability theory are

given.

1.4.9. Appendix C: Data from ADAM Simulations

Data from simulations probing the reliability of ADAM are given.

Chapter 1

9

CHAPTER TWO

Reliable Neural Networks

2.1. Introduction

Until recently, there have been few major pieces of work which study the field of fault

tolerant neural networks, or their reliability. Early papers or technical reports tended

either to contain a passing comment that fault tolerance existed, a general discussion of

fault tolerance, or very basic experimental results of the effects of noise or faults in

neural networks [7,8,9,10,11,12,13,14,15,16,17,18]. A common misunderstanding was

confusing resilience to faults with robustness to noisy inputs. Over the last two years

though, more substantial investigations of the fault tolerance of neural networks have

been published, though there is still very little theoretical work. Overall no common

consensus exists on how to investigate the reliability of neural networks and the result

of applying fault tolerance techniques, and so the vast majority of work tends to be

rather fragmented.

Various methodologies for investigating fault tolerance will be reviewed in section 2.2,

including the definition of fault models (section 2.2.1), reliability measures used to

assess fault tolerance (section 2.2.2). The concept of redundancy, which is central to

developing fault tolerant systems, will be examined in section 2.3. It will be considered

in terms of the internal and external representations employed by neural networks

(section 2.3.2 and 3), computational learning theory concepts (section 2.3.4), and basins

of attraction (section 2.3.5). Literature concerning various other concepts arising from

the style of computation in neural networks will then be examined. This includes

training algorithm's resilience to faults and relearning in section 2.4, and fault

detection/location/recovery in section 2.5. The results from investigations into various

neural network models will be discussed in section 2.6, and conclusions drawn as to the

current ideas on fault tolerance in neural networks. The question of whether neural

networks have inherent fault tolerance will especially be concentrated on. Section 2.7

Chapter 2

10

examines various techniques for developing fault tolerance in neural networks. Finally,

section 2.8 discusses the relevance of examining the fault tolerance of neural network

implementations as opposed to the computational fault tolerance of artificial neural

networks.

2.2. Frameworks for Analysing Fault Tolerance

A requirement exists for a methodology which directs the analysis of the fault tolerance

and reliability of neural networks (c.f. chapter 4). It should consider areas such as the

construction of fault models, methods of assessing fault tolerance, and simulation

techniques to probe fault tolerance.

These requirements have also has been noted by Carter [19]. This paper is by far the

most wide-ranging published work on the notion of fault tolerance in neural networks,

although it is understandably far from comprehensive. The scope is limited to

"applications of pattern recognition and signal processing", recognising that neural

network systems which solve optimisation problems are qualitatively different to those

solving function evaluation problems. To distinguish between classical terminology

where the generally accepted definition of fault tolerance is the notion that a system

provides "error-free computation in the presence of faults", Carter uses the term

"robust" to describe a neural network since they only ever give approximate solutions

[14]. However, this change in terminology does not occur in later publications due to

confusion when it is used to describe resilience to noise affecting inputs. A very

significant distinction with respect to analysing fault tolerance is drawn between the

two phases of neural network application: training and operation. The effects of faults

are likely to be different during these two distinct periods in a neural network's

lifecycle. Carter also identifies implementation-specific fault tolerance to be another

area for separate analysis. However, this seems to be an incorrect partitioning for the

analysis of reliability in neural networks since the implementation method used is quite

likely to affect very differently the fault tolerant properties of neural networks during

the operational and training phases. For example, the weights of connections are only

changed during the learning cycle, and so the method used in the implementation for

weight alteration will lead to reliability issues that are only appropriate during this

cycle. Also, it does not take into account systems which continuously adapt during

Chapter 2

11

actual operation. Although Carter's paper considers many questions for the development

of a methodology to study the fault tolerance of neural networks, it does not provide

any specific techniques which could be used in such an analysis.

2.2.1. Fault Models

These are a model of the effect of physical faults on the operation of a system (c.f.

chapter 4). The faults in the model are generally abstract descriptions of the effects of

physical defects for reasons of computational simplicity and cost. The fault model can

then be used in empirical simulations and theoretical analysis of the system, such as

examining its fault tolerance. However, no technique is known to exist for the

construction of fault models for artificial neural networks viewed at an abstract level,

although many such studies have been made of their fault tolerance.

The fault model employed by Bedworth and Lowe [20] in their investigation of the

multi-layer perceptron network (MLP) [21] was based on physical defects of the

components required by plausible implementation methods. This contrasts with trying

to abstract faults from the description of the MLP itself. For example, linear weight

noise was compared to the effects of thermal fluctuations, non-linear weight noise due

to capacitive type errors introduced by crosstalk. Belfore and Johnson [22] examined an

implementation of Hopfield networks using an electrical neuron model. Based on the

implementation level faults that would occur in this model, more abstract faults were

defined using the stuck-at class. However, with this method it is very likely that some

faults could not be so easily abstracted due to the difference in visualisation levels, and

indeed for a particular fault "a special simulation option was implemented to model

[this fault]."

However, in the vast majority of literature no justification is given for the fault types

defined, and generally only the basic processing unit is selected as the component that

can become defective by becoming stuck at some output value. It will be shown in

chapter 4 that this is not a suitable choice due to the existence of simpler components at

this abstract level of visualisation which give rise to a more realistic and accurate fault

model.

Chapter 2

12

2.2.2. Assessing Fault Tolerance

To measure the reliability due to the fault tolerance of a neural network when operating

as an associative memory or classification system, a common technique is to evaluate

the sample probability that a pattern will be recalled correctly [23,22,24] for various

fault levels. Conversely, for function approximation a continuous measure of deviation

from correct evaluation is more appropriate [25,26]. A similar approach to evaluating

the outcome of a neural network solving an optimisation problem is given in [27].

However, these measures only assess the reliability of a neural network for each

particular instance of fault distribution, rather than describing the actual resilience of

the neural network's operation to faults. The fault tolerance of a neural network is

indicated by a curve describing the neural network's reliability of operation over a range

of fault levels.

Segee and Carter use the RMS measure to assess the effect of faults on neural networks

solving function approximation problems [26], where the RMS error is given by

where F(x) is the output of the neural network and y the desired output. This measure is

then scaled appropriately with the function RMS to give a normalised value which

allows results from differing neural networks to be compared. However, the baseline

which is used to assess the effect of faults, the number of faults injected, does not allow

different sized neural networks to be compared directly. This is because faults are

injected sequentially rather than at a rate scaling with the size of the neural network or

according to some time-based probability function (see chapter 4). By comparing

different sized neural networks the effect of having varying computational capacity, and

hence potential redundancy, could be investigated. However, this kind of comparison is

very uncommon in the literature.

Assessing neural networks solving optimisation problems is particularly difficult since

the optimum solution is generally unknown at run-time, and so no convenient reference

point exists by which its output can be judged. Protzel et al [27] have investigated the

fault tolerance of the Hopfield model [13] applied to such problems as the Travelling

Salesman Problem and the Assignment Problem. To assess the solution provided by a

RMSError = 1N Σ

i=1

N F(xi) − yi

2

Chapter 2

13

Hopfield network, possibly defective due to faults, the measure used is

where c is the cost of a solution provided by the optimisation network1, cave is the

average cost of current solutions, and copt is the cost of the optimal solution. Note that

this implies that the optimal solution for a problem must be known in advance.

However, Protzel et al make the point that they are comparing a new method using

neural networks to existing methods, and so studying problems whose solutions are

already known (or at least very well approximated) is not a relevant issue for such

comparisons.

This measure of the quality of a solution allows results to be independent of any

problem instance and neural network size. The effect of faults is shown by the resulting

change in the quality value q of solutions as compared to those from a fault free neural

network, and hence an indication of the fault tolerance of Hopfield network's applied to

optimisation problems can be gained.

Neti et al [28] state that a neural network is ε-fault tolerant if

where V1 is the set of vertices (units) in neural network N(w), and H(.,w) is the mapping

performed by N(w). Hv(.,wv) is the mapping performed when unit v is removed. This

measure says that a neural network is ε-fault tolerant if, for all possible single unit

faults, the mapping differs by at most ε from the original. However, it should be noted

that an implicit limitation in this definition is that only the occurrence of single faults is

considered, and so it is rather limited to be of general use. The idea of uniformity of

fault tolerance is also considered in the paper, i.e. that the damage caused by the

removal of any unit is approximately equivalent. This is achieved by considering the

deviation of fault tolerance of each hidden node from the desired ε

Similar measures of fault tolerance to these are given by Bugmann et al [29], though

1 Note that this value, c, is directly available since the operation of a Hopfield network applied to an

optimisation problem is directly governed by a cost function defined whose global minimum

corresponds with the optimal solution to the problem.

q = cave − ccave − copt

H(.,w) −Hv(.,wv) 2 ≤ ε ∀ v ∈ V1 where _ 2 is Euclidean distance

eq-tol = ε ∗ 1N Σ

v=1

N

(Ev −ε)2

Chapter 2

14

they also consider the maximum damage as well as the average damage caused by loss

of single hidden units. This recognises the important distinction which exists in

different types of application areas regarding how fault tolerance should be considered.

For example, in a safety-critical application it is more sensible to assess the maximum

degradation of the system due to faults.

Lansner and Ekeberg [30] examine an associative memory system which iteratively

activates output units. In order to assess its reliability, they define two very useful terms

which are the expected local recall reliability (LRR) and the global recall reliability

(GRR). They define LRR as the probability that the next unit to activate will be in the

correct associated pattern given n units already activated. GRR is merely the extension

of LRR to the probability that the input pattern is associated correctly. Note that these

definitions rely on a relaxation process occurring in a neural network, and also that at

most only one unit is allowed to become active at each relaxation step. However, they

could well be useful in the examination of asynchronous Hopfield networks for

example.

Overall, methods for assessing the fault tolerance of neural networks do so by

measuring their reliability or degree of output error, and then plotting this for increasing

fault levels. However, this leads to a highly qualitative measure which only allows

different cases to be ranked in comparison with one another, rather than a quantitative

measure which would allow generic assessments to be made of the fault tolerance of

neural networks. This is not surprising though, since it will be seen in chapter 4 that the

latter is an extremely hard task to solve. Also, it is common that the consequences due

to unequal system complexity for the assessment of fault tolerance are ignored.

2.2.3. Simulation Frameworks

The last aspect of methodologies for investigating neural networks which will be

examined is how simulations are performed to assess the fault tolerance of various

neural network models. This is important because only through simulation can wide

ranging results be obtained. Clearly, both the construction of fault models and

development of fault tolerance measures as discussed above will be important

components in such simulations. However, the surrounding framework is equally

significant if general results are to be obtained.

Chapter 2

15

Most work examines the fault tolerance of various neural network models by

sequentially injecting faults, and examining their effect at each stage

[16,20,22,24,27,31,32,33]. This approach suffers from two deficits. First, it does not

allow the comparison of different sized neural networks since a large network will

suffer more faults than a smaller version over some fixed time period. Secondly, fault

injection techniques do not allow multiple faults to be examined in conjunction with

each other. The various faults' effects may well interact with each other, especially in

large neural networks, and so examining the effects of each fault individually will not

give an accurate picture of their combined effect in an implemented system. Prater and

Morley avoid this problem in their investigations [34] by only concentrating on the

effects of single faults. The necessity of considering the effect of multiple fault types is

neglected.

However, Segee and Carter [26] use a similar fault simulation method to May and

Hammerstrom [35] which is based on fault injection, differing only in that at each step,

the fault causing the worst damage is injected. This method overcomes the problem of

not simulating multiple differing faults occurring, but does not, as is recognised by

Segee and Carter [26], guarantee that the overall worst sequence of faults is generated.

This is since the effect of a fault which does not cause much damage when it first

occurs, could become much worse given the occurrence of some subsequent fault.

In chapter 4, various other frameworks will be suggested which allow the problems

associated with simulating the effects of multiple fault types to be overcome.

2.3. Redundancy

Redundancy of neural network components has often been identified as the factor

producing a reliable system, which corresponds with fault tolerance techniques applied

in conventional digital systems [4,6]. Moore [36] compares such conventional

techniques used to introduce fault tolerance into a computer system with apparent

mechanisms in biological neural networks, and then draws various conclusions from

this. He points out that biological networks use both spatial and temporal redundancy

(in relation to components and input/output representations), as do conventional

computing systems to achieve a greater degree of fault tolerance. However, von Seelen

and Mallot [37] question whether redundancy really is the key issue in determining the

Chapter 2

16

reliability of a neural network. They say, very plausibly, that a neural network does not

have redundancy in the sense of reserve capacity, but rather it utilises all of its resources

to gain the best trade-off between accuracy and computation time. Fault tolerance

comes from "isomorphic implementation, natural representation, a small number of

computation steps, and a balanced utilization of all available resources." [37]. By

isomorphic implementation, they mean that the output of the neural network can be

directly related to the internal processing within the network. However, though this is

reasonable for simple neural networks such as the Hopfield model where a clear

trajectory is followed through state space, for more complex models it does not seem to

be quite such a valid claim. Also, a natural representation is often a redundant signal in

its own right (e.g. retina image), so some of the influences on fault tolerance in neural

networks that they identify are not completely justified.

An interesting statement is made by McCulloch [38], "The reliability you can buy with

redundancy of calculation cannot be bought with redundancy of code or of channel".

This is in agreement with the work of von Neumann [39]. This type of redundancy

moves beyond the simple duplication of units/weights or small modules within a neural

network. It considers the possibility of inherent fault tolerance existing due to the

computational nature of neural networks introducing redundancy of calculation. For

example, temporal redundancy as often occurs in biological systems where calculations

are continuously repeated [36] can be viewed in this context.

2.3.1. Modular Redundancy

As well as redundancy of units and connections, it can also exist at higher level in terms

of groups of units or sub-networks. For instance, Izui and Pentland [40] replicate

hidden units to provide redundancy which they claim improves fault tolerance.

However, since more faults would occur in the larger neural network over a fixed time

period, this result would need more careful consideration before it could be accepted.

For similar reasons, the work by Clay and Sequin on duplication of hidden nodes also

needs further analysis [41].

At a higher level, Lincoln and Skrzypek [42] consider having many separate hidden

layers feeding into the output units, with each output acting in a similar fashion to the

judging elements in N-Modular Redundancy systems [4]. Each hidden layer is trained

Chapter 2

17

separately to solve the problem, and then all are clustered together to form the final

system. However, it is again not clear whether increased reliability is achieved despite

the increased size of the system.

The implementation of neural networks has given rise to various architectures whose

design introduces a degree of redundancy [43,44,45]. For example, a mixture of spatial

and temporal redundancy together with coding has been used by Chu and Wah [43] to

achieve a fault tolerant neural network system. Such designs make use of the regular

architectural and computational structure of neural networks to achieve redundancy.

2.3.2. Distributed .vs. Local Representations

The formation of distributed representations is often presented as a mechanism to

develop fault tolerant neural networks [8,16], though Biswas and Venkatesh take a

more pragmatic view [46] by terming such general statements as "folk theorems". They

point out that local representations lead to the existence of critical units whose failure

results in the impaired computational ability of the whole neural network. However,

they do acknowledge that evidence does exist for distributed representations which lead

to redundancy.

Baum et al examine the consequences for fault tolerance of various local and distributed

representations in an associative memory system [9]. A unary representation (relates to

the concept of Grandmother Cells) with simple replication is shown to provide a robust

associative memory system with excellent retrieval properties, though its storage

capacity is very limited. They also point out that the unary representation gives rise to

fault intolerance, although redundancy can be introduced by duplicating the

grandmother units. This is similar to ideas applied in Legendy's compacta networks

[15]. However, the claim of fault intolerance is not completely true since redundancy

will still occur in the connections feeding each grandmother unit. Faults affecting the

connections may well not cause a sufficiently large change in the unit's internal state to

alter the outcome of the winner-take-all process.

As an alternative, Baum et al [9] examine a distributed representation which is formed

by an intermediate layer of units in a layered neural network. They stress that such a

representation must not be simply several unary representations combined together

where individual units in the hidden layer still respond to only one stored pattern.

Chapter 2

18

Instead it must be truly distributed in the sense that units in the hidden layer respond to

several stored patterns, though this overlap must be controlled to minimise interference.

They then go on to consider the effect on reliability of faults causing a proportion of

input bits to be forced to zero, and for a particular training algorithm, derive the

resulting memory capacity given a required output accuracy. It is pointed out that the

redundancy introduced by the distributed representation is balanced by the need for

connection weights to take more than simply one of two states. They also point out that

sparsification of the internal representation improves the memory capacity still further,

though it is likely that this reduces the fault tolerance due to the movement towards a

unary representation. A compromise clearly exists between fault tolerance and the

capacity of the neural network arising from the chosen internal representation. This has

many similarities with the ADAM system [47] where a sparse data representation is

created using tuple units, and a distributed intermediate representation is used for

association.

The nature of the representation created by the brain-state-in-a-box model (BSB) [48] is

considered by Anderson [8]. The units in this neural network model are interconnected

via a positive feedback loop with limits placed on their absolute output values. This

produces a neural network which acts as an associative memory. Anderson suggests that

the system might well be useful as a preprocessor for noisy input data due to its

auto-associative properties. Wood [49] has carried out simulations of faults occurring in

the feedback matrix, and found that the results lead to a mixed conclusion. Although a

gradual decrease in accuracy of recall as faults occur might be expected from statistical

predictions, the results showed that as well as distributed representations, localisation

also existed which lead to critical connections. It can be concluded that it would be very

useful to have a measure which indicates the degree of the information distribution in a

neural network.

Anderson [8] has also differentiated between unary and distributed representations by

considering feature detectors which consist of either one neuron (microfeature) or

several (macrofeature). The vector feature model employs lateral excitation based on

the cerebral cortex, which results in several units behaving as a feature group. However,

this is not a distributed representation as defined by Baum et al above, and this may be

the reason for the rather mixed results which Wood found, as described above.

Chapter 2

19

2.3.3. Input and Output Representations

As well as distributed internal representations leading to a more reliable neural network,

the same can also be said for the input and output representation used [36,51]. Such

distribution leads to redundancy, for example, overlapping groups of output units where

each represents a particular classification. Methods for forming distributed

representations are discussed by Miikkulainen and Dyer [52], and include extending the

back-error propagation algorithm [21] to modify the input vectors passed to the neural

network. They found that the final neural network was fault tolerant to damage in its

input layer of units due to the learned distributed representation, and also that it

degraded in an approximately linear manner. However, the system requires a lexicon to

map actual "world" input vectors to the distributed input vector the neural network

requires, and this could become the keystone for the reliability of the overall system.

Various input representations for numerical values are considered by Takeda and

Goodman [53], such as either a binary and simple sum scheme, to examine how the

chosen representation affects the learning capabilities of the neural network. However,

they also note that the binary scheme is not particularly fault tolerant, but the simple

sum scheme is. This is since a single bit error will only cause a small change in the

number represented. Hancock [54] describes various other possible data representations,

but only considers their effect on learning.

A method to increase the storage capacity of a neural network has been considered by

Venkatesh [50]. A proportion of the output units in an associative neural network are

specified to be redundant (i.e. don't care what their value is), and then errors from a

known distribution are allowed to occur in the output layer. It can be viewed that extra

redundancy is introduced into the output layer of units, and this then improves the

network's robustness to noise and also to any unit failures. The results of this rather

strange method are not based on any particular neural network model but are generally

applicable. It is found that the memory capacity is increased, and is determined to be

related to the number of units in the neural network and the proportion of allowable

errors made by an output unit.

Chapter 2

20

2.3.4. Computational Complexity and Capacity

The concept proposed by von Seelen and Mallot [37] that a neural network utilises

resources to the full advances Carter's [19] explanation of fault tolerance in a neural

network. This explanation states that redundancy exists because of "spare capacity"

when the complexity of the problem to be solved is less than the computational capacity

of the neural network.

If redundancy does originate in this manner, then it would be useful to be able to

determine both the computational capacity of a neural network and the computational

complexity of a problem. Much of the work considering the memory capacity of various

neural networks can be applied here [7,10,13], though some more general work has also

been done.

Abu-Mostafa [55] shows that a neural network can solve any finite problem by

simulating boolean logic gates. However, due to practicality and efficiency

considerations, and also since feedback is not incorporated into the argument, then this

interesting result is somewhat degraded. He observes that the complexity of the problem

to be solved greatly depends upon the representation of the input presented to the neural

network, since it can be viewed that a network is just performing a change of

representation. This observation can be extended to also include the choice of output

representation in the case of supervised learning. Another important point made is that

the capacity (in a general sense) grows faster than the number of neurons/units in a

neural network, so it will be more efficient solving random problems [56] such as

complicated pattern recognition for example, rather than small structured problems.

This incidentally corresponds with the problem solving capabilities of humans.

Hartley and Szu [57] have found that a large number of neural network models can be

shown to be equivalent in power to a Turing machine if an infinite number of units are

used, else they have the power of a finite state machine. This result also supports

Abu-Mostafa's claim [55]. They point out that if further restrictions are placed on the

neural network, such as having a symmetrical weight matrix, then its power is greatly

decreased to below that of a finite state machine.

Some related work directed towards determining the correct sized linear threshold

network for valid generalisation has been done by Baum and Haussler [58]. Two

Chapter 2

21

measures of capacity are used. These are the maximum number of dichotomies that can

be induced on m inputs and the Vapnick-Chervonenkis (VC) Dimension [59]. This latter

value is the dimension of the largest input space that can be completely dichotomised

by some set of functions. The first can be related to the maximum possible

computational complexity of a problem set in m dimensional space, and the second can

be seen to closely relate to the computational capacity of a neural network. These

measures may well be of use in determining the redundancy in a neural network.

However, the VC dimension only applies to units which output a boolean value, and so

its application is somewhat limited.

Segee and Carter [60] have examined the effects on the reliability of feedforward

multi-layer neural networks by applying various pruning algorithms. These algorithms

are intended to reduce the number of free parameters in the neural network without

impairing its function, and so improve generalisation. This can be considered as trying

to match the neural network's computational capacity with the complexity of the

problem. They used the RMS measure, as described previously in section 2.2.2, to

measure the effect of setting single weights to zero. It was found that the reliability of

the pruned neural networks was not significantly different to that of the original. This

might be explained by considering that the spare capacity in the neural network was not

used to provide fault tolerance through redundancy, and so pruned. Since only units

which did not contribute to the function of the neural network were removed, no effect

on its fault tolerance would be expected.

2.3.5. Basins of Attraction

The concept of a basin of attraction in a neural network is linked to the visualisation of

its energy landscape. They can be viewed as a bounded area in this landscape over

which a stored pattern has complete influence, and have mainly been associated with

Hopfield networks employed as auto-associative memories [7,32]. The size of the basin

of attraction has been stated in terms of the maximum allowable number of erroneous

bits in the initial state vector as compared to that of the stored pattern, while still being

able to recover it. Basins of attraction can be viewed as a form of internal redundancy

of computation, i.e. a group of internal system states is used to represent a particular

computation state.

Chapter 2

22

More generally, Krauth et al [61] have studied basins of attraction in neural networks

that are based on the perceptron model [62,63,64]. The architectures that they study are

two-layer feedforward networks composed of binary units. They claim that the

existence and large size of basins of attraction in these neural networks are important

factors in rendering them fault tolerant. However, it should more strictly be said that

they give rise to robustness, or resistance to noise. It is unclear as to how the size and

existence of basins of attraction will be affected by faults occurring within the network.

2.4. Reliability during the Learning Phase

The nature of the reliability of a neural network during the training process is important

to determine, as has been noted by Carter [19]. He questions what output accuracy the

neural network can achieve, how much longer it will take to train the network, and how

fault tolerant the trained neural network will be when faults occur during training. The

answers will depend in part on the difference in computational capacity of the network

and the complexity of the problem. Moore [36] claims that neural networks will adapt

to any faults due to their learning ability, though this claim is presented with no

justification other than that biological neural networks behave in this fashion. However,

von Seelen and Mallot [37] note that in lesion experiments carried out in their

laboratory, there was "no compensation for [the] deficits even after prolonged

learning". These lesions were carried out on the visual cortex areas. It is unlikely

though that this result will be exactly equivalent to what will happen in an artificial

neural network during learning. Localised representations are less robust than

distributed representations to such structured damage, whereas they are probably just as

robust to random damage.

The bit precision required in a digital implementation of multi-layer perceptron

networks trained using back-error propagation has been studied by Holt and Hwang

[65]. They found that 14-16 bit precision is needed for the weights during training,

while only 8 bit precision is satisfactory during actual operation. This implies that it

may be more critical for a system to exhibit fault tolerance during learning since a small

error caused by defects will have more effect than during operation.

A study has been made by Pemberton and Vidal [66] on the consequences of having

noisy training data during learning in a single threshold logic unit, trained using the

Chapter 2

23

perceptron (discrete), the Widrow-Hoff (linear), and the generalised delta (non-linear)

learning algorithms. It was found that the output error rate of a unit followed

approximately linearly to the introduced training signal error rate when trained using

the perceptron rule. However, for the linear and non-linear training algorithms, the

output error rate did not significantly increase until the training signal error rate reached

about 40%, the non-linear rule being slightly better. It was also found that the choice of

learning rate greatly affected the robustness of the unit to training signal noise for the

linear and non-linear rules, and in the latter case, the threshold function scaling factor as

well. These results imply a degree of reliability will exist during the learning phase of a

neural network since the error caused by faults will initially be small for some learning

algorithms.

2.4.1. Retraining

If faults affecting a neural network could be detected (see section 2.5 below), a possible

method for recovery exists by retraining the neural network to alleviate the problem and

restore correct processing [8,67,68]. This has also been found to occur in some

biological neural networks, for example cortical reorganisation in adult macaques [69].

Tanaka [70] has experimented with relearning in a multi-layer perceptron (MLP)

network, and found that it was possible to regain high levels of output accuracy even

after many faults had occurred. However, the time taken to recover increased beyond

reasonable limits for many faults. This possibly is due to the computational capacity of

the neural network diminishing towards the minimum needed to solve the problem, and

so increasing the difficulty of the learning task.

Bedworth and Lowe [20] have also investigated relearning in a MLP network. Their

network was considerably larger than Tanaka's, and even with half the connections in

the network removed, it could still recover near original performance after a fifth of the

initial training time.

However, the result of Holt and Hwang described above on the precision needed during

learning [65] means that if a MLP is to be retrained after faults occur, provision must be

made for the higher bit precision required for the storage of weight values. This will

increase complexity in the neural network, as well as implementation cost, and may

possibly have a detrimental effect on reliability.

Chapter 2

24

Sequin and Clay [71] compared retraining a damaged MLP network which originally

had more hidden units than needed to solve the problem, with one in which faulty units

were located and replaced, though such fault location is likely to be extremely difficult

(see section 2.5 below). However, their results indicate that fault location would not be

necessary since the time taken to retrain both MLP's was approximately equivalent.

They conclude that adding redundant units initially is sufficient to allow retraining to

regain correct operation without requiring physical reconfiguration of the architecture.

Plaut examines relearning in attractor neural networks [72], particularly concentrating

on issues relating to rehabilitation of cognitive deficits due to brain damage. It was

found that errors due to faults near output units were corrected very quickly, but for

faults occurring lower in the neural network, retraining was not so effective.

The ability to retrain a neural network at intervals to recover from damage caused by

faults may well produce a system that can easily meet long term reliability goals [36],

though it must be noted that the retraining system will then become a key factor in

determining the reliability of the overall system.

2.5. Fault Management

Sufficient numbers of faults occurring in a system will cause errors unless protected

against using fault tolerance techniques. However, it would also be useful if faults can

be detected, then located, and finally removed. This is since the limit of a system's fault

tolerance may be exceeded given sufficient time, and failure will then occur. If faults

can be removed, then the potential reliability of a system will be greatly increased. The

problems of fault detection and location in neural networks will be reviewed below.

Fault removal can be achieved by retraining as discussed above, though other methods

are also conceivable, but no literature on such alternative methods relating to neural

networks is known to exist.

As seen above, many people have considered retraining, however, generally no

consideration is given to the decision of when to apply retraining. It would initially

seem that the task of fault detection in neural networks is surprisingly simple. When

faults have occurred in a neural network system, its performance will be degraded since

computation in a neural network is distributed evenly amongst its components, as von

Chapter 2

25

Seelen and Mallot [37] have noted, so a fault will always result in a deficit. However,

this may not be true in neural networks with non-linear thresholded elements since

although a fault will manifest itself in a change of unit activation, it is possible that no

significant change in the unit's output will occur [73]. Non-linear thresholding functions

hide internal errors caused by faults.

Anderson [8] has made an interesting point with respect to fault location in distributed

neural networks that errors caused by faults occurring during teaching will also be

distributed, and so will be both hard to locate and also to remove. Also, since neural

networks are essentially black-box systems in the sense that the functionality of internal

units is unknown, unless complex analysis is performed, individual units cannot be

tested to locate possible faults. No other work is known which examines this problem.

However, it is proposed that for multi-layer perceptron networks, the calculation of

errors involved in back-error propagation could be applied to determine the unit which

is erroneous, and so indicate the approximate location of the fault(s).

2.6. Analysis of Specific Neural Network Models

There exists a large amount of literature which has investigated the reliability or

robustness (tolerance to input noise confused with resistance to defective components)

of specific neural network models. Often, the central question of such studies is "Are

neural networks inherently fault tolerant?". However, it will be seen that inconsistent

answers are given to this question. The two neural network models which have been

examined in greatest depth are a feedback neural network developed by Hopfield and

feedforward multi-layer neural networks such as the frequently applied multi-layer

perceptron network.

2.6.1. Hopfield Neural Network Model

The Hopfield neural network [13] has been analysed for various objectives by many

researchers, and since it is a model for spin-glass theory, it is extremely amenable for

such mathematical analysis. However, only a few people have examined the fault

tolerance properties of the model, and the work that exists tends to cover the same

issues. Amit et al [7] present a classic example of the application of spin-glass theory to

the Hopfield network. They also highlight some issues which are important for

Chapter 2

26

robustness in the model, but do not consider the effects of faults. The relaxing of the

rather extreme condition of full interconnectivity in a Hopfield network was found to

decrease the storage capacity and quality of recall of the network only gradually. Also,

by not restricting the weight matrix to be symmetrical, a similar result occurred. This

condition is one which Hartley and Szu [57] noted would decrease the power of a

neural network. Two other issues, the saturation of the network and noise at synapses

are also shown only to decrease the storage capacity gradually.

A purely empirical study of the Hopfield network has been carried out by Palumbo [74]

which examines the effects of unit faults (stuck-at-0 and stuck-at-1) for networks

trained to solve the travelling salesman problem (TSP), an assignment problem, and

also a task allocation and load balancing problem. A degree of fault tolerance is shown

to exist in the neural network, but no measure of performance is given. An indication is

made that this latter point is an important problem that needs more research.

A VLSI architecture for implementing the Hopfield neural network [75] was found to

exhibit the inherent fault tolerance of the Hopfield model claimed to exist by many

researchers, but the network was so small (stored only 2 patterns) that it is unclear

whether the results would scale to larger networks.

The theoretical implications of faults affecting both units and connections in a Hopfield

network have been partially studied in [32] by considering the probability of correct

recall and signal-to-noise ratio. Results from simulations show that if relatively few

patterns are stored in the neural network, reliability with respect to faulty connections is

very good; with up to 40% faults there is still a good probability of accurate recall.

Analogies are drawn between the fault modes possible in the abstract model with

physical faults that might occur in some implementation of the neural network, but it is

doubtful if these are valid. This is since so few fault modes are considered, and these

are extremely simplistic. Theoretical results are then given for faults affecting both

units and connections, and are supported by the experimental data. The results are

obtained by considering the probability that the number of failed units exceeds the size

of the basin of attraction [76]. Tai concludes that "the network fault tolerance depends

heavily on the chosen fault model and on the number of stored patterns."

Chapter 2

27

Belfore and Johnson [22] present an analogue implementation of the Hopfield network,

and consider the fault tolerance of the neural network model viewed from this physical

level. They first list the possible fault modes that could occur in their "electrical

neurons", and then examine their effects on the neural network's operation when

solving the travelling salesman problem (TSP), and also for it acting as an associative

memory. Once again, the conclusion is drawn that neural networks seem to be

inherently fault tolerant, and they draw an analogy with holograms only losing

resolution when portions are cut away, though this analogy is somewhat doubtful.

Protzel et al concentrate on the role of Hopfield networks in solving optimisation

problems such as the Travelling Salesman Problem (as above) and the Assignment

Problem for example [27,77,78]. They take the view that the Hopfield network in this

role does exhibit inherent fault tolerance, and that this is a major incentive for its

application in critical systems. Since the Hopfield network acts as an auto-association

system, all units are equal in the sense that no "real" difference exists between input,

hidden and output units. For optimisation problems, all of the units are used as output

units. Given this, they note that faults can be viewed as acting as constraints on the final

solution found by the Hopfield network [78].

However, Nijhuis and Spaanenburg argue that neural networks are not by definition

inherently fault tolerant given their results from an investigation of the fault tolerant

properties of the Hopfield network [31]. They claim that the fault tolerance is very

much dependant on the fault model chosen (broken connections opposed to changes in

weight values), and the characteristics of the stored patterns (Hamming distance

between patterns). However, although the characteristics of the fault tolerance exhibited

under various fault models does differ, it is by no means absent. Also, the effect of

stored patterns' characteristics on fault tolerance is not indicative that the neural

structure is not inherently fault tolerant, it actually indicates that the training algorithm

does not develop weights which allow this natural fault tolerance to be employed.

2.6.2. Multi-Layer Perceptron Model

Several researchers have looked at the multi-layer perceptron model trained using the

back-propagation algorithm [21] with respect to its fault tolerant properties. Damarla

and Bhagat [33] trained both two and three layered networks (2-10-1 and 2-5-5-1) to

Chapter 2

28

solve the boolean exclusive-or problem (XOR), keeping the number of hidden units

constant in an attempt to allow the results from both neural networks to be comparable.

However, it is generally accepted that the number of connections in the network is the

significant factor, and there are only 30 weighted connections in the first network as

compared to 40 in the second. They reported that if the weights were left unconstrained

then no significant results were found. This was probably due to units with large

weights dominating the MLP. Constraining the weights greatly increased the training

time for small networks, but the robustness of the MLP to noise and units being

removed was greatly improved than if left unconstrained. However, given that many

more than the minimum of two hidden units required to solve the XOR problem were

used, the reliability observed was probably due to the MLP "overtraining" and

developing a unary representation [79,80] rather than extracting categorisation rules

from the input data. Further, the vast majority of the constrained weights had saturated

to the clipping value. Considering Abu-Mostafa's [56] indication that neural networks

are more likely to perform better on a random rather than a structured problem (such as

XOR), then the results of this paper are unlikely to be representative.

A more in-depth, although again only empirical, investigation has been carried out by

Bedworth and Lowe [20] in which they state that fault tolerance arises due to MLP

networks "lead[ing] to distributed rather than localised representations." They

investigated a large neural network (760-16-8) trained to recognise the confusable 'EE'

sounds from the English alphabet. The neural network was corrupted in various ways

which were designed to be similar to what would occur if it was implemented in

hardware. Performance was measured by two factors; the first being the number of

correct classifications, and the second the normalised error between the actual output

and the desired output. In general, no constraints were placed on the weights. They

found that the robustness of the neural network was very good, except for faults which

occurred in the connections feeding the output units and also when the output of hidden

units was forced to zero; which is equivalent to removing many connections to the

output units.

Tanaka [70] also claims that fault tolerance in neural networks is due to their "isotropic

architecture", as well as referring to the fact that every day many neurons in the brain

die without undue consequence. Again, a large neural network is used to collect

Chapter 2

29

experimental results (90-50-10), but the problem is merely one of classifying the figures

'0' to '9' represented as dots in a 15 by 6 matrix. For only 10 input patterns, the number

of hidden units seems excessive, and the neural network is likely to construct a unary

representation rather than developing feature detectors leading to a distributed

representation since no constraints are placed on the solution which the MLP finds.

Once again a high degree of fault tolerance is displayed by the network.

However, Prater and Morley [34] state that feedforward neural networks (such as the

MLP) are not inherently fault tolerant. They examine the fault tolerance of feedforward

networks for a variety of problems and using several training algorithms. The faults

they consider are based on the stuck-at model, and are assumed to be permanent. Unit

outputs can be stuck-at 0, 1, or . Similarly, weights can be stuck-at-0 or saturated to12

the magnitude of the largest valued weight in the neural network. They claim that a

fault causes both information loss and a bias change in a unit. It is this bias change

which results in large errors from faults other than stuck-at-0. Also, their results show

that expanding the number of layers in a neural network increases the effect of faults on

the output error. This is also shown by Stevenson et al [81] in an analysis of Madaline

networks [82]. Prater and Morley also note that the location of a fault is directly related

to its effect on the neural network's reliability. Weights closer to the output layer cause

more damage than those lower in the network. This corresponds with the relearning

results given by Plaut [72] (section 2.4.1). Their conclusion that inherent fault tolerance

does not exist in feedforward neural networks is tempered by an acknowledgement that

new training techniques can improve their fault tolerance (see section 2.7).

The conflict between the conclusions made regarding the question of the existence of

inherent fault tolerance in neural networks arises from a difference in its definition. In

some work, which includes this thesis, inherent fault tolerance is taken to exist in the

structure of the neural architecture and its computation. However, Prater and some

other researchers seem to consider the statement to describe that a trained neural

network is fault tolerant. Due to current training algorithms not producing a weight

configuration which leads to a fault tolerant neural network, this conflict arises.

Segee and Carter [26] compare the fault tolerance of MLP's with Gaussian Radial Basis

Function networks (GRBF's) performing function approximation. The MLP networks

are trained using several variations of the back-error propagation algorithm; standard,

Chapter 2

30

adding momentum, and using a flexible learning rate. They found that training using

momentum produced the MLP network whose RMS error was least increased by faults.

However, the most critical fault would often cause a total failure in the MLP's

operation. Conversely, the fault tolerance of GRBF's was found to be excellent, and no

single fault would cause a failure. Since the units used in GRBF's are local, i.e. only

respond to a bound region in the input space, this is perhaps not surprising given the

large number of units used (100 and 200) resulting in a considerable degree of overlap

between them.

Although many empirical investigations of studying the fault tolerance of multi-layer

perceptron networks have been carried out, little theoretical work has been done. This is

mainly because MLP's are notoriously hard to analyse mathematically. Zymslowski has

given some very general equations on the effects of parameter changes in a neural

network [83], but unfortunately there is no obvious way in which these can be applied

to construct a reliable neural network. However, he does show that both redundancy of

connections and feedback lead to reliability, but stresses that more effective

mechanisms should be sought. Stevenson et al have analysed the effect of weight and

input perturbations on adaline units and multiple layered feedforward networks

composed from them (madaline's) [81]. They considered the volume of an adaline's

input space that is swept out when one of its weights is perturbed by a small amount,

and by using this, could define the probability of misclassification. This was then

extended to multiple layers of adaline units using several approximation techniques.

Simulations were performed whose results closely matched the theoretical predictions.

Although the results do show that madalines are very resistant to weight perturbations,

though less so as more layers are used, this is not the aim presented in the paper.

Instead, it concentrates on the matching of the theoretical model with simulation results.

This work was extended by Dzwonczyk [84] who considered more realistic faults;

weights forced to zero, being saturated or sign reversal. Similar results were obtained,

though the computational cost of the failure model developed is expensive. An

interesting conclusion is that sparse connectivity may provide benefits for reliability,

though further investigation is indicated as being required.

A probabilistic multi-layer perceptron network (PNN) has been looked at by Specht

[85] where the conventional sigmoid threshold functions are replaced by probabilistic

Chapter 2

31

ones. It is shown that given certain trivial conditions, the PNN will asymptotically

approach the Bayes optimal decision surface. A very useful calculation can be made on

the input values to an output unit which gives the probability that the input to the PNN

belongs to the class which that particular output unit represents. This, although not a

suitable measure for the reliability of the neural network with respect to fault tolerance,

does give a confidence value for the output classification which might indicate the noise

level in the inputs. However, if the input is corrupted such that it resembles another

input class, as might happen for input classes that are not too dissimilar, the confidence

value will be incorrect, so it is not totally reliable.

2.6.3. CMAC Networks

Carter et al [25] have looked at the fault tolerance during operational use (rather than

during the training phase) of the Cerebellar Model Arithmetic Computer (CMAC)

described by Albus [86]. This paper follows guidelines for investigating fault tolerance

given by Carter [19] in an earlier paper.

The CMAC network was designed to be used for robot manipulator control, and as such

it can learn to approximate non-linear functions. The object of the paper is to study the

sensitivity of the network's output to faults, though these were limited only to the

adjustable weight layer due to the complexity of analysing the effects of faults in the

rest of the network. Two fault modes were considered, the first being the loss of a

weight, and the second, a weight value being saturated. They followed a strategy of

aiming to cause the greatest possible effect in the network by placing loss of weight

faults where weights were large, and saturated weight faults where weights were small.

This strategy was adopted so that the limits of the fault tolerance of the network would

hopefully be reflected in the results. They found that as the generalisation parameter in

the CMAC network was increased, the network became more tolerant to loss of weight

faults, but not for saturated weight faults. However, for discrete mappings the

generalisation parameter had to be decreased to improve the robustness of the network.

They concluded that the CMAC network is not so fault tolerant as it would at first

intuitively appear, and that the robustness to faults is not uniform. Quite rightly they

also stressed that "one must be cautious in making assessments of the fault-tolerance of

a fixed network on the basis of tests using a single mapping."

Chapter 2

32

2.6.4. Compacta Networks

Compacta networks are based on a theory for information storage in the human brain

advanced by Legendy [15] in 1967. The neural network model has a sparse, random

interconnectivity of input and output units based on McCulloch-Pitts neurons [62]

arranged in a hierarchical fashion, and a simple learning mechanism. The network is

updated synchronously. Diffuse groups of units (minor compacta) represent a single

entity in the overall distributed representation, so a loss of many units will only cause a

few units to be lost in each compacta on average, and so fault tolerance emanates from

the redundancy of units within each minor compacta. Legendy considers both the loss

of units and also the effects of noise from spuriously firing units (internal noise, not

external). He shows that with 10% or less units faulty, the effects on the "ignition" of

minor compacta (all units in the minor compacta firing) is negligible. However, for

internal noise, less than 0.4% of units must be spuriously firing, else the dynamic

threshold is temporarily raised by the system causing all activity within the network to

cease temporarily.

Worden and Womack [17] have proposed a more detailed study of the fault tolerance of

compacta networks. They are interested in the effects of faults with respect to both the

capacity and the accuracy in such a network. The paper merely lays out the guidelines

for their proposed study, and mentions possible factors that might affect the fault

tolerant properties of compacta networks. However, no simulations or analysis is known

to have been performed yet.

2.7. Fault Tolerance Techniques for Neural Networks

This section examines various methods which have been proposed for improving the

fault tolerance of neural networks trained using current learning algorithms.

Sequin and Clay [71] have proposed a method for improving the operational fault

tolerance of MLP's (w.r.t. hidden unit failure) by injecting a single fault during each

training epoch. In their simulations, the effect of the fault injected was to set the output

of a hidden unit to zero. They found that such training produced a MLP network which

would withstand multiple faults. They concluded that this was due to a more robust

internal representation being created2. A similar method was also given by Neti et al

2 However, results in this thesis (chapter 6) do not support this.

Chapter 2

33

[28] in which constraints were placed on the MLP's fault tolerance, and then a set of

weights estimated solving the problem by using a large-scale constrained nonlinear

programming technique. They note that by using fault tolerance as a constraint, better

generalisation is obtained.

Work submitted for publication by Murray and Edwards [87] looks at the consequence

for operational reliability of training with weight perturbation and destruction. The

research described is similar to results presented in this thesis in chapter 6. Their

method of training with weight perturbations is similar to Sequin and Clay's technique

of injecting transient unit faults described above. The recognition that weight faults

should be modelled rather than considering units to be defective agrees with the

methodology described in chapter 4.

Segee and Carter [26] have applied Sequin and Clay's training technique in Gaussian

Radial Basis Function networks. They found that the results obtained were much more

marked in GRBF's than in MLP's. If during each training epoch many units were

faulted in the GRBF network, it was found that the operation of the final trained

network would not degrade even after about 10% of the network's weights were

deleted. They also examined initialising the weights in the MLP in a "well-chosen

manner" as occurs in GRBF's. It was found that this did improve the fault tolerance of

the MLP, though only slightly.

Clay and Sequin have also looked at improving the fault tolerance of layered

feedforward neural networks [41] solving function approximation problems. Gaussian

hidden units were used which implies that they only respond over a limited volume of

the input space. Their method limits the maximum output of each hidden unit to a small

value which greatly reduces its contribution to the final output, though this does imply

that a large number of hidden units will be required. Every output is formed from the

additive response of several units, and this implies that a degree of redundancy exists.

However, their simulations do not take account of the increased level of faults that will

occur in such a neural network due its increased size and complexity. For this reason, it

is unclear whether this method does actually lead to improved fault tolerance.

An alternative approach to improving the fault tolerance of MLP's has been undertaken

by Bugmann et al [29] who consider extending the error function used by the

Chapter 2

34

back-error propagation learning algorithm to include information on the "robustness" of

the network. The term they add to the error function is

where yi is the actual output, yi,k is the output with node k faulted. This measures the

normalised increase in error due to the effect of all possible single faults in all hidden

nodes and for all patterns. An MLP network with 10 hidden units was then trained on

the XOR function using the modified back-error propagation algorithm. It was found

that fault tolerance was increased, but the solution found was not ideal, resulting in

reduced accuracy. They suggest that this may have been due to the MLP being trapped

in a local minimum. Prater and Morley [88] also examined this method, though they

used a conjugate training technique, and found that it did give consistent results. This

may be due to the improved optimisation properties of conjugate gradient descent over

that of back-error propagation.

Another approach considered by Bugmann et al was to locate the unit which would

cause the maximum damage when removed, and to replace another unit with a copy of

it. The pruning method employed was to select the hidden unit which had the largest

weight on its output. A similar approach has also been taken again by Prater and Morley

[88]. Bugmann et al found that the final trained MLP was "robust", and its accuracy

greater than that resulting from the first method described above. However, their results

are very limited. Prater and Morley considered both larger networks and more realistic

problem domains, and they concluded that this technique gave very inconsistent results.

Prater and Morley have also considered another approach which involves adjusting the

bias of units. This results from their observation that a fault causes both information

loss and a bias change [34]. The technique involves storing the average input excitation

for every hidden and output unit. When a fault occurs, biases are altered such that the

displaced excitation values are restored to their stored values. They found that this was

very effective for improving reliability, though it is obviously ineffective against faults

occurring in an output unit's bias weight.

Eym = 12Nh

ΣpΣiΣk

yi −yi,k

2

Chapter 2

35

2.8. Fault Tolerance of "Real" Neural Networks

It is vital to investigate the fault tolerance of neural network models at the abstract

level, but consideration must be also be made of how to implement a neural network

model using some fabrication technology (electronic, optical, biological, etc.) such that

the inherent fault tolerance of the model is retained. Also, additional techniques specific

to the fabrication technology can be applied to further enhance the reliability of the

system (see chapter 4). This latter consideration might also be vital to protect the

inherent fault tolerance of the model, depending on the implementation.

In large systems, due to the multiplicity of individual units, the loss of a few units is

unlikely to cause any noticeable decrease in accuracy in the overall system. Indeed,

some work has shown that unit losses of up to 40% can be tolerated in the Hopfield

model [32]. This tends to suggest that non-critical components within the neural

network system need not be particularly reliable due to the inherent overall fault

tolerance, as has also been noted by Belfore and Johnson [22]. However, Moore [36]

makes a contentious claim that because of the large number of components "neural

computers will need to be designed with high quality components." This seems most

unlikely.

When neural networks are included in commercial systems, it will be necessary that the

reliability of the implementation can be assessed. So, although it is vital to analyse

neural network models at the abstract level initially, eventually it will also be very

important to analyse various implementation architectures, as well as taking into

consideration the technology chosen.

2.9. Conclusions

This chapter has reviewed past and current literature which considers the fault tolerance

of neural networks. It can be seen that very few rigourous approaches to studying fault

tolerance in artificial neural networks have been made, and those which do exist, tend to

raise yet more questions to be answered. Also, much of the work lacks a sound

framework for investigating fault tolerance issues. For instance, very few papers

consider how fault models should be constructed for an artificial neural network. Little

or no consideration is given as to which abstract components in a neural network should

be chosen as fault locations, and similarly for their manifestation(s). When techniques

Chapter 2

36

for constructing a fault model for a neural network are discussed, the tendency is to

base it on some possible implementation. However, this approach greatly reduces the

fault model's generality and introduces aspects relating to faults which are inherent to

the particular fabrication technology employed.

The absence of a sound methodology underlying published work implies that

conclusions drawn from simulations which have been performed to discover

quantitative results of the reliability of various neural network models are limited. For

instance, it is difficult to compare the results from one model to another in a meaningful

way since there is no agreed quantitative measure for assessing the effect on reliability

of fault tolerance which is independent of the particular neural network being studied.

Also, the fault model chosen for such a study has been shown to strongly influence the

perceived fault tolerance of the neural network, and so a more sound methodology is

needed for constructing fault models. The various correlations and interdependencies

between these and other factors which can influence the inherent and measured fault

tolerance will need to be placed into the investigative framework mentioned above.

The central question of whether neural networks posses inherent fault tolerance still

appears to be in doubt in the literature. However, the root of this disagreement would

seem to arise from two different sets of preconditions. The style of neural computation

and the architecture of neural networks have been proposed as the reasons for inherent

fault tolerance existing. However, some work presents simulations which show that

trained neural networks do not exhibit resilience to faults. The difference between these

two standpoints centres on comparing the paradigm of neural networks with trained

neural systems. The negative empirical results cannot be taken as absolute confirmation

that inherent fault tolerance does not exist, they merely show that current training

algorithms do not produce neural network configurations which allow inherent fault

tolerance mechanisms to be employed.

Chapter 2

37

CHAPTER THREE

Concepts1

3.1. Introduction

This chapter examines various concepts relating to the fault tolerance of the

computation performed by neural networks, such as information and processing

distribution, generalisation, etc. The notion of how failure occurs in a neural network

will also be discussed, and the general consequences of this in assessing neural

networks' fault tolerance. Also, the question of how suitable neural networks are for

application in various problem domains will be considered. These various areas lead to

the notion of studying the computational fault tolerance of neural networks in this thesis

rather than that of various physical implementation technologies. Note that not all the

concepts described in this chapter will be investigated further in this thesis to limit its

scope.

Section 3.2 discusses the general notion of learning in neural networks, especially that

of supervised learning. Section 3.3 and 3.4 respectively consider the properties of

distribution and generalisation in neural networks, and consider their implications for

the fault tolerance of a neural network. Section 3.5 examines the architecture of neural

networks, and the influence it has on fault tolerance. The concept of failure is then

discussed in section 3.6, which leads to the development of a classification of problem

domains in section 3.7. Graceful degradation is also discussed. The idea of

computational fault tolerance is then described, and reasons given for its study in this

thesis based on the contents of the previous sections in this chapter. Finally, section 3.9

discusses the problems in verifying adaptive systems based on neural networks.

1 Parts of this chapter have been published in [102].

Chapter 3

38

3.2. Learning in Neural Networks

A property of neural networks which lends credence for their future application is their

capability to learn how to solve a problem rather than being specifically programmed.

In general, a neural network can be considered to consist of the following:

Various architectural components such as perceptron units, weights,

preprocessing elements, etc.

Possibly a control algorithm which specifies how the separate components

operate as a whole. Alternatively, their operation could be autonomous.

A learning algorithm.

The numerous variables in the neural network (e.g. topology, weights, squashing

functions, etc.) decide the ultimate function computed by it. These can either initially

be, as is generally the case, set to random values, or else some inbuilt knowledge can be

supplied. This latter initialisation technique has been termed learning with hints by

Abu-Mostafa [89].

The learning algorithm attempts to determine the values for these variables which will

solve the problem assigned to it. To perform this task, some information must be

supplied by an outside agent as to the nature of the problem. This may either be

precisely defined by supplying the required output for each input (supervised training),

or at least an indication of the neural network's error (reinforcement training). Another

alternative is that only inputs are made available to the learning algorithm

(unsupervised training), and the function which the trained neural network performs is

peculiar to its architecture, and its associated control and learning algorithms.

It can be seen that neural networks provide a general purpose computing system which

can be used (theoretically) to solve any problem2, though in some cases the costs

incurred may be unacceptable in terms of required architectural resources and/or

training time [90].

A conventional computing environment is reasonably flexible with a small number of

basic computational and structural constructs with a large range of behaviour which

2 It has been shown that a multi-layer perceptron network is equivalent in computational power to a

Turing Machine [57]

Chapter 3

39

must be organised by a programmer to solve a problem. For example, arithmetic and

logic computation constructs such as ADD and OR, and structural constructs as

IF-THEN, WHILE-DO. In contrast, a neural system consists of many computational

elements with very limited behaviour which are topologically structured in a highly

complex fashion coupled with a specific learning algorithm. It will be seen that, at least

in part, these differences require unusual fault tolerant techniques to be developed to

improve the reliability of neural networks.

3.2.1. Supervised Learning

The concept of supervised learning is not that of memorisation, i.e. storing associated

input-output data. In both supervised and reinforcement learning it is clear that the

neural network is taught to represent the particular problem. However, in the case of

unsupervised learning, the problems which can actually be taught are limited to those

which naturally match the neural network's dynamics (e.g. topological mapping [91]).

Reinforcement learning is more flexible, though it is difficult to develop reliable and

fast training schemes. Supervised learning provides a much more flexible approach.

However, it must be noted that the operation of the learning function is not that of

storing individual input and output vector pairs, but rather to abstract the underlying

problem and represent its solution using available resources. Merely performing

memorisation would not produce a system with any tangible benefits over that of a

conventional computing system.

During supervised training, patterns are presented to the neural network (often termed

loading [90]) which can be viewed as forming a functional mapping between the

associated input and output vectors. Hopefully the neural network will then have learnt

the real-world problem which the training patterns exemplified [92]. Various theorems

exist in computational learning theory [93] which supply a framework to examine this

area theoretically. For example, one result specifies the lower bound on the number of

examples which must be supplied to specify sufficiently the required functional

mapping [59].

The ability of neural networks to learn how to solve a problem provides interesting

possibilities for achieving a fault tolerant system. In chapter 2, various studies were

described which have examined retraining. However, this thesis concentrates on the

Chapter 3

40

operational fault tolerance [19] of neural networks and does not directly consider how

retraining can be used to enhance the reliability of a neural network. It is shown in

chapters 5 and 6 that the process of learning does have great influence on the final

operational fault tolerance of a neural network.

3.3. Distribution

One of the features of neural network's computation which is a major incentive for their

application in solving problems is that of distribution. Two distinct forms of this

property can be considered to exist: Distributed Information Storage, and Distributed

Processing. These will now be discussed in turn.

During learning patterns are loaded into the neural network by modifying the various

weight vectors feeding each unit. For a multi-layer perceptron network [21] this is

reflected by the changes in the representations at the various hidden layers. More

generally, it can be viewed that the representations formed by functionally separate

modules in a neural network are changed. Since potentially every weight is altered in a

training cycle, one can say that the information supplied by that training example is

incorporated (or stored) in every weight. This intuitively is the understanding of the

term distributed. During recall every weight takes part in the neural network's

operation, and so the information collected is seen to be stored in a distributed fashion

across all of the weights.

Distributed processing refers to how every unit performs its own function

independently of any other units in the neural network. However, it should be noted that

the correctness of its inputs may be dependant on other units. The global recall or

functional evaluation performed by the entire neural network results from the joint

(parallel) operation of all the units.

The distributed computational nature of neural networks seems at first to present a

serious drawback to developing a fault tolerant neural system. Although faults will

always identify themselves to some degree, they cannot be located easily. In an

implementation each component would require extra circuitry to detect and signal the

occurrence of a fault. The cost and reduction in overall reliability of the system might

render this approach unsatisfactory. It seems that this would constitute a major

Chapter 3

41

disincentive to the use of neural networks for reliable computation. However, neural

networks also have another important property, they can learn (c.f. section 3.2). This

feature will allow a faulty system, once detected, to be retrained either to remove or to

compensate for the faults without requiring them to be located. The re-learning process

will be relatively fast compared to the original learning time since the neural network

will only be distorted by the faults, not completely randomised.

It will be seen, especially from the results in chapter 6 on multi-layer perceptron

networks, that these holistic concepts are very general and can easily be misapplied

when dealing with specific architectures. For example, the notion of information being

stored in a distributed fashion across all the weights in a neural network should

correctly only be viewed as such between separate modules. A module is defined as a

large-scale functional unit performing a discrete operation with respect to overall

operation. In the case of a multi-layer perceptron network, such a module comprises of

a single layer, i.e. a change in representation.

As well as distributing information across all units within a neural network it is also

beneficial if the information load on every unit is approximately equivalent. This will

decrease the chance of having critical components which might cause system failure,

even if the remainder are free from faults. It will be seen in chapter 5 that significant

improvement in the degree of fault tolerance exhibited by a neural network results from

ensuring that information distribution is uniform. Also, the effective capacity of a

system should also be increased by ensuring uniform distribution of information since

resources will be used more efficiently.

3.4. Generalisation

One of the most important attributes of a neural network's style of computation is that

of generalisation. This refers to the ability of a neural network which has been trained

using a limited set of training data, to supply a reasonable output to an input which it

did not encounter during training. As an adaptive system, generalisation in a neural

network can be considered to refer to it learning to represent the underlying problem

rather than just memorising the particular inputs in the training set. An unknown input

then merely becomes an input to be processed. The quality of the generalisation

exhibited will depend on how accurately the neural network represents the problem.

Chapter 3

42

Lippman [2] refers to this view of generalisation by assessing how well a neural

network generalises by considering if any learning progress can be made given a new

set of training data chosen from the same problem.

Robustness to noisy inputs in classification systems can be a product of generalisation.

This is because inputs from regions surrounding training input patterns will produce the

same output as for the original training pattern due to generalisation. However, this

robustness should not be confused with resilience to defects affecting components, i.e.

fault tolerant behaviour. Note that there is an implicit assumption on the distribution of

patterns in input space resulting from noise affecting a particular input pattern. It

requires that this noise distribution is approximately equivalent to the generalisation

distribution of input patterns, i.e. input patterns considered to be in the same class.

Figure 3.1 illustrates noise distribution different to that for generalisation. An example

occurs when continuous inputs are binary encoded, a single bit error can result in a

large error in the underlying continuous space. To satisfy this assumption, a suitable

choice for the input data representation must be made.

This rest of this section will initially examine various characteristics of generalisation

that can occur in neural network computation, and then discuss their consequences for

fault tolerance. Next, the concept of constraining a neural network to be fault tolerant to

act as a mechanism to improve generalisation will be examined.

3.4.1. Local vs. Global Generalisation

Two distinct computational techniques by which a neural network generalises can be

identified by considering the nature of the response of internal units to inputs ranging

Noise

Generalisation

Distribution due to:

Figure 3.1 Distribution of a noisy input pattern does not match

its generalisation distribution in input space

Chapter 3

43

over the input space. Some neural networks employ units which only activate for inputs

in a limited bounded region of input space, e.g. Radial Basis Function networks [94]

and CMAC networks [86]. An unknown input will only activate those units whose

activation regions includes the new input. Other units will remain inactive. So, the

generation of a suitable output is only influenced by the units whose activation regions

surround the input pattern. Due to the limited region of input space involved in

generalising a new input, this form of generalisation is termed local.

The other computational method by which unknown inputs are processed is termed

global generalisation. This is where the internal units of a neural network respond to all

inputs lying anywhere within the input space. An example is the multi-layer perceptron

network [21] where the output of a unit is determined by a function of the distance of

its input from a hyperplane. An unknown input will cause all units to respond, and their

combined computation provides a suitable output.

Relationships can be drawn between these two facets of generalisation and the concept

of distribution in neural networks. Global generalisation implies that a more distributed

representation can be formed since the function of all units must be altered during

learning for every input pattern. However, locally generalising neural networks will

clearly favour local representations. Also, information processing will be more

distributed given global generalisation since all units are functionally active, as opposed

to only a few active units in the case of local generalisation.

These two computational techniques for generalisation result in different characteristics

for the possible fault tolerance of a neural network. If generalisation is local then it

implies that faults will cause generalisation to be suddenly unreliable in limited regions

of input space. Outside these local regions generalisation will be unaffected. This will

result in a system whose reliability is highly uncertain. Only when an input falls into a

region where the neural network's operation is affected will the effect of faults be

apparent and possible failure occur. However, faults affecting a neural network which

exhibits global generalisation will cause a small loss of generalisation for any input

pattern. This results in the effect of faults on the reliability of the neural network to be

more uniform across the input space, but less drastic.

Chapter 3

44

It is more difficult to compare the redundancy gained between globally and locally

generalising neural networks when their capacity is increased, for example, adding an

extra unit. Global generalisation implies that all units are involved in any computation,

and so increasing capacity will be observed across the whole system. However, for a

neural network which exhibits local generalisation, extra units will only increase

redundancy for patterns from a limited region of input space. This corresponds with the

above discussion on graceful degradation in locally and globally generalising neural

networks.

Overall, due to the combination of redundancy potentially increasing reliability

throughout a neural network and a more uniform effect of faults across the input space,

it would seem preferable to use a globally generalising neural network in a system. For

these reasons, only globally generalising neural networks will be examined in this

thesis.

However, it should be noted that if a large number of extra units are used in a locally

generalising neural network, the degree of overlap between the input space regions of

each unit can be increased such that a general improvement in fault tolerance will be

achieved. It is not obvious whether a similarly increased capacity in an equivalent

globally generalising neural network will be more effective.

3.4.2. Interpolation vs. Inexact Classification

As well as the various properties of generalisation due to the functional operation of a

neural network as described above, generalisation can also be differentiated depending

upon the nature of its application. The most common problem domains in which neural

networks have been employed are either to categorise input patterns into a set of

classes, or else to evaluate a functional mapping. The choice of thresholding function

employed in its output units is the principle influence determining which type of

operation a neural network performs. A pattern classification system uses non-linear or

hard-limited thresholded units to form discrete output patterns, while linear output units

are employed for functional mapping systems to produce continuous output values.

The style of generalisation required in both cases is very different. In the case of pattern

classification, inexact classification is required, while for functional evaluation,

interpolation is more suitable (see figure 3.2). These two areas will now be considered

Chapter 3

45

in more detail.

Generalisation in the form of inexact classification applies when a neural network is

operating as a pure classification system. The reasonable response to an unknown input

is to match it to one of a fixed set of existing classes, or to none at all if the input

presented is too dissimilar from any of the known classes (see figure 3.2b). Note that

both local and global generalisation can occur dependent on the nature of the

functionality of the units in the neural network.

Generalisation in the form of functional interpolation is required when a neural network

is performing a mapping between two vector spaces which can either be discrete or, as

is more generally the case, continuous. An unknown input is assigned a (new) output

which is constructed from either nearby known mappings or else using more global

information (see figure 3.2a). These two cases correspond to local and global

generalisation respectively.

Due to the differences in architecture required for a neural network to solve problems

from these two application paradigms, it would seem reasonable that tolerance to faults

will also differ. However, it is not clear where such differences will arise, or what their

nature might be. It has been noted that when applied to function approximation

problems [26] locally generalising radial basis function networks (RBF's) will tolerate

more faults than multi-layer perceptron networks which are globally generalising. This

(a)

(b)

Trainingexample

Unknowninput

Key:

Class A

Class B

Functiontrajectory

Figure 3.2 Forms of Generalisation: a) Functional Interpolation,

b) Inexact Classification

Chapter 3

46

is since the effect of faults is strictly limited within the problem space due to the local

nature of the operation of the units used in RBF's. For regions away from the fault, no

effect will occur. When an input is presented that lies within the region affected by the

fault(s) only if sufficient numbers of units in the locality exist will successful

functionality be possible. However, this seems rather wasteful in terms of resources and

implies that the holistic properties of distributed information storage will be weakened

due to the more local representation formed. For this reason globally generalising

neural networks are preferable. To achieve fault tolerance better training techniques are

required.

Note that the functional interpolation style of generalisation does not just apply to

neural networks performing a continuous function mapping (e.g. in robot control [95]),

it can also apply to classification problems where an interpolation to an unknown input

can also be appropriate. For example, if a neural network is trained to distinguish

different line segment orientations then it is useful if it can also recognise and give a

suitably interpolated output for line segments at orientations between those given as

examples during training. This occurs naturally in Kohonen networks [91].

3.4.3. Fault Tolerance as a Constraint

It has been proposed by various researchers that a requirement for fault tolerance can be

applied as a constraint in a neural network to achieve generalisation3 [28]. One study

has shown that this concept does have some empirical justification. A training method

which induces fault tolerance by injecting transient faults (described by Sequin and

Clay [71]) was shown to improve generalisation [96]. However the classification

problem used was extremely simple, but it is an indication that this concept bears

further investigation.

One of the questions which arises when training a neural network (section 3.2) is

deciding its size so that it will just be able to solve the problem. Too few units result in

the neural network unable to learn fully the problem. Conversely it has been commonly

found that if too many units are used in a neural network then its ability to generalise is

much diminished. This has been variously ascribed to overtraining, learning noise

within the training set, and memorising individual members of the training set rather

3 Personal communication, Dr. Bruce Segee, University of New Hampshire (August 1991).

Chapter 3

47

than forming a distributed representation. Alternatively, it can be considered that the

computational capacity of the neural network far exceeds the computational complexity

of the problem leading to this lack of generalisation. The degrees of freedom of the

neural network are not sufficiently constrained by the problem examples in the training

set (see figure 3.3) [92]. However, it is not always feasible to supply additional

examples of the problem due to physical restrictions, cost, etc. For example, sunspot

data and stock exchange information.

Given this situation, alternative mechanisms are required to constrain sufficiently the

training of the neural network by reducing its excess computational capacity, preferably

such that it exactly matches the complexity of the problem. One method of achieving

this has been by constraining groups of connections to share weight values, and was

very successfully applied in a multi-layer perceptron network trained using back-error

propagation to recognise handwritten digits [97]. Another method has been proposed by

Abu-Mostafa [89] which embeds a neural network with "hints" before training to

constrain the solution.

However, given the biological foundation of neural networks only the latter of these

two constraint methods has much credence. It seems most unlikely that disparate

synapses would have exactly the same effectiveness. However, the possibility that a

brain has inbuilt knowledge about problems which it will encounter is plausible but

Under-constrained problemrepresentation

Extra training exampleconstrains solution tomatch problem

Training examples:

Original

New

Figure 3.3 Require sufficient training examples to constrain a

neural network to represent underlying problem

Chapter 3

48

only at a very basic level, such as the initial structure of synaptic connections,

interconnections between various modules, etc. The amount of information which can

potentially be stored in the brain far exceeds that which can be represented in the genes.

So only very basic constraints would be plausible.

The alternative method of constraining learning in a neural network by imposing the

condition that it is fault tolerant, as mentioned above, will now be considered. To

achieve fault tolerance, a degree of redundancy in some form will have to be developed

in the neural network during training, and this will reduce its computational capacity as

required. In biological neural networks it is evident that they are extremely fault

tolerant. However, it will be seen that this is not an inherent feature in artificial neural

networks, and must be developed using suitable techniques during training. This

discrepancy is an indication that fault tolerance is perhaps a constraint employed by

nature to develop neural systems which can successfully generalise.

Another significant advantage of applying fault tolerance as a constraint is that it

provides a mechanism for a neural network to be scalable [98], i.e. it can increase or

decrease the complexity of the problem which it solves. This is since the degree of

redundancy evolved during initial training to produce fault tolerant operational

behaviour can be reduced if the complexity of the problem which the neural network

solves increases, or vice versa. This can be seen to occur in a limited manner in the

Hopfield model [13] where a tradeoff exists between the number of the patterns that can

be stored and the network's fault tolerance [32]. As more patterns are stored the

complexity of the problem is increased and the observed fault tolerance decreases.

3.5. Architectural Aspects of Neural Networks

It has already been noted that neural networks consist of many simple (often

homogeneous) processing elements connected via a complex communication network.

Typical neural networks which have been taught to solve problems such as reading text

[99] recognising sonar images [100], etc. have a few hundred units and many thousand

weighted connections. The combination of extreme simplicity in individual processing

units and the multitude of units and connections implies that the failure of a particular

computational element should not be critical for the operation of the system. This is

Chapter 3

49

often argued to imply that a degree of fault tolerance should exist in artificial neural

networks due to the existence of such redundancy.

In terms of information distribution each processing unit has a large fan-in of data on

which a relatively simple computation is performed. It seems likely, even if a

proportion of the incoming information is erroneous or nonexistent, that the unit should

still be able to function correctly in some limited fashion. Also, a processing unit feeds

many others and the information produced can be seen to be widely distributed in the

rest of the network. So even if some of its output communication paths are destroyed,

the information is not totally lost. It will be seen in chapter 6 that these two

architectural concepts do give rise to fault tolerance in multi-layer perceptron networks

as proposed here.

Some neural networks operate in an iterative manner, i.e. the final output is produced

by a repeated series of identical processing steps rather than just a single sequence of

operations being performed in one stage. An example is the Hopfield model [13]. At

each step the output converges towards the ultimate answer. This can be related to

temporal fault tolerance in the sense that small errors produced by faults at any stage

can be corrected by later steps in the processing due to their identical nature.

It can be seen that the peculiar architectural nature of neural networks tends to support

the reasoning that a degree of inherent fault tolerance should exist in artificial neural

networks. However, it will be seen that current training methods do not always produce

a neural network in which these fault tolerance inducing features are used to their

greatest extent.

3.6. Failure in Neural Networks

Failure in a system can be defined as the system not functioning as specified. This may

result in it producing erroneous results or not meeting performance goals such as timing

or consistency, etc. A more general definition, but weaker, would be that the system

does not meet the users' requirements. However, since the specification should define

precisely what these requirements are, the former definition of failure is more generally

accepted. In contrast, the user's definition of what the system should perform can

Chapter 3

50

change or even be inconsistent. Also, it is not easy to define formally and so recognise

failure for this case.

In a conventional computing system failure tends to be an abrupt halt of service.

However, neural networks naturally tend to exhibit graceful degradation (c.f. chapter

4), i.e. the service provided by the system deviates gradually from that specified. The

argument underlying this claim tends to rely on the distributed nature of neural

networks. However, it will be shown below that it is the computational style of neural

networks coupled with the nature of the problem domains for which neural networks

are best suited which gives rise to this graceful degradation.

Since this continuous manner of failure occurs naturally in neural networks it allows

systems to be developed which have very useful innate properties. For instance, in the

control of dynamic systems faults will not cause a catastrophic event, but rather the

precision of control will be degraded. An example is the truck-backer-upper system

developed by Widrow et al [101] which controls the direction of a reversing articulated

truck such that it eventually docks at a loading bay. When faults occur within the neural

network, the truck still reverses to the docking bay, but in a similar fashion to a "drunk"

driver4 (c.f. chapter 2). It should also be noted that special design techniques to achieve

graceful degradation which would normally have to be applied during the development

of a system using conventional computational structures will no longer be necessary

when using neural networks.

3.7. Problem Classification

Neural networks are particularly successful in learning to solve certain types of

problems such as image recognition, classification, controlling dynamic systems, etc.

These are termed soft problems. Similarly though, other (rigid) problems prove

incredibly difficult for a neural network to learn, e.g. digital arithmetic operations. In

general, a complex problem can be split into many sub-problems, and each will fall into

one of the two classes above.

It is no coincidence that the capabilities of neural networks in these two classes of

problems corresponds closely with those of biological neurocomputers, such as our own

4 Personal communication, Professor Widrow (July 1991)

Chapter 3

51

brain, since neural networks are based on simplified models of such. This section will

first define the characteristics of a soft problem, and then relate how it naturally maps

onto the computation performed by a neural network. Lastly, it will be shown how

learning in a neural network also corresponds with soft rather than rigid problems.

3.7.1. Soft Problem Domains

The characteristics of a soft problem are that the property of adjacency exists in the

space formed by the parameters which describe the problem5. By adjacency it is meant

that if the problems' parameters are slightly altered then the nature of the problem also

only slightly alters. So if function fw describes a problem with solution parameter space

w and ranging over x

Thus for a system which solves a soft problem as the current location in its solution

space moves the function of that system only changes gradually (see figure 3.4). This

closely matches the operational nature of neural networks where as the variables which

determine its function are slightly changed, the nature of its operation also only

gradually changes. It is due to this correspondence that soft problems can be naturally

represented by neural networks.

This reasoning can now be linked to learning in neural networks. Assuming that a local

learning algorithm is used, i.e. it only uses information derived close to the current

status of the neural network, a soft problem is far easier to learn than that of a rigid

problem where a slight change in system parameters results in a wildly different

solution or problem definition. In other words the local information available to the

learning algorithm is sufficient to determine the required variable values in a soft

problem due to its very nature. The solution space of a rigid problem does not supply

enough information about the direction in which the solution to the problem might be

found.

5 These are not the parameters which describe instances of the problem, they describe the nature of the

problem given some computational reference. So for neural networks, these parameters would be the

weights, biases, etc. Instances of the problem are the inputs fed to the neural network.

fw+δw(x) → fw(x) as δw → 0, ∀ x (7.1)

Chapter 3

52

3.7.2. Considerations for Graceful Degradation

Considering a soft problem solved using a neural network whose computation naturally

fits in such a problem domain, it can now be seen why neural networks exhibit graceful

degradation as described above in section 3.5. A fault causes the functionality of the

neural network only to change slightly due to a small movement in its solution space

(see figure 3.4). This is since the neural network's computation is operating in a soft

problem domain, and so the service provided by the system is almost identical to the

original behaviour. Hence graceful degradation is evident. However, if a neural network

was contrived to solve a rigid problem, this graceful degradation would be unlikely to

exist when faults occur. This is since a small change in the parameters controlling the

functionality of the neural network would not map to a corresponding small change in

the nature of the problem being solved due to the lack of adjacency in its solution space.

3.8. Computational Fault Tolerance

Given the various features of neural networks' computation described above together

with their effect on the fault tolerance and resulting reliability of a potential system

employing neural networks, the aim of this thesis will now be explained in this section.

CurrentFault

Solution Space

Figure 3.4 Effect of a fault in solution space

Chapter 3

53

As already stated, the aim of this thesis is to examine the computational fault tolerance

of artificial neural networks rather than that arising from any implementation

technologies. More specifically, the operational fault tolerance rather than the learning

fault tolerance of neural networks is to be examined [19], i.e. the resistance of trained

neural networks to the effects of faults during actual operation rather than during their

learning phase.

The term computational fault tolerance refers to a system being resilient to changes in

its overall functionality arising from the operation of abstract components in the system

being defective. The methodology for both selecting these components from an abstract

system definition and the manner in which they are defective (i.e. constructing a fault

model) will be explained in chapter 4.

Computational fault tolerance is subtly different to the physical fault tolerance of an

implementation since it considers the effect of "faults" on a system at a much higher

level of abstraction. For physical fault tolerance the faults modelled are based on

physical defects which could occur in the system, though abstracted due to

computational necessity (see chapter 4). However, for computational fault tolerance it is

the system which is abstracted, and then faults are based on actual components in this

abstracted definition. This alternative technique for studying the fault tolerance of a

system allows the possible fault tolerance within a computational paradigm to be

analysed before any implementation level design questions have to be answered.

However, it is not viewed as supplanting studies of the fault tolerance of possible

implementations, but rather it is an approach to guide development decisions during

system design. Both methods supply information on the effect of faults in a system but

just at different levels of abstraction.

The reasons for choosing to study this abstract form of fault tolerance rather than that at

an implementation level are:

There currently exists a lack of a suitable implementation technology for

neural networks. Individual units have very simple computation which does

not need high-speed components, but a fast, complex (three dimensional) and

dense communication medium is required. Silicon technology is at a tangent

to this requirement since it provides the ability to develop fast processors but

Chapter 3

54

slow communications. Of more potential benefit is optical technology, but this

is still not fully developed.

It allows the commonly asserted statement that neural networks are inherently

fault tolerant due to their style of computation, distributed storage, etc. to be

examined. Also, the fault tolerance of neural networks can be studied free

from the influences arising from any particular fabrication technology

employed.

Will allow general results to be found on the fault tolerance of neural

networks that will provide information for many models and also supply

guide-lines for future implementations.

Analysis of the effect of specific features in neural network's computation as

described in this chapter on fault tolerance are possible.

Limit scope of this research.

Although only the computational fault tolerance of neural networks will be examined in

this thesis for the various reasons given above, the results should have relevance for

future implementation designs. By studying the resistance of the computational style of

neural networks to certain deformations, implementations will be able to take advantage

of such computational fault tolerance that inherently exists. However, note that it

clearly is also possible that conventional fault tolerant techniques could be applied in a

design such as N-Modular Redundancy [4], etc.

3.9. Verifying an Adaptive System

This area will not be examined in any depth in this dissertation, but it is useful to

discuss a few of the major aspects in the problem of verifying an adaptive system

considering the potential promise of neural networks.

In section 3.2, the process of learning in neural networks was discussed. Although this

capability has great potential benefits, it also gives rise to some complications. An

adaptive system is generally considered unacceptable for use in a situation where

reliability of operation is paramount. This is because it is unclear how to verify that the

learning process always produces a system with the required functionality. Also, in the

case of on-line adaptation, i.e. the function of a system changes during operational use

Chapter 3

55

it must be shown that the learning process causes the system's functionality always to

match more closely the required functionality. In the vast majority of neural network

applications to date teaching is only performed before operational use. The verification

of the trained neural network's operation is very difficult, though it is believed that

computational learning theory [93] may be a tool which would be useful in undertaking

this task.

3.10. Conclusions

This chapter has examined various concepts of neural network's computational nature

such as distribution of information, generalisation, graceful degradation, etc. Also, the

types of problems which neural networks are best suited to have been discussed, and the

influence these have on the potential reliability of a system employing neural networks

examined. The graceful degradation which is prevalent in neural networks has been

shown to arise from the soft problem domains in which neural networks are typically

applied coupled with the computational nature of neural networks lending itself to such

problem domains. Given these issues, the aim of this thesis in studying the

computational, rather than physical, fault tolerance of neural networks has been

explained and justified.

Chapter 3

56

CHAPTER FOUR

A Methodology for Fault Tolerance1

4.1. Introduction

To study the fault tolerance of artificial neural networks as proposed in this thesis, two

issues will first have to be examined. The first addresses the question of which

components in a neural network could become defective and also the nature of their

defect, i.e. defining a fault model. A useful property would be if the fault model could

be made generic across many neural network models.

The second issue concerns how a neural network's reliability should be assessed. The

methodology described in the following sections provides a base from which research

into the fault tolerance of a neural network can be performed.

Although in this thesis only the computational fault tolerance of a neural network is

studied (c.f. chapter 3), it is proposed that the techniques given below for defining a

fault model and assessing reliability are also suitable when examining the fault

tolerance of neural hardware.

Section 4.2 considers general notions about the construction of fault models. Various

levels of visualisation for neural networks are then described in section 4.3, and the

problems of considering neural networks at an abstract level are discussed. Section 4.4

examines the locations chosen for defects in various conventional fault models, and

from this, section 4.5 describes how locations for faults can be selected from an abstract

definition of a neural network. Section 4.6 then gives two rules that should be followed

in defining the nature of such faults. Various considerations pertaining to spatial and

temporal aspects of neural networks and their application are then considered with

regard to assessing fault tolerance in section 4.7. Finally, the construction of fault

1 Parts of this chapter have been published in IJCNN-91 Singapore [111,112].

Chapter 4

57

models for artificial neural networks is summarised in section 4.8. Section 4.9 briefly

considers the role of functional fault models. The concept of fault coverage is described

in section 4.10, and the degree of coverage in a fault models constructed using the

method described in this chapter is considered.

Section 4.11 discusses reliability in neural networks, and then section 4.12 considers

how to measure the degree of failure in a neural network. Section 4.13 relates this to

assessing how fault tolerant a neural network is. Finally, section 4.14 discusses various

simulation frameworks within which the fault tolerance of a neural network can be

assessed, especially so that comparative results can be obtained.

4.2. Fault Models

The development of fault models is an essential part of the process in determining the

reliability of a neural network system. A fault model describes the types of faults that a

system can develop, specifying where and how they will occur in it. However, faults

become more difficult to formulate sensibly as a system is viewed at an increasingly

more abstract level, especially the definition of how a fault manifests itself. It will be

shown below how sensible locations for faults can be defined in a neural network

viewed at the abstract level, and then the complex problem of how to detail the effect of

faults in these locations will be approached.

The entities listed in a fault model need not necessarily physically exist, but may be

abstractions of real-world objects. In general, a fault model is an abstracted

representation of the physical defects which can occur in a system, such that it can be

employed to usefully, and reasonably accurately, simulate the behaviour of the system

over its intended lifetime with respect to its reliability. Four major goals exist when

devising a fault model:

1. The abstract faults described in the model should adequately cover the

effects of the physical faults which occur in the real-world system.

2. The computational requirements for simulation should be satisfiable.

3. The fault model should be conceptually simple and easy to use.

4. It should provide an insight into introducing fault tolerance in a design.

Chapter 4

58

However, these four requirements often conflict with each other resulting in the fault

model being compromised. For instance, simplicity, which leads to lower

computational requirements, may result in an inaccurate model if carried to excess.

4.3. Visualisation Levels for Neural Networks

Several levels of abstraction exist for visualising neural networks (see figure 4.1) at

which faults models can be developed, namely the abstract, architectural and

implementation levels. These relate in a complex manner to the levels of abstraction

(electrical, logical and functional) for which fault models are defined in digital systems

[103]. Considering only the implementation level, all three fault model abstraction

levels can be applied, just as for conventional computer systems. For the architectural

level, only the logical and functional abstraction levels of digital systems are

appropriate since the lower level electrical variables are hidden. Finally, the abstract

visualisation level for neural networks can only be related loosely to the functional level

of digital systems. The topological and mathematical state equations of a neural

network can be considered to be representative of a digital circuit's boolean state

function.

Chapter 4

59

(a)

(b)

(c)

w13

Output oi= H Σ

j≠iwij oj

where H() is the Heaviside function

Weights: w ij = Σs

2I i

s − 1

2I j

s −1

wii = 0

I 1

I 3

I 2

Figure 4.1 Visualisation Levels for Neural Networks: (a) Implementation,(b) Architectural, (c) Abstract

In general, just as for the levels of abstraction for viewing digital systems, it becomes

progressively harder to define good fault models as one moves from the implementation

level to the abstract level for neural networks. For example, in digital systems it is

impossible to model current leakage at the logical level, but it can be modelled at the

lower electrical level. Restrictions also arise due to the simplification of continuous

electrical parameters to logical values.

4.3.1. Abstract Level

The objective in this thesis is to investigate the inherent fault tolerance of neural

networks which arises from their unusual computational features, such as distribution of

information, generalisation, etc. It is not to directly examine the characteristics of

physical implementations, either at the architectural or implementational level, where

reliability will be influenced by the physical components and fabrication techniques

employed. This means that employing concepts from the electrical and architectural

levels will not be appropriate. The main reason for taking this approach is that

implementation and architectural levels are too specific; technologies change and

architectures are numerous. This investigative direction will allow implementations of

neural networks to be designed in such a way so as to retain the inherent fault tolerance

within the model, as well as to enhance it by means of standard fault tolerance design

procedures.

Examining neural networks at the abstract level suggests that the definition of a fault

model is likely to be difficult, and also at first view, there seems to be a conflict with

the requirement that a fault model should adequately cover real-world defects.

However, although it is possible to identify reasonably the potential faulty entities in the

abstract model of the neural network, it will be seen that reference occasionally must be

made to implementation aspects of possible designs when describing the nature of their

deviation from proper behaviour. This leads to acceptable fault coverage. Also, viewing

a neural network at an abstract level implies that the further goal in the development of

a fault model of disassociating it from any particular class of neural network will be

eased. This will allow comparisons between results from such neural models, thus

indicating their relative merits.

Chapter 4

60

4.3.2. Role of Fault Models

The fault model once defined can be used for two purposes. Firstly, if it covers physical

faults satisfactorily, then it can be used in the generation and application of a test

procedure to ensure that an implemented system operates according to specification.

Secondly, the fault model can be used in simulations of the neural network system to

evaluate a measure for its reliability. It is this latter case which is of interest here. Two

approaches exist for developing a measure for reliability in a neural network system.

The first is to use measures from existing reliability theory and apply them to neural

network systems, the other is to develop new measures. Both of these avenues will be

explored further in section 4.9. The overall objective is to define measures which are

generic in nature; they should apply across a wide range of neural network architectures

such that valid comparisons can be made between them, and again, as for the fault

models, simplicity and ease of use should be major considerations.

4.4. Conventional Fault Models

In formulating the locations for faults in an abstract visualisation of a neural network, it

is helpful to first examine existing fault models for conventional digital systems.

Considering a system at its most basic level, the physical faults which occur depend

upon the fabrication techniques used to implement the circuit, such as TTL or CMOS

for example. A few examples of such physical faults for the latter technology are

defects in the silicon, short circuits in metal, and holes in oxides used in transistors.

These very real faults are modelled by some more abstract representation in the related

fault model such that both accuracy and simplicity are hopefully achieved. Three levels

of abstraction tend to be considered for viewing systems, each with its own associated

class of fault model. See figure 4.2 for an example of a component represented in

electrical, logical, and functional form.

At the very detailed electrical level, example definitions of faults are changes in various

continuous variables such as voltage, resistance, and current levels. However, such a

fault model can only be useful for very small and simple systems. The computational

cost of modelling these very detailed variables quickly becomes prohibitive.

Next, the logical level only considers signal values which map to the logic (and

discrete) 0 and 1 values, and the corresponding faults are similarly more abstract. The

Chapter 4

61

faults defined at this level include, but are not limited to the well-known stuck-at faults,

e.g. stuck-at-1, stuck-at-0. Although the faults defined in the logical fault model are a

grossly simplified version of the physical faults which actually occur, they do bear a

reasonably acceptable functional relation to them. Also, computational costs are

reduced, though they are still considerable for present day circuit sizes. However,

physical faults such as current leaks and threshold voltage shifts cannot be represented

by the logical fault model.

Finally, at the functional level the fault model is defined using high-level information,

such as input/output specifications and circuit diagrams for example. This highly

abstract fault model is required when very large circuits are being considered or when

no information is available on the internal structure of the circuits' components. It may

also be used when the computational requirements of using a reasonably abstract logical

fault model are far too great. The very high level of abstraction results in the functional

fault model being implementation independent, but often also very imprecise and

heuristic in nature [4].

The quality of fault models is very variable. Although some may be uncomplicated and

conceptually simple to apply in simulations, they may not be particularly accurate with

respect to the faults a system would suffer in actual use. Although it is obviously

beneficial to have the former characteristics, the latter feature should be treated as a

primary objective. This then leads to a measure which has been used to indicate a fault

model's quality. Fault coverage is defined as the percentage of physical faults which are

identified by the fault model. However, this value is often very difficult to determine.

A

B AB

(a)

(b) A B AB

0 0 10 1 11 0 11 1 0

V

A

B

A B

(c)

Figure 4.2 NAND gate at 3 levels: (a) MOS, (b) Logic, (c) Truth table

Chapter 4

62

4.5. Fault Locations

As each level (electrical, logical, functional) in a digital system requires a

corresponding fault model, the case is similar when considering fault models for neural

networks. When they are viewed from the implementation level, the conventional fault

models above can be employed, and for the higher architectural viewpoint, a functional

fault model is suitable. However, fault models for the implementation and architectural

levels cannot be specified generically since they are highly dependent upon the design

(e.g. the fabrication technology used), although standard existing fault models for

individual components (e.g. diodes, transistors, shift registers, etc.) could be envisaged

as building blocks in developing such a fault model. However, the objective here is that

of formulating a fault model for a neural network visualised from the abstract level. As

a further goal, the fault model should apply across a wide range of neural network

models. It is possible that functional fault models could be used since are independent

of system implementation, though they may be limited since the abstract description of

a neural network differs widely from model to model. Functional fault models applied

directly to neural networks will be discussed later in section 4.8.

Although not an objective of this thesis, examining the application of more

conventional fault models to neural networks viewed at the implementation and

architectural levels (c.f. section 4.3) illustrates how locations for faults in an abstractly

defined neural network can be found. By noting the common features, and then

extrapolating from these observations, it will be shown how new fault models can be

devised for the abstract visualisation level.

At the implementation level the electrical fault model involves physical objects such as

connection wires, capacitors, and transistors. The logical fault model refers to the

slightly more abstract signal lines which interconnect logic gates, and it is these entities

which are chosen to be possibly faulty. For example s-a-0, s-a-1, short-circuited, etc.

Next, at the higher architectural level, components such as individual IC's or

communication lines are chosen to be candidates for fault locations.

In both of these cases it can be seen that locations considered eligible for faults are

atomic entities with respect to the conceptual level at which the system is being viewed,

Chapter 4

63

or the tight interaction of a few such atomic entities. Also, these entities can be seen to

be either acting as functional units or information channels.

4.5.1. Fault Locations for Neural Networks

The above observation that the chosen elements which are selected to construct the fault

model at these two conceptual levels cannot be subdivided suggests that similarly at the

abstract level, the entities selected from the mathematical model should also not be

capable of being fragmented in terms of their role. For example, weights, links, and

threshold functions could all be candidates. Note that in addition, the entities from the

abstract definition which are eligible as fault locations should also have some

operational function or substance, rather than just being a temporary variable which is

used to connect various equations in the abstract definition together. For example,

output values associated with units are not considered eligible. A fault may cause an

output value to be erroneous, but faults cannot directly affect an output value.

In summary, to identify the various entities in a neural network, viewed at an abstract

level, which should be considered as being eligible for possible inclusion in a fault

model are any non-trivial atomic entities in the abstract definition of the neural

network. These entities should be limited to those which have potential for changing

information within the neural network, rather than those which merely transfer

information. Note that the generally large number of possible candidates for fault

locations arising from this procedure will be reduced when defining the manifestations

of the faults (section 4.6) and considering other factors (section 4.7).

Faults should be considered for both the operational and training phases of a neural

network, though only the former is vital when operational systems are going to be

"cloned" from a single once-only trained neural network. However, note that this is not

the case for an autonomous system employing neural networks since learning will be an

active function throughout its lifetime. Due to this, the abstract definition of a neural

network should also describe the learning algorithm.

This methodology provides a basis for a fault model which is independent of any

possible implementation, and simulation results should indicate the fault tolerance

inherent within the neural network model, i.e. that which arises as a consequence of the

nature of the computational method of neural networks.

Chapter 4

64

4.5.2. Example

As an illustration of the above technique on the determination of fault locations, the

multi-layer perceptron neural network architecture [21] will be considered, and

reasonable fault locations will be identified from the abstract model. The description of

this abstract model is given in figure 4.3 which shows a graphical representation of the

neural network and the mathematical equations governing the system.

The various entities from the abstract definition for a multi-layer perceptron which can

act as possible fault locations are given below. For completeness, both the operational

and training phases are considered here.

Weights wij, not only for the operational phase where they are fixed values in the

multi-layer perceptron network once training has finished, but also for the

training phase. For simplicity, bias values θi are viewed as weights on

connections from a dummy unit which is permanently active.

Threshold Functions fi, a fault in a threshold function will alter the

transformation of the activation to an output value in some manner. This will

obviously affect both phases.

Derivative of Threshold Functions f i', this fault will only affect the system during

the training phase. It is identified as a separate fault since its function is generally

different to that of fi.

Constant Values, faults affecting any constant values are fixed by definition.

During the training phase an example would be the learning rate η.

Chapter 4

65

Outputs

Inputs

......

......

......

Output oi = f i Σj wij oj

such that feeding units j already evaluated andwhere f i is a differentiable monotonic function

Weight change is ∆wij = ηδioj

where for output units δi = (t i − oi)f i Σk wikok

and for hidden units δi = f i Σk wikok

Σ

lδlwil

Figure 4.3 Multi-Layer Perceptron Neural Network

Evaluation:

Training:

Target Values ti, these are not included in the constant values above since it is

conceivable that a MLP network may be trained on data that is modified as time

progresses (e.g. Miikkulainen and Dyer [52]).

Topology, the connectivity of the neural network could easily be subject to faults

in various ways such as the loss of a connection between two units.

There also exist some entities which although they represent information, their lifetime

is strictly limited. For example, delta values δi have to be kept at each backward pass so

that errors can be evaluated at hidden units. However, they must be considered for

inclusion in the fault model due to their functional role in the operation of the

multi-layer perceptron network.

Activation Values .ai = Σj

wij oj

Delta δi, faults in these are only relevant during the training phase.

Weight Change ∆wij, these are the alteration for the stable weight base value, and

similarly as for δi, faults are only applicable during the training phase.

Note that the concept of a "unit" becoming faulty is not specified above, it is only a

further abstraction from fault locations such as the threshold function, activation values

ai, input weights wij, etc. This is analogous to a stuck-at fault in a digital circuit covering

many (more concrete) physical faults.

It can be seen that a large number of possible fault locations exist for a multi-layer

perceptron network. However, when the actual manifestations for these faults are

defined, it will be found that a large proportion of them can be discarded.

4.6. Fault Manifestations

Although the entities acting as possible locations for faults have been identified at the

abstract level from the mathematical model of a neural network, the actual nature of the

faults they suffer have yet to be defined. For instance, a threshold function might be

said to saturate (i.e. output one of its extreme values), a link in the topology might be

lost, a weight might be distorted in some way, etc. The latter example is especially

difficult to define sensibly since it is uncertain as to what form of distortion might

Chapter 4

66

reasonably occur in the abstract universe which is considered here. It is due to the

abstract conceptual level at which the fault tolerance of artificial neural network models

are being viewed that this difficulty arises.

It is proposed that two main concepts exist for defining the manifestation of faults. The

first is to look solely at the abstract description of the neural network, and from this

distortions in the abstract universe can be applied. The alternative is by relating fault

locations to high-level implementation details or physical components. The details of

the fault can then be extracted from these comparisons and any constraints that arise.

The use of the first guiding principle above can be approached by defining the

manifestation(s) of a fault to be such that the maximum harm is caused to the system's

operation by the fault. This will capture all possible lesser manifestations, whether

likely or unlikely. The notion of maximum harm will depend specifically on the

component's context for which the fault is to be defined. In some cases, several

manifestations may suggest themselves, for instance, due to symmetry such as in the

sigmoid thresholding function. Generally, the fault manifestation will be dynamic rather

than causing a static change to normal function, and can be viewed as being an active

fault mode. This is since it is unlikely that a static fault mode will cause maximum

damage in all possible operational states of an entity.

The alternative concept of considering the faults which could occur given certain

implementational restrictions will have the consequence of degrading the generality of

the fault model. This is since various design questions will have to be answered in

applying these restrictions, such as the fabrication technology to be used, storage

method of weights, etc. Since different neural network models may well lead to

different answers to these questions, possible generality in the fault model will be lost.

However, in certain cases it may be possible to minimise this by developing fault

models whose construction only relies on an abstracted view of possible

implementations.

The technique for defining fault manifestations used in this thesis is a combination of

the two directions described above. First, possible faults are defined using the

maximisation of damage principle. However, this tends to lead to extreme fault modes

being developed. The second direction using information derived from implementation

Chapter 4

67

considerations can then be applied to these fault modes which has the effect of either

restricting their effect or ruling them out altogether. This joint methodology for

defining fault manifestations will be seen in the example given below to construct

useful fault models which will allow the computational fault tolerance of neural

networks to be investigated. It also allows generic fault models to be constructed which

are largely independent of fabrication technologies, design techniques, neural

architectures, etc.

4.6.1. Example

Using these concepts and the fault locations identified in the previous example, the fault

model for the multi-layer perceptron neural network can be fully defined. It will be seen

that no absolute fault model can be developed, only a general framework. From this a

fault model can be selected according to requirements such as the degree of

implementation independence, simplicity to achieve computationally feasible

simulations, etc.

Chapter 4

68

(b)

(a)

act

act

+1

-1

-1

+1

f (act)

f(act)

f(act)

f (act)

Figure 4.4 Graph of Threshold Function (a) Continuous, (b) Discrete

4.6.2. Threshold Function

The failure modes of the threshold function f, when considered only from the abstract

viewpoint, can best be defined by examining its graph (see figure 4.4a). The clear

symmetries in the threshold function suggests three possible failure modes. The first

two relate to the well-known stuck-at faults, and are defined here to be

stuck-at-minus-one and stuck-at-plus-one. However, this method of constructing a fault

model is not conducive for assessing the quality of fault coverage.

The alternative technique by applying the maximisation of damage principle suggests

another failure mode, and this is dynamic, rather than the static stuck-at faults. The

threshold function is defined to saturate to +1 when the fault-free output would be less

than zero, else -1 otherwise. This is a rather harsh fault during operational use2, since it

implies that the associated unit always outputs the incorrect value irrespective of "how

sure" it is, i.e. how large its activation is, and hence how close its fault-free output

would be to ±1. It could possibly be modified to take account of this by making the

fault probabilistic based on some function of the activation. For example,

. This shows how the two concepts for defining faultPr (Fault) = 1− f(act)

manifestations discussed previously can be jointly applied to develop a reasonable but

still wide-ranging fault mode.

The faults considered so far only apply to a continuous or analogue system. It is also

possible to define failure modes for a digital form of the threshold function (see figure

4.4b), though of course, this implicitly introduces reliance by the fault model on

implementation details. Example failure modes include elements of the digitised

function being corrupted, either randomly, or, by following the concept of maximum

damage, set to the opposite extreme of their fault-free value. This latter fault may again

be tempered by applying a similar probabilistic fault mode as above.

4.6.3. Differential of Threshold Function

The failure modes associated with the differential of the threshold function f' are similar

to the above. Since the graph peaks at +1 when act=0, and is symmetric about this

2 Note that during training, although more time will be required to teach a neural network if a unit

permanently reverses its threshold function, it is still very likely that the neural network will be able

to learn the training set to the same degree if no such fault had occurred.

Chapter 4

69

point (see figure 4.4), the two stuck-at faults should be stuck-at-plus-one and

stuck-at-zero.

To maximise the damage caused to the multi-layer perceptron's learning, the output

should be set to +1 as the fault-free value tends to 0, otherwise it should be set to 0. The

point of change could be defined to be where the sign of the curvature alters. This

causes the applied weight change to be always in the wrong direction. Similar failure

modes could be introduced for a discrete version of the function as described above.

4.6.4. Weights

Faults which affect the weights wij in a neural network are very hard to define sensibly

at an abstract level, though it will be seen that by using the maximisation of damage

principle this can be achieved. However, the first direction which applies "vague"

implementation information will be examined first. Two possibilities can be identified.

The weights can be considered as being held in a discrete form, such as binary

encoding, and then individual components can be corrupted in a similar fashion to that

for the discrete threshold function. The other alternative would be storing weights using

a continuous representation such as resistors for example. The model of the fault can

then be based on the fault characteristics of the component(s) used. For example, a

resistor is likely to either go open-circuit which can be modelled by causing the weight

to saturate to its maximum value, else it will become noisy which can be modelled by

adding noise from a Gaussian distribution.

However, the interest in this thesis is defining purely abstract failure modes for weights,

and as such, they will be independent of any implementation, and so the computational

fault tolerance of neural networks can be investigated. Two very simple failure modes

would be to either set the weight to zero, thus causing the loss of any partial

information (due to the distributed nature of neural network processing) that the weight

held. The other fault mode, following the concept of maximising damage, suggests that

the weight should be multiplied by -1. This represents a unit always trying to

misclassify an input.

Chapter 4

70

It will now be shown how this latter failure mode can also be derived from the abstract

definition of the multi-layer perceptron network, with a slight modification to decrease

its rather fierce nature, by examining the activation equation of a unit:

The vector Wi is normal to a hyperplane in n-D space positioned such that the minimum

scalar distance from the origin is θi. Input vectors O are then classified into a dichotomy

depending upon which side of the hyperplane they fall (see figure 4.5). Following the

notion of causing maximum damage, the failure mode of the weight should be chosen

such that the probability of any input vector O being misclassified is maximised. So, if

Wi' is the faulty weight vector, then for a particular input vector O:

For the first case (the second being similar), say wi2 is faulty, then for input vector

O=(o1, o2, ..., on) to be misclassified3:

Since oi is a continuous value over the range of the threshold function f, defined here to

be the interval (-1,+1), this implies that for o2 very small, wi2 will have to be very large

to cause an incorrect classification, in general:

Even disregarding the size of o2, a single weight would still have to be of large

magnitude to dominate all of the other inputs to the unit for many input vectors. So, this

fault definition is clearly too severe since the result of one weight being faulty will

cause the overall unit to always give the incorrect answer. This destroys the notion of

high fan-in causing individual inputs to be unimportant globally. Also, it would be

unlikely that an implementation would allow potentially infinite weights, and this

suggests a constraint which can be applied. A saturation limit can be applied on any

3 This assumes that all weights always contribute correctly to forming a unit's output.

act i = Σj=1

n

wij oj − θi

= Wi .O −θi

Wi .O

> θi ⇒ Wi .O < θi

< θi ⇒ Wi .O > θi

wi2 ≥ 1o2

(wi1o1 +wi3o3 + ...+winon −θi)

wi2 → ∞ as o2 → 0

Chapter 4

71

faulty weight by restricting weights to the range [-W, +W]. Note that if O is discrete,

then this constraint will still apply.

So, the fault manifestation suggested by this analysis is to cause negative weights to

saturate to +W, and positive weights to -W.

4.6.5. Topology

The topology of a neural network was another area identified as potentially being

affected by faults, and manifestations need to be defined. An obvious fault is the loss of

a connection between two units, and this relates to the loss of an arc in a directed

acyclic graph which abstractly represents the topology of a neural network. Another

possible failure mode would be to randomly reconnect a link to another unit in the

neural network (possibly due to a short-circuit), though this fault would be far less

likely than the simple loss of a link. However, the consequences of this type of fault

would be more severe than the first since the nature of the neural network might be

completely distorted, e.g. the MLP becoming a feedback network and so possibly

non-deterministic.

4.6.6. Other Fault Locations

Other entities which were classed as fault locations are various constants such as the

learning rate η, and the target values ti. Constant values tend to be chosen from a

limited interval, and the definition of how the fault will affect them will depend upon

W W' Faulty weightvector

X

Y still correct

now misclassified

X

Y

Figure 4.5 Active weight fault representing a unit which always

tries to misclassify its input

Chapter 4

72

their function in the case of trying to cause maximum damage. For the learning rate η,

extreme failure modes would be to set it to zero or to its highest possible value. Its

value will typically be in the range (0,1]. A fault affecting a target value ti could be that

its value is changed to be opposite to that in a fault-free situation, thus maximising

damage to the neural network's learning. Target values generally only take the two

values at the extreme ends of the threshold function range. As a less severe fault,

randomisation either supplying an offset or a new absolute value might be more suitable

in both cases. Considering general implementation details, constants or target values

might be encoded in binary form or produced by dedicated hardware, and so similar

failure modes could be used as described above for the weights.

Other entities which were identified in the MLP network as possible fault locations hold

information whose lifetime is strictly limited. These are the activation values ai, delta

values δi, and weight changes ∆wij . The activation values are required throughout both

the evaluation and training stages, though delta values are required only in the latter

stage, but all will need to be temporarily stored. To cause maximum damage, their

respective failure modes would be either to limit them to their opposite value, possibly

constraining the activation and delta values to some limited range, else to apply some

randomisation process to them in a similar manner to that for the weights (see above).

Weight changes need only be considered if they are required to be temporarily stored,

such as when momentum is used during training, and then their failure modes will be

similar to those already given.

4.7. Spatial and Temporal Considerations

The fault model can be simplified by considering the relative merits of each fault in the

system. If a fault only occurs in relatively few places whilst another is widespread, then

so long as the former does not occur with very large probability with relation to the

latter, it would be acceptable to disregard it. However, if a fault which occurs with low

probability has catastrophic effects, then it should be included. For example, in a RAM

chip the amount of circuitry for the actual storage of bits far outweighs that for

addressing, so when simulating the chip to examine its reliability, the addressing

circuitry is considered to be fault-free. Since in a large neural network the number of

weighted connections is likely to be far larger than the number of unit associated

Chapter 4

73

entities (e.g. threshold functions), then it would be reasonable that these could be

ignored during investigations.

Faults can be classified by two temporal characteristics, they are either permanent or

transient in nature. The most frequently occurring case has been found to be the latter

[104], and it can be further subdivided into "transient" and "intermittent" categories.

Transient faults are non-recurring, but intermittent faults occur given a set of

internal/external conditions, i.e. recurring. This latter form of temporary fault can

become permanent.

Due to the observed relative domination of transient faults [104], it is suggested that

any simulations or theory developed for neural network models should be based on only

transient faults occurring, though this will only become especially relevant for feedback

neural networks. This restriction is since any fault tolerance analysis will then produce

realistic data as to the behaviour of an implemented system. Also, it allows the

complexity of any potential fault model to be greatly, but reasonably, decreased.

The effect a fault has on a system will also depend upon when it actually occurs with

respect to the system's operation. For example, a fault affecting a weight which occurs

sometime between the forward and backward passes of the back-propagation algorithm

for multi-layer perceptron neural networks will have different consequences if it

happened to occur at the start of the forward pass. However, modelling this in the fault

model would greatly increase the complexity of any fault tolerance analysis. For this

reason it would be sensible to assume that the manifestation of a fault only occurs when

a functional sub-system of a neural network is not processing an input. This is a

reasonable assumption to make if the time taken for such a functional sub-system to

process its input is much less than the frequency at which inputs are presented.

4.8. Summary

In section 4.5 it was shown how locations for faults could be identified from an abstract

definition of a neural network. The manifestation of these faults was then considered in

section 4.6. Together with section 4.7, a selection of the possible fault modes can then

Chapter 4

74

be taken to compose the fault model. To summarise, the methodology for producing

such a fault model is as follows:

1. The atomic entities within the system viewed at the conceptual level at

which its fault tolerance is being examined must be extracted.

2. Discard from these entities any which would not have a significant effect on

the reliability of the system. This may be due to the number of such entities

in the overall system being very small as compared to other entities selected

in step 1.

3. For each entity, the manifestation of the faults affecting it can be defined by

applying the principle of causing maximum damage to the system's

computation, restricted by considering certain implementation details.

4.9. Functional Fault Models

The role of functional fault models for neural networks will now be examined. A

functional fault model for conventional digital systems offers independence from

implementation details, though often at the expense of exactness and completeness. For

combinatorial circuits, the faults can be described by modifications to the truth table,

and similarly the state transition table for sequential circuits. When considering higher

level components (e.g. RAMs) as the atomic entities of the circuit, more complex

descriptions than truth or state transition tables need to be employed, and generally,

some formal descriptive language, embedding boolean expressions, is used [103].

However, since the majority of neural networks are of a continuous nature (rather than

the logical 1 and 0 of digital circuits), such methods are not applicable. For Boolean

neural networks [105] though, they can be directly applied since each unit can be

viewed as performing a fixed boolean expression which can be described by a truth

table. If the neural network involves feedback, then obviously a state transition table

must be used. However, such functional fault models are generally only suitable for

testing systems for faults rather than acting as a model to aid in the simulation of a

system to identify its fault tolerance characteristics. Also, for large systems, the

computational requirements of the fault model quickly become impracticable.

Chapter 4

75

4.10. Fault Coverage

The measure indicating to what extent the fault model captures the multitude of

physical faults that occur in an implementation is termed fault coverage. To evaluate

the coverage of the fault models which have been discussed in the previous sections,

their two aspects of fault location and fault manifestation need to be considered

separately. Since the aim was to develop fault models for neural networks visualised at

the abstract level, the location of faults in the fault model cannot easily be related to that

which would occur in any possible implementation, and so the fault coverage is hard to

determine. Obviously, when "vague" implementation details are considered in defining

the failure modes this is improved, but implementation independence, and hence

generality is decreased. However, combining this with the use of the damage

maximisation principle, good fault coverage is possible from purely abstract failure

modes which will be implementation independent. This is since by causing maximum

damage to the functionality of the neural network, any lesser faults will be

encompassed.

It must be recognised though that for both of these fault models, and also the briefly

mentioned functional fault models, the fault coverage is generally very hard to

determine with any degree of accuracy. However, the abstract nature of the fault model

increases the possibility of them being generic in nature, due to the independence of

implementation. The fault models developed here can now be used in the process of

measuring the reliability of a neural network system, and this is the area covered in the

next section of this chapter.

4.11. Assessing Reliability

A basic requirement for almost all systems is some knowledge of how long it will

continue to function correctly. The reliability of a system depends upon a number of

factors such as the environment in which it will be used (e.g. spaceborne as opposed to

an air conditioned computer room), the design of the system which includes the quality

and type of parts used, fault tolerance techniques employed, and quality control during

assembly. All of these factors are related in a complex manner to each other involving

many trade-offs and mutual reinforcements. However, since neural network systems are

only being considered abstractly here, their inherent fault tolerance (which is one factor

Chapter 4

76

for reliability) can be observed by investigating their reliability. Only in an actual

implementation will the other factors will become relevant in determining the reliability

of the system. However, although the emphasis will be on abstract neural network

models, the reliability measures discussed will be equally applicable for

implementations in producing results, though for some methodologies, such as fault

injection for instance, it may be difficult to do so due to physical limitations.

Although it appears that neural networks do seem to exhibit some inherent fault

tolerance [32,70,77,106], a requirement exists for a generic approach towards

measuring just how fault tolerant such a neural network system is. This will allow

comparisons between various neural network architectures, and also hopefully between

models as well. Two standard methods which could supply the required assessment for

a neural network system are Fault Injection and Mean-Time-Before-Failure [4]. Such

techniques for assessing reliability as these, as well as others which may be developed

in the future, all require a detailed description of the faults which can occur in the

neural network system which is being investigated. The fault models described above

will be used to meet this requirement.

4.12. Failure in Neural Networks

The nature of neural network's style of computation does not lend itself to applications

requiring exact and precise answers, rather they are suitable for soft problem areas (c.f.

chapter 2). This means that failure will likewise be an imprecise event in most

situations. The assumption of failure in conventional systems being a discrete event is

not realistic for neural networks. This implies that the measurement of a neural

network's degree of failure must be done in a continuous manner. This is difficult since

they are essentially black-box systems and so their functionality can only be judged

from their interfaces. Thus measures which indicate the reliability of a neural network

can only use external information such as inputs, outputs, training data, etc. Although

specific measures may suggest themselves for particular neural networks, more generic

measures can be defined by considering various characteristics of neural networks.

Chapter 4

77

4.12.1. Measuring Failure

There are various areas which must be considered in defining a reliability measure:

Continuous .vs. discrete output units

Problem domain; classification, function approximation, etc.

Redundancy in output representation

Neural networks controlling a dynamic system

Neural network models which use some form of continuous threshold unit do not

compute definite, clear-cut answers for classification problems, but instead their output

merely indicates a tendency for a particular answer, and so the question of whether a

neural network has failed is hard to address. This problem is made worse still if the

neural network exhibits graceful degradation since the output units will not suddenly

change in value, but rather will slowly degrade towards uncertainty.

To define the failure of neural networks solving classification problems, a continuous

measure must be employed which reflects either the degree of certainty in its response

with respect to the wrong answer(s), or else the uncertainty in its response with respect

to what the answer(s) should be, fu. Note that this includes neural networks which use

their output units to indicate confidence since reliability measures relate to failure, and

only indirectly to faults. In this case, as an output unit degrades towards increasing

uncertainty, failure occurs with respect to the specification, and so will be detected by

the reliability measure. However, the increase in uncertainty may be due to the input

presented to the neural network and not caused by faults.

Conversely, for neural network models which require output units to be either on or off

(i.e. discrete valued rather than continuous representation), generally a Heaviside

function is used. These are possibly substituted for sigmoid threshold functions in the

output layer if used during training. To gauge failure in these units, the variable which

should be used is the activation, and then a similar method can be followed as above for

continuous threshold units. Activation must be considered since the thresholded output

value does not indicate where a unit falls between the extremes of absolute certainty

(saturated activation) and near uncertainty, that is, in the worst case a unit may be on

the verge of misclassifying an input. This can only be judged by examining the unit's

underlying activation.

Chapter 4

78

However, output representations can also be redundant, and so the overall degree of

failure in the output units considered as a whole will be reduced, possibly completely,

fo. An example where this occurs is with Kohonen networks [107] in which a group of

output units are activated. So, any measure for the degree of failure of the neural

network must not solely consider failure of output units individually and independently,

but must also take into account this data representation redundancy. It might be argued

that if the output representation is redundant, then the degree of failure of individual

output units can be disregarded and only the entire output vector considered. However,

unless it is possible to measure in a continuous fashion how close the redundant output

is to the critical point where the redundancy becomes insufficient to mask multiple

partial unit failures, i.e. the redundancy is not hidden, the output units must still be

considered individually as well, fu. Another reason to consider only the entire output

vector (or subgroups of it) is if an output representation is used which defines the neural

network's response as an interpolation of several adjacent output units [108].

As well as the above, for applications which require a stream of outputs from a neural

network system (e.g. controlling a dynamic system) rather than just presenting a single

input to obtain a result, qualitative aspects of their function must also be taken into

consideration when evaluating the degree of failure of the system, ft. For example, a

neural network which balances a pole may do so in many different equally successful

ways, one of which might require very gentle motions to keep the pole balanced, but

another might involve large forceful oscillations to do so. There is a clear qualitative

difference between them, but a quantitative measure is required which will take account

both of these differences and also of how correct the output is, irrespective of

application or neural network model.

All of these factors must be combined together to produce a function which will supply

a continuous value indicating the overall degree of failure within the neural network.

To summarise, correctness of output must obviously be incorporated which must take

account of the appropriate value attribute of individual output units with respect to

target values, and also the overall output vector due to possible data representation

Failure = Ffu(o1), ...,fu(on), fo

o

, f t o0, o1, ...,ot

Chapter 4

79

redundancy. To include information on the degree of failure in a dynamic system, the

derivative of the output of a unit can be used to indicate fluctuating behaviour, and

some measure of deviation to capture extreme swings for example. Both of the latter

values are needed since fast small changes or slow large changes would not be

adequately detected by either on its own. The actual way in which these various factors

are combined will depend upon the application, focus of interest, etc.

4.12.2. Applying Failure Measures

To detect failure in a system, a monitor must have pre-knowledge of the correct

processing results for any input presented, and all of the above techniques for

measuring the degree of failure have implicitly required this. Generally, it is possible

either to specify exactly the mapping which the neural network is supposed to have

learned, or else a suitable test set can be constructed which reflects the nature of the

input domain of the problem. However, for neural networks which are required to

generalise and where the mapping cannot be exactly specified, this test set may be more

difficult to construct. In cases where an acceptable test set cannot be formed, the failure

measure adopted can be determined by characteristics of the application area, though

this will greatly reduce its generality. See appendix A for an example of this method of

assessing failure. It describes how the reliability of a neural network was assessed

which performed either edge enhancement or clustering [109].

Since neural networks are black-box systems, the function for measuring the degree of

failure can only judge them based on the results at the output units for presented input

data. Hidden units cannot be used. This implies that the choice of the input test data,

which is used to assess the degree of failure in the neural network, may be critical for

certain applications. For example, a neural network may not generalise correctly in a

particular input region, and so cause a failure which can only be discovered if an input

is presented to the neural network from this incorrectly generalised region of input

space [110]. However, such failures will only result from deficits during training, or

perhaps due to faults in units which act as specific feature detectors. Any faults

occurring during operational use will cause an identifiable change in the output

independent of the input presented since neural networks process their inputs in a

distributed and parallel fashion; all components are actively involved in processing any

input presentation. This is unlike conventional computer systems where a fault may

Chapter 4

80

only cause a failure for a specific input, and so the selection of a test set can be

extremely difficult. The problem of choosing a wide-ranging input test set for neural

networks is not so critical, though if reliance is placed upon generalisation, then

difficulties may arise.

4.12.3. Example

For the multi-layer perceptron network (see figure 4.3) the definition of failure is based

on the existence of a training set composed of pairs of input and output patterns. Two

cases exist for the definition of failure depending upon whether generalisation is

required or not. Note that if generalisation is relied upon, then the training set should

adequately sample the input-output space.

First, if generalisation is not required, then the distance of the output pattern op to the

nearest incorrect target pattern ti can be considered. For failure not to occur,

This defines that the distance of the output from the correct output is less than that from

any other output classification. The Euclidean metric could be used to determinex− y

the distance, though other metrics could be substituted as appropriate.

However, if generalisation is required, then a threshold HD can be set on the maximum

distance that the actual output pattern op can differ from the correct pattern tp.

Note that the concept of a distance threshold HD has analogies to that of basins of

attraction, and its value should be set to a fairly small value if generalisation is heavily

relied upon. It should certainly not exceed the minimum distance of any output pattern

to another in the training set.

If the MLP is required to exhibit some degree of generalisation, then the target values

should be augmented by additional input-output vectors which were not used in the

training set, and represent suitable choices for testing required generalisation properties.

There obviously exists a trade-off between degree of coverage of the input-output range

∀ p. tp − op < ∀ i ≠ p. t i −op

∀ p. tp − op < HD

Chapter 4

81

and the available simulation resources which may not meet the computational

requirements of large test sets.

4.13. Relationship to Fault Tolerance

Measures for reliability should not be confused with measures for fault tolerance;

reliability and fault tolerance correspond in some areas, but are unrelated in others.

Fault tolerant design methods are a technique employed to improve reliability. The

definition of reliability is the probability that a system does not suffer a failure for a

time period T, given that it was working correctly at time t=0. Clearly it is perfectly

possible for faults to be suffered without diminishing reliability if they do not cause a

failure during time T. Also, the reliability can be diminished not by faults, but due to

the system not meeting its specification and failures resulting because of this.

Conversely, fault tolerance characterises how the system behaves as faults are

introduced into it. High fault tolerance indicates that the system will not be adversely

affected by faults, whereas low fault tolerance implies that it will be very sensitive to

any faults which occur. Hence measures of reliability do not strictly assess the fault

tolerance of a system, but, if the system is correct (i.e. it meets its specification), they

can give an indication of the effects of faults by the length of the time period before the

reliability begins to decrease. Also, any fault will potentially be able to influence a

neural network's output due to its processing being both distributed and parallel; all

components are involved in any computation. This is in contrast to a conventional

computer system where a fault will only become a factor when that part of the system is

used, except for common mode faults.

To assess fault tolerance, the reliability of a system can be measured for a range of fault

levels. However, plots of the reliability of differently configured systems can only be

compared if the base on which it is measured accounts for varying complexity between

systems. For example, using time as the base and assigning a failure rate to each

member of the fault model would be suitable. To compare the fault tolerance of two

systems based on their reliability curves, a further condition must hold. The reliability

curves must not cross, and ideally should be of the same general shape. This can be

seen in chapter 5 where graphs plotting the reliability of a neural network are all of the

same characteristic S-form.

Chapter 4

82

If systems being compared do not have the same characteristic reliability curves, then

this method for quantitative assessment of a system's fault tolerance will not be

applicable. Figure 4.6 illustrates this point. It can be seen that at time a, system A is

more fault tolerant than system B, but at time b, the converse is true.

4.14. Empirical Frameworks

It has been shown in section 4.5 and 6 how a fault model can be defined for a neural

network viewed from an abstract level, and also in section 4.11 how the neural

network's reliability (and hence fault tolerance) can be assessed. In the following

sections, several methods are given by which an empirical investigation of a neural

network's fault tolerance can be undertaken. These are Fault Injection, Mean-Time-

Before-Failure, and Service Degradation. However, before these three methods are

discussed, the problem of defining a suitable timescale such that different neural

networks can be compared will be described, and approaches given to meet such a

requirement.

4.14.1. Timescales

Some techniques for assessing the reliability of a neural network will require the

concept of time to be defined so that, for instance, fault rates can be specified, or the

time before failure occurs can be measured. The choice of timescale (e.g. real-world

seconds, CPU seconds, number of transactions, etc.) is determined by various factors,

Reliability

Time

System B

System A

ba

Figure 4.6 Comparing reliability of systems with different

characteristics to assess fault tolerance

Chapter 4

83

which are often in conflict with each other. Generally, the timescale should relate

sensibly to the characteristics of the application area, and to a lesser extent to the neural

network architecture used and the method of implementation.

For instance, a choice of measuring time in real-world seconds might be suitable for a

neural network system controlling some dynamical system, but not for a classification

application area where time would be better given the units of number of patterns

presented. Similarly, it would not be suitable to choose real-world seconds for a

software simulation of a neural network, CPU seconds or number of transactions would

be better. However, where a neural network model takes a non-deterministic number of

iterations to process an input (e.g. the Hopfield model), the units of time cannot be

based on a transaction count, but must rather be related to the number of iterations

performed by the system in evaluating an output, i.e. a measure that is invariant to

external controls or influences.

Not only must the timescale provide a suitable base from which to assess a particular

individual neural network's reliability, it must also allow valid comparisons to be made

between various different systems. These may or may not be based on the same neural

network model, and may even be non-neural systems. This means that the timescale

chosen must also take into account various factors such as the architecture and

implementation of the neural network model (e.g. evaluation algorithm, internal

components, etc.).

When comparing similar neural networks based on the same model all performing the

same task (e.g. MLP's with varying numbers of layers, hidden units), a large network

may well have better reliability when time is measured in number of pattern

presentations due to higher redundancy. However, an actual implementation of it will

take longer to process an input pattern than a smaller network performing the same task,

and so the number of faults occurring may well be greater in the long term. This

discrepancy should be compensated for in any comparative studies made. Producing

results that can be compared when using different types of neural network models or

non-neural systems requires similar consideration.

Two possible guidelines for choosing a timescale for a neural network system are either

examining the architecture and grouping together all of the parallel operations that are

Chapter 4

84

required during its processing stages, and then defining one unit of time to be the

execution (in parallel) of any particular group. The other possibility is to examine the

abstract description of the neural network model (e.g. see figure 4.3), and to define a

unit of time to be a recognizable mathematical operation. Both of these will allow

comparisons between the same neural networks model but with varying internal

structure, though to compare different (or non-deterministic) neural network models,

some allowance must be made for the complexity of operation for each possible time

unit such that the various models are evenly balanced.

4.14.2. Fault Injection Methods

Fault injection techniques involve subjecting a system to a known number of faults,

then measuring the subsequent degradation. This has to be repeated many times to

achieve a statistically significant result. The measure used to assess the system must be

related to the degree of failure of the system, since it is reliability which is of interest

here. Note that a system may maintain perfect performance until a fault threshold is

reached, when it suffers total failure. The discussion above on measuring the degree of

failure of a neural network is applicable here.

The resulting plots from experiments of the measure of reliability against many and

possibly various types of faults injected into a system, which can be termed fault

curves, will indicate how an operational system will behave if the rate at which each

type of fault occurs is known.

Fault injection techniques do suffer from a number of shortcomings. By far the most

damaging is when a system can suffer more than a single type of fault, as will almost

certainly be the case. Fault injection simulations are very good at indicating the isolated

effects of a number of identical faults occurring in a system, but are not effective in

analysing a system when many different fault types have to be taken into account since

their effects will not be independent. This makes it very difficult to predict with any

degree of accuracy the effects of various faults on a system which would occur in

real-life use. Combining in some fashion the effects of particular individual faults

occurring in isolation is very unlikely to be similar to the effects of all faults occurring

together over a period of time; the effects of individual fault types cannot simply be

added together due to correlations between them.

Chapter 4

85

In conclusion, fault injection methods are only useful to gain a very basic indication of

the reliability of a neural network system, though they may identify especially critical

faults which can then be protected against in any implementation design.

4.14.3. Example

For the multi-layer perceptron network (see figure 4.3), the fault model defined

previously in section 4.5.1 can be used for fault injection experiments. Since a

continuous measure is required for fault injection techniques, the partial failure

characteristic of neural networks due to their soft application areas can be exploited.

The definition of failure in the previous example can be used in that of a function f

measuring reliability, and since this is a probability, its codomain must range over

[0,1]. It should also be a continuous monotonic mapping since as the degree of failure

increases, the reliability should decrease. As before, two cases exist depending upon

whether generalisation is required, though they only differ in the argument given to f.

If generalisation is not required, then for a single pattern p, the measure of reliability

can be given by

which measures the difference between the distance of the closest incorrect output

classification to the actual output and the distance of the correct output classification

from the actual output. The difference is scaled to be in the range [0,1]. If the output is

closer to a incorrect classification, then the reliability is 0.

However, if generalisation is relied upon, then for a single pattern p, then the measure

of reliability is given by

fp max min ∀ i ≠ p. t i − op − tp − op ,0

such that fp(0) = 0

and fp min ∀ i ≠ p. t i − op

= 1

fp max HD − tp − op ,0

such that fp(0) = 0

and fp(HD) = 1

Chapter 4

86

To extend these two definitions to cover all patterns p, the maximum degree of failure

should be chosen to gain an idea of the on-line performance,

and their average (possibly weighted) for off-line, i.e. if ρp is an indication of the

importance of input-output space around pattern p

4.14.4. Mean-Time-Before-Failure Methods

An alternative method for judging the reliability of a system is to measure the average

time period before failure first occurs. Just as for fault injection methods, the results

obtained are statistical in nature, and so precise conclusions cannot be made. However,

a major difference between these two methods is that failure is considered as a discrete

event for mean-time-before-failure, rather than as a continuous variable.

The discussion in section 4.12.1 on the definition of a suitable timescale is clearly

relevant for this method. Note that both the timescale chosen and the definition of

discrete failure will be somewhat dependent upon the application and neural network

architecture being considered, though some generalities may exist between sub-groups.

As mentioned previously, failure of a neural network is difficult to define since

generally, unlike most conventional computing systems, they do not suddenly and

totally fail when faults occur; some degree of graceful degradation or fail-soft nature is

apparent. Also, many of the possible applications for which they could be applied are

equally flexible when it comes to defining failure, such as in the neural network system

which balances a pole mentioned previously. However, the treatment of "failure" is

different for MTBF methods from that used in fault injection methods. Here, failure is a

discrete event, it either happens or does not happen, and so the continuous measures of

failure used in fault injection investigations cannot be directly applied. Instead, some

rules need to be defined which specify when failure has been deemed to have occurred.

A general definition of failure is that it occurs whenever the system does not meet its

specification. This places the burden of responsibility onto the specifier of a system,

and the specification must define in detail the acceptable behaviour of the system. This

will include the limits to which degradation can occur, and so creates the distinction

f = max ∀ p.fp

f = Σp

ρpfp

Chapter 4

87

between failure and non-failure. These limits can be defined using the various general

conditions that were discussed above for fault injection methods, though others which

are specific to the neural network or application may be included by the designer as

appropriate. For example, an output unit could be defined to have failed when its output

deviates by at least 20%. A more global definition might be that failure occurs when a

neural network incorrectly classifies more than 5% its inputs.

The basic MTBF technique can be extended when investigating neural networks to

assess the time between sequential failures since they can have the property of

automatic recovery from failures. This occurs since their functionality is unaffected by

errors in information processing caused either by transient faults or due to uneven

distribution of information. However, if feedback occurs in the neural network's

topology, then this might disrupt recovery since errors could be amplified.

In conclusion however, the rather gross simplification of failure from the continuous

degradation which actually occurs in a neural network to the discrete on-off event used

here, detracts from the usefulness of MTBF models for assessing the reliability of a

neural network system.

4.14.5. Example

To apply MTBF methods to the multi-layer perceptron (MLP) neural network, the

following requirements need to be met. A reasonable fault model needs to be

developed, a suitable timescale needs to be chosen, and also the notion of failure in the

MLP. The fault model defined earlier in section 4.5.1 can be applied here. A suitable

choice of timescale will depend to a large extent upon the application chosen, for a

classification problem, the timescale could relate to the number of patterns presented.

Failure can be treated similarly as in the above example, but replacing the function f by

one which jumps from 0 to 1 when the distance threshold HD is reached if

generalisation is relied upon, or else when the output pattern op is closer to tq where

if it is not.p ≠ q

By running many simulations, a plot of the cumulative number of simulation runs

against MTBF against the number of times a simulation has already failed (i.e. a 3D

graph) can be made. This will show the distribution of the MLP's failure rate, and also

Chapter 4

88

it will show how a system will behave after it has suffered N previous failures.

However, it will not indicate the degree of graceful degradation exhibited due to the

discrete failure event.

4.14.6. Service Degradation Methods

As mentioned above, both the fault injection and MTBF methods for measuring the

reliability of a neural network have their shortcomings. However, a combination of the

two methods can be devised which draws on their strengths, and removes their

associated problems. The continuous measures used in fault injection experiments are

combined with the timescales and fault rates of the extended MTBF methods to produce

a means by which to assess the global reliability of a neural network system as time

progresses. Since most neural networks exhibit graceful degradation, this method

provides a clear indication of impending catastrophe in the system.

To achieve a continuous-valued indication of the global reliability of the system, it is

possible to assign a probability to each particular fault mode which indicates both how

likely it will manifest itself in a single unit of time, and also the fraction of locations in

which it will occur. Faults can then be generated probabilistically during the simulation

run. It is important to take into account both of these factors since a fault which is

unlikely to occur, but has numerous fault locations could well be more likely to occur

than a highly probable fault that can only occur in a very few locations. By dynamically

generating various types of faults during the simulation, any correlations between their

effects will automatically be taken into account. The degree of failure in the system can

then be probed by using the reliability measures discussed above in section 4.10.1.

Similarly as for the MTBF reliability methods, another problem is that of choosing a

valid and reasonable timescale for faults and the discussion in section 4.12.1 applies

equally well here to service degradation methods. However, although this method

results in a clear picture of a neural network's graceful degradation of reliability, to

collect statistically meaningful results using this method, many simulation runs will

have to be performed, and the total computation cost could be very large. For

safety-critical systems though, failure would be far more costly.

Chapter 4

89

4.14.7. Example

By using the fault model developed earlier, the timescale as given in the example for

MTBF methods, and also the continuous reliability measure defined in the example for

fault injection techniques, the reliability of the MLP can be assessed. This is done by

running many simulations (to collect statistically valid data), placing faults

probabilistically according to the predefined fault rates, and measuring the reliability of

the MLP at each time step. This produces a plot of the reliability of the MLP against

time, and its performance can then be judged. Depending upon the generic nature of the

fault model, timescale and reliability measure used, the results obtained from various

different experiments (e.g. different size MLP's) can be compared and contrasted.

4.14.8. Summary of Simulation Frameworks

The three empirical simulation procedures given above are summarised below.

Fault Injection Procedure:

1. Train a neural network to final state and save parameters.

2. Start with fault-free trained neural network and choose single defect mode

from fault model:

3. Choose a (new) random location and apply defect.

4. Evaluate reliability of neural network.

5. Repeat from step 3 until some proportion of all possible locations

chosen.

6. Repeat from step 2 many times to average results.

Mean-Time-Before-Failure Procedure:

1. Train a neural network to final state and save parameters.

2. Start at time 0 with fault-free trained neural network and assign time-based

pdf to each mode in fault model:

3. For every possible fault location in neural network, apply pdf's to check

for defects.

4. Test whether neural network has failed. If so, record time and repeat

from step 2 until sufficient results obtained for MTBF.

Chapter 4

90

5. Increment time and repeat from 3.

Service Degradation Procedure:

1. Train a neural network to final state and save parameters.

2. Start at time 0 with fault-free trained neural network and assign time-based

pdf to each mode in fault model:

3. For every possible fault location in neural network, apply pdf's to check

for defects.

4. Evaluate reliability of neural network for this time step.

5. Increment time and repeat from step 3 until maximum time reached or

reliability decreases below set minimum level.

6. Repeat from 2 many times to average results.

4.15. Conclusions

This chapter has provided a methodology by which the fault tolerance of neural

networks can be examined. This consists of defining a fault model, a measure for

reliability, defining a suitable timescale, and then for empirical investigations, setting

up an experimental framework. Although primarily concerned with studying neural

networks at an abstract level to understand the fault tolerance that arises from their

particular computational nature, the same techniques could be used when considering

neural networks at more concrete levels.

To define a fault model for a neural network given its abstract computational definition,

the atomic entities at this level of visualisation must first be identified. These serve as

the locations for faults in the neural network model. The final step is to define the effect

of faults at these locations. This is achieved by considering a fault always to cause the

maximum (harmful) change to the neural network's overall function, though limitations

may be suggested by certain physical implementation constraints.

The manner in which a neural network fails is continuous rather than a abrupt event. As

such, appropriate measures for their reliability are required. These should take into

account the computational nature of the output units, redundancy of output

representations, the particular nature of the application, etc.

Chapter 4

91

Once both a fault model and reliability measure have been defined for a neural network,

the effect of faults on its operation can then be investigated using one of the three

simulation frameworks described above, i.e. fault injection, MTBF, or service

degradation. Fault injection is suitable for determining the consequence of a certain

fault on a neural network's operation. MTBF is applicable only if a neural network has a

abrupt failure mode imposed upon it. Service degradation is more useful since it

recognises that a neural network exhibits continuous failure.

In summary, this chapter has presented a methodology that allows the fault tolerance of

different neural network configurations, or even neural models, to be compared by

using the results obtained from simulations provided their failure curves are of the same

characteristic family.

Chapter 4

92

Graphs from Fault Analysis

Time (x10,000 hrs)

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35

2

4 6 11

21

31

Graph 5.9 Service degradation results using various numbers of 2-tuple units

Time (x10,000 hrs)

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35

2

4 6 11

21

31

Graph 5.10 Service degradation results using various numbers of 3-tuple units

Time (x10,000 hrs)

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35

24

611

2131

Chapter 5

124

Graph 5.11 Service degradation results using various numbers of 4-tuple units

%s-a-0 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2

4

611

21 31

Graph 5.12 Fault injection results for various numbers of 2-tuple units

%s-a-1 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2

46

11 21

31

Graph 5.13 Fault injection results for various numbers of 2-tuple units

%s-a-0 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

24 6

11 2131

Graph 5.14 Fault injection results for various numbers of 2-tuple units

Chapter 5

125

%s-a-1 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

24 6

11 21

31

Graph 5.15 Fault injection results for various numbers of 2-tuple units

%s-a-0 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

24 6 11 21 31

Graph 5.16 Fault injection results for various numbers of 3-tuple units

%s-a-1 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2

46

11

21

31

Graph 5.17 Fault injection results for various numbers of 3-tuple units

Chapter 5

126

%s-a-0 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2

46

11

2131

Graph 5.18 Fault injection results for various numbers of 3-tuple units

%s-a-1 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

24 6 11 21

31

Graph 5.19 Fault injection results for various numbers of 3-tuple units

%s-a-0 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

24

6 1121

31

Graph 5.20 Fault injection results for various numbers of 4-tuple units

Chapter 5

127

%s-a-1 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2

46

11

21

31

Graph 5.21 Fault injection results for various numbers of 4-tuple units

%s-a-0 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2 46 11 21

31

Graph 5.22 Fault injection results for various numbers of 4-tuple units

%s-a-1 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

2 46 11

2131

Graph 5.23 Fault injection results for various numbers of 4-tuple units

Chapter 5

128

Time (x10,000 hrs)

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25

1210

75 3

Graph 5.24 Service degradation results for 2-tuple units using various

numbers of patterns stored

Time (x10,000 hrs)

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25

2520

15

105

Graph 5.25 Service degradation results for 3-tuple units using various

numbers of patterns stored

Time (x10,000 hrs)

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25

5040

30

2010

Graph 5.26 Service degradation results for 4-tuple units using various

numbers of patterns stored

Chapter 5

129

%s-a-0 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

3

57

10

12

Graph 5.27 Fault injection results for various numbers of 2-tuple units

%s-a-1 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

3

510 7

12

Graph 5.28 Fault injection results for various numbers of 2-tuple units

%s-a-0 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

3510

12

Graph 5.29 Fault injection results for various numbers of 2-tuple units

Chapter 5

130

%s-a-1 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

3510

12 7

Graph 5.30 Fault injection results for various numbers of 2-tuple units

%s-a-0 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

510

20

25

15

Graph 5.31 Fault injection results for various numbers of 3-tuple units

%s-a-1 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

510

20

2515

Graph 5.32 Fault injection results for various numbers of 3-tuple units

Chapter 5

131

%s-a-0 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

51020

25

15

Graph 5.33 Fault injection results for various numbers of 3-tuple units

%s-a-1 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

5

10

20

2515

Graph 5.34 Fault injection results for various numbers of 3-tuple units

%s-a-0 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

401020

3050

Graph 5.35 Fault injection results for various numbers of 4-tuple units

Chapter 5

132

%s-a-1 Faults in Key Vector

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

40

1020

30

50

Graph 5.36 Fault injection results for various numbers of 4-tuple units

%s-a-0 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

4010

2030

50

Graph 5.37 Fault injection results for various numbers of 4-tuple units

%s-a-1 Faults in Memory Links

Pr(

Fa

ilure

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

5040

3020

10

Graph 5.38 Fault Injection results for various numbers of 4-tuple units

Chapter 5

133

CHAPTER SIX

Multi-Layer Perceptrons1

6.1. Introduction

Perceptrons were devised by McCulloch and Pitts in 1943 [62] as a crude model of

neurons in the brain. They are very simple computational devices which can perform

binary classification on linearly separable sets of data. A binary input vector is sampled

by a number of fixed predicate functions, whose weighted binary outputs are fed into a

threshold logic unit. There exists a training algorithm (Perceptron Learning Rule [62])

for linearly separable problems which is guaranteed to find the required weights that are

applied to each predicate output.

Due to the limited capabilities of the perceptron unit, an obvious advance was to

connect layers of perceptrons together. The perceptron units were simplified by only

allowing the first layer to have predicate functions sampling the input, if at all. This

architecture became known as a multi-layer perceptron network (MLP). However, it

was not clear how to train it since the original perceptron learning rule relied on

knowing the correct response for every unit given some input. For the internal units of a

multi-layer perceptron network this is not possible. This problem of spatial credit

assignment was a major stumbling block to neural network research in the late 60's. The

publication of Minsky and Papert's book [64] which comprehensively analysed

perceptron units and single layer networks composed from them discouraged many

researchers who were trying to develop learning algorithms for more complex neural

networks composed of many layers of perceptron units. However, in 1974 Werbos

[117] gave an algorithm which could train such a network, though continuous activation

functions were used instead of the original binary decision threshold. It was

subsequently rediscovered by other researchers including the Parallel Distributed

1 Part of this chapter has been published in [119].

Chapter 6

134

Processing (PDP) research group [21] in 1986 who termed the learning algorithm

Back-Error Propagation (BP).

This new learning algorithm has become almost synonymous with multi-layer

perceptron networks to such an extent that a clear distinction between the architecture

and learning algorithm has been lost in many cases. Back-error propagation is only one

particular method for configuring the weights in a MLP. The work presented in this

chapter leads to the conclusion that the back-error propagation algorithm is inherently

flawed with respect to developing neural networks exhibiting fault tolerance. However,

it will be seen later that it is possible to derive a set of weights which do lead to fault

tolerance.

Firstly section 6.2 describes how complex training sets used in the various simulations

were constructed. Section 6.3 then analyses the fault tolerance of perceptron units, and

experimental results are shown to support the theoretical model. Section 6.3.3 discusses

an alternative view than that of hyperplane separation of a perceptron's function.

Section 6.4 constructs a fault model for multi-layer perceptron networks. This is used in

section 6.5 to analyse the effect of faults on the functionality of a multi-layer perceptron

network. Section 6.6 then analyses the reliability of multi-layer perceptron networks

trained using back-error propagation, and methods of improving their rather poor

tolerance to faults given in section 6.7. Section 6.8 then analyses the resulting fault

tolerant multi-layer perceptron networks, and a new technique is developed which

produces similar networks at far less computational cost. Section 6.9 analyses the fault

tolerance of the MLP networks trained using the new algorithm, and the consequences

of this method for generalisation in multi-layer perceptron networks are given in section

6.10. Finally, section 6.11 examines properties of the hidden representations formed in

multi-layer perceptron networks with regard to their resilience to faults.

6.2. Construction of Training Sets

For the purposes of this study training sets were constructed artificially rather than

using a real data source. This allowed many training sessions to be performed quickly,

and more importantly, the characteristics of the data set to be fully known.

Chapter 6

135

An algorithm was devised which would produce a number of classes (c) with a number

of examples (cp) drawn from each class in a n-dimensional bipolar {-1,+1}n or binary

{0,1}n space. Each class centre was chosen randomly, but with the constraint that each

was a certain minimum distance from any other centre. This is required so that a certain

minimum number of pattern examples can definitely be chosen from every class. The

class centres can be viewed as pattern exemplars. The output patterns associated with

the inputs were of type 1 in c, i.e. 00010 would represent inputs sampled from the

second of 5 pattern exemplars.

The selection criterion for accepting a set of class centres was defined to be those cases

where the following inequality held:

It accepts any class set in which at least twice the number (p) of examples required from

each class could be found in the space owned by a particular class exemplar. This space

extends one half of the minimum interclass distance d. This condition is placed on

training set construction so that classes do not have too large a degree of overlap.

The example patterns drawn from each class were based on the class exemplar with

components randomly reversed with probability

This method selects pattern examples with high probability from the space owned by a

class exemplar, though it also allows for possible class overlap.

A seed value was specified for the pseudorandom number generator so that training sets

could be reproduced.

6.3. Perceptron Units

The operation of the simplified perceptron units used in multi-layer networks can be

described by the following equation

where ik is the kth input component, and wk is the weight on the connection from that

Σr=0

12d

nCr > 2p where d = min ci − cj ⋅ ∀ i ≠ j

Pr() =12

min ci − cj ⋅ ∀ i ≠ j

n

output = σ Σk=1

n

i kwk − θ

= σ i ⋅ w− θ

(6.1)

Chapter 6

136

input. The constant θ offsets the weight input sum, and is normally termed the bias. The

function σ applied to the final result of the summation (activation) generally maps it

into a limited range [a,b], and hence is often called a squashing function.

The function of a perceptron unit is to classify its inputs into two classes, possibly with

some notion of certainty added. This is a crude model of the behaviour of neurons in

the brain which given certain stimuli, fire in bursts with frequency relating to the

closeness of the input stimulus to its exemplar [118].

There are three main classes of squashing function (σ) which have been developed and

used in perceptron units:

Binary: The output of units is hard-limited to binary {0,1} or bipolar {-1,+1}

values.

Linear: The squashing function maps x to ax. Generally the output represents

two classes based on the sign of the output, and the absolute magnitude

the certainty of response.

Non-Linear: The activation is mapped to a limited range as with the binary units,

though here the mapping is continuous. In accordance with the notion of

a perceptron unit representing two classes, the function tends to be

monotonically increasing. This is the class of units employed in MLP

networks.

6.3.1. Fault Tolerance of Perceptron Units

This section examines the fault tolerance arising from a perceptron unit's style of

computation. First, a simple fault model will be constructed. From equation 6.1 it can

be seen that the majority of entities in a unit occur in the summation of weight and

input components. Since the number of weights far outweighs the single bias θ, then it

can be considered to be masked by the weighted summation terms if its value is no

larger than typical weight values. However, if this is not the case, then since the bias is

often considered as being a weight from a unit with fixed output -1, it could be included

without special provision as an extra summation term.

Chapter 6

137

Faults affecting the squashing function can be ignored for similar scaling reasons as in

the case of the bias value since it is again much more likely that weight faults will occur

first.

Inputs ik for classification problems are generally binary {0,1} or bipolar {-1,+1}, and

so the dominating term in the computation performed by a perceptron unit, with respect

to its tolerance to faults, is a sum of weights wk. The result of this sum is then classified

by comparison with the bias θ. For now, faults affecting weights will be considered to

have the effect of forcing their value wk to zero, which can also be viewed as removing

a connection between the unit and input component ik. The fault model will be

discussed in more rigourous detail later when considering multi-layer perceptron

networks.

Notice that the consequence of faults affecting weights in this way is to reduce the

relative difference between a unit's activation and its bias value, i.e. the unit will move

closer to the point at which an input is misclassified and failure occurs2.

Since a single perceptron unit can only distinguish linearly separable patterns, the two

classes can be viewed as non-intersecting regions in n-dimensional space. The optimal

separating hyperplane for maximising resilience to the effect of faults is the vector

perpendicular to the bisection of the line connecting their centroids3 (see figure 6.1).

This is because the hyperplane's associated weight vector maximises the distance of

every input pattern from the separating hyperplane and hence minimises the possibility

of misclassification. Note that this assumes that the volume of input space covered by

each of the two classes are similar.

2 This assumes that all weights contribute correctly to the output of a unit for all inputs.3 Defined as average member of class where every member is weighted by its likelihood of occurring.

- Centroid

SeparatingHyperplane

Class 1

Class 2

w

θ

Figure 6.1 Separating hyperplane for maximal fault tolerance

Chapter 6

138

More formally, if class Ck has n members ci, each with associated weighting pi which

indicates the probability of ci occurring as an input, then its centroid is defined asck∗

The separating hyperplane which optimises fault tolerance is specified by the weight

vector w and bias value θ as follows

Note again it is assumed that the input space volumes of the two classes are similar. If

this is not the case, then if vi is the volume of class ci, the factor of ½ in the expression

for θ should be changed to . However, it will be seen that unequal sized classesv1v1+v2

reduces a perceptron unit's tolerance to weight faults.

The above claim in equation 6.2 can be shown by considering that the following

function must be maximised to optimise fault tolerance

where H(w, θ) defines the separating hyperplane, and function d gives the distance of

input ci from this hyperplane measured positive in the direction towards the class ti to

which ci belongs. For bipolar representations equation 6.3 is defined as

whilst for binary representations

Taking the case for bipolar representations, the method for binary being similar,

maximising F requires that

Note that function F has no minimum since the separating hyperplane could placed

infinitely far away from either of the two classes.

The differentiation of F can be simplified by incorporating the bias as an extra weight

on a connection from a unit which always outputs -1. This has the effect of moving the

ck∗ = Σ

i=1

n

pi ci whereΣ pi = 1

w = c2∗ −c1

∗ and θ = c2

∗ − c1∗

c1

∗ + 12w

w

(6.2)

F = Σi=1

n

pi d ci ,H

w,θ

(6.3)

F = Σi=1

n

pi (w ⋅ ci −θ) t i (6.4a)

F = Σi=1

n

pi w ⋅ ci − θ

(2t i −1) (6.4b)

dFdH

= 0

Chapter 6

139

separating hyperplane to pass through the origin. Notating the new weight vector as w*,

for bipolar data representations

The case for binary data representations is similar. This result shows that maximum

resilience to faults is achieved when the class centroids are equidistant from each other

about the origin since the bias was incorporated into the weights. Hence the separating

hyperplane must be such that it perpendicularly bisects the line joining the class

centroids as required. Note that this result also emphasises the need to incorporate a bias

into a perceptron unit.

It is interesting to consider the effects of the chosen input representation on the potential

resilience to faults of perceptron units. The functionality of a perceptron unit implies

that 0-valued input components in a binary representation do not actively provide

information in computing the output of a unit, unlike their counterparts in a bipolar

representation. This is since the activation of a perceptron unit is a sum of

multiplicative terms. For a given weight w, a 0-valued input does not contribute to the

activation value, whereas a -1 input will. It can be viewed that the perceived difference

between classes is smaller for a perceptron unit in the case of binary inputs. The internal

functional resolution of a perceptron unit given bipolar inputs is twice that when binary

inputs are supplied.

Given that the components of the centroids of the two input classes and arec1∗ c2

defined to be and respectively, then a suitable measure of the distance betweenci1 ci

2

them with respect to the functional nature of a perceptron unit is supplied by

This measure reflects the difference in resolution of the binary and bipolar data

representations being considered here. On average, the distance of a particular input

from any of the two classes will be due to the position of the separating12

D c1

∗ ,c2∗

hyperplane. Since the fault tolerance of a perceptron unit can be considered as the sum

of weighted input components, this implies that weight faults could be12

D c1

∗ ,c2∗

tolerated before failure (i.e. misclassification) would occur.

dFdw∗ = Σ

i=1

n pi ci

t i

= Σt i =1

pi ci − Σt i=−1

pi ci = 0

D c1

∗ ,c2∗ = Σ

i=1

n

abs ci

1 − ci2

Chapter 6

140

This also indicates that a bipolar representation will lead to improved reliability in

perceptron units since the distance between class centroids in bipolar space with respect

to their function will be twice that for a binary representation. This shows that the data

representation chosen for external inputs and internal units' outputs is critical for

providing tolerance to faults.

6.3.2. Empirical Analysis

To test this theory a simulation was run training a single perceptron unit to distinguish

between two pattern classes. The two class centres were randomly chosen and the

Hamming Distance between their centres varied between 1 and 10. The training set was

then constructed by selecting 5 examples of each class (see section 6.2) and then the

back-error propagation algorithm was used to find a weight vector solving the problem.

This particular learning algorithm was used instead of the simpler (but sufficient)

perceptron learning rule for consistency with later experiments.

For every training set, the perceptron unit was trained until the mean error was less than

0.1. Both 10 input and 20 input perceptron units were used. Then weights were

randomly chosen and removed (i.e. setting w to zero) and the unit tested for failure. The

definition of failure used was inability to distinguish the two classes. Each experiment

was carried out many times until the standard deviation of the number of faults

tolerated fell below 1.0.

Graph 6.1 shows the results of these experiments. The value for faults tolerated given

on the y-axis is the average minimum number of weights/connections that can be

Hamming Distance

Fa

ults

To

lera

ted

0

2

4

6

8

10

0 2 4 6 8 10

10 20

20

10

Binary

Bipolar

Graph 6.1 Binary .vs. Bipolar Representation in Perceptron Unit

Chapter 6

141

removed without failure occurring. This is plotted against a data set's Hamming

Distance between class centres. It can be seen that the data collected closely matches the

theoretical predictions (marked with stars). Also, it clearly shows that bipolar

representations lead to improved tolerance to faults as expected above.

6.3.3. Alternative Visualisation of a Perceptron's Function

The predominant technique for visualising the operation of a perceptron unit is by

considering that it classifies patterns based on a dichotomy of its input space. This is

formed by a hyperplane which is normal to the weight vector w and distance θ from the

origin. An alternative understanding of a perceptron unit's computation at a lower

functional level is more appropriate in this chapter. A unit's function is viewed in terms

of its internal operation rather than by its output representation. Although both

visualisations precisely describe the operation of a perceptron unit, hyperplane

separation does not naturally extend to allow intuitive insight into visualising the effect

of faults, as was seen in the previous section.

The alternative concept proposed here for visualising a perceptron unit's computation

starts from considering the scalar value of the vector projection of input x onto weight

vector w. It can be viewed that this indicates the degree by which x matches w. This

value is then compared to the bias θ, and the output of the unit indicates if the match

was sufficient.

The weight vector w defines the feature which the perceptron unit represents in a subset

of its input space. A subset is specified since it has been found that not all the weights

on connections feeding a unit are used, some decay to near zero during training and

play no significant part in the units operation4. Note that by the term feature used

above, it is not meant that a unit's weight vector corresponds to some semantic object in

the problem domain.

The bias represents by what degree the feature represented by the weight vector

represents has to be present in the input x. If there is enough evidence, i.e. w.x > θ, then

it will cause the unit to "fire". A non-linear squashing function saturates the unit's

activation as appropriate.

4 This is the basis for the various pruning algorithms which have been developed [60].

Chapter 6

142

This alternative visualisation for the operation of a perceptron unit has various

advantages over that of hyperplane separation. The effect on a hyperplane due to

removing weights is difficult to visualise, whereas for feature recognition it is clear that

information is lost or corrupted and the projection of the input onto the weight vector

will be less precise.

Also, the notion of distribution of information storage in neural networks becomes more

obvious since it can be viewed that the feature which a unit represents consists of many

components, not all of which have to be present for a pattern match to be performed.

These components could either be inputs fed to the network, or also the outputs of

previous units so combining multiple features to form more complex ones. As stated

above, it is not intended that these features should be viewed as corresponding to any

semantic item.

6.4. Multi-Layer Perceptrons

For ease of description later in this chapter, the MLP neural network and its associated

training algorithm back-error propagation will now be defined. The architecture of a

MLP is shown in figure 6.2 which shows how units are arranged in layers, with full

connectivity between the units in neighbouring layers. This is the standard pattern of

connectivity commonly used, though others such as having connections between units

and layers past its immediate neighbour are possible.

Input Layer

Output Layer

..........

..........

..........

Hidden Layer

WeightedConnections

i

j

wi j

Figure 6.2 Multi-Layer Perceptron Neural Network

Chapter 6

143

Each unit computes the following function based on its inputs from feeding units:

Note that an ordering of the units in a MLP is specified since feeding units j must have

already been evaluated. Also, the bias θ has been incorporated as a special weight link

as described previously. The activation or squashing function fi can be any bounded

differentiable monotonically increasing function. The input units merely take on the

value of their corresponding component in the input pattern.

6.4.1. Back-Error Propagation

The back-error propagation learning algorithm [21] supplies a weight change for every

connection in the MLP network given an input vector i and its associated target output

vector t. The change for each weight is

where for output units

and for hidden units

This last equation shows how the error for unit i, δi, is constructed from errors of units

in previous layers. This meets the problem of credit assignment.

6.4.2. Fault Model for MLP's

A fault model must be constructed for multi-layer perceptron networks before a study

of their reliability can be performed. The development of fault models from an abstract

description of a neural network has been described in chapter 4. For a multi-layer

perceptron network as defined above the various atomic entities during operational use

are the weights, a unit's activation, and the squashing function. Only the weights need

be considered in a multi-layer perceptron due to the massive number of weights as

compared to the entities associated with units.

oi = f i Σk ojwij

(6.5)

∆wij = ηδioj (6.6)

δi = (t i −oi)f i Σk wikok

(6.7)

δi = f i Σk wikok

Σ

lδlwil (6.8)

Chapter 6

144

The manifestation of weight faults in a multi-layer perceptron must now be defined. To

cause maximum harm, a weight should be multiplied by -∞ (see section 4.5.4).

However, it would be unlikely in any realistic implementation that potentially infinite

valued weights could exist. Instead it is probable that weights will be constrained to fall

in a range [-W,+W], and so a weight fault should cause its value to become the opposite

extreme value. The loss of a connection can be modelled by a weight value becoming 0.

For simplification, only the latter fault mode was considered in this chapter.

Note that a unit becoming defective in some way is not considered eligible for the fault

model since the concept of a unit entity exists at a much higher visualisation level than

that taken here. An error applied to a unit's output does not satisfactorily represent the

effect of internal faults within a unit since it is too complex. Such an abstract definition

of a neural network would not be particularly useful since it hides far too much of the

underlying computation of the system, and so would not provide beneficial information

on the tolerance to faults of multi-layer perceptron networks. This is especially true if

results obtained on fault tolerance were used in the development of a physical

implementation.

6.5. Analysis of the Effect of Faults in MLP's

The analysis in section 6.3 for the effect of weight faults in perceptron units can be

extended to multi-layer perceptron networks. The aim of this section is to specify the

nature of processing errors in the output layer caused by faults occurring anywhere in a

MLP network. There are two separate functional regions which can be identified with

respect to the effect of faults. First, weight values on connections from input units to

hidden units, and secondly, connections between hidden units and output units.

A weight fault occurring on a connection between a hidden and an output unit will

cause the absolute magnitude of the output unit's activation to decrease.

where the weight from hidden unit h to output unit o becomes zero. This case is exactly

the same as for the effect of faults in an individual perceptron unit as considered

previously in section 6.3.

acto → acto − whoxh

Chapter 6

145

The second case describes the effect on the output of a MLP network of a weight fault

occurring on a connection between an input unit and a hidden unit. This is more

complex. Considering a particular hidden unit, as more faults affect weights on

connections feeding it, its absolute activation will decrease as described above in the

case of an output unit. Eventually, this degradation results in the output of the hidden

unit inverting and becoming erroneous. This now means that all output units which are

connected to the failed hidden unit will be supplied erroneous information, and so each

will have an increased likelihood of failure.

For simplicity, it is assumed that a hard-limiting squashing function is used for all units

in a MLP network. This means that the output of a hidden unit will suddenly change

polarity when its tolerance to faults is exceeded. It was shown in section 6.3.2 that for a

bipolar input representation failure happens when the number of weight faults equals or

exceeds the average Hamming Distance HDi between input patterns. For binary inputs,

only ½HDi will be tolerated.

The effect of connection faults occurring between the input and hidden layers will now

be analysed, considering the two cases of using bipolar and binary thresholding units

separately.

6.5.1. Bipolar Thresholded Units

The two cases for the output of a hidden unit reversing are

and will be considered separately. The activation of an output unit is given by

For case 1, the effect on the activation of an output unit due to the output value of

connected hidden unit f becoming erroneous is

Case 2 is similar, except a reversal in the sign of the change to the activation of a fed

output unit

1. xi = +1 → xi = −1

2. xi = −1 → xi = +1

acto = Σi

wioxi −θ

acto = Σi

wioxi − θ

− 2wf

acto = Σi

wioxi − θ

+ 2wf

Chapter 6

146

These can be combined in the following equation which specifies the effect of input to

hidden weight faults on the activation of an output unit in a multi-layer perceptron

network.

6.5.2. Binary Thresholded Units

A similar analysis allows the effect of input to hidden weight faults on the activation of

an output unit to be ascertained for multi-layer perceptron networks using binary

squashing functions. Since the working is almost identical to that given above for

bipolar squashing functions, only the final result is given here:

It is interesting to note that this implies a constant bias wf affects the activation of an

output unit independent of the hidden unit's output value. This explains the observation

made by Prater and Morley that a weight fault "causes a loss of information and a bias

change" [34].

Binary: Bipolar:

wf xfacto − acto xf

acto −acto

-ve 0 -wf -1 -2wf

+ve 0 +wf -1 +2wf

-ve 1 +wf 1 +2wf

+ve 1 -wf 1 -2wf

6.5.3. Comparison between Data Representations

The analysis given above for the effect on the operation of a binary thresholded MLP of

an erroneous hidden unit due to weight faults occurring in the connections between

input and hidden units, shows that the change induced in an output unit's activation is

only half that if a bipolar thresholding function was used in the MLP's units (see table

acto = Σi

wioxi − θ

− 2xfwf

acto = Σi

wioxi − θ

− 2xf − 1

wf

Table 6.1 Change to fault-free activation of output unit caused

by hidden unit failure

Chapter 6

147

6.1). This suggests that a binary thresholding method should be used for all of the units

in a MLP network. However, using a binary threshold method would also have the

consequence of halving the number of weight faults which could be tolerated by

individual units in their incoming connections (c.f. section 6.3.2), and so the decision of

which data representation for the thresholding function to use is not trivial.

The two cases of using either bipolar or binary squashing functions in a MLP network

will now be considered separately. Note that HDi is the average Hamming Distance

between input patterns, and HDh is the average Hamming Distance between

representations formed in the hidden layer of a MLP.

First, if a bipolar squashing function is used in a MLP, then for an output unit to just

fail, either ½HDh hidden units must fail, or HDh hidden to output weight faults must

occur, or else some combination of these two events must occur:

However, if binary squashing functions are used, then either ½HDh hidden units must fail

(as before), or ½HDh hidden-output weight faults must occur. The various combinations

of these two events can be expressed as

It can be seen that a MLP network using bipolar squashing functions will exhibit better

tolerance to faults than the case when binary squashing functions are used. This

conclusion is drawn by noting that in both cases an equal number of hidden unit failures

causes similar damage to the function of output units. However, binary thresholded

hidden units are more likely to fail since they will only tolerate ½HDi faults in their input

connections, while bipolar thresholded hidden units will tolerate HDi weight faults.

6.5.4. Conversion of Binary to Bipolar Thresholded MLP

A trained MLP which employs binary thresholded units can easily be transformed into

an equivalent MLP with bipolar units. However, although the function of the MLP will

remain unchanged, the results given above imply that its reliability will be greatly

increased due to the improvement in the fault tolerance of its individual units.

i hidden units fail ∧HDh −2i hidden-output weight faults

i = 0...12HDh

i hidden units fail ∧

12HDh − i hidden-output weight faults

i = 0...12HDh

Chapter 6

148

6.6. Fault Tolerance of MLP's

As seen in chapter 2, many studies of the fault tolerance of multi-layer perceptron

networks have been carried out. However, nothing approaching a comprehensive

analysis of the nature of fault tolerance mechanisms in MLP's is known to exist. In the

rest of this chapter this task will be approached, and in part met. Clearly the results

from the single perceptron unit studies as described above will be of great use.

Given that a single perceptron unit seems to be very reliable, a simulation was run to

gauge the effect of faults in a multi-layer network. A complex training set was used

following the method described in section 6.2. Four class exemplars were randomly

chosen in a 10-dimensional bipolar space, with 5 pattern examples selected from each

making a training set of 20 vector associations. A MLP network was then trained to

solve this classification problem using the back-error propagation algorithm until the

maximum output unit error diminished to 0.05. This was considered a suitably low

value for the final error. Two training sessions were run, the first on a MLP network

having 5 hidden units, and the second for 10 hidden units. The values for these various

parameters were chosen fairly arbitarily. A number of example patterns were selected

from each class to produce a dataset which reflected class membership. Also, rather

more than the required number of hidden units were used to provide extra capacity for

redundancy.

The trained MLP network was then subjected to faults. This consisted of randomly

selecting approximately 10% of the weights in each MLP network, and forcing their

values to 0 (see section 6.4.2). This proportion seemed appropriate as a baseline for the

required tolerance to faults. The proportion of patterns in the training set that were then

Chapter 6

149

Combined Weight Values

% F

aile

d C

lass

ifica

tions

0

0.1

0.2

0.3

0.4

0.5

5 10 15 20

5 hidden units

Combined Weight Values

% F

aile

d C

lass

ifica

tions

0

0.1

0.2

0.3

0.4

0.5

5 10 15 20 25

10 hidden units

Graph 6.2 Proportion of failed patterns due to 10% weight faults

misclassified (i.e. the maximum output unit error was over 1.0) was used as a measure

of the damage inflicted on the MLP network.

The surprising result was that so few weight faults (8 in the case of the 10-5-4 network

containing 79 weights) would cause a considerable proportion of the input set to fail,

whilst the recognition of the remaining input patterns would not be appreciably

degraded. It was also found that certain individual weights would cause failure to occur.

Graph 6.2 above shows how the percentage of input patterns incorrectly classified in the

training set varies with the total absolute magnitude of the faulted weights. It clearly

illustrates that defective weights which contribute most towards features represented by

units (i.e. sum of faulted weights is large) cause an appreciable percentage of the

training set to be incorrectly classified. Graph 6.3 shows the maximum unit error over

all training patterns. It further reinforces the result that significant weights exist in the

classification of particular input patterns.

This result contradicts many remarks made by previous work (see chapter 2) that

multi-layer perceptron networks are fault tolerant. It also brings into question the view

that they store information in a distributed manner since the destruction of only a few

weights causes a non-trivial failure among certain stored associations, and has little or

no effect on the remainder.

Note that this result explains the "drunken driving" behaviour described by Widrow5 in

the truck-backer upper application [101] when the controlling MLP was injected with a

few faults. Errors will occur in the stream of control commands issued by the MLP for

5 Personal communication (July 1991)

Chapter 6

150

Combined Weight Values

Ma

xim

um

Err

or

0

0.5

1

1.5

2

5 10 15 20

5 hiddenunits

Combined Weight Values

Ma

xim

um

Err

or

0

0.5

1

1.5

2

5 10 15 20 25

10 hiddenunits

Graph 6.3 Maximum output unit error due to 10% weight faults

those inputs affected by the specific faults, and will cause the truck to turn in the wrong

direction. However, this causes it to move away from the particular region of input

space in which failure occurred, and so a correct output will eventually be generated

which turns the truck back on course. This sequence of events is repeated as the truck

reverses towards the loading bay. It would be interesting to study how many faults

would be tolerated before the overall behaviour of the truck is such that it does not

successfully align itself with the loading bay. Note that this is an example of the

problem noted in chapter 4 where the reliability of a neural network controlling a

dynamic system is not prejudiced by a single incorrect output, but by a sequence of

incorrect and correct outputs whose overall result combines to cause system failure.

6.6.1. Distribution of Information in MLP's

The traditional view of information distribution in neural networks, and multi-layer

perceptrons in particular, is by analogy to holographic storage; no single storage

element (normally taken to be a weight, or occasionally a unit) in a neural network

stores a particular pattern. Instead, patterns are stored in a distributed fashion across all

of the weights in a neural network. The conventional argument for fault tolerance is

that, as for a hologram, each weight in a neural network is unimportant globally, and so

its loss will not seriously impair the operation of the network. However, it is doubtful

whether this argument is valid for MLP's given the above results which showed that for

a small number of weight faults, a significant proportion of the training set is

misclassified. However, for a single perceptron unit it has been shown that a certain

number of weights can be viewed as being redundant in this fashion.

It is more appropriate for MLP networks to view each layer transforming patterns into a

different space, such that in the last hidden layer a representation is developed which is

linearly separable to produce the required output. This process can be viewed as

distributing the complex task of classification into several simpler steps at each hidden

layer. However, each layer of perceptron units can be viewed as being distributed in the

sense given in the previous paragraph. Reliability will arise from fault tolerance in each

layer of perceptron units, and overall will principally be governed by the least fault

tolerant layer.

Chapter 6

151

6.6.2. Analysis of Back-Error Propagation Learning

This section will consider why the back-error propagation algorithm does not produce a

MLP network configuration which exhibits the fault tolerant behaviour that might be

expected given the reliability of its individual perceptron units. This will be approached

by considering the effect of small changes in unit activation caused by weight faults. It

will then be shown that back-error propagation trained MLP networks are sensitive to

such changes.

The empirical results described above can be explained if the operation of a perceptron

unit is considered using the alternative visualisation described in section 6.3.3. The

projection of an input x onto its weight vector w' which suffers a fault in component f

can be described as follows

This scalar value s is now compared against the unit's bias θ to see if the degree by

which input x matches the feature w is sufficient to activate the unit. Looking at the

absolute difference between s and θ

It can be seen that the absolute difference between the fault-free projection and θ is

decreased, assuming every weight correctly contributes to the decision made by a

perceptron unit. If this value becomes negative, local failure will result since the unit

will then misclassify its input.

Although this describes the effect of a weight fault, it does not explain why only a few

faults generally cause such a dramatic failure in a multi-layer perceptron network for

some subset of the training set. It will now be shown how the back-error propagation

algorithm used to train the MLP network causes this lack tolerance to faults. The

common multiplicative term in the weight update ∆wij= ηδioj (equation 6.6) is

s= w ⋅ x = Σi=1

n

wixi − wfxf

w ⋅ x− θ = Σi=1

n

wixi − wfxf − θ

= Σi=1

n

wixi − θ − wfxf

= w ⋅ x −θ −wfxf

(6.9)

f i Σk wikok

⋅ oj = fi Σk wikok

⋅ fj Σl

wjl ol

(6.10)

Chapter 6

152

by examination of equations 6.7 and 6.8. If it is assumed that the same squashing

function f is used for all units (as is generally the case), then this term can be considered

as the multiple of f and its derivative f'. Note that their two arguments will not

necessarily have the same value since f is computed from the activation of the unit

feeding the unit where f' occurs. A plot is shown in figure 6.3 below using the sigmoid

function (bipolar representation) for f

Three plots of the common term in ∆w are shown. These correspond to three offsets (-6,

0, +6) applied to the argument of f with respect to the argument of f'. These offsets were

chosen since they indicate the envelope of the common multiplicative term given in

equation 6.10 for all possible offset values. It can be seen that for values outside the

range [-p,+p] this term is very small for large unit activation values, irrespective of the

offset between f and f'. This means that the change ∆wij applied to weights on the

connections feeding into a unit will also become very small as the unit's activation

increases.

When training the MLP network weight vectors move towards a stable point, which

implies that the weight changes must decrease towards zero. In figure 6.3 it can be seen

that there are at most three points where this occurs, and are when a unit's activation

tends outwards from ±p or at some point between. However, a unit having an activation

corresponding to a zero output, but still within the envelope range, is very unstable

since a slight disturbance causes a rapid rise in the weight change, and so this case is

considered most unlikely to occur. This means that units in a back-error propagation

trained MLP network will have activation values clustered around ±p (see figure 6.4).

This is supported by simulation results given in section 6.8.1 which show that hidden

units tend to output their extreme values. This is supported by results given in a preprint

by Murray and Edwards [87].

f(act) = 2.01.0+ e−act

− 1.0

Chapter 6

153

Given this knowledge it becomes clear why a back-error propagation trained MLP is

not fault tolerant despite being composed of reliable perceptron units. A single weight

fault (either forcing its value to 0 or the opposite extreme value) will decrease the

projection of the input onto the unit's weight vector, and so move the activation towards

0 (equation 6.9). Since the unit's activation was already close to the point where the

squashing function rapidly moves away from its asymptotes (see figure 6.4), this causes

a large error in the unit's output. This now greatly increases the likelihood of overall

system failure. However, if unit's activation lay in region ±q then faults would not cause

an immediate error in output value, and this problem could be avoided. It will be seen

in later sections how this result is employed to increase the reliability of MLP's.

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

-6 -4 -2 0 2 4 6

act

+p

-p

Offset=-6

No Offset

Offset=+6

Figure 6.3 Plot of common multiplicative term in BP algorithm

-1

0

1

-12 -9 -6 -3 0 3 6 9 12

Activation

Output +p

-p

Faults

Faults

-q

+q

Figure 6.4 Clustering of units' activations around +/- p

Chapter 6

154

6.7. Training for Fault Tolerance

Various studies were undertaken into producing a technique which produced a fault

tolerant neural network based on the MLP. This work was motivated given that a MLP

trained using back-error propagation is not as fault tolerant as might have been

concluded from the results obtained in section 6.3 examining the reliability of a single

perceptron unit. The techniques included:

Limited interconnectivity

Local feedback at hidden and/or output layers

Training with weight faults injected

However, only the technique of injecting weight faults during training produced clear

results with respect to developing a MLP network which exhibits resilience to faults.

6.7.1. Training with Weight Faults

This method is similar to that used by Clay and Sequin which produces a fault tolerant

MLP network by injecting transient unit faults during training [71]. However, in section

6.4.2 it was shown that the basic functional entities in a MLP network which should be

considered are the weights on connections between units rather than the actual units.

Hence, weights were randomly set to 0 during training so that tolerance to weight faults

would be introduced. Work described by Murray and Edwards in a paper submitted for

publication [87] also uses this technique, though it concentrates on synaptic weight

noise rather than weight elimination. A training session consists of the following steps:

1. Randomly choose a fixed number of weights and fail them.

2. Apply back-error propagation algorithm for all patterns in training set.

3. Restore faulted weights and repeat from step 1 until the maximum output

unit error diminishes to an acceptable value.

Generally only a single weight was faulted during each training step, though

simulations were also carried out faulting multiple weights. However, the increase in

possible faulted weight combinations increases combinatorily, and so training rapidly

becomes prohibitively expensive.

Chapter 6

155

6.7.2. Comparison with Clay and Sequin's Technique

Superficially, there seems little difference between injecting weight faults during

training as against units being faulted. However, the argument for training with faults is

to imbue a neural network with resistance to those particular faults. Since the

construction of a fault model for a MLP (section 6.4.2) showed that only weight faults

are important in a MLP system, then it seems more reasonable to train injecting weight

faults. Unit faults are too abstract and unlikely to be representative of the effect of

physical faults in an implemented MLP. Due to this, it is expected that training with

weight faults will lead to better overall reliability.

Note that the technique of injecting weight faults during back-error propagation training

as a fault tolerance mechanism for a MLP network is not the major work described in

this chapter. Instead, this chapter concentrates on analysing the MLP networks

produced by fault injection training given that the back-error propagation algorithm

inherently produces non-fault tolerant classification systems. The results of this

analysis, combined with the previous analysis of the tolerance to faults of a single

perceptron unit, is used to show how a fault tolerant MLP network can be constructed

after normal back-error propagation training. This is a great advantage since the

extremely long training times required when training with faults injected in each

learning cycle will not be needed.

6.8. Analysis of Trained MLP

MLP networks trained with transient fault injection have been demonstrated to form

fault tolerant systems [28,71], and several reasons proposed to explain why this should

be so. Similar reasoning can be applied for training with unit faults.

The first line of reasoning views the faulted MLP network during training as a

sub-network due to the loss of a unit/weight. These sub-networks are then individually

trained to solve the problem, and their individual solutions converge such that global

agreement between them is reached. Once fully trained, the loss of a single weight can

easily be tolerated, and tolerance to more than one weight is due to distribution over the

sub-networks.

Chapter 6

156

An alternative view is that the MLP forms a distributed representation [96], i.e. the

hidden layer representation is different to that normally found by plain back-error

propagation. This is redundant in some way and so leads to resilience to faults.

However, it will be shown in this section that neither of these two lines of reasoning are

correct. Also, it is shown how to produce a fault tolerant MLP in the style of the MLP

networks produced by training with faults, though with little extra computational

expense over basic back-error propagation training.

6.8.1. Analysis of Fault Injection Training

To identify the difference between a MLP trained with plain back-error propagation and

one with transient fault injection, MLP's with varying numbers of hidden units were

trained using both methods and the resulting network configurations compared. The

previous training set used in section 6.6 was used for consistency. It consists of 4 class

exemplars with 5 input patterns drawn from each producing a training set of 20

associated pairs. The dimension of the input space was 10.

Hidden Units

Do

t P

rod

uct

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12

Hidden

Output

(a)

Hidden Units

Do

t P

rod

uct

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12

Hidden

Output

(b)

Graph 6.4 Comparison of weight vector directions in MLP's trained with weight

faults, a) single fault injection, and b) double fault injection

Chapter 6

157

The first area examined was the internal representation developed for each of the four

class exemplars. It was found that all hidden units had a value of near -1 or +1 (a

bipolar representation was used) for every input pattern. Further, comparing the hidden

representations of matching MLP network configurations trained using the two

methods, it was found that they were identical in every case. The comparison allowed

for the possibility of a fixed permutation of the hidden units. This result implies that the

second of the two reasons given above explaining the fault tolerance induced by

training with faults is incorrect.

The next comparison performed was between the vector direction of the weights

feeding every unit in each MLP network. As above, the possibility of a fixed

permutation in the hidden units was allowed for. Graph 6.4 above shows the average

dot product between the weight vectors of matching hidden and output units in MLP

networks trained with and without injected faults. The number of hidden units in each

network varied between 5 and 12. Once again, it can be seen that no significant

difference exists between the various pairs of matching networks, though less so for the

second graph. However, the internal representations were still identical. This meant that

not only are the hidden representations the identical, but the dichotomies formed by all

units in their input space are also almost exactly the same.

Finally, the length of weight vectors for matching units was compared between the two

sets of trained MLP networks, where the length of a weight vector was found using the

Euclidean measure. Graph 6.5 shows the average ratio of the length of weight vectors

from a MLP trained with faults injected to that of the corresponding weight vector

when plain back-error propagation is used. It can be seen that in the former the length

of weight vectors is greater than in the original network. When two faults are injected

on each training step, this ratio is even more accentuated for hidden units. Note that this

difference is more massive than the slight change in angles between weight vectors for

double fault injection above.

Chapter 6

158

6.8.2. Comparison with MLP trained injecting unit faults

For comparison with the above results, simulations were also performed examining the

nature of MLP networks developed when training with unit faults injected. The

parameters of the simulation were similar in all other respects with its counterparts

above which analysed the weight vectors produced when training with weight faults.

Graph 6.6 below compares the MLP networks produced by training with a single

weight fault injected to those when a single unit fault is injected.

It can be seen that the direction of the weight vectors in both the hidden and output

layers of both MLP networks are almost identical. However, the length of weight

vectors in the MLP trained with unit faults injected are less than in the corresponding

MLP trained with weight faults. It will now be shown that this leads to a less fault

tolerant MLP network, as was expected in section 6.7.

Hidden Units

|We

igh

t| ra

tio1.2

1.3

1.4

1.5

1.6

1.7

1.8

4 6 8 10 12

Output

Hidden

(a)

Hidden Units

|Wei

ght|

Rat

io

0

1

2

3

4

5

6

4 6 8 10 12

Hidden

Output

(b)

Graph 6.5 Comparison of weight vector lengths in MLP's trained with weight

faults, a) single fault injection, and b) double fault injection

Chapter 6

159

To compare the two fault injection training techniques, a simulation was run training a

MLP network on the training set used previously. Graph 6.7 below shows the results for

a MLP network with 8 hidden units. It can be seen that training with weight faults gives

improved fault tolerance over unit fault injection training. However, both fault injection

training methods do produce a MLP network which is more fault tolerant than if simply

trained using back-error propagation.

Hidden Units

Dot

Pro

duct

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12

Hidden

Output

Hidden Units

|Wei

ght|

Rat

io

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12

Hidden

Output

Graph 6.6 Comparing training with weight faults and unit faults

Chapter 6

160

6.8.3. New Technique for Fault Tolerant MLP's

It was shown in section 6.6.2 that conventional back-error propagation training would

not produce fault tolerant MLP networks. Also, it was conjectured that increasing a

unit's activation would lead to increased resilience to faults. The above analysis of fault

injection training supports this. However, the associated training times are typically

much longer than when using conventional back-error propagation. This section

presents a new technique for producing similarly fault tolerant MLP networks, but

without the lengthy training times.

In figure 6.4 (page 154), it can be seen that in the asymptote region of the activation

function ±q, a weight fault will not cause an error in a unit's output. This avoids overall

failure of the MLP network. To achieve this, the weight vector of a unit can be scalar

multiplied by some suitable constant ζ which will cause the activation of a unit to be

likewise increased:

Weight Faults Injected

Err

or

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 3 6 9 12

AverageError

Plots

Max Error

Plots

Normal BP

Unit Injection

Weight Injection

Graph 6.7 Comparison of operation tolerance to faults after weight

injection training and unit injection training

act = ζw

⋅ x

= ζ w ⋅ x

Chapter 6

161

This will produce a unit which will tolerate a certain number of weight faults since the

output of the unit will not become erroneous, even though its absolute activation will

decrease. If every unit's weight vector is processed in this way, the entire MLP network

will tolerate a number of weight faults before failure occurs. This result is supported by

the previous analysis of MLP networks trained with faults injected in section 6.8.1

where it was found that the magnitude of weight vectors was greater than those in a

normal back-error propagation trained MLP network.

The feature of neural networks of indicating approaching failure due to graceful

degradation (c.f. section 3.7) will still be exhibited since as more weight faults affect a

unit, its absolute activation will decrease into the region where the squashing function

transits between output values. This will cause the output of the unit to become

increasingly erroneous, and so failure will not be a sudden discrete event.

Note that as , a unit will behave as if it were hard thresholded (c.f. section 6.3.1)ζ → ∞

and provides failure free service until the number of weight faults equals the Hamming

Distance between the centroids of its input classes6. However, at this point failure will

be abrupt since the change in activation caused by each weight fault will not be

mirrored by a gradual increase in the error of a unit's output as above. It can be seen

that a trade-off exists between the degree of graceful degradation required and the

degree of tolerance to faults, depending on the value ζ.

The enormous advantage of this technique to produce a fault tolerant MLP network

over that of fault injection training is that the training time is essentially only that

required for plain back-error propagation. This is a great improvement over the long

training times required to produce essentially the same MLP network configuration

when injecting faults during the training session.

Note that stretching a unit's weight vector is equivalent to sharpening its activation

function, i.e. compressing the activation region over which its output transitions

between asymptotic values. Sharpening an activation function is achieved by

multiplying the exponential term in the sigmoid function by a constant τ, which is often

6 Note that these input classes are not necessarily the training set classes. Also, the same assumptions

apply as in section 6.3.1.

Chapter 6

162

referred to as the temperature

If τ=ζ and the bias θ is incorporated into the weights, then

which shows the required equivalence.

6.9. Results of Scaled MLP Fault Tolerance

Simulations were performed to examine empirically the resilience to faults of MLP

networks with scaled weight vectors. The same training set as used in previous

simulations was used so that comparison with their results could be made. The number

of hidden units in the simulations ranged from 5 to 12. Note that the MLP networks

were trained using the normal back-error propagation algorithm. However, the final

weight vectors feeding into the hidden units were then scaled by a factor ζh, and

similarly ζo for output units to produce a fault tolerant MLP network.

To allow results from MLP networks with various numbers of hidden units to be

directly compared, the service degradation method (c.f. section 4.14.6) was used to

collect reliability data. This requires each fault to be assigned a constant failure rate λ,

which together with equation 6.11 below probabilistically models the occurrence of the

fault type at time t

The service degradation method implies that a simulation is started from time t0, and at

each time step the fault status of every weight is assessed according to equation 6.11.

The degree of failure of the MLP network is then measured by some means, and the

process repeated for the next time increment.

The measure of failure employed can either be discrete, or as is more appropriate for

neural networks (c.f. chapter 4), a continuous assessment of the system's reliability. The

measure used in these simulations was the proportion of inputs in the training set which

output = 1.0

1.0 +e−τ

w⋅x−θ

output = 1.0

1.0 + e−τ

w⋅x

= 1.0

1.0 +e−

ζw ⋅x

Pr fault occurs

= 1 −e−λt (6.11)

Chapter 6

163

were misclassified. This can be related to the probability of failure at time t if the

selection of input patterns is uniformly distributed.

Graph 6.8 below shows the results of the service degradation simulations on a MLP

with 8 hidden units. Plots labelled original are of a normal back-error propagation

trained MLP network, those labelled stretched are the results obtained from the same

MLP network but with factors ζh=1.4 and ζo=100. These factors were chosen to allow a

degree of graceful degradation to occur at the hidden layer, and to completely mask

weight faults at the output layer. Maximum error is defined as the maximum error over

all output units for all input patterns. Average error is the average maximum error over

the input patterns.

It can be seen that the maximum output unit error of the modified MLP network is far

less than the original network at initial times t<4. Over the time period t=1 to t=4 the

output of the modified MLP network is not in error at any time, and no failure occurs.

However, the conventional back-error propagation trained MLP network is showing

significant output error. At later times, t>4, the maximum error in both networks is

over 1.0, and hence failure due to misclassification occurs.

However, during this latter period, the average output unit error was approximately the

same for both MLP networks. This shows that the fault tolerant network is not

sacrificing classification ability to achieve increased reliability. If this was not the case

it would be expected that the average error be more than that of the unmodified MLP

Time

Err

or

0

0.5

1

1.5

2

0 5 10 15 20

StretchedOriginal

Max Error

Plots

Avg Error

Plots

Graph 6.8 Output error of MLP with 8 hidden units over time

Chapter 6

164

network. The resulting increased reliability arises purely by allowing the inherent

resilience to faults of a perceptron unit to be apparent in the MLP networks' units by

increasing their absolute activation levels.

The plots in graph 6.8 are termed failure curves since they depict the probability of

failure in the system due to faults defined in the fault model. A measure for a system's

fault tolerance can be defined as the area bounded by the maximum error curve until it

rises to a point at which system failure occurs. Since a bipolar representation was used

in the simulations here, this is when the maximum output unit error reaches 1.0.

Note that the area above the failure curve is measured so that increasing values of FT

imply a more fault tolerant system. A similar measure was used in chapter 5 for the

ADAM network.

Using this measure, graph 6.9 above shows how the fault tolerance of networks trained

with the previous weight scaling parameters ζ changes as more hidden units are added

to the MLP network. The fault tolerance of the original MLP network is also shown for

comparison. It can be seen that the fault tolerance increases as more hidden units are

added for both the original trained network and the modified network. As expected

though, the fault tolerance of the latter MLP networks is higher than the original.

In the above simulations ζh was kept small so that hidden units would exhibit a degree

of graceful degradation. If both ζh and ζo are set to large values, all units will tend to act

FT= ∫t=0

t=tf

1.0 −Error(t) dt where Error t f

= 1.0

Hidden Units

Fa

ult

To

lera

nce

0

1

2

3

4

4 6 8 10 12

Original

Stretched

Graph 6.9 Fault Tolerance of MLP for various numbers of hidden units

Chapter 6

165

as if binary thresholded. Similar simulations as above were run to determine the number

of weight faults that would be tolerated before an output unit gave an erroneous output

value for any input pattern. Graph 6.10 below shows that, as expected, binary

thresholded hidden units decrease the reliability of a MLP as compared to using

partially soft thresholded hidden units. The weight factors ζh and ζo for the "soft

hidden" units were the same as above (1.4 and 100.0 respectively). For "hard hidden"

units, both ζh=100.0 and ζo=100.0 to approximate binary units. The close match

between the plots for actual binary units and simulated binary units indicate that this is

achieved. Overall, the results emphasise that a degree of graceful degradation in hidden

units is necessary for overall reliability since otherwise large errors are fed to output

units.

6.10. Consequences for Generalisation

Clay and Sequin have shown that training with transient fault injection improves

generalisation and reduces the overfitting problem [96]. They attribute this to "a

suitably redundant internal representation" being developed due to their training

technique. However, from the results given above in section 6.8.1 this clearly cannot be

the source of the improved generalisation. In chapter 2, it was hypothesised that a

neural network would exhibit better generalisation if it was constrained to be fault

tolerant. This is due to its excess computational capacity becoming the redundancy

which supports such fault tolerance. With the new knowledge that training with faults

(affecting units or weights) causes the magnitude of weight vectors feeding units to be

Number of Hidden Units

Wei

ght F

aults

at F

ailu

re

020406080

100120140

0 10 20 30 40 50

Binary, HardHidden

Soft Hidden

Graph 6.10 Number of weight faults tolerated before failure occurs

given different values for weight stretching factors

Chapter 6

166

increased, and does not result in any change in internal representation, a more accurate

analysis of the effect on generalisation can now be made.

Increasing the magnitude of a weight vector feeding a unit has the same effect as

sharpening its squashing function, i.e. decreasing the activation range over which it

jumps from near one asymptote to the other. This implies that inputs lying close to the

decision boundary between two classes categorised by a unit will result in near

saturation outputs rather than values mapping from activations on the sloping section of

the squashing function. This implies that inputs which are just incorrectly classified will

result in large errors. A more accurate fitting of units' hyperplane boundaries to the

required decision boundary would be possible (see figure 6.5) if weight stretching is

performed during training. This can also be seen in the results given by Clay and

Sequin in [96]. This claim is also made by Murray and Edwards [87].

Applying weight stretching after training will not alter the actual decision boundary, but

it will still improve generalisation. This is since input vectors to some unit which lie

near its decision boundary would normally result in an abnormally low output value,

and this would adversely affect the operation of fed units. However, weight stretching

Class A

Class A

Class B

Class B

decision

decision

boundary

boundary

Hyperplane

KEY:

(a)

(b)

of a unit

Sharp

Fuzzy

Figure 6.5 Positioning and width of squashing function's slope of three

units' hyperplanes between two classes for (a) Normal BP,

(b) Stretching weights during training

Chapter 6

167

decreases the range of activation over which the squashing function transitions between

asymptotic values, and so inputs near to the unit's hyperplane decision boundary will

still output near asymptotic values (see figure 6.5). This then implies that no damaging

output value errors will be fed to subsequent units.

6.11. Uniform Hidden Representations

It can be seen from graph 6.9 above that the degree of tolerance to faults existing in a

MLP network increases with the number of hidden units. It is interesting to compare

this to the average Hamming Distance between the internal class patterns formed in the

hidden layer corresponding to each output class in the representation. Graph 6.11 below

shows this for an extension of the various MLP simulations used above in section 6.9. It

can be seen that as the number of hidden units increases, so does the average Hamming

Distance between the internal representation patterns.

Resilience to the effect of faults occurring in the hidden to output connections will

improve as the average Hamming Distance, HDh, increases between internal

representations due to the computational fault tolerance of individual output units (c.f.

section 6.3.1). This leads to the increased overall resilience to the effect of faults

observed in graph 6.9 above. For instance, if the output of hidden units are not

erroneous, then a bipolar output unit will tolerate, on average, HDh weight faults.

However, if faults affecting input to hidden weights do cause errors to occur in the

outputs of some hidden units, then these will reduce the number of weight faults that

will be tolerated in the hidden to output connections as described above in section 6.5.

An important observation from graph 6.11 is that the standard deviation of Hamming

Distance between internal representations7 is small, though less so for very large

numbers of hidden units. This implies that resilience to faults will be uniform across all

the hidden to output connections since each output unit will tolerate approximately the

same number of weight faults. This is analogous to uniform storage of information

which was induced in the ADAM system as described in chapter 5.

7 Vertical bars on graph indicate one standard deviation each way from mean value.

Chapter 6

168

Graph 6.12 below compares the average Hamming Distance between internal

representations as compared to a theoretical upper bound. This unconstrained upper

bound is simply given by

The function V is the volume of a sphere of radius d in n-dimensional binary space.

However, this upper bound is rarely achievable in practice, especially since the hidden

representations formed in MLP's must be linearly separable with respect to the MLP's

output.

Number of Hidden Units

Ham

min

g D

ista

nce

05

101520253035

0 10 20 30 40 50 60

Minimum HD

Average HD

Graph 6.11 Average and minimum Hamming distances between internal

representations for various sized hidden layers

p ≤ 2nh

V nh, 1

2HDh

where V n,d

= Σr=0

dnCr

Number of Hidden Units

Ham

min

g D

ista

nce

0

10

20

30

40

50

60

0 10 20 30 40 50 60

EmpiricalUpper Bound

Graph 6.12 Theoretical bound to maximum Hamming Distance

between internal representations

Chapter 6

169

It can be seen from graph 6.12 that the average Hamming Distance between internal

representations formed by back-error propagation training in MLP's diverges from the

theoretical maximum as more hidden units are used. It may be possible that better

internal representations with increased class separation could be formed during training

which will lead to increased tolerance to faults. If this could be achieved, the reliability

of a MLP network would be further improved.

6.12. Conclusions

This chapter has analysed the fault tolerance of perceptron units, and concluded that

individually they are extremely reliable. However, it was found that a MLP network

was not as fault tolerant as might be expected given this result. It has shown that

training with weight faults develops a fault tolerant multi-layer perceptron network in a

similar fashion as injecting unit faults described in [71]. The trained fault tolerant MLP

networks were extensively analysed to locate the mechanism which lead to their

robustness. It was found that the both the hidden representations and the directions of

weights vectors were not significantly different to a MLP network trained with normal

back-error propagation. The only discrepancy was in the magnitude of the weight

vectors.

Separate analysis of the effect of faults in a MLP, and the activation of units in a trained

MLP, showed how the back-error propagation algorithm results in individual units not

being fault tolerant due to insufficient unit activation levels. It was then shown that by

scalar multiplying every weight vector by factor ζ, each unit in the MLP would then be

capable of exhibiting fault tolerance as suggested by the initial analysis of a single

perceptron unit. This leads to better overall tolerance to faults in the entire MLP. An

advantage of this new technique as opposed to training with transient fault injection is

that training times are much reduced. Simulations were carried out which showed that

these two methods give comparable results, as would be expected.

An analysis of the hidden representations formed by back-error propagation training

showed that with more hidden units, the average Hamming Distance between internal

representations in the hidden layer of a MLP increases. As expected from the analysis

of the perceptron unit, it was found that the increased class separation lead to improved

tolerance to the effect of faults in the entire MLP. This was then verified by the results

Chapter 6

170

from various simulations. Further, the small standard deviation in the Hamming

Distance between internal representations implied that the effect of faults would be

approximately uniform across the hidden to output connections. This can be compared

to the uniform storage technique developed in chapter 5 for the ADAM network.

In conclusion, this chapter has shown how to allow a MLP network to use the inherent

fault tolerance its perceptron-like units to produce an overall fault tolerant system. As

discussed in section 6.6.1, this is only one area of distributed processing which results

in fault tolerance being exhibited by a MLP. The other is to force the development of

redundant representations in each hidden layer. Although the simulations above showed

that as more hidden units are added to a MLP the Hamming Distance between internal

representations increases, and hence also resilience to faults, it is unlikely that the

maximum fault tolerance possible is achieved.

Chapter 6

171

CHAPTER SEVEN

Conclusions

7.1. Overview

This thesis has examined the effect of faults on the reliability of the operation of

artificial neural networks. Their functionality was visualised at an abstract level rather

than considering actual implementations so that the fault tolerance arising from their

computational nature could be analysed. It also allowed the question to be posed "do

neural networks possess inherent fault tolerance?". Other reasons for making this

decision are given in chapter 3. Various concepts relating to techniques for achieving

fault tolerance in neural networks were also discussed in this chapter. These included

distribution of information and processing, generalisation, and the architectural

structure of neural networks. It was also considered whether requiring fault tolerant

behaviour could be applied as a constraint in a neural network to improve its

generalisation. The style of failure in neural networks was studied with respect to the

type of problems for which their computational nature is most suited, and the reasons

for neural networks exhibiting graceful degradation analysed.

A methodology was defined in chapter 4 by which the effect of fault tolerance

techniques on a neural network's reliability could be assessed. It addressed issues such

as the construction of fault models for systems visualised at an abstract level,

approaches to measuring the effect of faults on a system's reliability, and also various

simulation frameworks. In appendix A, it was shown how this methodology can be

applied to assess the reliability of a feedback neural network based only on a high-level

functional specification of its operation. This provided a very general approach to

reliability assessment in cases where no error function is provided by the learning

algorithm.

Chapter 7

172

Using this methodology, various neural network models were then investigated to gain

an understanding of the effect of faults on their operation. This knowledge identified

various potentials for inherent fault tolerance in neural networks, and lead to techniques

being developed to improve reliability by increasing resilience to the effect of faults.

These results will be summarised in this chapter.

7.2. Basis for Inherent Fault Tolerance

The simple perceptron unit which is used in various forms in many neural network

models has been shown to be highly fault tolerant. It was shown in chapter 6 that the

number of weight faults can be tolerated before failing to distinguish between two

classes is dependant on the Hamming Distance (HD) between them. Another factor is

whether it operates on bipolar or binary inputs. It was shown that the maximum number

of weights which can be defective in the case of a bipolar perceptron unit is HD, while

only for a binary unit.12HD

This basic result implies that neural networks employing perceptron-like units can be

made fault tolerant by ensuring that the characteristics of the input domain1 to a unit

meet the above requirements.

7.3. Fault Tolerance Mechanisms

This section will combine the various results from investigations in to ADAM in

chapter 5 and MLP's in chapter 6. The various computational properties in artificial

neural networks that lead to resilience to faults on their operation will be summarised.

These include ensuring uniform distribution, modular redundancy, architectural

constructs, and learning algorithms.

7.3.1. Uniform Fault Tolerance

A major factor in neural computation which leads to fault tolerant behaviour is that of

uniform distribution of information. By this it is meant that in addition to information

being distributed throughout a neural network's components during training, the

functional load placed on each component is approximately equal. Uniform distribution

implies that the effect of faults is not limited to a particular a region of input space.

1 Note that this is not necessarily the input domain to the neural network. The input domain to an output

unit in a MLP comes from the hidden layer.

Chapter 7

173

Instead faults cause degradation to the neural network's operation over a wide range of

inputs. This can be viewed as providing uniform fault tolerance. It was noted in chapter

3 that this characteristic would not occur in neural networks exhibiting local

generalisation, and so only globally generalising neural networks were studied in this

thesis.

In ADAM it was found that by ensuring all rows in the associative storage matrix

would store an equal number of class vectors on average, a great improvement in

resilience to faults could be achieved. This uniform storage was accomplished by the

addition of an extra preprocessing stage to ADAM which incurs very little extra

computational cost. Although the technique implies that twice the number of resources

are required, it was shown that the benefits with respect to increased reliability

outweighed these costs.

A similar result was found for the multi-layer perceptron network. The internal

representations formed by training with a modified back-error propagation algorithm in

MLP's with various numbers of hidden units were examined. It was found that the

average Hamming Distance between the hidden representations formed for each class

centre was proportional to the number of hidden units. More importantly, the standard

deviation was small in comparison which implies uniform storage occurs. This can be

explained by considering that, as described above, the fault tolerance of an unit in a

MLP is dependant upon the Hamming Distance between its input classes. Since all

hidden representations are approximately equidistant in terms of Hamming Distance,

this implies that output unit's resilience to faults will be near uniform.

7.3.2. Modular Redundancy

A more well known fault tolerance mechanism for achieving increased reliability has

been examined for the ADAM network, and also indirectly for the MLP network.

Redundancy can be achieved by replicating sub-systems, and so improve reliability if

the increased complexity of the overall system does not prejudice this. In ADAM, the

basic system module is a tuple unit together with the matrix region which it addresses.

Its output consists of the required class vector plus noise due to some level of memory

saturation.

Chapter 7

174

It was shown that increasing the number of tuple modules improved the overall

reliability of the system without being compromised by too rapid a rise in complexity.

This analysis was achieved by modelling the occurrence of faults using a time-based

probability density function which allows varying sized systems to be compared

realistically, as was described in chapter 3.

In MLP networks, the hidden unit can be viewed as the basic functional entity

controlling the capacity of the overall system. As with ADAM, it was shown that

employing more hidden units increases overall reliability even with the resulting

heightened system complexity.

7.3.3. Architectural Considerations in ADAM

The function of specific functional components in some neural network models may

have bearing on the overall system if they occur in sufficient numbers and have a

significant role. In ADAM, such a component is the tuple unit. These comprise the

preprocessing layer which forms the vector input to the associative matrix. Their

function and number require that their reliability must be taken into account. Results

given in chapter 5 showed that small tuple units should be used in ADAM systems for

greatest reliability. This is due to their lower potential noise levels in the presence of

faults (activating extra matrix rows).

This result reinforces the conclusions given for modular redundancy above where using

many tuple units increases ADAM's reliability in the presence of faults. This is since the

dimensions of a problem's input space specifies the dimensions of ADAM, and a small

tuple size implies that a large number of tuple units will be required.

Another objective in assessing computational fault tolerance is that of locating potential

critical faults. These are important to identify since it allows future implementation

designs to specifically protect against them. Fault injection experiments in ADAM

indicated that stuck-at-1 faults in the key vector and stuck-at-0 matrix link faults have

the greatest effect on its reliability.

7.3.4. Learning in Multi-Layer Perceptron Networks

It was found that a few critical weights will exist in MLP's trained using the back-error

propagation learning algorithm. This was surprising since perceptron-like units which

Chapter 7

175

are the basic building blocks of MLP's can be fault tolerant (section 7.2). Due to this

result a training method which develops fault tolerant MLP's was then examined. The

MLP is trained using the normal back-error propagation algorithm, but small numbers

of transient faults are injected at each step. This results in a MLP which tolerates many

faults, though the training time can be very long.

First, a more appropriate fault model than those which other researchers have used for

the MLP network was developed using the methodology described in chapter 3. Rather

than considering unit faults, weights were identified as the basic defect. Transient fault

injection training was then performed, and it was found that the MLP's exhibited better

fault tolerance than when unit faults are injected during training.

These fault tolerant MLP's were then analysed to determine the source of their

increased reliability. It was found that both the internal representations formed and the

direction of the units' weight vectors were essentially unchanged. The only difference

observed was that the magnitude of the weight vectors was greatly increased. The

mechanism by which this change lead to increased reliability was discovered by

considering the effect of faults on a unit's activation. It was shown that a weight fault

causes the absolute activation to decrease. If weights are small, then a loss of unit

activation causes the absolute output of a unit to decrease in the region where the

thresholding function transits between its two output extremes. By increasing the

magnitude of weights the average activation of units lies further away from this region

of the thresholding function. This results in faults not causing an immediate decrease in

a unit's output. It was also shown that this is functionally equivalent to sharpening a

unit's thresholding function. This technique is another fault tolerance mechanism for

perceptron-like units (c.f. section 7.2). To summarise, this mechanism decreases the

sensitivity of a unit's output to changes in its activation caused by faults.

The back-error propagation learning algorithm was then analysed to discover why it

produced weight configurations resulting in such limited unit activation. This involved

studying the dominant terms in the weight change equations. It was shown that units'

activations will be limited in magnitude to values clustering around the region where

the thresholding function begins to approach its asymptotes. This lead to the lack of

resilience to the effect of faults as described above.

Chapter 7

176

An extremely useful result from this analysis was that a fault tolerant MLP, similar to

one trained with transient fault injection, can be constructed merely by the scalar

multiplication of weight vectors after training with basic back-error propagation. This

precludes the extremely long learning times required when transient fault injection

training is employed. It was also shown that a similar result can be obtained by merely

sharpening the thresholding functions in each unit.

In section 7.2, another fault tolerance mechanism was described which was found to

result in increased reliability in a perceptron unit depending on the Hamming Distance

between the two classes which it distinguishes. To assess this, MLP's with varying

numbers of hidden units in their intermediate layer were trained on a fixed classification

problem. As expected, reliability increased with the number of hidden units used. The

Hamming Distance between the internal representations formed for each input class

were then measured. It was found that the standard deviation of internal representations

was low implying that they were fairly uniformly distributed in Hamming space (c.f.

section 7.3.1). Also, it was shown that the average Hamming Distance was close to a

theoretical upper bound implying that the back-error propagation algorithm does

develop internal representations which will lead to fault tolerance in this respect.

7.4. Inherent Fault Tolerance?

In conclusion, results given in this thesis have shown that neural networks do have the

potential to be inherently fault tolerant, although current learning algorithms do not

always develop appropriate weight configurations. For example, it was shown how the

activation of units in a MLP trained using back-error propagation algorithm lie at a

critical point on the thresholding function, and faults cause their absolute output to

decrease. In ADAM, class vectors are not stored in a uniform manner in the associative

matrix, and localised memory saturation occurs.

It was noted in chapter 2 that the question of whether neural networks are inherently

fault tolerant is currently undecided in the literature. This conflict has been shown to

arise due to a distinction not being made between considering the neural computational

paradigm and trained neural networks. Given their implicit assumptions, both views are

essentially correct. Neural networks do have the potential to be inherently fault tolerant

given a suitable learning algorithm. However, current algorithms such as back-error

Chapter 7

177

propagation do not develop suitable weight configurations. To achieve fault tolerance in

the one layer binary weighted neural networks in ADAM, the loss of information

during training when new links have already been previously set must be minimised.

7.5. Implications for Future Research

The research presented in this thesis has shown that inherent fault tolerance mechanisms

do exist in neural networks, and various constructive techniques have been developed

which promote these. However, the research has also indicated various avenues which

seem promising for future research.

7.5.1. Generalisation

In chapter 3 it was proposed that applying fault tolerance as a constraint during learning

will improve generalisation in neural networks. This thesis has not examined this area

in any detail, instead it has concentrated on the initial problem of developing fault

tolerance mechanisms. However, it would be useful to determine if this proposal has

any justification. Generalisation in the presence of input noise seems likely if the

distortion caused by faults is functionally similar. In particular, the effect of uniform

distribution of information, which has been shown to be a fault tolerance mechanism,

on generalisation deserves examination. For instance, maximising distance between

internal representations in MLP's could result in decreasing generalising if too diverse

representations are formed.

The area of computational learning theory (CLT) could also be used to examine

rigourously the effects of imposing fault tolerance as a constraint in neural networks.

This mathematical framework addresses the question of whether a general learning

device will correctly generalise, i.e. learn to represent the underlying problem. A central

equation in CLT considers the number of training examples that are required to

constrain a model with some given capacity. Recognising that improving fault tolerance

reduces the capacity of a system, the number of training examples required will be

reduced. Lines of work would be to develop bounds on the capacity of a neural network

when certain fault tolerance mechanisms are imposed.

Chapter 7

178

7.5.2. Internal Representations

This also suggests another line of research. Current bounds on the theoretical maximum

Hamming Distance between internal representations could be improved. For example,

the constraint that each class must be linearly separable from the other classes

corresponding to internal representations could be introduced. This will allow the

effectiveness of learning algorithms, such as back-error propagation, to be assessed in

neural networks composed of perceptron-like units with respect to the resilience of

individual units to faults.

7.5.3. Implementations

Another important area is to consider how computational fault tolerance mechanisms as

described in this thesis can be preserved in an implementation design. This will allow

inherent fault tolerance to be achieved at little or no extra complexity. Also, the effect

of conventional fault tolerance techniques applied at the implementation level to further

enhance reliability should be assessed. However, an objective should be that the fault

tolerance due to a neural network's computational nature is not compromised, and this

provides another area which can be examined.

7.5.4. Neural Fault Tolerance

Finally, additional fault tolerance mechanisms should be sought at the computational

level in artificial neural networks. In particular, neural network models involving forms

of feedback should be considered. The question of whether errors caused by transient

faults can be self-corrected in an iterative neural network is of great interest, especially

when applied to control problems.

Certainly this final section does not cover all areas connected with fault tolerance

mechanisms and reliability deserving future research, but it can be seen that there is a

large scope for study.

Chapter 7

179

APPENDIX A

Fault Tolerance of Lateral

Interaction Networks

This appendix was a paper published in IJCNN-91, Singapore [109]. It is included in

this thesis as an example of how the degree of failure in an artificial neural network can

be assessed from a specification of its functionality, rather than by using a test set of

data (c.f. chapter 4).

A.1. Introduction

Neural networks offer a parallel distributed method of processing information unlike

that of conventional serial computing systems, also their underlying basis of

computation is analogue rather than digital. Although they were inspired from studies

of the structure of the brain [62], artificial neural networks are a very simplified model

of biological neural networks, and are also very much smaller. However, it is generally

accepted that neural networks are well suited to solve problems which are very

successfully tackled by biological neural systems such as our brain.

Artificial neural networks consist of a large number of simple processing units (often

termed neurons) which are highly interconnected. Each unit forms a weighted sum of

its inputs, then thresholds it with respect to some internal bias value using some

bounded non-linear function. The selection of suitable weights and biases such that a

problem is solved is performed by some (normally iterative) algorithm; this process has

been termed learning.

It has commonly been mentioned that neural networks are naturally fault tolerant

[7,8,13,22], i.e. will continue to provide acceptable service in the presence of faults.

The intuitive reasoning behind this assertion is that their distributed processing is

Appendix A

180

resilient to errors caused by faults, and large fan-in to individual units renders

insignificant the effect that faults can cause.

The objective of this paper is to examine the fault tolerance of lateral inhibition arrays.

Section A.2 discusses the suitability of applying neural networks to application areas

with respect to their solution characteristics. The structure and operation of lateral

inhibition arrays is then described in section A.3. Section A.4 defines a fault model.

Failure is considered in section A.5, and it is discussed how its occurrence depends

upon the lateral inhibition array's application area. Section A.6 details the empirical

results obtained. Finally conclusions are drawn.

A.2. Soft/Rigid Application Areas

Application area solutions can be identified as either soft or rigid. Considering the

solution of a problem to be represented by a function from some N-dimensional space

to M-dimensional space, then it can be termed soft if the function is fairly smooth and

continuous, i.e. as an input vector traverses its space, the output vector will also do

likewise.

Conversely, a rigid problem is characterised by a discrete mapping, and an instance of

the problem has an clear-cut exact solution. It will generally be the case that solutions

given for various similar instances will not themselves be similar.

Most neural networks can either perform a functional mapping or classification

depending on whether the thresholding function applied to its output units is linear or

saturates/hard-limiting. The output of the former will span over the entire output space,

whilst the latter will always produce a restricted set of output vectors. The concept of

generalisation in neural networks should be distinguished between these two categories.

For a classification system, generalisation implies that input patterns close to a stored

pattern will be given the same class. For a neural network performing a functional

mapping though, generalisation will generally involve some form of interpolation.

Neural networks will exhibit generalisation when solving soft problems since regions

around input-output pairs are related rather than only the actual points, as is the case

with rigid problems.

If x → f x then x+ δx → f x

+ δx as δx → 0

Appendix A

181

Note that it may be possible in some cases to change the representation of the instance

of a problem such that a seemingly rigid problem may become soft. This will happen if

the new representation has the property of adjacency, i.e. nearby members of the

representation are nearby in problem space. For example, binary addition is rigid, but

by representing it using real numbers it becomes soft. This technique can be a key

element in helping to attain generalisation in neural network systems.

A.2.1. Implications for Reliability

Any system will inevitably be affected by factors such as noise and uncertainty for

example, and so any particular instance of a problem will actually be represented by a

small region in input space. However, if the problem is soft, then this will map to

another region in output space. This leads to noise tolerance, though it also implies that

the concept of a precise answer being produced by such a system is meaningless.

The inherent fault tolerance of neural networks can be reinforced given suitable

input-output representation and internal computational processing. High fan-in to units

means that although faults may cause severe local damage (e.g. weight set to opposite

extreme), this will only cause a deviation from fault-free unit output values. The

influence of a single input to a unit is limited. Even if a unit was faulty, the extreme

case of its output going to its opposite extreme would only cause a single input to all

subsequent units to be affected. So, if the input-output representation chosen is such

that the problem is soft, then this deviation from fault-free values will be tolerated due

to adjacency.

A.2.2. Verification

Neural network learning algorithms tend only to approach the optimum set of weights

and biases for a problem, i.e. it is possible that some problems cannot be perfectly

solved. This implies that even a fault-free neural network may not produce exactly the

desired response for any given input. If generalisation is relied upon, then the output

quality will be even more degraded. However, for soft problem domains where exact

solutions are not appropriate, this is acceptable behaviour, though verifying that a

neural network meets its specification will be very hard. For example, it could be that a

small portion of input space is not properly generalised, and that failure will occur if it

Appendix A

182

is accessed. This will not be detected with absolute certainty by testing, and exhaustive

testing is likely to be unfeasible.

Some neural networks paradigms (e.g. Kohonen [91], Barto et al [120]) incorporate the

idea of continual adaptation to both environment and/or internal structure. These

present a special problem for verification. It is possible that a system could adapt in

stages, each stage being built on a previously verified core, and then the new system

itself being verified.

A.3. Lateral Inhibition

Lateral inhibition arrays, also known as centre-on surround-off cells, are a class of

single-layer neural networks with feedback between output units. They developed from

studies of Limulus (the Horseshoe crab) by Hartline and Ratliff [121] where it was

found that lateral inhibitory feedback occurred between nearby receptor units in its

optical system. This can be generalised to include excitatory feedback as well, and in

general, the functional structure of such feedback to a particular neuron from

surrounding neurons depends on the distance between them. The central neuron is

excited by nearby neurons, a ring of neurons surrounding these exert an inhibitory

influence, whilst more distant neurons supply weak excitation. These can be termed

lateral interaction networks.

To simulate such a system, N units are arranged in a single layer, and connections are

made to each unit from neighbouring units. The value of the weight of each connection

is derived from a Mexican-hat function (see figure A.1). For simplicity only 1D-arrays

of units are considered in this paper, though it is expected that results can be generalised

to higher dimensions. Simulations used a more discrete form of the Mexican-hat

function ignoring the long-distant weak excitation. Note that units near the array edges

will be unduly influenced from the interior due to imbalance between their incoming

excitatory and inhibitory influences, and so various boundary effects will occur. Since

the size of arrays used in simulations are limited, these effects would cause noticeable

distortion. To overcome this, the array of units was joined together at its ends, thus

Appendix A

183

forming a circle and effectively simulating an infinite array.

Figure A.1 Lateral interaction network, dotted lines show how weights

correspond to Mexican-hat function

A.3.1. Network Dynamics

During operational use, an input vector I is initially imposed on the array of N units

forcing their outputs to assume this value. The array of units thenO = (o1,o2, ...,oN)

synchronously updates over some time period T in discrete time steps δt. Note that in a

biological system this would actually happen in a continuous manner. Each unit

evaluates its total activation and passes it through some non-linear function σ:

where are the weights on lateral connectionsWi = (wij−k,wij−k+1, ...,wi0, ...,wij+k−1,wij+k)

from a radius of k surrounding units either side of any particular fed unit i. Note that no

learning takes place; the dynamics of the system depend upon the ratio of excitation to

inhibition and their actual magnitudes, and also the ratio between the ranges over which

excitatory and inhibitory connections extend and also their actual radius.

The application role of lateral interaction networks is seen as providing pre-processing

to a system or filtering of communications between sub-systems. Two functions which

they can perform are described below.

Weight

LateralDistance

Input

Mexican-hatFunction

Feedback to centralunit: '+' Excitatory '-' Inhibitory

Value forweight

+ ++ - - - -----

oi(t +δt) = oi(t) + σ Σj=−k

k

oi(t)wij

Appendix A

184

Figure A.2 Lateral interaction network functions (a) Clustering, and (b) High-

frequency filter (LF - Low Frequency, HF - High Frequency)

A.3.2. Operational Behaviour

The behaviour of lateral interaction networks as defined above is that of forming a

cluster from an input stimulus around its centre of activity, an example of this is

depicted in figure A.2(a). This has been termed an activity bubble by Kohonen [91]. For

a more realistic input stimulus, i.e. one that is not a smooth unimodal distribution, the

behaviour is not so simplistic. For example, the output might join together two separate

input peaks into one, or more than one stable cluster might form in the final output.

Given certain conditions, a lateral interaction network can also act as a high-frequency

filter. If instead of nearby units exciting the central unit it is only inhibited by

neighbouring units (i.e. as in non-primates), and the extent of this inhibition is only

local, then low frequencies are blocked and high frequencies passed. Figure A.2(b)

shows edge detection from a stationary square-wave input stimulus. The high frequency

areas (HF) are retained whilst units in low frequency areas (LF) are forced inactive. The

width of the final peaks in the output are proportional to the difference between

inhibition radius and width of initial image.

A.3.3. Stabilisation

In both cases it has been assumed that a lateral interaction network is iterated over some

time period T, but no mention has been made of how long this should be. However, in

any implementation some mechanism must be included which will indicate when the

network's output is ready for further processing. Two such methods could be to either

Output

Array of Units

t=0

t=1

t=2

Output

Array of Units(a) (b)

t=0

t=1

t=2

HFLF

HF

LFLF

Appendix A

185

specify that a fixed time period Tf is required, or else the output's could be monitored

for stability, and a signal then sent to indicate completion of processing. For reasons of

locality, this latter option is used in simulations.

A.4. Fault Model

A fault model must list which abstract components of a system could go wrong, and

also the effect on their fault-free behaviour. A good fault model should adequately

cover all physical faults that could occur, though simulation must be computationally

feasible. A difficulty that arises with the majority of neural networks is that no suitable

implementation technology yet exists; their connectivity implies a three dimensional

implementation medium. For this reason, and also since it may lead to deeper

understanding, it is best to examine the fault tolerance of neural networks from an

abstract viewpoint. A framework for constructing a fault model from an abstract

definition is given in Bolt [111].

Constructing the fault model initially requires fault locations to be identified. By

examining the definition of lateral interaction networks as given in section A.3, the

construction of a suitable fault model can be based purely on the Mexican-hat function

which determines the weight values. Individual unit attributes are not included since

they are insignificant with respect to the number of connections between units. Since

the weight vectors applied to every unit are identical, it is reasonable to assume that an

implementation would store them globally, and so any weight fault will affect every

unit.

Figure A.3 Faults affecting global weight vector

WeightKey:

Correct

Faulty

LateralDistance

Appendix A

186

Now that the components of the global weight vector have been identified as the fault

model's locations, it only remains to define faulty behaviour. Operating on the principle

of maximum damage, two failure modes can be constructed for a faulty weight element.

These are stuck-at-0 and inverted. The latter refers to a excitatory connection becoming

inhibitory and vice versa. Note that the loss of a connection is incorporated by default

into the above fault definitions affecting the global weight vector, though more severely

since the matching connection will be lost for every unit. Figure A.3 illustrates faults

affecting a simplified discrete Mexican-hat function.

A.4.1. Timescale

Faults can be classified as either transient or permanent. The lifetime of transient faults

is only some short period of time, but the latter persist forever. By far the most common

are transient faults [104], and it is these that will be modelled in simulations.

The timing of when faults should realistically be introduced, and for how long they

should last must be defined before simulations can be performed. This will depend on

the type of application area as well as the functionality of lateral interaction networks. If

the application involves the use of a lateral interaction network as a component, then its

operation should be viewed as a single step. Any faults should be injected when the

input is initially presented to it, and they should be defined to last for the complete

processing of the input pattern1. However, if a subsequent system is sensitive to changes

in the networks' outputs, or the network is considered as the entire system under

investigation, then any faults should be injected at each iteration of the lateral

interaction process. The duration of faults should last for only one iteration since the

evolution of the output is paramount.

A.5. Definition of Failure

Due to the correspondence between soft applications and the nature of neural networks

computation, failure is not a clearly observable discrete event, rather it is a degradation

in the quality of the solution which is represented by the output's of the neural network.

This implies that a continuous measure of failure is more suitable. Since, as mentioned

in section A.2, even the fault-free response of a neural network may vary around the

1 This case applied to simulations performed.

Appendix A

187

correct output, defining failure sensibly can be a difficult task. Failure of a system will

also depend upon the structural level at which it is viewed, either as an entity in its own

right or as a component of a larger system.

The equations for measuring failure given below should not be viewed as the only

possible, they are only examples. However, they are designed to give a good

representation of the degree of failure as required by the circumstances of each

situation.

A.5.1. System Failure

Considering an isolated neural network when no training data exists (i.e. either

unsupervised learning or fixed dynamics, two methods exist by which the definition of

failure can be approached2. First, requirements can be placed on what operation the

neural network is supposed to perform which can be used to produce a specification.

This can then be used as a base from which to define failure. Alternatively, current

deviation from previously obtained fault-free results can be assessed to indicate degree

of failure. Note that such test data will have to be obtained under strict conditions.

As an example of the first method, a lateral interaction network can be viewed as an

edge enhancer, i.e. a high-frequency spatial filter. By describing these operational

characteristics, failure can be defined as either low-frequencies being passed or

high-frequencies blocked.

where is the normalised maximum increase in initial input of unit u with respect to∆i u

its immediate neighbours.

The second method is particularly applicable when the operation of a neural network is

very complex or when it is unknown, i.e. a black-box system. Also, it resembles that for

assessing error in a supervised learning neural network, the only difference being that

the associated pairs of input and output data are not supplied externally as a goal to

achieve, but must be carefully collected from fault-free operation. Note that failure

cannot be monitored on-line, but periodically the neural network must be assessed on

the test data. The degree of deviation of actual output from the known result for specific

2 This is the case with lateral interaction networks.

F = 1N Σ

u=1

N ou

1 −∆iu

+ (1 − ou)∆i u

(A.1)

Appendix A

188

inputs could be defined as

A.5.2. Component Failure

When viewing a neural network as a component of a larger system, failure has to be

considered somewhat differently. In this case, failure of the neural network can be

defined as occurring when the surrounding system cannot correctly perform its

computation due to erroneous input fed from the neural network component. The way

in which failure occurs will depend to a large extent on whether the subsequent system

is rigid or soft. If it is rigid, then the definition of failure of the neural network will be

discrete, whilst if the fed system is soft (possibly another neural network) then failure

can be continuously measured.

As an example of the latter case, a lateral interaction network could be used in

conjunction with a Kohonen network [91] during the training phase to select the

neighbourhood of units eligible for change. Failure will be related to the inaccuracy of

the neighbourhood indicated, i.e. maximally active input areas not being selected, and

input areas which are not maximally active being selected. Note that the failure measure

also penalises selection when the difference between maximally and minimally active

inputs is small, as is required behaviour for adaptation of Kohonen networks.

However, if only the maximally active input area is required, measuring failure must

also penalise the case of more than one distinct area being selected.

An example of a lateral interaction network feeding a rigid system could be that it

selects the highest input value which is then discretely mapped (the rigid system) to an

address, e.g. selecting winner based on competition marks. It would not be acceptable if

the wrong input was selected, even if it was near to the correct input element since there

is no representation adjacency in the rigid system.

where h is the Heaviside function.

F = 1N Σ

pΣu

tpu − opu (A.2)

F = 1N Σ

u

ou(1 − i u) + (1 −ou)i u

(A.3)

Failure ↔ ∃ i.h(t i) ≠ h(oi) (A.4)

Appendix A

189

A.6. Empirical Investigations

The application of lateral interaction networks both for edge enhancing and

neighbourhood formation as a component in Kohonen networks were examined.

Simulations for both cases were performed using appropriate lateral interaction network

configurations and failure measures (equations A.1 and A.3 respectively). Data used

was constructed manually to reflect a wide range of variability. The global weight array

spanned the entire array of units and was scaled as required to match the range of lateral

interaction. For simplicity, a square approximation to the interaction function as

displayed in figure A.1 was used. Faults were randomly introduced with equal

probability in 10% increments to the initially fault-free global weight array. Since

scaling was used in accessing the array, this meant that faults during operation occurred

probabilistically. All simulations were repeated 25 times with different random number

seeds for statistical analysis.

Plots of results show the probability of failure for various ranges of lateral interaction

against the percentage of faults injected. On the title line, E and I refer to the fault-free

values of excitatory and inhibitory weights respectively.

A.6.1. Edge Enhancing

Simulations were carried out on six different types of data. Four sets consisted of

variously position/sized sharp bars; a single bar decreasing in size, two bars changing in

size and/or moving together. The remaining datasets were constructed from members of

the first four, but the edges were changed to smooth curves. The total number of

different patterns used was 43. The standard deviation of the probability of failure in all

simulations was no more than 0.08.

It was found that the range of fault-free weight values for the excitatory and inhibitory

links did not alter the basic operational behaviour of the network with respect to the

type of data processed by the network, and only slightly altered the degree of fault

tolerance exhibited. The results for the single bar dataset and the effect of varying the

weight values are given in graph A.1. Note that good graceful degradation is exhibited.

For high levels of faults, it appears that networks with small ranges of lateral interaction

have significantly better fault tolerance than those with large ranges. However, since

even with a small interaction range the probability of failure is large for high fault

Appendix A

190

levels, this result is not particularly useful.

Graph A.1 Effect of varying the excitatory/inhibitory weights

Results also indicate that similar behaviour was exhibited by edge enhancing lateral

interaction networks for each dataset (graph A.2a), though not unexpectedly their

performance on smooth edged data was somewhat degraded (graph A.2b). However,

the structure of the standard deviation was not found to be independent of the type of

data processed, though similarity did exist for the various choices of

excitation/inhibition weight values (see graph A.3).

Graph A.2 Variation in Pr(failure) due to dataset characteristics

a) Single bar (E=0.3, I=0.6) b) Ranging over E=0.3, I=0.6-0.9

a) Ranging over all datasets b) Only hard-edged datasets

Appendix A

191

Graph A.3 Standard deviation for various datasets/weight values

The results from simulations performed on the four hard-edged datasets were combined

such that the maximum probability of failure was selected (see graph A.4). From this, it

is concluded that for a reliable system with respect to faults the interaction range should

be set to 3.

Graph A.4 Combined results for edge enhancing

A.6.2. Neighbourhood Formation

As with edge enhancing, simulations were performed using several datasets. Both

unimodal and bimodal curves were included, changing size and position of one or both

maxima. In total, this came to 25 input patterns with 4 different characteristics. Once

again, in all cases good graceful degradation existed.

a) Combined results (E=0.3, I=0.6) b) Ranging over E=0.3, I=0.6-0.9

a) Combined results (E=0.6, I=0.4) b) Ranging over all datasets

Appendix A

192

As might be expected with only a minor difference between the configuration of a

lateral interaction network for edge enhancing and neighbourhood formation, very

similar results to those above were obtained. The fault tolerance exhibited was

independent of the characteristics of the data processed. Also, similarity existed

between results over a range of excitatory/inhibitory values. As above, the maximum

probability of failure over all datasets is given in graph A.5. From this, a lateral

interaction range of 5 will lead to good fault tolerance being exhibited.

Graph A.5 Combined results for neighbourhood formation

Graph A.6 Combined results for best-match

A slightly different network configuration was tested in which the ratio of excitation to

inhibition was less than 1 unlike in the previous neighbourhood formation simulations.

This was designed to choose only the maximally active region of input, i.e. best-match.

It was tested on the dataset which contained bimodal input patterns with differing sized

a) Combined results (E=0.6, I=0.4) b) Ranging over all datasets

a) Best-match (E=0.2, I=0.6) b) Ranging over E=0.2, I=0.3-0.6

Appendix A

193

maxima. Results are shown in figure A.6. It is again evident that the functionality is not

influenced greatly by the particular choice of excitation/inhibition weight values.

A.7. Conclusions

The effect of faults on lateral interaction networks functioning both as an edge

enhancing system and a clustering system has been investigated. Results show that the

change in behaviour due to faults is not influenced by the type of data processed.

Similarly the operational quality is not altered drastically over a range of values for the

excitation and inhibition weights. Selection of the interaction range appears to be the

most critical element in designing a fault tolerant system, though here again results have

shown that some flexibility exists.

The similarity of behaviour over a wide range of parameters and data presented

suggests that lateral interaction networks provide a robust system against faults and also

external noise. Also, graceful degradation is exhibited as the level of faults increases.

Appendix A

194

APPENDIX B

Glossary

This appendix presents an extended glossary of terms used in this thesis.

Activation This is the internal state of an unit in a neural network formed

from its combined weighted inputs before thresholding is

applied.

Classification Inputs are labelled as belonging to one of a discrete number

of types or classes. A null class is sometimes included to

represent unknown inputs.

Computation Fault

Tolerance

Resistance to effect of faults to computation performed by

abstracted system.

Data Representation The format of external inputs presented to a neural network.

Distribution Refers to holistic nature of a neural network's operation and

information storage. All components of a neural network are

involved during training and operation. No part of a neural

network's function can be attributed to a local region in its

architecture.

Appendix B

195

Epoch Completion of a training pass. Often applied to supervised

training to refer to presentation of entire training set.

Error An internal result in a system's computation which is likely to

lead to failure.

Failure Event that the operation of a system no longer meets its

required specification.

Failure Rate Constant applied to components describing the rate at which

they become defective. Defined as the number of components

failing from time t0 to t1 relative to the original number of

surviving components at time to.

Faults The cause of an error. For example, defects occurring in a

systems' components, erroneous inputs, design inaccuracies.

Fault Model Abstractly describes the effect of physical defects on a

system's operation.

Fault Tolerance A technique used to increase the reliability of a system by

imbuing it with resilience to the effect of faults occurring.

Feedback Neural

Network

A neural network which has loops in its connectivity, i.e.

internal feedback exists.

Feedforward Neural

Network

A neural network which has no loops in its connectivity, i.e.

no internal feedback exists.

Appendix B

196

Function

Approximation

Type of problem given to neural network. Requires it to learn

a continuous or discrete mapping between two vector spaces.

Generalisation Refers to the ability of a neural network to produce a sensible

output for an input which did not occur during training.

Graceful DegradationProperty of a system to deliver useful service in the presence

of faults.

Hidden Units Processing units in a neural network both only fed by and

feeding other processing units within the neural network.

Internal

Representation

The representation of a problem formed in the hidden units of

a neural network during learning. Points in the hidden unit

space are mapped from points in the input space.

Learning Algorithm Constructive method by which the free parameters in a neural

network can be changed so that it solves a given problem.

Modular Neural

Networks

Systems composing of several smaller neural networks which

are individually trained and operated.

N-Modular

Redundancy

A fault tolerance mechanism which duplicates of a sub-

system N times, and then takes a majority vote to determine

the final output.

Neural Network A large number of simple processing elements with complex

interconnectivity. Has basis in biological neural systems.

Appendix B

197

Recurrent Neural

Network

See Feedback Neural Network.

Redundancy Spare capacity in a system either to actively or passively

broaden computational load. When applied to data it allows

reconstruction of damaged entries. In duplicating sub-system

processing, overall system operation no longer requires the

correct operation of all system components. Often introduced

to a system by fault tolerance techniques.

Reliability Probability of a system operating correctly at time t.

Rigid .vs. Soft

Problem

Classification of problems based on the degree of adjacency

existing in their solution space. Soft problems are described

by a large adjacency factor, rigid problems are not.

Squashing Function See Thresholding Function.

Thresholding

Function

Function applied to a units activation to form its output.

Limits absolute magnitude of a unit's activation.

Training Cycle See Epoch.

Training Set Set of input, output pairs used during the supervised training

of a neural network.

Appendix B

198

Tuple Unit A function taking a n dimensional binary input i coding an

integer a in the range [0,2n], and producing a 2n dimensional

binary input with a 1 in the position corresponding to a, and

0's in the remainder. For example, 010 maps to 00000100.

Uniform Information

Distribution

As for distribution of information, but computational load of

neural networks' elements equal.

Weight Scalar value associated with a connection between two units

which modifies the communicated data. Generally acts as a

multiplicative factor.

Appendix B

199

APPENDIX C

Data from ADAM Simulations

This appendix tabulates the experimental data from ADAM simulations described in

chapter 5.

Time Probability of failure for various numbers of 2-tuple units

2 4 6 11 21 31

Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d.

0 0.318 0.097 0.012 0.026 0.0020.01 0 0 0 0 0 0

1 0.552 0.099 0.126 0.06 0.03 0.038 0 0 0 0 0 0

2 0.82 0.128 0.36 0.085 0.116 0.057 0.012 0.026 0 0 0 0

3 0.934 0.068 0.622 0.106 0.278 0.086 0.058 0.0450 0 0 0

4 0.984 0.041 0.812 0.1160.52 0.095 0.128 0.052 0.018 0.032 0.006 0.017

5 0.998 0.027 0.932 0.0660.72 0.076 0.31 0.122 0.042 0.033 0.016 0.025

6 1 0.01 0.986 0.041 0.844 0.083 0.504 0.091 0.126 0.069 0.048 0.064

7 1 0 0.998 0.026 0.934 0.056 0.658 0.076 0.274 0.125 0.116 0.075

8 1 0.01 0.982 0.051 0.79 0.085 0.444 0.087 0.23 0.081

9 1 0 0.996 0.027 0.894 0.071 0.574 0.078 0.374 0.086

10 0.998 0.01 0.956 0.044 0.694 0.074 0.552 0.084

11 1 0.01 0.974 0.032 0.816 0.083 0.666 0.074

12 1 0 0.986 0.026 0.908 0.055 0.766 0.075

13 0.996 0.02 0.942 0.043 0.86 0.077

14 1 0.014 0.968 0.039 0.91 0.056

15 1 0 0.98 0.022 0.948 0.056

16 0.984 0.014 0.976 0.029

17 1 0 0.99 0.031

18 0.996 0.017

Table C.1 Probability of failure for various numbers of 2-tuple units

Appendix C

200

Time Probability of failure for various numbers of 3-tuple units

2 4 6 11 21 31

Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d.

0 0.32 0.138 0.02 0.05 0 0 0 0 0 0 0 0

1 0.468 0.092 0.1 0.082 0.036 0.064 0 0 0 0 0 0

2 0.668 0.108 0.28 0.122 0.108 0.061 0.012 0.033 0 0 0 0

3 0.844 0.123 0.472 0.1440.24 0.107 0.044 0.069 0 0 0 0

4 0.932 0.078 0.676 0.124 0.408 0.1140.08 0.064 0.004 0.02 0 0

5 0.98 0.071 0.824 0.096 0.58 0.134 0.124 0.065 0.024 0.041 0 0

6 0.996 0.037 0.872 0.077 0.744 0.119 0.232 0.0950.08 0.065 0.012 0.033

7 1 0.02 0.92 0.065 0.86 0.103 0.336 0.089 0.128 0.077 0.024 0.033

8 1 0 0.972 0.065 0.92 0.076 0.508 0.102 0.248 0.1 0.06 0.076

9 0.988 0.037 0.988 0.09 0.644 0.108 0.344 0.098 0.1280.09

10 1 0.033 0.992 0.02 0.764 0.129 0.44 0.084 0.196 0.08

11 1 0 0.996 0.02 0.856 0.1 0.568 0.106 0.316 0.108

12 0.996 0 0.92 0.07 0.7 0.125 0.412 0.073

13 0.944 0.052 0.784 0.085 0.508 0.106

14 0.96 0.037 0.848 0.07 0.616 0.108

15 0.976 0.037 0.904 0.065 0.6840.08

16 0.992 0.037 0.936 0.056 0.784 0.096

17 0.996 0.02 0.952 0.037 0.876 0.1

18 1 0.02 0.968 0.037 0.912 0.064

19 0.972 0.02 0.936 0.052

20 0.976 0.02 0.96 0.052

21 0.98 0.02 0.984 0.052

22 0.984 0.02 0.992 0.028

23 0.992 0.028 0.996 0.02

24 0.992 0 0.996 0

25 0.996 0.02 0.996 0

26 0.996 0 0.996 0

27 0.996 0

Table C.2 Probability of failure for various numbers of 3-tuple units

Appendix C

201

Time Probability of failure for various numbers of 4-tuple units

2 4 6 11 21 31

Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d.

0 0.336 0.236 0.056 0.108 0 0 0.008 0.04 0 0 0 0

1 0.496 0.153 0.12 0.125 0.04 0.082 0.024 0.055 0 0 0 0

2 0.632 0.111 0.28 0.183 0.08 0.082 0.048 0.066 0 0 0 0

3 0.816 0.162 0.44 0.183 0.152 0.128 0.072 0.066 0.0080.04 0 0

4 0.912 0.131 0.6 0.173 0.328 0.133 0.136 0.095 0.0080 0 0

5 0.936 0.066 0.736 0.138 0.4960.17 0.192 0.092 0.024 0.055 0.016 0.055

6 1 0.095 0.84 0.117 0.632 0.138 0.312 0.129 0.056 0.075 0.032 0.055

7 1 0 0.92 0.141 0.72 0.154 0.4 0.117 0.08 0.066 0.056 0.066

8 0.96 0.082 0.824 0.131 0.5120.13 0.12 0.082 0.08 0.088

9 0.968 0.04 0.912 0.101 0.64 0.199 0.184 0.138 0.112 0.075

10 0.992 0.066 0.944 0.075 0.704 0.111 0.272 0.117 0.176 0.095

11 1 0.04 0.976 0.075 0.784 0.115 0.376 0.117 0.232 0.108

12 1 0 0.992 0.055 0.864 0.115 0.496 0.163 0.328 0.131

13 0.992 0 0.912 0.119 0.576 0.115 0.384 0.108

14 0.992 0 0.928 0.055 0.656 0.1 0.48 0.143

15 0.992 0 0.976 0.087 0.752 0.131 0.584 0.131

16 0.984 0.04 0.832 0.129 0.632 0.105

17 0.984 0 0.88 0.087 0.696 0.095

18 0.984 0 0.92 0.1 0.784 0.13

19 0.992 0.04 0.96 0.082 0.832 0.133

20 1 0.04 0.968 0.04 0.856 0.066

21 0.984 0.055 0.88 0.066

22 1 0.055 0.896 0.055

23 0.912 0.055

24 0.936 0.088

25 0.968 0.075

26 0.976 0.04

27 0.984 0.04

28 0.984 0

29 0.992 0.04

30 0.992 0

31 0.992 0

32 1 0.04

Table C.3 Probability of failure for various numbers of 4-tuple units

Appendix C

202

Time Probability of failure for various levels of memory saturation using

2-tuple units

0.08 0.14 0.19 0.26 0.3

Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d.

0 0 0 0.016 0.08 0.011 0.057 0.032 0.063 0.047 0.084

1 0 0 0.024 0.04 0.04 0.058 0.088 0.092 0.1270.07

2 0 0 0.064 0.082 0.097 0.071 0.208 0.096 0.2670.11

3 0.013 0.067 0.096 0.075 0.166 0.084 0.356 0.136 0.483 0.125

4 0.04 0.092 0.12 0.066 0.274 0.103 0.5 0.123 0.66 0.097

5 0.08 0.111 0.2 0.115 0.371 0.122 0.684 0.131 0.777 0.076

6 0.093 0.067 0.296 0.131 0.5090.12 0.812 0.134 0.883 0.07

7 0.107 0.067 0.4 0.154 0.617 0.119 0.8760.07 0.937 0.058

8 0.2 0.181 0.488 0.101 0.709 0.108 0.932 0.077 0.973 0.054

9 0.24 0.111 0.584 0.117 0.76 0.081 0.96 0.061 0.983 0.028

10 0.32 0.199 0.656 0.14 0.851 0.123 0.98 0.05 0.997 0.031

11 0.453 0.167 0.736 0.115 0.909 0.082 0.996 0.0371 0.017

12 0.52 0.136 0.824 0.13 0.949 0.077 0.996 0

13 0.613 0.181 0.88 0.092 0.977 0.058 0.996 0

14 0.747 0.236 0.928 0.087 0.994 0.063 1 0.02

15 0.84 0.153 0.944 0.055 0.994 0

16 0.853 0.067 0.968 0.088 0.994 0

17 0.867 0.067 0.992 0.066 1 0.029

18 0.893 0.092 0.992 0

19 0.907 0.067 1 0.04

20 0.973 0.136

21 0.973 0

22 0.987 0.067

23 1 0.067

Table C.4 Probability of failure for various levels of memory

saturation using 2-tuple units

Appendix C

203

Time Probability of failure for various levels of memory saturation using

3-tuple units

0.07 0.14 0.2 0.26 0.31

Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d.

0 0 0 0 0 0 0 0.002 0.01 0.006 0.019

1 0 0 0 0 0.013 0.033 0.05 0.051 0.099 0.064

2 0 0 0.024 0.052 0.067 0.054 0.168 0.0910.32 0.119

3 0 0 0.052 0.046 0.184 0.085 0.368 0.076 0.573 0.097

4 0.016 0.055 0.124 0.084 0.323 0.086 0.538 0.0660.75 0.064

5 0.032 0.055 0.22 0.089 0.456 0.102 0.71 0.089 0.859 0.065

6 0.08 0.133 0.372 0.116 0.635 0.0880.83 0.065 0.946 0.065

7 0.152 0.114 0.532 0.1 0.763 0.084 0.92 0.048 0.99 0.037

8 0.168 0.055 0.656 0.105 0.872 0.081 0.972 0.042 0.997 0.015

9 0.256 0.13 0.776 0.087 0.923 0.059 0.992 0.032 1 0.011

10 0.36 0.117 0.86 0.062 0.957 0.044 0.996 0.014

11 0.512 0.145 0.932 0.089 0.971 0.027 1 0.014

12 0.68 0.138 0.96 0.046 0.992 0.032

13 0.8 0.163 0.972 0.033 0.997 0.018

14 0.848 0.087 0.98 0.028 1 0.013

15 0.904 0.108 0.992 0.033

16 0.936 0.075 0.996 0.02

17 0.968 0.075 0.996 0

18 0.976 0.04 1 0.02

19 1 0.066

Table C.5 Probability of failure for various levels of memory

saturation using 3-tuple units

Appendix C

204

Time Probability of failure for various levels of memory saturation using

4-tuple units

0.07 0.14 0.2 0.26 0.31

Avg s.d. Avg s.d. Avg s.d. Avg s.d. Avg s.d.

0 0 0 0 0 0 0 0 0 0.001 0.004

1 0 0 0.004 0.014 0.017 0.024 0.066 0.041 0.107 0.056

2 0 0 0.02 0.024 0.103 0.065 0.224 0.066 0.349 0.087

3 0.012 0.033 0.082 0.071 0.257 0.079 0.479 0.068 0.661 0.063

4 0.04 0.054 0.184 0.074 0.499 0.083 0.729 0.067 0.872 0.074

5 0.068 0.046 0.358 0.088 0.719 0.066 0.896 0.068 0.964 0.041

6 0.172 0.068 0.552 0.085 0.865 0.071 0.958 0.055 0.991 0.027

7 0.28 0.108 0.708 0.077 0.936 0.051 0.985 0.031 0.996 0.018

8 0.428 0.105 0.834 0.075 0.968 0.024 0.996 0.021 0.999 0.007

9 0.584 0.096 0.92 0.053 0.995 0.032 0.998 0.007 1 0.004

10 0.76 0.097 0.962 0.047 0.997 0.009 0.999 0.005

11 0.888 0.098 0.99 0.043 1 0.009 1 0.005

12 0.94 0.065 0.992 0.01

13 0.96 0.041 0.998 0.017

14 0.984 0.044 1 0.01

15 0.992 0.028

16 1 0.028

Table C.6 Probability of failure for various levels of memory

saturation using 4-tuple units

Appendix C

205

REFERENCES

1. Beale, R. and Jackson, T., Neural Computing: An Introduction, IOP Publishing

(1990).

2. Lippmann, R.P., "An introduction to computing with neural nets", IEEE

Acoustics Speech Signal Processing Magazine 4, pp.4-22 (1987).

3. Khanna, T., Foundation of Neural Networks, Addison-Wesley (1990).

4. Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, (1985).

5. Kaufmann, A., Reliability - A Mathematical Approach, Transworld Publishers,

London (1972).

6. Anderson, T. and Lee, P.A., Fault Tolerance, principles and practice,

Prentice-Hall International (1981).

7. Amit, D.J. and Gutfreund, H., "Statistical Mechanics of Neural Networks near

Saturation", Annals of Physics 173, pp.30-67 (1987).

8. Anderson, J.A., "Cognitive and Psychological Computation with Neural Models",

IEEE Trans Systems Man and Cybernetics SMC-13, pp.799-815 (1983).

9. Baum, E.B., Moody, J. and Wilczek, F., "Internal Representations for Associative

Memory", Biological Cybernetics 59, pp.217-228 (1988).

10. Bruce, A., Canning, A., Forrest, B., Gardner, E., Wallace, D.J., 1986, 65-70 and

AIP Conference Proceedings 151, "Learning and Memory Properties in Fully

Connected Networks", AIP Conference Proceedings 151, pp.65-70 (1986).

11. Fogelman-Soulie, F., Gallinari, P., Le Cun, Y. and Thiria, S., "Evaluation of

network architectures on test learning tasks", Proceedings of the first IEEE

International Conference on Neural Networks, San-Diego II , pp.653-660 (1987).

12. Cannon, S.C., Robinson, D.A. and Shamma, S., "A Proposed Neural Network for

the Integrator of the Oculomotor System", Biological Cybernetics 49, pp.127-36

(1983).

References

206

13. Hopfield, J.J., "Neural networks and physical systems with emergent collective

computational abilities", Proceedings of the National Acadamy of Sciences, USA

79, pp.2554-8 (1982).

14. Kung, S.Y., "Parallel Architectures for Artificial Neural Nets", Proc.

International Conference on Systolic Arrays, pp.163-74 (1988).

15. Legendy, C.R., "On the Scheme by Which the Human Brain Stores Information",

Mathematical Biosciences 1, pp.555-97 (1967).

16. Char, J.M., Cherkassy, V., Wechsler, H. and Zimmerman, G.L., "Distributed and

fault-tolerant computation for retrieval tasks using distributed associative

memories", IEEE Transactions on Computers A15(4), pp.484-90 (April 1988).

17. Worden, S.J. and Womack, B.F., "Analysis of small compacta networks",

Proceedings of 1986 IEEE Conference on Systems, Man and Cybernetics, pp.61-4

(1986).

18. Zhou, Y.T., Chellappa, R. and Jenkins, B.K., "A Novel Approach to Image

Restoration Based on a Neural Network", Proceeding of the IEEE First Annual

International Conference on Neural Networks 4, pp. 269-76 (1987).

19. Carter, M.J., "The 'Illusion' of Fault Tolerance in Neural Networks for Pattern

Recognition and Signal Processing", Proc. Technical Session on Fault-Tolerant

Integrated Systems, Durham NH: University of New Hampshire (1988).

20. Bedworth, M.D. and Lowe, D., Fault Tolerance in Multi-Layer Perceptrons: a

preliminary study, RSRE: Pattern Processing and Machine Intelligence Division

(July 1988).

21. Rumelhart, D.E., Hinton, G.E. and Williams, R.J., "Learning Internal

Representations by Error Propagation" pp. 318-362 in Parallel Distributed

Processing, ed. Rumelhart, D.E. and McClelland, J.L. (Eds), MIT Press (1986).

22. Belfore, L.A. and Johnson, B.W., "The fault-tolerance of neural networks", The

International Journal of Neural Networks Research and Applications 1, pp.24-41

(Jan 1989).

23. Warkowski, F., Leenstra, J., Nijhuis, J. and Spaanenburg, L., "Issues in the Test

of Artificial Neural Networks", Digest ICCD '89, pp.487-490 (Oct 1989).

24. Hinton, G.E. and Shallice, T., "Lesioning an Attractor Network: Investigations of

Acquired Dyslexia", Psychological Review 98(1), pp.74-94 (1991).

References

207

25. Carter, M.J., Rudolph, F. and Nucci, A., "Operational Fault Tolerance of CMAC

Networks", NIPS-90, Denver, Morgan Kaufmann (1990).

26. Segee, B.E. and Carter, M.J., "Comparitive Fault Tolerance of Parallel

Distributed Processing Networks (Debunking the Myth of Inherent Fault

Tolerance)", Intelligent Structures Group Report ECE.IS.92.07 (1992).

27. Protzel, P.W., Palumbo, D.L. and Arras, M.K., "Performance and

Fault-Tolerance of Neural Networks for Optimization", ICASE Report No. 91-45,

NASA Langley Research Centre (1991).

28. Neti, C., Schneider, M.H. and Young, E.D., "Maximally fault-tolerant neural

networks and nonlinear programming", Proceedings of IJCNN-90, San Diego 2,

pp.483-496 (June 1990).

29. Bugmann, G., Sojka, P., Reiss, M., Plumbley, M. and Taylor, J.G., "Direct

Approaches to Improving the Robustness of Multilayer Neural Networks",

Proceedings of the International Conference on Artificial Neural Networks,

Brighton UK (1992).

30. Lansner, A. and Ekeburg, O., "Reliability and Speed of Recall in an Associative

Network", IEEE Trans Pattern Analysis and Machine Intelligence PAMI-7(1985).

31. Nijhuis, J.A.G. and Spaanenburg, L., "Fault tolerance of neural associative

memories", IEE Proceedings 136-E(5), pp.389-394 (Sept 1989).

32. Heng-Ming, T., "Fault Tolerance in Neural Networks", WNN-AIND-90, pp.59

(Feb 1990).

33. Damarla, T.R. and Bhagat, P.K., "Fault Tolerance in Neural Networks",

Southeastcon '89 Proceedings: Energy and Information Technologies in the S.E.

1, pp.328-31 (1989).

34. Prater, J.S. and Morley Jr., R.E., "Characterization of Fault Tolerance in

Feedforward Neural Networks", submitted to IEEE Transactions on Neural

Networks, in review.

35. May, N. and Hammerstrom, D., "Fault Simulation of a Wafer-Scale Integrated

Neural Network", Abstracts of the First INNS Meeting, Boston, pp.393 (1988).

36. Moore, W.R., "Conventional Fault-Tolerance and Neural Computers" pp. 29-37

in Neural Computers, ed. C von der Malsburg, Berlin: Springer-Verlag (1988).

References

208

37. von Seelen, W. and Mallot, H.A., "Parallelism and Redundancy in Neural

Networks" pp. 50-60 in Neural Computers, ed. C von der Malsburg, Berlin:

Springer-Verlag (1988).

38. McCulloch, W.S., "The Reliability of Biological Systems", Self-Organzing

Systems, pp.264-281 (1959).

39. von Neumann, J., "Probabilistic Logics and the Synthesis of Reliable Components

from Unreliable Elements" pp.43-98 in Automata Studies, ed. Shannon, C.E. and

McCarthy, J., Princeton University Press (1956).

40. Izui, Y. and Pentland, A., "Analysis of Neural Networks with Redundancy",

Neural Computation 2(2), pp.226-238 (Summer 1990).

41. Clay, R.D. and Sequin, C.H., "Limiting Fault-Induced Output Errors in ANN's",

IJCNN-91, Seattle, supplementary poster session (1991).

42. Lincoln, W. and Skrzypek, J., "Synergy of Clustering Multiple Back Propagation

Networks", Proceedings of NIPS-89, pp.650-657 (1989).

43. Chu, L. and Wah, B.W., "Fault Tolerant Neural Networks with Hybrid

Redundancy", IJCNN-90, San Diego 2, pp.639-649 (1990).

44. Distante, F., Sami, M.G., Stefanelli, R. and Gajani, G.S., "Fault-Tolerance

Aspects in Silicon Structures for Neural Networks", NIMES-90, pp.284-295

(1990).

45. Fernandes, P.M.L. and Silva, K.M.C., "Nerve cell soma model with high

reliability and low power consumption", Med. & Biol. Eng. & Comput. 18,

pp.261-264 (1980).

46. Biswas, S. and Venkatesh, S.S., "The Devil and the Network: What Sparsity

Implies to Robustness and Memory", NIPS-3, pp.883-889 (1991).

47. Austin, J., "ADAM:A Distributed Associative Memory For Scene Analysis" pp.

285 in Proceedings of first international conference on neural networks, ed.

M.Caudill, C.Butler, IEEE, San Diego (June, 1987).

48. Anderson, J., "Neural models with cognitive implications." pp. 27-90 in Basic

processes in reading perception and comprehension, ed. D. LaBerge and S. J.

Sanuels , Erlbaum (1977).

49. Wood, C., "Implications of simulated lesion experiments for the interpretation of

lesions in real nervous systems" in Neural Models of Language Processes, ed.

Arbib, M.A., Caplan, D. and Marshall, J.C., New York: Academic (1983).

References

209

50. Venkatesh, S.S., "Epsilon Capacity of Neural Networks", AIP Conference

Proceedings 151, pp.440-445 (1986).

51. Tanaka, H., Matsuda, S. and Ogi, H., "Redundant Coding for Fault Tolerant

Computing on Hopfield Network", Abstracts of the First Annual INNS Meeting,

Boston, pp.141 (1988).

52. Miikkulainen, R. and Dyer, M., "Encoding Input/Output Representations in

Connectionist Cognitive Systems", 1988 Connectionist Models Summer School,

Carnegie-Mellon University, Morgan Kaufmann (1988).

53. Takeda, M. and Goodman, J.W., "Neural Networks for computation: number

representations and programming complexity", Applied Optics 25 (1986).

54. Hancock, P., "Data representation in neural nets: an empirical study", 1988

Connectionist Models Summer School, Carnegie-Mellon University, Morgan

Kaufmann (1988).

55. Abu-Mostafa, Y.S., "Neural Networks for Computing?", AIP Conference

Proceedings 151, pp.1-7 (1986).

56. Abu-Mostafa, Y.S., "Complexity of random problems" in Complexity in

Information Theory, Springer-Verlag (1986).

57. Hartley, R. and Szu, H., "A Comparison of the Computational Power of Neural

Network Models", Proceedings of the first IEEE International Conference on

Neural Networks, San-Diego 3, pp.15-22 (1987).

58. Baum, E.B. and Haussler, D., "What Size Net gives Valid Generalization?",

NIPS-89, Denver, Morgan Kaufmann (1987).

59. Vapnik, V.N. and Chervonenkis, A., "On the uniform convergence of relative

frequencies of events to their probabilities", Theory Prob. Appl. 16, pp.264-280

(1971).

60. Segee, B.E. and Carter, M.J., "Fault Tolerance of Pruned Multilayer Networks",

IJCNN-91, Seattle 2, pp.447-452 (1991).

61. Krauth, W., Mezard, M. and Nadal, J.P., "Basins of Attraction in a

Perceptron-Like Neural Network", Complex Systems 2, pp.387-408 (1988).

62. McCulloch, W.S. and Pitts, A., "A logical calculus of the ideas immanent in

nervous activity", Bulletin of Mathematical Biophysics 5, pp.115-133 (1943).

63. Rosenblatt, F., Principles of Neurodynamics, (1962).

References

210

64. Minsky, M. and Papert, S., Perceptrons: An introduction to computational

geometry, MIT Press (1969).

65. Holt, J.L. and Hwang, J., "Finite Precision Error Analysis of Neural Network

Hardware Implementations", FT-10, Dept. of Elect. Engr., University of

Washington (1990).

66. Pemberton, J.C. and Vidal, J.J., "The effect of training signal errors on node

learning", Technical Report: CSD-890041, University of California (1989).

67. Hodges, R.E. and Wu, C., "The Neural Network Self-Healing Process by using a

Reconstructed Sample Space", WNN-AIND-90, pp.65 (1990).

68. Petsche, T. and Dickinson, B.W., "Trellis Codes, Receptive Fields, and Fault

Tolerant, Self-Repairing Neural Networks", IEEE Transactions on Neural

Networks 1 (2), pp.154-166 (1990).

69. Pons, T.P., Garraghty, P.E., Ommaya, A.K., Kaas, J.H., Taub, E. and Mishkin,

M., "Massive Cortical Reorganization After Sensory Deafferentation in Adult

Macaques", Science 252, pp.1857-1860 (1991).

70. Tanaka, H., "A Study of a High Reliable System against Electric Noises and

Element Failures", Proceedings of the 1989 International Symposium on Noise

and Clutter Rejection in Radars and Imaging Sensors, pp.415-20 (1989).

71. Sequin, C. and Clay, D., "Fault-Tolerance in Artificial Neural Networks", Proc.

IJCNN 90, San Diego 1, pp.703-708 (June 1990).

72. Plaut, D.C., "Connectionist Neuropsychology: The Breakdown and Recovery of

Behaviour in Lesioned Attractor Networks", Thesis Summary, (1991).

73. Brause, R., "Fault Tolerance in Neural Network Associative Memory", Technical

Report, Johann Wolfgang Goethe University (1989).

74. Palumbo, D., "Assessing the Fault Tolerance of Neural Networks",

WNN-AIND-90, pp.3 (Feb 1990).

75. Sivilotti, M.A., Emerling, M.R. and Mead, C.A., "VLSI Architectures for

Implementation of Neural Networks", AIP Conference Proceedings 151,

pp.408-413 (1986).

76. McEliece, R.J., Posner, E., Rodemich, E. and Venkatesh, S., "The Capacity of the

Hopfield Associative Memory", IEEE Trans. Info. Theory IT-33, pp.461-82

(1987).

References

211

77. Protzel, P.W., "Comparative Performance Measure for Neural Networks Solving

Optimization Problems", IJCNN-90, Washington DC (1990).

78. Protzel, P.W. and Arras, M.K., "Fault-Tolerance of Optimization Networks:

Treating Faults as Additional Constraints", IJCNN-90, Washington DC (1990).

79. Tesauro, G. and Sejnowski, T.J., "A Parallel Network that Learns to Play

Backgammon", Technical Report CCSR-88-2, Center for Complex Systems

Research, University of Illinois (1988).

80. Scalettar, R. and Zee, A., "A feed-forward memory with decay", Institute for

Theoretical Physics preprint: NSF-ITP-86-118 (1986).

81. Stevenson, M., Winter, R. and Widrow, B., "Sensitivity of Feedforward Neural

Networks to Weight Errors", IEEE. Trans. on Neural Networks 1(1), pp.71-80

(March 1990).

82. Widrow, B., "Generalisation and information storage in networks of Adaline

'neurons'" pp.435-461 in Self-Organizing Systems, ed. M.C. Yovitz, G.T. Jacobi

and G.D. Goldstein, Washington, DC: Spartan Books (1962).

83. Zymslowski, W., "Some problems of sensitivity of neuronal nets to variations of

parameters of their elements", IFAC Symposium on Automatic Control and

Computers in the Medical Field, pp.133-7 (1971).

84. Dzwonczyk, M.J., "Quantitative Failure Models of Feed-Forward Neural

Networks", CSDL-T-1068, M.Sc. Thesis, Massachusetts Institute of Technology

(1991).

85. Specht, D.F., "Probabilistic Neural Networks", Neural Networks 3, pp. 109-118

(1990).

86. Albus, J.S., "A new approach to manipulator control: the Cerebellar Model

Articulation Controller (CMAC)", Trans. ASME-J. Dynamic Syst., Meas., Contr.

97, pp.220-7 (1975).

87. Murray, A.F. and Edwards, P.J., "Enhanced MLP Performance and Fault

Tolerance Resulting from Synaptic Weight Noise During Training", submitted to

IEEE Transactions on Neural Networks, (July 1992).

88. Prater, J.S. and Morley Jr., R.E., "Improving Fault Tolerance in Feedforward

Neural Networks", submitted to IEEE Transactions on Neural Networks, in

review.

References

212

89. Abu-Mostafa, Y., "Learning from Hints in Neural Networks", Journal of

Complexity 6, pp.192-198 (1990).

90. Judd, S.J., "Neural Network Design and the Complexity of Learning",

Caltech-CS-TR-88-20, California Institute of Technology (Sep 88).

91. Kohonen, T., "Self-organized formation of topologically correct feature maps",

Biological Cybernetics 43, pp.59-69 (1982).

92. Abu-Mostafa, Y., "The Vapnik-Chervonenkis Dimension: Information versus

Complexity in Learning", Neural Computation 1, pp.312-317 (1989).

93. Valiant, L.G., "A theory of the learnable", Commun. ACM 27, pp.1134-1142

(1984).

94. Poggio, T. and Girosi, F., "Networks for Approximation and Learning",

Proceedings of the IEEE 78, pp.1481-1497 (1990).

95. Martinetz, T., Ritter, H. and Schulten, K., "Learning of Visuomotor-Coordination

of a Robot Arm with Redundant Degrees of Freedom" pp. 431--434 in Parallel

Processing in Neural Systems and Computers, ed. G. Hauske (1990).

96. Clay, R.D. and Sequin, C.H., "Fault Tolerance Training Improves Generalisation

and Robustness", IJCNN-92, Baltimore 1, pp.769-774 (1992).

97. Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.

and Jackel, L.D., "Handwritten Digit Recognition with a Back-Propagation

Network", Proceedings of NIPS-89, pp.396-404 (1989).

98. Grossberg, S., Neural networks and natural intelligence, (1989).

99. Sejnowski, T.J. and Rosenberg, C., "NetTalk: A Parallel Network that Learns to

Read Aloud", John Hopkins University (1988).

100. Gorman, R.P. and Sejnowski, T.J., "Analysis of hidden units in a layered network

trained to classify sonar targets", Neural Networks 1, pp.75-89 (1988).

101. Nguyen, D. and Widrow, B., "The Truck Backer-Upper: An Example of

Self-Learning in Neural Networks", Proceedings of the International Joint

Conference on Neural Networks 2, pp.357-363 (June 1989).

102. Bolt, G.R., "Fault Tolerance and Robustness in Neural Networks", IJCNN-91,

Seattle 2, pp.A-986 (July 1991).

103. Hayes, J.P., Computer Architecture and Organization, McGraw-Hill (1985).

References

213

104. Maestri, G., "The retryable processor", AFIPS, Fall Joint Computer Conference

41(1), pp.273 - 277 (1972).

105. Kauffman, S.A., "Metabolic stability and epigenesis in randomly connected

genetic nets", Journal of Theoretical Biology 22, pp.437-467 (1969).

106. Brause, R., "Pattern Recognition and Fault Tolerance in Non-Linear Neural

Networks", Abstracts of the First Annual INNS Meeting, Boston 1, pp.13 (1988).

107. Kohonen, T., "Analysis of a simple self organizing process", Biological

Cybernetics 44, pp.135-140 (1982).

108. Lehky, S.R. and Sejnowski, T.J., "Network model of shape-from shading:neural

function arises from both receptive and projective fields", Nature 333, pp.452-454

(1988).

109. Bolt, G.R., "Fault Tolerance of Lateral Interaction Networks", IJCNN-91,

Singapore 2, pp.1373-1378 (November 1991).

110. Ammann, P.E. and Knight, J.C., "Data Diversity: An Approach to Software Fault

Tolerance", IEEE Transactions on Computers 37(4), pp.418-425 (April 1988).

111. Bolt, G.R., "Fault Models for Artificial Neural Networks", IJCNN-91, Singapore

3, pp.1918-1923 (November 1991).

112. Bolt, G.R., "Assessing the Reliability of Artificial Neural Networks", IJCNN-91,

Singapore 1, pp.578-583 (November 1991).

113. Willshaw, D.J., Buneman, D.P. and Longuet-Higgins, H.C., "Non-holographic

associative memory", Nature 222, pp.960-962 (1969).

114. Stonham, J., "Practical Pattern Recognition" pp. 231-272 in Advanced Digital

Information Systems, ed. I. Aleksander, Prentice Hall International (1985).

115. Bolt, G.R., Austin, J. and Morgan, G., "Operational Fault Tolerance of the

ADAM Neural Network System", IEE 2nd Int. Conf. Artificial Neural Networks,

Bournemouth, pp.285-289 (November 1991).

116. Bolt, G.R., Austin, J. and Morgan, G., "Uniform Tuple Storage", Pattern

Recognition Letters 13, pp.339-344 (May 1992).

117. Werbos, P.J., "Beyond regression: New tools for prediction and analysis in the

behavioural sciences", PhD Thesis, Harvard University, Cambridge (1974).

References

214

118. von der Malsburg, C., "Self-Organization of Orientation Sensitive Cells in the

Striate Cortex", Kybernetik 14, pp.85-100 (1973).

119. Bolt, G.R., Austin, J. and Morgan, G., "Fault Tolerant Multi-Layer Perceptrons",

YCS 180, Dept. of Computer Science, University of York (1992).

120. Barto, A.G., Sutton, R.S. and Anderson, C.W., "Neuronlike elements that solve

difficult learning control problems", IEEE Transactions on Systems Man and

Cybernetics SMC-13, pp.834-846 (1983).

121. Hartline, H.K. and Ratliff, F., "Inhibitory interaction of receptor units in the eye

of the Limulus", J.Gen. Physiol. 40, pp.357-376 (1959).

References

215