Application Aware Functional Safety Analysis Techniques · Application Aware Functional Safety Analysis Techniques A Thesis Submitted for the Degree of Doctor of Philosophy in the

Application Aware Functional Safety

Analysis Techniques

A Thesis Submitted

for the Degree of Doctor of Philosophy

in the Faculty of Engineering

by

Prasanth V

Electrical Communication Engineering

Indian Institute of Science

Bangalore – 560 012

November 28, 2019

© Copyright by Prasanth V, 2019

All rights reserved

i

Abstract

Integrated Circuits (IC) are used to realize multitude of real life systems. These real

life systems have ICs interacting with physical systems (this combination being referred

to as hybrid systems) and many of them are used in safety critical applications. The

implications of a fault in any of the constituent components of the system must be

analysed and appropriately addressed to mitigate its potentially dangerous after effects.

Given the increasing dependence on ICs to meet the functional requirements of safety

critical applications, safety analysis of ICs plays an important role in ensuring safety of

the application performed by the system.

When it comes to designing hybrid systems, we are gradually moving away from

the paradigm of independently designing the digital and physical parts of hybrid systems

towards simultaneous considerations for both. This helps in providing an optimal system

design solution. However, the same does not hold true when it comes safety analysis.

Today, safety analysis of ICs used in such systems is typically done in isolation of the

end application and associated physical system due to practical considerations like safety

analysis complexity, lack of a proper physical system model, etc. This results in the need

to take recourse to conservative design techniques incorporating costly redundancy.

Many hybrid systems have an acceptable tolerance determined by the application

due to the inertial nature of the physical system, error tolerance capability in closed loop

applications, built-in hardware and software functionality, etc. These tolerances can be

beneficially employed to reduce the hardware overhead required to implement safety. In

this thesis, we investigate the problem of building affordably robust soft-error resilient

systems based upon flip-flop protection. We develop methods to identify the minimal set

of critical flip-flops which must be protected in an integrated circuit, keeping in mind the

inherent tolerances (resiliency) of the system into which it is incorporated.

This thesis first proposes a set of techniques to map tolerances available at

application level, to the individual circuit modules of the IC, to make the safety analysis

ii

less pessimistic. The circuit modules of the IC can then be analysed standalone using the

mapped tolerances to reduce the analysis pessimism.

Fault injection is the preferred technique used to ascertain the safety worthiness

(robustness) of the circuit. We then look at the limitations of fault injection based safety

analysis and propose the use of formal techniques to ensure a comprehensive analysis.

We show how the input constraints and output tolerances can be modelled in formal

verification framework to reduce the analysis pessimism. We also propose a technique to

enhance the workload used for fault injection to provide a more comprehensive

functional safety analysis for larger modules which cannot be handled using formal

techniques.

Traditionally, protection for critical flip-flops is offered with the help of application

agnostic hardware or software based techniques. However, these may not offer the most

cost optimal solution. The thesis also proposes two application based techniques to

protect critical flip-flops which are identified through the functional safety analysis

process or steps. These techniques also consider the practical scenario where the

application needs to be rendered safe (also termed as “safed”), even when the underlying

components used to build the system are not robust.

In summary, this thesis provides a set of techniques to address the application level

functional safety requirements with lower cost and better robustness. More specifically, it

proposes a divide and conquer approach to make application level functional safety

analysis feasible, demonstrates application of formal methods, illustrates workload

augmentation based techniques to render the safety analysis comprehensive, and

addresses the requirements for the development of functional safety system using non

robust components.

iii

Acknowledgements

This work, in its present form would not have been possible without the kind

support and help from many individuals. I take this opportunity to express my gratitude

to the people who have been instrumental in the successful completion of this thesis.

I owe my sincere thanks to my advisors, Dr. Rubin Parekhji and Prof. Bharadwaj

Amrutur, without whom this work would not have been possible. Right from the moment,

I expressed my wish to pursue PhD, Dr. Rubin Parekhji has supported with guidance,

encouragement and support. During the course of the work, we did explore different

adjacent topics and Dr. Parekhji always helped to steer it in the right direction. We had

long and continuous discussions on these which helped to shape the thesis the way it is

today. Prof. Bharadwaj Amrutur provided the motivation to go out of my comfort zone

and take an altogether new problem for the thesis work. He has constantly monitored the

progress and provided the right guidance in taking the research forward. This helped to

have confidence in exploring the unknown and come up with good solutions.

I am thankful to all my teachers at IISc, especially in the ECE and DESE

department, for the wonderful lectures I got to attend.

I owe my sincere thanks to Padmini Sampath and Sumedha Limaye for supporting

my wish to pursue higher studies and agreeing to sponsor me for the research program

(through Texas Instruments (India) Pvt. Ltd., Bangalore). I am grateful to my mentors at

Texas Instruments Shailesh Ghotgalkar, Jaya Singh, Venkatesh Natarajan, Srivaths Ravi

and Bharat Rajaram for supporting me and helping me at different times during the

course of my PhD. They have made sure that I have the required support to pursue

research and come out successful.

In addition to this, many colleagues from Texas Instruments and IISc have helped

me with the experiments, providing suggestions and directions. The list is long and, to

name a few, they are Abhishek, Amal, Arif, Ashish, Chakra, Han, Jeff, Kaustubh, Nidhin,

Pooja, Prashant, Prashanth, Richard, Rupin, Sai and Swathi. Given that so many people

have helped me during the course of my work, I might have missed one or two names. I

would like to sincerely thank all my colleagues who have helped me.

iv

All PhD students have an in between sad story of personal sacrifices and struggle to

maintain the balance between personal life and PhD work to tell. Mine is similar if not bit

worse due to the addition of office commitments. Maintaining a fair balance among

research work, office assignments and family life always proved tough. I’ve prioritized

the first two many times and my wife and both daughters have been a victim of the same.

They are looking forward to me submitting my thesis. However, they have been quite

supportive to understand and modulate their expectations during this time. I would like to

thank my parents, in-laws, sister, wife, daughters, colleagues and friends for providing

their full support and encouragement for my PhD ambitions.

v

Publications Based on This Thesis

1. Prasanth V, Rubin Parekhji, Amrutur Bharadwaj, “Improved Methods for Accurate

Safety Analysis of Real-life Systems,” in IEEE Asian Test Symposium, 2015.

2. Prasanth V, Rubin Parekhji, Amrutur Bharadwaj, “Safety Analysis for Integrated

Circuits in the Context of Hybrid Systems,” IEEE International Test Conference,

2017. (Selected for Honourable Mention Award)

3. Prasanth V, Rubin Parekhji, Amrutur Bharadwaj, “Perturbation based Workload

Augmentation for Comprehensive Functional Safety Analysis”, International

Conference on VLSI Design, 2019.

Related Publication and Presentations

4. Prasanth V, Rubin Parekhji, “Low Overhead Design and Test Techniques for

Application Specific Functional Safety,” Innovative Practices Session, VLSI Test

Symposium, 2017.

5. Prasanth V, David Foley, Srivaths Ravi, “Demystifying Automotive Safety and

Security for Semiconductor Developer,” IEEE International Test Conference, 2017.

6. Prasanth V, Srivaths Ravi, “Safety and Security in Automotive 2.0 Era”, half day

tutorial, Design, Automation and Test in Europe, 2019.

7. Prasanth V, Srivaths Ravi, “Safety and Security in Automotive 2.0 Era”, half day

tutorial, IEEE Asian Test Symposium, 2019.

vi

Keywords

Functional safety, application tolerance, transient faults, soft errors, value tolerance,

time tolerance.

vii

Contents

Abstract ................................................................................................................................ i

Acknowledgements ............................................................................................................ iii

Publications Based on This Thesis ..................................................................................... v

Keywords ........................................................................................................................... vi

Contents ............................................................................................................................ vii

List of Tables ...................................................................................................................... x

List of Figures .................................................................................................................... xi

1. Introduction ................................................................................................................. 1

Evolution of Functional Safety Systems .............................................................. 2 1.1

Integrated Circuits Functional Safety Concerns................................................... 2 1.2

IC Functional Safety Research Challenges .......................................................... 4 1.3

Contributions of This Thesis ................................................................................ 5 1.4

Thesis Organization.............................................................................................. 6 1.5

2. Functional Safety of Integrated Circuits ..................................................................... 8

Application Case Study: EV Traction System ..................................................... 8 2.1

Functional Safety Standards ............................................................................... 11 2.2

2.2.1 Deriving Semiconductor Safety Requirements from End Application ...... 12

2.2.2 SEooC Design Process ............................................................................... 14

IC Design Evaluation for Safety ........................................................................ 16 2.3

2.3.1 Types of Failures ........................................................................................ 16

2.3.2 Circuit Failure Mode Analysis: Qualitative ............................................... 18

2.3.3 Circuit Failure Mode Analysis: Quantitative ............................................. 20

Protecting Against Systematic and Random Failures ........................................ 21 2.4

2.4.1 Robust Development Process ..................................................................... 21

2.4.2 Safety Mechanisms ..................................................................................... 22

viii

2.4.3 Development Process to Address Random Failures ................................... 24

Limitations with Existing Safety Analysis Methods .......................................... 25 2.5

2.5.1 Making Safety Analysis Comprehensive ................................................... 25

2.5.2 Reduction of Implementation Overheads ................................................... 27

3. Safety Analysis Pessimism Reduction by Utilizing Application Tolerance ............. 29

Background and Related Work .......................................................................... 31 3.1

Improved Safety Analysis Technique ................................................................ 33 3.2

3.2.1 Value and Time Tolerance ......................................................................... 33

3.2.2 Divide and Conquer Safety Analysis Approach ......................................... 36

Evaluation of Value Tolerance and Time Tolerance ......................................... 38 3.3

3.3.1 Determination of Tolerance Using Actual System ..................................... 38

3.3.2 Analytical Estimation of Tolerance Values ................................................ 47

3.3.3 Determination of Tolerance Using High Level Models ............................. 49

Conclusion .......................................................................................................... 51 3.4

4. Formal Verification Based Approach for Accurate Safety Analysis ........................ 52

Background and Related Work .......................................................................... 53 4.1

Improved Safety Analysis Framework ............................................................... 55 4.2

Analysis on Benchmark Circuits ........................................................................ 57 4.3

Analysis on Industrial Modules.......................................................................... 58 4.4

Conclusion .......................................................................................................... 60 4.5

5. Improved Fault Injection Based Safety Analysis Approaches ................................. 62

Fault Injection Based Safety Analysis Approach ............................................... 63 5.1

5.1.1 Experimental Setup .................................................................................... 64

Fault Injection Workload Analysis .................................................................... 68 5.2

Workload Perturbation Approach ...................................................................... 70 5.3

Experimental Results.......................................................................................... 71 5.4

ix

5.4.1 Control Functions ....................................................................................... 71

5.4.2 Inverter Application ................................................................................... 73

Conclusion .......................................................................................................... 76 5.5

6. Application Driven Protection Mechanisms ............................................................. 77

Hardware Based Protection Techniques ............................................................ 77 6.1

6.1.1 Device Level Techniques ........................................................................... 77

6.1.2 Circuit Level Techniques ........................................................................... 78

6.1.3 Module Level Techniques .......................................................................... 80

Software Based Protection Techniques .............................................................. 80 6.2

6.2.1 Control Flow Checking .............................................................................. 81

6.2.2 Vulnerability Reduction Techniques .......................................................... 81

6.2.3 Software Redundancy Techniques ............................................................. 83

Application Based Protection Techniques ......................................................... 84 6.3

6.3.1 Critical Flip-flop Reduction by Altering Application Execution ............... 87

6.3.2 Detection of Critical Flip-flops by Selective Redundant Execution .......... 94

Conclusion ........................................................................................................ 100 6.4

7. Conclusions and Future Work ................................................................................ 101

Future Work ..................................................................................................... 102 7.1

References ....................................................................................................................... 104

x

List of Tables

Table 2.1. Quantitative metric requirements..................................................................... 20

Table 2.2. Calibrating typical safety mechanisms. ........................................................... 23

Table 2.3. Workload coverage and number of dangerous flip-flops. ............................... 27

Table 4.1. Safety analysis on benchmark circuits. ............................................................ 57

Table 4.2. Safety analysis on industrial modules. ............................................................. 59

Table 5.1. Control functions used for evaluation. ............................................................. 72

Table 5.2. Workload iterations for inverter. ..................................................................... 74

Table 6.1. Different tasks executed by motor control application. ................................... 85

Table 6.2. Tradeoffs associated with changing control loop frequency. .......................... 88

Table 6.3. Comparison of traditional and proposed fault tolerant approaches. ................ 97

xi

List of Figures

Figure 1.1. Representative fault tolerant systems. .............................................................. 3

Figure 2.1. Block diagram of an EV traction system. ......................................................... 9

Figure 2.2. Closed loop control system. ........................................................................... 10

Figure 2.3. Functional safety standards. ........................................................................... 11

Figure 2.4. Derivation of semiconductor safety requirements from end-application. ...... 13

Figure 2.5. SEooC requirements. ...................................................................................... 15

Figure 2.6. ISO26262 failure classification. ..................................................................... 16

Figure 2.7. Bathtub curve. ................................................................................................. 18

Figure 2.8. Dependent failures. (a) Common cause (b) Cascading. ................................. 19

Figure 2.9. Development process to address systematic faults. ........................................ 21

Figure 2.10. Functional safety development flow to address random failures. ................ 24

Figure 3.1. Safety analysis complexity and hardware overhead tradeoffs. ...................... 29

Figure 3.2. Illustration of a hybrid system. ....................................................................... 33

Figure 3.3. Tolerance in value and time over the control input range. ............................. 34

Figure 3.4. Motor speed variation for various injected errors. ......................................... 35

Figure 3.5. Computation of value and time tolerance. ...................................................... 36

Figure 3.6. Closed loop control system operation. ........................................................... 38

Figure 3.7. DRV8312, F2805x and BLDC motor. ........................................................... 40

Figure 3.8. Dataflow of closed loop motor control application. ....................................... 41

Figure 3.9. Application time tolerance in the presence of worst case errors. ................... 42

Figure 3.10. Application value tolerence as percentage of CPU output value. ................ 43

Figure 3.11. AC inverter application kit. .......................................................................... 44

Figure 3.12. AC inverter dataflow diagram. ..................................................................... 44

Figure 3.13. Application time tolerance in the presence of worst case errors. ................. 45

Figure 3.14. Application value tolerence as percentage of CPU output value. ................ 46

Figure 3.15. First order system. ........................................................................................ 47

Figure 3.16. Input and otuput values of the control system. ............................................. 48

Figure 3.17. PMSM control system. ................................................................................. 49

xii

Figure 3.18. PMSM output in the presence of worst case error. ...................................... 50

Figure 3.19. Magnified version of PMSM output in the presence of worst case error. .... 50

Figure 3.20. PMSM output used for determining value tolereance. ................................. 51

Figure 4.1. Illustration of FV based safety analysis.......................................................... 55

Figure 4.2. IEV property specification. ............................................................................ 56

Figure 4.3. Reference application. .................................................................................... 58

Figure 5.1. Safety analysis approaches. ............................................................................ 63

Figure 5.2. Software fault injection flow. ......................................................................... 64

Figure 5.3. Critical elements identifed at different operating conditions. ........................ 65

Figure 5.4. Critical flip-flops identified using each approach. ......................................... 66

Figure 5.5. Critical flip-flops identified for inverter application. ..................................... 67

Figure 5.6. Critical flip-flops identified in three approaches for inverter application. ..... 68

Figure 5.7. Number of unique critical flip-flops identified for each workload. ............... 69

Figure 5.8. Algorithm for workload augmentation. .......................................................... 71

Figure 5.9. Variation of critical elements with workload perturbation. ............................ 72

Figure 5.10. Critical flip-flops identified with perturbed workloads for AC inverter

application. ........................................................................................................................ 74

Figure 6.1. SOI transistor. ................................................................................................. 78

Figure 6.2. DICE flip-flop. ............................................................................................... 79

Figure 6.3. BISER fip-flop................................................................................................ 79

Figure 6.4. Sequencing of different tasks executed by motor control application. .......... 85

Figure 6.5. Change in criticality over time. ...................................................................... 86

Figure 6.6. Variation of time tolerance with control loop frequency for BLDC motor. .. 89

Figure 6.7. Variation of time tolerance with control loop frequency for AC inverter. ..... 89

Figure 6.8. Critical flip-flop identification for different time tolerance values. ............... 90

Figure 6.9. Variation in the number of critical flip-flops with time tolerance (# number of

control loop cycles) for BLDC motor control application. ............................................... 91

Figure 6.10. Variation in the number of critical flip-flops with time tolerance (# number

of control loop cycles) for AC inverter application. ......................................................... 91

Figure 6.11. Execution variation with different control loop frequencies. ....................... 92

Figure 6.12. PI function implemented by the control system. .......................................... 93

xiii

Figure 6.13. Selective redundant execution. ..................................................................... 95

Figure 6.14. Memory and MIPS overhead reduction for selective redundant execution

approach as compared to EDDI. ....................................................................................... 95

Figure 6.15. Selective redundant execution for error recovery. ....................................... 96

Figure 6.16. Memory and MIPS overhead reduction for proposed system recovery

approach as compared to TMR implemented in software. ............................................... 98

Figure 6.17. Protection approaches for safety critical application.................................... 98

1

1. Introduction

Advancement of technology has led to reduction in transistor feature sizes and

power. This has helped to integrate more components into an Integrated Circuit (IC). As

we move to newer technology nodes, the factors which are helping to shrink the transistor

size and reduce the power consumption are having an adverse impact on reliability. The

risk of device failure due to aging induced phenomenon like Negative Bias Temperature

Instability (NBTI) and Hot Carrier Injection (HCI) [1], and Single Event Upsets (SEU)

due to particle strikes [2] has increased. Of the different failure mechanisms, random

failures due to atmospheric particle strikes pose the biggest threat to the reliable operation

of ICs. While systematic design techniques [3] can be deployed to reduce the risk due to

life-time failures, cost effective solutions for random failures are still evolving.

In addition to the increased failure rate, ICs offer additional challenges due to

complexity in analyzing the potential failure modes [4]. The failure modes of mechanical

systems (which the modern electronic components are replacing) are predictable and

render themselves to an easy analysis. But due to the inherent complexity in the design,

manufacturing and functionality of ICs today, a much more rigorous approach is

required. For example, if we compare a mechanical steering to a steer-by-wire system [5],

a failure in mechanical steering will lead to predictable failure modes of loss of steering

or insufficient steering. But for a steer-by-wire system, a bit flip in the IC can cause the

steer-by-wire system to even steer in the opposite direction, which is not a potential

failure mode in the case of mechanical steering.

Along with the challenges of technology node shrinking and complexity in

analyzing failure modes, there is a requirement to keep the overheads of performance,

area and implementation incurred for functional safety to a minimum. This is forcing us

to think along new vectors for IC safety analysis, together with cost effective hardened

circuit components [6], improved design techniques [7] and improved architectural

methods [8].

2

Evolution of Functional Safety Systems 1.1

With evolution of technology, transistors become smaller, faster and more power

efficient resulting in larger integration. The number of transistors in the ICs has increased

significantly driven by the Moore’s law [9] and Dennard’s scaling law [10]. A paradigm

shift which happened during the evolution of ICs is the design of System-on-Chips

(SoCs) [11,12]. This allowed integration of multiple components into a single IC leading

to more efficient and cost effective design. Along with other advancements, the firmware

/ software foot-print used in these systems [13] has also increased. It has become more

complex with the emergence of machine learning and artificial intelligence implemented

using these ICs [14,15,16,17].

IC failures due to faults were initially a concern for safety critical systems in

transportation, industrial plants, space and medical. However, with increase in failure rate

and rapid proliferation of ICs as replacement for mechanical parts, it is required to

comprehend functional safety requirements even for consumer electronic systems.

Integrated Circuits Functional Safety Concerns 1.2

Functional safety requirements have traditionally been addressed using redundancy

techniques. Concerns in such systems were addressed by having redundancy based

control architectures [18,19,20] like 1oo2 (1-out-of-2 also known as Dual Modular

Redundancy), 2oo3 (2-out-of-3 also known as Triple Modular Redundancy), etc. as

shown in Figure 1.1. Redundancy based architectures lead to significant increase in

implementation overheads. Since the deployment of earlier systems was restricted, the

additional cost incurred for functional safety was of lesser concern than it is today. With

more wide-spread deployment into several applications in recent years, the higher cost

incurred in designing such redundant systems is driving the need to impart functional

safety with reduced design overheads. As redundancy is replaced with other design

techniques, functional safety analysis techniques are becoming more architecture and

application dependent and hence more complex. This requires detailed analysis of the

circuit, understanding the effect of the failure of each flip-flop / logic gate on the system,

and imparting resilience by addressing the dangerous after-effects using other techniques.

3

As ICs started getting larger and more complex, thorough investigation of failure

modes using standard techniques like Failure Mode and Effect Analysis (FMEA) [21,22]

and Fault Tree Analysis (FTA) [23] became difficult. Several functional safety incidents

like Toyota’s unintended acceleration recalls [24,25], Ford’s unintended gear shift related

recalls [26], and multiple other automotive recalls [27] illustrate this difficulty. The

advent of autonomous systems which are capable of taking independent decisions based

on a set of parameters has made the safety analysis more critical than ever before. Recent

accidents with Uber [28,29] and Tesla [30,31] point to the limitations of the functional

safety analysis methods deployed today.

As functional safety became important to different end applications, different

functional safety standards evolved to reduce risks by providing necessary requirements

and processes. For example, IEC 61508 [32] addresses the safety requirements of

electrical, electronic and programmable electronic safety related systems. This is the base

standard from which other functional safety standards are derived. ISO 26262 [33] is an

adaptation of IEC 61508 specifically for automotive electric and electronic systems.

DO178 [34] address the functional safety requirements of avionics systems. Different

Figure 1.1. Representative fault tolerant systems.

4

functional safety standards are required for different end systems since the safety

requirements vary widely among the different applications.

IC Functional Safety Research Challenges 1.3

The fundamental question raised in hardware safety is “How can I build an

affordably robust soft-error resilient system?”. Techniques and methods should be

available for the designer to comprehensively identify the minimum set of critical

components which need to be protected to make the system safe. The safety analysis

methods should reduce the analysis complexity and make it amenable to analyse very

large systems. In addition, it should reduce the impact on time to market as well.

Comprehensive safety analysis must ensure that the IC is safe for all application

scenarios. This implies the need to cover the impact of every fault for all valid functional

states of operation. This becomes particularly challenging when all the different

applications in which the IC is getting used are not available at the time of design. It is a

common scenario that an IC built for one application finds use in several other and often

unrelated applications.

When performing IC functional safety analysis, we need to understand that not

every random fault occurring in an IC will result in application failure. Different masking

effects (e.g. logical masking, electrical masking, latching window masking and

application level masking) [35,36] can prevent the fault from propagating to the output

and causing the application to fail. Functional safety analysis should consider the

different masking effects and reduce the overall hardware overhead incurred.

Many real life systems have ICs interacting with physical systems in safety critical

applications. These physical systems are inherently analog in nature, and the accuracy of

the analysis is dependent upon the performance deviation which can be tolerated under

different use conditions. These systems are typically designed as closed loop control

systems and, by their very nature, can correct certain errors since such system can build

its resilience across subsequent iterations of the control loop. In addition, the interacting

physical system has latency and tolerance. The control algorithms [37,38] for these

systems are designed to accommodate the variability in the physical system and the

5

environment. The design of ICs used in such systems can beneficially employ these

system level tolerances to identify the minimum set of components that are required to be

protected, thereby reducing the hardware overhead.

As we include the system level information to reduce the pessimism associated

with functional safety analysis, the analysis complexity increases significantly. As an

illustration, an application level analysis for functional safety of a motor control system

will require consideration of the motor and its different operating conditions involving

motor speed, load, etc. This requires inclusion of the motor models and the whole

analysis may become impractical due to the complexity addition.

Contributions of This Thesis 1.4

In this thesis, we address some of the major challenges associated with IC

functional safety analysis and soft error mitigation. Broadly, this thesis investigates

functional safety analysis as it is performed today, determines the limitations and

proposes techniques to address them. More specifically, the major contributions from the

thesis can be grouped as given below.

Firstly, this thesis analyses functional safety requirements from an application

perspective and identify the analysis complexities and potential optimizations. It proposes

a new technique by which the application level tolerances can be mapped to tolerances at

IC level. This approach helps to reduce the analysis complexity and at the same time

optimize the hardware overhead incurred for including functional safety. It further

proposes a new divide and conquer approach by which the tolerances at IC level are

mapped to individual modules whereby the analysis can be performed at standalone

module level.

Secondly, this thesis evaluates an alternate approach using formal techniques for

comprehensive functional safety evaluation. Incorporation of formal techniques makes

the analysis more accurate. In order to make the analysis less pessimistic, application

tolerance information is also incorporated. This involves capturing application diversity

as input constraints (values and sequences), modelling application specific performance

6

tolerances as an output range across time intervals and illustrating how the physical

system can be included into this analysis using a suitable representation.

Thirdly, this thesis proposes a workload perturbation approach to augment

workloads used in the traditional fault injection approach to address the dual problems of

safety analysis complexity and comprehensiveness. A method is proposed for systematic

perturbation of workloads whereby new workloads are generated iteratively, and they are

shown to be effective to detect additional critical flip-flops. These three contributions

together provide a new framework to identify the critical flip-flops which must be

protected.

Lastly, the thesis proposes two new application level techniques to protect critical

flip-flops. The first technique relies on altering application execution (e.g. increase

frequency of the control loop operations) to reduce the number of critical flip-flops. The

second technique uses selective redundant execution to safe the critical flip-flops. A

novel technique to correct the identified faulty flip-flops is also presented. These

techniques analyse realistic application scenarios and propose an optimal approach for

implementing application aware methods to protect critical flip-flops to incorporate

functional safety.

Thesis Organization 1.5

This chapter gave an overview of the evolution of functional safety systems and

safety concerns associated with ICs used in these systems. The open problems from the

IC safety are discussed next followed by the major contributions of this thesis in

addressing them.

Chapter 2 presents an overview of IC functional safety describing the fundamental

safety concepts, derivation of semiconductor functional safety requirements from an

application level, various functional safety standards, and challenges faced in functional

safety analysis of ICs.

Chapter 3 proposes an improved functional safety analysis technique where the

application level tolerance is mapped to an IC as value tolerance and time tolerance. It

proposes a divide and conquer approach to make the functional safety analysis practically

7

viable. It shows several examples at various abstraction levels to show how the

application level tolerance can be mapped to IC hardware (or functional modules inside

an IC).

Chapter 4 illustrate the use of formal techniques to address the dual challenges of

analysis comprehensiveness and pessimism. It shows using a set of benchmark circuits

and two industry modules how the proposed techniques can be used to perform safety

analysis.

Chapter 5 proposes workload augmentation approach to addresses design size

limitations associated with the formal approach and non-comprehensiveness associated

with practical workloads used in the traditional fault injection approach. The chapter

demonstrates the benefits of the proposed approach using a set of control benchmark

algorithms and code routines. It further explains a more directed perturbation approach to

solve the practical challenges associated with random workload perturbation.

Chapter 6 proposes two new application level techniques to protect the critical flip-

flops (wherein a flip-flop is termed critical when, in the presence of a fault, its erroneous

value results in unacceptable application behaviour) identified using the new approaches

proposed in Chapters 3, 4 and 5. The first technique relies on altering application

execution to reduce the number of critical flip-flops. The second technique uses selective

redundant execution to protect the critical flip-flops. A novel technique to correct the

identified faulty flip-flops is also presented. They consider realistic application scenarios

and arrive at an optimal approach to detect / protect critical flip-flops.

Chapter 7 concludes the thesis, and lists areas of further research.

8

2. Functional Safety of Integrated Circuits

Increasing and pervasive use of semiconductors define many of the modern

applications today. If we take the case of automotive applications, semiconductors can be

found in multiple subsystems of a vehicle – from the basic EPS (Electric Power Steering)

and ABS (Antilock Brake System) to advanced safety and control (collision warning and

parking assistant systems), state-of-the-art infotainment systems, networking units (from

CAN to Bluetooth), and advanced control and comfort systems [39]. Establishing and

emerging automotive trends such as Electric Vehicle (EV) / Hybrid Electric Vehicle

(HEV), advanced vehicle intelligence, autonomous driving, etc., are only pushing the

semiconductor content even higher. While at one end, semiconductors continue to

provide newer functionality to the consumer, they also bring in increased risk of failures

which elevate the functional safety concerns of the system. This has led to functional

safety becoming a critical (and in some cases defining) parameter for the semiconductors

and ICs used for realizing important functions.

This chapter gives a brief overview of IC functional safety. The rest of this chapter

is organized as follows. Section 2.1 illustrates using an example, the impact of

semiconductor failures on the EV traction application. Section 2.2 introduces the

different functional safety standards. Section 2.3 introduces the key safety concepts

covered in these standards. Section 2.4 describes an example functional safety aware

development process practised to develop ICs, and Section 2.5 lists the limitations in this

existing safety analysis methodology.

Application Case Study: EV Traction System 2.1

ICs can fail in an application due to a variety of reasons ranging from excess

temperature, excess voltage, ageing, ionizing radiation, package stress, etc. As

semiconductor content in various mission critical applications like automobile, aviation,

etc. increase, chances of application failure due to semiconductor failures also increases.

The impact of IC failure on an application will differ based on the type of failure and type

9

of application. In this section, we analyse an Electric Vehicle (EV) traction application

and see how a failure in the control IC can impact the application.

Figure 2.1 shows the block diagram for the traction control system of an EV. The

system consists of an AC induction motor connected directly to the drive train. The motor

is driven by power stages which get the energy from the high wattage (20 – 100 KWh)

battery. The power stage is controlled by a control IC such as the C2000 microcontroller

(MCU) [40]. This control IC receives torque/speed set-point command from the

supervisor IC, based on information from input functions like accelerator, brake pedal,

etc. There is a separate Battery Management System (BMS) IC which monitors the

battery to keep track of its charging and discharging operation and voltage levels. The

control IC senses torque by measuring the motor phase current using the ADC. The

processing element in the control IC will determine the system error by comparing the

set-point torque received from the supervisor IC with the sensed torque and execute the

control algorithm to minimize this error.

The result of the control algorithm execution is conveyed to the motor as change of

applied power, by varying duty cycle of Pulse Width Modulation (PWM) module output

as indicated in Figure 2.2. The figure represents two torque scenarios. We have used a

Figure 2.1. Block diagram of an EV traction system.

10

simplistic assumption that the torque conveyed to the motor is proportional to pulse

width. We analyse the impact of a fault in the different modules of the control IC on the

motor control application.

(i) A fault in the ADC module can lead to incorrect sensing (higher/lower than

actual) of speed. Incorrectly sensed speed, when input to the control algorithm,

can result in incorrect processing which can in turn lead to inadvertent

acceleration or deceleration of the motor.

(ii) A fault in the CPU will cause incorrect execution of the control algorithm

causing inadvertent acceleration or deceleration.

(iii) A fault in the PWM module can cause the output duty cycle to change, resulting

in inadvertent acceleration or deceleration.

In the above analysis, we have only considered the worst-case impact that a fault in

the IC can have on the application, (as a result of acceleration / deceleration, as against

the case when there is no appreciable change in the speed at all). The exact impact of a

fault will depend on the logic gate / flip-flop impacted and the type of fault. A transient

fault in one of the flip-flops in the data logic may not lead to much difference in the EV

motor output due to the mechanical components involved. The error will get corrected

Figure 2.2. Closed loop control system.

11

subsequently due to the closed loop operation albeit with a longer latency. However, if

the error is in the control path of the CPU, it is possible that the entire program flow

changes, resulting in irrecoverable and/or catastrophic behaviour. As an example, a

transient fault in the program counter will have much larger impact on the application

when compared to a transient fault in the LSB of adder logic output which drives the

PWM.

Functional Safety Standards 2.2

Various functional safety standards have been developed and deployed over the

years to set guidelines for design and implementation to avoid hazards caused by

malfunctioning behavior of electric and electronic devices in end systems. IEC 61508

[32] is the fundamental functional safety standard that encapsulates basic safety concepts

and design requirements applicable to a wide range of electric and electronic end

equipment across industry. Application specific functional safety standards have since

evolved from IEC 61508 as shown in Figure 2.3 [41], in order to address the

Figure 2.3. Functional safety standards.

EN 62061(factory automation)

IEC 60730(household goods)

IEC 61508(meta standard)

ISO 13849(machinery)

IEC 60880(nuclear station)

IEC 50158(furnaces)

RTCA / DO 178B(aerospace)

IEC 61800(power drive)

IEC 60601(medical equipment)

EN 50128(railway)

ISO 26262(automotive)

12

requirements and considerations (e.g. availability, integrity requirements, etc.) specific to

an application.

Of these standards, ISO 26262 functional safety standard [33] addresses the

requirements from an automotive standpoint. This standard’s scope addresses the

functional safety requirements due to malfunctioning behaviour of electronic and

electrical systems in passenger cars, motor bikes, trucks and buses. The standard defines

requirements during various phases of system life cycle - management, development,

production, operation, service, and decommissioning.

Similarly, there are standards for nuclear power plants (IEC 60880) [42], aerospace

(DO 178B) [34], railway (EN 50128) [43], medical equipment (IEC 60601) [44], factory

automation (EN 62061) [45], machinery (ISO 13849) [46], power drive (IEC 61800)

[47], household goods (IEC 60730)[48], etc. One of the key challenges with different

system level functional safety standards is that it becomes difficult for a semiconductor

developer to easily infer what really needs to be done at an SoC level. Therefore, it is key

to understand the fundamental concepts underlying the standards to better appreciate the

safety requirements. Such an understanding will in turn enable a semiconductor

developer to come up with an optimal and cost-effective chip development process, and a

solution that can meet the requirements.

2.2.1 Deriving Semiconductor Safety Requirements from

End Application

In this section, we will use the automotive functional safety standard ISO 26262 to

describe how the semiconductor safety requirements are derived from the application

safety requirements. Automotive functional safety standard uses a risk analysis based

approach for deriving the safety requirements from an end application. We will use EV

traction application to examine how semiconductor requirements are derived from the

application safety requirements. The various steps involved are enumerated below and

further described in Figure 2.4.

13

(i) Identify the specific application or function for which the safety analysis needs

to be performed. In this example, application considered is EV traction.

(ii) Identify various operating conditions, (e.g. driving on highway, driving in city,

plug-in charging), of the application and hazards possible, (e.g. unintended

positive torque causing acceleration), in these operating conditions.

(iii) Determine risk associated with each of the hazards based on severity (extent of

injury a malfunction can lead to), exposure (duration for which the vehicle is in

a particular operating condition which has the potential to cause the hazard) and

controllability (ability of driver to control the vehicle and prevent injury when a

Figure 2.4. Derivation of semiconductor safety requirements from end-application.

Item definition

Situational Analysis &

Hazard Identification

Operating conditions

(Plug in charging,

driving)

Hazard Classification

Risk rating based on

Exposure, Severity,

Controllability

ASIL Assignment for

the Hazards

Functional Safety

Requirements

Technical Safety

Requirements

Hardware Safety

Requirements

Unintended positive

Torque – ASIL-X

Prevent unintended

positive torque – ASIL-X

Technical requirements

to implement Functional

Safety Requirements

Individual module level

safety mechanisms used

to achieve diagnostic

coverage.

Se

mic

on

du

cto

r +

Syste

m

Sa

fety

Re

qu

ire

me

nt

EV Traction application

Ap

plic

atio

n

Sa

fety

Re

qu

ire

me

nt

(i)

(ii)

(iii)

(v)

(iv)

(vi)

(vii)

14

hazard happens), e.g. unintended acceleration in city with pedestrians can cause

fatal accidents and will be assigned a higher risk rating. Recommended practices

published by Society of Automobile Engineers (SAE) like SAE J2980 [49]

provide guidance for identifying and classifying hazardous events.

(iv) Assign an Automotive Safety Integrity Level (ASIL) based on the determined

risk. ASIL varies from A to D, with D being the most stringent in terms of the

requirements. ISO26262 provides a reference table which will help to derive the

ASIL from severity, exposure and controllability values. Steps (ii), (iii) and (iv)

constitute the Hazard Analysis and Risk Assessment (HARA) [50] process.

(v) Derive functional safety requirements (which are implementation independent,

i.e. guaranteed to hold irrespective of how the function is implemented e.g.

avoid unattended acceleration) and associated integrity levels (ASIL levels)

using Steps (iii) and (iv), (e.g. prevention of unintended acceleration – ASIL-C).

(vi) Derive technical safety requirements required for the implementation of the

functional safety requirements, (e.g. MCU shall check correctness of generated

torque).

(vii) Derive IC safety requirements from the system level technical safety

requirements, (e.g. redundant sensing of generated torque using two ADCs,

protection of critical flip-flops implementing Program Counter (PC), etc.).

2.2.2 SEooC Design Process

In Section 2.2.1, we have analysed how semiconductor safety requirements are

derived from the end application. However, in most of the cases, semiconductor devices

may not be designed and manufactured for a particular application. They may be either

developed to cater to a set of applications, (e.g. a micro-controller may be used in diverse

applications like motor control, digital power control, control of home appliances, etc.) or

may be a standalone function (e.g. voltage regulator) which can be used in several

applications. In order to cater to the requirements for a diverse set of applications,

ISO26262 has laid out the Safety Element out of Context (SEooC) development process

[51].

15

For deriving semiconductor safety requirements, SEooC development mandates the

IC manufacturer to consider a set of applications which it can potentially cater to. It

analyses the safety requirements from these applications. These requirements (termed

assumptions) will be used for the IC design. The assumptions will contain both the

requirements for the IC design, termed as ‘assumed requirements’ (e.g. safety

mechanisms to be implemented within the device like ECC) and additional requirements

for components external to the device, termed as ‘assumptions on design external to

SEooC’ (e.g. requirement to have an external power monitor since the device does not

have a built-in power monitor). While integrating an SEooC device, a system integrator

must ensure that the system conforms to the assumptions used during the design of the

component. The SEooC development flow outlined in the ISO26262 is shown in Figure

2.5.

The SEooC development process deployed in IC industry today does not use

system level information (e.g. tolerance available at the system level due to physical

components, repeated execution in closed loop control, etc.) for performing safety

analysis leading to an increase in analysis pessimism. This is due to increase in analysis

complexity due to consideration of additional system level effects. This thesis illustrates

how IC safety requirements can be accurately derived from the application safety

requirements and utilized for IC safety analysis without increase in analysis complexity.

Figure 2.5. SEooC requirements.

Assumptions

Assumed requirements

Assumptions on design external to

SEooC

SEooC requirements

SEooC design

16

IC Design Evaluation for Safety 2.3

Once the IC safety requirements are determined from the application, the design is

evaluated and further augmented to make sure that the safety requirements can be

addressed even in the presence of faults. The design evaluation process involves

understanding the different types of faults, performing systematic analysis to identify

their implications and protecting against them.

2.3.1 Types of Failures

IC failures can be classified into systematic failures and random failures as show in

Figure 2.6.

Systematic Failures 2.3.1.1

Systematic failures are failures related in a deterministic way to a certain cause that

can be eliminated by change of design, manufacturing process, operational procedures,

etc. Such failures typically arise due to design quality issues (e.g. bugs in the design),

non-adherence to operating conditions (e.g. device operating in conditions outside the

specified temperature, voltage, humidity conditions), etc. These failures are repeatable in

Figure 2.6. ISO26262 failure classification.

Failure

Systematic

FailureRandom Failure

Design Quality

Manufacturing Quality

Adherence to

Operating Conditions

Permanent Faults

Transient Faults

Single Point

FaultSafe Fault

Residual

Fault

Multiple Point

Fault

17

nature and can be controlled to a large extent by following a regimented approach

(process) during the product life cycle.

Random Failures 2.3.1.2

Random failures can be due to permanent faults or transient faults. Permanent

faults are caused by ageing phenomena like Negative Bias Temperature Instability

(NBTI) and Hot Carrier Injection (HCI) [1]. Transient faults are caused due to alpha

particle and neutron particle strikes. A permanent fault, as the name indicates, causes

permanent damage to the chip and will stay until it is removed or repaired. On the other

hand, the impact of transient fault is temporary in nature. Given the random occurrence of

both permanent and transient failures, the probability of occurrence is best estimated

using statistical information based on historical data, accelerated test procedures like

High Temperature Operating Life (HTOL) [52], burn-in, etc. (for permanent faults) and

radiation testing (for transient faults) [53].

Random faults can be further classified into single point fault, residual fault,

multiple point fault and safe fault [54]. The ISO26262 definitions for these faults are

included here. A single point fault is a fault which is not covered by any safety

mechanism and whose failure can lead to violation of the system safety goal. From

amongst the faults which are covered by safety mechanism, residual faults refer to a

subset of faults that can lead to violation of a safety goal, where this subset of faults is not

covered by the existing safety mechanisms. A multi-point fault is a fault which

standalone cannot cause the violation of safety goal, but the same fault in combination

with other one or more independent faults can lead to violation of safety goal. A fault

whose occurrence will not cause violation of the safety goal is called a safe fault.

Device Failure Rate and Bathtub Curve 2.3.1.3

The failures occurring during the semiconductor device lifetime are described in

terms of the classical bathtub curve [55] illustrated in Figure 2.7 [56]. During the early

life of the device, there is failure termed as infant mortality, which has a decreasing

probability with time. Similarly, there is wear-out failure, which has an increasing

probability of failure with time. In between these, there is a constant failure rate during

18

the operational lifetime of the device. This constant failure rate is attributed mainly to the

random failures occurring in devices.

2.3.2 Circuit Failure Mode Analysis: Qualitative

Qualitative failure mode analysis approach performs a systematic analysis of the

circuit to identify the potential vulnerabilities and provide suggestions / recommendations

to change the design of the circuit to address the vulnerabilities. The common methods

used for qualitative analysis are Failure Mode and Effects Analysis (FMEA) [21] and

Fault Tree Analysis (FTA) [23]. FMEA is an inductive bottoms-up approach where faults

in each of the lower level components is analysed for its impact on the higher level

function failure (e.g. impact of faults in logic gates / flip-flops at the IC level). FMEA

analysis consists of (i) determining the faults which can cause a function to fail, (ii)

identifying diagnostic mechanisms which can detect such faults, (iii) determining the

faults in the diagnostic logic which can make such detection ineffective, and (iv)

identifying checks (diagnostic mechanisms) for diagnostics that can detect a diagnostic

malfunction. FTA is a deductive tops-down approach by which each of the system failure

conditions is analysed and further mapped to low level contributing elements, (e.g. flip-

flops, gates, etc.).

Figure 2.7. Bathtub curve.

Decreasing failure rate

Constant failure rate

Increasing failure rate

Time

Failu

re R

ate

Wear-out failure

Early infant mortality failureObserved failure rate

Constant (random) failure

19

A combination of both these techniques is recommended to be used for the

functional safety analysis. However, system level considerations are not used in the IC

safety analysis practiced today. Hence, there is insufficient information available to

perform FTA. In the absence of system level information, all faults at the IC level are

classified as critical for FMEA analysis. This leads to analysis pessimism. In this thesis,

we propose a two-step FMEA process to address this limitation. In the first step, we

analyse the impact of different errors at IC level which can cause failure at the system

level. In the second step, we identify critical flip-flops whose faults can cause errors at IC

level which can lead to system level failures.

Dependent Failure Analysis 2.3.2.1

Qualitative failure analysis should also consider the effect of dependent failures,

which can reduce effectiveness of the employed diagnostics. Dependent failures can be

either common cause failures or cascading failures. Common cause failures, indicated in

Figure 2.8(a), are the ones in which the logic common to both functional and diagnostic

modules fails. This can make the diagnostic ineffective, e.g. if the power to the mission

logic and diagnostic logic is the same, failure of power will lead to non-detection of a

fault by the diagnostic module. Cascading failures as indicated in Figure 2.8(b) are the

ones where the failure in one module propagates to another module making it fail. This is

applicable for cases where functions with different integrity levels co-exist within the

chip and there is possibility of a fault from a lower integrity module to impact a higher

Figure 2.8. Dependent failures. (a) Common cause (b) Cascading.

(a) Common cause failure

Common

Logic

Mission

Logic

Redundant

Logic

Comparator

Failure

(b) Cascading failure

Lower

Integrity Logic

Higher

Integrity logic

Failure

20

integrity module, e.g. fault propagating from lower integrity debug / emulation logic to

the higher integrity functional logic, thus impacting the critical high integrity functions

executing on the CPU.

2.3.3 Circuit Failure Mode Analysis: Quantitative

Quantitative analysis provides an objective means to ascertain the safety worthiness

of a given circuit. These metrics can be used to (i) objectively assess safety effectiveness

of the design to cope with the random hardware failures, (ii) provide guidance towards

making design enhancements for safety, (iii) compare different diagnostic architectures

and (iv) ascertain the ASIL which can be achieved using the specified architecture.

Quantitative analysis requires derivation of some key metrics, namely Single Point

Fault Metric (SPFM), Latent Fault Metric (LFM) and Probabilistic Metric for Hardware

Random Failures (PMHF). The device needs to meet the specific target values (indicated

in Table 2.1) corresponding to these metrics to achieve a specific ASIL. SPFM and LFM

can be computed using the equations provided below. (Refer Section 2.3.1.2 for fault

classification). PMHF indicates the average probability of failure per hour and is obtained

by addition of the failure rates of the constituting components.

SPFM = 1- 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠 + 𝑆𝑖𝑛𝑔𝑙𝑒 𝑃𝑜𝑖𝑛𝑡 𝐹𝑎𝑢𝑙𝑡𝑠

𝑇𝑜𝑡𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠−𝑆𝑎𝑓𝑒 𝐹𝑎𝑢𝑙𝑡𝑠

LFM = 1- 𝑈𝑛𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑃𝑜𝑖𝑛𝑡 𝐹𝑎𝑢𝑙𝑡𝑠

𝑇𝑜𝑡𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠 −(𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠+ 𝑆𝑖𝑛𝑔𝑙𝑒 𝑃𝑜𝑖𝑛𝑡 𝐹𝑎𝑢𝑙𝑡𝑠)

Table 2.1. Quantitative metric requirements.

ASIL-B ASIL-C ASIL-D

SPFM ≥ 90% ≥ 97% ≥ 99%

LFM ≥ 60% ≥ 80% ≥ 90%

PMHF < 10-7

h-1

< 10-7

h-1

< 10-8

h-1

21

Protecting Against Systematic and Random 2.4

Failures

We have seen in Section 2.3.1 that application failures are caused due to systematic

and random failures. Therefore, functional safety aware chip development should

comprehend both these factors. The systematic failures are controlled by having a robust

development process and random failures are addressed by having protection

mechanisms which can either detect or avoid the faults.

2.4.1 Robust Development Process

The end system should have a robust development process for the different phases

corresponding to concept, design, production, operation and decommissioning, to

mitigate systematic failures. A part of the robust development process addressing the

safety requirements in IC design covering concept, design and production stages is shown

in Figure 2.9. This includes:

(i) Requirement Phase: Collection of safety requirements from the application or

from a set of applications.

(ii) Design phase: Hardware and software design based on the specification.

Figure 2.9. Development process to address systematic faults.

Hardware safety

requirements

Verification of hardware

safety requirements

System level functional

safety requirements

Validation of functional

safety requirements

Implementation

22

(iii) Verification phase: Check to ensure that design meets functional specification

and safety requirements.

ISO26262 specifies representative best practices for each of the different

development phases. This includes proper traceability from requirements to design to

verification and validation phases, how IPs used in the devices are selected and assessed

(development interface agreement), change management process during development,

best practices in design (e.g. conservative design with sufficient over design), verification

(e.g. use of formal methods) and validation (e.g. fault insertion testing), etc.

2.4.2 Safety Mechanisms

Along with addressing systematic failures through a robust development process,

ICs should have in-built mechanisms to handle random failures. Devices need to have

additional protection mechanisms called safety mechanisms in place to detect or mitigate

such failures. Safety mechanisms can be classified into three based on the protection they

offer.

(i) Primary safety mechanisms: Detects single point faults, e.g. ECC for memory.

(ii) Test for diagnostics: Detect faults in the diagnostic mechanisms, e.g. test for

ECC logic.

(iii) Fault avoidance measures: Helps to avoid a particular fault, e.g. spacing apart

memory bit-cells which form a logical word such that neighbouring multiple bit

upsets can still be detected by ECC.

Factors Involved in Selection of Safety Mechanisms 2.4.2.1

The effectiveness of a safety mechanism is measured using Diagnostic Coverage

(DC). DC indicates the proportion of the hardware logic faults which can be detected by

the implemented safety mechanisms. In addition to diagnostic coverage, factors like area

and power overhead, MIPS required for safety mechanism execution, safety mechanism

development effort, etc., are considered before deciding on a particular safety

mechanism. Some of the common safety mechanisms and their typical diagnostic

coverages and overheads (largely implementation dependent and subject to application

23

scenarios) are listed in Table 2.2. The assessment given is approximate and will differ

based upon the actual function being defined and its implementation.

Classification of Safety Mechanisms 2.4.2.2

Safety mechanisms can also be classified based on the level of design abstraction at

which they are implemented.

(i) Device level, where the components used for building SoC such as standard cell

library are augmented with safety mechanisms (e.g. RAZOR flip-flop [57],

DICE flip-flop [6], BISER flip-flop [58]).

(ii) Gate level, where the components are protected by means of circuit level

techniques (e.g. delayed capture methodology [59,60,61], lock-step operation)

(iii) Function / Architecture level, where higher levels of abstraction are analysed

and leveraged to provide protection (e.g. redundant execution [8,62], monitoring

mechanisms [63], etc.).

Table 2.2. Calibrating typical safety mechanisms.

Safety mechanism Transient fault

DC

Permanent

fault DC

Area overhead MIPS

overhead

Lock-step CPU High High High None

Hardware self-test for CPU None High Medium Medium

Software self-test for CPU None Medium Low High

Parity for memories Low Low Low Low

ECC for memories High High Medium Low

Self-test for memories None High Medium Medium

24

2.4.3 Development Process to Address Random Failures

The functional safety development flow to address random failures should

comprehend requirements mentioned in previous sections. A typical development flow is

illustrated in Figure 2.10.

(i) Derive device architecture based on the functional requirements of the target

application(s).

(ii) Derive functional safety requirements from the target application or set of

applications considered in the case of SEooC development.

(iii) Perform qualitative Failure Mode and Effect Analysis (FMEA) of design based

on device architecture and functional safety requirements. This helps to identify

safety mechanisms.

Figure 2.10. Functional safety development flow to address random failures.

Device Architecture

Specification

Qualitative Analysis

Functional Safety

Requirements

Updated Device

Architecture Specification

Design Implementation

Hardware and Software

Safety Requirements

Design Verification

Quantitative Analysis

Goals Met DoneYESNO

1st Iteration after Implementation

YES

NO

25

(iv) Update hardware and software safety requirements based on the qualitative

analysis.

(v) Update device specification to capture the new safety mechanisms.

(vi) This is followed by design implementation. After design implementation,

qualitative analysis is repeated with the additional implementation related

information to evaluate the design based on implementation information.

(vii) Perform design verification to check correct implementation of functional logic

and safety mechanism.

(viii) Perform quantitative analysis to check whether the goals associated with the

functional safety requirements are met. If the goals are not met, device

architecture is modified. This iteration continues till the goals are met.

Limitations with Existing Safety Analysis 2.5

Methods

This chapter provided an overview of IC safety analysis as practised today. It can

be noticed that several additional steps are added to the IC development process to make

it functional safety compliant. These additional steps lead to additional effort and

implementation overheads, and significantly increase the development time (and hence

time to market) for the ICs and the systems incorporating them. In spite of these

additional steps existing functional safety analysis methods suffer from two major gaps.

This section describes these limitations.

2.5.1 Making Safety Analysis Comprehensive

Identification of critical (dangerous) gates / flip-flops in IC whose faults can cause

the application to fail and protecting them is one of the important steps in the functional

safety compliant IC design process. A failure during application life time can occur due

to several reasons, namely, permanent fault in logic gates / flip-flops, Single Event Upset

in flip-flops and other storage elements, Single Event Transients (SET) in combinational

26

logic gate, etc. In this thesis, we focus on the robustness analysis of the circuit in the

presence of SEU in flip-flops.

Fault injection is widely used to provide a metric on the suitability of a circuit

module to be used in safety critical application. Fault injection provides evidence of the

robustness (or otherwise) of the hardware executing the given application in the presence

of SEUs. For SEU robustness evaluation, faults which model SEUs are injected in the

presence of application workload. (A workload is a popular term used to denote an

application execution sequence in terms of input values, internal CPU or firmware

program, etc.). Outputs and critical state elements in the circuit with faults injected in

simulation are compared with those in the circuit with no faults injected (good)

simulation. If the injected faults result in a changed output and this is not detected by the

safety mechanisms, the gate / flip-flop in which this fault is injected is classified as

dangerous [4]. A flip-flop / gate, on the other hand, is safe if for all injected faults, there

is no output change, or if the change is detected by the built-in safety mechanisms. Fault

injection methods use either actual application workloads or synthesized workloads for

the safety evaluation. The comprehensiveness of the safety analysis critically depends on

comprehensiveness of the workloads used during fault injection based evaluation.

The toggle coverage metric is often used to ascertain the suitability of workloads to

identify critical flip-flops, (i.e. those which are identified as dangerous and must hence be

protected). For example, a set of workloads is considered adequate if the toggle coverage

is greater than a prescribed value [22]. Practical considerations of the circuit size and

simulation time require that an upper bound (e.g. 70%, 90%, 99%) on this coverage be

set based upon circuit size and simulation time. However, the relation between the

number of dangerous flip-flops and the toggle coverage is not well established [64].

Table 2.3 shows this data for an industrial circuit, (a digital filter used for filtering noise

from low frequency analog signals). Different coverage metrics, (namely block,

expression, code and toggle coverage), are evaluated independently for each workload

and the number of dangerous flip-flops is identified. Workloads (TC_0 to TC_9 which

are run independently) are arranged in the increasing order of toggle coverage. Contrary

to expectations, workloads with higher toggle coverage do not necessarily correspond to

a larger number of dangerous flip-flops. TC_9 has higher toggle coverage than TC_8;

27

however the number of dangerous flip-flops is smaller. TC_2 has lower toggle coverage

than TC_7; however, the number of dangerous flip-flops is larger. A higher code

coverage also does not indicate a better workload quality. It is therefore important to

consider more comprehensive workloads and perform exhaustive fault injection.

2.5.2 Reduction of Implementation Overheads

Not every random fault occurring in an IC will result in application failure.

Different masking effects (e.g. logical masking, electrical masking, latching window

masking and application level masking) [35] can prevent the fault from propagating to

the output and causing the application to fail. Safety analysis without considering these

masking effects will result in over-design leading to an increased hardware and / or

design implementation overhead. Analysis at higher abstraction levels (e.g. at the

application level) helps to evaluate and take advantage of additional masking effects, and

these can be beneficially employed to reduce the hardware overhead.

As an illustration, we can consider a real life functional safety system like traction

control or ABS used in automobiles. These systems demand high accuracy and self-

correction to adjust for the physical system’s non-linearity and noise effects. They are

designed as closed loop control systems and typically have an associated acceptable

tolerance in the extent to which the behaviour can deviate from the ideal or centre

Table 2.3. Workload coverage and number of dangerous flip-flops.

Test

case

Block

coverage

Expression

coverage

Code

coverage

Toggle

coverage

# Dangerous

flip-flops

TC_0 93.33% 100.00% 72.91% 41.02% 54

TC_1 93.33% 100.00% 74.48% 44.49% 69

TC_2 93.33% 100.00% 74.58% 44.62% 69

TC_3 93.33% 100.00% 74.87% 45.47% 84

TC_4 93.33% 100.00% 74.78% 45.59% 82

TC_5 100.00% 100.00% 84.87% 68.93% 54

TC_6 100.00% 100.00% 86.44% 73.07% 68

TC_7 100.00% 100.00% 86.54% 73.20% 67

TC_8 100.00% 100.00% 86.84% 74.05% 84

TC_9 100.00% 100.00% 86.74% 74.17% 82

28

position. The control algorithms [37,38] for these systems are also designed to

accommodate the variability in the physical system. Today, there is no systematic way to

include such application level tolerance information in the IC functional safety analysis

process. Hence, the design is often pessimistic leading to further increase in the

overheads due to functional safety.

This thesis proposes new methods to address these limitations.

29

3. Safety Analysis Pessimism Reduction by

Utilizing Application Tolerance

Not every random fault occurring in an IC will result in application failure.

Different masking effects (e.g. logical masking, electrical masking, latching window

masking and application level masking) [35] can prevent the fault from propagating to

the output and causing the application to fail. Safety analysis without considering these

masking effects will result in over-design leading to an increased hardware overhead.

While there is reported work on effects of logical, electrical and timing masking, those

due to application masking have not been as well studied.

Analysis at higher abstraction levels helps to incorporate additional masking effects

and these can be beneficially employed to reduce the hardware overhead. On the other

hand, analysis at higher abstraction levels significantly increases the analysis complexity.

As an illustration, an application level analysis for functional safety of a motor control

system will require consideration of motor and its different operating conditions

involving motor speed, load, etc. This requires inclusion of the motor models and makes

the analysis more complex, as compared to that of the IC considered standalone. Figure

3.1 indicates the hardware overhead and analysis complexity trade-offs associated with

functional safety analysis when carried out at different abstraction levels.

Figure 3.1. Safety analysis complexity and hardware overhead tradeoffs.

Device Level

Module Level

SoC Level

Application Level

An

aly

sis

Co

mp

lexity

Incre

ase

Ha

rdw

are

Ove

rhe

ad

Incre

ase

30

Many of today’s real life safety critical systems have high accuracy and self-

correction requirements to adjust for the physical system’s non-linearity and noise

effects. These systems are designed as closed loop control systems and typically have an

associated acceptable tolerance in the extent to which the behaviour can deviate from the

ideal or centre position. The control algorithms [37,38] for these systems are also

designed to accommodate the variability in physical system. The design of ICs used in

such systems can beneficially employ these system level tolerances to identify the

minimum set of components that are required to be protected, thereby reducing the

hardware overhead. However, this significantly increases the safety analysis complexity.

In order to address the complexity issues associated with system level analysis, a

new divide and conquer approach is proposed in this chapter. The approach involves a

two-step process. The first step apportions the system level tolerance to the individual

modules in the system, including the IC (or ICs). The apportioned tolerance information

can then be used to identify the critical flip-flops which need to be protected. Various

techniques used for apportioning of the system level tolerance to ICs and its associated

modules are covered in this section. The main contributions of this chapter of the thesis

are: (i) proposing an integrated approach for the functional safety analysis of hybrid

systems (ii) creating a framework for including application level attributes like closed

loop operation and acceptable error consideration for IC safety analysis and (iii)

apportioning of system level tolerances as value tolerance and time tolerance to various

modules in the system.

The rest of this chapter is organized as follows. Section 3.1 gives a brief overview

of related work in this area. Section 3.2 describes the improved safety analysis technique

where a new technique for mapping the application tolerance to digital system as value

and time tolerance is proposed. The section also details about how the proposed

technique can be used to include system level tolerance for IC level safety analysis.

Section 3.3 describes the different ways by which the IC level value and time tolerance

can be derived from the application and Section 3.4 concludes the chapter.

31

Background and Related Work 3.1

Many real-life systems have ICs interacting with physical systems in safety critical

applications. These systems are typically designed as closed loop control systems. A

closed loop system, by its very nature, can correct certain errors since such a system can

build its resilience across subsequent iterations of the control loop. In addition, the

interacting physical system has latency and tolerance. A change in the control value

driving the physical system may not reflect immediately on its behaviour due to the

inherent inertia. As an example, on a system level safety analysis we performed for a

BLDC motor, a stuck-fault on the control input takes about 92 cycles of closed loop

operation (4.6 ms) to result in a 5% variation in the motor speed. However, incorporation

of the system level artefacts results in significant increase in the analysis complexity. In

this chapter, we propose improved safety analysis techniques to consider these effects

while designing an optimised system which meets functional safety requirements.

Several analytical and statistical methods [65,66,67,68] have also been proposed to

speed-up the functional safety analysis of ICs. In RAVEN (RApid Vulnerability

EstimatioN) [66], the authors propose partitioning the IC into smaller sub-blocks

followed by separate analysis of each sub-block. The sub-block analysis results are then

used to estimate the system level error probability and identification of critical

components. [67] proposes an approach for computation of output error probability using

individual signal probability and fault propagation probability. The authors introduce the

concept of using time tolerance for soft error evaluation and propose the use of Markov

models to estimate error probability after the time tolerance window. [68] proposes a

correlation based approach to identify dependencies of signal probabilities of error free

signals and error probability of erroneous signals to provide higher SER estimation

accuracy.

These methods, however, treat IC safety as a standalone problem without

considering its interaction with the surrounding physical system. Even when system level

effects are considered, they are dependent on a particular state and not on the closed loop

behaviour [66,67]. Hybrid systems, on the other hand, have multiple such states which

are acceptable across different control loop iterations. In the absence of hybrid system

32

consideration, the analysis will be pessimistic [64].

Multiple research works have focused on implementing robust control in the

presence of errors [69] for generic control systems as well as for specific motor control

systems [70]. These works have been limited to robustness to sensor and actuator errors

within certain parametric bounds and do not consider interactions between the closed

loop modules. There are several observer based approaches employed for fault detection.

VDA consortium (Verband Der Automobilindustrie - German Association of the

Automotive Industry) came up with a monitoring concept to ensure functional safety of

gasoline and diesel engines known as VDA E-Gas architecture [71]. This approach is

widely used in the engine control system of many automobiles [72]. In [73], the use of

residual generator and residual evaluation module for fault detection purposes is

proposed. [63] introduces the concept of using mapped predictive check states for the

detection of transient faults in the controller. These approaches help in online fault

detection, but do not help to localize the fault or take the corrective action. As fail

operational system requirements, (i.e. systems which continue to operate even after a

fault is detected), are becoming more prominent in automotive applications [74], these

approaches have some limitations which restrict their applicability.

The methodology in this chapter addresses some limitations of earlier work

reported in the literature by including some new artefacts into the analysis. (i) Safety

analysis for the IC is performed together with the interacting physical system making the

analysis more reliable. (ii) Application level attributes like closed loop operation and

acceptable error are considered for the analysis making it less pessimistic. (iii) In order to

reduce the analysis complexity due to the additional system information being

considered, a divide and conquer approach is used. Critical flip-flops required to be

protected can be directly identified using this analysis, as against restricting it to just

identifying the system drifting into an unsafe state, thereby enabling the implementation

of low cost monitoring and correction mechanisms. Low overhead design robustness

(either through component hardening or fault tolerance) techniques can then be used for

protecting the right sub-set of critical components (e.g. flip-flops) as against the entire

design.

33

Improved Safety Analysis Technique 3.2

A representative hybrid system is shown in Figure 3.2. It consists of a physical

system comprising of the motor, power stage providing current to the motor and Hall

sensor determining the motor position. This physical system is controlled by a digital

controller whose main function is to maintain the speed of the motor within an acceptable

range as set using the communication interface.

The digital controller consists of: (i) A CAP (CAPture) [75] module which decodes

the Hall sensor output and provides rotor position information to CPU. (ii) The CPU

receives the speed set-point information and rotor position information of the motor. It

executes the control algorithm to determine the actuation required to maintain the desired

motor speed. (iii) A PWM (Pulse Width Modulator) [76] which provides pulsed inputs to

the power stage which drives the motor. The motor speed is proportional to the PWM

duty cycle. (The feedback control loop adjusts the motor speed under different operating

conditions). Since the digital system and physical system operate in a closed loop and

there is an inherent inertia associated with physical system, the system is robust to some

errors. This is specified as the application tolerance.

3.2.1 Value and Time Tolerance

Hybrid systems are typically associated with a tolerance range around the control

point wherein they can operate in an acceptably correct manner and any minor

perturbation around this point can be corrected during the course of time, (i.e. in the

Figure 3.2. Illustration of a hybrid system.

Physical System

Digital Controller

CPU

Power StageHall Sensor

Communication

Motor

CAP PWM

34

subsequent operating loops), without the application being perceptibly impacted. We

refer to this as value tolerance. This depicts the set of control values around the correct

control value using which the system can operate in an acceptably correct manner. This

can also be visualized as the maximum steady state error which can be tolerated by the

system.

Additionally, these systems can be slow to react (with respect to the frequency of

the clock used in the digital control). It will take some finite time, (i.e. a few operating

loops), termed as time tolerance, for the system to respond to a change in the control

input, even for real-time applications. Time tolerance depicts the time duration for which

the system is able to function in an acceptably correct manner for a maximum deviation

in the control values. The tolerance in value and time are depicted in Figure 3.3. Beyond

this tolerance range, i.e. beyond a range of control values and / or beyond a given time

duration, the system can drift and go outside the acceptable operating range. The closed

loop operation can either contain or correct many of the errors within an IC without

causing the system to fail.

The combined impact due to value tolerance and time tolerance can be complex

since not all value tolerances may be applicable across the time tolerance windows. In

this evaluation, therefore, we calculate the time tolerance across the entire operating

range for the worst-case variation in the values of control inputs.

Figure 3.3. Tolerance in value and time over the control input range.

X

Time Tolerance

Window

Va

lue

To

lera

nce

Win

do

w

time

Syste

m o

utp

ut

fault

35

To understand the tolerance in a motor control system, we performed random fault

injections (explained further in Section 2.5.1) and observed the controlled parameter

(speed) variation with time after the faults are injected. The various profiles thus obtained

are illustrated in Figure 3.4. Case (i) indicates the fault free scenario. Certain injected

faults depicted in Case (ii), (e.g. those causing minor perturbation in closed loop

algorithm co-efficients which are not corrected across iterations), cause perturbation in

the motor speed; however the motor settles to a new speed which is within the acceptable

range. Other faults depicted in Case (iii), (e.g. those in the data computation logic which

get corrected in closed loop iterations), cause perturbation in speed which also gets

corrected within an acceptable time For faults depicted in Case (iv), (e.g. those in the

program counter which changes the control flow causing unrecoverable error), cause the

motor speed to drift out of the acceptable range resulting in application failure. Safety

analysis techniques employed to identify the critical elements within the IC should

consider these system level tolerance effects. In its absence, the analysis will be

pessimistic.

Figure 3.4. Motor speed variation for various injected errors.

Mo

tor

Sp

ee

dTime

Mo

tor

Sp

ee

d

Time

Mo

tor

Sp

ee

d

Time

(i) (ii)

(iv)

Mo

tor

Sp

ee

d

Time(iii)

Acceptable operating range

ErrorFault

36

3.2.2 Divide and Conquer Safety Analysis Approach

We have seen in Section 2.5.1 that fault injection is the preferred technique to

identify the set of critical elements (e.g. flip-flops) which must be protected during the IC

design phase. Fault injection driven simulation is performed wherein the faults are

injected one at a time on every flip-flop inside the circuit, and the corresponding outputs

are observed. If the injected fault in a flip-flop results in an unacceptable change in at

least one output (across value and time) which is not detected by any safety mechanism

within the circuit, then this flip-flop must be protected.

In an ideal scenario, the physical system can be modelled and the key parameters of

this system can also be observed during the fault injection experiments. Such co-

simulations environments have been demonstrated for smaller logic [77] but have not

been adopted for larger ICs due to limitations of slow progression of digital simulation,

longer simulation time due to slower response of the physical system in response to any

change in control input, etc. The increase in the simulation time along with the

requirement to inject a fault in every operating cycle and on every flip-flop of the circuit

significantly increases the computational complexity of such evaluations.

In order to address the simulation complexity issue, we propose a divide and

conquer approach. This is carried out in two steps: (i) Application tolerance is mapped

(apportioned) to individual module outputs as value and time tolerance. (ii) Safety

evaluation is carried for the individual modules using these tolerance values. As an

Figure 3.5. Computation of value and time tolerance.

Physical System

Digital Controller

(MCU)

CPU

Power StageHall Sensor

Communication

Motor

CAP

Acceptable

Tolerance - δOT

++

e(n)

δO2

PWMδO3δO1

37

illustration for the mapping of application tolerance, consider the case of the closed loop

motor control system in Figure 3.5. A ±5% variation is considered the acceptable

tolerance for the motor speed. (This was determined based upon the peak overshoot for

which the control system is designed). In this system, the application tolerance δOT (±5%

variation in speed) has to be mapped to individual module outputs as value and time

tolerance. In the figure, δO1, δO2 and δO3 denote the value and time tolerance (can be

represented as a tuple {Vt, Tt}) at the output of the CAP, CPU and PWM modules.

The process of mapping of the application tolerance to individual circuit modules

can be considered similar to the problem of timing budgeting performed during the SoC

design phase, wherein the timing slack available at the SoC level (due to the clock

frequency as well as the I/Os interacting with the external system) is apportioned to

individual IP modules therein. Timing closure for these modules is carried out

individually using these apportioned budgets and integrated into the SoC. Such a

methodology may not lead to an optimal design as the combined effect of interacting

signals is not considered. However, such a divide and conquer approach is widely

adopted due to practical limitations, (e.g. design and analysis complexity, EDA tool run-

times, etc.), associated a flat timing analysis approach. This has also motivated the divide

and conquer methodology presented in this chapter.

Once the individual module tolerances are identified, they are used for standalone

fault injection driven module analysis. (A simulation model, (e.g. Verilog netlist), or an

emulation platform can be used). If the injected fault causes the output of the flip-flop to

deviate beyond the value tolerance and beyond the time tolerance, then this flip-flop is

marked for protection. The probability of soft errors is considered low. Recent works

have indicated at least four orders of magnitude difference between SEU (single event

upset) and MBU (multiple bit upset) rate [78]. As a result, in the fault injection

campaigns, only single faults modelling SEUs are considered.

38

Evaluation of Value Tolerance and Time 3.3

Tolerance

Estimation of value tolerance and time tolerance is an important step in divide and

conquer functional safety evaluation approach. Depending upon the system complexity

and model / hardware availability, this can be performed in different ways. This includes

(i) Determination of tolerance using actual system

(ii) Analytical estimation of tolerance values for simple systems

(iii) Simulation based evaluation using higher level models

3.3.1 Determination of Tolerance Using Actual System

Sometimes, functional safety evaluation is performed towards the later phases of

the system development when we already have the digital system and physical system

integrated. There may also be cases when the accurate model of the system is not

available and it may be easier to perform the evaluation using the end system itself. In

such scenarios, we may have to evaluate the tolerances using the actual system. This

section explains the steps involved to evaluate the value and time tolerance for the CPU

module using the actual system.

Consider a motor control system as shown in Figure 3.6. We can map the output

application tolerance δOT to value and time tolerance at various module interfaces. In

Figure 3.6. Closed loop control system operation.

Acceptable

Tolerance - δOT

ECAP CPU PWMδO1 δO2

δO3δIT

39

order to estimate the time tolerance, values corresponding to maximum error, (i.e. the

CPU outputs generating the maximum and minimum possible actuator input values in

this case), are forced and the time required for the motor to go beyond the ±5% tolerable

range of speed variation is obtained. i.e. for a motor operating speed of 900 RPM, the

lower and upper speed limits are set at 855 and 945 RPM respectively. Similarly, to

ascertain the value tolerance, the maximum and minimum permissible CPU output values

during the steady state operating condition, are ascertained. This process can be repeated

at the output of each module to determine the value and time tolerance of individual

modules. We enumerate the steps to compute the time tolerance and value tolerance of

CPU below.

Time tolerance:

(i) Force the maximum value at the CPU output (data bus interface of the module

forced to 0xFF…) during application execution.

(ii) Determine the time taken by the application to go out of acceptable application

tolerance (Tt-max).

(iii) Force minimum value at the CPU output (data bus interface of the module

forced to 0x0…) during application execution.

(iv) Determine the time taken by the application to go out of acceptable application

tolerance (Tt-min).

(v) Time tolerance Tt = min (Tt-max, Tt-min).

Value tolerance:

(i) Strobe the CPU output during application execution for the entire operating

speed of the motor (corresponding to the full range of outputs of the PWM

module). The module output changes within a particular range due to steady

state error of the control system. (We could have varied the CPU output to

determine the maximum and minimum acceptable values. However, for ease of

implementation, we took a more conservative approach of determining the value

tolerance based on the maximum and minimum values the particular is taking

during the execution. This will result in reduced pessimism).

40

(ii) Determine the maximum (Vt-max) and minimum (Vt-min) values at the module

output.

(iii) Value tolerance Vt = (Vt-max - Vt-min).

This section gives two representative examples, BLDC motor control and AC

inverter, to illustrate how the tolerance information can be derived from the end system.

BLDC Motor Control System 3.3.1.1

The experimental setup for deriving the tolerance for BLDC motor control system

consists of a three phase eight pole BLDC motor driven by a voltage source inverter

(DRV8312) [79] and Microcontroller (F2805x) [80] shown in Figure 3.7. The power

stage of the inverter is actuated by a Pulse Width Modulator (PWM) [76] which is part of

the Microcontroller (MCU). The MCU has a CPU which executes the closed loop control

algorithm to drive the BLDC motor to rotate at the commanded speed. The BLDC motor

is equipped with three Hall sensors which produce voltage outputs depending on the rotor

position. The CAPture (CAP) module [75] of the MCU uses these outputs to estimate the

rotor position. The motor speed is estimated based on the rate of change of position.

The data flow of the application is shown in Figure 3.8. The CPU executes the PID

(Proportional Integral Derivative) control algorithm every time it receives an interrupt

from the timer. For every Interrupt Service Routine (ISR), the CPU samples the CAP

input and estimates the motor speed and position. This information along with the desired

Figure 3.7. DRV8312, F2805x and BLDC motor.

41

speed input received from communication interface is used to determine the magnitude of

current to be applied on the three phase windings for torque generation. For this

experiment, the CPU frequency is 80 MHz and the control loop frequency is 20 KHz. For

demonstration of application tolerance, we have selected speed tolerance of 5%, which

corresponds to the peak overshoot for which the control system is designed.

During the closed loop operation, there can be change in operating conditions (e.g.

load), change in commanded speed, error in the sampling of input values (e.g. sensor

linearity error), etc. The control algorithm execution is dependent upon these input

parameters, which are captured into an associated memory and associated set of states.

(The integral part of Proportional Integral Derivative (PID) control keeps a memory of

the input values). Their combined effect will create multiple different application threads

for the single application. These different application threads can drive multiple {Vt, Tt}

tuples since different variations in the value (and time) tolerances can affect the

individual modules differently. As a result, the complexity of the analysis can increase

requiring multiple threads of execution across the fault injection experiments. In our

Figure 3.8. Dataflow of closed loop motor control application.

HA

LL

SEN

SOR

BLDC Motor

CAP

3 φ PWM

Commutation

PWM Period

BA

C

Desired Speed

Ob

serv

ed

Sp

ee

d

Commutation Trigger

Speed(CPU)

Du

ty C

ycle

PID Control(CPU)

Fault Injection ThreadFunctional Thread

Timer

42

work, we have considered the worst case value at the output and estimated the time

tolerance.

In order to evaluate time tolerance at the CPU output, we inject faults leading to

worst case variation at the CPU output and determine the number of cycles required for

the motor speed to move out of the application tolerance range. The CPU provides a 32

bit binary value to the PWM module. To compute the time tolerance at the CPU output,

values of 0x0 (denoting zero duty cycle for PWM i.e. zero electrical energy applied to the

motor) and 0xFFFFFFFF (denoting 100% duty cycle i.e. maximum electrical energy

applied to the motor) are forced on the CPU outputs.

The time tolerance results thus obtained are plotted in Figure 3.9. The X-axis

denotes the motor speed in RPM (Revolutions Per Minute) under different operating

conditions and the Y-axis denotes the number of closed loop cycles required for the

motor to go out of the tolerable speed range. Two different values of time tolerance,

corresponding to minimum and maximum values of CPU output, are calculated at

different speed settings of the motor. These results indicate that the closed loop motor

control application has a time tolerance of 92-228 cycles of closed loop operation (which

is equivalent to 276K to 684K cycles of CPU operation, translating to 4.6 to 11.4 ms),

Figure 3.9. Application time tolerance in the presence of worst case errors.

0

50

100

150

200

250

900 1050 1200 1350 1500 1650 1800 1950 2100 2250 2400

Tim

e to

lera

nce

in

clo

se

d lo

op

cycle

s

Operating condition (Motor speed in RPM)

Minimum CPU output Maximum CPU output

43

which corresponds to +/-5% speed variation for different operating speed settings. The

application can operate in an acceptably correct manner for at least 92 cycles of closed

loop operation across the full operating range of the motor, in the presence of worst case

error in the system.

In order to compute value tolerance, the CPU output (PWM input) is monitored

during application execution. Due to variations in the operating conditions and error in

position / speed determination, there is a slight variation in the sensed speed and the CPU

output will change based on the algorithm being executed. The tolerable variation (steady

state error) in CPU output thus obtained for different speed settings is tabulated and its

variation from the value corresponding to desired speed is noted. The minimum

(maximum) tolerance is observed at an RPM of 1650 (2400). The tolerable CPU output

variation at the RPM of 1650 (2400) is 0.54-0.568 (0.785-0.842) which equals a variation

of 5% around 0.55 (7% around 0.8). The CPU output tolerance expressed as a percentage

of CPU output value is shown in Figure 3.10. Here the X-axis denotes the different motor

speeds, and the Y-axis denotes the CPU output value tolerance.

AC Inverter System 3.3.1.2

We next used a digital power system to evaluate the value and time tolerances. We

used a microcontroller (MCU) based AC inverter development kit [81] for determining

Figure 3.10. Application value tolerence as percentage of CPU output value.

0

1

2

3

4

5

6

7

8

900 1050 1200 1350 1500 1650 1800 1950 2100 2250 2400

Va

lue

To

lera

nce

(as %

of

CP

U o

utp

ut)

Operating Condition (Motor Speed in RPM)

44

the tolerance values. The experimental setup consists of an AC load driven by an MCU as

shown in Figure 3.11. The MCU has a CPU which executes the closed loop control

algorithm to drive the AC load (e.g. motor) at the commanded voltage and current levels.

The AC load voltage and current are measured by the MCU using the on chip Analog to

Digital Converter (ADC).

The dataflow of AC inverter application is similar to that of BLDC motor control

application and is represented in Figure 3.12. Timer module periodically generates the

interrupt which triggers the ADC module to sample the load current value. The measured

load current value is compared with the reference values to determine the control error.

The CPU periodically executes the Proportional Integral (PI) control algorithm to

minimize this error. The CPU output is used to drive the Pulse Width Modulator (PWM)

which controls the power stages used to determine the voltage / current provided to the

AC load. For this experiment, a CPU frequency of 60 MHz and control loop frequency of

20 KHz are chosen. We have selected ±4% output tolerance (δOT) which corresponds to

Figure 3.12. AC inverter dataflow diagram.

Digital Controller (MCU)

ADC CPU PWMPower Stage

Load

Reference Load

Current

δOTδCT

Figure 3.11. AC inverter application kit.

45

the peak overshoot in current for which the control system is designed. We need to

compute the value and time tolerance tuples of δCT at the CPU output.

In order to evaluate time tolerance at the CPU output, we inject faults leading to

worst case variation at the CPU output and determine the number of cycles required for

the Root Mean Square (RMS) value of AC inverter load current to move out of the

application tolerance range. The CPU provides a 32 bit binary value to the PWM module.

To compute the time tolerance at the CPU output, values of 0x0 (denoting zero duty cycle

for PWM i.e. zero electrical energy applied to the AC inverter load) and 0xFFFFFFFF

(denoting 100% duty cycle i.e. maximum electrical energy applied to the AC inverter

load) are forced on the CPU outputs.

The time tolerance results thus obtained are plotted in Figure 3.13. The X-axis

denotes the normalized value (output voltage of 0-20V is normalized to a range from 0 to

1) of set-point voltage and the Y-axis denotes the number of closed loop cycles required

for the inverter to go out of the tolerable current range. Two different values of time

tolerance, corresponding to minimum and maximum values of CPU output, are calculated

at different voltage settings of the AC inverter. These results indicate that the closed loop

AC inverter application has a time tolerance of 10-55 cycles of closed loop operation

Figure 3.13. Application time tolerance in the presence of worst case errors.

0

10

20

30

40

50

60

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Tim

e to

lera

nce

in c

lose

d loo

p

cycle

s

Operating condition (setpoint voltage)

Minimum CPU output Maximum CPU output

46

(which is equivalent to 30K to 165K cycles of CPU operation, translating to 0.5 to 2.75

ms), which corresponds to +/-4% RMS current for different operating voltage settings.

The application can operate in an acceptably correct manner for at least 10 cycles of

closed loop operation across the full operating range of the AC inverter in the presence of

worst case error.

It can also be observed that the tolerance range for BLDC motor control system is

about 10 times higher compared to the AC inverter system. This is expected since the

BLDC motor control physical system involves mechanical components in comparison

with the purely electrical AC inverter system. The response of mechanical system is

sluggish when compared to that of an electrical system.

In order to compute value tolerance, the CPU output is monitored during

application execution. Due to variations in the operating conditions and error output

voltage determination, there is a slight variation in the sensed speed and the CPU output

will change based on the algorithm being executed. The tolerable variation (steady state

error) in CPU output thus obtained for different speed settings is tabulated and its

variation from the value corresponding to desired speed is noted. The minimum

(maximum) tolerance is observed at a set-point of 0.2(0.15). The tolerable CPU output

variation at the set-point value of 0.2(0.15) is 0.292-0.321 (0.219-0.273) which equals a

variation of 9% around 0.307 (21% around 0.246). The CPU output tolerance expressed

Figure 3.14. Application value tolerence as percentage of CPU output value.

0

5

10

15

20

25

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Va

lue

To

lera

nce

(as %

of C

PU

o

utp

ut)

Operating condition (setpoint voltage)

Value Tolerance

47

as a percentage of CPU output value is shown in Figure 3.14. Here the X-axis denotes the

different set-point values for output voltage, and the Y-axis denotes the CPU output value

tolerance.

3.3.2 Analytical Estimation of Tolerance Values

It may not always be practical (e.g. actual system may not be available) to evaluate

the tolerances using the actual model and we may have to resort to other techniques. If

the system model is simple, we can evaluate the tolerance analytically. In this section, we

use a simple first order control system as indicated in Figure 3.15 to illustrate analytical

estimation of value tolerance and time tolerance.

Consider the scenario where the system gets input set-point value from an external

source and the controller will modulate the input to the physical system based on the

difference in value between the set-point input and physical system output. Consider an

output tolerance of δOT. We need to map this application output tolerance δOT to tolerance

δCT at the controller output. δCT is a tuple consisting of both value and time tolerance.

In order to simplify the computation, we assume that the system is operating in

steady state at 𝑡 = 0, with 𝑢(0) = 𝑢0, and 𝑦(0) =𝑏

𝑎𝑢0. Consider the scenario where a

fault in the controlling digital system causes its output to change from 𝑢0 to 𝑢1 at time

𝑡 = 0. We need to derive the system evolution over time due to this change in the digital

system output to compute the value tolerance and time tolerance.

Transfer function of first order physical system

𝑌(𝑠)

𝑈(𝑠)=

𝑏

(𝑠+𝑎) --------------------------------------------------------- (1)

The equivalent differential equation for the system can be written as

Figure 3.15. First order system.

ControllerSetPoint Value OutputδOTδCT

48

𝑑

𝑑𝑡𝑦(𝑡) = −𝑎𝑦(𝑡) + 𝑏𝑢(𝑡) --------------------------------------------- (2)

The initial condition of such a system is 𝑢(0) = 𝑢0 and 𝑦(0) = 𝑏

𝑎𝑢0

We need to solve the differential equation in (2) to identify how the system evolves

due to a change in control input from 𝑢0 to 𝑢1.

On solving (2)

𝑦(𝑡) = 𝑒−𝑎𝑡 ∫ 𝑏𝑒𝑎𝑥𝑢(𝑥)𝑑𝑥𝑡

0

+ 𝑏𝑢0𝑒−𝑎𝑡

𝑎

= 𝑏𝑢0𝑒−𝑎𝑡

𝑎+

𝑏𝑢1𝑒−𝑎𝑡 (𝑒𝑎𝑡−1)

𝑎 ------------------------------- (3)

This evolution of the system due to the change in digital system output can be used

to determine the time tolerance and value tolerance. This reference, input and output

values of the system are represented in Figure 3.16.

Time Tolerance Determination

Time tolerance is defined as the maximum time for which the physical system

output remains within the application tolerance range for a worst case variation in the

digital system output. Assuming the worst case error is 𝑢1

Time Tolerance = 1

𝑎log(

𝑏𝑢1−𝑏𝑢0

𝑏𝑢1−𝑎𝑦1) ------------------------------------ (5)

Figure 3.16. Input and otuput values of the control system.

Digital system reference input

Digital system output

Physical system output

U0

time

time

time

(b/a)U0

(b/a)U1

U1

U0

49

Value Tolerance Determination

Value tolerance is defined as the maximum error in the digital system output for

which the physical system output is within the defined application tolerance δOT even if

the error persists for an infinitely long time. Assuming 𝑦1 is the acceptable output

considering application tolerance (i.e. maximum variation that can be allowed for the

physical system).

Value Tolerance = 𝑎𝑦1

𝑏 -------------------------------------------------- (4)

We have used a simplifying assumption that the system is continuing without any

change in load and control input throughout the evaluation period.

3.3.3 Determination of Tolerance Using High Level Models

Analytical estimation of value and time tolerance is not practical with increase in

the system complexity. In such cases, higher level models can be used to determine the

tolerance values. In this section, we show how tolerance can be evaluated from higher

level models using a representative system Permanent Magnet Synchronous Motor

(PMSM) driven using a traditional PID controller.

We modelled a Permanent Magnet Synchronous Motor (PMSM) driven using a

traditional PID controller as shown in Figure 3.17 using Matlab Simulink [82]. The

PMSM motor output tolerance of δOT need to be mapped to the controller output tolerance

δCT. The control loop is executed at 10 KHz and the motor is set to run at a speed of 40

revolutions per second (rps). We added a fault injection infrastructure between the

controller and PMSM motor. The infrastructure will support two types of injection. (i)

Forcing the output to zero and maximum value. This is used for determining the time

Figure 3.17. PMSM control system.

Controller PMSM MotorReference

SpeedOutputδOTδCT

Fault Injection

50

tolerance. (ii) Increase and decrease the input the motor by a constant value. This is used

for determining the value tolerance.

In the simulation setup, the output is first forced to zero and maximum value. The

worst case (i.e. one with lesser tolerance) output waveform thus obtained is shown in

Figure 3.18. A zoomed in version of the figure to aid computation of time tolerance is

shown in Figure 3.19. The time tolerance is determined from the figure is 308 us.

We then forced different values at the input of PMSM motor using the fault

injection setup shown in Figure 3.17 and observed the change in motor speed. We found

Figure 3.19. Magnified version of PMSM output in the presence of worst case error.

Figure 3.18. PMSM output in the presence of worst case error.

51

the value which is required for the motor to be just within the application tolerance and

this is called as value tolerance. The output thus obtained is shown in Figure 3.20. The

value tolerance thus obtained is 33.3% i.e. variation in peak to peak value of digital

system output from 0.15 to 0.20 which can cause the application output to be within 5%

variation. It can also be noticed that the error is trying to pull the output out from its

stable value where the closed loop action is trying to bring it back leading to a sinusoidal

output.

Conclusion 3.4

We have used a motor control example to demonstrate the tolerance present in

practical systems. However, utilizing of application tolerance for performing IC safety

analysis is not practical due to significant increase in the analysis complexity. We

propose a divide and conquer approach to address the practical limitations in performing

such an evaluation during the IC design phase. In this approach, application tolerance at

the system level is apportioned to individual modules as value and time tolerance thus

enabling the IC safety analysis to use this information and reduce the pessimism. We

have also discussed the different approaches by which the application tolerance can be

mapped to IC level.

Figure 3.20. PMSM output used for determining value tolereance.

52

4. Formal Verification Based Approach for

Accurate Safety Analysis

Once the application tolerance is apportioned to individual modules, the analysis

can be performed on these modules standalone. Fault injection is widely used in the

industry to provide a metric on the suitability of a circuit module to be used in safety

critical application. Current fault injection methods used in industry do not provide a

systematic way to include application specific tolerance, thereby resulting in pessimistic

evaluation. In addition, the evaluation is not exhaustive as the workloads used for fault

injection typically cover only a small subset of the functional scenarios (conditions and

states) possible. Ensuring sufficiency of the workloads used for fault injection based

safety evaluation is an open problem. Other practical limitations of fault injection based

techniques are also highlighted in recent works [66,83].

Formal verification techniques have been proposed in the literature as an alternative

to simulation methods using fault injection. However, they introduce new challenges, e.g.

requirement to model the complete set of properties for the system, making it effort

intensive and absence of application related input constraints leading to identification of

almost all the elements as critical.

Real-life systems are complex and pose the challenge of not being able to

comprehensively categorise its constituent components as safe or dangerous. This is

because these systems must include the physical components (which are inherently

analog in nature), and the accuracy of the analysis is dependent upon the performance

deviations which can be tolerated under different use / application conditions. This

chapter proposes new methods for safety analysis of such systems. Its main contributions

are: (i) Capturing the application diversity as input constraints (values and sequences).

(ii) Modelling application specific tolerances as an output range across time intervals.

(iii) Formal techniques have been incorporated into this analysis framework to make it

more accurate and less pessimistic than is possible using existing methods. Results of this

53

analysis are presented on ITC benchmark circuits and two industrial circuits.

The rest of this chapter is organized as follows. Section 4.1 briefly overviews

various techniques for safety analysis with specific focus on formal verification based

techniques. Section 4.2 introduces the proposed safety analysis framework. Experimental

results using the proposed approach are presented in Section 4.3 and Section 4.4. Section

4.5 concludes the chapter.

Background and Related Work 4.1

Safety analysis is required to ascertain the safe behaviour of a system under

different operating conditions. Several analysis techniques exist at different design stages

and abstraction levels (e.g. hardware / circuit level, software / application level, system

level etc.). Workload based fault injection [84,85] is widely practiced in industry to

ascertain the robustness of a circuit to faults. However, experimental evaluation presented

in Section 2.5.1 indicated lack of comprehensive workloads as a major limitation in

addressing the quality requirements for functional safety evaluation. This motivated us to

explore the use of formal property checking methods.

The use of formal property checking for safety analysis was first proposed in [86],

wherein deterministic occurrence / non-occurrence of events in the presence of modelled

faults were analysed based on properties derived from specification. [87] proposed the

use of model checking for identifying latches which must be protected. The analysis was

based on creating equivalent formal models, considering an SEU on each latch, one at a

time. A failing property indicates a dangerous latch, which must be protected. Properties

internal to the circuit and at the circuit outputs are considered. The application behaviour

and tolerances are not considered, resulting in pessimistic results. Besides, these methods

require a set of golden properties which fully capture the circuit behaviour. In cases

where verification is not performed using formal methods, significant effort is needed to

code these properties.

The work in [88,89] use bounded model checking along with module duplication

(one is golden and the other is fault injected) and properties based upon output

mismatch. If the mismatch property is satisfied, the flip-flop on which the fault is injected

54

is classified as dangerous. The process is repeated for each flip-flop in the module. Using

this approach, no circuit behaviour specific properties are needed. [90] uses a similar

approach to check for robustness of the error detection logic in the presence of faults in

the main circuit. The above methods have not considered application specific scenarios,

e.g. input constraints, sets of valid and invalid input and internal state sequences, and

output constraints and tolerances. As a result, the above approaches can be pessimistic,

thereby identifying more dangerous elements.

[91] discusses the impact of imperfect hardware (structural defects) on application

performance. The work discusses it from the aspect of manufacturing test and safety

implications of the imperfect hardware are not considered. [67] proposes the use of

application tolerance provided the output recovers within a specified time as determined

by the application. Equal probabilities are assigned to flip-flops states (0 or 1), and the

output error probability for a single transient fault on all combinational gates and flip-

flops is computed. The proposed approach is static in nature and dynamic input

sequences (including closed loop feedback behaviour) are not considered. [92] proposes

application of system level considerations in performing the safety evaluation, owing to

substantial error masking occurring at higher design abstractions. The methodology

employs fault injection on selected workloads for identifying the contribution of

individual gates to the system level FIT. This approach solves the problem of safety

analysis of standalone applications with limited workload.

All the above approaches help to generate a quantitative metric for safety (in terms

of which flip-flops are safe and which are dangerous) based upon the circuit construction.

The work presented in this chapter augments the solution space to provide a more

accurate safety metric by incorporating these additions: (i) Specific input constraints and

sequences to match functional workloads are considered. (ii) Tolerances in the output

values and time required to attain them are included. (iii) Safety requirements across

different modules are budgeted more accurately by including the physical system

behaviour.

It is therefore important to consider more comprehensive workloads and perform an

exhaustive fault injection. This is done in two steps. (i) Workload specification using

55

input constraints (input values and relationship between values) and input sequences.

(This can also be represented using FSMs). (ii) Analysis covering SEU injection on each

flip-flop across all cycles of execution.

Improved Safety Analysis Framework 4.2

The proposed safety evaluation framework is shown in Figure 4.1. Two copies of

the circuit are created. Fault is injected in one by conditionally flipping the input of each

of the flip-flops therein. This is achieved by Ex-Oring the inputs to the flip-flops with the

other input being driven as a primary input (denoted as error_input). The number of such

error_input ports is equal to the number of flip-flops. Incisive Enterprise Verifier (IEV)

from Cadence [93] is used for the safety evaluation. IEV will perform this injection

exhaustively, i.e. endeavour to inject a pulse at any one of the error_input ports over any

one cycle of the evaluation, causing the safety property to be verified exhaustively. This

is in contrast to simulation, wherein such an analysis will result in 2*N*M runs, where N

is the number of flip-flops and M is the number of cycles of evaluation.

We now explain how the circuit and its inputs and outputs are represented in IEV

for safety analysis. Input constraints and sequences are modelled as an FSM. It represents

the conditions required for workload execution by emulating functional scenarios

corresponding to register settings (constraints), periodic register updates, relation

between register settings, etc. The output representation includes tolerances in the range

Figure 4.1. Illustration of FV based safety analysis.

D

CLK

Q

D

CLK

Q

Injected circuit

Reference circuit

Inp

ut

FSM

Ou

tpu

t P

rop

erty

error_input1N

56

of output values and the duration for which outputs can be in error. These are captured as

assertions using which the circuit robustness is verified. The error_input used for fault

injection should satisfy the following properties during safety evaluation. (i) Only one

error input is modified at a time. (ii) This error input pulses only once during the

evaluation.

A sample IEV property file for SEU analysis is shown in Figure 4.2. This property

specification for flip-flop number 64 (from amongst 66, numbered from 0 to 65) is

shown. Part 1 indicates that all the bits except the 63rd

bit in the error_input are

constrained to zero. Part 2 indicates that once the value of each bit in the error_input is

high, it should go low in the next cycle and remain low throughout. Part 1 and Part 2

together ensure that the 63rd

bit of the error_input will be high for one cycle during the

entire formal verification run. Once one bit in the error input is made high for one cycle,

the flip-flop corresponding to that bit will have its output flipped. This is done to emulate

the behaviour of SEUs in flip-flops. Clocks driving the flip-flops must be enabled to

allow the flip-flop to capture the toggled value at the error input. IEV injects the faults on

Part 1 // Constraining inputs except for pin 64

const -add -pin error_input[62:0] 0

const -add -pin error_input[65:64] 0

Part 2 // Create a single pulse for error input of pin64

generate

for (ip =0; ip<=65;ip=ip+1) begin : generic

fall_all_zero: assume always (fell(error_input[ip]) -> always (error_input[ip] == 0))

@(posedge clk) ;

rise_edge_zero: assume always (error_input[ip] -> next (!error_input[ip]))

@(posedge clk) ;

end

endgenerate

Part 3//properties to be checked

(i) output_always_equal : assert property ( @(posedge clk) $rose(out_compare) |=>

##[0:7] $fell(out_compare));

(ii) output_always_equal: assert always (((out_inject <= out_golden + 4) && (out_inject

>= out_golden - 4))) @(posedge clock);

(iii) fell(onehot(error_input)); rose(sig_pwma_sync_G); rose(sig_pwma_sync_G)} |=>

(sig_pwma_sync_G == sig_pwma_sync_I) [*] ) @(posedge clk);

Figure 4.2. IEV property specification.

57

the cycle in which the clock is enabled so as to cause the property to fail. Part 3 indicates

the specification of the tolerance property. The tolerance can be specified in terms of time

(the output of the reference and injected module can be different for certain time

duration) indicated in Part 3(i) or value (the output of the reference and injected module

can differ in each cycle by a certain value) indicated in Part 3(ii) or a combination of

both. Part 3(iii) indicates the tolerance specification where the PWM output can be wrong

for two PWM pulses and it need to be correct afterwards.

We now illustrate the application of our method to a few circuits. We begin with a

set of ITC benchmark circuits [94] and then extend the analysis to two industrial circuits

[76,75].

Analysis on Benchmark Circuits 4.3

The proposed framework is first used on various ITC benchmark circuits [94].

Output value tolerances of ±2 and ±4 are considered. The tolerance values for the

benchmark circuits are specified to illustrate the application of the proposed method and

do not depend on the circuit functionality. The number flip-flops identified as safe under

these error conditions is evaluated. Flip-flop savings is calculated by comparing number

of safe flip-flops thus obtained as against total number of flip-flops. (This indicates that

there is no cost of meeting safety for these flip-flops). The results on five ITC benchmark

circuits are shown in Table 4.1. The percentage of flip-flops which are marked as safe

and the evaluation time for the entire circuit for two conditions of output value tolerance

(±2 and ±4 over the correct output) are also indicated.

Table 4.1. Safety analysis on benchmark circuits.

Circuit #flip-

flops

Tolerance = ±2 Tolerance = ±4

#safe FF % Savings Time (S) #safe FF % Savings Time (S)

b03 30 7 23.33 53 12 40.00 56

b04 66 4 6.06 192 6 9.09 182

b08 21 4 19.05 33 6 28.57 32

b10 17 2 11.76 36 3 17.65 31

b11 31 3 9.68 64 4 12.90 69

Average

13.98

21.64

58

Analysis on Industrial Modules 4.4

The proposed framework is then applied to two industrial modules, PWM and

ECAP in the system context described in Figure 4.3. Here, the variation in magnetic field

of the rotor is measured using a Hall Effect sensor. The output of Hall Effect sensor is fed

to the ECAP module which decodes the motor speed. The reference value for the motor

speed is set using software. The PID controller adjusts the PWM pulse width based on the

difference between the observed and reference motor speed values.

The PWM module has different operating modes like up count mode, down count

mode, asynchronous trip protection configuration, dead band configuration, etc. Similarly

the ECAP module has different operating modes like single-shot mode, continuous mode,

edge polarity selection, etc. These are one time configuration parameters and are

configured in the beginning. Once the configuration is over, the module starts its closed

loop operation. The periodic operation of measuring the motor speed, calculating the new

PWM pulse width value and updating the new value is typically called one period of the

closed loop operation. An input model has been constructed to capture various input

constraints and sequences for the PWM corresponding to the operating modes (up mode,

down mode, up-down mode), input range (pulse duration, pulse width) and relationship

between the inputs (pulse width and pulse period). ECAP module provides an interrupt to

PID controller once the sensor decode is complete. An FSM is created for ECAP module

P(Kp)

I(Ki)

D(Kd)

++ PWMError

Hall effect sensor

Referencespeed

ECAP

Figure 4.3. Reference application.

59

to periodically read the register contents from both injected and reference module, based

on the interrupt. The register values from injected and reference modules thus obtained,

are compared to determine the error condition.

The lifetime analysis of PWM and ECAP module flip-flops indicate that they can

be categorised into three groups, namely (i) Flip-flops which are updated only once

during application initialization. (ii) Flip-flops which are updated once during every

period of closed loop operation. (iii) Flip-flops which form the datapath which are

updated many times during one period of closed loop operation.

Few knobs have been used to manage complexity of analysis. (a) Only one error

occurs during the evaluation window. (b) Proper modelling of inputs and outputs to

contain the state space explosion problem. The improvement with input modelling is

supported by the fact that, analysis of a similar sized ITC benchmark circuit (e.g. b12

with 121 flip-flops) with unconstrained input could not complete. (c) This lifetime

analysis of the flip-flop values illustrate that any SEU will manifest as output deviation in

the first period of closed loop operation and will either get corrected in the second period

onwards or will continue to cause errors. These optimization knobs helped to perform

analysis of these modules standalone and reduce sequential depth of the safety properties.

A reference property thus modelled for PWM after application of these optimization

knobs is shown in Part 3(iii) of Figure 4.2. The property asserts the requirement for

injected PWM output to be same as reference PWM output in the second period of closed

loop operation following the fault injection.

The results thus obtained for PWM and ECAP modules are shown in Table 4.2.

The percentage of flip-flops which are marked as safe and the evaluation time for the

Table 4.2. Safety analysis on industrial modules.

Circuit Mode #FF #safe FF Time (S) % Savings Total

Safe FF

% Total

Savings

PWM

up

56

24 3132 42.86

18 32.14 down 24 12787 42.86

up down 20 90526 35.71

ECAP int 202 173 165338 85.64 173 85.64

60

entire circuit are indicated. Note that the PWM module analysis had to be split into three

operating modes to manage the complexity. The number of safe flip-flops for the PWM

module is arrived at by doing an intersection across the three operating modes. The

savings are 32% and 85% for PWM and ECAP modules respectively.

Approximation used for ECAP and PWM (all flip-flops having wrong value, will

be either corrected in the first period of closed loop operation following the fault injection

or will remain in error forever) helped to reduce the analysis complexity. However, the

approach cannot be generalized and applied for other circuits. In its absence, we found

that the proposed approach is not scalable to handle large circuits. We tried different

techniques like design partitioning, transforming system level properties to module level

properties etc. to reduce the analysis complexity and make the formal verification based

safety analysis approach tractable. We were only partly successful. Due to the inherent

limitation of design space explosion with formal verification based approaches, we

decided to focus on simulation based techniques to augment formal verification based

approach for handling large circuits.

Conclusion 4.5

In this chapter, we have proposed improved methods for accurate safety analysis of

real-life systems. Some of the limitations of earlier methods have been overcome and

new capabilities have been created. A system view has been used to perform a systematic

analysis, taking into account application specific workloads and error tolerances. The

physical system behavior is also included and it is shown how the safety budget for

individual modules can be suitably set to obtain a more accurate (less pessimistic)

reliability assessment, and hence incur lesser design cost for robustness. A formal

verification framework has been used and suitable properties which account for

application specific behaviors are illustrated. Experimental results are provided on

benchmark circuits and two industrial circuits. They indicate that the right inclusion of

the application tolerance specification allows for a more accurate analysis, together with

a less costlier design. As examples, the number of flip-flops required to be hardened

reduced by an average of 13% to 21% for the ITC benchmark circuits as against the

61

classical approach of 100% hardening. For the superset application scenarios considered

with PWM and ECAP modules, the number of such flip-flops to be hardened reduced by

32% and 85% respectively.

62

5. Improved Fault Injection Based Safety

Analysis Approaches

As discussed in Section 2.5, traditional fault injection based functional safety

evaluation has two major limitations. (i) Application level tolerance due to interaction

with the physical system is not accounted for. (ii) Comprehensiveness of the application

workloads selected for functional safety evaluation is not guaranteed [64]. The limitation

in (i) causes the analysis to be pessimistic, while that in (ii) causes it to be optimistic.

Incorporating application level tolerance in safety evaluation as discussed in Section 3

will help address (i). The proposal to use formal approaches to address (ii) and make the

analysis comprehensive is only partially successful as it fails to handle very large circuits.

In order to address the limitations, we propose a new workload augmentation method

based upon judicious perturbations of a given workload. These perturbations are carried

out using systematic analysis of a given workload to identify the input parameters and

data variables, and application specific understanding to determine the impact on

performance and safety due to these perturbations.

The main contributions of this chapter are: (i) Detailed evaluation is performed on

two representative systems to identify limitations with the existing approaches. (ii)

Different experiments are performed with alternate workloads to profile the increase in

the number of identified critical flip-flops. (iii) A method is proposed for systematic

perturbation of workloads whereby new workloads are generated iteratively, and are

shown to be effective to detect additional critical flip-flops. (These are termed as

derivative workloads). (iv) This method is evaluated on two examples. In one case,

several closed loop routines for electric motor control and digital power conversion

applications are analysed and perturbed. In the second case, a physical system for inverter

operation driving an AC load using a microcontroller based system is analysed and its

operation similarly perturbed. The results of these experiments indicate that the proposed

perturbation technique is both effective and affordable.

63

The rest of the chapter is organized as follows. Section 5.1 covers fault injection

based safety analysis approach where the traditional fault injection based approach is

compared with the ideal application based method and proposed divide and conquer

approach (introduced in Section 3.2) to evaluate the benefits and limitation. Section 5.2

performs a detailed analysis of workloads used for fault injection based safety analysis.

Section 5.3 introduces the proposed workload perturbation approach. Experimental

results with the proposed approach are covered in Section 5.4 and Section 5.5 concludes

the chapter.

Fault Injection Based Safety Analysis Approach 5.1

Fault injection [95] is the preferred technique used to ascertain the safety

worthiness (robustness) of the circuit. In order to evaluate the effectiveness of the

proposed divide and conquer method in reducing analysis pessimism and understanding

the limitations of the fault injection approach, three experiments on the BLDC motor

control system and AC inverter system have been performed and their results compared.

Figure 5.1. Safety analysis approaches.

Identify application

level tolerance

Map application

tolerance to

individual module

output

Module level Fault

Injection using the

derived tolerance

information

Critical flip-flop

which need to be

protected

Identify application

level tolerance

System level Fault

Injection using the

derived tolerance

information

Critical flip-flop

which need to be

protected

Module level Fault

Injection without

any tolerance

consideration

Critical flip-flop

which need to be

protected

Proposed

Approach

(Open Loop)

Ideal

Approach

(Closed Loop)

Existing

Approach

(Open Loop)

64

These are indicated in Figure 5.1 and are explained below:

(i) Existing approach: This consists of analysing individual modules in a standalone

way without any tolerance consideration, as is the practice today

(ii) Ideal approach: This consists of identifying critical flip-flops at the application

level using the full system model using the system level tolerance information.

No approximation or tolerance budgeting is employed here.

(iii) Proposed approach: This consists of performing module level analysis with

tolerance allocation, i.e. divide and conquer method described in Section 3.2.2.

5.1.1 Experimental Setup

A hardware emulation setup was chosen as the platform for performing this

comparative study. This has enabled a faster evaluation over simulation methods, and

avoids the requirement for having a co-simulation environment with the physical model

of the motor and the Verilog netlist of the controller. We can use any of the techniques

defined in Section 3.3 for deriving the individual module tolerances.

Fault injection to identify the critical flip-flops is performed using software

Figure 5.2. Software fault injection flow.

Wait for Interrupt

Fault injection

enabled?

Fault already

injected?

Flip register value

YES

NO

NO

YES

Output in error?

Register

classified critical

YES

Analysis

duration

elapsed?

NO

YES

NO

Closed loop control

algorithm execution

StartEnd

65

mutation. The approach involves creating a fault injection thread in parallel with the

application thread. The modified application thread is shown in Figure 5.2. The fault

injection thread inverts the values stored in the flip-flops, one at a time, at every cycle of

the application execution, across multiple iterations corresponding to the workload. In

this particular evaluation we have restricted the fault injection to memory mapped

registers i.e. flip-flops which are directly accessible using software code. (This was the

level of granularity feasible in the emulation setup used in this experiment). The output,

(module output in case of existing approach and proposed approach and system output in

the case of ideal approach), is observed for each fault. If the output drifts beyond the

expected value, (golden output in case of existing approach and acceptable output with

tolerance considerations in case of ideal approach and proposed approach), the flip-flop

in which fault is injected is classified as critical and must be protected.

BLDC Motor Control System Results 5.1.1.1

Fault injection is performed on the BLDC motor control system described in

Section 3.3.1.1 to identify critical flip-flops using the three approaches as illustrated in

Figure 5.1. The number of flip-flops thus identified is shown in Figure 5.3. The X axis

denotes the various motor speed settings for the BLDC motor. The Y axis denotes the

number of critical flip-flops identified. For the motor speed setting of 900 RPM, 190 flip-

Figure 5.3. Critical elements identifed at different operating conditions.

66

flops were identified as dangerous in the existing approach, while the numbers were 48

and 50 for the ideal approach and the proposed approach respectively. Similar results

were observed for other speed settings as well.

Since the control must be robust for any operating speed of the motor, the union of

flip-flops identified as critical at different speed settings is considered. 239 flip-flops

were accordingly identified as critical using the existing analysis technique, 54 using the

ideal approach and 55 using the proposed approach. The results thus indicate a 4.3x

reduction in the number of critical flip-flops using the proposed divide and conquer

approach.

In order to compare the detection capability associated with the three approaches,

we performed a detailed analysis of the flip-flops identified as critical using the three

approaches. The results thus obtained are shown in Figure 5.4. We further analyse the

results and make a few observations:

(i) In the absence of any tolerance considerations, the existing approach (most

pessimistic) must identify the maximum number of flip-flops as dangerous. A

subset of these will be identified as critical using the proposed approach with the

mapped application tolerances applied at the module level. This may not be the

minimal set since interactions between modules are not considered. A smaller

subset will be identified as critical using the ideal approach with closed loop

analysis performed using system model together with application tolerances.

Figure 5.4. Critical flip-flops identified using each approach.

Non critical flip-flops (46)

7

3

181Existing approach

Ideal approach

48

Proposed approach3

67

The flip-flops identified as critical are mapped into the three main groups in

Figure 5.4.

(ii) Contrary to expectations, the following discrepancies were observed. Six flip-

flops identified as critical in the ideal approach were not identified using the

proposed approach. Three flip-flops identified as critical in the ideal approach

were not so identified even using the existing (most pessimistic) approach.

AC Inverter System Results 5.1.1.2

Fault injection is performed on 384 flip-flops (belonging to the CPU module which

controls PWM) of the AC Inverter system covered in Section 3.3.1.2 to identify the

critical ones using each of the three approaches mentioned in Figure 5.1. The number of

flip-flops thus identified is shown in Figure 5.5. The X axis denotes the various set-points

associated with the AC Inverter. The Y axis denotes the number of critical flip-flops

identified. For the load current setting of 0.15, 79 flip-flops were identified as dangerous

in the existing approach, while the numbers were 32 and 52 for the ideal approach and

the proposed approach respectively. Similar results were observed for other load current

settings as well. Considering all the different operating points, 96, 36 and 53 critical flip-

flops were obtained for traditional approach, application based approach and proposed

divided and conquer approach. The results thus indicate a 1.8x reduction in the number of

critical flip-flops using the proposed divide and conquer approach.

Figure 5.5. Critical flip-flops identified for inverter application.

68

In order to compare the detection capability with the three approaches, we

performed a detailed analysis of the flip-flops identified as critical. The results thus

obtained are shown in Figure 5.6. Similar observations can be drawn in the case of AC

Inverter application as that of BLDC motor control application.

(i) Application based (ideal) approach identifies the least number of critical flip-

flops. On the other hand, the traditional approach identifies the highest number

(indicating pessimism in the analysis). 21 additional flip-flops are identified as

critical using both the divide and conquer approach and traditional approach,

while 40 flip-flops are identified as critical using only the traditional approach.

(These numbers indicate the pessimism associated with these two approaches).

(ii) One flip-flop is uniquely identified as critical by application based approach,

which is missed by the other two approaches.

(iii) On the other hand, there are three flip-flops identified by application based and

traditional approaches, which are missed by the divide and conquer approach.

(iv) Overall, four critical flip-flops have escaped detection using the divide and

conquer approach.

Fault Injection Workload Analysis 5.2

Fault injection based functional safety analysis process involves ascertaining the

circuit behavior in presence of faults and determining whether the protection mechanisms

Figure 5.6. Critical flip-flops identified in three approaches for inverter application.

69

incorporated in the chip are capable of detecting them. The set of workloads chosen for

safety evaluation play an important role in ensuring analysis comprehensiveness. We

have seen in Section 2.5.1 that it is difficult to get a comprehensive set of workloads for

performing fault injection. In its absence we select workloads considering the toggle

coverage [22]. Practical considerations of the circuit size and simulation time require that

an upper bound (e.g. 70%, 90%, 99%) on this coverage be set based upon circuit size and

simulation time.

In order to understand this better, we have performed safety evaluation on a

representative module with 25 different workloads satisfying the toggle coverage criteria.

The results thus obtained are shown in Figure 5.7. The X axis denotes the workload

number and Y axis denotes the number of unique flip-flops identified as critical using

every new workload. It can be observed that the number of critical flip-flops identified

saturates around the tenth workload and each additional workload identifies a much

smaller number of critical flip-flops. We term these flip-flops as the trailing end of flip-

flops. In a typical analysis, we are not sure whether all the critical flip-flops falling in the

trailing end are detected. In this chapter, we present an affordable way to identify such

trailing end flip-flops.

Figure 5.7. Number of unique critical flip-flops identified for each workload.

70

Workload Perturbation Approach 5.3

We analyzed the flip-flops which escape detection in the experiments covered in

Section 5.1.1.1and 5.1.1.2. These will come as critical flip-flops at the trailing end of the

curve in Figure 5.7. A new methodology is proposed for detection of these critical flip-

flops using the below three steps:

(i) The flip-flops which escape detection were functional neighbours of the critical

flip-flops which are already identified. Functional neighbourhood consists of the

set of flip-flops which implement / control the same functionality. For example,

the 32 flip-flops storing the value of integration constant (Ki) in the Proportional

Integral (PI) control function form a functional neighbourhood.

(ii) A detailed analysis revealed that small perturbations in the workload are

required to excite the specific conditions for controlling and observing a

particular fault location. These perturbations were mapped to changes in input

parameters and other embedded control parameters. The changes in values thus

obtained were used to set the perturbation range. These additional workloads are

called derivative workloads.

(iii) A derivative workload causes better excitation in certain areas of the design, e.g.

forcing the application behaviour to exceed the permissible tolerance values.

This excites additional flip-flops (corresponding to the trailing end flip-flops in

Figure 5.7).

On the other hand, an application agnostic approach to generating additional

workloads with random perturbations of control / state settings may lead to the generation

and / or excitation of several states and conditions which may not be functionally

relevant. Such an approach is hence pessimistic. A desirable approach would be to

identify the set of acceptable input conditions and internal states which influence the

closed loop operation in the desired way.

An algorithm for such systematic identification and perturbation to generate

derivative workloads is shown in Figure 5.8. The objective function for the Workload

Augmentation algorithm is to maximize the number of critical flip-flops identified. We

71

first identify the set of acceptable input variables and embedded control parameters. We

perturb the workload within the acceptable parameter variation range and perform

functional safety evaluation. If the generated workload identifies at least one additional

critical flip-flop, it is accepted. The algorithm is iteratively executed to evaluate the

search space. With each succeeding iteration, the number of new flip-flops identified

reduces (corresponding to the trailing end flip-flops in Figure 5.7).

Experimental Results 5.4

In order to evaluate the effectiveness of the proposed approach in identifying

critical flip-flops, experiments are first conducted on a set of control functions and later

repeated on the inverter application.

5.4.1 Control Functions

We selected a representative set of control functions as given in Table 5.1. This

includes a mix of closed loop control applications and related critical computation

functions (which are used in safety critical applications). We determined the set of

Figure 5.8. Algorithm for workload augmentation.

𝑾𝒐𝒓𝒌𝒍𝒐𝒂𝒅 𝑨𝒖𝒈𝒎𝒆𝒏𝒕𝒂𝒕𝒊𝒐𝒏(𝑺, 𝑾𝑳)

𝒑𝒆𝒓𝒕𝒖𝒓𝒃(𝑾𝑳)

Set of design elements 𝑆

Identify critical elements using workload 𝑊𝐿,

𝑆𝑐𝑟𝑖𝑡 = 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 (𝑆, 𝑊𝐿)

for 𝒊𝒕𝒆𝒓 = 𝟎 through 𝒎𝒂𝒙

perturb workload, 𝑊𝐿𝑖𝑡𝑒𝑟 = 𝑝𝑒𝑟𝑡𝑢𝑟𝑏(𝑊𝐿)

Identify critical elements 𝑆𝑖𝑡𝑒𝑟 = 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙(𝑆, 𝑊𝐿𝑖𝑡𝑒𝑟)

If 𝑆𝑐𝑟𝑖𝑡 ∪ 𝑆𝑖𝑡𝑒𝑟 ≠ 𝑆𝑐𝑟𝑖𝑡 , accept the workload

𝑆𝑐𝑟𝑖𝑡 = 𝑆𝑐𝑟𝑖𝑡 ∪ 𝑆𝑖𝑡𝑒𝑟

If search is not leading to new workload acceptance, exit ()

Identify all the embedded control variables (𝑣) and input parameters (𝑖) in Workload,

𝑊𝐿 = 𝑓(𝑖, 𝑣)

set of acceptable values of 𝑣, 𝑉 = 𝑎𝑐𝑐𝑒𝑝𝑡𝑎𝑏𝑙𝑒 (𝑣)

set of acceptable values of 𝑖, 𝐼 = 𝑎𝑐𝑐𝑝𝑡𝑎𝑏𝑙𝑒(𝑖)

∀𝑣′ ∈ 𝑉 𝑎𝑛𝑑 ∀ 𝑖′ ∈ 𝐼, 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑊𝐿) = 𝑓(𝑖′, 𝑣′)

72

acceptable input values, set of acceptable values for embedded control parameters and

output tolerances. Fault injection experiments were performed to identify the set of

critical flip-flops for each of the control functions, and repeated for these variations,

using the Workload_Augmentation algorithm in Figure 5.8.

The results thus obtained are shown in Figure 5.9. For initial evaluation, the

maximum number of iterations was set to six, i.e. the original workload was perturbed

five times. We injected faults on different memory mapped data variables. (This number

varies with the function and is shown in brackets). The cumulative number of critical

Table 5.1. Control functions used for evaluation.

Function Purpose

mppt_PnO Perturb and observe algorithm to extract maximum power.

mppt_incc Incremental conductance algorithm to extract maximum

power.

clarke Transformation to convert three-phase quantities into two-

phase quantities.

Iclarke Transformation to convert two-phase quadrature quantities

into three-phase quantities.

park Transformation to convert stationary reference frame to

rotating reference frame.

ipark Transformation to convert rotating reference frame to

stationary reference frame.

2p2z Two pole two zero digital control algorithm.

3p3z Three pole three zero digital control algorithm.

Figure 5.9. Variation of critical elements with workload perturbation.

73

flip-flops identified for each iteration is shown along the Y axis. As can be seen, with

successive iterations, no new trailing end flip-flops are identified, and eventually the

workload perturbation stops. (The initial iteration count was set to 6. However for 3p3z, it

was extended to 10 since in the 6th

iteration also new critical flip-flops were identified).

5.4.2 Inverter Application

For practical workloads, exhaustive iterations using additional workloads

(generated using the algorithm in Figure 5.8) will result in a very large number of

workloads and thus increase the analysis complexity. For the inverter application, the

estimated workloads are in excess of 10,000. We hence propose a more directed

perturbation method, (based upon the knowledge of the application, its inputs and

embedded control variables for closed loop operation), wherein variations which can

result in the generation of relevant workloads specific to the application alone are

considered.

While a detailed explanation of the control loop algorithm for the inverter

application is beyond the scope of this section, we can condense the analysis using the

equations below.

𝑦(𝑖) = 𝑈𝑝 + 𝑈𝑖

= [𝐾𝑝 ∗ 𝑒(𝑖)] + [𝐾𝑖 ∗ 𝑒(𝑖) + 𝑈(𝑖 − 1)]

where 𝑦(𝑖), 𝑈𝑝, 𝑈𝑖, 𝑒(𝑖), 𝐾𝑝 and 𝐾𝑖 denote the PI function output, proportional path

output, integral path output, error input, integration path constant and proportional path

constant respectively. In order to control the output 𝑦(𝑖), the error term 𝑒(𝑖) must be

suitably controlled by 𝐾𝑝 and 𝐾𝑖.

74

For the inverter application, fault injection was performed with an initial workload

to identify the critical flip-flops using each of the three approaches mentioned in Figure

5.1. The set of critical flip-flops which remain undetected using the divide and conquer

approach (as explained in Section 5.1.1.2) were analysed and the workload was

augmented iteratively. This is indicated in Table 5.2.

The additional number of critical flip-flops identified using the additional workload

is shown in Figure 5.10. (The dotted ellipse indicates the critical flip-flops which escaped

detection for that workload). The following analysis was performed.

Table 5.2. Workload iterations for inverter.

Workload Change

WL0 Initial workload.

WL1 WL0 updated with time varying (sinusoidal) error.

WL2 WL1 updated to optimize the control loop for open loop

scenario.

WL3 WL2 updated to change the integral control parameter of the

closed loop.

Figure 5.10. Critical flip-flops identified with perturbed workloads for AC inverter

application.

75

(i) 4 flip-flops escaped detection using the initial workload WL0 (Figure 5.10(a)).

This is because while in the actual application, the control system error value

(which is the input to the control function) in each control loop iteration shows

significant variation due to the physical system behaviour, in the divide and

conquer approach the same workload is executed on the module in isolation with

lesser variation. The module, (in this case the CPU on which fault injection is

performed and which provides controls to PWM), can be more comprehensively

excited by changing the constant reference provided to the closed loop control

with a continuously changing reference (e.g. a sinusoidal reference). A new

workload WL1 is thereby created. The results shown in Figure 5.10(b) indicate

that 2 out of these 4 flip-flops are now detected.

(ii) Further analysis of the 2 unidentified critical flip-flops indicated that the control

loop parameters required for open loop operation (divide and conquer approach)

were different compared to those required for closed loop operation (application

approach). A new workload WL2 was generated by updating the control loop

parameters to recalibrate the output value for the open loop operation. The

results in Figure 5.10(c) indicate that 1 additional flip-flop is now detected.

(iii) The single flip-flop which escaped detection using Workload WL2 was part of

the integrator logic. In a typical PI control loop, the 𝐾𝑝 (proportional) term

impacts the loop gain and helps to reach the optimal performance point faster.

The 𝐾𝑖 (integral) term helps to remove the steady state error in the system. A

new workload WL3 was added with new value for 𝐾𝑖 which can force the

application behaviour to exceed the permissible tolerance values. As shown in

Figure 5.10(d), WL3 detected all the critical flip-flops.

In this particular study, we have used the proposed approach to reduce the number

of critical flip-flops which escape detection. This approach can also be utilized to perform

trade-off between hardware overhead incurred and reliability gained by reducing the

number of false positives, i.e. the number of non-critical flip-flops which are

pessimistically marked as critical using the proposed approach. As can be seen in Figure

5.10, the total number of critical flip-flops identified varies across workloads. This

76

includes flip-flops not identified as critical in the application approach as well as flip-

flops so identified pessimistically. The number of flip-flops pessimistically identified

using this algorithm is contained to 23 (out of 384, i.e. 5.99%), whereas with random

perturbation, this number is much higher, tending to cover all the flip-flops.

The workload perturbation algorithm is carried out accordingly in Figure 5.8.

Generic recommendations include: (a) Inputs applied and parameter changes performed

in open loop analysis must be representative of the actual closed loop scenario. (ii) Flip-

flops whose criticality depends on their values must be so identified and updated.

Conclusion 5.5

In this chapter, we proposed a perturbation based workload augmentation technique

for performing comprehensive functional safety evaluation. Experiments are performed

on a set of safety critical control functions and an inverter application, and the

effectiveness of the proposed method in identifying additional critical flip-flops is

demonstrated. 12% to 26% additional critical flip-flops are identified. Through these

experiments, we have illustrated how optimisations in safety evaluation methods using

application level tolerance can be traded off with additional workloads to arrive at a more

comprehensive list of critical flip-flops to meet the overall hardware overhead and

reliability requirements. Together, these results indicate that the proposed perturbation

technique is effective to identify additional critical flip-flops within affordable overhead

of analysis complexity and pessimism

77

6. Application Driven Protection Mechanisms

In Chapters 3, 4 and 5, different techniques were proposed to perform

comprehensive functional safety analysis and identify the minimal set of critical flip-

flops which must be suitably managed in the application (e.g. either by preventing the

occurrence of SEUs or detecting SEU occurrence followed by remedial action). In this

chapter, we will analyse the techniques which can be used to protect the identified set of

such critical flip-flops.

The techniques used for protection can be grouped into three categories, namely

hardware, software and application level techniques. Protection using hardware

[96,6,58,57,97] and software techniques [98,99,100,101,102] are well researched topics.

Cross-layer techniques which involve both hardware and software have also been

proposed [103,104]. However, there is not much reported work on utilizing the

application information to protect critical flip-flops. In this chapter, we review

representative hardware and software techniques and propose two new application level

techniques.

Hardware Based Protection Techniques 6.1

Hardware based techniques implement protection using a combination of spatial

and temporal redundancy. They can be implemented at different abstraction levels,

namely at the device level, circuit level and module level. Tradeoffs associated with

implementing the protection at different abstraction levels must be considered during the

IC design and system design stages.

6.1.1 Device Level Techniques

High energy alpha and neutron particle strikes create additional electron-hole pairs

as they pass through the semiconductor device. Depending upon the energy of the

incoming particle, there can be sufficient amount of charge accumulation to invert a

stored logic value leading to a soft error. Device level soft error mitigation techniques

78

introduce additional steps, either through design robustness or through additional steps in

manufacturing to reduce the impact of alpha and neutron particle strikes. These

techniques either increase the critical charge of state holding element / transistor (i.e.

amount of charge which is required to change the flip-flop state) or reduce the charge

collection, thus reducing the probability of the flip-flop changing its state. (Charge

collection refers to the process by which the excess electron-hole pairs created due to a

particle strike are swept into the source / drain regions instead of recombining and

neutralizing.).

Silicon on Insulator (SOI) is a device level technique deployed for SER mitigation.

This technique introduces a layer of insulator between source / drain and substrate as

shown in Figure 6.1 [96]. The charge collected on account of a particle strike is much

lesser compared to that with the traditional bulk process as the buried oxide layer

prevents charge flow from the substrate to the source and drain. This, in turn, reduces the

probability of the flip-flop losing its state. A similar reduction in SER is observed for

FinFET technologies due to the charge dissipation in the substrate itself before reaching

the source or drain [105].

6.1.2 Circuit Level Techniques

Circuit level soft error mitigation methodologies use a combination of transistor

and logic gate / flip-flop design techniques to build components which are hardened or

tolerant to the effects of radiation. A few common examples are listed below.

Figure 6.1. SOI transistor.

Substrate

Oxide

Substrate

Gate oxide

Gate

Source Drain

79

Dual Interlocked storage Cell (DICE) [6] is a composite flip-flop which provides

protection from single event upsets, by using spatial redundancy to store its value as a

pair of elements, each element of which has a set of complementary values. Refer to

Figure 6.2. If flip-flop’s state changes due to particle strike, it can be restored using the

available redundancy. Various optimizations to DICE flip-flops have been proposed to

reduce the area and power overhead [106].

Built in Soft Error Resilience (BISER) [58] is another technique used for protecting

latches and flip-flops. A BISER flip-flop consists of two flip-flops joined with a C-

element [107] as shown in Figure 6.3. In the fault-free condition, both the flip-flops have

the same value and the C element provide the inverted value. Upon a particle strike, the

values in the two flip-flops are opposite. Thereupon, the C element will tristate the

output. The BISER flip-flop will retain the previous value at the output due to the bus

Figure 6.2. DICE flip-flop.

Figure 6.3. BISER fip-flop.

80

keeper circuit. The two flip-flops will once again take an identical value when they are

updated by the next clock cycle.

There are other circuit level techniques like the use of Razor flip-flop [57], delayed

capture methodology [59], SEU mitigation using error control coding techniques [61],

DF-DICE flip-flop [108], etc. which offer different tradeoffs in terms implementation

overhead and detection capabilities.

6.1.3 Module Level Techniques

Module level redundancy is commonly used for implementing protection against

soft errors. In this approach, an entire module is replicated, redundant instances are fed

with the same input values and the outputs of redundant modules are continuously

compared. (This method is practically very relevant since it does not require new design

or characterisation of flip-flops and transistors used to build them. It instead uses standard

library components). Dual Core Lock Step (DCLS) architecture [97] is the simplest

example of module level redundancy, wherein two instances of a given module are

operated in tandem. Additionally, considerations like staggered execution in redundant

streams, different physical spacing requirements between the redundant units, etc., are

also incorporated to mitigate effects of common cause failures (i.e. common faults

propagating to both the modules, e.g. through power supply and clock networks, thereby

rendering the checker ineffective). For applications requiring higher availability (e.g.

fault tolerance), triple core lock-step [109] is also deployed.

Software Based Protection Techniques 6.2

Software based protection techniques implement protection in programmable

systems (e.g. with CPUs) using smart methods [110] for generating programs which

result in CPU instructions and code execution for a given application. These techniques

can be classified mainly into three types, namely, (i) control flow checking, (ii)

vulnerability reduction techniques, and (iii) redundancy techniques. These techniques are

briefly described, and the related implementation overheads, protection offered and

81

suitability from an application context (e.g. data processing intensive vs control

intensive) are explained.

6.2.1 Control Flow Checking

Control flow checking uses assertions to check the program flow sequencing. It can

detect when the execution takes a different path in the presence of a fault by using some

property of the path, e.g. time, signature, end state, etc. A watchdog [111] is a classic

example for a mechanism used for checking gross control flow errors, wherein the time

behaviour of the execution along the erroneous path can be profiled and the fault can be

detected.

More sophisticated control flow checking measures [98,112,113,114] have been

reported for better detection capability. These measures divide the program into sub-

routines, where a sub-routine is a collection of instructions with a unique entry and exit

point. Various sub-routines are connected using arcs to form a full program and each sub-

routine is allocated a unique signature. Additional instructions are added in the sub-

routines to perform control flow checking. If the program flow is not as expected, error is

flagged. As an illustration, 97% of transient faults in control logic can be detected using

the control flow checking measure illustrated in [113].

These measures detect faults only in the control blocks of the processor which

result in an incorrect call or branch. It is incapable of detecting faults that do not cause a

change in the control flow. In a typical processor, the control logic constitutes about 10-

15% of the entire logic. In the sample fault simulation experiments performed with the

BLDC motor control and AC inverter application, we have observed that there are many

faults, (e.g. faults in the data path logic), which impact accuracy of data but do not impact

the control flow. Such faults will remain undetected and hence these approaches are not

suitable for the applications considered here.

6.2.2 Vulnerability Reduction Techniques

Algorithm Based Fault Tolerance (ABFT) [99] has been one of the earliest

vulnerability reduction techniques deployed to improve reliability of software programs.

82

The methodology encodes data in a specific form and the algorithms are designed to

operate on this encoded data and produce encoded output data. The various computations

required for the algorithm are performed in different computation units such that a fault

in any of the units affects only a portion of the data which can then be detected. The

benefit of the proposed methodology using a set of matrix operations has been

demonstrated.

The work in [115] has defined Program Vulnerability Factor (PVF) to demonstrate

the impact of a set of instruction sequences on the dependability of the overall application

in much the same way as Architectural Vulnerability Factor (AVF) [116,117] is used to

demonstrate the impact of architectural and micro-architectural components on the

dependability of the overall application. PVF is a property of the dynamic execution of

the program and helps identify subsets of instruction sequences which are vulnerable to

transient faults. A 20% reduction in vulnerability is observed by reordering of

instructions in the identified vulnerable subset of instruction sequences.

The work in [100] has proposed code re-ordering and critical variable duplication

to reduce the vulnerability due to soft errors. A tool RECCO (Reliable Code Compiler) is

used to map the input source code to a more reliable source code. The tool allows user

configurability to establish tradeoffs between dependability improvement and

performance degradation. The tool assigns a reliability weight to each of the variables

based on the functional dependencies and life-time of the variable. The number of places

a variable is getting used determines the functional dependency. The duration between a

variable’s creation (i.e. write operation) and last consumption (i.e. read operation) is

called a life period. The sum of life periods for the entire program duration gives the life-

time of the variable. Code re-ordering is performed to reduce the reliability weight of

each variable. Variable duplication is proposed to further reduce the vulnerability. The

vulnerability reduction reported is lower (5-9%) for generic program sequences, however,

for specific program sequences, it is higher (up to 65%).

Though these approaches can be deployed for the applications considered in this

thesis, based on the results already observed (i.e. vulnerability reduction of 5-9% for

generic program sequences), the benefits obtained by using these approach are restricted

83

by the ability to identify and recode specific instruction sequences. In addition, these

techniques require instruction sequences to be generated in a given form which require

compiler changes. This may not be practical for commercial processor platforms (e.g.

ARM cores).

6.2.3 Software Redundancy Techniques

Software redundancy based protection techniques use a combination of spatial and

temporal methods to reduce vulnerability. One of the first redundancy based techniques

implemented for fault tolerance is N-Version programming [118]. Though originally

proposed for finding systematic faults (e.g. bugs) in the software program, it can also be

used to implement fault tolerance. In this approach, different versions of the program are

created from the same original specification by different teams (individuals or groups of

individuals). A supervisor program is used to compare the output of these different

versions and select the correct ones based on majority voting to proceed to the next stage

of program execution. With the increase in software complexity, the effort required for

creating multiple versions has become prohibitive. In addition, new tools have come up

to improve the quality of software program thus reducing the need for N-Version

programming technique. Due to these reasons, this technique finds lesser acceptance

now-a-days.

Error Detection by Duplicated Instruction (EDDI) [101] and SoftWare

Implemented Fault Tolerance (SWIFT) [102] techniques implement fault tolerance using

duplicated execution and result comparison for concurrent error detection. The approach

uses different resources for storing variables and duplicated instructions. The duplicated

operations are spaced apart in time. The results of the duplicated operations are compared

before a write to a memory or a branch operation is performed. This approach can detect

transient faults in control logic, data and instruction memory, functional units and

interconnects. High transient fault detection is reported for both EDDI (96.2% to 99.2%)

and SWIFT (98.05%) technique.

The fault detection limitations of EDDI and SWIFT techniques (e.g. detection

escapes which can happen when a fault happens after comparison and before memory

update, fault causing a normal instruction to transform into a memory update instruction,

84

etc.), were addressed by the CompileR Assisted Fault Tolerance (CRAFT) [119]

technique, thereby improving transient fault detection to 99.29%. Craft technique

consumes additional MIPS and hence increases the execution time by 31.4%. PROfile

guided Fault Tolerance (PROFiT) [120] technique addresses the problem of performance

impact by identifying and protecting only the critical sections of the complete software

program, which consists of both critical and non-critical sections, (e.g. an automotive

processor executing critical driver assistance function along with non-critical

infotainment functions).

Practical adoption of hardware based fault tolerance techniques is still limited due

to lack of their availability in application specific SoCs catering to functional safety

requirements, (e.g. Application domain Specific Instruction Set Processor (ASIP) for

functional safety [121]), or due to the prohibitive cost associated with them particularly

when the fault tolerance is required only for a small subset of tasks implemented by the

SoC. Software based techniques are, therefore, used for protecting Commercial Off The

Shelf (COTS) components [122] used in building functional safety systems. However,

there is a significant performance and memory footprint overhead associated with

software techniques, together with limited protection in some cases. In order to address

these issues, we propose methods which utilise application tolerance for incorporating

functional safety.

Application Based Protection Techniques 6.3

Typical closed loop applications consist of periodic execution of different tasks to

perform various functional operations, (e.g. PID control loop, periodic communication,

etc.) and non-functional operations, (e.g. related to safety, security, operating power /

voltage modes, etc.). A small subset of tasks from amongst all the different tasks

executed by the application will be classified as safety critical. To ensure functional

safety for the application, the fault tolerance requirements of safety critical tasks must be

met even if it is at the expense of non-critical tasks. This section describes two

application oriented techniques which can provide fault tolerance.

85

Consider an SoC executing a motor control application. The motor control

application consists of various tasks of different criticality as given in Table 6.1 (For

simplicity and to make the notation generic, we represent criticality using numbers from

1 to 4; such that a higher number indicates higher criticality). These include obtaining

set-point information (to control motor speed) from the higher level system, motor

control task which controls the motor speed based on the obtained set-point information,

motor control monitoring function to ensure that the speed matches the set-point and

enters into a fail-safe mode if the speed is not within the tolerance range, data-logging

function to periodically save the status information to aid debug, and speed information

to be updated on the display panel. The criticality of each of these tasks is typically

determined based upon factors described in Section 2.2.1.

The different tasks in the application are triggered either by an interrupt or upon

completion of a previous task. The task triggering interrupts occur at varying times / have

varying frequency and have different priorities [123] determined by the application

requirements. In this example, the priority of interrupts is indicated using tags I1, I2 and

Table 6.1. Different tasks executed by motor control application.

No Task Task Trigger Criticality

T1 Motor control monitoring function Interrupt I1 4

T2 Motor control T1 completion 3

T3 Periodic communication of set point information Interrupt I2 3

T4 Speed intimation to display panel T3 completion 2

T5 Data-logging Interrupt I3 1

Figure 6.4. Sequencing of different tasks executed by motor control application.

T3- Periodic communication of set point information

T2- Motor control

T4- Speed intimation to display panel

T5- Data-loggingT1- Motor control monitoring function

I1 Highest priority Interrupt> I2 2nd Highest priority Interrupt I3 Lowest priority Interrupt

I1 I3 I1 I1 I1 I1 I1I3I2 I2

86

I3 where I1 is the highest priority interrupt and I3 is the lowest priority interrupt. A

higher priority interrupt can always interrupt when a lower priority Interrupt Service

Routine (ISR) is in progress, however, a lower priority interrupt cannot interrupt a higher

priority ISR. The sequencing of different tasks of the motor control application in the

time domain is represented in Figure 6.4. We can see that a lower criticality task T5

initiated by a lower priority interrupt I3 is interrupted by the higher priority interrupts I1

and I2.

The criticality of the processor in the SoC and the associated value and time

tolerance at any instant during application execution are determined based on the task the

processor is executing. We augment the notations introduced in [124] to represent this.

We can consider the motor control application to be made up of various tasks 𝑇𝑖. In a

typical application scenario, a particular peripheral will always be part of a task, e.g.

PWM output is always driving the motor, and the CPU bandwidth is time division

multiplexed for the various application tasks. Each of these tasks can be considered to be

running periodically with a period 𝑃𝑖, computation time 𝐶𝑖, time permissible (timeline) to

complete a particular operation 𝐷𝑖 and criticality 𝐿𝑖. The value and time tolerance

associated with the task can be represented as 𝑉𝑇𝑖 and 𝑇𝑇𝑖 respectively. This change in

tolerance of the processor of the SoC over time is illustrated in Figure 6.5. These

different tasks have different tolerance requirements and different timelines. It is possible

to utilize this information to reduce the implementation overhead for protecting the

critical application tasks.

Figure 6.5. Change in criticality over time.

T5 .

(data-logging, non-critical) T2 . T1 . T3 . T4 .

{VT1, TT1}{VT2, TT2} {infinity, infinity}

{VT3, TT3} {VT4, TT4}

87

6.3.1 Critical Flip-flop Reduction by Altering Application

Execution

A typical control system will involve sampling the input, processing of the input to

determine the actuation required, and followed by actuation. The number of times this

given action of sampling, processing and actuation takes place in a second is the control

loop frequency. The control loop frequency is determined based upon one or more

parameters of the physical system, i.e. motor control system. In application based

functional safety analysis, we have mapped application tolerance as value tolerance and

time tolerance. The value tolerance is a result of the acceptable set of values around the

control point (driving the actuator) for which the application can behave in an acceptably

correct manner. The time tolerance is a result of inertia of the physical system where a

change in the controller output takes a much larger time to have any perceptible impact

on the physical system that the application is controlling. We expect that a higher number

of repeated executions, (i.e. higher control loop frequency), can help increase both the

value tolerance as well as the time tolerance, thereby rendering fewer number of flip-

flops as critical.

In order to ascertain the impact of the change in control loop frequency on the set

of critical flip-flops identified, we perform evaluation on the same two reference designs:

(a) BLDC motor control and (b) AC inverter circuit. The evaluation is performed as a

two-step process.

(i) Compute the variation in value tolerance and time tolerance (in terms of number

of control loop cycles) with change in control loop frequency.

(ii) Compute the change in the identified number of critical flip-flops with change in

the value tolerance and time tolerance.

Computation of Value and Time Tolerance for Different Control 6.3.1.1

Loop Frequencies

For a typical control system, there will be a range of control loop frequencies for

which the system can operate in an acceptably correct manner. Hence for these

experiments, we have limited the control loop frequency changes to lie within this

88

acceptable range. We have evaluated the variation in value tolerance and time tolerance

with change in control loop frequency.

The control loop frequency can be changed in two ways. (i) Changing the PLL lock

frequency whereby the device altogether operates at a new frequency. Based on the new

frequency, the entire set of operations is performed at a different frequency. (ii) Updating

the frequency of interrupt which initiates the control loop operation. (For example, a

timer module generates the periodic interrupt and the interrupt frequency can be changed

by re-configuring the timer module). Updating the interrupt frequency will cause a

change only in the frequency of control loop operations. In case of an increase in the

control loop frequency, the CPU may not have enough bandwidth to process the

additional loop operations. In such a scenario, the operation of some of the less critical

tasks (e.g. datalogging) can be slowed down, (i.e. frequency of processing lowered), to

provide the bandwidth to process the more critical tasks. The tradeoffs associated with

these techniques to increase control loop frequency are illustrated in Table 6.2. (Since the

device is already operating at the maximum frequency and there is no frequency

headroom available for changing the device operating frequency, the period configuration

of the timer module is changed for this experimental evaluation).

The control loop frequency is changed and the time tolerance associated with the

Table 6.2. Tradeoffs associated with changing control loop frequency.

No Change using timer module period

configuration

Change using PLL clock frequency

configuration

1 Device operating frequency remains

same.

Device operating frequency changes.

Higher frequency configuration is

possible only if the device is rated to

operate at increased frequency.

2 Change is instantaneous on changing

the timer period configuration.

Device PLL must relock at the higher

clock frequency. This typically takes

a few micro seconds.

3 Only critical control loop operation

takes place at the higher frequency.

All operations in the device takes

place at the higher frequency.

4 Timing of all other cyclic operations

must be updated to accommodate the

higher bandwidth required for critical

control loop operation.

There is almost no change required in

the system configuration.

89

application is determined. Since value tolerance indicate the impact of change in value

over an infinitely long time, it will not change with change in control loop frequency

(refer to Section 3.2.1). The time tolerance values are determined as given in Section

3.3.1. The time tolerance expressed as the number of control loop iterations increased

with increase in control loop frequency.

The computed time tolerance values for the BLDC motor control system for

different control loop frequencies for different motor speeds is shown in Figure 6.6. In

this figure, the X axis denotes the various control loop frequencies in KHz and Y axis

Figure 6.6. Variation of time tolerance with control loop frequency for BLDC motor.

0

50

100

150

200

250

10 15 20 25 30 35 40

Tim

e T

ole

ran

ce

Control Loop Frequency

1350 1500 1650 1800

1950 2100 2250 2400

Figure 6.7. Variation of time tolerance with control loop frequency for AC inverter.

0

5

10

15

20

20 30 40 50 60 70

Tim

e T

ole

ran

ce

Control Loop Frequency

0.15 0.2 0.25 0.3

0.35 0.4 0.45 0.5

90

denotes the time tolerance in number of control loop cycles. The various lines in the plot

indicate the time tolerance values for different motor speeds (in rpm).

The computed time tolerance values for the AC inverter application at different

control loop frequencies are shown in Figure 6.7. The different lines in the graph

correspond to the different input current values at which the AC inverter is operating. (In

the figure, the range of input current values from 0 A – 30A is represented over a range

from 0 to 1).

Computation of Number of Critical Flip-flops 6.3.1.2

In Section 3.3.1, we showed that application information can be used for reducing

the number of critical flip-flops and thus the hardware overhead incurred for “safeing”

the application. In Section 6.3.1.1, we showed that the tolerance value increases with

increase in control loop frequency. In this sub-section, we evaluate whether it is possible

to further optimize the hardware overhead by using the additional tolerance gained by

virtue of running the control loop at a higher frequency.

The time tolerance and value tolerance thus obtained from the system level

evaluation is used to perform the divide and conquer analysis as shown in Figure 6.8. In

this analysis, the system is configured to execute the control loop at the targeted

frequency. The value tolerance and time tolerance are configured in the fault injection

set-up. The critical flip-flops are identified for each targeted time tolerance value.

Figure 6.8. Critical flip-flop identification for different time tolerance values.

Select Divide and Conquer approach

Set Value Tolerance based on application

Update Time Tolerance

Identify the list of critical flip-flops

Critical flip-flop list

91

The critical flip-flops thus obtained for BLDC motor control application are shown

in Figure 6.9. We have determined the number of critical flip-flops for several different

discrete time tolerance values from 0 to 100. A significant reduction (from 55 to 30, i.e.

45%) in the number of critical flip-flops is seen as we increase the time tolerance from 0

to 1. Thereafter, the number of critical flip-flops did not reduce with increase in time

tolerance.

We repeated the experiment for the AC inverter application. The results are plotted

in Figure 6.10. Here the critical flip-flop count reduces more significantly with increase

Figure 6.9. Variation in the number of critical flip-flops with time tolerance (# number

of control loop cycles) for BLDC motor control application.

0

10

20

30

40

50

60

0 1 2 10 20 30 50 100

Critical F

lip-f

lop

s

Time Tolerance

Figure 6.10. Variation in the number of critical flip-flops with time tolerance (#

number of control loop cycles) for AC inverter application.

40

45

50

55

60

0 1 2 3 10 20 30 40 50 100

Cri

tica

l Flip

-flo

ps

Time Tolerance

92

in time tolerance. It reduces from 59 for a time tolerance of zero to 48 for a time

tolerance of 30 control loop cycles and further to 44 for a time tolerance of 100 control

loop cycles. By increasing the time tolerance from 20 to 40, we observed a 6% (51 to 48)

reduction in the number of critical flip-flops.

Observations for BLDC Motor Control and AC Inverter System 6.3.1.3

We observe a different behaviour for the variation in critical flip-flop count with

variation in control loop frequency for the BLDC motor control and AC inverter

applications. In case of BLDC motor control application, the number of critical flip-flops

does not reduce for time tolerance value beyond one. However, for the AC inverter

application, the number of critical flip-flops continues to reduce with increase in the

control loop frequency.

Figure 6.11 indicates the difference in execution with two different control loop

frequencies in the presence of a fault. Figure 6.11(a) profiles the application with a lower

Figure 6.11. Execution variation with different control loop frequencies.

(a) Execution with lower control loop frequency

X

Time Tolerance

Window

Va

lue

To

lera

nce

Win

do

w

(b) Execution with higher control loop frequency

X

Time Tolerance

Window

Va

lue

To

lera

nce

Win

do

w

time

Cp

u o

utp

ut

time

Cp

u o

utp

ut

fault

fault

93

control loop frequency and Figure 6.11(b) profiles the same application with a higher

control loop frequency. It can be observed that a higher control loop frequency helps

correct the error in the system before the time tolerance interval, thereby making the

system operate in an acceptably correct manner. This behaviour depends on the function

influenced by the flip-flop on which the fault is injected.

The difference in behaviours for the control function implemented by BLDC motor

control and AC inverter system is now explained. The Proportional Integral (PI) function

implemented by the controllers (shown in Figure 6.12) can be denoted as:

𝑦(𝑖) = 𝑈𝑝 + 𝑈𝑖

= [𝐾𝑝 ∗ 𝑒(𝑖)] + [𝐾𝑖 ∗ 𝑒(𝑖) + 𝑈(𝑖 − 1)]

where 𝑦(𝑖), 𝑈𝑝, 𝑈𝑖, 𝑒(𝑖), 𝐾𝑝 𝑎𝑛𝑑 𝐾𝑖 denote the PI function output, proportional path

output, integral path output, error input, integration path constant and proportional path

constant respectively. In order to control the output 𝑦(𝑖), the error term 𝑒(𝑖) must be

suitably controlled by 𝐾𝑝 and 𝐾𝑖. [𝐾𝑝 ∗ 𝑒(𝑖)] forms the proportional path and [𝐾𝑖 ∗

𝑒(𝑖) + 𝑈(𝑖 − 1)] forms integral path of the controller.

A fault in one of the flip-flops implementing the proportional path is immediately

visible at the function output. This will also get corrected in the next iteration. There is no

storage (memory) of values in the proportional path. However, we notice that the integral

path has inherent memory as it keeps on accumulating the results of previous iterations.

A fault in one of the flip-flops implementing the integral path will take time to get

corrected, i.e. get updated to its new value. Such systems which have inherent memory

Figure 6.12. PI function implemented by the control system.

Ki*e(i) + U(i-1)

Kp*e(i)

+-

error e(i)

Reference

Feed

bac

k

++

Up

Ui

Proportional path

Integral path

output y(i)

94

will benefit from increasing the control loop frequency. For the BLDC motor control

application, the integration parameter (𝐾𝑝) used in the PI control is very close to zero and

the control loop is operating very close to a proportional control system. Hence, changing

the control loop operating frequency does not reduce the number of critical flip-flops.

We therefore conclude that the proposed approach of increasing the control loop

frequency to render fewer flip-flops as critical will be beneficial when the application has

inherent memory. A given application must be analyzed accordingly. The higher the

inherent memory the more are the benefits associated with the proposed approach.

6.3.2 Detection of Critical Flip-flops by Selective Redundant

Execution

In this section, we propose a new technique to detect faults in critical flip-flops

(which are identified as part of the application based functional safety evaluation

described in Chapters 3, 4 and 5), by selective redundant execution of critical portions of

the application. Unlike other software based fault tolerant approaches covered in Section

6.2, which attempts to protect the entire application, the proposed optimization detects

faults only on the critical flip-flops thereby reducing the implementation overhead. This

proposed approach is further extended to provide system recovery on detection of a fault.

Selective Redundant Execution 6.3.2.1

In the proposed approach, the application is first evaluated to identify the safety

critical tasks. Application based functional safety evaluation is then performed on these

critical tasks to identify critical flip-flops. Once identified, dual streams of execution

operate on independent sets of such critical flip-flops. Once execution is complete, the

two sets of critical flip-flops are compared before a memory update or before a control

flow operation (e.g. call, branch, etc.). This will help detect any soft errors which occur

during the execution. The flow chart for proposed approach is shown in Figure 6.13.

Compared to a non-fault tolerant implementation, the proposed approach will lead

to (i) increase in memory footprint due to the additional storage required for saving the

redundant code and redundant variables, and (ii) increase in MIPS required since the

critical portion of the code must be redundantly executed. However, overheads associated

95

with the proposed approach are lower than those with duplication as implemented by

EDDI approach [101]. In order to evaluate the overheads associated with the proposed

approach, we perform the evaluation on a set illustrative of control functions. The

Figure 6.13. Selective redundant execution.

Critical flip-flops are duplicated and are updated in two independent threads.

Control flow operation

Redundant values are compared

NO

Identify critical flip-flops using application based

functional safety evaluation

Error?

YES

NO.

Indicate error to external world

Memory update operation

Memory update value is compared with redundant

value

Error?

YES

YES.

NO .

NO .

Figure 6.14. Memory and MIPS overhead reduction for selective redundant execution

approach as compared to EDDI.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Ove

rhe

ad

Red

uctio

n

Illustrative Control Functions

Memory MIPS

96

benefits of the proposed approach (memory overhead reduction and MIPS savings) are

compared with EDDI (having same error tolerance) in Figure 6.14. It can be seen that

both memory and MIPS overhead reduction compared to EDDI approach [101] varies in

the range of 5% to 41%.

In the case of a complete application as shown in Figure 6.4, the proposed approach

will be applied only for the safety critical threads. Additional memory and MIPS

overhead for protection will be incurred only for them, as compared to application

agnostic duplication employed in EDDI approach.

System Recovery on Identification of an Error 6.3.2.2

The proposed selective redundant execution will help detect an error. The system

will enter into a fail-safe mode [125] thereupon. For many of the functional safety

applications like aircraft autopilot, autonomous driving, etc., availability is an equally

important requirement along with functional safety. These systems cannot stop on

detection of a fault and the system has to be fail-operational [126]. Traditional fault

tolerant architectures for high availability hold / store the critical variables with a

redundancy of three. Safety critical application threads are executed with a redundancy of

three and majority voting is performed using the outputs from the three execution streams

to determine the final output. This implementation results in a threefold increase in

memory and MIPS overhead. In this section, we extend the selective redundant execution

approach to meet the fail operational requirements but at a much lesser overhead.

Figure 6.15. Selective redundant execution for error recovery.

RT-1 RT-2 V RT-1 RT-2 V RT-3 RT-1 RT-2 V

Redundant thread-1

Redundant thread-2

Redundant thread-3

P XP

Voting miscompare followed by recovery of erroneous thread

Voting passP X Voting fail

1 2 3

R

Error recovery thread

Software voting and check-pointing thread

RT-1

97

The proposed selective redundant execution for error recovery is shown in Figure

6.15. The various steps involved in this method are listed below.

(i) Three different copies of the variables are saved (check-pointed) at

predetermined phases of the application.

(ii) Safety critical application threads are executed with a redundancy of two.

(iii) Outputs of the two execution streams are compared. If the outputs match, critical

variables are check-pointed with a redundancy of three.

(iv) If the outputs do not match, the third redundant thread is executed and majority

voting is performed to identify the correct output. Once the correct output is

ascertained, the states (flip-flops) leading to the correct output are identified and

check-pointed.

The comparison of the proposed approach with that of traditional approach (triple

modular redundancy implemented in software) is shown in Table 6.3.

In order to evaluate the benefits of the proposed approach, we performed the

system recovery implementation on the set of illustrative control functions. The memory

and MIPS overhead savings for a no-error scenario (with no RT-3 as indicated in Figure

6.15) when compared to the traditional Triple Modular Redundancy (TMR)

implementation in software (having same error tolerance) is indicated in Figure 6.16. As

it can be noted from the figure, the memory overhead savings varies from 7% to 54%

depending on the application. Similarly, the MIPS overhead savings varies from 36% to

Table 6.3. Comparison of traditional and proposed fault tolerant approaches.

No Traditional approach Proposed approach

1 All variables are saved with a

redundancy of three.

Only critical variables identified using

the application based functional safety

evaluation are stored redundantly.

2 All application threads (irrespective of

whether the thread is safety critical or

not) is executed with redundancy of

three.

Only safety critical application threads

are redundantly executed.

3 Redundancy of three is maintained

irrespective of occurrence of error.

Redundancy of two is used in error free

conditions.

98

61% depending upon the application (i.e. number of flip-flops identified as critical within

each control function). Upon an error (Phase RT-3 indicated in Figure 6.15), the overhead

depends on when the fault is detected and how many register values must be restored.

Practical Considerations in Implementing Redundancy 6.3.2.3

A typical safety critical application consists of both safety critical and non-safety

critical application threads. A safety critical application thread can be further divided into

various phases where the different phases can be classified as safety critical or not. For

example, a control loop operation may bypass certain computations and hold the previous

Figure 6.16. Memory and MIPS overhead reduction for proposed system recovery

approach as compared to TMR implemented in software.

0%

10%

20%

30%

40%

50%

60%

70%

Ove

he

rad

Re

du

ction

Illustrative Control Functions

Memory MIPS

Figure 6.17. Protection approaches for safety critical application.

P1 P2 P3 P4 P5 P6 P7

Non-safety critical-T2 Safety critical-T3Safety critical-T1

P8 P9

Approach 1 (registers are protected throughout the application)

S S

Approach 3 (registers are protected only during safety critical phases of application thread)

S S – Safety critical phase

Approach 2 (registers are protected during safety critical application thread)

99

values if there is no set-point or load change. The application safety requirements during

different time segments are represented in Figure 6.17. The different application threads

are shown as T1 – T3 and different phases are shown as P1 – P9. The safety critical

application phases are marked S.

The proposed redundancy based approach (both for fault detection and system

recovery) can be implemented in three ways.

(i) Approach 1: Once the critical flip-flops are identified, they are protected

throughout the application. The protection for the flip-flops is applicable during

the execution of non-safe critical code also.

(ii) Approach 2: The critical flip-flops are protected only during the critical task

execution phase of the application.

(iii) Approach 3: The application is profiled to understand the scenarios under which

a flip-flop is classified as critical. During the application execution, an

independent checker is run alongside to identify whether a certain application

phase is critical or not. Critical flip-flops are protected for the phases in which

the application thread is classified as critical.

Approach 1 is sub-optimal in implementation as the critical registers are

continuously protected irrespective of whether the thread is safety critical or not.

Approach 3 requires the identification of critical flip-flops during application execution.

This would require additional hardware which monitors the application and its input

values, and determine whether a flip-flop needs to be protected. This also leads to

significant increase in the computational complexity. Due to the limitations with

Approach 1 and Approach 3, Approach 2 is better suited for a practical implementation.

In addition to the fault tolerance requirements of the critical tasks which are part of

the application, there are requirements with respect to prevention of fault propagation

from a task of lower criticality to that of higher criticality [127,128]. However, such

requirements are not addressed as part of this thesis, as these are considered independent

of the application tolerance.

100

Conclusion 6.4

This chapter proposed two new application based techniques for robust execution

in the presence of faults in critical components. The first technique is based on changing

the application execution, (e.g. control loop frequency), to reduce the number of critical

components. It utilizes the additional time tolerance (measured in terms of number of

control loop cycles) gained as a result of increased control loop frequency to protect the

identified critical flip-flops. For the AC inverter application, we observed 6% reduction

in the number of critical flip-flops when the time tolerance is doubled. However,

experiments on the BLDC motor control application did indicate that the approach cannot

be generically applied to all systems. We assessed the conditions under which the

proposed approach can be applied.

The second technique used selective redundant execution for identifying critical

components. Experimental results on representative control functions indicate 5% to 41%

reduction in both memory footprint and MIPS overhead when compared to EDDI

approach. An enhancement to aid system recovery in case of detection of a fault is also

proposed. Compared to software based TMR approach, the proposed approach resulted in

7% to 54% reduction in memory footprint and 36% to 61% reduction in MIPS. We also

evaluated how the two proposed approaches can be applied to a complete application

which comprises of both safety critical and non-safety critical threads.

101

7. Conclusions and Future Work

The use of Integrated Circuit (IC) components in end applications continues to rise,

particularly in critical safety applications like automotive, industrial, navigation and

medical. Due to the complexity of functions implemented using semiconductor devices

and the complexity of the device manufacturing process, a semiconductor component can

fail in multiple different ways, thus increasing the risk to the application. This risk is

addressed by having additional mechanisms to enable timely detection of faults before

they can lead to a catastrophic application failure. These safety mechanisms must be

optimal, i.e. must incur lesser overhead in terms of area, power, application MIPS, etc.,

while ensuring the required levels of safety. However, the methods and techniques

available and deployed today do not address these requirements in an effective manner.

The low overhead solutions are less comprehensive and more comprehensive solutions

come with significant overhead. This thesis addresses these challenges by presenting a

set of techniques for performing comprehensive functional safety analysis to enable

adequate protection. It also proposes new techniques to offer protection while incurring

lower overheads.

Chapter 2 gave an overview of semiconductor functional safety as practiced in

industry today. It analysed the application level implications of semiconductor failures

with a representative EV Traction application, different functional safety standards

covering the different end applications, and different safety analysis techniques.

Understanding the functional safety compliant IC development process helped to

highlight the dual challenges of comprehensiveness of safety analysis and reducing the

additional overhead incurred due to functional safety.

Chapter 3 introduced a new safety analysis technique whereby the tolerance

available in the safety critical applications is included while performing the safety

analysis, thereby limiting the hardware overhead incurred due to safety. For the safety

analysis to utilize the application tolerance, we proposed a new technique to map the

application tolerance as value tolerance and time tolerance at the IC level and for the

102

modules inside it. Experiments were performed on two real-life applications, a brushless

DC motor control system and an AC inverter control system. We also showed how the

application level tolerance can be mapped to different modules internal of the IC at

different abstraction levels.

Chapter 4 demonstrated the use of formal techniques to identify critical flip-flops in

the presence of tolerance. It also indicated ways by which application specific behaviours

(including tolerance) can be included in the analysis framework to obtain a more accurate

(less pessimistic) reliability assessment, and hence incur lesser design cost for robustness.

Experimental results were provided on benchmark circuits and two industrial circuits.

Chapter 5 proposed a perturbation based workload augmentation technique for

performing comprehensive functional safety evaluation. Experiments were performed on

a set of safety critical control functions and a real-life application, and the effectiveness

of the proposed method in identifying additional critical flip-flops is demonstrated.

Through these experiments, we have illustrated how optimisations in safety evaluation

methods using application level tolerances can be traded off with additional workloads to

arrive at a more comprehensive list of critical flip-flops to meet the overall hardware

overhead and reliability requirements. (It helped identify directed variants of workloads

as against unconstrained search methods using formal techniques).

Chapter 6 proposed two new application based techniques for robust execution in

the presence of faults in critical components. The first technique was based on changing

the application execution, (e.g. control loop frequency), to reduce the number of critical

components. The second technique used selective redundant execution for protecting

critical components while still reducing the memory footprint and MIPS overhead.

Future Work 7.1

This thesis has proposed methods for optimal functional safety analysis of ICs,

application driven identification of minimal number of critical flip-flops, and techniques

to protect the critical flip-flops using application profiling for safety. We consider a few

potential directions to enhance this work.

103

In this thesis, we have considered Single Event Transient (SET) events leading to

Single Event Upsets (SEUs) for safety analysis. SETs leading to Multiple Bit Upsets

(MBU) and Multiple Event Transients (METs) are not considered. As newer technologies

are getting rapidly adopted into automotive applications with the increase in performance

requirements for ADAS, MBUs and METs are also likely to occur. Hence, in order to

scale the proposed approach for automotive with newer technology semiconductors, these

effects must also be considered.

We have demonstrated the use of Formal Verification (FV) techniques for

functional safety analysis. However, FV techniques in the presence of large workloads

(with and without perturbation) for complex control functions have been not investigated.

Options include partitioning larger circuits and workloads into smaller ones and

abstracting portions of the circuit into behavioural models (without matching state space

explosion with optional use of assertions) to reduce analysis complexity.

The work in this thesis mainly investigates the control and protection provided by

digital circuits interacting with the physical system. Analog functions, e.g. data

converters, etc. which can also implement safety critical functions are not considered.

Their investigation will require analysis of newer transient conditions and associated

faults, apportioning function tolerance to sub-functions, and simulation artefacts for

abstracted models, and accuracy and speed tradeoffs.

The above investigations can build on the contributions of this thesis. The solutions

thus obtained will help in the development of future safety critical applications.

104

References

[1] D. Lorenz, G. Georgakos, and U. Schlichtmann, “Aging analysis of circuit timing

considering NBTI and HCI,” in International On-Line Testing Symposium, 2009.

[2] R. C. Baumann, “Radiation induced soft errors in advanced semiconductor

technologies,” IEEE Transactions on Device and Materials Reliability, 2005.

[3] M. Alam, “Reliability and process variation aware design of integrated circuits,”

Journal for Microelectronics Reliability, Elsevier, 2008.

[4] R. Mariani and G. Boschi, “A systematic approach for failure modes and effects

analysis of system-on-chips,” in International On-Line Testing Symposium, 2007.

[5] R. Isermann, R. Schwarz, and S. Stolzl, “Fault-tolerant drive-by-wire systems,” IEEE

Control Systems, 2002.

[6] T. Calin, M. Nicolaidis, and R. Velazco, “Upset hardened memory design for

submicron cmos technology,” IEEE Transactions on Nuclear Science, 1996.

[7] V. Prasanth, V. Singh, and R. Parekhji, “Derating based hardware optimizations in

soft error tolerant designs,” in VLSI Test Symposium, 2012.

[8] P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson, “Multiplexed redundant

execution: A technique for efficient fault tolerance in chip multiprocessors,” in

Design, Automation & Test in Europe, 2010.

[9] R. R. Schaller, “Moore’s law: past, present and future,” IEEE spectrum, 1997.

[10] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc,

“Design of ion-implanted MOSFET’s with very small physical dimensions,” IEEE

Journal of Solid-State Circuits, 1974.

[11] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, G. Lemieux, P. P. Pande,

C. Grecu, and A. Ivanov, “System-on-chip: Reuse and integration,” Proceedings of

the IEEE, 2006.

[12] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor system-on-chip

technology,” IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, 2008.

105

[13] D. Edenfeld, A. B. Kahng, M. Rodgers, and Y. Zorian, “2003 technology roadmap

for semiconductors,” Computer, 2004.

[14] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung,

“Accelerating deep convolutional neural networks using specialized hardware,”

Microsoft Research Whitepaper, 2015.

[15] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun,

S. Zhao, H. Larochelle, D. Englund et al., “Deep learning with coherent

nanophotonic circuits,” Nature Photonics, 2017.

[16] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars, “The

architectural implications of autonomous driving: Constraints and acceleration,” in

ACM SIGPLAN, 2018.

[17] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Computer architectures for

autonomous driving,” Computer, 2017.

[18] A. Hayek and J. Börcsök, “Safety chips in light of the standard IEC 61508: survey

and analysis,” in International Symposium on Fundamentals of Electrical

Engineering, 2014.

[19] E. Ugljesa and J. Börcsök, “Evaluation of sophisticated hardware architectures for

safety applications,” in International Symposium on Information, Communication

and Automation Technologies, 2009.

[20] W. M. Goble and H. Cheddie, Safety Instrumented Systems verification: practical

probabilistic calculations. ISA, 2004.

[21] D. H. Stamatis, Failure mode and effect analysis: FMEA from theory to execution.

ASQ Quality Press, 2003.

[22] R. Mariani, G. Boschi, and F. Colucci, “Using an innovative SoC-level FMEA

methodology to design in compliance with IEC61508,” 2007.

[23] A. Mearns, “Fault tree analysis- the study of unlikely events in complex systems,” in

System Safety Symposium, Seattle, Wash, 1965.

[24] P. Koopman, “A case study of toyota unintended acceleration and software safety,”

2014. [Online]. Available: https://users.ece.cmu.edu/~koopman/toyota/koopman-09-

18-2014_toyota_slides.pdf

106

[25] R. E. Cole, “What really happened to toyota?” MIT Sloan Management Review,

2011.

[26] “Ford issues extensive recall on f-150 models over downshifting problem.” [Online].

Available: https://www.hlmlawfirm.com/blog/ford-issues-extensive-recall-on-f-150-

models-over-downshifting-problem/

[27] K. Kalaignanam, T. Kushwaha, and M. Eilert, “The impact of product recalls on

future product reliability and future accidents: Evidence from the automobile

industry,” Journal of Marketing, 2013.

[28] N. A. Stanton, P. M. Salmon, G. H. Walker, and M. Stanton, “Models and methods

for collision analysis: A comparison study based on the uber collision with a

pedestrian,” Safety Science, 2019.

[29] N. Bomey, “Uber self-driving car crash: Vehicle detected arizona pedestrian 6

seconds before accident,” USA Today, https://www. usatoday.

com/story/money/cars/2018/05/24/uber-self-driving-car-crash-ntsb-

investigation/640123002, 2018.

[30] V. A. Banks, K. L. Plant, and N. A. Stanton, “Driver error or designer error: Using

the perceptual cycle model to explore the circumstances surrounding the fatal tesla

crash on 7th may 2016,” Safety science, 2018.

[31] F. Lambert, “Understanding the fatal tesla accident on autopilot and the nhtsa

probe,” Electrek, July, 2016.

[32] IEC 61508, International standard for functional safety of electrical / electronic /

programmable electronic safety-related systems, 2010.

[33] ISO 26262, International standard for functional safety of electrical and electronic

systems in production automobiles, 2018.

[34] R. F. S. 167, DO-178B, Software considerations in airborne systems and equipment

certification. RTCA, Incorporated, 1992.

[35] M. Ebrahimi, A. Evans, M. B. Tahoori, R. Seyyedi, E. Costenaro, and

D. Alexandrescu, “Comprehensive analysis of alpha and neutron particle-induced

soft errors in an embedded processor at nanoscales,” in Design, Automation & Test

in Europe, 2014.

107

[36] S. Mukherjee, Architecture design for soft errors. Morgan Kaufmann, 2011.

[37] Q. Zhao and J. Jiang, “Reliable state feedback control system design against actuator

failures,” Automatica, 1998.

[38] G.-H. Yang, J. L. Wang, and Y. C. Soh, “Reliable h-infinity controller design for

linear systems,” Automatica, 2001.

[39] C. T. Doug Parker, “Winning share in automotive semiconductor,” 2013. [Online].

Available: http://www.mckinsey.com/~/media/mckinsey/dotcom/client_service/-

semiconductors/issue%203%20autumn%202013/pdfs/-

5_automotivesemiconductors.ashx

[40] “Featured applications for real-time control.” [Online]. Available: http://-

www.ti.com/lsds/ti/microcontrollers-16-bit-32-bit/c2000-performance/real-time-

control/applications-featured-applications.page

[41] R. Mariani, “Applying iso 26262 to adas and automated driving,” in AutoSens, 2014.

[42] IEC 60880, International standard for Nuclear power plants - Instrumentation and

control systems important to safety - Software aspects for computer-based systems

performing category A functions, 2006.

[43] EN 50128, International standard for Railway applications-Communication,

Signaling and Processing Systems-Software for Railway Control and Protection

Systems, 2011.

[44] IEC 60601, International standard for common aspects of electrical equipment used

in medical practice , 2015.

[45] IEC62061, International standard for safety of machinery, 2005.

[46] ISO 13849, International standard for Safety of machinery—Safety-related parts of

control systems, 2015.

[47] IEC 61800, International standard for adjustable speed electrical power drive

systems, 2017.

[48] IEC 60730, International standard for household and similar electrical appliances

safety, 2003.

[49] SAE J2980, Considerations for ISO 26262 ASIL Hazard Classification, 2015.

108

[50] T. Stolte, G. Bagschik, A. Reschka et al., “Hazard analysis and risk assessment for

an automated unmanned protective vehicle,” arXiv, 2017.

[51] R. Schneider, W. Brandstaetter, M. Born, O. Kath, T. Wenzel, R. Zalman, and

J. Mayer, “Safety element out of context - a practical approach,” SAE Technical

Paper, 2012.

[52] B. Peng, Y. Chen, S.-Y. Kuo, and C. Bolger, “IC HTOL test stress condition

optimization,” in IEEE International Symposium on Defect and Fault Tolerance in

VLSI Systems, 2004.

[53] P. W. Lisowski and K. F. Schoenberg, “The Los Alamos neutron science center,”

Nuclear Instruments and Methods in Physics Research Section A: Accelerators,

Spectrometers, Detectors and Associated Equipment, 2006.

[54] M. Bellotti and R. Mariani, “How future automotive functional safety requirements

will impact microprocessors design,” Microelectronics Reliability, 2010.

[55] G.-A. Klutke, P. C. Kiessler, and M. A. Wortman, “A critical look at the bathtub

curve,” IEEE Transactions on Reliability, 2003.

[56] Bathtub curve. [Online]. Available: https://en.wikipedia.org/wiki/Bathtub_curve

[57] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw,

T. Austin, K. Flautner et al., “Razor: A low-power pipeline based on circuit-level

timing speculation,” in Microarchitecture, 2003.

[58] M. Zhang, S. Mitra, T. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S. Kim, N. R.

Shanbhag, and S. J. Patel, “Sequential element design with built-in soft error

resilience,” IEEE Transactions on Very Large Scale Integration Systems, 2006.

[59] V. Prasanth, V. Singh, and R. Parekhji, “Robust detection of soft errors using

delayed capture methodology,” in International On-Line Testing Symposium, 2010.

[60] N. D. P. Avirneni, V. Subramanian, and A. K. Somani, “Low overhead soft error

mitigation techniques for high-performance and aggressive systems,” Dependable

Systems & Networks, 2009.

[61] V. Prasanth, V. Singh, and R. Parekhji, “Reduced overhead soft error mitigation

using error control coding techniques,” in International On-Line Testing Symposium,

2011.

109

[62] E. T. Grochowski, W. Rash, N. Quach, H. Nguyen, and A. Rabago, “Microprocessor

with dual execution core operable in high reliability mode,” Patent 6615366, 2003.

[63] S. Banerjee, A. Chatterjee, and J. A. Abraham, “Efficient cross-layer concurrent

error detection in nonlinear control systems using mapped predictive check states,”

in International Test Conference, 2016.

[64] V. Prasanth, R. Parekhji, and B. Amrutur, “Improved methods for accurate safety

analysis of real-life systems,” in Asian Test Symposium, 2015.

[65] M. A. Sabet, B. Ghavami, and M. Raji, “Gpu-accelerated soft error rate analysis of

large-scale integrated circuits,” IEEE Design & Test, 2018.

[66] H. Cho, S. Mirkhani, C.-Y. Cher, J. A. Abraham, and S. Mitra, “Quantitative

evaluation of soft error injection techniques for robust system design,” in Design

Automation Conference, 2013.

[67] I. Polian, J. P. Hayes, S. M. Reddy, and B. Becker, “Modeling and mitigating

transient errors in logic circuits,” IEEE Transactions on Dependable and Secure

Computing, 2011.

[68] L. Chen, M. Ebrahimi, and M. B. Tahoori, “CEP: Correlated Error Propagation for

hierarchical soft error analysis,” Journal of Electronic Testing, 2013.

[69] T. Maeba, M. Deng, A. Yanou, and T. Henmi, “Swing-up controller design for

inverted pendulum by using energy control method based on lyapunov function,” in

IEEE International Conference on Modelling, Identification and Control, 2010.

[70] M. I. Momtaz, S. Banerjee, and A. Chatterjee, “Real-time DC motor error detection

and control compensation using linear checksums,” in VLSI Test Symposium, 2016.

[71] Standardized e-gas monitoring concept for gasoline and diesel engine control units.

[Online]. Available: https://www.iav.com/sites/default/files/attachments/seite/ak-

egas-v6-0-en-150922_1.pdf

[72] D. Geyer, M. Kick, and M. Kraus, “Monitoring the functional reliability of an

internal combustion engine,” Patent 8392046, 2013.

[73] P. Pisu, Fault Detection and Isolation with Applications to Vehicle Systems.

Springer, 2016.

110

[74] A. Kohn, R. Schneider, A. Vilela, U. Dannebaum, and A. Herkersdorf, “Markov

chain-based reliability analysis for automotive fail-operational systems,” SAE

International Journal of Transportation Safety, 2017.

[75] Enhanced Capture Module (eCAP) Reference Guide. [Online]. Available: http://-

www.ti.com/lit/ug/sprufz8a/sprufz8a.pdf

[76] Enhanced Pulse Width Modulator (ePWM) Reference Guide. [Online]. Available:

http://www.ti.com/lit/ug/spruge9e/spruge9e.pdf

[77] Y.-S. Kung, N. V. Quynh, N. T. Hieu, C.-C. Huang, and L.-C. Huang,

“Simulink/Modelsim co-simulation and FPGA realization of speed control IC for

PMSM drive,” Procedia Engineering, 2011.

[78] C. Bottoni, M. Glorieux, J. Daveau, G. Gasiot, F. Abouzeid, S. Clerc, L. Naviner,

and P. Roche, “Heavy ions test result on a 65nm Sparc-v8 radiation-hard

microprocessor,” in IEEE International Reliability Physics Symposium, 2014.

[79] DRV8312 - Three Phase Brushless DC Motor Driver IC. [Online]. Available: http://-

www.ti.com/product/DRV8312

[80] F2805x - Real time control MCU. [Online]. Available: http://www.ti.com/product/-

TMS320F28055

[81] Texas Instruments Development Kit Application Note. [Online]. Available: http://-

www.ti.com/tool/TMDSSOLARPEXPKIT

[82] C.-M. Ong, Dynamic simulation of electric machinery: using MATLAB/SIMULINK.

Prentice hall, 1998.

[83] S. Mirkhani and J. A. Abraham, “Fast evaluation of test vector sets using a

simulation-based statistical metric,” in VLSI Test Symposium (VTS), 2014.

[84] A. L. Silburt, A. Evans, I. Perryman, S.-J. Wen, and D. Alexandrescu, “Design for

soft error resiliency in internet core routers,” IEEE Transactions on Nuclear Science,

2009.

[85] G. Boschi, R. Mariani, and S. Lorenzini, “A verification strategy for fault-detection

and fault-tolerance circuits,” in International On-Line Testing Symposium, 2011.

111

[86] R. Leveugle, “A new approach for early dependability evaluation based on formal

property checking and controlled mutations,” in International On-Line Testing

Symposium, 2005.

[87] S. A. Seshia, W. Li, and S. Mitra, “Verification guided soft error resilience,” Design

Automation and Test in Europe, 2007.

[88] G. Fey and R. Drechsler, “A basis for formal robustness checking,” International

Symposium on Quality Electronic Design, 2008.

[89] G. Fey, A. Sülflow, and R. Drechsler, “Computing bounds for fault tolerance using

formal techniques,” in Design Automation Conference, 2009.

[90] U. Krautz, M. Pflanz, C. Jacobi, H.-W. Tast, K. Weber, and H. T. Vierhaus,

“Evaluating coverage of error detection logic for soft errors using formal methods,”

Design Automation and Test in Europe, 2006.

[91] M. Breuer, “Hardware that produces bounded rather than exact results,” in Design

Automation Conference, 2010.

[92] D. Holcomb, W. Li, and S. A. Seshia, “Design as you see fit: System-level soft error

analysis of sequential circuits,” Design Automation and Test in Europe, 2009.

[93] Cadence Incisive Enterprise Verifier. [Online]. Available: http://www.cadence.com/-

products/fv/enterprise_verifier/pages/default.aspx

[94] F. Corno, M. S. Reorda, and G. Squillero, “RT-level ITC’99 benchmarks and first

ATPG results,” IEEE Design & Test of Computers, 2000.

[95] A. Benso and P. Prinetto, Fault injection techniques and tools for embedded systems

reliability evaluation. Springer Science & Business Media, 2003.

[96] E. H. Cannon, D. D. Reinhardt, M. S. Gordon, and P. S. Makowenskyj, “SRAM

SER in 90, 130 and 180 nm bulk and SOI technologies,” in 2004 IEEE International

Reliability Physics Symposium. Proceedings, 2004.

[97] K. Greb and D. Pradhan, “Hercules microcontrollers: Real-time mcus for safety-

critical products,” white Paper, 2011.

[98] R. Venkatasubramanian, J. P. Hayes, and B. T. Murray, “Low-cost on-line fault

detection using control flow assertions,” in IEEE On-Line Testing Symposium, 2003.

112

[99] K.-H. Huang and J. Abraham, “Algorithm-based fault tolerance for matrix

operations,” IEEE transactions on computers, 1984.

[100] A. Benso, S. Chiusano, P. Prinetto, and L. Tagliaferri, “A C/C++ source-to-source

compiler for dependable applications,” in International Conference on Dependable

Systems and Networks, 2000.

[101] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Error detection by duplicated

instructions in super-scalar processors,” IEEE Transactions on Reliability, 2002.

[102] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, “Swift:

Software implemented fault tolerance,” in International Symposium on Code

generation and optimization, 2005.

[103] S. Rehman, K.-H. Chen, F. Kriebel, A. Toma, M. Shafique, J.-J. Chen, and

J. Henkel, “Cross-layer software dependability on unreliable hardware,” IEEE

Transactions on Computers, 2015.

[104] J. Henkel, L. Bauer, H. Zhang, S. Rehman, and M. Shafique, “Multi-layer

dependability: From microarchitecture to application level,” in 2014 51st

ACM/EDAC/IEEE Design Automation Conference (DAC), 2014.

[105] G. Hubert, L. Artola, and D. Regis, “Impact of scaling on the soft error sensitivity

of bulk, FDSOI and FinFET technologies due to atmospheric radiation,” Integration,

the VLSI journal, 2015.

[106] P. Hazucha, T. Karnik, S. Walstra, B. A. Bloechel, J. W. Tschanz, J. Maiz,

K. Soumyanath, G. E. Dermer, S. Narendra, V. De et al., “Measurements and

analysis of ser-tolerant latch in a 90-nm dual-v/sub t/cmos process,” IEEE Journal of

Solid-State Circuits, 2004.

[107] T.-Y. Wuu and S. B. Vrudhula, “A design of a fast and area efficient multi-input

muller c-element,” IEEE Transactions on Very Large Scale Integration (VLSI)

Systems, 1993.

[108] R. Naseer and J. Draper, “DF-DICE: A scalable solution for soft error tolerant

circuit design,” in 2006IEEE International Symposium on Circuits and Systems,

2006.

113

[109] X. Iturbe, B. Venu, E. Ozer, and S. Das, “A Triple Core Lock-Step (TCLS) ARM

Cortex-R5 Processor for Safety-Critical and Ultra-Reliable Applications,” in

International Conference on Dependable Systems and Networks, 2016.

[110] M. Werner, K. Devarajegowda, M. Chaari, and W. Ecker, “Increasing soft error

resilience by software,” in Design Automation Conference, 2019.

[111] D. J. Lu, “Watchdog processors and structural integrity checking,” IEEE

Transactions on Computers, 1982.

[112] O. Goloubeva, M. Rebaudengo, M. S. Reorda, and M. Violante, “Soft-error

detection using control flow assertions,” in IEEE Symposium on Defect and Fault

Tolerance in VLSI Systems, 2003.

[113] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Control-flow checking by software

signatures,” IEEE Transactions on Reliability, 2002.

[114] S. Schuster, P. Ulbrich, I. Stilkerich, C. Dietrich, and W. Schröder-Preikschat,

“Demystifying soft-error mitigation by control-flow checking–a new perspective on

its effectiveness,” ACM Transactions on Embedded Computing Systems, 2017.

[115] V. Sridharan and D. R. Kaeli, “Eliminating microarchitectural dependency from

architectural vulnerability,” in IEEE International Symposium on High Performance

Computer Architecture, 2009.

[116] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, “A

systematic methodology to compute the architectural vulnerability factors for a high-

performance microprocessor,” in IEEE/ACM International Symposium on

Microarchitecture,, 2003.

[117] X. Li, S. V. Adve, P. Bose, and J. A. Rivers, “Online estimation of architectural

vulnerability factor for soft errors,” in ACM SIGARCH Computer Architecture News,

2008.

[118] L. Chen and A. Avizienis, “N-version programming: A fault-tolerance approach to

reliability of software operation,” in International Symposium on Fault-Tolerant

Computing, 1995.

114

[119] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S.

Mukherjee, “Design and evaluation of hybrid fault-detection systems,” in ACM

SIGARCH Computer Architecture News, 2005.

[120] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S.

Mukherjee, “Software-controlled fault tolerance,” ACM Transactions on

Architecture and Code Optimization (TACO), 2005.

[121] M. Imai, Y. Takeuchi, K. Sakanushi, and N. Ishiura, “Advantage and possibility of

application-domain specific instruction-set processor (ASIP),” IPSJ Transactions on

System LSI Design Methodology, 2010.

[122] P. Winokur, G. Lum, M. Shaneyfelt, F. Sexton, G. Hash, and L. Scott, “Use of

COTS microelectronics in radiation environments,” IEEE Transactions on Nuclear

Science, 1999.

[123] J. Yiu, The definitive guide to the ARM Cortex-M3. Newnes, 2009.

[124] S. Vestal, “Preemptive scheduling of multi-criticality systems with varying degrees

of execution time assurance,” in 28th IEEE International Real-Time Systems

Symposium, 2007.

[125] R. Mariani and P. Fuhrmann, “Comparing fail-safe microcontroller architectures in

light of IEC 61508,” in 22nd IEEE International Symposium on Defect and Fault-

Tolerance in VLSI Systems (DFT 2007), 2007.

[126] A. Kohn, M. Käßmeyer, R. Schneider, A. Roger, C. Stellwag, and A. Herkersdorf,

“Fail-operational in safety-related automotive multi-core systems,” in IEEE

International Symposium on Industrial Embedded Systems, 2015.

[127] A. Burns and R. I. Davis, “A survey of research into mixed criticality systems,”

ACM Computing Surveys (CSUR), 2018.

[128] S. Fei, G. Prashant, and Z. Min, “On freedom from interference in mixed criticality

systems: A causal learning approach,” in International Test Conference, 2019.

Documents

Application Aware Functional Safety Analysis Techniques · Application Aware Functional Safety Analysis Techniques A Thesis Submitted for the Degree of Doctor of Philosophy in the