Upload
others
View
82
Download
1
Embed Size (px)
Citation preview
Application Aware Functional Safety
Analysis Techniques
A Thesis Submitted
for the Degree of Doctor of Philosophy
in the Faculty of Engineering
by
Prasanth V
Electrical Communication Engineering
Indian Institute of Science
Bangalore – 560 012
November 28, 2019
© Copyright by Prasanth V, 2019
All rights reserved
i
Abstract
Integrated Circuits (IC) are used to realize multitude of real life systems. These real
life systems have ICs interacting with physical systems (this combination being referred
to as hybrid systems) and many of them are used in safety critical applications. The
implications of a fault in any of the constituent components of the system must be
analysed and appropriately addressed to mitigate its potentially dangerous after effects.
Given the increasing dependence on ICs to meet the functional requirements of safety
critical applications, safety analysis of ICs plays an important role in ensuring safety of
the application performed by the system.
When it comes to designing hybrid systems, we are gradually moving away from
the paradigm of independently designing the digital and physical parts of hybrid systems
towards simultaneous considerations for both. This helps in providing an optimal system
design solution. However, the same does not hold true when it comes safety analysis.
Today, safety analysis of ICs used in such systems is typically done in isolation of the
end application and associated physical system due to practical considerations like safety
analysis complexity, lack of a proper physical system model, etc. This results in the need
to take recourse to conservative design techniques incorporating costly redundancy.
Many hybrid systems have an acceptable tolerance determined by the application
due to the inertial nature of the physical system, error tolerance capability in closed loop
applications, built-in hardware and software functionality, etc. These tolerances can be
beneficially employed to reduce the hardware overhead required to implement safety. In
this thesis, we investigate the problem of building affordably robust soft-error resilient
systems based upon flip-flop protection. We develop methods to identify the minimal set
of critical flip-flops which must be protected in an integrated circuit, keeping in mind the
inherent tolerances (resiliency) of the system into which it is incorporated.
This thesis first proposes a set of techniques to map tolerances available at
application level, to the individual circuit modules of the IC, to make the safety analysis
ii
less pessimistic. The circuit modules of the IC can then be analysed standalone using the
mapped tolerances to reduce the analysis pessimism.
Fault injection is the preferred technique used to ascertain the safety worthiness
(robustness) of the circuit. We then look at the limitations of fault injection based safety
analysis and propose the use of formal techniques to ensure a comprehensive analysis.
We show how the input constraints and output tolerances can be modelled in formal
verification framework to reduce the analysis pessimism. We also propose a technique to
enhance the workload used for fault injection to provide a more comprehensive
functional safety analysis for larger modules which cannot be handled using formal
techniques.
Traditionally, protection for critical flip-flops is offered with the help of application
agnostic hardware or software based techniques. However, these may not offer the most
cost optimal solution. The thesis also proposes two application based techniques to
protect critical flip-flops which are identified through the functional safety analysis
process or steps. These techniques also consider the practical scenario where the
application needs to be rendered safe (also termed as “safed”), even when the underlying
components used to build the system are not robust.
In summary, this thesis provides a set of techniques to address the application level
functional safety requirements with lower cost and better robustness. More specifically, it
proposes a divide and conquer approach to make application level functional safety
analysis feasible, demonstrates application of formal methods, illustrates workload
augmentation based techniques to render the safety analysis comprehensive, and
addresses the requirements for the development of functional safety system using non
robust components.
iii
Acknowledgements
This work, in its present form would not have been possible without the kind
support and help from many individuals. I take this opportunity to express my gratitude
to the people who have been instrumental in the successful completion of this thesis.
I owe my sincere thanks to my advisors, Dr. Rubin Parekhji and Prof. Bharadwaj
Amrutur, without whom this work would not have been possible. Right from the moment,
I expressed my wish to pursue PhD, Dr. Rubin Parekhji has supported with guidance,
encouragement and support. During the course of the work, we did explore different
adjacent topics and Dr. Parekhji always helped to steer it in the right direction. We had
long and continuous discussions on these which helped to shape the thesis the way it is
today. Prof. Bharadwaj Amrutur provided the motivation to go out of my comfort zone
and take an altogether new problem for the thesis work. He has constantly monitored the
progress and provided the right guidance in taking the research forward. This helped to
have confidence in exploring the unknown and come up with good solutions.
I am thankful to all my teachers at IISc, especially in the ECE and DESE
department, for the wonderful lectures I got to attend.
I owe my sincere thanks to Padmini Sampath and Sumedha Limaye for supporting
my wish to pursue higher studies and agreeing to sponsor me for the research program
(through Texas Instruments (India) Pvt. Ltd., Bangalore). I am grateful to my mentors at
Texas Instruments Shailesh Ghotgalkar, Jaya Singh, Venkatesh Natarajan, Srivaths Ravi
and Bharat Rajaram for supporting me and helping me at different times during the
course of my PhD. They have made sure that I have the required support to pursue
research and come out successful.
In addition to this, many colleagues from Texas Instruments and IISc have helped
me with the experiments, providing suggestions and directions. The list is long and, to
name a few, they are Abhishek, Amal, Arif, Ashish, Chakra, Han, Jeff, Kaustubh, Nidhin,
Pooja, Prashant, Prashanth, Richard, Rupin, Sai and Swathi. Given that so many people
have helped me during the course of my work, I might have missed one or two names. I
would like to sincerely thank all my colleagues who have helped me.
iv
All PhD students have an in between sad story of personal sacrifices and struggle to
maintain the balance between personal life and PhD work to tell. Mine is similar if not bit
worse due to the addition of office commitments. Maintaining a fair balance among
research work, office assignments and family life always proved tough. I’ve prioritized
the first two many times and my wife and both daughters have been a victim of the same.
They are looking forward to me submitting my thesis. However, they have been quite
supportive to understand and modulate their expectations during this time. I would like to
thank my parents, in-laws, sister, wife, daughters, colleagues and friends for providing
their full support and encouragement for my PhD ambitions.
v
Publications Based on This Thesis
1. Prasanth V, Rubin Parekhji, Amrutur Bharadwaj, “Improved Methods for Accurate
Safety Analysis of Real-life Systems,” in IEEE Asian Test Symposium, 2015.
2. Prasanth V, Rubin Parekhji, Amrutur Bharadwaj, “Safety Analysis for Integrated
Circuits in the Context of Hybrid Systems,” IEEE International Test Conference,
2017. (Selected for Honourable Mention Award)
3. Prasanth V, Rubin Parekhji, Amrutur Bharadwaj, “Perturbation based Workload
Augmentation for Comprehensive Functional Safety Analysis”, International
Conference on VLSI Design, 2019.
Related Publication and Presentations
4. Prasanth V, Rubin Parekhji, “Low Overhead Design and Test Techniques for
Application Specific Functional Safety,” Innovative Practices Session, VLSI Test
Symposium, 2017.
5. Prasanth V, David Foley, Srivaths Ravi, “Demystifying Automotive Safety and
Security for Semiconductor Developer,” IEEE International Test Conference, 2017.
6. Prasanth V, Srivaths Ravi, “Safety and Security in Automotive 2.0 Era”, half day
tutorial, Design, Automation and Test in Europe, 2019.
7. Prasanth V, Srivaths Ravi, “Safety and Security in Automotive 2.0 Era”, half day
tutorial, IEEE Asian Test Symposium, 2019.
vi
Keywords
Functional safety, application tolerance, transient faults, soft errors, value tolerance,
time tolerance.
vii
Contents
Abstract ................................................................................................................................ i
Acknowledgements ............................................................................................................ iii
Publications Based on This Thesis ..................................................................................... v
Keywords ........................................................................................................................... vi
Contents ............................................................................................................................ vii
List of Tables ...................................................................................................................... x
List of Figures .................................................................................................................... xi
1. Introduction ................................................................................................................. 1
Evolution of Functional Safety Systems .............................................................. 2 1.1
Integrated Circuits Functional Safety Concerns................................................... 2 1.2
IC Functional Safety Research Challenges .......................................................... 4 1.3
Contributions of This Thesis ................................................................................ 5 1.4
Thesis Organization.............................................................................................. 6 1.5
2. Functional Safety of Integrated Circuits ..................................................................... 8
Application Case Study: EV Traction System ..................................................... 8 2.1
Functional Safety Standards ............................................................................... 11 2.2
2.2.1 Deriving Semiconductor Safety Requirements from End Application ...... 12
2.2.2 SEooC Design Process ............................................................................... 14
IC Design Evaluation for Safety ........................................................................ 16 2.3
2.3.1 Types of Failures ........................................................................................ 16
2.3.2 Circuit Failure Mode Analysis: Qualitative ............................................... 18
2.3.3 Circuit Failure Mode Analysis: Quantitative ............................................. 20
Protecting Against Systematic and Random Failures ........................................ 21 2.4
2.4.1 Robust Development Process ..................................................................... 21
2.4.2 Safety Mechanisms ..................................................................................... 22
viii
2.4.3 Development Process to Address Random Failures ................................... 24
Limitations with Existing Safety Analysis Methods .......................................... 25 2.5
2.5.1 Making Safety Analysis Comprehensive ................................................... 25
2.5.2 Reduction of Implementation Overheads ................................................... 27
3. Safety Analysis Pessimism Reduction by Utilizing Application Tolerance ............. 29
Background and Related Work .......................................................................... 31 3.1
Improved Safety Analysis Technique ................................................................ 33 3.2
3.2.1 Value and Time Tolerance ......................................................................... 33
3.2.2 Divide and Conquer Safety Analysis Approach ......................................... 36
Evaluation of Value Tolerance and Time Tolerance ......................................... 38 3.3
3.3.1 Determination of Tolerance Using Actual System ..................................... 38
3.3.2 Analytical Estimation of Tolerance Values ................................................ 47
3.3.3 Determination of Tolerance Using High Level Models ............................. 49
Conclusion .......................................................................................................... 51 3.4
4. Formal Verification Based Approach for Accurate Safety Analysis ........................ 52
Background and Related Work .......................................................................... 53 4.1
Improved Safety Analysis Framework ............................................................... 55 4.2
Analysis on Benchmark Circuits ........................................................................ 57 4.3
Analysis on Industrial Modules.......................................................................... 58 4.4
Conclusion .......................................................................................................... 60 4.5
5. Improved Fault Injection Based Safety Analysis Approaches ................................. 62
Fault Injection Based Safety Analysis Approach ............................................... 63 5.1
5.1.1 Experimental Setup .................................................................................... 64
Fault Injection Workload Analysis .................................................................... 68 5.2
Workload Perturbation Approach ...................................................................... 70 5.3
Experimental Results.......................................................................................... 71 5.4
ix
5.4.1 Control Functions ....................................................................................... 71
5.4.2 Inverter Application ................................................................................... 73
Conclusion .......................................................................................................... 76 5.5
6. Application Driven Protection Mechanisms ............................................................. 77
Hardware Based Protection Techniques ............................................................ 77 6.1
6.1.1 Device Level Techniques ........................................................................... 77
6.1.2 Circuit Level Techniques ........................................................................... 78
6.1.3 Module Level Techniques .......................................................................... 80
Software Based Protection Techniques .............................................................. 80 6.2
6.2.1 Control Flow Checking .............................................................................. 81
6.2.2 Vulnerability Reduction Techniques .......................................................... 81
6.2.3 Software Redundancy Techniques ............................................................. 83
Application Based Protection Techniques ......................................................... 84 6.3
6.3.1 Critical Flip-flop Reduction by Altering Application Execution ............... 87
6.3.2 Detection of Critical Flip-flops by Selective Redundant Execution .......... 94
Conclusion ........................................................................................................ 100 6.4
7. Conclusions and Future Work ................................................................................ 101
Future Work ..................................................................................................... 102 7.1
References ....................................................................................................................... 104
x
List of Tables
Table 2.1. Quantitative metric requirements..................................................................... 20
Table 2.2. Calibrating typical safety mechanisms. ........................................................... 23
Table 2.3. Workload coverage and number of dangerous flip-flops. ............................... 27
Table 4.1. Safety analysis on benchmark circuits. ............................................................ 57
Table 4.2. Safety analysis on industrial modules. ............................................................. 59
Table 5.1. Control functions used for evaluation. ............................................................. 72
Table 5.2. Workload iterations for inverter. ..................................................................... 74
Table 6.1. Different tasks executed by motor control application. ................................... 85
Table 6.2. Tradeoffs associated with changing control loop frequency. .......................... 88
Table 6.3. Comparison of traditional and proposed fault tolerant approaches. ................ 97
xi
List of Figures
Figure 1.1. Representative fault tolerant systems. .............................................................. 3
Figure 2.1. Block diagram of an EV traction system. ......................................................... 9
Figure 2.2. Closed loop control system. ........................................................................... 10
Figure 2.3. Functional safety standards. ........................................................................... 11
Figure 2.4. Derivation of semiconductor safety requirements from end-application. ...... 13
Figure 2.5. SEooC requirements. ...................................................................................... 15
Figure 2.6. ISO26262 failure classification. ..................................................................... 16
Figure 2.7. Bathtub curve. ................................................................................................. 18
Figure 2.8. Dependent failures. (a) Common cause (b) Cascading. ................................. 19
Figure 2.9. Development process to address systematic faults. ........................................ 21
Figure 2.10. Functional safety development flow to address random failures. ................ 24
Figure 3.1. Safety analysis complexity and hardware overhead tradeoffs. ...................... 29
Figure 3.2. Illustration of a hybrid system. ....................................................................... 33
Figure 3.3. Tolerance in value and time over the control input range. ............................. 34
Figure 3.4. Motor speed variation for various injected errors. ......................................... 35
Figure 3.5. Computation of value and time tolerance. ...................................................... 36
Figure 3.6. Closed loop control system operation. ........................................................... 38
Figure 3.7. DRV8312, F2805x and BLDC motor. ........................................................... 40
Figure 3.8. Dataflow of closed loop motor control application. ....................................... 41
Figure 3.9. Application time tolerance in the presence of worst case errors. ................... 42
Figure 3.10. Application value tolerence as percentage of CPU output value. ................ 43
Figure 3.11. AC inverter application kit. .......................................................................... 44
Figure 3.12. AC inverter dataflow diagram. ..................................................................... 44
Figure 3.13. Application time tolerance in the presence of worst case errors. ................. 45
Figure 3.14. Application value tolerence as percentage of CPU output value. ................ 46
Figure 3.15. First order system. ........................................................................................ 47
Figure 3.16. Input and otuput values of the control system. ............................................. 48
Figure 3.17. PMSM control system. ................................................................................. 49
xii
Figure 3.18. PMSM output in the presence of worst case error. ...................................... 50
Figure 3.19. Magnified version of PMSM output in the presence of worst case error. .... 50
Figure 3.20. PMSM output used for determining value tolereance. ................................. 51
Figure 4.1. Illustration of FV based safety analysis.......................................................... 55
Figure 4.2. IEV property specification. ............................................................................ 56
Figure 4.3. Reference application. .................................................................................... 58
Figure 5.1. Safety analysis approaches. ............................................................................ 63
Figure 5.2. Software fault injection flow. ......................................................................... 64
Figure 5.3. Critical elements identifed at different operating conditions. ........................ 65
Figure 5.4. Critical flip-flops identified using each approach. ......................................... 66
Figure 5.5. Critical flip-flops identified for inverter application. ..................................... 67
Figure 5.6. Critical flip-flops identified in three approaches for inverter application. ..... 68
Figure 5.7. Number of unique critical flip-flops identified for each workload. ............... 69
Figure 5.8. Algorithm for workload augmentation. .......................................................... 71
Figure 5.9. Variation of critical elements with workload perturbation. ............................ 72
Figure 5.10. Critical flip-flops identified with perturbed workloads for AC inverter
application. ........................................................................................................................ 74
Figure 6.1. SOI transistor. ................................................................................................. 78
Figure 6.2. DICE flip-flop. ............................................................................................... 79
Figure 6.3. BISER fip-flop................................................................................................ 79
Figure 6.4. Sequencing of different tasks executed by motor control application. .......... 85
Figure 6.5. Change in criticality over time. ...................................................................... 86
Figure 6.6. Variation of time tolerance with control loop frequency for BLDC motor. .. 89
Figure 6.7. Variation of time tolerance with control loop frequency for AC inverter. ..... 89
Figure 6.8. Critical flip-flop identification for different time tolerance values. ............... 90
Figure 6.9. Variation in the number of critical flip-flops with time tolerance (# number of
control loop cycles) for BLDC motor control application. ............................................... 91
Figure 6.10. Variation in the number of critical flip-flops with time tolerance (# number
of control loop cycles) for AC inverter application. ......................................................... 91
Figure 6.11. Execution variation with different control loop frequencies. ....................... 92
Figure 6.12. PI function implemented by the control system. .......................................... 93
xiii
Figure 6.13. Selective redundant execution. ..................................................................... 95
Figure 6.14. Memory and MIPS overhead reduction for selective redundant execution
approach as compared to EDDI. ....................................................................................... 95
Figure 6.15. Selective redundant execution for error recovery. ....................................... 96
Figure 6.16. Memory and MIPS overhead reduction for proposed system recovery
approach as compared to TMR implemented in software. ............................................... 98
Figure 6.17. Protection approaches for safety critical application.................................... 98
1
1. Introduction
Advancement of technology has led to reduction in transistor feature sizes and
power. This has helped to integrate more components into an Integrated Circuit (IC). As
we move to newer technology nodes, the factors which are helping to shrink the transistor
size and reduce the power consumption are having an adverse impact on reliability. The
risk of device failure due to aging induced phenomenon like Negative Bias Temperature
Instability (NBTI) and Hot Carrier Injection (HCI) [1], and Single Event Upsets (SEU)
due to particle strikes [2] has increased. Of the different failure mechanisms, random
failures due to atmospheric particle strikes pose the biggest threat to the reliable operation
of ICs. While systematic design techniques [3] can be deployed to reduce the risk due to
life-time failures, cost effective solutions for random failures are still evolving.
In addition to the increased failure rate, ICs offer additional challenges due to
complexity in analyzing the potential failure modes [4]. The failure modes of mechanical
systems (which the modern electronic components are replacing) are predictable and
render themselves to an easy analysis. But due to the inherent complexity in the design,
manufacturing and functionality of ICs today, a much more rigorous approach is
required. For example, if we compare a mechanical steering to a steer-by-wire system [5],
a failure in mechanical steering will lead to predictable failure modes of loss of steering
or insufficient steering. But for a steer-by-wire system, a bit flip in the IC can cause the
steer-by-wire system to even steer in the opposite direction, which is not a potential
failure mode in the case of mechanical steering.
Along with the challenges of technology node shrinking and complexity in
analyzing failure modes, there is a requirement to keep the overheads of performance,
area and implementation incurred for functional safety to a minimum. This is forcing us
to think along new vectors for IC safety analysis, together with cost effective hardened
circuit components [6], improved design techniques [7] and improved architectural
methods [8].
2
Evolution of Functional Safety Systems 1.1
With evolution of technology, transistors become smaller, faster and more power
efficient resulting in larger integration. The number of transistors in the ICs has increased
significantly driven by the Moore’s law [9] and Dennard’s scaling law [10]. A paradigm
shift which happened during the evolution of ICs is the design of System-on-Chips
(SoCs) [11,12]. This allowed integration of multiple components into a single IC leading
to more efficient and cost effective design. Along with other advancements, the firmware
/ software foot-print used in these systems [13] has also increased. It has become more
complex with the emergence of machine learning and artificial intelligence implemented
using these ICs [14,15,16,17].
IC failures due to faults were initially a concern for safety critical systems in
transportation, industrial plants, space and medical. However, with increase in failure rate
and rapid proliferation of ICs as replacement for mechanical parts, it is required to
comprehend functional safety requirements even for consumer electronic systems.
Integrated Circuits Functional Safety Concerns 1.2
Functional safety requirements have traditionally been addressed using redundancy
techniques. Concerns in such systems were addressed by having redundancy based
control architectures [18,19,20] like 1oo2 (1-out-of-2 also known as Dual Modular
Redundancy), 2oo3 (2-out-of-3 also known as Triple Modular Redundancy), etc. as
shown in Figure 1.1. Redundancy based architectures lead to significant increase in
implementation overheads. Since the deployment of earlier systems was restricted, the
additional cost incurred for functional safety was of lesser concern than it is today. With
more wide-spread deployment into several applications in recent years, the higher cost
incurred in designing such redundant systems is driving the need to impart functional
safety with reduced design overheads. As redundancy is replaced with other design
techniques, functional safety analysis techniques are becoming more architecture and
application dependent and hence more complex. This requires detailed analysis of the
circuit, understanding the effect of the failure of each flip-flop / logic gate on the system,
and imparting resilience by addressing the dangerous after-effects using other techniques.
3
As ICs started getting larger and more complex, thorough investigation of failure
modes using standard techniques like Failure Mode and Effect Analysis (FMEA) [21,22]
and Fault Tree Analysis (FTA) [23] became difficult. Several functional safety incidents
like Toyota’s unintended acceleration recalls [24,25], Ford’s unintended gear shift related
recalls [26], and multiple other automotive recalls [27] illustrate this difficulty. The
advent of autonomous systems which are capable of taking independent decisions based
on a set of parameters has made the safety analysis more critical than ever before. Recent
accidents with Uber [28,29] and Tesla [30,31] point to the limitations of the functional
safety analysis methods deployed today.
As functional safety became important to different end applications, different
functional safety standards evolved to reduce risks by providing necessary requirements
and processes. For example, IEC 61508 [32] addresses the safety requirements of
electrical, electronic and programmable electronic safety related systems. This is the base
standard from which other functional safety standards are derived. ISO 26262 [33] is an
adaptation of IEC 61508 specifically for automotive electric and electronic systems.
DO178 [34] address the functional safety requirements of avionics systems. Different
Figure 1.1. Representative fault tolerant systems.
4
functional safety standards are required for different end systems since the safety
requirements vary widely among the different applications.
IC Functional Safety Research Challenges 1.3
The fundamental question raised in hardware safety is “How can I build an
affordably robust soft-error resilient system?”. Techniques and methods should be
available for the designer to comprehensively identify the minimum set of critical
components which need to be protected to make the system safe. The safety analysis
methods should reduce the analysis complexity and make it amenable to analyse very
large systems. In addition, it should reduce the impact on time to market as well.
Comprehensive safety analysis must ensure that the IC is safe for all application
scenarios. This implies the need to cover the impact of every fault for all valid functional
states of operation. This becomes particularly challenging when all the different
applications in which the IC is getting used are not available at the time of design. It is a
common scenario that an IC built for one application finds use in several other and often
unrelated applications.
When performing IC functional safety analysis, we need to understand that not
every random fault occurring in an IC will result in application failure. Different masking
effects (e.g. logical masking, electrical masking, latching window masking and
application level masking) [35,36] can prevent the fault from propagating to the output
and causing the application to fail. Functional safety analysis should consider the
different masking effects and reduce the overall hardware overhead incurred.
Many real life systems have ICs interacting with physical systems in safety critical
applications. These physical systems are inherently analog in nature, and the accuracy of
the analysis is dependent upon the performance deviation which can be tolerated under
different use conditions. These systems are typically designed as closed loop control
systems and, by their very nature, can correct certain errors since such system can build
its resilience across subsequent iterations of the control loop. In addition, the interacting
physical system has latency and tolerance. The control algorithms [37,38] for these
systems are designed to accommodate the variability in the physical system and the
5
environment. The design of ICs used in such systems can beneficially employ these
system level tolerances to identify the minimum set of components that are required to be
protected, thereby reducing the hardware overhead.
As we include the system level information to reduce the pessimism associated
with functional safety analysis, the analysis complexity increases significantly. As an
illustration, an application level analysis for functional safety of a motor control system
will require consideration of the motor and its different operating conditions involving
motor speed, load, etc. This requires inclusion of the motor models and the whole
analysis may become impractical due to the complexity addition.
Contributions of This Thesis 1.4
In this thesis, we address some of the major challenges associated with IC
functional safety analysis and soft error mitigation. Broadly, this thesis investigates
functional safety analysis as it is performed today, determines the limitations and
proposes techniques to address them. More specifically, the major contributions from the
thesis can be grouped as given below.
Firstly, this thesis analyses functional safety requirements from an application
perspective and identify the analysis complexities and potential optimizations. It proposes
a new technique by which the application level tolerances can be mapped to tolerances at
IC level. This approach helps to reduce the analysis complexity and at the same time
optimize the hardware overhead incurred for including functional safety. It further
proposes a new divide and conquer approach by which the tolerances at IC level are
mapped to individual modules whereby the analysis can be performed at standalone
module level.
Secondly, this thesis evaluates an alternate approach using formal techniques for
comprehensive functional safety evaluation. Incorporation of formal techniques makes
the analysis more accurate. In order to make the analysis less pessimistic, application
tolerance information is also incorporated. This involves capturing application diversity
as input constraints (values and sequences), modelling application specific performance
6
tolerances as an output range across time intervals and illustrating how the physical
system can be included into this analysis using a suitable representation.
Thirdly, this thesis proposes a workload perturbation approach to augment
workloads used in the traditional fault injection approach to address the dual problems of
safety analysis complexity and comprehensiveness. A method is proposed for systematic
perturbation of workloads whereby new workloads are generated iteratively, and they are
shown to be effective to detect additional critical flip-flops. These three contributions
together provide a new framework to identify the critical flip-flops which must be
protected.
Lastly, the thesis proposes two new application level techniques to protect critical
flip-flops. The first technique relies on altering application execution (e.g. increase
frequency of the control loop operations) to reduce the number of critical flip-flops. The
second technique uses selective redundant execution to safe the critical flip-flops. A
novel technique to correct the identified faulty flip-flops is also presented. These
techniques analyse realistic application scenarios and propose an optimal approach for
implementing application aware methods to protect critical flip-flops to incorporate
functional safety.
Thesis Organization 1.5
This chapter gave an overview of the evolution of functional safety systems and
safety concerns associated with ICs used in these systems. The open problems from the
IC safety are discussed next followed by the major contributions of this thesis in
addressing them.
Chapter 2 presents an overview of IC functional safety describing the fundamental
safety concepts, derivation of semiconductor functional safety requirements from an
application level, various functional safety standards, and challenges faced in functional
safety analysis of ICs.
Chapter 3 proposes an improved functional safety analysis technique where the
application level tolerance is mapped to an IC as value tolerance and time tolerance. It
proposes a divide and conquer approach to make the functional safety analysis practically
7
viable. It shows several examples at various abstraction levels to show how the
application level tolerance can be mapped to IC hardware (or functional modules inside
an IC).
Chapter 4 illustrate the use of formal techniques to address the dual challenges of
analysis comprehensiveness and pessimism. It shows using a set of benchmark circuits
and two industry modules how the proposed techniques can be used to perform safety
analysis.
Chapter 5 proposes workload augmentation approach to addresses design size
limitations associated with the formal approach and non-comprehensiveness associated
with practical workloads used in the traditional fault injection approach. The chapter
demonstrates the benefits of the proposed approach using a set of control benchmark
algorithms and code routines. It further explains a more directed perturbation approach to
solve the practical challenges associated with random workload perturbation.
Chapter 6 proposes two new application level techniques to protect the critical flip-
flops (wherein a flip-flop is termed critical when, in the presence of a fault, its erroneous
value results in unacceptable application behaviour) identified using the new approaches
proposed in Chapters 3, 4 and 5. The first technique relies on altering application
execution to reduce the number of critical flip-flops. The second technique uses selective
redundant execution to protect the critical flip-flops. A novel technique to correct the
identified faulty flip-flops is also presented. They consider realistic application scenarios
and arrive at an optimal approach to detect / protect critical flip-flops.
Chapter 7 concludes the thesis, and lists areas of further research.
8
2. Functional Safety of Integrated Circuits
Increasing and pervasive use of semiconductors define many of the modern
applications today. If we take the case of automotive applications, semiconductors can be
found in multiple subsystems of a vehicle – from the basic EPS (Electric Power Steering)
and ABS (Antilock Brake System) to advanced safety and control (collision warning and
parking assistant systems), state-of-the-art infotainment systems, networking units (from
CAN to Bluetooth), and advanced control and comfort systems [39]. Establishing and
emerging automotive trends such as Electric Vehicle (EV) / Hybrid Electric Vehicle
(HEV), advanced vehicle intelligence, autonomous driving, etc., are only pushing the
semiconductor content even higher. While at one end, semiconductors continue to
provide newer functionality to the consumer, they also bring in increased risk of failures
which elevate the functional safety concerns of the system. This has led to functional
safety becoming a critical (and in some cases defining) parameter for the semiconductors
and ICs used for realizing important functions.
This chapter gives a brief overview of IC functional safety. The rest of this chapter
is organized as follows. Section 2.1 illustrates using an example, the impact of
semiconductor failures on the EV traction application. Section 2.2 introduces the
different functional safety standards. Section 2.3 introduces the key safety concepts
covered in these standards. Section 2.4 describes an example functional safety aware
development process practised to develop ICs, and Section 2.5 lists the limitations in this
existing safety analysis methodology.
Application Case Study: EV Traction System 2.1
ICs can fail in an application due to a variety of reasons ranging from excess
temperature, excess voltage, ageing, ionizing radiation, package stress, etc. As
semiconductor content in various mission critical applications like automobile, aviation,
etc. increase, chances of application failure due to semiconductor failures also increases.
The impact of IC failure on an application will differ based on the type of failure and type
9
of application. In this section, we analyse an Electric Vehicle (EV) traction application
and see how a failure in the control IC can impact the application.
Figure 2.1 shows the block diagram for the traction control system of an EV. The
system consists of an AC induction motor connected directly to the drive train. The motor
is driven by power stages which get the energy from the high wattage (20 – 100 KWh)
battery. The power stage is controlled by a control IC such as the C2000 microcontroller
(MCU) [40]. This control IC receives torque/speed set-point command from the
supervisor IC, based on information from input functions like accelerator, brake pedal,
etc. There is a separate Battery Management System (BMS) IC which monitors the
battery to keep track of its charging and discharging operation and voltage levels. The
control IC senses torque by measuring the motor phase current using the ADC. The
processing element in the control IC will determine the system error by comparing the
set-point torque received from the supervisor IC with the sensed torque and execute the
control algorithm to minimize this error.
The result of the control algorithm execution is conveyed to the motor as change of
applied power, by varying duty cycle of Pulse Width Modulation (PWM) module output
as indicated in Figure 2.2. The figure represents two torque scenarios. We have used a
Figure 2.1. Block diagram of an EV traction system.
10
simplistic assumption that the torque conveyed to the motor is proportional to pulse
width. We analyse the impact of a fault in the different modules of the control IC on the
motor control application.
(i) A fault in the ADC module can lead to incorrect sensing (higher/lower than
actual) of speed. Incorrectly sensed speed, when input to the control algorithm,
can result in incorrect processing which can in turn lead to inadvertent
acceleration or deceleration of the motor.
(ii) A fault in the CPU will cause incorrect execution of the control algorithm
causing inadvertent acceleration or deceleration.
(iii) A fault in the PWM module can cause the output duty cycle to change, resulting
in inadvertent acceleration or deceleration.
In the above analysis, we have only considered the worst-case impact that a fault in
the IC can have on the application, (as a result of acceleration / deceleration, as against
the case when there is no appreciable change in the speed at all). The exact impact of a
fault will depend on the logic gate / flip-flop impacted and the type of fault. A transient
fault in one of the flip-flops in the data logic may not lead to much difference in the EV
motor output due to the mechanical components involved. The error will get corrected
Figure 2.2. Closed loop control system.
11
subsequently due to the closed loop operation albeit with a longer latency. However, if
the error is in the control path of the CPU, it is possible that the entire program flow
changes, resulting in irrecoverable and/or catastrophic behaviour. As an example, a
transient fault in the program counter will have much larger impact on the application
when compared to a transient fault in the LSB of adder logic output which drives the
PWM.
Functional Safety Standards 2.2
Various functional safety standards have been developed and deployed over the
years to set guidelines for design and implementation to avoid hazards caused by
malfunctioning behavior of electric and electronic devices in end systems. IEC 61508
[32] is the fundamental functional safety standard that encapsulates basic safety concepts
and design requirements applicable to a wide range of electric and electronic end
equipment across industry. Application specific functional safety standards have since
evolved from IEC 61508 as shown in Figure 2.3 [41], in order to address the
Figure 2.3. Functional safety standards.
EN 62061(factory automation)
IEC 60730(household goods)
IEC 61508(meta standard)
ISO 13849(machinery)
IEC 60880(nuclear station)
IEC 50158(furnaces)
RTCA / DO 178B(aerospace)
IEC 61800(power drive)
IEC 60601(medical equipment)
EN 50128(railway)
ISO 26262(automotive)
12
requirements and considerations (e.g. availability, integrity requirements, etc.) specific to
an application.
Of these standards, ISO 26262 functional safety standard [33] addresses the
requirements from an automotive standpoint. This standard’s scope addresses the
functional safety requirements due to malfunctioning behaviour of electronic and
electrical systems in passenger cars, motor bikes, trucks and buses. The standard defines
requirements during various phases of system life cycle - management, development,
production, operation, service, and decommissioning.
Similarly, there are standards for nuclear power plants (IEC 60880) [42], aerospace
(DO 178B) [34], railway (EN 50128) [43], medical equipment (IEC 60601) [44], factory
automation (EN 62061) [45], machinery (ISO 13849) [46], power drive (IEC 61800)
[47], household goods (IEC 60730)[48], etc. One of the key challenges with different
system level functional safety standards is that it becomes difficult for a semiconductor
developer to easily infer what really needs to be done at an SoC level. Therefore, it is key
to understand the fundamental concepts underlying the standards to better appreciate the
safety requirements. Such an understanding will in turn enable a semiconductor
developer to come up with an optimal and cost-effective chip development process, and a
solution that can meet the requirements.
2.2.1 Deriving Semiconductor Safety Requirements from
End Application
In this section, we will use the automotive functional safety standard ISO 26262 to
describe how the semiconductor safety requirements are derived from the application
safety requirements. Automotive functional safety standard uses a risk analysis based
approach for deriving the safety requirements from an end application. We will use EV
traction application to examine how semiconductor requirements are derived from the
application safety requirements. The various steps involved are enumerated below and
further described in Figure 2.4.
13
(i) Identify the specific application or function for which the safety analysis needs
to be performed. In this example, application considered is EV traction.
(ii) Identify various operating conditions, (e.g. driving on highway, driving in city,
plug-in charging), of the application and hazards possible, (e.g. unintended
positive torque causing acceleration), in these operating conditions.
(iii) Determine risk associated with each of the hazards based on severity (extent of
injury a malfunction can lead to), exposure (duration for which the vehicle is in
a particular operating condition which has the potential to cause the hazard) and
controllability (ability of driver to control the vehicle and prevent injury when a
Figure 2.4. Derivation of semiconductor safety requirements from end-application.
Item definition
Situational Analysis &
Hazard Identification
Operating conditions
(Plug in charging,
driving)
Hazard Classification
Risk rating based on
Exposure, Severity,
Controllability
ASIL Assignment for
the Hazards
Functional Safety
Requirements
Technical Safety
Requirements
Hardware Safety
Requirements
Unintended positive
Torque – ASIL-X
Prevent unintended
positive torque – ASIL-X
Technical requirements
to implement Functional
Safety Requirements
Individual module level
safety mechanisms used
to achieve diagnostic
coverage.
Se
mic
on
du
cto
r +
Syste
m
Sa
fety
Re
qu
ire
me
nt
EV Traction application
Ap
plic
atio
n
Sa
fety
Re
qu
ire
me
nt
(i)
(ii)
(iii)
(v)
(iv)
(vi)
(vii)
14
hazard happens), e.g. unintended acceleration in city with pedestrians can cause
fatal accidents and will be assigned a higher risk rating. Recommended practices
published by Society of Automobile Engineers (SAE) like SAE J2980 [49]
provide guidance for identifying and classifying hazardous events.
(iv) Assign an Automotive Safety Integrity Level (ASIL) based on the determined
risk. ASIL varies from A to D, with D being the most stringent in terms of the
requirements. ISO26262 provides a reference table which will help to derive the
ASIL from severity, exposure and controllability values. Steps (ii), (iii) and (iv)
constitute the Hazard Analysis and Risk Assessment (HARA) [50] process.
(v) Derive functional safety requirements (which are implementation independent,
i.e. guaranteed to hold irrespective of how the function is implemented e.g.
avoid unattended acceleration) and associated integrity levels (ASIL levels)
using Steps (iii) and (iv), (e.g. prevention of unintended acceleration – ASIL-C).
(vi) Derive technical safety requirements required for the implementation of the
functional safety requirements, (e.g. MCU shall check correctness of generated
torque).
(vii) Derive IC safety requirements from the system level technical safety
requirements, (e.g. redundant sensing of generated torque using two ADCs,
protection of critical flip-flops implementing Program Counter (PC), etc.).
2.2.2 SEooC Design Process
In Section 2.2.1, we have analysed how semiconductor safety requirements are
derived from the end application. However, in most of the cases, semiconductor devices
may not be designed and manufactured for a particular application. They may be either
developed to cater to a set of applications, (e.g. a micro-controller may be used in diverse
applications like motor control, digital power control, control of home appliances, etc.) or
may be a standalone function (e.g. voltage regulator) which can be used in several
applications. In order to cater to the requirements for a diverse set of applications,
ISO26262 has laid out the Safety Element out of Context (SEooC) development process
[51].
15
For deriving semiconductor safety requirements, SEooC development mandates the
IC manufacturer to consider a set of applications which it can potentially cater to. It
analyses the safety requirements from these applications. These requirements (termed
assumptions) will be used for the IC design. The assumptions will contain both the
requirements for the IC design, termed as ‘assumed requirements’ (e.g. safety
mechanisms to be implemented within the device like ECC) and additional requirements
for components external to the device, termed as ‘assumptions on design external to
SEooC’ (e.g. requirement to have an external power monitor since the device does not
have a built-in power monitor). While integrating an SEooC device, a system integrator
must ensure that the system conforms to the assumptions used during the design of the
component. The SEooC development flow outlined in the ISO26262 is shown in Figure
2.5.
The SEooC development process deployed in IC industry today does not use
system level information (e.g. tolerance available at the system level due to physical
components, repeated execution in closed loop control, etc.) for performing safety
analysis leading to an increase in analysis pessimism. This is due to increase in analysis
complexity due to consideration of additional system level effects. This thesis illustrates
how IC safety requirements can be accurately derived from the application safety
requirements and utilized for IC safety analysis without increase in analysis complexity.
Figure 2.5. SEooC requirements.
Assumptions
Assumed requirements
Assumptions on design external to
SEooC
SEooC requirements
SEooC design
16
IC Design Evaluation for Safety 2.3
Once the IC safety requirements are determined from the application, the design is
evaluated and further augmented to make sure that the safety requirements can be
addressed even in the presence of faults. The design evaluation process involves
understanding the different types of faults, performing systematic analysis to identify
their implications and protecting against them.
2.3.1 Types of Failures
IC failures can be classified into systematic failures and random failures as show in
Figure 2.6.
Systematic Failures 2.3.1.1
Systematic failures are failures related in a deterministic way to a certain cause that
can be eliminated by change of design, manufacturing process, operational procedures,
etc. Such failures typically arise due to design quality issues (e.g. bugs in the design),
non-adherence to operating conditions (e.g. device operating in conditions outside the
specified temperature, voltage, humidity conditions), etc. These failures are repeatable in
Figure 2.6. ISO26262 failure classification.
Failure
Systematic
FailureRandom Failure
Design Quality
Manufacturing Quality
Adherence to
Operating Conditions
Permanent Faults
Transient Faults
Single Point
FaultSafe Fault
Residual
Fault
Multiple Point
Fault
17
nature and can be controlled to a large extent by following a regimented approach
(process) during the product life cycle.
Random Failures 2.3.1.2
Random failures can be due to permanent faults or transient faults. Permanent
faults are caused by ageing phenomena like Negative Bias Temperature Instability
(NBTI) and Hot Carrier Injection (HCI) [1]. Transient faults are caused due to alpha
particle and neutron particle strikes. A permanent fault, as the name indicates, causes
permanent damage to the chip and will stay until it is removed or repaired. On the other
hand, the impact of transient fault is temporary in nature. Given the random occurrence of
both permanent and transient failures, the probability of occurrence is best estimated
using statistical information based on historical data, accelerated test procedures like
High Temperature Operating Life (HTOL) [52], burn-in, etc. (for permanent faults) and
radiation testing (for transient faults) [53].
Random faults can be further classified into single point fault, residual fault,
multiple point fault and safe fault [54]. The ISO26262 definitions for these faults are
included here. A single point fault is a fault which is not covered by any safety
mechanism and whose failure can lead to violation of the system safety goal. From
amongst the faults which are covered by safety mechanism, residual faults refer to a
subset of faults that can lead to violation of a safety goal, where this subset of faults is not
covered by the existing safety mechanisms. A multi-point fault is a fault which
standalone cannot cause the violation of safety goal, but the same fault in combination
with other one or more independent faults can lead to violation of safety goal. A fault
whose occurrence will not cause violation of the safety goal is called a safe fault.
Device Failure Rate and Bathtub Curve 2.3.1.3
The failures occurring during the semiconductor device lifetime are described in
terms of the classical bathtub curve [55] illustrated in Figure 2.7 [56]. During the early
life of the device, there is failure termed as infant mortality, which has a decreasing
probability with time. Similarly, there is wear-out failure, which has an increasing
probability of failure with time. In between these, there is a constant failure rate during
18
the operational lifetime of the device. This constant failure rate is attributed mainly to the
random failures occurring in devices.
2.3.2 Circuit Failure Mode Analysis: Qualitative
Qualitative failure mode analysis approach performs a systematic analysis of the
circuit to identify the potential vulnerabilities and provide suggestions / recommendations
to change the design of the circuit to address the vulnerabilities. The common methods
used for qualitative analysis are Failure Mode and Effects Analysis (FMEA) [21] and
Fault Tree Analysis (FTA) [23]. FMEA is an inductive bottoms-up approach where faults
in each of the lower level components is analysed for its impact on the higher level
function failure (e.g. impact of faults in logic gates / flip-flops at the IC level). FMEA
analysis consists of (i) determining the faults which can cause a function to fail, (ii)
identifying diagnostic mechanisms which can detect such faults, (iii) determining the
faults in the diagnostic logic which can make such detection ineffective, and (iv)
identifying checks (diagnostic mechanisms) for diagnostics that can detect a diagnostic
malfunction. FTA is a deductive tops-down approach by which each of the system failure
conditions is analysed and further mapped to low level contributing elements, (e.g. flip-
flops, gates, etc.).
Figure 2.7. Bathtub curve.
Decreasing failure rate
Constant failure rate
Increasing failure rate
Time
Failu
re R
ate
Wear-out failure
Early infant mortality failureObserved failure rate
Constant (random) failure
19
A combination of both these techniques is recommended to be used for the
functional safety analysis. However, system level considerations are not used in the IC
safety analysis practiced today. Hence, there is insufficient information available to
perform FTA. In the absence of system level information, all faults at the IC level are
classified as critical for FMEA analysis. This leads to analysis pessimism. In this thesis,
we propose a two-step FMEA process to address this limitation. In the first step, we
analyse the impact of different errors at IC level which can cause failure at the system
level. In the second step, we identify critical flip-flops whose faults can cause errors at IC
level which can lead to system level failures.
Dependent Failure Analysis 2.3.2.1
Qualitative failure analysis should also consider the effect of dependent failures,
which can reduce effectiveness of the employed diagnostics. Dependent failures can be
either common cause failures or cascading failures. Common cause failures, indicated in
Figure 2.8(a), are the ones in which the logic common to both functional and diagnostic
modules fails. This can make the diagnostic ineffective, e.g. if the power to the mission
logic and diagnostic logic is the same, failure of power will lead to non-detection of a
fault by the diagnostic module. Cascading failures as indicated in Figure 2.8(b) are the
ones where the failure in one module propagates to another module making it fail. This is
applicable for cases where functions with different integrity levels co-exist within the
chip and there is possibility of a fault from a lower integrity module to impact a higher
Figure 2.8. Dependent failures. (a) Common cause (b) Cascading.
(a) Common cause failure
Common
Logic
Mission
Logic
Redundant
Logic
Comparator
Failure
(b) Cascading failure
Lower
Integrity Logic
Higher
Integrity logic
Failure
20
integrity module, e.g. fault propagating from lower integrity debug / emulation logic to
the higher integrity functional logic, thus impacting the critical high integrity functions
executing on the CPU.
2.3.3 Circuit Failure Mode Analysis: Quantitative
Quantitative analysis provides an objective means to ascertain the safety worthiness
of a given circuit. These metrics can be used to (i) objectively assess safety effectiveness
of the design to cope with the random hardware failures, (ii) provide guidance towards
making design enhancements for safety, (iii) compare different diagnostic architectures
and (iv) ascertain the ASIL which can be achieved using the specified architecture.
Quantitative analysis requires derivation of some key metrics, namely Single Point
Fault Metric (SPFM), Latent Fault Metric (LFM) and Probabilistic Metric for Hardware
Random Failures (PMHF). The device needs to meet the specific target values (indicated
in Table 2.1) corresponding to these metrics to achieve a specific ASIL. SPFM and LFM
can be computed using the equations provided below. (Refer Section 2.3.1.2 for fault
classification). PMHF indicates the average probability of failure per hour and is obtained
by addition of the failure rates of the constituting components.
SPFM = 1- 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠 + 𝑆𝑖𝑛𝑔𝑙𝑒 𝑃𝑜𝑖𝑛𝑡 𝐹𝑎𝑢𝑙𝑡𝑠
𝑇𝑜𝑡𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠−𝑆𝑎𝑓𝑒 𝐹𝑎𝑢𝑙𝑡𝑠
LFM = 1- 𝑈𝑛𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑃𝑜𝑖𝑛𝑡 𝐹𝑎𝑢𝑙𝑡𝑠
𝑇𝑜𝑡𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠 −(𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐹𝑎𝑢𝑙𝑡𝑠+ 𝑆𝑖𝑛𝑔𝑙𝑒 𝑃𝑜𝑖𝑛𝑡 𝐹𝑎𝑢𝑙𝑡𝑠)
Table 2.1. Quantitative metric requirements.
ASIL-B ASIL-C ASIL-D
SPFM ≥ 90% ≥ 97% ≥ 99%
LFM ≥ 60% ≥ 80% ≥ 90%
PMHF < 10-7
h-1
< 10-7
h-1
< 10-8
h-1
21
Protecting Against Systematic and Random 2.4
Failures
We have seen in Section 2.3.1 that application failures are caused due to systematic
and random failures. Therefore, functional safety aware chip development should
comprehend both these factors. The systematic failures are controlled by having a robust
development process and random failures are addressed by having protection
mechanisms which can either detect or avoid the faults.
2.4.1 Robust Development Process
The end system should have a robust development process for the different phases
corresponding to concept, design, production, operation and decommissioning, to
mitigate systematic failures. A part of the robust development process addressing the
safety requirements in IC design covering concept, design and production stages is shown
in Figure 2.9. This includes:
(i) Requirement Phase: Collection of safety requirements from the application or
from a set of applications.
(ii) Design phase: Hardware and software design based on the specification.
Figure 2.9. Development process to address systematic faults.
Hardware safety
requirements
Verification of hardware
safety requirements
System level functional
safety requirements
Validation of functional
safety requirements
Implementation
22
(iii) Verification phase: Check to ensure that design meets functional specification
and safety requirements.
ISO26262 specifies representative best practices for each of the different
development phases. This includes proper traceability from requirements to design to
verification and validation phases, how IPs used in the devices are selected and assessed
(development interface agreement), change management process during development,
best practices in design (e.g. conservative design with sufficient over design), verification
(e.g. use of formal methods) and validation (e.g. fault insertion testing), etc.
2.4.2 Safety Mechanisms
Along with addressing systematic failures through a robust development process,
ICs should have in-built mechanisms to handle random failures. Devices need to have
additional protection mechanisms called safety mechanisms in place to detect or mitigate
such failures. Safety mechanisms can be classified into three based on the protection they
offer.
(i) Primary safety mechanisms: Detects single point faults, e.g. ECC for memory.
(ii) Test for diagnostics: Detect faults in the diagnostic mechanisms, e.g. test for
ECC logic.
(iii) Fault avoidance measures: Helps to avoid a particular fault, e.g. spacing apart
memory bit-cells which form a logical word such that neighbouring multiple bit
upsets can still be detected by ECC.
Factors Involved in Selection of Safety Mechanisms 2.4.2.1
The effectiveness of a safety mechanism is measured using Diagnostic Coverage
(DC). DC indicates the proportion of the hardware logic faults which can be detected by
the implemented safety mechanisms. In addition to diagnostic coverage, factors like area
and power overhead, MIPS required for safety mechanism execution, safety mechanism
development effort, etc., are considered before deciding on a particular safety
mechanism. Some of the common safety mechanisms and their typical diagnostic
coverages and overheads (largely implementation dependent and subject to application
23
scenarios) are listed in Table 2.2. The assessment given is approximate and will differ
based upon the actual function being defined and its implementation.
Classification of Safety Mechanisms 2.4.2.2
Safety mechanisms can also be classified based on the level of design abstraction at
which they are implemented.
(i) Device level, where the components used for building SoC such as standard cell
library are augmented with safety mechanisms (e.g. RAZOR flip-flop [57],
DICE flip-flop [6], BISER flip-flop [58]).
(ii) Gate level, where the components are protected by means of circuit level
techniques (e.g. delayed capture methodology [59,60,61], lock-step operation)
(iii) Function / Architecture level, where higher levels of abstraction are analysed
and leveraged to provide protection (e.g. redundant execution [8,62], monitoring
mechanisms [63], etc.).
Table 2.2. Calibrating typical safety mechanisms.
Safety mechanism Transient fault
DC
Permanent
fault DC
Area overhead MIPS
overhead
Lock-step CPU High High High None
Hardware self-test for CPU None High Medium Medium
Software self-test for CPU None Medium Low High
Parity for memories Low Low Low Low
ECC for memories High High Medium Low
Self-test for memories None High Medium Medium
24
2.4.3 Development Process to Address Random Failures
The functional safety development flow to address random failures should
comprehend requirements mentioned in previous sections. A typical development flow is
illustrated in Figure 2.10.
(i) Derive device architecture based on the functional requirements of the target
application(s).
(ii) Derive functional safety requirements from the target application or set of
applications considered in the case of SEooC development.
(iii) Perform qualitative Failure Mode and Effect Analysis (FMEA) of design based
on device architecture and functional safety requirements. This helps to identify
safety mechanisms.
Figure 2.10. Functional safety development flow to address random failures.
Device Architecture
Specification
Qualitative Analysis
Functional Safety
Requirements
Updated Device
Architecture Specification
Design Implementation
Hardware and Software
Safety Requirements
Design Verification
Quantitative Analysis
Goals Met DoneYESNO
1st Iteration after Implementation
YES
NO
25
(iv) Update hardware and software safety requirements based on the qualitative
analysis.
(v) Update device specification to capture the new safety mechanisms.
(vi) This is followed by design implementation. After design implementation,
qualitative analysis is repeated with the additional implementation related
information to evaluate the design based on implementation information.
(vii) Perform design verification to check correct implementation of functional logic
and safety mechanism.
(viii) Perform quantitative analysis to check whether the goals associated with the
functional safety requirements are met. If the goals are not met, device
architecture is modified. This iteration continues till the goals are met.
Limitations with Existing Safety Analysis 2.5
Methods
This chapter provided an overview of IC safety analysis as practised today. It can
be noticed that several additional steps are added to the IC development process to make
it functional safety compliant. These additional steps lead to additional effort and
implementation overheads, and significantly increase the development time (and hence
time to market) for the ICs and the systems incorporating them. In spite of these
additional steps existing functional safety analysis methods suffer from two major gaps.
This section describes these limitations.
2.5.1 Making Safety Analysis Comprehensive
Identification of critical (dangerous) gates / flip-flops in IC whose faults can cause
the application to fail and protecting them is one of the important steps in the functional
safety compliant IC design process. A failure during application life time can occur due
to several reasons, namely, permanent fault in logic gates / flip-flops, Single Event Upset
in flip-flops and other storage elements, Single Event Transients (SET) in combinational
26
logic gate, etc. In this thesis, we focus on the robustness analysis of the circuit in the
presence of SEU in flip-flops.
Fault injection is widely used to provide a metric on the suitability of a circuit
module to be used in safety critical application. Fault injection provides evidence of the
robustness (or otherwise) of the hardware executing the given application in the presence
of SEUs. For SEU robustness evaluation, faults which model SEUs are injected in the
presence of application workload. (A workload is a popular term used to denote an
application execution sequence in terms of input values, internal CPU or firmware
program, etc.). Outputs and critical state elements in the circuit with faults injected in
simulation are compared with those in the circuit with no faults injected (good)
simulation. If the injected faults result in a changed output and this is not detected by the
safety mechanisms, the gate / flip-flop in which this fault is injected is classified as
dangerous [4]. A flip-flop / gate, on the other hand, is safe if for all injected faults, there
is no output change, or if the change is detected by the built-in safety mechanisms. Fault
injection methods use either actual application workloads or synthesized workloads for
the safety evaluation. The comprehensiveness of the safety analysis critically depends on
comprehensiveness of the workloads used during fault injection based evaluation.
The toggle coverage metric is often used to ascertain the suitability of workloads to
identify critical flip-flops, (i.e. those which are identified as dangerous and must hence be
protected). For example, a set of workloads is considered adequate if the toggle coverage
is greater than a prescribed value [22]. Practical considerations of the circuit size and
simulation time require that an upper bound (e.g. 70%, 90%, 99%) on this coverage be
set based upon circuit size and simulation time. However, the relation between the
number of dangerous flip-flops and the toggle coverage is not well established [64].
Table 2.3 shows this data for an industrial circuit, (a digital filter used for filtering noise
from low frequency analog signals). Different coverage metrics, (namely block,
expression, code and toggle coverage), are evaluated independently for each workload
and the number of dangerous flip-flops is identified. Workloads (TC_0 to TC_9 which
are run independently) are arranged in the increasing order of toggle coverage. Contrary
to expectations, workloads with higher toggle coverage do not necessarily correspond to
a larger number of dangerous flip-flops. TC_9 has higher toggle coverage than TC_8;
27
however the number of dangerous flip-flops is smaller. TC_2 has lower toggle coverage
than TC_7; however, the number of dangerous flip-flops is larger. A higher code
coverage also does not indicate a better workload quality. It is therefore important to
consider more comprehensive workloads and perform exhaustive fault injection.
2.5.2 Reduction of Implementation Overheads
Not every random fault occurring in an IC will result in application failure.
Different masking effects (e.g. logical masking, electrical masking, latching window
masking and application level masking) [35] can prevent the fault from propagating to
the output and causing the application to fail. Safety analysis without considering these
masking effects will result in over-design leading to an increased hardware and / or
design implementation overhead. Analysis at higher abstraction levels (e.g. at the
application level) helps to evaluate and take advantage of additional masking effects, and
these can be beneficially employed to reduce the hardware overhead.
As an illustration, we can consider a real life functional safety system like traction
control or ABS used in automobiles. These systems demand high accuracy and self-
correction to adjust for the physical system’s non-linearity and noise effects. They are
designed as closed loop control systems and typically have an associated acceptable
tolerance in the extent to which the behaviour can deviate from the ideal or centre
Table 2.3. Workload coverage and number of dangerous flip-flops.
Test
case
Block
coverage
Expression
coverage
Code
coverage
Toggle
coverage
# Dangerous
flip-flops
TC_0 93.33% 100.00% 72.91% 41.02% 54
TC_1 93.33% 100.00% 74.48% 44.49% 69
TC_2 93.33% 100.00% 74.58% 44.62% 69
TC_3 93.33% 100.00% 74.87% 45.47% 84
TC_4 93.33% 100.00% 74.78% 45.59% 82
TC_5 100.00% 100.00% 84.87% 68.93% 54
TC_6 100.00% 100.00% 86.44% 73.07% 68
TC_7 100.00% 100.00% 86.54% 73.20% 67
TC_8 100.00% 100.00% 86.84% 74.05% 84
TC_9 100.00% 100.00% 86.74% 74.17% 82
28
position. The control algorithms [37,38] for these systems are also designed to
accommodate the variability in the physical system. Today, there is no systematic way to
include such application level tolerance information in the IC functional safety analysis
process. Hence, the design is often pessimistic leading to further increase in the
overheads due to functional safety.
This thesis proposes new methods to address these limitations.
29
3. Safety Analysis Pessimism Reduction by
Utilizing Application Tolerance
Not every random fault occurring in an IC will result in application failure.
Different masking effects (e.g. logical masking, electrical masking, latching window
masking and application level masking) [35] can prevent the fault from propagating to
the output and causing the application to fail. Safety analysis without considering these
masking effects will result in over-design leading to an increased hardware overhead.
While there is reported work on effects of logical, electrical and timing masking, those
due to application masking have not been as well studied.
Analysis at higher abstraction levels helps to incorporate additional masking effects
and these can be beneficially employed to reduce the hardware overhead. On the other
hand, analysis at higher abstraction levels significantly increases the analysis complexity.
As an illustration, an application level analysis for functional safety of a motor control
system will require consideration of motor and its different operating conditions
involving motor speed, load, etc. This requires inclusion of the motor models and makes
the analysis more complex, as compared to that of the IC considered standalone. Figure
3.1 indicates the hardware overhead and analysis complexity trade-offs associated with
functional safety analysis when carried out at different abstraction levels.
Figure 3.1. Safety analysis complexity and hardware overhead tradeoffs.
Device Level
Module Level
SoC Level
Application Level
An
aly
sis
Co
mp
lexity
Incre
ase
Ha
rdw
are
Ove
rhe
ad
Incre
ase
30
Many of today’s real life safety critical systems have high accuracy and self-
correction requirements to adjust for the physical system’s non-linearity and noise
effects. These systems are designed as closed loop control systems and typically have an
associated acceptable tolerance in the extent to which the behaviour can deviate from the
ideal or centre position. The control algorithms [37,38] for these systems are also
designed to accommodate the variability in physical system. The design of ICs used in
such systems can beneficially employ these system level tolerances to identify the
minimum set of components that are required to be protected, thereby reducing the
hardware overhead. However, this significantly increases the safety analysis complexity.
In order to address the complexity issues associated with system level analysis, a
new divide and conquer approach is proposed in this chapter. The approach involves a
two-step process. The first step apportions the system level tolerance to the individual
modules in the system, including the IC (or ICs). The apportioned tolerance information
can then be used to identify the critical flip-flops which need to be protected. Various
techniques used for apportioning of the system level tolerance to ICs and its associated
modules are covered in this section. The main contributions of this chapter of the thesis
are: (i) proposing an integrated approach for the functional safety analysis of hybrid
systems (ii) creating a framework for including application level attributes like closed
loop operation and acceptable error consideration for IC safety analysis and (iii)
apportioning of system level tolerances as value tolerance and time tolerance to various
modules in the system.
The rest of this chapter is organized as follows. Section 3.1 gives a brief overview
of related work in this area. Section 3.2 describes the improved safety analysis technique
where a new technique for mapping the application tolerance to digital system as value
and time tolerance is proposed. The section also details about how the proposed
technique can be used to include system level tolerance for IC level safety analysis.
Section 3.3 describes the different ways by which the IC level value and time tolerance
can be derived from the application and Section 3.4 concludes the chapter.
31
Background and Related Work 3.1
Many real-life systems have ICs interacting with physical systems in safety critical
applications. These systems are typically designed as closed loop control systems. A
closed loop system, by its very nature, can correct certain errors since such a system can
build its resilience across subsequent iterations of the control loop. In addition, the
interacting physical system has latency and tolerance. A change in the control value
driving the physical system may not reflect immediately on its behaviour due to the
inherent inertia. As an example, on a system level safety analysis we performed for a
BLDC motor, a stuck-fault on the control input takes about 92 cycles of closed loop
operation (4.6 ms) to result in a 5% variation in the motor speed. However, incorporation
of the system level artefacts results in significant increase in the analysis complexity. In
this chapter, we propose improved safety analysis techniques to consider these effects
while designing an optimised system which meets functional safety requirements.
Several analytical and statistical methods [65,66,67,68] have also been proposed to
speed-up the functional safety analysis of ICs. In RAVEN (RApid Vulnerability
EstimatioN) [66], the authors propose partitioning the IC into smaller sub-blocks
followed by separate analysis of each sub-block. The sub-block analysis results are then
used to estimate the system level error probability and identification of critical
components. [67] proposes an approach for computation of output error probability using
individual signal probability and fault propagation probability. The authors introduce the
concept of using time tolerance for soft error evaluation and propose the use of Markov
models to estimate error probability after the time tolerance window. [68] proposes a
correlation based approach to identify dependencies of signal probabilities of error free
signals and error probability of erroneous signals to provide higher SER estimation
accuracy.
These methods, however, treat IC safety as a standalone problem without
considering its interaction with the surrounding physical system. Even when system level
effects are considered, they are dependent on a particular state and not on the closed loop
behaviour [66,67]. Hybrid systems, on the other hand, have multiple such states which
are acceptable across different control loop iterations. In the absence of hybrid system
32
consideration, the analysis will be pessimistic [64].
Multiple research works have focused on implementing robust control in the
presence of errors [69] for generic control systems as well as for specific motor control
systems [70]. These works have been limited to robustness to sensor and actuator errors
within certain parametric bounds and do not consider interactions between the closed
loop modules. There are several observer based approaches employed for fault detection.
VDA consortium (Verband Der Automobilindustrie - German Association of the
Automotive Industry) came up with a monitoring concept to ensure functional safety of
gasoline and diesel engines known as VDA E-Gas architecture [71]. This approach is
widely used in the engine control system of many automobiles [72]. In [73], the use of
residual generator and residual evaluation module for fault detection purposes is
proposed. [63] introduces the concept of using mapped predictive check states for the
detection of transient faults in the controller. These approaches help in online fault
detection, but do not help to localize the fault or take the corrective action. As fail
operational system requirements, (i.e. systems which continue to operate even after a
fault is detected), are becoming more prominent in automotive applications [74], these
approaches have some limitations which restrict their applicability.
The methodology in this chapter addresses some limitations of earlier work
reported in the literature by including some new artefacts into the analysis. (i) Safety
analysis for the IC is performed together with the interacting physical system making the
analysis more reliable. (ii) Application level attributes like closed loop operation and
acceptable error are considered for the analysis making it less pessimistic. (iii) In order to
reduce the analysis complexity due to the additional system information being
considered, a divide and conquer approach is used. Critical flip-flops required to be
protected can be directly identified using this analysis, as against restricting it to just
identifying the system drifting into an unsafe state, thereby enabling the implementation
of low cost monitoring and correction mechanisms. Low overhead design robustness
(either through component hardening or fault tolerance) techniques can then be used for
protecting the right sub-set of critical components (e.g. flip-flops) as against the entire
design.
33
Improved Safety Analysis Technique 3.2
A representative hybrid system is shown in Figure 3.2. It consists of a physical
system comprising of the motor, power stage providing current to the motor and Hall
sensor determining the motor position. This physical system is controlled by a digital
controller whose main function is to maintain the speed of the motor within an acceptable
range as set using the communication interface.
The digital controller consists of: (i) A CAP (CAPture) [75] module which decodes
the Hall sensor output and provides rotor position information to CPU. (ii) The CPU
receives the speed set-point information and rotor position information of the motor. It
executes the control algorithm to determine the actuation required to maintain the desired
motor speed. (iii) A PWM (Pulse Width Modulator) [76] which provides pulsed inputs to
the power stage which drives the motor. The motor speed is proportional to the PWM
duty cycle. (The feedback control loop adjusts the motor speed under different operating
conditions). Since the digital system and physical system operate in a closed loop and
there is an inherent inertia associated with physical system, the system is robust to some
errors. This is specified as the application tolerance.
3.2.1 Value and Time Tolerance
Hybrid systems are typically associated with a tolerance range around the control
point wherein they can operate in an acceptably correct manner and any minor
perturbation around this point can be corrected during the course of time, (i.e. in the
Figure 3.2. Illustration of a hybrid system.
Physical System
Digital Controller
CPU
Power StageHall Sensor
Communication
Motor
CAP PWM
34
subsequent operating loops), without the application being perceptibly impacted. We
refer to this as value tolerance. This depicts the set of control values around the correct
control value using which the system can operate in an acceptably correct manner. This
can also be visualized as the maximum steady state error which can be tolerated by the
system.
Additionally, these systems can be slow to react (with respect to the frequency of
the clock used in the digital control). It will take some finite time, (i.e. a few operating
loops), termed as time tolerance, for the system to respond to a change in the control
input, even for real-time applications. Time tolerance depicts the time duration for which
the system is able to function in an acceptably correct manner for a maximum deviation
in the control values. The tolerance in value and time are depicted in Figure 3.3. Beyond
this tolerance range, i.e. beyond a range of control values and / or beyond a given time
duration, the system can drift and go outside the acceptable operating range. The closed
loop operation can either contain or correct many of the errors within an IC without
causing the system to fail.
The combined impact due to value tolerance and time tolerance can be complex
since not all value tolerances may be applicable across the time tolerance windows. In
this evaluation, therefore, we calculate the time tolerance across the entire operating
range for the worst-case variation in the values of control inputs.
Figure 3.3. Tolerance in value and time over the control input range.
X
Time Tolerance
Window
Va
lue
To
lera
nce
Win
do
w
time
Syste
m o
utp
ut
fault
35
To understand the tolerance in a motor control system, we performed random fault
injections (explained further in Section 2.5.1) and observed the controlled parameter
(speed) variation with time after the faults are injected. The various profiles thus obtained
are illustrated in Figure 3.4. Case (i) indicates the fault free scenario. Certain injected
faults depicted in Case (ii), (e.g. those causing minor perturbation in closed loop
algorithm co-efficients which are not corrected across iterations), cause perturbation in
the motor speed; however the motor settles to a new speed which is within the acceptable
range. Other faults depicted in Case (iii), (e.g. those in the data computation logic which
get corrected in closed loop iterations), cause perturbation in speed which also gets
corrected within an acceptable time For faults depicted in Case (iv), (e.g. those in the
program counter which changes the control flow causing unrecoverable error), cause the
motor speed to drift out of the acceptable range resulting in application failure. Safety
analysis techniques employed to identify the critical elements within the IC should
consider these system level tolerance effects. In its absence, the analysis will be
pessimistic.
Figure 3.4. Motor speed variation for various injected errors.
Mo
tor
Sp
ee
dTime
Mo
tor
Sp
ee
d
Time
Mo
tor
Sp
ee
d
Time
(i) (ii)
(iv)
Mo
tor
Sp
ee
d
Time(iii)
Acceptable operating range
ErrorFault
36
3.2.2 Divide and Conquer Safety Analysis Approach
We have seen in Section 2.5.1 that fault injection is the preferred technique to
identify the set of critical elements (e.g. flip-flops) which must be protected during the IC
design phase. Fault injection driven simulation is performed wherein the faults are
injected one at a time on every flip-flop inside the circuit, and the corresponding outputs
are observed. If the injected fault in a flip-flop results in an unacceptable change in at
least one output (across value and time) which is not detected by any safety mechanism
within the circuit, then this flip-flop must be protected.
In an ideal scenario, the physical system can be modelled and the key parameters of
this system can also be observed during the fault injection experiments. Such co-
simulations environments have been demonstrated for smaller logic [77] but have not
been adopted for larger ICs due to limitations of slow progression of digital simulation,
longer simulation time due to slower response of the physical system in response to any
change in control input, etc. The increase in the simulation time along with the
requirement to inject a fault in every operating cycle and on every flip-flop of the circuit
significantly increases the computational complexity of such evaluations.
In order to address the simulation complexity issue, we propose a divide and
conquer approach. This is carried out in two steps: (i) Application tolerance is mapped
(apportioned) to individual module outputs as value and time tolerance. (ii) Safety
evaluation is carried for the individual modules using these tolerance values. As an
Figure 3.5. Computation of value and time tolerance.
Physical System
Digital Controller
(MCU)
CPU
Power StageHall Sensor
Communication
Motor
CAP
Acceptable
Tolerance - δOT
++
e(n)
δO2
PWMδO3δO1
37
illustration for the mapping of application tolerance, consider the case of the closed loop
motor control system in Figure 3.5. A ±5% variation is considered the acceptable
tolerance for the motor speed. (This was determined based upon the peak overshoot for
which the control system is designed). In this system, the application tolerance δOT (±5%
variation in speed) has to be mapped to individual module outputs as value and time
tolerance. In the figure, δO1, δO2 and δO3 denote the value and time tolerance (can be
represented as a tuple {Vt, Tt}) at the output of the CAP, CPU and PWM modules.
The process of mapping of the application tolerance to individual circuit modules
can be considered similar to the problem of timing budgeting performed during the SoC
design phase, wherein the timing slack available at the SoC level (due to the clock
frequency as well as the I/Os interacting with the external system) is apportioned to
individual IP modules therein. Timing closure for these modules is carried out
individually using these apportioned budgets and integrated into the SoC. Such a
methodology may not lead to an optimal design as the combined effect of interacting
signals is not considered. However, such a divide and conquer approach is widely
adopted due to practical limitations, (e.g. design and analysis complexity, EDA tool run-
times, etc.), associated a flat timing analysis approach. This has also motivated the divide
and conquer methodology presented in this chapter.
Once the individual module tolerances are identified, they are used for standalone
fault injection driven module analysis. (A simulation model, (e.g. Verilog netlist), or an
emulation platform can be used). If the injected fault causes the output of the flip-flop to
deviate beyond the value tolerance and beyond the time tolerance, then this flip-flop is
marked for protection. The probability of soft errors is considered low. Recent works
have indicated at least four orders of magnitude difference between SEU (single event
upset) and MBU (multiple bit upset) rate [78]. As a result, in the fault injection
campaigns, only single faults modelling SEUs are considered.
38
Evaluation of Value Tolerance and Time 3.3
Tolerance
Estimation of value tolerance and time tolerance is an important step in divide and
conquer functional safety evaluation approach. Depending upon the system complexity
and model / hardware availability, this can be performed in different ways. This includes
(i) Determination of tolerance using actual system
(ii) Analytical estimation of tolerance values for simple systems
(iii) Simulation based evaluation using higher level models
3.3.1 Determination of Tolerance Using Actual System
Sometimes, functional safety evaluation is performed towards the later phases of
the system development when we already have the digital system and physical system
integrated. There may also be cases when the accurate model of the system is not
available and it may be easier to perform the evaluation using the end system itself. In
such scenarios, we may have to evaluate the tolerances using the actual system. This
section explains the steps involved to evaluate the value and time tolerance for the CPU
module using the actual system.
Consider a motor control system as shown in Figure 3.6. We can map the output
application tolerance δOT to value and time tolerance at various module interfaces. In
Figure 3.6. Closed loop control system operation.
Acceptable
Tolerance - δOT
ECAP CPU PWMδO1 δO2
δO3δIT
39
order to estimate the time tolerance, values corresponding to maximum error, (i.e. the
CPU outputs generating the maximum and minimum possible actuator input values in
this case), are forced and the time required for the motor to go beyond the ±5% tolerable
range of speed variation is obtained. i.e. for a motor operating speed of 900 RPM, the
lower and upper speed limits are set at 855 and 945 RPM respectively. Similarly, to
ascertain the value tolerance, the maximum and minimum permissible CPU output values
during the steady state operating condition, are ascertained. This process can be repeated
at the output of each module to determine the value and time tolerance of individual
modules. We enumerate the steps to compute the time tolerance and value tolerance of
CPU below.
Time tolerance:
(i) Force the maximum value at the CPU output (data bus interface of the module
forced to 0xFF…) during application execution.
(ii) Determine the time taken by the application to go out of acceptable application
tolerance (Tt-max).
(iii) Force minimum value at the CPU output (data bus interface of the module
forced to 0x0…) during application execution.
(iv) Determine the time taken by the application to go out of acceptable application
tolerance (Tt-min).
(v) Time tolerance Tt = min (Tt-max, Tt-min).
Value tolerance:
(i) Strobe the CPU output during application execution for the entire operating
speed of the motor (corresponding to the full range of outputs of the PWM
module). The module output changes within a particular range due to steady
state error of the control system. (We could have varied the CPU output to
determine the maximum and minimum acceptable values. However, for ease of
implementation, we took a more conservative approach of determining the value
tolerance based on the maximum and minimum values the particular is taking
during the execution. This will result in reduced pessimism).
40
(ii) Determine the maximum (Vt-max) and minimum (Vt-min) values at the module
output.
(iii) Value tolerance Vt = (Vt-max - Vt-min).
This section gives two representative examples, BLDC motor control and AC
inverter, to illustrate how the tolerance information can be derived from the end system.
BLDC Motor Control System 3.3.1.1
The experimental setup for deriving the tolerance for BLDC motor control system
consists of a three phase eight pole BLDC motor driven by a voltage source inverter
(DRV8312) [79] and Microcontroller (F2805x) [80] shown in Figure 3.7. The power
stage of the inverter is actuated by a Pulse Width Modulator (PWM) [76] which is part of
the Microcontroller (MCU). The MCU has a CPU which executes the closed loop control
algorithm to drive the BLDC motor to rotate at the commanded speed. The BLDC motor
is equipped with three Hall sensors which produce voltage outputs depending on the rotor
position. The CAPture (CAP) module [75] of the MCU uses these outputs to estimate the
rotor position. The motor speed is estimated based on the rate of change of position.
The data flow of the application is shown in Figure 3.8. The CPU executes the PID
(Proportional Integral Derivative) control algorithm every time it receives an interrupt
from the timer. For every Interrupt Service Routine (ISR), the CPU samples the CAP
input and estimates the motor speed and position. This information along with the desired
Figure 3.7. DRV8312, F2805x and BLDC motor.
41
speed input received from communication interface is used to determine the magnitude of
current to be applied on the three phase windings for torque generation. For this
experiment, the CPU frequency is 80 MHz and the control loop frequency is 20 KHz. For
demonstration of application tolerance, we have selected speed tolerance of 5%, which
corresponds to the peak overshoot for which the control system is designed.
During the closed loop operation, there can be change in operating conditions (e.g.
load), change in commanded speed, error in the sampling of input values (e.g. sensor
linearity error), etc. The control algorithm execution is dependent upon these input
parameters, which are captured into an associated memory and associated set of states.
(The integral part of Proportional Integral Derivative (PID) control keeps a memory of
the input values). Their combined effect will create multiple different application threads
for the single application. These different application threads can drive multiple {Vt, Tt}
tuples since different variations in the value (and time) tolerances can affect the
individual modules differently. As a result, the complexity of the analysis can increase
requiring multiple threads of execution across the fault injection experiments. In our
Figure 3.8. Dataflow of closed loop motor control application.
HA
LL
SEN
SOR
BLDC Motor
CAP
3 φ PWM
Commutation
PWM Period
BA
C
Desired Speed
Ob
serv
ed
Sp
ee
d
Commutation Trigger
Speed(CPU)
Du
ty C
ycle
PID Control(CPU)
Fault Injection ThreadFunctional Thread
Timer
42
work, we have considered the worst case value at the output and estimated the time
tolerance.
In order to evaluate time tolerance at the CPU output, we inject faults leading to
worst case variation at the CPU output and determine the number of cycles required for
the motor speed to move out of the application tolerance range. The CPU provides a 32
bit binary value to the PWM module. To compute the time tolerance at the CPU output,
values of 0x0 (denoting zero duty cycle for PWM i.e. zero electrical energy applied to the
motor) and 0xFFFFFFFF (denoting 100% duty cycle i.e. maximum electrical energy
applied to the motor) are forced on the CPU outputs.
The time tolerance results thus obtained are plotted in Figure 3.9. The X-axis
denotes the motor speed in RPM (Revolutions Per Minute) under different operating
conditions and the Y-axis denotes the number of closed loop cycles required for the
motor to go out of the tolerable speed range. Two different values of time tolerance,
corresponding to minimum and maximum values of CPU output, are calculated at
different speed settings of the motor. These results indicate that the closed loop motor
control application has a time tolerance of 92-228 cycles of closed loop operation (which
is equivalent to 276K to 684K cycles of CPU operation, translating to 4.6 to 11.4 ms),
Figure 3.9. Application time tolerance in the presence of worst case errors.
0
50
100
150
200
250
900 1050 1200 1350 1500 1650 1800 1950 2100 2250 2400
Tim
e to
lera
nce
in
clo
se
d lo
op
cycle
s
Operating condition (Motor speed in RPM)
Minimum CPU output Maximum CPU output
43
which corresponds to +/-5% speed variation for different operating speed settings. The
application can operate in an acceptably correct manner for at least 92 cycles of closed
loop operation across the full operating range of the motor, in the presence of worst case
error in the system.
In order to compute value tolerance, the CPU output (PWM input) is monitored
during application execution. Due to variations in the operating conditions and error in
position / speed determination, there is a slight variation in the sensed speed and the CPU
output will change based on the algorithm being executed. The tolerable variation (steady
state error) in CPU output thus obtained for different speed settings is tabulated and its
variation from the value corresponding to desired speed is noted. The minimum
(maximum) tolerance is observed at an RPM of 1650 (2400). The tolerable CPU output
variation at the RPM of 1650 (2400) is 0.54-0.568 (0.785-0.842) which equals a variation
of 5% around 0.55 (7% around 0.8). The CPU output tolerance expressed as a percentage
of CPU output value is shown in Figure 3.10. Here the X-axis denotes the different motor
speeds, and the Y-axis denotes the CPU output value tolerance.
AC Inverter System 3.3.1.2
We next used a digital power system to evaluate the value and time tolerances. We
used a microcontroller (MCU) based AC inverter development kit [81] for determining
Figure 3.10. Application value tolerence as percentage of CPU output value.
0
1
2
3
4
5
6
7
8
900 1050 1200 1350 1500 1650 1800 1950 2100 2250 2400
Va
lue
To
lera
nce
(as %
of
CP
U o
utp
ut)
Operating Condition (Motor Speed in RPM)
44
the tolerance values. The experimental setup consists of an AC load driven by an MCU as
shown in Figure 3.11. The MCU has a CPU which executes the closed loop control
algorithm to drive the AC load (e.g. motor) at the commanded voltage and current levels.
The AC load voltage and current are measured by the MCU using the on chip Analog to
Digital Converter (ADC).
The dataflow of AC inverter application is similar to that of BLDC motor control
application and is represented in Figure 3.12. Timer module periodically generates the
interrupt which triggers the ADC module to sample the load current value. The measured
load current value is compared with the reference values to determine the control error.
The CPU periodically executes the Proportional Integral (PI) control algorithm to
minimize this error. The CPU output is used to drive the Pulse Width Modulator (PWM)
which controls the power stages used to determine the voltage / current provided to the
AC load. For this experiment, a CPU frequency of 60 MHz and control loop frequency of
20 KHz are chosen. We have selected ±4% output tolerance (δOT) which corresponds to
Figure 3.12. AC inverter dataflow diagram.
Digital Controller (MCU)
ADC CPU PWMPower Stage
Load
Reference Load
Current
δOTδCT
Figure 3.11. AC inverter application kit.
45
the peak overshoot in current for which the control system is designed. We need to
compute the value and time tolerance tuples of δCT at the CPU output.
In order to evaluate time tolerance at the CPU output, we inject faults leading to
worst case variation at the CPU output and determine the number of cycles required for
the Root Mean Square (RMS) value of AC inverter load current to move out of the
application tolerance range. The CPU provides a 32 bit binary value to the PWM module.
To compute the time tolerance at the CPU output, values of 0x0 (denoting zero duty cycle
for PWM i.e. zero electrical energy applied to the AC inverter load) and 0xFFFFFFFF
(denoting 100% duty cycle i.e. maximum electrical energy applied to the AC inverter
load) are forced on the CPU outputs.
The time tolerance results thus obtained are plotted in Figure 3.13. The X-axis
denotes the normalized value (output voltage of 0-20V is normalized to a range from 0 to
1) of set-point voltage and the Y-axis denotes the number of closed loop cycles required
for the inverter to go out of the tolerable current range. Two different values of time
tolerance, corresponding to minimum and maximum values of CPU output, are calculated
at different voltage settings of the AC inverter. These results indicate that the closed loop
AC inverter application has a time tolerance of 10-55 cycles of closed loop operation
Figure 3.13. Application time tolerance in the presence of worst case errors.
0
10
20
30
40
50
60
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Tim
e to
lera
nce
in c
lose
d loo
p
cycle
s
Operating condition (setpoint voltage)
Minimum CPU output Maximum CPU output
46
(which is equivalent to 30K to 165K cycles of CPU operation, translating to 0.5 to 2.75
ms), which corresponds to +/-4% RMS current for different operating voltage settings.
The application can operate in an acceptably correct manner for at least 10 cycles of
closed loop operation across the full operating range of the AC inverter in the presence of
worst case error.
It can also be observed that the tolerance range for BLDC motor control system is
about 10 times higher compared to the AC inverter system. This is expected since the
BLDC motor control physical system involves mechanical components in comparison
with the purely electrical AC inverter system. The response of mechanical system is
sluggish when compared to that of an electrical system.
In order to compute value tolerance, the CPU output is monitored during
application execution. Due to variations in the operating conditions and error output
voltage determination, there is a slight variation in the sensed speed and the CPU output
will change based on the algorithm being executed. The tolerable variation (steady state
error) in CPU output thus obtained for different speed settings is tabulated and its
variation from the value corresponding to desired speed is noted. The minimum
(maximum) tolerance is observed at a set-point of 0.2(0.15). The tolerable CPU output
variation at the set-point value of 0.2(0.15) is 0.292-0.321 (0.219-0.273) which equals a
variation of 9% around 0.307 (21% around 0.246). The CPU output tolerance expressed
Figure 3.14. Application value tolerence as percentage of CPU output value.
0
5
10
15
20
25
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Va
lue
To
lera
nce
(as %
of C
PU
o
utp
ut)
Operating condition (setpoint voltage)
Value Tolerance
47
as a percentage of CPU output value is shown in Figure 3.14. Here the X-axis denotes the
different set-point values for output voltage, and the Y-axis denotes the CPU output value
tolerance.
3.3.2 Analytical Estimation of Tolerance Values
It may not always be practical (e.g. actual system may not be available) to evaluate
the tolerances using the actual model and we may have to resort to other techniques. If
the system model is simple, we can evaluate the tolerance analytically. In this section, we
use a simple first order control system as indicated in Figure 3.15 to illustrate analytical
estimation of value tolerance and time tolerance.
Consider the scenario where the system gets input set-point value from an external
source and the controller will modulate the input to the physical system based on the
difference in value between the set-point input and physical system output. Consider an
output tolerance of δOT. We need to map this application output tolerance δOT to tolerance
δCT at the controller output. δCT is a tuple consisting of both value and time tolerance.
In order to simplify the computation, we assume that the system is operating in
steady state at 𝑡 = 0, with 𝑢(0) = 𝑢0, and 𝑦(0) =𝑏
𝑎𝑢0. Consider the scenario where a
fault in the controlling digital system causes its output to change from 𝑢0 to 𝑢1 at time
𝑡 = 0. We need to derive the system evolution over time due to this change in the digital
system output to compute the value tolerance and time tolerance.
Transfer function of first order physical system
𝑌(𝑠)
𝑈(𝑠)=
𝑏
(𝑠+𝑎) --------------------------------------------------------- (1)
The equivalent differential equation for the system can be written as
Figure 3.15. First order system.
ControllerSetPoint Value OutputδOTδCT
48
𝑑
𝑑𝑡𝑦(𝑡) = −𝑎𝑦(𝑡) + 𝑏𝑢(𝑡) --------------------------------------------- (2)
The initial condition of such a system is 𝑢(0) = 𝑢0 and 𝑦(0) = 𝑏
𝑎𝑢0
We need to solve the differential equation in (2) to identify how the system evolves
due to a change in control input from 𝑢0 to 𝑢1.
On solving (2)
𝑦(𝑡) = 𝑒−𝑎𝑡 ∫ 𝑏𝑒𝑎𝑥𝑢(𝑥)𝑑𝑥𝑡
0
+ 𝑏𝑢0𝑒−𝑎𝑡
𝑎
= 𝑏𝑢0𝑒−𝑎𝑡
𝑎+
𝑏𝑢1𝑒−𝑎𝑡 (𝑒𝑎𝑡−1)
𝑎 ------------------------------- (3)
This evolution of the system due to the change in digital system output can be used
to determine the time tolerance and value tolerance. This reference, input and output
values of the system are represented in Figure 3.16.
Time Tolerance Determination
Time tolerance is defined as the maximum time for which the physical system
output remains within the application tolerance range for a worst case variation in the
digital system output. Assuming the worst case error is 𝑢1
Time Tolerance = 1
𝑎log(
𝑏𝑢1−𝑏𝑢0
𝑏𝑢1−𝑎𝑦1) ------------------------------------ (5)
Figure 3.16. Input and otuput values of the control system.
Digital system reference input
Digital system output
Physical system output
U0
time
time
time
(b/a)U0
(b/a)U1
U1
U0
49
Value Tolerance Determination
Value tolerance is defined as the maximum error in the digital system output for
which the physical system output is within the defined application tolerance δOT even if
the error persists for an infinitely long time. Assuming 𝑦1 is the acceptable output
considering application tolerance (i.e. maximum variation that can be allowed for the
physical system).
Value Tolerance = 𝑎𝑦1
𝑏 -------------------------------------------------- (4)
We have used a simplifying assumption that the system is continuing without any
change in load and control input throughout the evaluation period.
3.3.3 Determination of Tolerance Using High Level Models
Analytical estimation of value and time tolerance is not practical with increase in
the system complexity. In such cases, higher level models can be used to determine the
tolerance values. In this section, we show how tolerance can be evaluated from higher
level models using a representative system Permanent Magnet Synchronous Motor
(PMSM) driven using a traditional PID controller.
We modelled a Permanent Magnet Synchronous Motor (PMSM) driven using a
traditional PID controller as shown in Figure 3.17 using Matlab Simulink [82]. The
PMSM motor output tolerance of δOT need to be mapped to the controller output tolerance
δCT. The control loop is executed at 10 KHz and the motor is set to run at a speed of 40
revolutions per second (rps). We added a fault injection infrastructure between the
controller and PMSM motor. The infrastructure will support two types of injection. (i)
Forcing the output to zero and maximum value. This is used for determining the time
Figure 3.17. PMSM control system.
Controller PMSM MotorReference
SpeedOutputδOTδCT
Fault Injection
50
tolerance. (ii) Increase and decrease the input the motor by a constant value. This is used
for determining the value tolerance.
In the simulation setup, the output is first forced to zero and maximum value. The
worst case (i.e. one with lesser tolerance) output waveform thus obtained is shown in
Figure 3.18. A zoomed in version of the figure to aid computation of time tolerance is
shown in Figure 3.19. The time tolerance is determined from the figure is 308 us.
We then forced different values at the input of PMSM motor using the fault
injection setup shown in Figure 3.17 and observed the change in motor speed. We found
Figure 3.19. Magnified version of PMSM output in the presence of worst case error.
Figure 3.18. PMSM output in the presence of worst case error.
51
the value which is required for the motor to be just within the application tolerance and
this is called as value tolerance. The output thus obtained is shown in Figure 3.20. The
value tolerance thus obtained is 33.3% i.e. variation in peak to peak value of digital
system output from 0.15 to 0.20 which can cause the application output to be within 5%
variation. It can also be noticed that the error is trying to pull the output out from its
stable value where the closed loop action is trying to bring it back leading to a sinusoidal
output.
Conclusion 3.4
We have used a motor control example to demonstrate the tolerance present in
practical systems. However, utilizing of application tolerance for performing IC safety
analysis is not practical due to significant increase in the analysis complexity. We
propose a divide and conquer approach to address the practical limitations in performing
such an evaluation during the IC design phase. In this approach, application tolerance at
the system level is apportioned to individual modules as value and time tolerance thus
enabling the IC safety analysis to use this information and reduce the pessimism. We
have also discussed the different approaches by which the application tolerance can be
mapped to IC level.
Figure 3.20. PMSM output used for determining value tolereance.
52
4. Formal Verification Based Approach for
Accurate Safety Analysis
Once the application tolerance is apportioned to individual modules, the analysis
can be performed on these modules standalone. Fault injection is widely used in the
industry to provide a metric on the suitability of a circuit module to be used in safety
critical application. Current fault injection methods used in industry do not provide a
systematic way to include application specific tolerance, thereby resulting in pessimistic
evaluation. In addition, the evaluation is not exhaustive as the workloads used for fault
injection typically cover only a small subset of the functional scenarios (conditions and
states) possible. Ensuring sufficiency of the workloads used for fault injection based
safety evaluation is an open problem. Other practical limitations of fault injection based
techniques are also highlighted in recent works [66,83].
Formal verification techniques have been proposed in the literature as an alternative
to simulation methods using fault injection. However, they introduce new challenges, e.g.
requirement to model the complete set of properties for the system, making it effort
intensive and absence of application related input constraints leading to identification of
almost all the elements as critical.
Real-life systems are complex and pose the challenge of not being able to
comprehensively categorise its constituent components as safe or dangerous. This is
because these systems must include the physical components (which are inherently
analog in nature), and the accuracy of the analysis is dependent upon the performance
deviations which can be tolerated under different use / application conditions. This
chapter proposes new methods for safety analysis of such systems. Its main contributions
are: (i) Capturing the application diversity as input constraints (values and sequences).
(ii) Modelling application specific tolerances as an output range across time intervals.
(iii) Formal techniques have been incorporated into this analysis framework to make it
more accurate and less pessimistic than is possible using existing methods. Results of this
53
analysis are presented on ITC benchmark circuits and two industrial circuits.
The rest of this chapter is organized as follows. Section 4.1 briefly overviews
various techniques for safety analysis with specific focus on formal verification based
techniques. Section 4.2 introduces the proposed safety analysis framework. Experimental
results using the proposed approach are presented in Section 4.3 and Section 4.4. Section
4.5 concludes the chapter.
Background and Related Work 4.1
Safety analysis is required to ascertain the safe behaviour of a system under
different operating conditions. Several analysis techniques exist at different design stages
and abstraction levels (e.g. hardware / circuit level, software / application level, system
level etc.). Workload based fault injection [84,85] is widely practiced in industry to
ascertain the robustness of a circuit to faults. However, experimental evaluation presented
in Section 2.5.1 indicated lack of comprehensive workloads as a major limitation in
addressing the quality requirements for functional safety evaluation. This motivated us to
explore the use of formal property checking methods.
The use of formal property checking for safety analysis was first proposed in [86],
wherein deterministic occurrence / non-occurrence of events in the presence of modelled
faults were analysed based on properties derived from specification. [87] proposed the
use of model checking for identifying latches which must be protected. The analysis was
based on creating equivalent formal models, considering an SEU on each latch, one at a
time. A failing property indicates a dangerous latch, which must be protected. Properties
internal to the circuit and at the circuit outputs are considered. The application behaviour
and tolerances are not considered, resulting in pessimistic results. Besides, these methods
require a set of golden properties which fully capture the circuit behaviour. In cases
where verification is not performed using formal methods, significant effort is needed to
code these properties.
The work in [88,89] use bounded model checking along with module duplication
(one is golden and the other is fault injected) and properties based upon output
mismatch. If the mismatch property is satisfied, the flip-flop on which the fault is injected
54
is classified as dangerous. The process is repeated for each flip-flop in the module. Using
this approach, no circuit behaviour specific properties are needed. [90] uses a similar
approach to check for robustness of the error detection logic in the presence of faults in
the main circuit. The above methods have not considered application specific scenarios,
e.g. input constraints, sets of valid and invalid input and internal state sequences, and
output constraints and tolerances. As a result, the above approaches can be pessimistic,
thereby identifying more dangerous elements.
[91] discusses the impact of imperfect hardware (structural defects) on application
performance. The work discusses it from the aspect of manufacturing test and safety
implications of the imperfect hardware are not considered. [67] proposes the use of
application tolerance provided the output recovers within a specified time as determined
by the application. Equal probabilities are assigned to flip-flops states (0 or 1), and the
output error probability for a single transient fault on all combinational gates and flip-
flops is computed. The proposed approach is static in nature and dynamic input
sequences (including closed loop feedback behaviour) are not considered. [92] proposes
application of system level considerations in performing the safety evaluation, owing to
substantial error masking occurring at higher design abstractions. The methodology
employs fault injection on selected workloads for identifying the contribution of
individual gates to the system level FIT. This approach solves the problem of safety
analysis of standalone applications with limited workload.
All the above approaches help to generate a quantitative metric for safety (in terms
of which flip-flops are safe and which are dangerous) based upon the circuit construction.
The work presented in this chapter augments the solution space to provide a more
accurate safety metric by incorporating these additions: (i) Specific input constraints and
sequences to match functional workloads are considered. (ii) Tolerances in the output
values and time required to attain them are included. (iii) Safety requirements across
different modules are budgeted more accurately by including the physical system
behaviour.
It is therefore important to consider more comprehensive workloads and perform an
exhaustive fault injection. This is done in two steps. (i) Workload specification using
55
input constraints (input values and relationship between values) and input sequences.
(This can also be represented using FSMs). (ii) Analysis covering SEU injection on each
flip-flop across all cycles of execution.
Improved Safety Analysis Framework 4.2
The proposed safety evaluation framework is shown in Figure 4.1. Two copies of
the circuit are created. Fault is injected in one by conditionally flipping the input of each
of the flip-flops therein. This is achieved by Ex-Oring the inputs to the flip-flops with the
other input being driven as a primary input (denoted as error_input). The number of such
error_input ports is equal to the number of flip-flops. Incisive Enterprise Verifier (IEV)
from Cadence [93] is used for the safety evaluation. IEV will perform this injection
exhaustively, i.e. endeavour to inject a pulse at any one of the error_input ports over any
one cycle of the evaluation, causing the safety property to be verified exhaustively. This
is in contrast to simulation, wherein such an analysis will result in 2*N*M runs, where N
is the number of flip-flops and M is the number of cycles of evaluation.
We now explain how the circuit and its inputs and outputs are represented in IEV
for safety analysis. Input constraints and sequences are modelled as an FSM. It represents
the conditions required for workload execution by emulating functional scenarios
corresponding to register settings (constraints), periodic register updates, relation
between register settings, etc. The output representation includes tolerances in the range
Figure 4.1. Illustration of FV based safety analysis.
D
CLK
Q
D
CLK
Q
Injected circuit
Reference circuit
Inp
ut
FSM
Ou
tpu
t P
rop
erty
error_input1N
56
of output values and the duration for which outputs can be in error. These are captured as
assertions using which the circuit robustness is verified. The error_input used for fault
injection should satisfy the following properties during safety evaluation. (i) Only one
error input is modified at a time. (ii) This error input pulses only once during the
evaluation.
A sample IEV property file for SEU analysis is shown in Figure 4.2. This property
specification for flip-flop number 64 (from amongst 66, numbered from 0 to 65) is
shown. Part 1 indicates that all the bits except the 63rd
bit in the error_input are
constrained to zero. Part 2 indicates that once the value of each bit in the error_input is
high, it should go low in the next cycle and remain low throughout. Part 1 and Part 2
together ensure that the 63rd
bit of the error_input will be high for one cycle during the
entire formal verification run. Once one bit in the error input is made high for one cycle,
the flip-flop corresponding to that bit will have its output flipped. This is done to emulate
the behaviour of SEUs in flip-flops. Clocks driving the flip-flops must be enabled to
allow the flip-flop to capture the toggled value at the error input. IEV injects the faults on
Part 1 // Constraining inputs except for pin 64
const -add -pin error_input[62:0] 0
const -add -pin error_input[65:64] 0
Part 2 // Create a single pulse for error input of pin64
generate
for (ip =0; ip<=65;ip=ip+1) begin : generic
fall_all_zero: assume always (fell(error_input[ip]) -> always (error_input[ip] == 0))
@(posedge clk) ;
rise_edge_zero: assume always (error_input[ip] -> next (!error_input[ip]))
@(posedge clk) ;
end
endgenerate
Part 3//properties to be checked
(i) output_always_equal : assert property ( @(posedge clk) $rose(out_compare) |=>
##[0:7] $fell(out_compare));
(ii) output_always_equal: assert always (((out_inject <= out_golden + 4) && (out_inject
>= out_golden - 4))) @(posedge clock);
(iii) fell(onehot(error_input)); rose(sig_pwma_sync_G); rose(sig_pwma_sync_G)} |=>
(sig_pwma_sync_G == sig_pwma_sync_I) [*] ) @(posedge clk);
Figure 4.2. IEV property specification.
57
the cycle in which the clock is enabled so as to cause the property to fail. Part 3 indicates
the specification of the tolerance property. The tolerance can be specified in terms of time
(the output of the reference and injected module can be different for certain time
duration) indicated in Part 3(i) or value (the output of the reference and injected module
can differ in each cycle by a certain value) indicated in Part 3(ii) or a combination of
both. Part 3(iii) indicates the tolerance specification where the PWM output can be wrong
for two PWM pulses and it need to be correct afterwards.
We now illustrate the application of our method to a few circuits. We begin with a
set of ITC benchmark circuits [94] and then extend the analysis to two industrial circuits
[76,75].
Analysis on Benchmark Circuits 4.3
The proposed framework is first used on various ITC benchmark circuits [94].
Output value tolerances of ±2 and ±4 are considered. The tolerance values for the
benchmark circuits are specified to illustrate the application of the proposed method and
do not depend on the circuit functionality. The number flip-flops identified as safe under
these error conditions is evaluated. Flip-flop savings is calculated by comparing number
of safe flip-flops thus obtained as against total number of flip-flops. (This indicates that
there is no cost of meeting safety for these flip-flops). The results on five ITC benchmark
circuits are shown in Table 4.1. The percentage of flip-flops which are marked as safe
and the evaluation time for the entire circuit for two conditions of output value tolerance
(±2 and ±4 over the correct output) are also indicated.
Table 4.1. Safety analysis on benchmark circuits.
Circuit #flip-
flops
Tolerance = ±2 Tolerance = ±4
#safe FF % Savings Time (S) #safe FF % Savings Time (S)
b03 30 7 23.33 53 12 40.00 56
b04 66 4 6.06 192 6 9.09 182
b08 21 4 19.05 33 6 28.57 32
b10 17 2 11.76 36 3 17.65 31
b11 31 3 9.68 64 4 12.90 69
Average
13.98
21.64
58
Analysis on Industrial Modules 4.4
The proposed framework is then applied to two industrial modules, PWM and
ECAP in the system context described in Figure 4.3. Here, the variation in magnetic field
of the rotor is measured using a Hall Effect sensor. The output of Hall Effect sensor is fed
to the ECAP module which decodes the motor speed. The reference value for the motor
speed is set using software. The PID controller adjusts the PWM pulse width based on the
difference between the observed and reference motor speed values.
The PWM module has different operating modes like up count mode, down count
mode, asynchronous trip protection configuration, dead band configuration, etc. Similarly
the ECAP module has different operating modes like single-shot mode, continuous mode,
edge polarity selection, etc. These are one time configuration parameters and are
configured in the beginning. Once the configuration is over, the module starts its closed
loop operation. The periodic operation of measuring the motor speed, calculating the new
PWM pulse width value and updating the new value is typically called one period of the
closed loop operation. An input model has been constructed to capture various input
constraints and sequences for the PWM corresponding to the operating modes (up mode,
down mode, up-down mode), input range (pulse duration, pulse width) and relationship
between the inputs (pulse width and pulse period). ECAP module provides an interrupt to
PID controller once the sensor decode is complete. An FSM is created for ECAP module
P(Kp)
I(Ki)
D(Kd)
++ PWMError
Hall effect sensor
Referencespeed
ECAP
Figure 4.3. Reference application.
59
to periodically read the register contents from both injected and reference module, based
on the interrupt. The register values from injected and reference modules thus obtained,
are compared to determine the error condition.
The lifetime analysis of PWM and ECAP module flip-flops indicate that they can
be categorised into three groups, namely (i) Flip-flops which are updated only once
during application initialization. (ii) Flip-flops which are updated once during every
period of closed loop operation. (iii) Flip-flops which form the datapath which are
updated many times during one period of closed loop operation.
Few knobs have been used to manage complexity of analysis. (a) Only one error
occurs during the evaluation window. (b) Proper modelling of inputs and outputs to
contain the state space explosion problem. The improvement with input modelling is
supported by the fact that, analysis of a similar sized ITC benchmark circuit (e.g. b12
with 121 flip-flops) with unconstrained input could not complete. (c) This lifetime
analysis of the flip-flop values illustrate that any SEU will manifest as output deviation in
the first period of closed loop operation and will either get corrected in the second period
onwards or will continue to cause errors. These optimization knobs helped to perform
analysis of these modules standalone and reduce sequential depth of the safety properties.
A reference property thus modelled for PWM after application of these optimization
knobs is shown in Part 3(iii) of Figure 4.2. The property asserts the requirement for
injected PWM output to be same as reference PWM output in the second period of closed
loop operation following the fault injection.
The results thus obtained for PWM and ECAP modules are shown in Table 4.2.
The percentage of flip-flops which are marked as safe and the evaluation time for the
Table 4.2. Safety analysis on industrial modules.
Circuit Mode #FF #safe FF Time (S) % Savings Total
Safe FF
% Total
Savings
PWM
up
56
24 3132 42.86
18 32.14 down 24 12787 42.86
up down 20 90526 35.71
ECAP int 202 173 165338 85.64 173 85.64
60
entire circuit are indicated. Note that the PWM module analysis had to be split into three
operating modes to manage the complexity. The number of safe flip-flops for the PWM
module is arrived at by doing an intersection across the three operating modes. The
savings are 32% and 85% for PWM and ECAP modules respectively.
Approximation used for ECAP and PWM (all flip-flops having wrong value, will
be either corrected in the first period of closed loop operation following the fault injection
or will remain in error forever) helped to reduce the analysis complexity. However, the
approach cannot be generalized and applied for other circuits. In its absence, we found
that the proposed approach is not scalable to handle large circuits. We tried different
techniques like design partitioning, transforming system level properties to module level
properties etc. to reduce the analysis complexity and make the formal verification based
safety analysis approach tractable. We were only partly successful. Due to the inherent
limitation of design space explosion with formal verification based approaches, we
decided to focus on simulation based techniques to augment formal verification based
approach for handling large circuits.
Conclusion 4.5
In this chapter, we have proposed improved methods for accurate safety analysis of
real-life systems. Some of the limitations of earlier methods have been overcome and
new capabilities have been created. A system view has been used to perform a systematic
analysis, taking into account application specific workloads and error tolerances. The
physical system behavior is also included and it is shown how the safety budget for
individual modules can be suitably set to obtain a more accurate (less pessimistic)
reliability assessment, and hence incur lesser design cost for robustness. A formal
verification framework has been used and suitable properties which account for
application specific behaviors are illustrated. Experimental results are provided on
benchmark circuits and two industrial circuits. They indicate that the right inclusion of
the application tolerance specification allows for a more accurate analysis, together with
a less costlier design. As examples, the number of flip-flops required to be hardened
reduced by an average of 13% to 21% for the ITC benchmark circuits as against the
61
classical approach of 100% hardening. For the superset application scenarios considered
with PWM and ECAP modules, the number of such flip-flops to be hardened reduced by
32% and 85% respectively.
62
5. Improved Fault Injection Based Safety
Analysis Approaches
As discussed in Section 2.5, traditional fault injection based functional safety
evaluation has two major limitations. (i) Application level tolerance due to interaction
with the physical system is not accounted for. (ii) Comprehensiveness of the application
workloads selected for functional safety evaluation is not guaranteed [64]. The limitation
in (i) causes the analysis to be pessimistic, while that in (ii) causes it to be optimistic.
Incorporating application level tolerance in safety evaluation as discussed in Section 3
will help address (i). The proposal to use formal approaches to address (ii) and make the
analysis comprehensive is only partially successful as it fails to handle very large circuits.
In order to address the limitations, we propose a new workload augmentation method
based upon judicious perturbations of a given workload. These perturbations are carried
out using systematic analysis of a given workload to identify the input parameters and
data variables, and application specific understanding to determine the impact on
performance and safety due to these perturbations.
The main contributions of this chapter are: (i) Detailed evaluation is performed on
two representative systems to identify limitations with the existing approaches. (ii)
Different experiments are performed with alternate workloads to profile the increase in
the number of identified critical flip-flops. (iii) A method is proposed for systematic
perturbation of workloads whereby new workloads are generated iteratively, and are
shown to be effective to detect additional critical flip-flops. (These are termed as
derivative workloads). (iv) This method is evaluated on two examples. In one case,
several closed loop routines for electric motor control and digital power conversion
applications are analysed and perturbed. In the second case, a physical system for inverter
operation driving an AC load using a microcontroller based system is analysed and its
operation similarly perturbed. The results of these experiments indicate that the proposed
perturbation technique is both effective and affordable.
63
The rest of the chapter is organized as follows. Section 5.1 covers fault injection
based safety analysis approach where the traditional fault injection based approach is
compared with the ideal application based method and proposed divide and conquer
approach (introduced in Section 3.2) to evaluate the benefits and limitation. Section 5.2
performs a detailed analysis of workloads used for fault injection based safety analysis.
Section 5.3 introduces the proposed workload perturbation approach. Experimental
results with the proposed approach are covered in Section 5.4 and Section 5.5 concludes
the chapter.
Fault Injection Based Safety Analysis Approach 5.1
Fault injection [95] is the preferred technique used to ascertain the safety
worthiness (robustness) of the circuit. In order to evaluate the effectiveness of the
proposed divide and conquer method in reducing analysis pessimism and understanding
the limitations of the fault injection approach, three experiments on the BLDC motor
control system and AC inverter system have been performed and their results compared.
Figure 5.1. Safety analysis approaches.
Identify application
level tolerance
Map application
tolerance to
individual module
output
Module level Fault
Injection using the
derived tolerance
information
Critical flip-flop
which need to be
protected
Identify application
level tolerance
System level Fault
Injection using the
derived tolerance
information
Critical flip-flop
which need to be
protected
Module level Fault
Injection without
any tolerance
consideration
Critical flip-flop
which need to be
protected
Proposed
Approach
(Open Loop)
Ideal
Approach
(Closed Loop)
Existing
Approach
(Open Loop)
64
These are indicated in Figure 5.1 and are explained below:
(i) Existing approach: This consists of analysing individual modules in a standalone
way without any tolerance consideration, as is the practice today
(ii) Ideal approach: This consists of identifying critical flip-flops at the application
level using the full system model using the system level tolerance information.
No approximation or tolerance budgeting is employed here.
(iii) Proposed approach: This consists of performing module level analysis with
tolerance allocation, i.e. divide and conquer method described in Section 3.2.2.
5.1.1 Experimental Setup
A hardware emulation setup was chosen as the platform for performing this
comparative study. This has enabled a faster evaluation over simulation methods, and
avoids the requirement for having a co-simulation environment with the physical model
of the motor and the Verilog netlist of the controller. We can use any of the techniques
defined in Section 3.3 for deriving the individual module tolerances.
Fault injection to identify the critical flip-flops is performed using software
Figure 5.2. Software fault injection flow.
Wait for Interrupt
Fault injection
enabled?
Fault already
injected?
Flip register value
YES
NO
NO
YES
Output in error?
Register
classified critical
YES
Analysis
duration
elapsed?
NO
YES
NO
Closed loop control
algorithm execution
StartEnd
65
mutation. The approach involves creating a fault injection thread in parallel with the
application thread. The modified application thread is shown in Figure 5.2. The fault
injection thread inverts the values stored in the flip-flops, one at a time, at every cycle of
the application execution, across multiple iterations corresponding to the workload. In
this particular evaluation we have restricted the fault injection to memory mapped
registers i.e. flip-flops which are directly accessible using software code. (This was the
level of granularity feasible in the emulation setup used in this experiment). The output,
(module output in case of existing approach and proposed approach and system output in
the case of ideal approach), is observed for each fault. If the output drifts beyond the
expected value, (golden output in case of existing approach and acceptable output with
tolerance considerations in case of ideal approach and proposed approach), the flip-flop
in which fault is injected is classified as critical and must be protected.
BLDC Motor Control System Results 5.1.1.1
Fault injection is performed on the BLDC motor control system described in
Section 3.3.1.1 to identify critical flip-flops using the three approaches as illustrated in
Figure 5.1. The number of flip-flops thus identified is shown in Figure 5.3. The X axis
denotes the various motor speed settings for the BLDC motor. The Y axis denotes the
number of critical flip-flops identified. For the motor speed setting of 900 RPM, 190 flip-
Figure 5.3. Critical elements identifed at different operating conditions.
66
flops were identified as dangerous in the existing approach, while the numbers were 48
and 50 for the ideal approach and the proposed approach respectively. Similar results
were observed for other speed settings as well.
Since the control must be robust for any operating speed of the motor, the union of
flip-flops identified as critical at different speed settings is considered. 239 flip-flops
were accordingly identified as critical using the existing analysis technique, 54 using the
ideal approach and 55 using the proposed approach. The results thus indicate a 4.3x
reduction in the number of critical flip-flops using the proposed divide and conquer
approach.
In order to compare the detection capability associated with the three approaches,
we performed a detailed analysis of the flip-flops identified as critical using the three
approaches. The results thus obtained are shown in Figure 5.4. We further analyse the
results and make a few observations:
(i) In the absence of any tolerance considerations, the existing approach (most
pessimistic) must identify the maximum number of flip-flops as dangerous. A
subset of these will be identified as critical using the proposed approach with the
mapped application tolerances applied at the module level. This may not be the
minimal set since interactions between modules are not considered. A smaller
subset will be identified as critical using the ideal approach with closed loop
analysis performed using system model together with application tolerances.
Figure 5.4. Critical flip-flops identified using each approach.
Non critical flip-flops (46)
7
3
181Existing approach
Ideal approach
48
Proposed approach3
67
The flip-flops identified as critical are mapped into the three main groups in
Figure 5.4.
(ii) Contrary to expectations, the following discrepancies were observed. Six flip-
flops identified as critical in the ideal approach were not identified using the
proposed approach. Three flip-flops identified as critical in the ideal approach
were not so identified even using the existing (most pessimistic) approach.
AC Inverter System Results 5.1.1.2
Fault injection is performed on 384 flip-flops (belonging to the CPU module which
controls PWM) of the AC Inverter system covered in Section 3.3.1.2 to identify the
critical ones using each of the three approaches mentioned in Figure 5.1. The number of
flip-flops thus identified is shown in Figure 5.5. The X axis denotes the various set-points
associated with the AC Inverter. The Y axis denotes the number of critical flip-flops
identified. For the load current setting of 0.15, 79 flip-flops were identified as dangerous
in the existing approach, while the numbers were 32 and 52 for the ideal approach and
the proposed approach respectively. Similar results were observed for other load current
settings as well. Considering all the different operating points, 96, 36 and 53 critical flip-
flops were obtained for traditional approach, application based approach and proposed
divided and conquer approach. The results thus indicate a 1.8x reduction in the number of
critical flip-flops using the proposed divide and conquer approach.
Figure 5.5. Critical flip-flops identified for inverter application.
68
In order to compare the detection capability with the three approaches, we
performed a detailed analysis of the flip-flops identified as critical. The results thus
obtained are shown in Figure 5.6. Similar observations can be drawn in the case of AC
Inverter application as that of BLDC motor control application.
(i) Application based (ideal) approach identifies the least number of critical flip-
flops. On the other hand, the traditional approach identifies the highest number
(indicating pessimism in the analysis). 21 additional flip-flops are identified as
critical using both the divide and conquer approach and traditional approach,
while 40 flip-flops are identified as critical using only the traditional approach.
(These numbers indicate the pessimism associated with these two approaches).
(ii) One flip-flop is uniquely identified as critical by application based approach,
which is missed by the other two approaches.
(iii) On the other hand, there are three flip-flops identified by application based and
traditional approaches, which are missed by the divide and conquer approach.
(iv) Overall, four critical flip-flops have escaped detection using the divide and
conquer approach.
Fault Injection Workload Analysis 5.2
Fault injection based functional safety analysis process involves ascertaining the
circuit behavior in presence of faults and determining whether the protection mechanisms
Figure 5.6. Critical flip-flops identified in three approaches for inverter application.
69
incorporated in the chip are capable of detecting them. The set of workloads chosen for
safety evaluation play an important role in ensuring analysis comprehensiveness. We
have seen in Section 2.5.1 that it is difficult to get a comprehensive set of workloads for
performing fault injection. In its absence we select workloads considering the toggle
coverage [22]. Practical considerations of the circuit size and simulation time require that
an upper bound (e.g. 70%, 90%, 99%) on this coverage be set based upon circuit size and
simulation time.
In order to understand this better, we have performed safety evaluation on a
representative module with 25 different workloads satisfying the toggle coverage criteria.
The results thus obtained are shown in Figure 5.7. The X axis denotes the workload
number and Y axis denotes the number of unique flip-flops identified as critical using
every new workload. It can be observed that the number of critical flip-flops identified
saturates around the tenth workload and each additional workload identifies a much
smaller number of critical flip-flops. We term these flip-flops as the trailing end of flip-
flops. In a typical analysis, we are not sure whether all the critical flip-flops falling in the
trailing end are detected. In this chapter, we present an affordable way to identify such
trailing end flip-flops.
Figure 5.7. Number of unique critical flip-flops identified for each workload.
70
Workload Perturbation Approach 5.3
We analyzed the flip-flops which escape detection in the experiments covered in
Section 5.1.1.1and 5.1.1.2. These will come as critical flip-flops at the trailing end of the
curve in Figure 5.7. A new methodology is proposed for detection of these critical flip-
flops using the below three steps:
(i) The flip-flops which escape detection were functional neighbours of the critical
flip-flops which are already identified. Functional neighbourhood consists of the
set of flip-flops which implement / control the same functionality. For example,
the 32 flip-flops storing the value of integration constant (Ki) in the Proportional
Integral (PI) control function form a functional neighbourhood.
(ii) A detailed analysis revealed that small perturbations in the workload are
required to excite the specific conditions for controlling and observing a
particular fault location. These perturbations were mapped to changes in input
parameters and other embedded control parameters. The changes in values thus
obtained were used to set the perturbation range. These additional workloads are
called derivative workloads.
(iii) A derivative workload causes better excitation in certain areas of the design, e.g.
forcing the application behaviour to exceed the permissible tolerance values.
This excites additional flip-flops (corresponding to the trailing end flip-flops in
Figure 5.7).
On the other hand, an application agnostic approach to generating additional
workloads with random perturbations of control / state settings may lead to the generation
and / or excitation of several states and conditions which may not be functionally
relevant. Such an approach is hence pessimistic. A desirable approach would be to
identify the set of acceptable input conditions and internal states which influence the
closed loop operation in the desired way.
An algorithm for such systematic identification and perturbation to generate
derivative workloads is shown in Figure 5.8. The objective function for the Workload
Augmentation algorithm is to maximize the number of critical flip-flops identified. We
71
first identify the set of acceptable input variables and embedded control parameters. We
perturb the workload within the acceptable parameter variation range and perform
functional safety evaluation. If the generated workload identifies at least one additional
critical flip-flop, it is accepted. The algorithm is iteratively executed to evaluate the
search space. With each succeeding iteration, the number of new flip-flops identified
reduces (corresponding to the trailing end flip-flops in Figure 5.7).
Experimental Results 5.4
In order to evaluate the effectiveness of the proposed approach in identifying
critical flip-flops, experiments are first conducted on a set of control functions and later
repeated on the inverter application.
5.4.1 Control Functions
We selected a representative set of control functions as given in Table 5.1. This
includes a mix of closed loop control applications and related critical computation
functions (which are used in safety critical applications). We determined the set of
Figure 5.8. Algorithm for workload augmentation.
𝑾𝒐𝒓𝒌𝒍𝒐𝒂𝒅 𝑨𝒖𝒈𝒎𝒆𝒏𝒕𝒂𝒕𝒊𝒐𝒏(𝑺, 𝑾𝑳)
𝒑𝒆𝒓𝒕𝒖𝒓𝒃(𝑾𝑳)
Set of design elements 𝑆
Identify critical elements using workload 𝑊𝐿,
𝑆𝑐𝑟𝑖𝑡 = 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 (𝑆, 𝑊𝐿)
for 𝒊𝒕𝒆𝒓 = 𝟎 through 𝒎𝒂𝒙
perturb workload, 𝑊𝐿𝑖𝑡𝑒𝑟 = 𝑝𝑒𝑟𝑡𝑢𝑟𝑏(𝑊𝐿)
Identify critical elements 𝑆𝑖𝑡𝑒𝑟 = 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙(𝑆, 𝑊𝐿𝑖𝑡𝑒𝑟)
If 𝑆𝑐𝑟𝑖𝑡 ∪ 𝑆𝑖𝑡𝑒𝑟 ≠ 𝑆𝑐𝑟𝑖𝑡 , accept the workload
𝑆𝑐𝑟𝑖𝑡 = 𝑆𝑐𝑟𝑖𝑡 ∪ 𝑆𝑖𝑡𝑒𝑟
If search is not leading to new workload acceptance, exit ()
Identify all the embedded control variables (𝑣) and input parameters (𝑖) in Workload,
𝑊𝐿 = 𝑓(𝑖, 𝑣)
set of acceptable values of 𝑣, 𝑉 = 𝑎𝑐𝑐𝑒𝑝𝑡𝑎𝑏𝑙𝑒 (𝑣)
set of acceptable values of 𝑖, 𝐼 = 𝑎𝑐𝑐𝑝𝑡𝑎𝑏𝑙𝑒(𝑖)
∀𝑣′ ∈ 𝑉 𝑎𝑛𝑑 ∀ 𝑖′ ∈ 𝐼, 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑊𝐿) = 𝑓(𝑖′, 𝑣′)
72
acceptable input values, set of acceptable values for embedded control parameters and
output tolerances. Fault injection experiments were performed to identify the set of
critical flip-flops for each of the control functions, and repeated for these variations,
using the Workload_Augmentation algorithm in Figure 5.8.
The results thus obtained are shown in Figure 5.9. For initial evaluation, the
maximum number of iterations was set to six, i.e. the original workload was perturbed
five times. We injected faults on different memory mapped data variables. (This number
varies with the function and is shown in brackets). The cumulative number of critical
Table 5.1. Control functions used for evaluation.
Function Purpose
mppt_PnO Perturb and observe algorithm to extract maximum power.
mppt_incc Incremental conductance algorithm to extract maximum
power.
clarke Transformation to convert three-phase quantities into two-
phase quantities.
Iclarke Transformation to convert two-phase quadrature quantities
into three-phase quantities.
park Transformation to convert stationary reference frame to
rotating reference frame.
ipark Transformation to convert rotating reference frame to
stationary reference frame.
2p2z Two pole two zero digital control algorithm.
3p3z Three pole three zero digital control algorithm.
Figure 5.9. Variation of critical elements with workload perturbation.
73
flip-flops identified for each iteration is shown along the Y axis. As can be seen, with
successive iterations, no new trailing end flip-flops are identified, and eventually the
workload perturbation stops. (The initial iteration count was set to 6. However for 3p3z, it
was extended to 10 since in the 6th
iteration also new critical flip-flops were identified).
5.4.2 Inverter Application
For practical workloads, exhaustive iterations using additional workloads
(generated using the algorithm in Figure 5.8) will result in a very large number of
workloads and thus increase the analysis complexity. For the inverter application, the
estimated workloads are in excess of 10,000. We hence propose a more directed
perturbation method, (based upon the knowledge of the application, its inputs and
embedded control variables for closed loop operation), wherein variations which can
result in the generation of relevant workloads specific to the application alone are
considered.
While a detailed explanation of the control loop algorithm for the inverter
application is beyond the scope of this section, we can condense the analysis using the
equations below.
𝑦(𝑖) = 𝑈𝑝 + 𝑈𝑖
= [𝐾𝑝 ∗ 𝑒(𝑖)] + [𝐾𝑖 ∗ 𝑒(𝑖) + 𝑈(𝑖 − 1)]
where 𝑦(𝑖), 𝑈𝑝, 𝑈𝑖, 𝑒(𝑖), 𝐾𝑝 and 𝐾𝑖 denote the PI function output, proportional path
output, integral path output, error input, integration path constant and proportional path
constant respectively. In order to control the output 𝑦(𝑖), the error term 𝑒(𝑖) must be
suitably controlled by 𝐾𝑝 and 𝐾𝑖.
74
For the inverter application, fault injection was performed with an initial workload
to identify the critical flip-flops using each of the three approaches mentioned in Figure
5.1. The set of critical flip-flops which remain undetected using the divide and conquer
approach (as explained in Section 5.1.1.2) were analysed and the workload was
augmented iteratively. This is indicated in Table 5.2.
The additional number of critical flip-flops identified using the additional workload
is shown in Figure 5.10. (The dotted ellipse indicates the critical flip-flops which escaped
detection for that workload). The following analysis was performed.
Table 5.2. Workload iterations for inverter.
Workload Change
WL0 Initial workload.
WL1 WL0 updated with time varying (sinusoidal) error.
WL2 WL1 updated to optimize the control loop for open loop
scenario.
WL3 WL2 updated to change the integral control parameter of the
closed loop.
Figure 5.10. Critical flip-flops identified with perturbed workloads for AC inverter
application.
75
(i) 4 flip-flops escaped detection using the initial workload WL0 (Figure 5.10(a)).
This is because while in the actual application, the control system error value
(which is the input to the control function) in each control loop iteration shows
significant variation due to the physical system behaviour, in the divide and
conquer approach the same workload is executed on the module in isolation with
lesser variation. The module, (in this case the CPU on which fault injection is
performed and which provides controls to PWM), can be more comprehensively
excited by changing the constant reference provided to the closed loop control
with a continuously changing reference (e.g. a sinusoidal reference). A new
workload WL1 is thereby created. The results shown in Figure 5.10(b) indicate
that 2 out of these 4 flip-flops are now detected.
(ii) Further analysis of the 2 unidentified critical flip-flops indicated that the control
loop parameters required for open loop operation (divide and conquer approach)
were different compared to those required for closed loop operation (application
approach). A new workload WL2 was generated by updating the control loop
parameters to recalibrate the output value for the open loop operation. The
results in Figure 5.10(c) indicate that 1 additional flip-flop is now detected.
(iii) The single flip-flop which escaped detection using Workload WL2 was part of
the integrator logic. In a typical PI control loop, the 𝐾𝑝 (proportional) term
impacts the loop gain and helps to reach the optimal performance point faster.
The 𝐾𝑖 (integral) term helps to remove the steady state error in the system. A
new workload WL3 was added with new value for 𝐾𝑖 which can force the
application behaviour to exceed the permissible tolerance values. As shown in
Figure 5.10(d), WL3 detected all the critical flip-flops.
In this particular study, we have used the proposed approach to reduce the number
of critical flip-flops which escape detection. This approach can also be utilized to perform
trade-off between hardware overhead incurred and reliability gained by reducing the
number of false positives, i.e. the number of non-critical flip-flops which are
pessimistically marked as critical using the proposed approach. As can be seen in Figure
5.10, the total number of critical flip-flops identified varies across workloads. This
76
includes flip-flops not identified as critical in the application approach as well as flip-
flops so identified pessimistically. The number of flip-flops pessimistically identified
using this algorithm is contained to 23 (out of 384, i.e. 5.99%), whereas with random
perturbation, this number is much higher, tending to cover all the flip-flops.
The workload perturbation algorithm is carried out accordingly in Figure 5.8.
Generic recommendations include: (a) Inputs applied and parameter changes performed
in open loop analysis must be representative of the actual closed loop scenario. (ii) Flip-
flops whose criticality depends on their values must be so identified and updated.
Conclusion 5.5
In this chapter, we proposed a perturbation based workload augmentation technique
for performing comprehensive functional safety evaluation. Experiments are performed
on a set of safety critical control functions and an inverter application, and the
effectiveness of the proposed method in identifying additional critical flip-flops is
demonstrated. 12% to 26% additional critical flip-flops are identified. Through these
experiments, we have illustrated how optimisations in safety evaluation methods using
application level tolerance can be traded off with additional workloads to arrive at a more
comprehensive list of critical flip-flops to meet the overall hardware overhead and
reliability requirements. Together, these results indicate that the proposed perturbation
technique is effective to identify additional critical flip-flops within affordable overhead
of analysis complexity and pessimism
77
6. Application Driven Protection Mechanisms
In Chapters 3, 4 and 5, different techniques were proposed to perform
comprehensive functional safety analysis and identify the minimal set of critical flip-
flops which must be suitably managed in the application (e.g. either by preventing the
occurrence of SEUs or detecting SEU occurrence followed by remedial action). In this
chapter, we will analyse the techniques which can be used to protect the identified set of
such critical flip-flops.
The techniques used for protection can be grouped into three categories, namely
hardware, software and application level techniques. Protection using hardware
[96,6,58,57,97] and software techniques [98,99,100,101,102] are well researched topics.
Cross-layer techniques which involve both hardware and software have also been
proposed [103,104]. However, there is not much reported work on utilizing the
application information to protect critical flip-flops. In this chapter, we review
representative hardware and software techniques and propose two new application level
techniques.
Hardware Based Protection Techniques 6.1
Hardware based techniques implement protection using a combination of spatial
and temporal redundancy. They can be implemented at different abstraction levels,
namely at the device level, circuit level and module level. Tradeoffs associated with
implementing the protection at different abstraction levels must be considered during the
IC design and system design stages.
6.1.1 Device Level Techniques
High energy alpha and neutron particle strikes create additional electron-hole pairs
as they pass through the semiconductor device. Depending upon the energy of the
incoming particle, there can be sufficient amount of charge accumulation to invert a
stored logic value leading to a soft error. Device level soft error mitigation techniques
78
introduce additional steps, either through design robustness or through additional steps in
manufacturing to reduce the impact of alpha and neutron particle strikes. These
techniques either increase the critical charge of state holding element / transistor (i.e.
amount of charge which is required to change the flip-flop state) or reduce the charge
collection, thus reducing the probability of the flip-flop changing its state. (Charge
collection refers to the process by which the excess electron-hole pairs created due to a
particle strike are swept into the source / drain regions instead of recombining and
neutralizing.).
Silicon on Insulator (SOI) is a device level technique deployed for SER mitigation.
This technique introduces a layer of insulator between source / drain and substrate as
shown in Figure 6.1 [96]. The charge collected on account of a particle strike is much
lesser compared to that with the traditional bulk process as the buried oxide layer
prevents charge flow from the substrate to the source and drain. This, in turn, reduces the
probability of the flip-flop losing its state. A similar reduction in SER is observed for
FinFET technologies due to the charge dissipation in the substrate itself before reaching
the source or drain [105].
6.1.2 Circuit Level Techniques
Circuit level soft error mitigation methodologies use a combination of transistor
and logic gate / flip-flop design techniques to build components which are hardened or
tolerant to the effects of radiation. A few common examples are listed below.
Figure 6.1. SOI transistor.
Substrate
Oxide
Substrate
Gate oxide
Gate
Source Drain
79
Dual Interlocked storage Cell (DICE) [6] is a composite flip-flop which provides
protection from single event upsets, by using spatial redundancy to store its value as a
pair of elements, each element of which has a set of complementary values. Refer to
Figure 6.2. If flip-flop’s state changes due to particle strike, it can be restored using the
available redundancy. Various optimizations to DICE flip-flops have been proposed to
reduce the area and power overhead [106].
Built in Soft Error Resilience (BISER) [58] is another technique used for protecting
latches and flip-flops. A BISER flip-flop consists of two flip-flops joined with a C-
element [107] as shown in Figure 6.3. In the fault-free condition, both the flip-flops have
the same value and the C element provide the inverted value. Upon a particle strike, the
values in the two flip-flops are opposite. Thereupon, the C element will tristate the
output. The BISER flip-flop will retain the previous value at the output due to the bus
Figure 6.2. DICE flip-flop.
Figure 6.3. BISER fip-flop.
80
keeper circuit. The two flip-flops will once again take an identical value when they are
updated by the next clock cycle.
There are other circuit level techniques like the use of Razor flip-flop [57], delayed
capture methodology [59], SEU mitigation using error control coding techniques [61],
DF-DICE flip-flop [108], etc. which offer different tradeoffs in terms implementation
overhead and detection capabilities.
6.1.3 Module Level Techniques
Module level redundancy is commonly used for implementing protection against
soft errors. In this approach, an entire module is replicated, redundant instances are fed
with the same input values and the outputs of redundant modules are continuously
compared. (This method is practically very relevant since it does not require new design
or characterisation of flip-flops and transistors used to build them. It instead uses standard
library components). Dual Core Lock Step (DCLS) architecture [97] is the simplest
example of module level redundancy, wherein two instances of a given module are
operated in tandem. Additionally, considerations like staggered execution in redundant
streams, different physical spacing requirements between the redundant units, etc., are
also incorporated to mitigate effects of common cause failures (i.e. common faults
propagating to both the modules, e.g. through power supply and clock networks, thereby
rendering the checker ineffective). For applications requiring higher availability (e.g.
fault tolerance), triple core lock-step [109] is also deployed.
Software Based Protection Techniques 6.2
Software based protection techniques implement protection in programmable
systems (e.g. with CPUs) using smart methods [110] for generating programs which
result in CPU instructions and code execution for a given application. These techniques
can be classified mainly into three types, namely, (i) control flow checking, (ii)
vulnerability reduction techniques, and (iii) redundancy techniques. These techniques are
briefly described, and the related implementation overheads, protection offered and
81
suitability from an application context (e.g. data processing intensive vs control
intensive) are explained.
6.2.1 Control Flow Checking
Control flow checking uses assertions to check the program flow sequencing. It can
detect when the execution takes a different path in the presence of a fault by using some
property of the path, e.g. time, signature, end state, etc. A watchdog [111] is a classic
example for a mechanism used for checking gross control flow errors, wherein the time
behaviour of the execution along the erroneous path can be profiled and the fault can be
detected.
More sophisticated control flow checking measures [98,112,113,114] have been
reported for better detection capability. These measures divide the program into sub-
routines, where a sub-routine is a collection of instructions with a unique entry and exit
point. Various sub-routines are connected using arcs to form a full program and each sub-
routine is allocated a unique signature. Additional instructions are added in the sub-
routines to perform control flow checking. If the program flow is not as expected, error is
flagged. As an illustration, 97% of transient faults in control logic can be detected using
the control flow checking measure illustrated in [113].
These measures detect faults only in the control blocks of the processor which
result in an incorrect call or branch. It is incapable of detecting faults that do not cause a
change in the control flow. In a typical processor, the control logic constitutes about 10-
15% of the entire logic. In the sample fault simulation experiments performed with the
BLDC motor control and AC inverter application, we have observed that there are many
faults, (e.g. faults in the data path logic), which impact accuracy of data but do not impact
the control flow. Such faults will remain undetected and hence these approaches are not
suitable for the applications considered here.
6.2.2 Vulnerability Reduction Techniques
Algorithm Based Fault Tolerance (ABFT) [99] has been one of the earliest
vulnerability reduction techniques deployed to improve reliability of software programs.
82
The methodology encodes data in a specific form and the algorithms are designed to
operate on this encoded data and produce encoded output data. The various computations
required for the algorithm are performed in different computation units such that a fault
in any of the units affects only a portion of the data which can then be detected. The
benefit of the proposed methodology using a set of matrix operations has been
demonstrated.
The work in [115] has defined Program Vulnerability Factor (PVF) to demonstrate
the impact of a set of instruction sequences on the dependability of the overall application
in much the same way as Architectural Vulnerability Factor (AVF) [116,117] is used to
demonstrate the impact of architectural and micro-architectural components on the
dependability of the overall application. PVF is a property of the dynamic execution of
the program and helps identify subsets of instruction sequences which are vulnerable to
transient faults. A 20% reduction in vulnerability is observed by reordering of
instructions in the identified vulnerable subset of instruction sequences.
The work in [100] has proposed code re-ordering and critical variable duplication
to reduce the vulnerability due to soft errors. A tool RECCO (Reliable Code Compiler) is
used to map the input source code to a more reliable source code. The tool allows user
configurability to establish tradeoffs between dependability improvement and
performance degradation. The tool assigns a reliability weight to each of the variables
based on the functional dependencies and life-time of the variable. The number of places
a variable is getting used determines the functional dependency. The duration between a
variable’s creation (i.e. write operation) and last consumption (i.e. read operation) is
called a life period. The sum of life periods for the entire program duration gives the life-
time of the variable. Code re-ordering is performed to reduce the reliability weight of
each variable. Variable duplication is proposed to further reduce the vulnerability. The
vulnerability reduction reported is lower (5-9%) for generic program sequences, however,
for specific program sequences, it is higher (up to 65%).
Though these approaches can be deployed for the applications considered in this
thesis, based on the results already observed (i.e. vulnerability reduction of 5-9% for
generic program sequences), the benefits obtained by using these approach are restricted
83
by the ability to identify and recode specific instruction sequences. In addition, these
techniques require instruction sequences to be generated in a given form which require
compiler changes. This may not be practical for commercial processor platforms (e.g.
ARM cores).
6.2.3 Software Redundancy Techniques
Software redundancy based protection techniques use a combination of spatial and
temporal methods to reduce vulnerability. One of the first redundancy based techniques
implemented for fault tolerance is N-Version programming [118]. Though originally
proposed for finding systematic faults (e.g. bugs) in the software program, it can also be
used to implement fault tolerance. In this approach, different versions of the program are
created from the same original specification by different teams (individuals or groups of
individuals). A supervisor program is used to compare the output of these different
versions and select the correct ones based on majority voting to proceed to the next stage
of program execution. With the increase in software complexity, the effort required for
creating multiple versions has become prohibitive. In addition, new tools have come up
to improve the quality of software program thus reducing the need for N-Version
programming technique. Due to these reasons, this technique finds lesser acceptance
now-a-days.
Error Detection by Duplicated Instruction (EDDI) [101] and SoftWare
Implemented Fault Tolerance (SWIFT) [102] techniques implement fault tolerance using
duplicated execution and result comparison for concurrent error detection. The approach
uses different resources for storing variables and duplicated instructions. The duplicated
operations are spaced apart in time. The results of the duplicated operations are compared
before a write to a memory or a branch operation is performed. This approach can detect
transient faults in control logic, data and instruction memory, functional units and
interconnects. High transient fault detection is reported for both EDDI (96.2% to 99.2%)
and SWIFT (98.05%) technique.
The fault detection limitations of EDDI and SWIFT techniques (e.g. detection
escapes which can happen when a fault happens after comparison and before memory
update, fault causing a normal instruction to transform into a memory update instruction,
84
etc.), were addressed by the CompileR Assisted Fault Tolerance (CRAFT) [119]
technique, thereby improving transient fault detection to 99.29%. Craft technique
consumes additional MIPS and hence increases the execution time by 31.4%. PROfile
guided Fault Tolerance (PROFiT) [120] technique addresses the problem of performance
impact by identifying and protecting only the critical sections of the complete software
program, which consists of both critical and non-critical sections, (e.g. an automotive
processor executing critical driver assistance function along with non-critical
infotainment functions).
Practical adoption of hardware based fault tolerance techniques is still limited due
to lack of their availability in application specific SoCs catering to functional safety
requirements, (e.g. Application domain Specific Instruction Set Processor (ASIP) for
functional safety [121]), or due to the prohibitive cost associated with them particularly
when the fault tolerance is required only for a small subset of tasks implemented by the
SoC. Software based techniques are, therefore, used for protecting Commercial Off The
Shelf (COTS) components [122] used in building functional safety systems. However,
there is a significant performance and memory footprint overhead associated with
software techniques, together with limited protection in some cases. In order to address
these issues, we propose methods which utilise application tolerance for incorporating
functional safety.
Application Based Protection Techniques 6.3
Typical closed loop applications consist of periodic execution of different tasks to
perform various functional operations, (e.g. PID control loop, periodic communication,
etc.) and non-functional operations, (e.g. related to safety, security, operating power /
voltage modes, etc.). A small subset of tasks from amongst all the different tasks
executed by the application will be classified as safety critical. To ensure functional
safety for the application, the fault tolerance requirements of safety critical tasks must be
met even if it is at the expense of non-critical tasks. This section describes two
application oriented techniques which can provide fault tolerance.
85
Consider an SoC executing a motor control application. The motor control
application consists of various tasks of different criticality as given in Table 6.1 (For
simplicity and to make the notation generic, we represent criticality using numbers from
1 to 4; such that a higher number indicates higher criticality). These include obtaining
set-point information (to control motor speed) from the higher level system, motor
control task which controls the motor speed based on the obtained set-point information,
motor control monitoring function to ensure that the speed matches the set-point and
enters into a fail-safe mode if the speed is not within the tolerance range, data-logging
function to periodically save the status information to aid debug, and speed information
to be updated on the display panel. The criticality of each of these tasks is typically
determined based upon factors described in Section 2.2.1.
The different tasks in the application are triggered either by an interrupt or upon
completion of a previous task. The task triggering interrupts occur at varying times / have
varying frequency and have different priorities [123] determined by the application
requirements. In this example, the priority of interrupts is indicated using tags I1, I2 and
Table 6.1. Different tasks executed by motor control application.
No Task Task Trigger Criticality
T1 Motor control monitoring function Interrupt I1 4
T2 Motor control T1 completion 3
T3 Periodic communication of set point information Interrupt I2 3
T4 Speed intimation to display panel T3 completion 2
T5 Data-logging Interrupt I3 1
Figure 6.4. Sequencing of different tasks executed by motor control application.
T3- Periodic communication of set point information
T2- Motor control
T4- Speed intimation to display panel
T5- Data-loggingT1- Motor control monitoring function
I1 Highest priority Interrupt> I2 2nd Highest priority Interrupt I3 Lowest priority Interrupt
I1 I3 I1 I1 I1 I1 I1I3I2 I2
86
I3 where I1 is the highest priority interrupt and I3 is the lowest priority interrupt. A
higher priority interrupt can always interrupt when a lower priority Interrupt Service
Routine (ISR) is in progress, however, a lower priority interrupt cannot interrupt a higher
priority ISR. The sequencing of different tasks of the motor control application in the
time domain is represented in Figure 6.4. We can see that a lower criticality task T5
initiated by a lower priority interrupt I3 is interrupted by the higher priority interrupts I1
and I2.
The criticality of the processor in the SoC and the associated value and time
tolerance at any instant during application execution are determined based on the task the
processor is executing. We augment the notations introduced in [124] to represent this.
We can consider the motor control application to be made up of various tasks 𝑇𝑖. In a
typical application scenario, a particular peripheral will always be part of a task, e.g.
PWM output is always driving the motor, and the CPU bandwidth is time division
multiplexed for the various application tasks. Each of these tasks can be considered to be
running periodically with a period 𝑃𝑖, computation time 𝐶𝑖, time permissible (timeline) to
complete a particular operation 𝐷𝑖 and criticality 𝐿𝑖. The value and time tolerance
associated with the task can be represented as 𝑉𝑇𝑖 and 𝑇𝑇𝑖 respectively. This change in
tolerance of the processor of the SoC over time is illustrated in Figure 6.5. These
different tasks have different tolerance requirements and different timelines. It is possible
to utilize this information to reduce the implementation overhead for protecting the
critical application tasks.
Figure 6.5. Change in criticality over time.
T5 .
(data-logging, non-critical) T2 . T1 . T3 . T4 .
{VT1, TT1}{VT2, TT2} {infinity, infinity}
{VT3, TT3} {VT4, TT4}
87
6.3.1 Critical Flip-flop Reduction by Altering Application
Execution
A typical control system will involve sampling the input, processing of the input to
determine the actuation required, and followed by actuation. The number of times this
given action of sampling, processing and actuation takes place in a second is the control
loop frequency. The control loop frequency is determined based upon one or more
parameters of the physical system, i.e. motor control system. In application based
functional safety analysis, we have mapped application tolerance as value tolerance and
time tolerance. The value tolerance is a result of the acceptable set of values around the
control point (driving the actuator) for which the application can behave in an acceptably
correct manner. The time tolerance is a result of inertia of the physical system where a
change in the controller output takes a much larger time to have any perceptible impact
on the physical system that the application is controlling. We expect that a higher number
of repeated executions, (i.e. higher control loop frequency), can help increase both the
value tolerance as well as the time tolerance, thereby rendering fewer number of flip-
flops as critical.
In order to ascertain the impact of the change in control loop frequency on the set
of critical flip-flops identified, we perform evaluation on the same two reference designs:
(a) BLDC motor control and (b) AC inverter circuit. The evaluation is performed as a
two-step process.
(i) Compute the variation in value tolerance and time tolerance (in terms of number
of control loop cycles) with change in control loop frequency.
(ii) Compute the change in the identified number of critical flip-flops with change in
the value tolerance and time tolerance.
Computation of Value and Time Tolerance for Different Control 6.3.1.1
Loop Frequencies
For a typical control system, there will be a range of control loop frequencies for
which the system can operate in an acceptably correct manner. Hence for these
experiments, we have limited the control loop frequency changes to lie within this
88
acceptable range. We have evaluated the variation in value tolerance and time tolerance
with change in control loop frequency.
The control loop frequency can be changed in two ways. (i) Changing the PLL lock
frequency whereby the device altogether operates at a new frequency. Based on the new
frequency, the entire set of operations is performed at a different frequency. (ii) Updating
the frequency of interrupt which initiates the control loop operation. (For example, a
timer module generates the periodic interrupt and the interrupt frequency can be changed
by re-configuring the timer module). Updating the interrupt frequency will cause a
change only in the frequency of control loop operations. In case of an increase in the
control loop frequency, the CPU may not have enough bandwidth to process the
additional loop operations. In such a scenario, the operation of some of the less critical
tasks (e.g. datalogging) can be slowed down, (i.e. frequency of processing lowered), to
provide the bandwidth to process the more critical tasks. The tradeoffs associated with
these techniques to increase control loop frequency are illustrated in Table 6.2. (Since the
device is already operating at the maximum frequency and there is no frequency
headroom available for changing the device operating frequency, the period configuration
of the timer module is changed for this experimental evaluation).
The control loop frequency is changed and the time tolerance associated with the
Table 6.2. Tradeoffs associated with changing control loop frequency.
No Change using timer module period
configuration
Change using PLL clock frequency
configuration
1 Device operating frequency remains
same.
Device operating frequency changes.
Higher frequency configuration is
possible only if the device is rated to
operate at increased frequency.
2 Change is instantaneous on changing
the timer period configuration.
Device PLL must relock at the higher
clock frequency. This typically takes
a few micro seconds.
3 Only critical control loop operation
takes place at the higher frequency.
All operations in the device takes
place at the higher frequency.
4 Timing of all other cyclic operations
must be updated to accommodate the
higher bandwidth required for critical
control loop operation.
There is almost no change required in
the system configuration.
89
application is determined. Since value tolerance indicate the impact of change in value
over an infinitely long time, it will not change with change in control loop frequency
(refer to Section 3.2.1). The time tolerance values are determined as given in Section
3.3.1. The time tolerance expressed as the number of control loop iterations increased
with increase in control loop frequency.
The computed time tolerance values for the BLDC motor control system for
different control loop frequencies for different motor speeds is shown in Figure 6.6. In
this figure, the X axis denotes the various control loop frequencies in KHz and Y axis
Figure 6.6. Variation of time tolerance with control loop frequency for BLDC motor.
0
50
100
150
200
250
10 15 20 25 30 35 40
Tim
e T
ole
ran
ce
Control Loop Frequency
1350 1500 1650 1800
1950 2100 2250 2400
Figure 6.7. Variation of time tolerance with control loop frequency for AC inverter.
0
5
10
15
20
20 30 40 50 60 70
Tim
e T
ole
ran
ce
Control Loop Frequency
0.15 0.2 0.25 0.3
0.35 0.4 0.45 0.5
90
denotes the time tolerance in number of control loop cycles. The various lines in the plot
indicate the time tolerance values for different motor speeds (in rpm).
The computed time tolerance values for the AC inverter application at different
control loop frequencies are shown in Figure 6.7. The different lines in the graph
correspond to the different input current values at which the AC inverter is operating. (In
the figure, the range of input current values from 0 A – 30A is represented over a range
from 0 to 1).
Computation of Number of Critical Flip-flops 6.3.1.2
In Section 3.3.1, we showed that application information can be used for reducing
the number of critical flip-flops and thus the hardware overhead incurred for “safeing”
the application. In Section 6.3.1.1, we showed that the tolerance value increases with
increase in control loop frequency. In this sub-section, we evaluate whether it is possible
to further optimize the hardware overhead by using the additional tolerance gained by
virtue of running the control loop at a higher frequency.
The time tolerance and value tolerance thus obtained from the system level
evaluation is used to perform the divide and conquer analysis as shown in Figure 6.8. In
this analysis, the system is configured to execute the control loop at the targeted
frequency. The value tolerance and time tolerance are configured in the fault injection
set-up. The critical flip-flops are identified for each targeted time tolerance value.
Figure 6.8. Critical flip-flop identification for different time tolerance values.
Select Divide and Conquer approach
Set Value Tolerance based on application
Update Time Tolerance
Identify the list of critical flip-flops
Critical flip-flop list
91
The critical flip-flops thus obtained for BLDC motor control application are shown
in Figure 6.9. We have determined the number of critical flip-flops for several different
discrete time tolerance values from 0 to 100. A significant reduction (from 55 to 30, i.e.
45%) in the number of critical flip-flops is seen as we increase the time tolerance from 0
to 1. Thereafter, the number of critical flip-flops did not reduce with increase in time
tolerance.
We repeated the experiment for the AC inverter application. The results are plotted
in Figure 6.10. Here the critical flip-flop count reduces more significantly with increase
Figure 6.9. Variation in the number of critical flip-flops with time tolerance (# number
of control loop cycles) for BLDC motor control application.
0
10
20
30
40
50
60
0 1 2 10 20 30 50 100
Critical F
lip-f
lop
s
Time Tolerance
Figure 6.10. Variation in the number of critical flip-flops with time tolerance (#
number of control loop cycles) for AC inverter application.
40
45
50
55
60
0 1 2 3 10 20 30 40 50 100
Cri
tica
l Flip
-flo
ps
Time Tolerance
92
in time tolerance. It reduces from 59 for a time tolerance of zero to 48 for a time
tolerance of 30 control loop cycles and further to 44 for a time tolerance of 100 control
loop cycles. By increasing the time tolerance from 20 to 40, we observed a 6% (51 to 48)
reduction in the number of critical flip-flops.
Observations for BLDC Motor Control and AC Inverter System 6.3.1.3
We observe a different behaviour for the variation in critical flip-flop count with
variation in control loop frequency for the BLDC motor control and AC inverter
applications. In case of BLDC motor control application, the number of critical flip-flops
does not reduce for time tolerance value beyond one. However, for the AC inverter
application, the number of critical flip-flops continues to reduce with increase in the
control loop frequency.
Figure 6.11 indicates the difference in execution with two different control loop
frequencies in the presence of a fault. Figure 6.11(a) profiles the application with a lower
Figure 6.11. Execution variation with different control loop frequencies.
(a) Execution with lower control loop frequency
X
Time Tolerance
Window
Va
lue
To
lera
nce
Win
do
w
(b) Execution with higher control loop frequency
X
Time Tolerance
Window
Va
lue
To
lera
nce
Win
do
w
time
Cp
u o
utp
ut
time
Cp
u o
utp
ut
fault
fault
93
control loop frequency and Figure 6.11(b) profiles the same application with a higher
control loop frequency. It can be observed that a higher control loop frequency helps
correct the error in the system before the time tolerance interval, thereby making the
system operate in an acceptably correct manner. This behaviour depends on the function
influenced by the flip-flop on which the fault is injected.
The difference in behaviours for the control function implemented by BLDC motor
control and AC inverter system is now explained. The Proportional Integral (PI) function
implemented by the controllers (shown in Figure 6.12) can be denoted as:
𝑦(𝑖) = 𝑈𝑝 + 𝑈𝑖
= [𝐾𝑝 ∗ 𝑒(𝑖)] + [𝐾𝑖 ∗ 𝑒(𝑖) + 𝑈(𝑖 − 1)]
where 𝑦(𝑖), 𝑈𝑝, 𝑈𝑖, 𝑒(𝑖), 𝐾𝑝 𝑎𝑛𝑑 𝐾𝑖 denote the PI function output, proportional path
output, integral path output, error input, integration path constant and proportional path
constant respectively. In order to control the output 𝑦(𝑖), the error term 𝑒(𝑖) must be
suitably controlled by 𝐾𝑝 and 𝐾𝑖. [𝐾𝑝 ∗ 𝑒(𝑖)] forms the proportional path and [𝐾𝑖 ∗
𝑒(𝑖) + 𝑈(𝑖 − 1)] forms integral path of the controller.
A fault in one of the flip-flops implementing the proportional path is immediately
visible at the function output. This will also get corrected in the next iteration. There is no
storage (memory) of values in the proportional path. However, we notice that the integral
path has inherent memory as it keeps on accumulating the results of previous iterations.
A fault in one of the flip-flops implementing the integral path will take time to get
corrected, i.e. get updated to its new value. Such systems which have inherent memory
Figure 6.12. PI function implemented by the control system.
Ki*e(i) + U(i-1)
Kp*e(i)
+-
error e(i)
Reference
Feed
bac
k
++
Up
Ui
Proportional path
Integral path
output y(i)
94
will benefit from increasing the control loop frequency. For the BLDC motor control
application, the integration parameter (𝐾𝑝) used in the PI control is very close to zero and
the control loop is operating very close to a proportional control system. Hence, changing
the control loop operating frequency does not reduce the number of critical flip-flops.
We therefore conclude that the proposed approach of increasing the control loop
frequency to render fewer flip-flops as critical will be beneficial when the application has
inherent memory. A given application must be analyzed accordingly. The higher the
inherent memory the more are the benefits associated with the proposed approach.
6.3.2 Detection of Critical Flip-flops by Selective Redundant
Execution
In this section, we propose a new technique to detect faults in critical flip-flops
(which are identified as part of the application based functional safety evaluation
described in Chapters 3, 4 and 5), by selective redundant execution of critical portions of
the application. Unlike other software based fault tolerant approaches covered in Section
6.2, which attempts to protect the entire application, the proposed optimization detects
faults only on the critical flip-flops thereby reducing the implementation overhead. This
proposed approach is further extended to provide system recovery on detection of a fault.
Selective Redundant Execution 6.3.2.1
In the proposed approach, the application is first evaluated to identify the safety
critical tasks. Application based functional safety evaluation is then performed on these
critical tasks to identify critical flip-flops. Once identified, dual streams of execution
operate on independent sets of such critical flip-flops. Once execution is complete, the
two sets of critical flip-flops are compared before a memory update or before a control
flow operation (e.g. call, branch, etc.). This will help detect any soft errors which occur
during the execution. The flow chart for proposed approach is shown in Figure 6.13.
Compared to a non-fault tolerant implementation, the proposed approach will lead
to (i) increase in memory footprint due to the additional storage required for saving the
redundant code and redundant variables, and (ii) increase in MIPS required since the
critical portion of the code must be redundantly executed. However, overheads associated
95
with the proposed approach are lower than those with duplication as implemented by
EDDI approach [101]. In order to evaluate the overheads associated with the proposed
approach, we perform the evaluation on a set illustrative of control functions. The
Figure 6.13. Selective redundant execution.
Critical flip-flops are duplicated and are updated in two independent threads.
Control flow operation
Redundant values are compared
NO
Identify critical flip-flops using application based
functional safety evaluation
Error?
YES
NO.
Indicate error to external world
Memory update operation
Memory update value is compared with redundant
value
Error?
YES
YES.
NO .
NO .
Figure 6.14. Memory and MIPS overhead reduction for selective redundant execution
approach as compared to EDDI.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Ove
rhe
ad
Red
uctio
n
Illustrative Control Functions
Memory MIPS
96
benefits of the proposed approach (memory overhead reduction and MIPS savings) are
compared with EDDI (having same error tolerance) in Figure 6.14. It can be seen that
both memory and MIPS overhead reduction compared to EDDI approach [101] varies in
the range of 5% to 41%.
In the case of a complete application as shown in Figure 6.4, the proposed approach
will be applied only for the safety critical threads. Additional memory and MIPS
overhead for protection will be incurred only for them, as compared to application
agnostic duplication employed in EDDI approach.
System Recovery on Identification of an Error 6.3.2.2
The proposed selective redundant execution will help detect an error. The system
will enter into a fail-safe mode [125] thereupon. For many of the functional safety
applications like aircraft autopilot, autonomous driving, etc., availability is an equally
important requirement along with functional safety. These systems cannot stop on
detection of a fault and the system has to be fail-operational [126]. Traditional fault
tolerant architectures for high availability hold / store the critical variables with a
redundancy of three. Safety critical application threads are executed with a redundancy of
three and majority voting is performed using the outputs from the three execution streams
to determine the final output. This implementation results in a threefold increase in
memory and MIPS overhead. In this section, we extend the selective redundant execution
approach to meet the fail operational requirements but at a much lesser overhead.
Figure 6.15. Selective redundant execution for error recovery.
RT-1 RT-2 V RT-1 RT-2 V RT-3 RT-1 RT-2 V
Redundant thread-1
Redundant thread-2
Redundant thread-3
P XP
Voting miscompare followed by recovery of erroneous thread
Voting passP X Voting fail
1 2 3
R
Error recovery thread
Software voting and check-pointing thread
RT-1
97
The proposed selective redundant execution for error recovery is shown in Figure
6.15. The various steps involved in this method are listed below.
(i) Three different copies of the variables are saved (check-pointed) at
predetermined phases of the application.
(ii) Safety critical application threads are executed with a redundancy of two.
(iii) Outputs of the two execution streams are compared. If the outputs match, critical
variables are check-pointed with a redundancy of three.
(iv) If the outputs do not match, the third redundant thread is executed and majority
voting is performed to identify the correct output. Once the correct output is
ascertained, the states (flip-flops) leading to the correct output are identified and
check-pointed.
The comparison of the proposed approach with that of traditional approach (triple
modular redundancy implemented in software) is shown in Table 6.3.
In order to evaluate the benefits of the proposed approach, we performed the
system recovery implementation on the set of illustrative control functions. The memory
and MIPS overhead savings for a no-error scenario (with no RT-3 as indicated in Figure
6.15) when compared to the traditional Triple Modular Redundancy (TMR)
implementation in software (having same error tolerance) is indicated in Figure 6.16. As
it can be noted from the figure, the memory overhead savings varies from 7% to 54%
depending on the application. Similarly, the MIPS overhead savings varies from 36% to
Table 6.3. Comparison of traditional and proposed fault tolerant approaches.
No Traditional approach Proposed approach
1 All variables are saved with a
redundancy of three.
Only critical variables identified using
the application based functional safety
evaluation are stored redundantly.
2 All application threads (irrespective of
whether the thread is safety critical or
not) is executed with redundancy of
three.
Only safety critical application threads
are redundantly executed.
3 Redundancy of three is maintained
irrespective of occurrence of error.
Redundancy of two is used in error free
conditions.
98
61% depending upon the application (i.e. number of flip-flops identified as critical within
each control function). Upon an error (Phase RT-3 indicated in Figure 6.15), the overhead
depends on when the fault is detected and how many register values must be restored.
Practical Considerations in Implementing Redundancy 6.3.2.3
A typical safety critical application consists of both safety critical and non-safety
critical application threads. A safety critical application thread can be further divided into
various phases where the different phases can be classified as safety critical or not. For
example, a control loop operation may bypass certain computations and hold the previous
Figure 6.16. Memory and MIPS overhead reduction for proposed system recovery
approach as compared to TMR implemented in software.
0%
10%
20%
30%
40%
50%
60%
70%
Ove
he
rad
Re
du
ction
Illustrative Control Functions
Memory MIPS
Figure 6.17. Protection approaches for safety critical application.
P1 P2 P3 P4 P5 P6 P7
Non-safety critical-T2 Safety critical-T3Safety critical-T1
P8 P9
Approach 1 (registers are protected throughout the application)
S S
Approach 3 (registers are protected only during safety critical phases of application thread)
S S – Safety critical phase
Approach 2 (registers are protected during safety critical application thread)
99
values if there is no set-point or load change. The application safety requirements during
different time segments are represented in Figure 6.17. The different application threads
are shown as T1 – T3 and different phases are shown as P1 – P9. The safety critical
application phases are marked S.
The proposed redundancy based approach (both for fault detection and system
recovery) can be implemented in three ways.
(i) Approach 1: Once the critical flip-flops are identified, they are protected
throughout the application. The protection for the flip-flops is applicable during
the execution of non-safe critical code also.
(ii) Approach 2: The critical flip-flops are protected only during the critical task
execution phase of the application.
(iii) Approach 3: The application is profiled to understand the scenarios under which
a flip-flop is classified as critical. During the application execution, an
independent checker is run alongside to identify whether a certain application
phase is critical or not. Critical flip-flops are protected for the phases in which
the application thread is classified as critical.
Approach 1 is sub-optimal in implementation as the critical registers are
continuously protected irrespective of whether the thread is safety critical or not.
Approach 3 requires the identification of critical flip-flops during application execution.
This would require additional hardware which monitors the application and its input
values, and determine whether a flip-flop needs to be protected. This also leads to
significant increase in the computational complexity. Due to the limitations with
Approach 1 and Approach 3, Approach 2 is better suited for a practical implementation.
In addition to the fault tolerance requirements of the critical tasks which are part of
the application, there are requirements with respect to prevention of fault propagation
from a task of lower criticality to that of higher criticality [127,128]. However, such
requirements are not addressed as part of this thesis, as these are considered independent
of the application tolerance.
100
Conclusion 6.4
This chapter proposed two new application based techniques for robust execution
in the presence of faults in critical components. The first technique is based on changing
the application execution, (e.g. control loop frequency), to reduce the number of critical
components. It utilizes the additional time tolerance (measured in terms of number of
control loop cycles) gained as a result of increased control loop frequency to protect the
identified critical flip-flops. For the AC inverter application, we observed 6% reduction
in the number of critical flip-flops when the time tolerance is doubled. However,
experiments on the BLDC motor control application did indicate that the approach cannot
be generically applied to all systems. We assessed the conditions under which the
proposed approach can be applied.
The second technique used selective redundant execution for identifying critical
components. Experimental results on representative control functions indicate 5% to 41%
reduction in both memory footprint and MIPS overhead when compared to EDDI
approach. An enhancement to aid system recovery in case of detection of a fault is also
proposed. Compared to software based TMR approach, the proposed approach resulted in
7% to 54% reduction in memory footprint and 36% to 61% reduction in MIPS. We also
evaluated how the two proposed approaches can be applied to a complete application
which comprises of both safety critical and non-safety critical threads.
101
7. Conclusions and Future Work
The use of Integrated Circuit (IC) components in end applications continues to rise,
particularly in critical safety applications like automotive, industrial, navigation and
medical. Due to the complexity of functions implemented using semiconductor devices
and the complexity of the device manufacturing process, a semiconductor component can
fail in multiple different ways, thus increasing the risk to the application. This risk is
addressed by having additional mechanisms to enable timely detection of faults before
they can lead to a catastrophic application failure. These safety mechanisms must be
optimal, i.e. must incur lesser overhead in terms of area, power, application MIPS, etc.,
while ensuring the required levels of safety. However, the methods and techniques
available and deployed today do not address these requirements in an effective manner.
The low overhead solutions are less comprehensive and more comprehensive solutions
come with significant overhead. This thesis addresses these challenges by presenting a
set of techniques for performing comprehensive functional safety analysis to enable
adequate protection. It also proposes new techniques to offer protection while incurring
lower overheads.
Chapter 2 gave an overview of semiconductor functional safety as practiced in
industry today. It analysed the application level implications of semiconductor failures
with a representative EV Traction application, different functional safety standards
covering the different end applications, and different safety analysis techniques.
Understanding the functional safety compliant IC development process helped to
highlight the dual challenges of comprehensiveness of safety analysis and reducing the
additional overhead incurred due to functional safety.
Chapter 3 introduced a new safety analysis technique whereby the tolerance
available in the safety critical applications is included while performing the safety
analysis, thereby limiting the hardware overhead incurred due to safety. For the safety
analysis to utilize the application tolerance, we proposed a new technique to map the
application tolerance as value tolerance and time tolerance at the IC level and for the
102
modules inside it. Experiments were performed on two real-life applications, a brushless
DC motor control system and an AC inverter control system. We also showed how the
application level tolerance can be mapped to different modules internal of the IC at
different abstraction levels.
Chapter 4 demonstrated the use of formal techniques to identify critical flip-flops in
the presence of tolerance. It also indicated ways by which application specific behaviours
(including tolerance) can be included in the analysis framework to obtain a more accurate
(less pessimistic) reliability assessment, and hence incur lesser design cost for robustness.
Experimental results were provided on benchmark circuits and two industrial circuits.
Chapter 5 proposed a perturbation based workload augmentation technique for
performing comprehensive functional safety evaluation. Experiments were performed on
a set of safety critical control functions and a real-life application, and the effectiveness
of the proposed method in identifying additional critical flip-flops is demonstrated.
Through these experiments, we have illustrated how optimisations in safety evaluation
methods using application level tolerances can be traded off with additional workloads to
arrive at a more comprehensive list of critical flip-flops to meet the overall hardware
overhead and reliability requirements. (It helped identify directed variants of workloads
as against unconstrained search methods using formal techniques).
Chapter 6 proposed two new application based techniques for robust execution in
the presence of faults in critical components. The first technique was based on changing
the application execution, (e.g. control loop frequency), to reduce the number of critical
components. The second technique used selective redundant execution for protecting
critical components while still reducing the memory footprint and MIPS overhead.
Future Work 7.1
This thesis has proposed methods for optimal functional safety analysis of ICs,
application driven identification of minimal number of critical flip-flops, and techniques
to protect the critical flip-flops using application profiling for safety. We consider a few
potential directions to enhance this work.
103
In this thesis, we have considered Single Event Transient (SET) events leading to
Single Event Upsets (SEUs) for safety analysis. SETs leading to Multiple Bit Upsets
(MBU) and Multiple Event Transients (METs) are not considered. As newer technologies
are getting rapidly adopted into automotive applications with the increase in performance
requirements for ADAS, MBUs and METs are also likely to occur. Hence, in order to
scale the proposed approach for automotive with newer technology semiconductors, these
effects must also be considered.
We have demonstrated the use of Formal Verification (FV) techniques for
functional safety analysis. However, FV techniques in the presence of large workloads
(with and without perturbation) for complex control functions have been not investigated.
Options include partitioning larger circuits and workloads into smaller ones and
abstracting portions of the circuit into behavioural models (without matching state space
explosion with optional use of assertions) to reduce analysis complexity.
The work in this thesis mainly investigates the control and protection provided by
digital circuits interacting with the physical system. Analog functions, e.g. data
converters, etc. which can also implement safety critical functions are not considered.
Their investigation will require analysis of newer transient conditions and associated
faults, apportioning function tolerance to sub-functions, and simulation artefacts for
abstracted models, and accuracy and speed tradeoffs.
The above investigations can build on the contributions of this thesis. The solutions
thus obtained will help in the development of future safety critical applications.
104
References
[1] D. Lorenz, G. Georgakos, and U. Schlichtmann, “Aging analysis of circuit timing
considering NBTI and HCI,” in International On-Line Testing Symposium, 2009.
[2] R. C. Baumann, “Radiation induced soft errors in advanced semiconductor
technologies,” IEEE Transactions on Device and Materials Reliability, 2005.
[3] M. Alam, “Reliability and process variation aware design of integrated circuits,”
Journal for Microelectronics Reliability, Elsevier, 2008.
[4] R. Mariani and G. Boschi, “A systematic approach for failure modes and effects
analysis of system-on-chips,” in International On-Line Testing Symposium, 2007.
[5] R. Isermann, R. Schwarz, and S. Stolzl, “Fault-tolerant drive-by-wire systems,” IEEE
Control Systems, 2002.
[6] T. Calin, M. Nicolaidis, and R. Velazco, “Upset hardened memory design for
submicron cmos technology,” IEEE Transactions on Nuclear Science, 1996.
[7] V. Prasanth, V. Singh, and R. Parekhji, “Derating based hardware optimizations in
soft error tolerant designs,” in VLSI Test Symposium, 2012.
[8] P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson, “Multiplexed redundant
execution: A technique for efficient fault tolerance in chip multiprocessors,” in
Design, Automation & Test in Europe, 2010.
[9] R. R. Schaller, “Moore’s law: past, present and future,” IEEE spectrum, 1997.
[10] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc,
“Design of ion-implanted MOSFET’s with very small physical dimensions,” IEEE
Journal of Solid-State Circuits, 1974.
[11] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, G. Lemieux, P. P. Pande,
C. Grecu, and A. Ivanov, “System-on-chip: Reuse and integration,” Proceedings of
the IEEE, 2006.
[12] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor system-on-chip
technology,” IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 2008.
105
[13] D. Edenfeld, A. B. Kahng, M. Rodgers, and Y. Zorian, “2003 technology roadmap
for semiconductors,” Computer, 2004.
[14] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung,
“Accelerating deep convolutional neural networks using specialized hardware,”
Microsoft Research Whitepaper, 2015.
[15] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun,
S. Zhao, H. Larochelle, D. Englund et al., “Deep learning with coherent
nanophotonic circuits,” Nature Photonics, 2017.
[16] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars, “The
architectural implications of autonomous driving: Constraints and acceleration,” in
ACM SIGPLAN, 2018.
[17] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Computer architectures for
autonomous driving,” Computer, 2017.
[18] A. Hayek and J. Börcsök, “Safety chips in light of the standard IEC 61508: survey
and analysis,” in International Symposium on Fundamentals of Electrical
Engineering, 2014.
[19] E. Ugljesa and J. Börcsök, “Evaluation of sophisticated hardware architectures for
safety applications,” in International Symposium on Information, Communication
and Automation Technologies, 2009.
[20] W. M. Goble and H. Cheddie, Safety Instrumented Systems verification: practical
probabilistic calculations. ISA, 2004.
[21] D. H. Stamatis, Failure mode and effect analysis: FMEA from theory to execution.
ASQ Quality Press, 2003.
[22] R. Mariani, G. Boschi, and F. Colucci, “Using an innovative SoC-level FMEA
methodology to design in compliance with IEC61508,” 2007.
[23] A. Mearns, “Fault tree analysis- the study of unlikely events in complex systems,” in
System Safety Symposium, Seattle, Wash, 1965.
[24] P. Koopman, “A case study of toyota unintended acceleration and software safety,”
2014. [Online]. Available: https://users.ece.cmu.edu/~koopman/toyota/koopman-09-
18-2014_toyota_slides.pdf
106
[25] R. E. Cole, “What really happened to toyota?” MIT Sloan Management Review,
2011.
[26] “Ford issues extensive recall on f-150 models over downshifting problem.” [Online].
Available: https://www.hlmlawfirm.com/blog/ford-issues-extensive-recall-on-f-150-
models-over-downshifting-problem/
[27] K. Kalaignanam, T. Kushwaha, and M. Eilert, “The impact of product recalls on
future product reliability and future accidents: Evidence from the automobile
industry,” Journal of Marketing, 2013.
[28] N. A. Stanton, P. M. Salmon, G. H. Walker, and M. Stanton, “Models and methods
for collision analysis: A comparison study based on the uber collision with a
pedestrian,” Safety Science, 2019.
[29] N. Bomey, “Uber self-driving car crash: Vehicle detected arizona pedestrian 6
seconds before accident,” USA Today, https://www. usatoday.
com/story/money/cars/2018/05/24/uber-self-driving-car-crash-ntsb-
investigation/640123002, 2018.
[30] V. A. Banks, K. L. Plant, and N. A. Stanton, “Driver error or designer error: Using
the perceptual cycle model to explore the circumstances surrounding the fatal tesla
crash on 7th may 2016,” Safety science, 2018.
[31] F. Lambert, “Understanding the fatal tesla accident on autopilot and the nhtsa
probe,” Electrek, July, 2016.
[32] IEC 61508, International standard for functional safety of electrical / electronic /
programmable electronic safety-related systems, 2010.
[33] ISO 26262, International standard for functional safety of electrical and electronic
systems in production automobiles, 2018.
[34] R. F. S. 167, DO-178B, Software considerations in airborne systems and equipment
certification. RTCA, Incorporated, 1992.
[35] M. Ebrahimi, A. Evans, M. B. Tahoori, R. Seyyedi, E. Costenaro, and
D. Alexandrescu, “Comprehensive analysis of alpha and neutron particle-induced
soft errors in an embedded processor at nanoscales,” in Design, Automation & Test
in Europe, 2014.
107
[36] S. Mukherjee, Architecture design for soft errors. Morgan Kaufmann, 2011.
[37] Q. Zhao and J. Jiang, “Reliable state feedback control system design against actuator
failures,” Automatica, 1998.
[38] G.-H. Yang, J. L. Wang, and Y. C. Soh, “Reliable h-infinity controller design for
linear systems,” Automatica, 2001.
[39] C. T. Doug Parker, “Winning share in automotive semiconductor,” 2013. [Online].
Available: http://www.mckinsey.com/~/media/mckinsey/dotcom/client_service/-
semiconductors/issue%203%20autumn%202013/pdfs/-
5_automotivesemiconductors.ashx
[40] “Featured applications for real-time control.” [Online]. Available: http://-
www.ti.com/lsds/ti/microcontrollers-16-bit-32-bit/c2000-performance/real-time-
control/applications-featured-applications.page
[41] R. Mariani, “Applying iso 26262 to adas and automated driving,” in AutoSens, 2014.
[42] IEC 60880, International standard for Nuclear power plants - Instrumentation and
control systems important to safety - Software aspects for computer-based systems
performing category A functions, 2006.
[43] EN 50128, International standard for Railway applications-Communication,
Signaling and Processing Systems-Software for Railway Control and Protection
Systems, 2011.
[44] IEC 60601, International standard for common aspects of electrical equipment used
in medical practice , 2015.
[45] IEC62061, International standard for safety of machinery, 2005.
[46] ISO 13849, International standard for Safety of machinery—Safety-related parts of
control systems, 2015.
[47] IEC 61800, International standard for adjustable speed electrical power drive
systems, 2017.
[48] IEC 60730, International standard for household and similar electrical appliances
safety, 2003.
[49] SAE J2980, Considerations for ISO 26262 ASIL Hazard Classification, 2015.
108
[50] T. Stolte, G. Bagschik, A. Reschka et al., “Hazard analysis and risk assessment for
an automated unmanned protective vehicle,” arXiv, 2017.
[51] R. Schneider, W. Brandstaetter, M. Born, O. Kath, T. Wenzel, R. Zalman, and
J. Mayer, “Safety element out of context - a practical approach,” SAE Technical
Paper, 2012.
[52] B. Peng, Y. Chen, S.-Y. Kuo, and C. Bolger, “IC HTOL test stress condition
optimization,” in IEEE International Symposium on Defect and Fault Tolerance in
VLSI Systems, 2004.
[53] P. W. Lisowski and K. F. Schoenberg, “The Los Alamos neutron science center,”
Nuclear Instruments and Methods in Physics Research Section A: Accelerators,
Spectrometers, Detectors and Associated Equipment, 2006.
[54] M. Bellotti and R. Mariani, “How future automotive functional safety requirements
will impact microprocessors design,” Microelectronics Reliability, 2010.
[55] G.-A. Klutke, P. C. Kiessler, and M. A. Wortman, “A critical look at the bathtub
curve,” IEEE Transactions on Reliability, 2003.
[56] Bathtub curve. [Online]. Available: https://en.wikipedia.org/wiki/Bathtub_curve
[57] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw,
T. Austin, K. Flautner et al., “Razor: A low-power pipeline based on circuit-level
timing speculation,” in Microarchitecture, 2003.
[58] M. Zhang, S. Mitra, T. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S. Kim, N. R.
Shanbhag, and S. J. Patel, “Sequential element design with built-in soft error
resilience,” IEEE Transactions on Very Large Scale Integration Systems, 2006.
[59] V. Prasanth, V. Singh, and R. Parekhji, “Robust detection of soft errors using
delayed capture methodology,” in International On-Line Testing Symposium, 2010.
[60] N. D. P. Avirneni, V. Subramanian, and A. K. Somani, “Low overhead soft error
mitigation techniques for high-performance and aggressive systems,” Dependable
Systems & Networks, 2009.
[61] V. Prasanth, V. Singh, and R. Parekhji, “Reduced overhead soft error mitigation
using error control coding techniques,” in International On-Line Testing Symposium,
2011.
109
[62] E. T. Grochowski, W. Rash, N. Quach, H. Nguyen, and A. Rabago, “Microprocessor
with dual execution core operable in high reliability mode,” Patent 6615366, 2003.
[63] S. Banerjee, A. Chatterjee, and J. A. Abraham, “Efficient cross-layer concurrent
error detection in nonlinear control systems using mapped predictive check states,”
in International Test Conference, 2016.
[64] V. Prasanth, R. Parekhji, and B. Amrutur, “Improved methods for accurate safety
analysis of real-life systems,” in Asian Test Symposium, 2015.
[65] M. A. Sabet, B. Ghavami, and M. Raji, “Gpu-accelerated soft error rate analysis of
large-scale integrated circuits,” IEEE Design & Test, 2018.
[66] H. Cho, S. Mirkhani, C.-Y. Cher, J. A. Abraham, and S. Mitra, “Quantitative
evaluation of soft error injection techniques for robust system design,” in Design
Automation Conference, 2013.
[67] I. Polian, J. P. Hayes, S. M. Reddy, and B. Becker, “Modeling and mitigating
transient errors in logic circuits,” IEEE Transactions on Dependable and Secure
Computing, 2011.
[68] L. Chen, M. Ebrahimi, and M. B. Tahoori, “CEP: Correlated Error Propagation for
hierarchical soft error analysis,” Journal of Electronic Testing, 2013.
[69] T. Maeba, M. Deng, A. Yanou, and T. Henmi, “Swing-up controller design for
inverted pendulum by using energy control method based on lyapunov function,” in
IEEE International Conference on Modelling, Identification and Control, 2010.
[70] M. I. Momtaz, S. Banerjee, and A. Chatterjee, “Real-time DC motor error detection
and control compensation using linear checksums,” in VLSI Test Symposium, 2016.
[71] Standardized e-gas monitoring concept for gasoline and diesel engine control units.
[Online]. Available: https://www.iav.com/sites/default/files/attachments/seite/ak-
egas-v6-0-en-150922_1.pdf
[72] D. Geyer, M. Kick, and M. Kraus, “Monitoring the functional reliability of an
internal combustion engine,” Patent 8392046, 2013.
[73] P. Pisu, Fault Detection and Isolation with Applications to Vehicle Systems.
Springer, 2016.
110
[74] A. Kohn, R. Schneider, A. Vilela, U. Dannebaum, and A. Herkersdorf, “Markov
chain-based reliability analysis for automotive fail-operational systems,” SAE
International Journal of Transportation Safety, 2017.
[75] Enhanced Capture Module (eCAP) Reference Guide. [Online]. Available: http://-
www.ti.com/lit/ug/sprufz8a/sprufz8a.pdf
[76] Enhanced Pulse Width Modulator (ePWM) Reference Guide. [Online]. Available:
http://www.ti.com/lit/ug/spruge9e/spruge9e.pdf
[77] Y.-S. Kung, N. V. Quynh, N. T. Hieu, C.-C. Huang, and L.-C. Huang,
“Simulink/Modelsim co-simulation and FPGA realization of speed control IC for
PMSM drive,” Procedia Engineering, 2011.
[78] C. Bottoni, M. Glorieux, J. Daveau, G. Gasiot, F. Abouzeid, S. Clerc, L. Naviner,
and P. Roche, “Heavy ions test result on a 65nm Sparc-v8 radiation-hard
microprocessor,” in IEEE International Reliability Physics Symposium, 2014.
[79] DRV8312 - Three Phase Brushless DC Motor Driver IC. [Online]. Available: http://-
www.ti.com/product/DRV8312
[80] F2805x - Real time control MCU. [Online]. Available: http://www.ti.com/product/-
TMS320F28055
[81] Texas Instruments Development Kit Application Note. [Online]. Available: http://-
www.ti.com/tool/TMDSSOLARPEXPKIT
[82] C.-M. Ong, Dynamic simulation of electric machinery: using MATLAB/SIMULINK.
Prentice hall, 1998.
[83] S. Mirkhani and J. A. Abraham, “Fast evaluation of test vector sets using a
simulation-based statistical metric,” in VLSI Test Symposium (VTS), 2014.
[84] A. L. Silburt, A. Evans, I. Perryman, S.-J. Wen, and D. Alexandrescu, “Design for
soft error resiliency in internet core routers,” IEEE Transactions on Nuclear Science,
2009.
[85] G. Boschi, R. Mariani, and S. Lorenzini, “A verification strategy for fault-detection
and fault-tolerance circuits,” in International On-Line Testing Symposium, 2011.
111
[86] R. Leveugle, “A new approach for early dependability evaluation based on formal
property checking and controlled mutations,” in International On-Line Testing
Symposium, 2005.
[87] S. A. Seshia, W. Li, and S. Mitra, “Verification guided soft error resilience,” Design
Automation and Test in Europe, 2007.
[88] G. Fey and R. Drechsler, “A basis for formal robustness checking,” International
Symposium on Quality Electronic Design, 2008.
[89] G. Fey, A. Sülflow, and R. Drechsler, “Computing bounds for fault tolerance using
formal techniques,” in Design Automation Conference, 2009.
[90] U. Krautz, M. Pflanz, C. Jacobi, H.-W. Tast, K. Weber, and H. T. Vierhaus,
“Evaluating coverage of error detection logic for soft errors using formal methods,”
Design Automation and Test in Europe, 2006.
[91] M. Breuer, “Hardware that produces bounded rather than exact results,” in Design
Automation Conference, 2010.
[92] D. Holcomb, W. Li, and S. A. Seshia, “Design as you see fit: System-level soft error
analysis of sequential circuits,” Design Automation and Test in Europe, 2009.
[93] Cadence Incisive Enterprise Verifier. [Online]. Available: http://www.cadence.com/-
products/fv/enterprise_verifier/pages/default.aspx
[94] F. Corno, M. S. Reorda, and G. Squillero, “RT-level ITC’99 benchmarks and first
ATPG results,” IEEE Design & Test of Computers, 2000.
[95] A. Benso and P. Prinetto, Fault injection techniques and tools for embedded systems
reliability evaluation. Springer Science & Business Media, 2003.
[96] E. H. Cannon, D. D. Reinhardt, M. S. Gordon, and P. S. Makowenskyj, “SRAM
SER in 90, 130 and 180 nm bulk and SOI technologies,” in 2004 IEEE International
Reliability Physics Symposium. Proceedings, 2004.
[97] K. Greb and D. Pradhan, “Hercules microcontrollers: Real-time mcus for safety-
critical products,” white Paper, 2011.
[98] R. Venkatasubramanian, J. P. Hayes, and B. T. Murray, “Low-cost on-line fault
detection using control flow assertions,” in IEEE On-Line Testing Symposium, 2003.
112
[99] K.-H. Huang and J. Abraham, “Algorithm-based fault tolerance for matrix
operations,” IEEE transactions on computers, 1984.
[100] A. Benso, S. Chiusano, P. Prinetto, and L. Tagliaferri, “A C/C++ source-to-source
compiler for dependable applications,” in International Conference on Dependable
Systems and Networks, 2000.
[101] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Error detection by duplicated
instructions in super-scalar processors,” IEEE Transactions on Reliability, 2002.
[102] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, “Swift:
Software implemented fault tolerance,” in International Symposium on Code
generation and optimization, 2005.
[103] S. Rehman, K.-H. Chen, F. Kriebel, A. Toma, M. Shafique, J.-J. Chen, and
J. Henkel, “Cross-layer software dependability on unreliable hardware,” IEEE
Transactions on Computers, 2015.
[104] J. Henkel, L. Bauer, H. Zhang, S. Rehman, and M. Shafique, “Multi-layer
dependability: From microarchitecture to application level,” in 2014 51st
ACM/EDAC/IEEE Design Automation Conference (DAC), 2014.
[105] G. Hubert, L. Artola, and D. Regis, “Impact of scaling on the soft error sensitivity
of bulk, FDSOI and FinFET technologies due to atmospheric radiation,” Integration,
the VLSI journal, 2015.
[106] P. Hazucha, T. Karnik, S. Walstra, B. A. Bloechel, J. W. Tschanz, J. Maiz,
K. Soumyanath, G. E. Dermer, S. Narendra, V. De et al., “Measurements and
analysis of ser-tolerant latch in a 90-nm dual-v/sub t/cmos process,” IEEE Journal of
Solid-State Circuits, 2004.
[107] T.-Y. Wuu and S. B. Vrudhula, “A design of a fast and area efficient multi-input
muller c-element,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 1993.
[108] R. Naseer and J. Draper, “DF-DICE: A scalable solution for soft error tolerant
circuit design,” in 2006IEEE International Symposium on Circuits and Systems,
2006.
113
[109] X. Iturbe, B. Venu, E. Ozer, and S. Das, “A Triple Core Lock-Step (TCLS) ARM
Cortex-R5 Processor for Safety-Critical and Ultra-Reliable Applications,” in
International Conference on Dependable Systems and Networks, 2016.
[110] M. Werner, K. Devarajegowda, M. Chaari, and W. Ecker, “Increasing soft error
resilience by software,” in Design Automation Conference, 2019.
[111] D. J. Lu, “Watchdog processors and structural integrity checking,” IEEE
Transactions on Computers, 1982.
[112] O. Goloubeva, M. Rebaudengo, M. S. Reorda, and M. Violante, “Soft-error
detection using control flow assertions,” in IEEE Symposium on Defect and Fault
Tolerance in VLSI Systems, 2003.
[113] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Control-flow checking by software
signatures,” IEEE Transactions on Reliability, 2002.
[114] S. Schuster, P. Ulbrich, I. Stilkerich, C. Dietrich, and W. Schröder-Preikschat,
“Demystifying soft-error mitigation by control-flow checking–a new perspective on
its effectiveness,” ACM Transactions on Embedded Computing Systems, 2017.
[115] V. Sridharan and D. R. Kaeli, “Eliminating microarchitectural dependency from
architectural vulnerability,” in IEEE International Symposium on High Performance
Computer Architecture, 2009.
[116] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, “A
systematic methodology to compute the architectural vulnerability factors for a high-
performance microprocessor,” in IEEE/ACM International Symposium on
Microarchitecture,, 2003.
[117] X. Li, S. V. Adve, P. Bose, and J. A. Rivers, “Online estimation of architectural
vulnerability factor for soft errors,” in ACM SIGARCH Computer Architecture News,
2008.
[118] L. Chen and A. Avizienis, “N-version programming: A fault-tolerance approach to
reliability of software operation,” in International Symposium on Fault-Tolerant
Computing, 1995.
114
[119] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S.
Mukherjee, “Design and evaluation of hybrid fault-detection systems,” in ACM
SIGARCH Computer Architecture News, 2005.
[120] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S.
Mukherjee, “Software-controlled fault tolerance,” ACM Transactions on
Architecture and Code Optimization (TACO), 2005.
[121] M. Imai, Y. Takeuchi, K. Sakanushi, and N. Ishiura, “Advantage and possibility of
application-domain specific instruction-set processor (ASIP),” IPSJ Transactions on
System LSI Design Methodology, 2010.
[122] P. Winokur, G. Lum, M. Shaneyfelt, F. Sexton, G. Hash, and L. Scott, “Use of
COTS microelectronics in radiation environments,” IEEE Transactions on Nuclear
Science, 1999.
[123] J. Yiu, The definitive guide to the ARM Cortex-M3. Newnes, 2009.
[124] S. Vestal, “Preemptive scheduling of multi-criticality systems with varying degrees
of execution time assurance,” in 28th IEEE International Real-Time Systems
Symposium, 2007.
[125] R. Mariani and P. Fuhrmann, “Comparing fail-safe microcontroller architectures in
light of IEC 61508,” in 22nd IEEE International Symposium on Defect and Fault-
Tolerance in VLSI Systems (DFT 2007), 2007.
[126] A. Kohn, M. Käßmeyer, R. Schneider, A. Roger, C. Stellwag, and A. Herkersdorf,
“Fail-operational in safety-related automotive multi-core systems,” in IEEE
International Symposium on Industrial Embedded Systems, 2015.
[127] A. Burns and R. I. Davis, “A survey of research into mixed criticality systems,”
ACM Computing Surveys (CSUR), 2018.
[128] S. Fei, G. Prashant, and Z. Min, “On freedom from interference in mixed criticality
systems: A causal learning approach,” in International Test Conference, 2019.