Upload
lamnga
View
215
Download
1
Embed Size (px)
Citation preview
RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS
By
ADAM M. JACOBS
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2013
c⃝ 2013 Adam M. Jacobs
2
To my parents for all of their patience and support
3
ACKNOWLEDGMENTS
This work was supported in part by the I/UCRC Program of the National Science
Foundation under Grant No. EEC-0642422 and IIP-1161022. The author gratefully
acknowledges vendor equipment and/or tools provided by various vendors that helped
make this work possible. The author also thanks fellow graduate student, Grzegorz
Cieslewski, for developing the FPGA fault-injection tool used to gather many of the
experimental results for this work.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 BACKGROUND AND RELATED RESEARCH . . . . . . . . . . . . . . . . . . . 18
2.1 FPGA Performance and Power Efficiency . . . . . . . . . . . . . . . . . . 182.2 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Single-Event Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 FPGAs in Space Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 Low-Overhead Fault Tolerance Methods . . . . . . . . . . . . . . . . . . . 232.6 Algorithm-Based Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . 252.7 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 FRAMEWORK FOR RECONFIGURABLE FAULT TOLERANCE . . . . . . . . 28
3.1 RFT Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.1 RFT Controller Operation . . . . . . . . . . . . . . . . . . . . . . . 303.1.2 MicroBlaze Operation . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.3 Environment-Based Fault Mitigation . . . . . . . . . . . . . . . . . 343.1.4 RFT Controller Resource and Performance Overheads . . . . . . . 35
3.2 RFT Fault-Rate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 RFT Performability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Validation Case Study . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Orbital Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2.1 Low-Earth orbit case study . . . . . . . . . . . . . . . . . 483.4.2.2 Highly-elliptical orbit case study . . . . . . . . . . . . . . 50
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 ALGORITHM-BASED FAULT TOLERANCE FOR FPGA SYSTEMS . . . . . . 63
4.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1.1 Checksum-Based ABFT for Matrix Multiplication . . . . . . . . . . 644.1.2 Matrix-Multiplication Architectures . . . . . . . . . . . . . . . . . . 66
4.1.2.1 Baseline, serial architecture . . . . . . . . . . . . . . . . . 664.1.2.2 Fine-grained parallel architecture . . . . . . . . . . . . . . 66
5
4.1.2.3 Coarse-grained parallel architecture . . . . . . . . . . . . 674.1.2.4 Architectural modifications for ABFT . . . . . . . . . . . . 67
4.1.3 Resource-Overhead Experiments . . . . . . . . . . . . . . . . . . . 684.1.3.1 Resource overhead of serial architectures . . . . . . . . . 694.1.3.2 Resource overhead of parallel architectures . . . . . . . . 70
4.1.4 Fault-Injection Experiments . . . . . . . . . . . . . . . . . . . . . . 724.1.4.1 Design vulnerability of serial architectures . . . . . . . . . 744.1.4.2 Design vulnerability of parallel architectures . . . . . . . . 75
4.1.5 Analysis of Matrix-Multiplication Architectures . . . . . . . . . . . . 764.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.1 Checksum-Based ABFT for FFTs . . . . . . . . . . . . . . . . . . . 784.2.2 FFT Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.2.1 Radix-2 Burst-IO FFT architecture . . . . . . . . . . . . . 804.2.2.2 Radix-2 Pipelined FFT architecture . . . . . . . . . . . . 804.2.2.3 Architectural modifications for ABFT . . . . . . . . . . . . 80
4.2.3 Resource-Overhead Experiments . . . . . . . . . . . . . . . . . . . 824.2.3.1 Resource overhead of Burst-IO architecture . . . . . . . . 824.2.3.2 Resource overhead of Pipelined architecture . . . . . . . 83
4.2.4 FFT Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.5 Analysis of FFT Architectures . . . . . . . . . . . . . . . . . . . . . 85
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 RFT SYSTEM INTEGRATION FOR RAPID SYSTEM DEVELOPMENT . . . . 94
5.1 Dynamically Generated RFT Components . . . . . . . . . . . . . . . . . . 945.1.1 RFT Controller Point-to-Point Interface . . . . . . . . . . . . . . . . 955.1.2 Parameterized and Configurable Voting Logic . . . . . . . . . . . . 96
5.2 Task Scheduling for RC Systems in Dynamic Fault-Rate Environments . . 975.2.1 Selection Criteria for Fault-Tolerant Mode . . . . . . . . . . . . . . 98
5.2.1.1 FT-mode selection using thresholds . . . . . . . . . . . . 985.2.1.2 Time-resource metric for FT-mode selection . . . . . . . . 99
5.2.2 Scheduler for RFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2.2.1 RFT architecture description . . . . . . . . . . . . . . . . 100
5.2.3 Software Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.2.4 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.4.1 Constant fault rates . . . . . . . . . . . . . . . . . . . . . 1025.2.4.2 Dynamic fault-rate case studies . . . . . . . . . . . . . . 1035.2.4.3 Scheduling improvements . . . . . . . . . . . . . . . . . . 104
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6
LIST OF TABLES
Table page
3-1 RFT fault-tolerance modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3-2 RFT controller resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3-3 Fault-injection results for RFT components. . . . . . . . . . . . . . . . . . . . . 54
3-4 RFT Markov model validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3-5 Unavailability and performability for LEO case study. . . . . . . . . . . . . . . . 55
3-6 Unavailability and performability for HEO case study. . . . . . . . . . . . . . . . 55
4-1 Resource utilization and overhead of serial MM designs. . . . . . . . . . . . . . 87
4-2 Serial matrix multiplication fault-injection results. . . . . . . . . . . . . . . . . . 87
4-3 Resource utilization and overhead of FFT designs. . . . . . . . . . . . . . . . . 87
4-4 FFT fault-injection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5-1 RFT controller resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5-2 Dynamic scheduling results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7
LIST OF FIGURES
Figure page
3-1 System-on-chip architecture with RFT controller. . . . . . . . . . . . . . . . . . 55
3-2 RFT controller PLB-to-PRR interface. . . . . . . . . . . . . . . . . . . . . . . . 56
3-3 RFT fault-rate model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3-4 Phased-mission Markov model transitioning between TMR and DWC modes. . 57
3-5 Markov models of RFT modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3-6 RFT validation Markov models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3-7 LEO fault rates using the RFT fault-rate model. . . . . . . . . . . . . . . . . . . 58
3-8 LEO system availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3-9 Effects of adaptive thresholds on availability and performability. . . . . . . . . . 60
3-10 HEO fault rates using the RFT fault-rate model. . . . . . . . . . . . . . . . . . . 61
3-11 HEO system availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4-1 Matrix-multiplication architectures. . . . . . . . . . . . . . . . . . . . . . . . . . 88
4-2 Matrix-multiplication psuedocode. . . . . . . . . . . . . . . . . . . . . . . . . . 88
4-3 Matrix-multiplication ABFT-Extra architecture. . . . . . . . . . . . . . . . . . . . 88
4-4 Slice overhead of fine-grained parallel matrix multiplication. . . . . . . . . . . . 89
4-5 Slice overhead of coarse-grained parallel matrix multiplication. . . . . . . . . . 89
4-6 DSP48 and BlockRAM overhead of parallel matrix multiplication. . . . . . . . . 90
4-7 Fault vulnerability of fine-grained parallel matrix multiplication. . . . . . . . . . . 91
4-8 Fault vulnerability of coarse-grained parallel matrix multiplication. . . . . . . . . 92
4-9 FFT architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4-10 Required ABFT threshold value for floating-point FFTs. . . . . . . . . . . . . . 93
5-1 Research areas for Phase 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5-2 Comparison of RFT architectures. . . . . . . . . . . . . . . . . . . . . . . . . . 107
5-3 Flowchart of scheduling simulator. . . . . . . . . . . . . . . . . . . . . . . . . . 107
5-4 Effect of arrival rate on fault-free operation. . . . . . . . . . . . . . . . . . . . . 108
8
5-5 Effect of fault rate on task rejection. . . . . . . . . . . . . . . . . . . . . . . . . 108
5-6 Fault-rate profile for case studies. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5-7 Resource fragmentation from adaptive placement. . . . . . . . . . . . . . . . . 110
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS
By
Adam M. Jacobs
May 2013
Chair: Alan D. GeorgeMajor: Electrical and Computer Engineering
Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the
capability to provide space applications with the necessary performance, energy-efficiency,
and adaptability to meet next-generation mission requirements. However, mitigating
an FPGA’s susceptibility to radiation-induced faults is challenging. Triple-modular
redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but
TMR incurs substantial overheads such as increased area and power requirements. In
order to reduce overhead while providing sufficient radiation mitigation, this research
proposes a framework for reconfigurable fault tolerance (RFT) that enables system
designers to dynamically adjust a system’s level of redundancy and fault mitigation
based on the varying radiation incurred at different orbital positions. To realize this
goal and validate the effectiveness of the approach, three areas are investigated and
addressed.
First, a method for accurately estimating time-varying fault rates in space systems
and a reliability and performance model for adaptive systems are needed to quantify the
effectiveness of the RFT approach. Using multiple case-study orbits, our models predict
that adaptive fault-tolerance strategies are able to improve unavailability by 85% over
low-overhead fault tolerance techniques and performability by 128% over traditional,
static TMR fault tolerance.
10
Second, low-overhead fault-tolerance techniques which can be used within the
RFT framework for improved performance must be investigated. The effectiveness
of Algorithm-Based Fault Tolerance (ABFT) for FPGA-based systems is explored for
matrix multiplication and FFT. ABFT kernels were developed for an FPGA platform, and
reliability was measured using fault-injection testing. We show that matrix multiplication
and FFTs with ABFT can provide improved reliability (vulnerability reduced by 98%) with
low resource overhead, and scale favorably with additional parallelism.
Third, methods for facilitating the integration of RFT hardware into existing
PR-based systems and architectures are explored. We expand the RFT framework
to be used with bus-based or point-to-point architectures. We design a fault-tolerant
task-scheduling algorithm which can schedule RFT tasks in a dynamically-changing fault
environment in order to maximize system performability.
Combined, these three areas demonstrate the capability of RFT to provide both
performance and reliability in space. Using low-overhead fault-tolerance techniques and
reconfiguration, RFT can meet the strict constraints of next-generation space systems.
11
CHAPTER 1INTRODUCTION
As remote sensor technology for space systems increases in fidelity, the amount
of data collected by orbiting satellites and other space vehicles will continue to
outpace the ability to transmit that data to other stations (e.g., ground stations, other
satellites). Increasing future systems’ onboard data-processing capabilities can
alleviate this downlink bottleneck, which is caused by bandwidth limitations and high
latency transmission. Onboard data processing enables much of the raw data to be
interpreted, reduced, and/or compressed onboard the space system before transmitting
the results to ground stations or other space systems, thus reducing data transmission
requirements. For applications with low data transmission requirements, improved
onboard data processing can enable more complex and autonomous capabilities. These
autonomous capabilities will enable new classes of space missions such as in situ
scientific studies, constellations of multiple, coordinated space systems, or automated
maneuvering and guidance of deep-space systems. Finally, increased onboard data
processing can enable future space systems to keep up with increasingly stringent
real-time constraints. However, increasing the onboard data-processing capabilities
requires high-performance computing, which has largely been absent from space
systems.
In addition to the high performance requirements of these enhanced future systems,
space environments impose several other stringent, high-priority design constraints.
System design decisions must consider the system’s size, weight, and power (SWaP)
and heat dissipation, which are dictated by the system’s physical platform configuration.
The satellite’s physical dimensions restrict the system’s size, the photovoltaic solar
array’s capacity restricts the power generation, and all heat dissipation must occur
from passive, radiative cooling, which can be slow. In addition to SWaP requirements,
system designers must consider system component reliability and system availability
12
because faulty space systems are either impossible or prohibitively expensive to
service. Radiation-hardened devices, with increased protection from long-term
radiation exposure (total ionizing dose), provide system reliability and correctness.
While device-level radiation hardening increases component lifetimes and reliability,
these hardened devices are expensive and have dramatically reduced performance as
compared to non-hardened commercial-off-the-shelf (COTS) components.
In order to design an optimal system, the performance, SWaP, and reliability
requirements of future space applications must be considered together. System design
philosophy must consider the worst-case operating scenario (e.g., typically worst-case
radiation), which may dramatically limit the overall system performance even though
the worst-case scenario may be infrequent (e.g., radiation levels typically change based
on orbital position). However, if components used for redundancy and reliability can be
dynamically repurposed to perform useful, additional computation, future space systems
could meet both high-performance and high-reliability requirements. Future systems
could maintain high reliability during worst-case operating environments while achieving
high performance during less radiation-intensive periods. Current design methodologies
do not account for this type of adaptability. Therefore, in order for future space systems
to achieve high levels of performance, more sophisticated and adaptive system design
methodologies are necessary.
One approach for adaptive high-performance space system design leverages
hardware-adaptive devices such as field-programmable gate arrays (FPGAs), which
provide parallel computations at a high level of performance per unit size, mass, and
power [Williams et al. 2010]. Fortunately, many space applications, such as synthetic
aperture radar (SAR) [Le et al. 2004], hyperspectral imaging (HSI) [Hsueh and Chang
2008], image compression [Gupta et al. 2006], and other image processing applications
[Dawood et al. 2002], where onboard data processing can significantly reduce data
transmission requirements, are amenable to an FPGA’s highly parallel architecture.
13
Since reconfiguration enables FPGAs to perform a wide variety of application-specific
tasks, these systems can rival application-specific integrated circuit (ASIC) performance
while maintaining a general-purpose processor’s flexibility. An SRAM-based FPGA can
be reconfigured multiple times within a single application, allowing a single FPGA to
be used for multiple functions by time-multiplexing the FPGA’s hardware resources,
reducing the number of concurrently active processing modules when an application
does not require all processing modules all of the time. Thus, FPGA reconfiguration
facilitates small, lightweight, yet powerful systems that can be optimized for a space
application’s time-varying hardware requirements.
In order to leverage FPGAs in space systems, the FPGA must operate correctly
and reliably in high-radiation environments, such as those found in near-Earth orbits.
Currently, most radiation-hardened FPGAs have antifuse-based configuration memories
that are immune to single-event upsets (SEUs). However, these hardened FPGAs have
reconfiguration limitations and small capacities, reducing the primary performance
benefits offered by COTS SRAM-based FPGAs. Fortunately, when combined with
special system design techniques, SRAM-based FPGAs can be viable for space
systems. An SRAM-based FPGA’s primary computational limitation is the possibility
of SEUs causing errors within the FPGA user logic and routing resources, which
can manifest as configuration memory upsets or logic memory (e.g., flip-flops, user
RAM) upsets (i.e., resulting in deviations from the expected application behavior).
Fault-tolerant techniques, such as triple-modular redundancy (TMR) and memory
scrubbing, can protect the system from most SEUs and significantly decrease the
SEU-induced errors, but designing an FPGA-based space system using TMR introduces
at least 200% area overhead for each protected module. Depending on the expected
upset rates for a given space system, other lower-overhead fault tolerance methods
could be used to provide sufficient reliability while maximizing the resources available for
performance.
14
When designing a traditional space system, system designers estimate the
expected worst-case upset rates and include an additional safety margin. However,
since single-event upset (SEU) rates vary based on orbital position and the majority
of orbital positions experience relatively low upset rates, a system designed for
the worst-case upset-rate scenario contains processing resources that are wasted
during the frequent low-upset-rate periods. In order to provide the necessary reliability
during high-upset-rate periods and reduce the processing overhead incurred during
low-upset-rate periods, the fault tolerance method must change based on the current
upset rate. For example, during high-upset-rate periods, the system can be reconfigured
to provide high reliability at the expense of reduced processing capabilities, while during
low-upset-rate periods the system can be reconfigured to provide higher performance
by re-provisioning the excess hardware (used for high reliability during high-upset-rate
periods) to application functionality. This upset-rate-based adaptability provides high
performance while maintaining reliability.
This research proposes a framework for reconfigurable fault tolerance (RFT) that
enables FPGA-based space systems to dynamically adapt the amount of fault tolerance
based on the current upset rate. To realize this goal and validate the effectiveness of
the approach, three areas must be addressed and investigated. First, a method for
accurately estimating time-varying fault rates in space systems and a reliability and
performance model for adaptive systems is need to quantify the effectiveness of the RFT
approach. Second, techniques for low-overhead fault tolerance which can be used within
the RFT framework for improved performance must be investigated. Third, tools which
facilitate the creation of the RFT hardware components are required to enable the RFT
integration into existing PR-based systems and architectures. The research presented in
this document is divided into three phases, each proposing a solution to the goals listed
above.
15
The first phase of this research addresses the need for performance and reliability
modelling of adaptive systems with changing environments, such as FPGA-based
space systems. A fault-rate estimation methodology for systems in near-Earth orbits is
developed to estimate the time-varying fault rates experienced during a specified orbit.
A phased-mission Markov model is then used to estimate performance and reliability of
an adaptive system using several adaptation schedules. In this phase, we also develop
and implement an RFT controller design which can provide adaptive fault tolerance in an
FPGA. The reliability and performance models are then experimentally validated using
fault-injection testing on the RFT controller.
The second phase of this research investigates the effectiveness of techniques
for low-overhead fault tolerance in FPGA systems. Algorithm-based fault tolerance
(ABFT) is a technique that can be used with many linear-algebra operations, such as
matrix multiplication or LU decomposition [Huang and Abraham 1984], to provide fault
tolerance with as little as 5-10% overhead. Traditionally, ABFT has been implemented
in software, with multiprocessor arrays, and in hardware, with systolic arrays, to protect
application datapaths. Our ABFT approach may be used in FPGA applications to
provide both datapath and configuration memory protection with low overhead. Other
fault-tolerance techniques (e.g., duplication with compare, error-correcting codes,
concurrent error detection, reduced-precision redundancy) are also examined and
evaluated for FPGA resource overhead and reliability within the context of the RFT
framework.
In the third phase of this research, methods for integrating the RFT framework
with pre-existing partial reconfiguration architectures will be examined, enabling fault
tolerance for pre-existing systems with minimal design changes. Three aspects of
system integration will be considered. First, a tool that will enable the dynamic creation
of system-specific RFT controllers, allowing mission-specific, resource-optimized
hardware. Second, support and integration of an RFT controller with the existing
16
PR architectures to enable fault tolerance with minimal design modifications. Third,
investigation of fault-tolerant task scheduling in the presence of changing fault rates.
The remaining sections of this paper are organized as follows. Chapter 2 surveys
previous work related to this topics common to all three phases of this research.
Chapter 3 describes the RFT hardware architecture, the RFT fault-rate model, and
the RFT performability model and provides case-study examples for two near-Earth
orbits. Chapter 4 evaluates the use of low-overhead fault tolerance techniques in FPGA
systems, analyzes reliability results obtained using fault-injection in MM and FFT case
studies, and suggests design modifications for higher reliability. Chapter 5 demonstrates
a dynamic RFT hardware-creation methodology and a point-to-point RFT controller
implementation, and explores heuristics for an environmentally-aware task scheduler
for RFT systems. Finally, Chapter 6 presents conclusions and outlines directions for
possible future research.
17
CHAPTER 2BACKGROUND AND RELATED RESEARCH
In this chapter, we motivate the need for SRAM-based FPGAs in space systems.
The FPGAs used in space systems must be rated for a sufficient total accumulated
ionizing dose (TID) to ensure long-term device functionality over a given mission
duration. Single-event effects (SEEs), caused by collisions with high-energy protons
and heavy ions (i.e., radiation), are the primary short-term reliability concern in space
systems. In order for FPGA-based systems to maintain reliability, a combination of
radiation-hardened FPGAs and complex fault-tolerance techniques are required
to mitigate errors. Since many of the common fault-tolerance techniques require
substantial temporal or spatial area overhead, new, low-overhead techniques allow more
reconfigurable logic to be used for actual, useful computation instead of for redundancy.
2.1 FPGA Performance and Power Efficiency
SRAM-based FPGAs offer a very large amount of configurable logic and have the
ability to modify portions of a design during run-time, giving these FPGAs the capability
to efficiently perform a wide range of applications by using a high degree of parallelism
while running at low clock rates. Williams et al. [2010] developed computational density
metrics to quantify and predict performance of SRAM-based FPGAs for specific types
of algorithms and applications. Their analysis showed that FPGAs were capable
of providing between 3 and 60 times more performance per unit power than many
conventional general-purpose processors, depending on the types of operations being
considered. The performance and power efficiency of FPGAs is extremely desirable for
the small power budgets of space systems.
2.2 Partial Reconfiguration
Partial reconfiguration (PR) enables a user to modify a portion of an FPGA’s
configuration while the remainder of the FPGA remains operational. PR time-multiplexes
mutually exclusive application-specific processing modules on the same hardware and
18
only the modules that need to be reconfigured halt operation, which makes PR attractive
for real-time systems. Currently, Xilinx supports PR for the Virtex-4 and newer devices
[Xilinx 2010a], while Altera has recently announced PR support for Stratix-V devices
[Altera 2010].
During the system design phase, system designers define the FPGA’s partially
reconfigurable regions (PRRs) and partially reconfigurable modules (PRMs) and route
signals to/from the PRRs through bus macro PR primitives (Xilinx 9.2 PR tool flow)
or partition pins (Xilinx 12.1 PR tool flow). Partial bitstreams, communicated through
external configuration interfaces (e.g., SelectMAP, JTAG), are used to reconfigure the
PRRs with the PRMs. On Xilinx devices, the Internal Configuration Access Port (ICAP)
is an internal configuration interface, allowing user logic to directly reconfigure PRRs,
removing the need for additional external configuration support devices. Additionally,
since partial bitstreams are typically much smaller than full bitstreams (which are used
to configure the entire FPGA), PR reduces bitstream storage requirements. Partial
bitstreams may be significantly smaller than full bitstreams, related to the size of each
PRR.
2.3 Single-Event Effects
SEEs occur when high-energy particles, such as protons, neutrons, or heavy
ions, collide with silicon atoms on a device, depositing the ion’s electric charge into the
device’s circuit. Protons and electrons are trapped within the Earth’s Van Allen belts,
while heavy ions are mainly produced by galactic cosmic rays and solar flares. When a
high-energy particle collides with a silicon device, the energy of the collision can cause
the logical values stored in sequential memory elements to be inverted [Karnik and
Hazucha 2004]. Errors caused by these particles are often referred to as soft errors or
SEUs, as there is no permanent circuitry damage and any affected memories can be
corrected by re-writing the correct values. Single-event functional interrupts (SEFIs),
another type of SEE, can cause a semi-permanent fault that requires a circuit to be
19
power-cycled to restore correct operation. Single-event latchups (SELs) are destructive
SEEs that occur when a particle causes a parasitic forward-biased structure within the
device substrate that can allow destructively high amounts of current to flow through the
substrate, potentially damaging the device. SELs can be avoided by using appropriate
device manufacturing processes, and devices produced using silicon-on-insulator (SOI)
processes are largely immune. SEL immunity is an important property for selecting
devices for many space systems.
2.4 FPGAs in Space Systems
FPGA configuration memories can be constructed using several different technologies
(e.g., antifuse, flash, and SRAM), each with performance and reliability tradeoffs.
Traditionally, antifuse-based FPGAs have been used in space systems to provide
simple processing capabilities and “glue logic” to interconnect multiple peripherals
or to combine/replace the functionality of multiple ASICs. Antifuse-based FPGAs are
one-time programmable where the configuration process creates/fixes the FPGA’s
physical routing interconnect structure. This configuration process provides an inherent
level of fault tolerance from SEUs since the antifuse-based routing cannot be reversed.
Additionally, many commercially available antifuse-based FPGAs (e.g., Actel RTSX-SU
[Actel 2010a]) include replicated flip-flop cells to prevent upsets in the sequential logic.
Additionally, antifuse devices generally have a high TID threshold and immunity to SELs.
Unfortunately, when compared to flash-based or SRAM-based FPGAs, antifuse-based
FPGAs contain a relatively small amount of available logic gates and the fixed-logic
structure limits performance potential.
Flash-based FPGAs (e.g., Actel RT-ProASIC3 [Actel 2010b]) attempt to maintain
an antifuse-based FPGA’s reliability while increasing the amount of configurable
logic (logic available for a system designer to implement application functionality) and
allowing reconfiguration. Flash-based FPGA configuration memories are composed
of radiation-tolerant flash cells that provide reliability for combinational logic, however
20
system designers must insert sequential logic replication to fully protect the FPGA from
faults. Even though flash-based FPGAs can be fully reconfigured to support multiple
applications, flash-based FPGAs do not support the PR capability that is available on
some SRAM-based FPGAs due to a lack of architectural and vendor support for such a
capability. Concerns over the TID effects on flash-based logic (floating-gate transistors)
have prevented wide-spread acceptance of flash-based FPGAs in space systems [Wang
2003].
SRAM-based FPGAs are the most radiation-susceptible type of FPGA since the
design functionality is stored in vulnerable SRAM cells (i.e., configuration memory),
and configuration memory upsets cause functional changes to the design’s logic.
Traditionally, this vulnerability has prevented SRAM-based FPGAs from being used in
highly critical space applications, however, some space systems use space-qualified
SRAM-based FPGAs for onboard processing. Space-qualified FPGAs are similar to
COTS FPGAs but are produced using epitaxial wafers, use ceramic, hermetically-sealed
packaging, and have been tested to ensure that damaging SEL events will not
compromise the system. These devices are rated for TID levels high enough to be used
in space systems [Xilinx 2010c]. Even with these reliability techniques, these space
systems must still use several fault-mitigation strategies, such as TMR and configuration
scrubbing, to ensure that system upsets are detectable and recoverable. (We note that
unless specified, all further FPGA references implicitly refer to SRAM-based FPGAs.)
Traditional FPGA-based space system designs leverage spatial TMR. TMR uses
a reliable majority voter connected to three identical module replicas in order to detect
and mask errors in any one module. In the context of TMR, a module refers to the
functional unit being replicated, which can range from a single logic gate to an entire
device. There are two primary TMR variations: external and internal. External TMR
uses three independent FPGAs working in lockstep where each FPGA implements a
module replica and the outputs are connected to an external radiation-hardened voter
21
that compares the results. External TMR requires significant hardware overhead
(each protected module is triplicated and board layout complexity is significantly
increased), but is reliable. The RCC board produced by SEAKR engineering [Troxel
et al. 2008] uses external TMR to provide reliable computation using Xilinx Virtex-4
FPGAs. Alternatively, internal TMR creates three identical modules within a single
FPGA, and the majority voter resides internally or externally [Carmichael et al. 1999].
Internal TMR can reduce the number of physical FPGAs required to implement a space
application, but may increase the chance of a common-mode failure, where multiple
modules fail simultaneously from a single fault. For example, a SEFI may cause multiple
internally-replicated modules to fail, whereas externally-replicated FPGAs would be
immune. Several tools assist system designers in incorporating TMR into space system
designs [Pratt et al. 2006; Xilinx 2004].
In addition to TMR, configuration scrubbing prevents error accumulation in FPGA
configuration memory. While TMR masks individual errors, TMR does not correct the
underlying fault and cannot correct errors that occur in multiple modules. Scrubbing
uses an external device to read back the FPGA’s configuration memory and compares
the read configuration memory to a known “good” copy. Alternatively, some FPGAs
calculate an error correction code (ECC) during configuration read-back for every
configuration frame, which can be used to detect and correct configuration faults. If a
mismatch is detected, the correct configuration can be written using PR without halting
the entire FPGA operation [Xilinx 2010a]. Traditionally, scrubbing is performed by an
external radiation-hardened microcontroller to ensure reliability of the reconfiguration
process. However, a self-scrubber may be implemented within the FPGA using the ICAP
available in Xilinx Virtex-4 and newer devices. Xilinx has also developed a single-event
mitigation (SEM) IP core which can perform configuration memory error detection and
correction for user designs [Xilinx 2013b].
22
Despite these drawbacks, SRAM-based FPGAs have been used in many space
systems, including earth-observing science satellites, communication satellites, and
satellites and rovers for the Venus and Mars missions. For instance, space-qualified
Xilinx Virtex-1000 (XQVR1000) devices were used on the Mars Exploration Rovers for
motor control functions and four XQR4000XLs were used for the lander pyrotechnics
[Ratter 2004]. Configuration read-back and scrubbing were used for detection and
correction of SEUs, and the full system was cycled once per Martian evening to remove
persistent errors. These systems, which contained very little logic as compared to
today’s standards and lacked the ability for PR, were not used for data processing.
The increased logic resources on recent FPGA families enable future systems to
use FPGAs for image processing and other computation-intensive applications. The
SpaceCube created at NASA Goddard Space Flight Center uses multiple commercial
Xilinx Virtex-4 FPGAs along with a radiation-hardened processor to provide onboard
processing capabilities for a variety of missions [Flatley 2010]. The SpaceCube provided
computational power for the Relative Navigation Sensors (RNS) experiment during
Hubble Servicing Mission 4 and is currently being used as an on-orbit test platform
aboard the Naval Research Laboratory’s MISSE-7 experiment on the International
Space Station (ISS). Each FPGA in the system contains a self-scrubber module,
protected with TMR, to correct errors and prevent error accumulation.
2.5 Low-Overhead Fault Tolerance Methods
While TMR is the most common fault tolerance method for FPGAs, the high area
overhead due to replicating modules partially negates some FPGA benefits. Therefore,
much research has focused on developing alternative, low-overhead fault-tolerance
methods for low-upset environments, such as replication-based fault tolerance [Laprie
et al. 1990], ECCs [Rao and Fujiwara 1989], and application-specific optimizations
to provide low-cost reliability. Many of these alternative techniques can detect errors
quickly, but may require additional processing or complete re-computation to correct
23
the errors. When the expected upset rates are low, the re-computation rates may be
acceptably low, depending on application throughput requirements.
Replication-based fault tolerance represents the most commonly used type of
fault-mitigation strategy, due to conceptual simplicity and high fault coverage. TMR
can detect single errors and can correct/mask single errors using a majority voter.
Duplication with compare (DWC) is an alternative replication-based method that
compares the outputs of duplicated modules. DWC reduces the resource overhead
by one half as compared to TMR, but DWC cannot correct errors and must fully
re-compute data when errors are detected [Johnson et al. 2008]. Shim et al. [2004]
proposed reduced-precision redundancy (RPR) for numerical computation. RPR
triplicates an application module as in TMR, but the replicas have lower precision
or only operate on the most-significant bits of application data, ensuring that the
most-significant bits are protected and errors in the least-significant bits are treated
as noise in the system. RPR reduced the number of detectable faults (fault coverage),
and resulted in significant resource savings while maintaining sufficient signal-to-noise
ratios for many DSP applications. Morgan et al. [2007] investigated the use of temporal
redundancy and “quadded logic” as additional methods for providing fault tolerance
through redundancy. However, due to inefficient mapping to the underlying FPGA
architecture, their methods did not provide fault tolerance improvement over TMR and
imposed a large area overhead. The effects of the fault-tolerant methods were offset by
the increased cross-sectional vulnerability of the larger FPGA designs.
Even though these replication-based methods for fault tolerance are suitable
for FPGAs, several new fault-tolerant FPGA architectures leverage an FPGA’s high
capacity and flexibility while maintaining reliability. Alnajiar et al. [2009] proposed a
hypothetical coarse-grained multi-context FPGA architecture that supported TMR, DWC,
and single-context modes. Their work explored the effects of soft errors and aging on
the proposed architecture, with an example Viterbi decoder module mapped to the
24
hypothetical architecture. Kyriakoulakos et al. [2009] proposed simple modifications
to the current Virtex-5 and Virtex-6 architectures to allow native support for DWC and
TMR. By adding an XOR-gate between each 5-input LUT within a larger 6-input LUT, the
existing architecture and synthesis tools required only minimal changes to support their
approach, while incurring only 17.5% to 76% slice utilization overhead for DWC or TMR,
respectively.
2.6 Algorithm-Based Fault Tolerance
Algorithm-based fault tolerance (ABFT) is a method that can be used with
many linear-algebra operations to provide fault tolerance without the use of explicit
replication., The ABFT method was originally described for matrix multiplication and
LU decomposition [Huang and Abraham 1984] but has been expanded to protect
other algorithms comprised of linear operations, such as QR decomposition, Fast
Fourier Transform [Tao and Hartmann 1993; Wang and Jha 1994], and finite element
analysis [Mishra and Banerjee 2003; Roy-Chowdhury et al. 1996]. The traditional
description of ABFT was designed for systolic arrays, but the method has also been
used in multiprocessor-based, high-performance computing [Yao et al. 2012].
ABFT augments an original data matrix with row and/or column checksums
and the linear-algebra operation is performed on the new, augmented matrix. If
the linear-algebra operation is computed successfully, the resulting augmented
matrix will contain valid, consistent checksums [Huang and Abraham 1984]. ABFT
checksum generation and comparison has lower computational complexity than the
primary linear-algebra operation. ABFT computational overhead is generally low, and
as a proportion of total computation, decreases as the matrix size increases. The
mathematical basis of ABFT for the matrix multiplication and FFT algorithms will be
shown in Section 3 and Section 4, respectively.
When using ABFT with floating-point algorithms, the introduction of rounding and
precision errors requires the use of a threshold comparison to differentiate rounding
25
errors from incorrectly computed data. The determination of a sufficient threshold is
significantly influenced by properties of the input data, and can affect error coverage and
the number of false positive results [Chowdhury and Banerjee 1996; Roy-Chowdhury
and Banerjee 1993]. Additionally, the original description of ABFT considered algorithms
executed on systolic arrays, with each processing element computing a single data
element of the result matrix. In these systems, a single fault could not propagate to other
processing elements and would only result in a single erroneous result element. [Silva
et al. 1998] investigated these vulnerabilities in traditional ABFT implementations and
proposed methods for improving fault coverage using a Robust ABFT approach.
2.7 Task Scheduling
Task scheduling algorithms can be categorized as either online or offline. Offline
scheduling algorithms have complete knowledge of all tasks that must be scheduled.
The general scheduling optimization problem is NP-hard, but efficient heuristics exist,
and offline schedules can be pre-determined at compile-time. With online scheduling,
tasks arrive at the scheduler periodically over time, and must be placed around
previously scheduled tasks. Task arrival rates and patterns can greatly affect the
quality of the scheduler’s results. Arndt et al. [2000] examined many online scheduling
algorithms for distributed parallel computers, and used simulation to evaluate their
performance. Of the several algorithms studied, the FirstFit algorithm, which prioritizes
scheduling by the arrival time of each task, provided good schedule lengths while
minimizing the average wait times.
Real-time systems introduce additional constraints for task scheduling. Each task
must be completed by a deadline, otherwise the results will no longer be relevant or
needed. A hard deadline must be met, otherwise the system is considered failed. A firm
deadline can be missed, but the usefulness of the result after the deadline is zero. A
soft deadline can be missed, but the value of the result decreases after the deadline has
passed. For a hard real-time system all deadlines must be met, but the goal of a soft
26
real-time system is to meet as many deadlines as possible while optimizing for other
critera. Traditionally, schedulers attempt to minimize criteria such as makespan (total
schedule length) or average task latency.
Han et al. [2003] created a fault-tolerant scheduling algorithm for periodic real-time
software tasks. For each primary task, an alternate less-precise task is also used to
generate a sufficient result before the deadline. These alternate tasks are scheduled as
close to the task deadline as possible. In the case of a primary task failure, the alternate
task will be executed. If the primary task succeeded, the alternate tasks are discarded.
Their algorithm is intended for offline use and was intended to protect systems against
software faults. Pathan [2006] extends the rate-monotonic scheduling algorithm to
support temporal TMR (RM-FT), scheduling multiple copies of tasks to mask faults in
any one copy. RM-FT also requires periodic tasks to perform a scheduling analysis.
Scheduling aperiodic real-time tasks is a more difficult problem and is currently being
studied.
Scheduling tasks for reconfigurable computing creates an additional level of
complexity. Instead of scheduling a task for a one-dimensional array of processors,
scheduling on a two-dimensional FPGA fabric becomes a constrained placement
problem. Additionally, tasks may have multiple hardware or software implementations,
increasing the overall search space. Banerjee et al. [2005] present an offline KLFM
heuristic [Kernighan and Lin 1970] which incorporates detailed placement information in
order to provide high-quality schedules. Mei et al. [2000] combine a genetic algorithm
to determine HW/SW placement with a traditional list scheduling algorithm to enable
online scheduling of real-time reconfigurable embedded systems. Steiger et al. [2004]
developed two heuristic scheduling algorithms, Horizon and Stuffing, which provide good
results while limiting the computational requirements.
27
CHAPTER 3FRAMEWORK FOR RECONFIGURABLE FAULT TOLERANCE
RFT leverages COTS components in space systems to achieve high performance
and flexibility while maintaining reliability. Beyond traditional, spatial TMR, alternative
fault-mitigation methods may be appropriate given a particular application’s performance
requirements and set of expected environmental factors (e.g., radiation). Other
methods, such as temporal TMR, ABFT, software-implemented fault tolerance (SIFT), or
checkpointing and rollback, may be suitable for system-level protection. Each alternative
fault-mitigation method has tradeoffs between performance, reliability, and overhead,
and Pareto-optimal operation may change over a system’s lifetime. Therefore, the main
goal of our proposed RFT framework is to enable a system to autonomously adapt to
Pareto-optimal operation based on the current system’s environmental situation.
Our RFT framework consists of three main elements: a PR-based hardware
architecture; a fault-rate model for estimating orbital SEU rates; and a methodology
for modeling RFT system performability. The hardware architecture, described in
Section 3.1, is similar to other traditional SoC architectures used for reconfigurable
computing, with multiple, identical PRRs, which are leveraged for module redundancy.
The hardware architecture allows the system to execute each processing module
(PRM) in several possible fault-tolerance modes, each with differing performance and
reliability characteristics. RFT software adapts the amount of per-module redundancy in
accordance with the current environment and temporarily pauses only the reconfigured
hardware modules to preserve and record application state while changing the
fault-tolerance mode, allowing the remainder of the system to continue processing.
Additionally, modules with constant state can continue processing during the adaptation
process. Section 3.2 presents a model for estimating expected fault rates in potential
space-system orbits. Finally, Section 3.3 presents a performability model that quantifies
the RFT benefits in environments with varying upset rates.
28
3.1 RFT Hardware Architecture
Figure 3-1 shows the high-level architecture of an FPGA-based SoC design with
PRRs integrated with an RFT controller. The main architectural components include
a microprocessor, a memory controller, I/O ports, PRRs (1 . . . N) for PRMs, and the
system interconnect, which connects all of these components to the microprocessor.
Since we leverage Xilinx FPGAs, the microprocessor is a MicroBlaze (less resource-intensive
processors such as PicoBlaze can also be used) and the system interconnect is a
processor local bus (PLB). The MicroBlaze orchestrates PRR reconfiguration using the
ICAP, maintains the state of the currently active PRMs, and initiates fault-tolerance mode
switching.
All architectural components except for the PRRs are protected using TMR since
these components’ functionality is crucial to the entire system’s reliability. Although the
ICAP cannot be replicated, the signals to and from the ICAP are also protected with
TMR. Tools such as Xilinx’s TMRTool [Xilinx 2004] or BYU’s EDIF-based TMR tool [Pratt
et al. 2006] automate TMR design creation by applying low-level TMR voting on the
design’s original, unprotected netlist. For additional SEU protection, FPGA configuration
scrubbing should be performed in order to prevent error accumulation. Scrubbing can
be performed with an external scrubber and radiation-hardened configuration storage or
with an internal scrubber using the internal configuration ECC present in Virtex-4 and
later devices.
Each PRM uses a PLB-compatible interface that connects to the RFT controller
or directly to the PLB based on whether the PRM is an RFT-enabled module (a
module replicated/instantiated by the RFT controller) or not, respectively. The RFT
controller instantiates the bus macros or other low-level components required for
interfacing with the PRRs. The RFT controller also contains multiple majority voters
and comparators (voting logic) that can be used to detect or correct errors by evaluating
the replicated PRMs’ outputs. The RFT-enabled modules are used in parallel to create
29
redundancy-based, fault-protection modes (e.g., DWC, TMR) by interfacing with the RFT
controller’s voting logic. Additionally, other single-module fault-protection modes, such as
ABFT, can be used for individual PRRs and the RFT controller provides additional fault
tolerance components, such as watchdog timers, to detect hanging conditions within
PRMs.
Table 3-1 lists the currently supported fault-tolerance modes of RFT and the modes’
fault-tolerance type and PRR requirements. The MicroBlaze evaluates the system’s
current performance requirements and monitors external stimuli (radiation) using
external sensors to determine when the fault-tolerance mode should be switched, at
which time the MicroBlaze reconfigures the appropriate PRRs and the RFT controller’s
internal voting scheme between the PRRs’ outputs for the new fault-tolerance mode.
3.1.1 RFT Controller Operation
Figure 3-2 illustrates the interface between PRRs and the PLB for an RFT controller
that can operate in single-module and redundancy-based fault-tolerance modes.
This RFT controller interface routes input signals from the system interconnect
to the appropriate PRRs and routes voting output signals back to the PLB. The
abstraction of this logic into the RFT controller enables pre-existing PRMs to leverage
the fault-tolerance modes and interface with the RFT controller with minimum modifications.
To communicate data and control signals between the MicroBlaze, the PRRs, and the
RFT controller, at design time, the system designer assigns a large memory-mapped
region of the MicroBlaze’s address space to the RFT controller. In order to route
signals to specified PRRs, the system designer also subdivides this region into smaller
subregions and assigns a subregion to each PRR, while taking into consideration the
memory interface requirements of each potential PRM. Each PRM must implement
the actual memory interface (e.g., dual-port RAM, FIFO-based interface), which allows
pre-existing PRMs to interface with the RFT controller with minor modifications since no
specific interface is required. When communication data from the MicroBlaze arrives
30
over the PLB, the RFT controller’s address decoder processes the input address and
determines the destination PRR(s) based on the current fault-tolerance mode. While
operating in single-module fault-tolerance modes, the data is passed directly to the PRR
specified by the decoded address. While operating in redundancy-based fault-tolerance
modes, the data is passed to multiple PRRs using appropriate enable signals to the
PRRs’ interfaces.
The RFT controller routes the outputs from each individual PRR, the outputs
from the voting logic via the FT-mode register, or recorded internal status information
over the PLB to the MicroBlaze. When the MicroBlaze requests a PRR’s output, the
RFT controller’s output multiplexer (Output Mux in Fig. 3-2) selects the appropriate
PRR output to route to the PLB controller based on the decoded address from the
MicroBlaze and the current FT-Mode value. In single-module fault tolerance modes,
the RFT controller’s output multiplexer routes signals directly from the PRRs to the
PLB controller, which bypasses the RFT’s internal voting logic. In redundancy-based
fault-tolerance modes, the output multiplexer routes the verified outputs from the RFT
controller’s voting logic to the PLB controller.
In addition to providing routing and voting logic for redundancy-based fault-tolerance
modes, the RFT controller supports additional fault-tolerance capabilities that do not
require redundant PRMs. If the system is operating in a single-module fault-tolerance
mode, each PRM may use the RFT controller’s watchdog timers or perform internal
fault detection. The RFT controller provides each PRR with an optional watchdog timer
interface using a signal that must be asserted periodically within a user-defined time
interval (usually on the order of seconds). If the PRM does not assert the watchdog
timer reset signal within the time interval, the RFT controller’s interrupt generator
asserts an interrupt signal to the MicroBlaze, alerting the MicroBlaze of a possible
failed PRR. Additionally, PRMs that perform internal fault detection must include an
interrupt signal to notify the RFT controller of internally detected errors, which the RFT
31
controller propagates to the MicroBlaze. PRMs using ABFT perform self-checking (or
self-correction if the ABFT operation permits) on internally generated checksums and
send an interrupt signal to the RFT controller when data errors are detected. Similarly,
PRMs that use internal TMR can also signal detected and corrected errors to the
RFT controller. PRMs without any specific fault-tolerance features can also use the
RFT controller’s watchdog timer, which still allows the RFT controller to detect module
hang-ups or other operational errors.
3.1.2 MicroBlaze Operation
In an RFT system, the MicroBlaze enables additional operational features, such
as fault-tolerance methods to complement the fault-tolerance modes offered by the
RFT controller. Additionally, the MicroBlaze maintains the system’s fault-tolerance state
(e.g., active PRMs, current PRRs’ fault-tolerance modes, etc.) and orchestrates PRR
reconfiguration and the fault-tolerance mode switching process.
To protect the application state while reconfiguring PRRs, the MicroBlaze provides
support for PRM checkpointing and rollback. A checkpoint, which the MicroBlaze
stores in external memory, consists of the minimum set of application state information
needed to restore the application’s current state. In the event of a fault, an application
can use the previous checkpoint to roll back to a known good state, instead of wasting
execution time by beginning execution at an initial starting state. A PRM’s state can
be checkpointed if the PRM can be read and modified by the MicroBlaze. In an RFT
system, the state of all PRMs should be checkpointed periodically in order to reduce
wasted computation in the event of a fault-induced PRM restart. The stored checkpoints
can then be used during fault recovery procedures to improve system availability.
In addition to handling PRM checkpointing, the MicroBlaze also handles fault
recovery and reconfiguration procedures. If the RFT controller’s voting logic detects
faults in the PRMs’ outputs, the RFT controller records fault status information about the
faulty PRM (e.g., error location, time, etc.) in internal FT-status registers and sends an
32
interrupt to the MicroBlaze, which initiates the reconfiguration procedure. The FT-status
registers record fault status information that may be used by the MicroBlaze to make
fault-tolerance mode decisions or to provide system log information to system operators.
For single-module fault-tolerance modes, the MicroBlaze corrects the faulty PRM by
reconfiguring the PRR with the PRM’s original bitstream over the ICAP. For the DWC
fault-tolerance mode, both of the associated PRMs must be reconfigured since the faulty
PRM cannot be identified. If checkpoints exist for the PRMs, the MicroBlaze initializes
the new PRMs using these checkpoints. Alternatively, if the RFT system is operating in
the TMR fault-tolerance mode, the MicroBlaze checkpoints one of the non-faulty PRMs
while the faulty PRM is reconfigured. After the non-faulty PRM has been checkpointed,
the RFT controller pauses the non-faulty PRMs using clock gating to keep the PRMs
synchronized. Once the faulty PRM has been reconfigured, the MicroBlaze initializes the
newly reconfigured PRM from the most recent checkpoint, and the RFT controller can
re-enable the clocks for all three of the replicated PRMs to resume operation.
Finally, the MicroBlaze orchestrates the switching procedure between different RFT
fault-tolerance modes. Fault-tolerance mode switching can be triggered by external
events, by a priori knowledge of the operating environment, or by application-triggered
events, and the fault-tolerance mode switching procedure may vary on a per-system
basis depending on the specific system performance and reliability requirements. The
MicroBlaze can use information from the fault-status registers to make Pareto-optimal
decisions about future fault-tolerance modes. Before fault-tolerance mode switching,
the MicroBlaze ensures that sufficient PRRs are available for the new fault-tolerance
mode. Additional PRRs may be required or PRRs may be freed when switching from a
single-module fault-tolerance mode to a redundancy-based fault-tolerance mode or vice
versa, respectively. The MicroBlaze signals the RFT controller to change fault-tolerance
modes by writing to the RFT controller’s FT-Mode register. The MicroBlaze reconfigures
the PRRs involved in the fault-tolerance mode switching via the ICAP with partial
33
bitstreams for the appropriate PRM or a blank partial bitstream if the PRR is not required
for the new fault-tolerance mode. When the reconfiguration process is complete, the
MicroBlaze signals the RFT controller, by rewriting the FT-Mode register, to resume PRM
operation.
3.1.3 Environment-Based Fault Mitigation
RFT fault-tolerance mode switching can be triggered by a priori knowledge of the
operating environment, by application-triggered events, or by external events. While a
priori knowledge and application-triggered events are convenient for modeling purposes,
real-world systems leverage measurements from attached sensors to determine the
system’s current environmental status. Additionally, due to the unpredictability of space
weather conditions, such as solar flare events, an RFT system must be able to respond
dynamically to the changing environment.
In an RFT system, the current expected fault rate can be estimated either directly or
indirectly. An external radiation sensor can be directly interfaced with the FPGA, allowing
the MicroBlaze to track the current fault rate and predict future fault rates. Alternatively,
the RFT system can indirectly determine fault rates by tracking the number of data and
configuration faults detected during operation. Since an FPGA’s fabric is composed of
large SRAM arrays, these arrays can be used as makeshift radiation detectors. If a fault
is detected in the FPGA configuration or data memory, either through readback during
scrubbing or from the RFT controller’s logic, the fault can be recorded or can be used to
make decisions about which RFT fault-tolerance mode to use. One simple, rule-based
method for choosing the RFT fault-tolerance mode is to use a sliding window approach.
For instance, a system may use the following rules:
1. If there were any faults in the past 5 minutes, use the DWC fault-tolerance mode.
2. If there were more than 5 faults in the past 5 minutes, use the TMR fault-tolerancemode.
34
3. Only transition from the TMR fault-tolerance mode to a lower-reliability mode after5 minutes for fault-free operation.
The size of the sliding window and the RFT fault-tolerance mode choices are system-
and mission-dependent. A larger window size can provide a more conservative
fault-tolerance strategy, while a small window size can more quickly adapt to spikes
in the experienced fault rate. Implementing a rule with hysteresis, such a rule (3), can
produce a more reliable strategy, while requiring a smaller window size.
3.1.4 RFT Controller Resource and Performance Overheads
To quantify the RFT controller’s resource overhead, we implemented an RFT-based
system on a Virtex-4 FX60-based platform, similar to the SpaceCube. The design uses
six PRRs connected to a MicroBlaze through a PLB-connected RFT controller and
operates at 100 MHz. Each PRR contains 2,000 slices for user logic. Each PRR can
operate in high-performance (HP) mode, DWC-mode with the PRR’s neighboring PRRs,
or TMR-mode with two consecutive, neighboring PRRs. The resources (LUT, FF, and
slice) required for the RFT controller and constituent modules are detailed in Table 3-2.
The final column of Table 3-2 shows the percentage of the Virtex-4 FX60 used by each
module. The RFT controller logic, excluding the PLB controller that would be required
for non-RFT designs, requires approximately 900 slices. The largest modules within
the RFT controller are the (optional) watchdog timers and the voting logic. Overall, the
complete RFT controller uses approximately 5% of the total FPGA.
Since frequent PRR reconfiguration combined with lengthy PRR reconfiguration
time can impose a performance overhead, we measured the reconfiguration of a
single PRR. The PRR reconfiguration time was measured using timing functions on a
MicroBlaze with a PLB-attached ICAP controller running at 100 MHz. We measured
the PRR reconfiguration time for a 2,000 slice PRR as approximately 15 ms. Even
though the RFT mode-switching overhead is dominated by this PRR reconfiguration
time, the mode-switching time incurs a negligible performance penalty due to the likely
35
low frequency of mode-switching. Additionally, the switching overhead only occurs when
increasing the fault tolerance redundancy (i.e., switching from DWC to TMR). When
decreasing the fault tolerance redundancy, the retained modules can continue running
without interruption.
3.2 RFT Fault-Rate Model
In order to analyze RFT’s reliability benefits, a suitable fault-rate model that
incorporates varying fault rates, system performance capabilities, and system fault-tolerance
modes is required. Traditional reliability analysis focuses on quantifying permanent
hardware failures in systems with long lifetimes. Since most processors and FPGAs
have sufficiently long lifetimes, permanent hardware failures due to long-term, end-of-life
failures can be ignored. Therefore, reliability analysis for space systems focuses on
short-term failure analysis by modeling SEU-induced computational failures. For
FPGA-based systems, these failures can cause either single or multiple data errors
through corrupted data or configuration memory. FPGA upset rates for space systems
are correlated with the magnetic field strength of the system’s current orbital position.
For example, as a system passes into the Van Allen radiation belt, the trapped charged
particles in the belt have a higher likelihood of interacting with the system. For systems
in a low-earth orbit, only a portion of the orbit passes through the inner Van Allen belt,
at a location known as the South Atlantic Anomaly (SAA). The SAA represents the
low-altitude area with a large number of trapped particles from the inner Van Allen belt’s
closest point to the Earth.
Existing fault-rate modeling generally produces a single, average orbital upset
rate, however this average is not sufficient for RFT. In order for RFT to adapt the
fault-tolerance mode, the fault-rate model must estimate the expected fault rates based
on the instantaneous orbital position. To account for the orbit-dependent, time-varying
fault rates, data from several sources can be combined to form a more accurate
estimate than a single average rate. Our fault-rate model combines orbital position
36
and trajectory, magnetic field strength, and cosmic ray particle interaction data to provide
an accurate estimate of instantaneous fault rates. We use these fault-rate estimates as
input into multiple system-level Markov models in order to calculate reliability, availability,
and performability of RFT systems.
Our RFT fault-rate model combines three existing models to estimate time-varying
fault rates. Figure 3-3 illustrates the three models used as well as the inputs and
outputs to each model. To generate accurate fault estimates, a system’s time-varying
orbital position must be modeled. Orbital position can be estimated using the SGP4
model, which is a simplified general perturbation modeling algorithm. NORAD, who is
responsible for tracking space objects and space debris, developed SGP4 for tracking
near-earth satellites [Hoots and Roehrich 1980]. SGP4 accurately calculates a satellite’s
position given a set of orbital elements (apogee, perigee, inclination, etc.) collectively
referred to as a two-line element (TLE). Given a system’s TLE, the RFT fault-rate model
uses SGP4 to generate the system’s position information, in terms of latitude, longitude,
and altitude, over the user-defined modeling time period.
Next, the RFT fault-rate model passes the SGP4 positioning information for each
point along a specified orbit to the International Association of Geomagnetism and
Aeronomy’s (IAGA) International Geomagnetic Reference Field (IGRF) model [Maus
et al. 2005], which models the Earth’s magnetosphere. IGRF combines magnetosphere
data collected from satellites and observatories around the world in order to create the
most accurate and up-to-date model possible (the model is updated every five years).
For a given orbital position, IGRF outputs a McIlwain L-parameter, representing the set
of magnetic field lines that cross the Earth’s magnetic equator at a number of Earth-radii
equal to the value of the L-parameter. The inner Van Allen radiation belt corresponds
to L-values between 1.5 and 2.5. The outer Van Allen belt corresponds to L-values
between 4 and 6. In addition to identifying regions with trapped particles, the McIlwain
L-parameter can be used to estimate the effect of geomagnetic shielding (cutoff rigidity)
37
from galactic cosmic rays. The estimated L-parameters are then used, along with the
outputs of CREME96, to estimate SEU rates.
The CREME96 model [Tylka et al. 1997], a publically available SEU estimation
tool, generates fault-rate estimates and has been used extensively to predict heavy ion
and proton-induced SEU rates, as well as estimate the expected total ionizing dose in
modern electronics. CREME96 combines orbital parameters, space system physical
characteristics, and silicon device process information to create a highly accurate SEU
simulation. Traditionally, CREME96 generates an average fault rate for a particular orbit,
generated by averaging hundreds of orbits together. However, CREME96 can also
generate fault rates for orbital segments, which can be segmented using the McIlwain
L-parameter. By running several simulations with very narrow L-parameter segments,
CREME96 can obtain estimated fault rates for each segment. As the width of each
segment decreases, the generated fault rates become more precise and continuous.
The L-parameter outputs from the IGRF model are then mapped to the appropriate
orbital segment and the associated fault rate.
The SGP4, IGRF, and CREME96 models collectively generate a time-varying
fault-rate estimate over the course of a specified orbit. We implemented the RFT
fault-rate model as a C++-based program that connects the separate models together
and passes data between them. SGP4’s algorithms, along with C++/Java reference
code, are publically available via the Internet. A FORTRAN-based implementation of
the IGRF algorithm for calculating position-based McIlwain L-parameters is publically
available from NASA Goddard [Macmillan and Maus 2010]. Fault-rate information can
be generated from the CREME96 model through a web-based interface, which is stored
for efficient offline use. Orbits described by TLEs can be visualized using open-source
tools, such as JSatTrak [Gano 2010]. The RFT fault-rate model program accepts TLE
data as input and generates time-varying fault-rate estimates as output. These fault-rate
estimates are then used by the RFT performability model.
38
3.3 RFT Performability Model
System reliability is the probability that a system is operating without faults after a
specified time period. Assuming exponentially-distributed random faults at rate λ, the
system reliability is traditionally defined as:
R(t) = e−λt (3–1)
Mean-time-to-failure (MTTF) and Mean-time-to-repair (MTTR), are the average amounts
of time before a system encounters a failure (or repair) event. For a fault rate λ (or repair
rate µ), MTTF (or MTTR) is defined as:
MTTF =
∫ ∞
0
t · Rλ(t) dt MTTR =
∫ ∞
0
t · Rµ(t) dt (3–2)
System availability, which is similar to system reliability, estimates the long-term,
steady-state probability that the system is operating correctly, and is defined as:
A =MTTF
MTTF +MTTR(3–3)
System unavailability is the opposite of availability, and is often used for convenience
when discussing systems with very high availability. Unavailability is defined as:
UA = 1− A =MTTR
MTTF +MTTR(3–4)
Space system reliability and availability can be accurately modeled using Markov
models. A Markov model is composed of states and transition rates. A state represents
the current operating state of the system and the transition rates represent the
transitions from an operating state to a failure state (i.e., failure rates), or from a failure
state to an operating state (i.e., repair rates). The Markov model can be transformed
into a series of equations, which can be solved or approximated numerically using
tools such as SHARPE, an open-source fault-modeling tool [Sahner and Trivedi 1987],
to determine probabilities of each state. System reliability and availability can be
39
directly determined from the calculated state probabilities. For Markov models, the
instantaneous availability of repairable systems measures the probability of being in
an “available” vs. “failed” state at a given point in time. These types of models are
frequently used to estimate the effects of TMR, scrubbing, and other fault-tolerance
methods in FPGAs and other electronics [Dobias et al. 2005; Garvie and Thompson
2004; Pratt et al. 2007].
Markov reward models, a type of weighted Markov model, can be used to extend
the concept of system availability to measure system performability of adaptable
systems [Ciardo et al. 1990; Meyer 1982]. Performability is a metric that combines
system availability with the amount of work produced by the system and gives a
measure of total work performed. Performability is especially useful for gracefully
degradable systems or other systems that having changing characteristics over time.
Assuming that X (t) is a semi-Markov process with state space S and is continuous over
time t > 0, the instantaneous performability is defined by
Performability(t) =∑a∈S
Perf (a) · P{X (t) = a} (3–5)
where Perf (a) is the system performance in state a. System performance can be
defined using any desired performance metric (e.g., throughput, execution time, etc.)
and performability is measured similarly. In this context, instantaneous availability can
be viewed as a special case when the Perf (a) = 1 in available states and Perf (a) = 0
otherwise.
For reconfigurable FPGA systems where the system configuration changes over
time (e.g., our RFT architecture), the system must be modeled as a phased-mission
system. A phased-mission system is described using a set of unique models for each
phase of the mission. The states at the end of a given phase’s model map to the states
of the following phase’s model during phase transitions and the phase duration can be
modeled as either probabilistic or deterministic [Alam et al. 2006; Kim and Park 1994].
40
We use the RFT fault-rate model’s generated fault-rate estimates to drive multiple
system-level Markov models in order to calculate reliability, availability, and performability
of RFT systems. In order to incorporate varying fault rates (due to orbital position)
and varying system topologies (due to RFT), we leverage a phased-mission Markov
approach. We model the RFT system as a collection of individual phases, with each
phase consisting of a period of time where the fault-tolerance mode, failure rates,
and repair rates are constant. The phase lengths are both application-dependent
and orbit-dependent. At the end of each phase, the pre-transition state probabilities
are mapped onto initial probabilities for the post-transition Markov model. Figure 3-4
illustrates an example high-level model transitioning from TMR to DWC at time t1, and
then transitioning from DWC back to TMR at time t2. Fault rates (λ) and repair rates
(µ) are represented as directed graph edges between states. At each phase transition
(denoted by the dashed vertical lines), state probabilities are re-mapped (dashed
arrows). When re-mapping state probabilities from TMR to DWC, two TMR states are
merged into a single operational state. When re-mapping from DWC to TMR, the single
operational DWC state maps to the most-similar TMR state.
Each fault-tolerance mode has an associated Markov model. Each state of the
Markov model corresponds to the number of operational devices in a system (e.g.,
PRRs in an FPGA-based SoC). Transitions between states occur when a device
changes state from operational to failed, or vice-versa. FPGA upset rates are estimated
using the RFT fault-rate model described in Section 3.2. The system repair rates are
based on the system-designer-specified scrub rate and checkpointing rate and the
system and PRR reconfiguration time, which can be obtained experimentally. Transitions
between fault-tolerance modes are mission-dependent. State-mapping functions ensure
a continuous availability function, although performability may contain discontinuities at
phase transitions.
41
Figure 3-5 shows Markov model representations for each fault-tolerance mode
in a system with six PRRs. Figures 3-5(a) and 3-5(b) represent systems where each
PRR is operating in the HP or ABFT fault-tolerance modes, respectively. Figure 3-5(c)
represents a system using three independent pairs of PRMs operating in the DWC
fault-tolerance mode. Figure 3-5(d) represents a system using two independent sets
of three PRMs operating in the TMR fault-tolerance mode. Solid circles represent the
Markov model’s available operating states and dashed circles represent failed operating
states. Using the definition of performability from Eq. 3–5, each state in a Markov
reward model is assigned a performance throughput value. For this analysis, system
performance is normalized to the work performed by a single PRR. The performance
throughput value of each state is represented in the Markov model by the value in
the rectangles. For example, six independent PRRs running concurrently in the HP
fault-tolerance mode would have a performance throughput value of 6 while a system
using six PRRs with two independent sets of PRRs operating in the TMR fault-tolerance
mode would have a performance throughput value of 2.
The ABFT Markov model is used to demonstrate a generic reliability model
for modules that contain some form of internal fault tolerance and are capable of
self-detecting data errors. In particular, ABFT provides a low-overhead method for
detecting errors in certain linear algebra operations. While hardware-implemented
ABFT may not have 100% fault coverage, hardware-implemented ABFT provides
improvements over the HP model, which cannot detect when corrupt data is returned.
In the Markov model for the single-module ABFT mode, we estimate the performance
of a single module as 80% of the default, unprotected module due to the performance
overhead associated with generating and comparing checksums for fault detection
[Acree et al. 1993]. Additionally, ABFT fault rates are modeled as having a 20% higher
fault rate than unprotected modules due to the increased module size from the ABFT
logic. The repair rate for the ABFT model uses the system’s scrubbing rate, supplying a
42
worst-case repair rate for cases when ABFT logic does not detect (or even introduces)
an error, relying on external scrubbing and configuration readback to detect errors.
Although these numbers are application- and implementation-specific, the numbers
represent overhead costs that must be considered. The ABFT model can also be used
to model other modules using user-implemented fault tolerance techniques.
For a given phased-mission Markov model, a TLE is used to determine the
expected fault rates using the RFT fault-rate model in Section 3.2. These fault rates,
along with a description of the mission-specific criteria for using each fault-tolerance
mode, are used to split a full space mission into distinct phases for Markov modeling.
RFT phases are periods of time with constant fault rates, repair rates, and fault-tolerance
modes. For each phase, the previous phase’s state probabilities are mapped onto the
new phase’s Markov model as initial probabilities. SHARPE processes the new model
to numerically solve the state probabilities and reliability metrics for the current phase.
SHARPE’s results provide the initial probabilities for the next phase. This process is
repeated iteratively for each phase for the entire mission’s duration and SHARPE’s
results are aggregated to produce overall reliability results.
3.4 Results and Analysis
In this section, we present one validation case study and two orbital case studies
to evaluate the potential reliability and performance benefits of our RFT framework for
space systems. The validation case study uses FPGA fault injection to estimate fault
rates and error coverage, which can then be used by the other case studies. The orbital
case studies represent FPGA-based space systems operating in two common orbits,
with multiple performance and reliability requirements. The first case study represents a
space system operating in low-Earth orbit (LEO), the Earth Observing-1 (EO-1) satellite.
Space systems in a LEO experience relatively low radiation. The second case study
represents a space system operating in a highly elliptical orbit (HEO), which is a much
43
harsher radiation orbit. Each case study will compare multiple adaptive fault-tolerance
strategies to a traditional static TMR strategy.
3.4.1 Validation Case Study
In order to validate our reliability models, faults must be injected into an executing
RFT system. In this section, we present FPGA fault-injection results gathered using
the Simple, Portable Fault Injector (SPFI) [Cieslewski et al. 2010]. SPFI performs fault
injection using both full and partial reconfiguration, which reduces the time required to
modify configuration memory and improves the speed of fault injection. We validate our
reliability models by correlating SPFI’s results with analytical Markov model results.
For the validation case study, we implemented a simplified RFT-based system on
a Xilinx ML505 FPGA development platform. The RFT-based system had a 3-PRR
RFT controller that allowed HP and TMR voting and had watchdog timer functionality.
Each PRR contained a matrix multiplication (MM) PRM and a MicroBlaze processor
in the static region streamed data from a UART to an MM PRM and streamed results
back to the UART. The SPFI fault-injection tool enabled individual system components
(e.g., PRMs, RFT controller, MicroBlaze) to be tested independently without modifying
the entire system. Table 3-3 shows the RFT systems’ fault-injection results. For each
system component, Table 3-3 indicates the number of injections performed and the
number of data errors and system hangs detected. The fault vulnerability for each
component is scaled to the FPGA’s total number of configuration bits to estimate the
components’ design vulnerability factor (DVF). A component’s/device’s DVF represents
the percentage of bits that are vulnerable to faults and can result in observable errors.
Most Xilinx FPGA designs have a DVF that ranges from 1%-10% [Xilinx 2010b]
due to the large amount of configuration memory devoted to routing. The DVF for
each component is calculated by measuring the component’s fault rate, estimating
the number of vulnerable bits by scaling the fault rate to the size of the area occupied
by the component, and dividing the number of vulnerable bits by the total number
44
of FPGA configuration bits. For a single MM PRM with the RFT controller using HP
mode, only 1.6% of faults that occurred in the PRR were found to cause observable
errors in the output. Approximately 41,497 vulnerable bits in each PRR, which occupies
approximately one-eighth of the FPGA, results in a DVFMM of 0.197%. In this case, the
majority of the PRR is unused, resulting in very few vulnerable bits and a low DVF. Faults
injected into the RFT controller resulted in a DVFRFT of 0.022%. The MicroBlaze was not
protected using TMR or other design techniques and had a DVFMB of 0.342%. Based on
these component fault-injection results, the total FPGA DVF is estimated to be 0.955%
(3DVFMM +DVFRFT +DVFMB).
Figure 3-6 shows Markov model representations for each fault-tolerance mode in
a system with three PRRs. Figures 3-6(a) and 3-6(b) represent systems where each
PRR is operating in the HP or the TMR fault-tolerance mode, respectively. These
models are similar to the models presented in Section 3.3, without additional states for
performability, and with the addition of state transitions to account for fault coverage.
In a TMR system, coverage refers to the percentage of faults that cause the system to
immediately enter the “failed” state. These non-covered faults occur due to designs not
being fully protected by TMR.
Initial fault-injection testing revealed that our system appeared to have two possible
fault scenarios. In the first scenario, a fault could cause the system to remain operational
but produce erroneous data. This state was recoverable using periodic scrubbing. In the
second scenario, a fault could cause the system to hang until a full reconfiguration was
performed. We expanded the Markov models to include these behaviors as additional
states. Figure 3-6(a) shows the HP Markov model with two unavailable states that
account for the two fault scenarios. The probabilities of transitioning to “Faulty Data”
or “System Hang” were determined from the component fault-injection testing. The
DVFFPGA was estimated from the sum of each of the components tested.
45
A similar approach was taken for the TMR Markov model shown in Figure 3-6(b).
The additional “degraded” state is used to model faults that have been masked by
the TMR protection provided by the RFT controller. The probability of transitioning
to the “degraded” state is provided by the DVFMM from previous testing. The DVFsys
term represents faults in the RFT controller (DVFRFT) and MicroBlaze (DVFMB). From
fault-injection testing, we estimate the DVFsys to be approximately 0.364%.
By assigning fault rates, repair rates, and coverage to the Markov model, we can
calculate the system availability. Table 3-4 shows the fault and repair rates used in this
analysis. Using a scrub rate of 1 fault per 10 seconds, and a repair rate of 1 scrub per
5 seconds, the RFT system will have a 98.82% availability in HP mode and 99.31%
availability in TMR mode. When the fault rate to scrub rate ratio is increased, the
benefits of TMR become more pronounced. Using a scrub rate of 1 fault per 2 seconds,
and a repair rate of 1 scrub per 10 seconds, the RFT system will have a 92.25%
availability in HP mode and 95.32% availability in TMR mode. These high availabilities,
even in HP mode, are due to frequent scrubbing and the very low DVF of the FPGA
design.
Finally, we validate the Markov model results using fault injection. In a continuously-running
RFT system, faults are injected at a specified rate, which is randomly varied using a
Poisson distribution to simulate the error model used in the Markov model. Scrubbing,
using partial reconfiguration, occurs at user-defined periodic intervals. Full reconfiguration
occurs at the next scrubbing cycle after a PRM error has been detected by the RFT
controller. Full reconfigurations also occur if the external testing program detects that the
FPGA system has entered the “System Hang” state. Availability can be experimentally
determined by the ratio of time the system is operating correctly to the total experiment
run time. For each run, 10,000 faults were randomly injected into a running system.
Table 3-4 shows the results of the availability experiment and the relative error from the
analytical model. At low fault rates, the HP and TMR modes both provide approximately
46
99% availability. With high fault rates, the system had an availability of 92.59% in the
HP-mode and 93.89% in the TMR-mode.
The Markov model availability methodology provides a simple and effective method
for determining the effects of fault and repair rates on system availability without
exhaustive testing. In general, the HP models predicted slightly lower unavailability than
what was observed during experimental testing while the TMR models overestimated
the availability of the system. All availability results were within 1.5% of the Markov
models’ predictions. The Markov model accuracy can be improved by providing more
accurate fault-injection results. The availability values obtained through experiments can
be improved by increasing the length of the testing period.
Fault-injection testing did highlight implementation issues that must be handled in
any high-reliability design. The use of TMR had a lower than expected benefit due to
the unprotected MicroBlaze processor. Since the DVFMB was larger than the DVFMM,
the availability of the system was dominated by the MicroBlaze’s availability. For high
reliability, the MicroBlaze must be protected with TMR or an alternative fault-tolerant
processor (e.g., FT-LEON3).
3.4.2 Orbital Case Studies
For each orbital case study, the system under test is an FPGA System-on-Chip
implementation of the RFT hardware architecture described in Section 3.1. In order to
calculate device vulnerability, the parameters for the Xilinx Virtex-4 FX60 will be used as
inputs to the RFT fault-rate model described in Section 3.2. The radiation susceptibility
parameters of the Virtex-4 device family are obtained from the Xilinx Radiation Test
Consortium’s (XRTC) published results [Swift et al. 2008]. The generated fault rate is
linearly scaled from a full device to the size of the PRR to produce PRR fault rates. The
RFT controller is connected to 6 PRRs, allowing for several combinations of the TMR,
DWC, and ABFT fault-tolerance modes discussed in Section 3.3. The performability
47
model from Section 3.3 is used to evaluate the effectiveness of each fault-tolerance
strategy.
3.4.2.1 Low-Earth orbit case study
Space systems in LEO are commonly used for earth-observing science applications,
such as HSI or SAR. Both HSI and SAR have large datasets, which can be significantly
reduced through on-board processing using a variety of algorithms that are decomposable
into basic kernels that can be parallelized and implemented on an FPGA system for
high performance and power efficiency. For example, HSI can be decomposed into a
sequence of matrix multiplications and matrix inversions, and SAR can be decomposed
into vector multiplication and FFTs. Since these mathematical operations are linear,
ABFT can be used to protect the computation results from SEUs. For applications that
cannot be protected with ABFT, TMR or DWC must be used to provide fault tolerance.
The TLE used to generate the fault rates for this LEO case study is from the EO-1
satellite. The EO-1 orbit is circular at an altitude of 700 km and a 98.12 ◦ inclination with
a mean travel time of 98 minutes. Figure 3-7(a) shows the orbital track of EO-1 and the
shaded circle represents the EO-1’s field of view. Figure 3-7(b) shows the estimated
number of upsets per hour that occur in the EO-1 orbit over several orbital periods. The
average fault rate of the Virtex-4 FX60 in the EO-1’s orbit is 16.5 faults per device-day
(combined configuration memory and BRAM vulnerability). Each local maximum occurs
when the satellite is closest to the Earth’s magnetic poles. Fault rates in EO-1’s orbit
are low because the orbit is lower than the Van Allen Belts and is fully within the Earth’s
magnetosphere, which deflects a large amount of radiation.
We examine the availability and performability of TMR, DWC, and ABFT fault-tolerance
modes in LEO, as well as three adaptive fault-tolerance strategies to maximize
application performability: 10% two-mode, 50% two-mode, and three-mode adaptive
strategies. For the adaptive fault-tolerance strategies, the fault-tolerance mode switching
is determined by comparing the current upset rate with a fault-rate threshold. The 10%
48
two-mode adaptive strategy uses the ABFT fault-tolerance mode when upset rate is in
the lowest 10% of the expected fault rates and the TMR fault-tolerance mode otherwise.
The 50% two-mode adaptive strategy uses a similar strategy as the 10% two-mode, but
with a higher fault-rate threshold. The three-mode adaptive strategy uses ABFT when
the upset rate is in the lowest 10% of the expected fault rates, TMR when the upset rate
is in the highest 50% of the expected fault rates, and DWC otherwise. In all modes, the
system performs scrubbing to ensure that configuration memory errors are removed
from the system. For this LEO case study, the system uses a 60-second scrub cycle,
which is also used as the system repair rate for the Markov models.
Figure 3-8(a) shows the availability of the LEO system while statically using the four
fault-tolerance models described in Figure 3-5. While the static HP strategy’s availability
quickly declines (due to the lack of any fault-tolerance mechanisms), the static TMR,
DWC, and ABFT strategies all maintain availability above 95%. The dynamically
changing availability is directly related to the current fault rate, however, since the repair
rate of the system (through scrubbing) is much larger than the expected fault rates, most
configuration memory faults are rapidly mitigated. The system availability while using
the three adaptive fault-tolerance strategies is shown in Figure 3-8(b). The average
availability for each adaptive strategy improves availability over the static ABFT strategy,
and the adaptive strategies can increase the minimum availability to above 99.5%.
Table 3-5 displays the availability results for each fault-tolerance strategy in terms of
unavailability, showing the probability of a system failure. The 10% two-mode adaptive
strategy reduces average unavailability by 88% as compared to the static ABFT strategy.
For extremely high availability, static TMR is required, as the maximum unavailability for
the adaptive strategies is more than 100 times higher than TMR.
The system performability for the static and adaptive fault-tolerance strategies
for a LEO is also shown in Table 3-5. Due to the low overall upset rates and good
availability of the static ABFT strategy, the static ABFT strategy achieves the highest
49
performability. The 50% two-mode adaptive strategy achieves an average performability
throughput of 4.01, a 100% improvement over static TMR, while improving unavailability
over the static ABFT strategy by 73%. The 10% two-mode strategy has lower average
performability, while maintaining better system availability. The three-mode strategy has
better performability than the static DWC mode while having better availability than the
static DWC or ABFT modes, but is outperformed by the 50% two-mode strategy.
We point out that the fault-rate threshold for the two-mode strategies can be used
to adjust the availability and performability parameters for a space system. Figure 3-9
illustrates the effect of changing the threshold of the two-mode adaptive strategy for
the LEO case study. As the threshold is raised, more time is spent using the ABFT
mode, which lowers system availability while increasing performability. For the LEO
case study, most of the performance gains from using ABFT can be obtained by using
a low threshold value because much of the orbit will be under this threshold. With a
10% threshold, an RFT system would spend an approximately equal amount of time
in each of the ABFT and TMR modes. Raising the adaptive threshold higher than 50%
results in limited performance gains at the expense of decreased availability. Further
analysis of these thresholds in mode-switching strategies can enable optimization
toward Pareto-optimal goals.
3.4.2.2 Highly-elliptical orbit case study
The HEO is a common type of orbit used mostly by communication satellites.
From the ground, satellites traveling in an HEO can appear stationary in the sky for
long periods of time. HEOs also offer visibility of the Earth’s polar regions, where most
geosynchronous satellites do not. The HEO used for this case study is a Molniya orbit,
named for the communication satellites that first used this orbit. A TLE for a Molniya-1
satellite was used to generate fault rates for the HEO case study. This orbit has a
perigee of 1,100km, an apogee of 39,000km, and a 63.4 ◦ inclination with a mean travel
time of 12 hours. The average amount of radiation throughout the orbit is much higher
50
than the LEO case study, and much larger amounts of radiation are encountered when
the satellite passes through the Van Allen belts. The average fault rate in the Molniya-1
orbit is 62 faults per device-day. For most of the orbit, the fault rate averages 7 faults per
device-day, but the large fault-rate peaks that occur near perigee increases the overall
fault rate. Figure 3-10(a) illustrates the Molniya-1 orbit and Figure 3-10(b) shows the
estimated number of upsets per hour that an FPGA might experience in an HEO.
Space systems outside of the Earth’s magnetosphere, either in geosynchronous
orbit or in interplanetary space, experience constant fault rates that vary based on the
occurrence of solar flares and other space weather conditions. Solar flares can send
a wave of high-energy particles into space, causing a brief period of extremely high
fault rates. These fault-rate spikes, while different in origin, look similar to the fault-rate
peaks that occur in this case study. The analysis used in this case study can also be
used to estimate RFT system performance in the presence of different space weather
conditions.
We evaluate the availability and performability of TMR, DWC, and ABFT fault-tolerance
modes in an HEO. In order to maximize performability of applications in an HEO, we
also examine two adaptive fault-tolerance strategies. The ABFT/TMR adaptive strategy
uses the ABFT fault-tolerance mode when upset rates are in the lowest 5% of expected
fault rates and the TMR fault-tolerance mode otherwise. The DWC/TMR adaptive
strategy uses the same fault-rate thresholds as ABFT/TMR, but switches between the
DWC and TMR modes. In each mode, the system performs scrubbing to ensure that
configuration memory errors are removed from the system. For the HEO case study, the
system uses a 10-second scrub cycle to account for the increased average fault rates
experienced in an HEO, which is used as the system repair rate for the Markov models.
Figures 3-11(a) and 3-11(b) show the availability of the HEO case study system
while using three static fault-tolerance strategies and two adaptive strategies. The
static TMR strategy maintains an average availability of 99.93%, but the availability
51
drops as low as 95.1% while passing through the peak fault-rate periods. While using
the static DWC and ABFT strategies, the availability drops significantly during the
peak fault-rate periods, making these strategies unsuitable for systems that must
maintain continuous operation (due to the extremely high upset rate, many systems
shut down operation during peak fault-rate periods). However, outside of the peak
fault-rate periods, the minimum availability for the static DWC (99.8%) and ABFT
(99.3%) strategies are high enough to be tolerable by many applications. The two
adaptive strategies use DWC and ABFT fault-tolerance modes during low fault-rate
periods and TMR fault-tolerance mode during the peak fault-rate periods. Using the
adaptive strategies, the availability never falls below the TMR strategy’s minimum
availability of 95.1% during peak fault-rate periods, while maintaining higher availability
at other times. (Note the change of scale in Figure 3-11(b).) Table 3-6 shows the
unavailability and performability of the fault-tolerance strategies described in this
section. The DWC/TMR adaptive strategy reduces average unavailability by 80%
as compared to the static DWC strategy. The ABFT/TMR adaptive strategy reduces
average unavailability by 85% as compared to the static ABFT strategy.
The system performability for the static and adaptive fault-tolerance strategies for an
HEO are shown in Table 3-6. While the TMR strategy exhibits the lowest performability,
the TMR strategy also has the lowest variation in performability and highest availability,
allowing for predictable performance levels. The performability of the static ABFT and
DWC strategies is significantly reduced in the peak fault-rate periods, but quickly returns
to acceptable levels after passing through the peak fault-rate periods. The maximum
unavailability for each of the adaptive strategies occurs during the peak fault-rate
periods. Since the RFT system is using the TMR fault-tolerance mode during these
segments, the adaptive strategies have the same maximum unavailability as the static
TMR strategy. The DWC/TMR adaptive strategy increases performability over the TMR
strategy by 46%. The ABFT/TMR adaptive strategy increases average performability
52
over the TMR strategy by 128%, or reduces performability by 4% over the static ABFT
strategy, while significantly improving unavailability over the ABFT strategy.
3.5 Conclusions
In this work, we have presented a novel and comprehensive framework for
reconfigurable fault tolerance capable of creating and modeling FPGA-based reconfigurable
architectures that enable a system to self-adapt its fault-tolerance strategy in accordance
with dynamically varying fault rates. The PR-based RFT hardware architecture enables
several redundancy-based fault-tolerance modes (e.g., DWC, TMR) and additional
fault-tolerance features (e.g., watchdog timers, ABFT, checkpointing and rollback), as
well as a mechanism for dynamically switching between modes. The combination of
these fault-tolerance features enables the use of COTS SRAM-based FPGAs in harsh
environments. Future work will automate and optimize the creation RFT controllers
for specific system configurations in order to increase developer productivity, facilitate
adoption, and reduce system overhead.
In addition to the hardware architecture, we have demonstrated a fault-rate model
for RFT to accurately estimate upset rates and capture time-varying radiation effects
for arbitrary satellite orbits using a collection of existing, publically available tools and
models. Our model provides important characterization of fault rates for space systems
over the course of a mission that cannot be captured by average fault rates. The HEO
case study demonstrated the large range of fault rates experienced in certain elliptical
orbits, and the potential for performance improvements when using RFT.
Using the results from the fault-rate model, our Markov-model-based RFT
performability model was used to demonstrate the benefits of using an adaptive
fault-tolerance system architecture. The performability model was validated using
FPGA fault injection on an experimental RFT system and multiple static and adaptive
fault-tolerance strategies for space systems in dynamically changing environments
were evaluated. The RFT performability model demonstrated that while TMR provides
53
a lower bound on availability, less reliable, low-overhead methods can be used during
low-fault-rate periods to improve performance. We evaluated two case study orbits
and observed that adaptive fault-tolerance strategies are able to improve unavailability
by 85% over static ABFT and performability by 128% over traditional, static TMR fault
tolerance. This additional performance can lead to more capable and power-efficient
FPGA-based onboard processing systems in the future.
Table 3-1. RFT fault-tolerance modes.Fault-tolerance mode Fault-tolerance type PRRs requiredTriple Modular Redundancy (TMR) Redundancy 3Duplication with Compare (DWC) Redundancy 2High-Performance (HP) - no fault protection Single-module 1Algorithm-Based Fault Tolerance (ABFT) Single-module 1Internal TMR Single-module 1
Table 3-2. RFT controller resource usage.Module name LUTs used FFs used Slices used FPGA utilizationPLB Controller 479 375 345 1.4%Address Decoder 238 0 136 0.5%RFT Registers 82 64 48 0.2%Watchdog Timers 498 390 366 1.4%Voting Logic 438 0 234 0.9%Output Mux 227 0 132 0.5%Total 1962 829 1261 5.0%
Table 3-3. Fault-injection results for RFT components.System Faults Data System Vulnerable DVFComponent injected errors hangs bits (est.) (%)Matrix Multiply (MM) 100,000 1,501 71 41,497 0.197%RFT Controller (RFT) 100,000 63 157 4,576 0.022%MicroBlaze (MB) 100,000 150 1,219 86,584 0.342%Full FPGA 0.955%
54
Table 3-4. RFT Markov model validation.
Fault period Repair period FT-mode Markov model Experimental Model(time/fault) (time/scrub) availability availability error
10s 5s HP 98.82% 98.99% 0.2%10s 5s TMR 99.31% 99.11% 0.2%2s 10s HP 92.25% 92.59% 0.4%2s 10s TMR 95.32% 93.89% 1.5%
Table 3-5. Unavailability and performability for LEO case study.
Average Maximum Average Minimumunavailability unavailability performability performability
TMR 9.8× 10−6 3.8× 10−5 2.00 2.00DWC 3.9× 10−3 1.1× 10−2 3.00 2.98ABFT 1.6× 10−2 4.7× 10−2 4.79 4.762-Mode (10%) 1.9× 10−3 4.6× 10−3 3.51 2.002-Mode (50%) 4.4× 10−3 1.8× 10−2 4.01 2.003-Mode 2.7× 10−3 5.4× 10−3 3.69 2.00
Table 3-6. Unavailability and performability for HEO case study.Average Maximum Average Minimumunavailability unavailability performability performability
TMR 7.2× 10−4 4.9× 10−2 2.00 1.95DWC 1.4× 10−2 4.1× 10−1 2.98 2.46ABFT 4.9× 10−2 9.2× 10−1 4.73 2.62DWC/TMR 2.6× 10−3 4.9× 10−2 2.92 1.95ABFT/TMR 7.2× 10−3 4.9× 10−2 4.56 1.95
Microprocessor(MicroBlaze)
MemoryController
I/O Ports (UART, USB)
ICAPRFT Controller
System Interconnect (PLB)
TMR Components
PR
R N
PR
R N
−1
PR
R 2
PR
R 1
Figure 3-1. System-on-chip architecture with RFT controller.
55
PRR1Interface
Voting Logic
RFT
Controller
PR
Regions
Fau
lt S
yndr
ome
AddressDecoder Output
Mux
PLB Controller
Dat
a In
Add
ress
Dat
a O
ut
FT−Status Registers
Watchdog Timers
FT−ModeRegister
Vot
erR
esul
t
InterruptGenerator
Inte
rrup
t
PRR2Interface
PRR3Interface
Inte
rnal
P
RM
si
gnal
s
Wat
chdo
g re
set
Inte
rnal
P
RM
si
gnal
s
Wat
chdo
g re
set
Inte
rnal
P
RM
si
gnal
s
Wat
chdo
g re
set
PR
R
Out
PR
R
Out
PR
R
Out
Figure 3-2. RFT controller PLB-to-PRR interface.
SGP4 Model IGRF Model
Two-Line
Element
(TLE)
Positioning
Information
Segmented Fault Rates
McIlwain
L-Parameters
Fault Rate
EstimatesOrbital Parameters
Device & Process
Parameters
User-
Supplied
Data
Time
Period
System Physical
Characteristics
CREME96
Model
L-Parameter
/
Fault Rate
Mapping
Figure 3-3. RFT fault-rate model.
56
Zero
Faults
One
Fault
Failed
Zero
Faults
Failed
Zero
Faults
One
Fault
Failed
3そ
2そ
た1
た2
3そ
2そ
た1
た2
t0 t1 t2 t3TMR TMRDWC
2そ た1
Figure 3-4. Phased-mission Markov model transitioning between TMR and DWC modes.
Fully
Operational
1 Failed
Module
2 Failed
Modules
3 Failed
Modules
4 Failed
Modules
5 Failed
Modules
6.0
5.0
4.0
3.0
2.0
1.0
6 Failed
Modules
0
6 λ
5 λ
4 λ
3 λ
2 λ
1 λ
Fully
Operational
1 Failed
Module
2 Failed
Modules
3 Failed
Modules
4 Failed
Modules
5 Failed
Modules
4.8
4.0
3.2
2.4
1.6
0.8
6 Failed
Modules
0
7.2 λ
6 λ
4.8 λ
3.6 λ
2.4 λ
1.2 λ
µ
µ
µ
µ
µ
µ
Fully
Operational
1 Failed
Module
2 Failed
Modules
3.0
2.0
1.0
6 λ
4 λ
µ
µ
3 Failed
Modules
0
2 λ µ
Fully
Operational
One Degraded
Instance
Multiple
Modules
Failed
Two Degraded
Instances
One Failed
TMR Instance
One Failed,
one degraded
TMR Instance
6 λ
2 λ 3 λ
3 λ 4 λ
2 λ
µ
µ
µ
µ
µ
2.0
2.0
2.01.0
1.0
0
A B C D
Figure 3-5. Markov models of RFT modes. A) 6x HP. B) 6x ABFT. C) 3x DWC. D) 2xTMR.
57
Operational
Faulty Data
そ(DVFFPGA)(Pdata)
System
Hang
そ(DVFFPGA)(Phang)
た1 た2
そ(DVFFPGA)(Phang)
Degraded
Faulty Data
そ(DVFsys)(Pdata)
System
Hang
Operational
そ(DVFsys)(Phang)
そ(DVFsys)(Phang)
3そ(DVFMM)
2そ(DVFMM)
そ(DVFsys)(Pdata)
そ(DVFsys)(Phang)
た2
た1
た1
A B
Figure 3-6. RFT validation Markov models. A) HP mode. B) TMR mode.
A
0 50 100 150 200 250 300 350 400 450 5000
10
20
30
40
50
60
Time (minutes)
Fau
lts p
er
devi
ce−d
ay
B
Figure 3-7. LEO fault rates using the RFT fault-rate model. A) Visualization of the EO-1TLE data using JSatTrak [Gano 2010]. B) Expected fault rates over severalorbits.
58
0 40 80 120 160 2000.9500
0.9550
0.9600
0.9650
0.9700
0.9750
0.9800
0.9850
0.9900
0.9950
1.0000
HP DWC TMR ABFT
Time (minutes)
Ava
ilabi
lity
A
0 40 80 120 160 2000.9500
0.9550
0.9600
0.9650
0.9700
0.9750
0.9800
0.9850
0.9900
0.9950
1.0000
2−mode (10%) 2−Mode (50%) 3−Mode
Time (minutes)
Ava
ilabi
lity
B
Figure 3-8. LEO system availability. A) System availability of static strategies. B) Systemavailability of adaptive strategies.
59
0%20%
40%60%
80%100%
0.9800
0.9850
0.9900
0.9950
1.0000
0
1
2
3
4
5
6
Performability Availability
2−Mode Strategy Threshold
Ava
ilab
ility
Pe
rfo
rmab
ility
Figure 3-9. Effects of adaptive thresholds on availability and performability.
60
A
0 1000 2000 3000 4000 50000
500
1000
1500
2000
2500
Time (minutes)
Faul
ts p
er d
evic
e−da
y
B
Figure 3-10. HEO fault rates using the RFT fault-rate model. A) Visualization ofMolniya-1 TLE data using JSatTrak [Gano 2010]. B) Expected fault ratesover time.
61
0 1000 2000 3000 4000 50000
0.2
0.4
0.6
0.8
1
TMR DWC ABFT
Time (minutes)
Ava
ilabi
lity
620 640 660 680 700 720 740 7600
0.2
0.4
0.6
0.8
1
TMR DWC ABFT
Time (minutes)
Ava
ilabi
lity
A
0 1000 2000 3000 4000 50000.9
0.92
0.94
0.96
0.98
1
DWC/TMR ABFT/TMR
Time (minutes)
Ava
ilabi
lity
620 640 660 680 700 720 740 7600.9
0.92
0.94
0.96
0.98
1
DWC/TMR ABFT/TMR
Time (minutes)
Ava
ilabi
lity
B
Figure 3-11. HEO system availability. A) System availability of static strategies (zoomedimage on right). B) System availability of adaptive strategies (zoomedimage on right).
62
CHAPTER 4ALGORITHM-BASED FAULT TOLERANCE FOR FPGA SYSTEMS
Algorithm-based fault tolerance (ABFT) is a fault-tolerance method that can be used
with many linear-algebra operations, such as matrix multiplication or LU decomposition
[Huang and Abraham 1984]. Additionally, the ABFT approach can be extended to
other linear operators such as the Fourier transform. Fortunately, many common space
applications are composed of linear-algebra operations; e.g., HSI features matrix
multiplication [Jacobs et al. 2008], while SAR features Fast Fourier transforms. Other
algorithms can often be converted to fit an algebraic framework. Traditionally, ABFT
has been implemented in software, with multiprocessor arrays, and in hardware, with
systolic arrays, to protect application datapaths. Our ABFT approach may be used in
FPGA applications to provide both datapath and configuration memory protection with
low overhead. By demonstrating the effectiveness of ABFT in FPGA systems, we can
enable the use of FPGAs in future space missions where resource constraints and
reliability are the major challenges. In the larger perspective of an RFT system, ABFT
enables additional design-space options for optimizing total system performability.
In this chapter, we present an analysis of multiple fault-tolerance methods on
Xilinx FPGAs including TMR and ABFT. We examine the resource usage of each
method and measure the vulnerability of the design using a fault-injection tool. We then
examine possible design tradeoffs and modifications that can enable higher reliability.
The following sections of this chapter are organized as follows. Section 4.1 describes
multiple hardware architectures for ABFT matrix multiplication, along with analysis of
resource usage and fault vulnerability. Section 4.2 describes an ABFT-based algorithm
for FFTs, FFT case study architectures, and fault vulnerability analysis. Section 4.3
presents conclusions, provides suggestions for developing more reliable designs, and
outlines directions for future research.
63
4.1 Matrix Multiplication
Matrix multiplication (MM) is used as a key kernel in a large number of signal-processing
applications, and has been shown to benefit from the performance of FPGAs [Dave
et al. 2007; Wu et al. 2010; Zhuo and Prasanna 2004]. The traditional formulation of
ABFT uses MM as a motivating case, demonstrating the method’s ability to detect and
correct errors while limiting computational complexity and overhead. MM also has the
benefit of simple parallel decomposition strategies, trading additional FPGA resource
usage for decreased total execution time, which work well with ABFT. In Section 4.1.1,
the mathematical description of ABFT for MM is presented. In Section 4.1.2, several
possible MM-ABFT architectures are described. Section 4.1.3 analyzes the resource
overhead tradeoffs for each of the proposed architectures. Section 4.1.4 presents an
analysis of fault injection results obtained from each of the proposed architectures.
Finally, Section 4.1.5 summarizes the results from the matrix multiplication analysis.
4.1.1 Checksum-Based ABFT for Matrix Multiplication
The following definitions provide the mathematical background for ABFT. To obtain
the weighted checksums, the initial data will have to be multiplied by an encoder matrix.
Without a loss of generality and to simplify the notation, we assume that generic matrix
is square with dimensions of N × N.
Definition 1: An encoder matrix is a matrix whose product with the data matrix will
yield the desired checksums. For the remainder of this paper we will refer to the encoder
matrix as EN . Depending on the size of the encoding matrix, EN may encode more than
one checksum row or column. The EN used in this paper will have dimensions of N × 1.
EN =
[1 1 · · · 1 1
]T(4–1)
64
Definition 2: A column checksum matrix AC is an initial data matrix A that has
been augmented with extra rows of checksums. Such a matrix will have dimensions of
(N + 1)× N and has the form:
AC =
A
ETN · A
(4–2)
Similarly, a row checksum matrix AR can be obtained by augmenting a data matrix
A with additional columns. Such a matrix will have dimensions of N × (N + 1) and has
the following form:
AR =
[A A · EN
](4–3)
Definition 3: The product of a column checksum matrix AC and a row checksum
matrix BR will produce a full checksum matrix CF . Such a matrix will have dimensions of
(N + 1)× (N + 1) and the form:
AC · BR =
A
ETN · A
·[B B · EN
]
=
A · B A · B · EN
ETN · A · B ET
N · A · B · EN
=
C C · EN
ETN · C ET
N · C · EN
= CF
(4–4)
The associative property of the matrix product allows for verification of the
multiplication procedure by simply recalculating the checksums and comparing them
with ones obtained through the matrix multiplication. In general, operations that preserve
65
weighted checksums are called checksum-preserving and the matrix product is an
example of such a function.
4.1.2 Matrix-Multiplication Architectures
Several matrix multiplication modules were created to examine the reliability of
hardware-based ABFT. This section discusses a serial architecture and two possible
parallelization strategies for MM and the design decisions that were made for the ABFT
architectures. For this analysis, 32-bit integer precision was used in each design.
4.1.2.1 Baseline, serial architecture
The minimal-hardware, serial architecture for the matrix multiplication function
consists of a single multiply-accumulator (MAC), an address generation module, and
three data-storage modules (RAM) as shown in Figure 4-1(a). Two memories are used
to store the input matrices A and B, and one memory is used for the resulting output
matrix C . The address generator iterates through the correct matrix indices (i, j, k in
Figure 4-2), sending data stored in the two input RAMs to the MAC, and generates the
appropriate address for output values. This MM module can be used for any size matrix,
the limiting factor being data storage. For this analysis, the input and output matrices
are each stored in a 1024-deep 32-bit Xilinx BlockRAM (BRAM) component. This
architecture requires O(N3) cycles to fully calculate an output matrix. MM computational
throughput can be improved by exploiting parallelism with additional MAC units.
4.1.2.2 Fine-grained parallel architecture
The fine-grained parallel MM architecture unrolls the inner loop of the MM algorithm
as shown in Figure 4-2. Each element in the result matrix C is the dot product of a
row from matrix A and a column from matrix B. This parallel architecture uses multiple
processing elements to compute the dot-product in parallel. With the fine-grained
parallel architecture shown in Figure 4-1(b), the output of several multipliers are
connected to an adder-tree structure, allowing the parallel computation of partial
dot-products, which are then accumulated into the final, full dot product. By fully
66
parallelizing the dot product (using N multipliers), the execution time of the full algorithm
can be reduced from O(N3) to O(N2). This method requires accessing multiple memory
elements in parallel, and may be limited by the total number of memory elements or
DSP components on the FPGA.
4.1.2.3 Coarse-grained parallel architecture
The coarse-grained parallel approach essentially unrolls the outer loop of the MM
algorithm in Figure 4-2. Each row in the result matrix is calculated from a row in matrix A
and every element of matrix B. This data parallelism enables each processing element
to calculate a unique portion of matrix C by using a portion of matrix A and all of matrix
B, as shown in Figure 4-1(c). When each processing element computes a single row,
the execution time of the full algorithm can be reduced from O(N3) to O(N2). This
method requires reading multiple values from matrix A and writing multiple values to
matrix C in parallel, and requires the same total number of memory and DSP elements
as in the fine-grained parallel architecture.
4.1.2.4 Architectural modifications for ABFT
The addition of ABFT logic requires the creation of two functions, ABFT checksum
generation and ABFT checksum verification. Each of these functions requires a
simple accumulator (or a MAC for weighted checksums). In order to accommodate
the calculated checksum data, the BRAM storage must be large enough to hold the
ABFT-augmented matrices. Checksum generation sums each column of matrix A and
writes the checksum into the matrix A BRAM, creating the column checksum matrix,
AC . Next, checksum generation performs the same process for the rows of matrix B
to create a row checksum matrix, BR . The checksum verification function sums the
columns and rows of matrix C and compares the sums to the checksum values in the
matrix C BRAM. If a mismatch is detected, an error signal is asserted until the module
is reset. For some applications, such as image processing, data errors that occur
in low-significance bits may be ignored. ABFT accomplishes this by comparing the
67
difference of two generated checksums to a user-defined threshold value. However, for
maximum coverage, this threshold should be set to zero for integer operations.
Figure 4-3 shows an example of an ABFT-enabled MM architecture where an
additional MAC is used for the checksum generation and verification functions. The MAC
hardware that exists for the main MM operation may be reused for creating checksums
(ABFT-Shared), or an additional accumulator can be used for this purpose (ABFT-Extra).
For the baseline architecture, an additional ABFT accumulator would incur almost 100%
overhead. However, when the ABFT encoding matrix in Equation 1 is used, multipliers
are not required and the ABFT hardware can be simplified. For parallel designs with
multiple processing elements, the overhead of a single MAC for ABFT calculations is
amortized.
To implement ABFT error correction, the column and row indices of faulty rows can
be temporarily stored in registers. Faulty elements exist at the intersection of a faulty row
and faulty column. To correct the faulty element, the checksum-generation module must
recalculate the column checksum, ignoring the value at the faulty row index. This sum
would be subtracted from the matrix C checksum value to obtain the correct value, and
stored in the matrix C RAM. Using the encoding matrix defined in Equation 1, ABFT will
be able to detect single- and multiple-element errors. However, for specific distributions
of multiple-element errors, the correction algorithm may fail. Weighted-checksum
encoding matrices can be used to provide additional fault localization capabilities for
multiple-element errors. The ABFT designs discussed in later sections perform error
detection only.
4.1.3 Resource-Overhead Experiments
In this section we analyze the overhead of the architectures presented in Section
4.1.2 and compare them to traditional fault-tolerance mitigation strategies. For this
analysis, we use a 32-bit integer MM module and vary the number of processing
elements and the type of parallelization. These MM modules can perform computation
68
on matrices up to 32 × 32 elements in size, limited only by the amount of BRAM
dedicated to storage. We compare fine-grained and coarse-grained parallel MM designs
with several fault-tolerant designs (TMR, ABFT, and hybrid TMR/ABFT). The baseline
and ABFT designs were synthesized using the Xilinx synthesis tool while the TMR
designs used Synplify Premier. Each design was implemented on a Xilinx ML605
development board with a Virtex6-LX240T FPGA. The results of this comparison, with 1
processing element per design, are shown in Table 4-1, and will be discussed below.
4.1.3.1 Resource overhead of serial architectures
The baseline (non-fault-tolerant) MM architecture with a single processing
element uses 3 BRAMs to store input and output data. Matrix A and B are each
stored in separate BRAM components in order to allow simultaneous accesses,
while matrix C only requires a single BRAM. DSP48 components are reserved for
the multiply-accumulate unit. The multiply-accumulator can be implemented in 3 DSP48
units (3 per 32-bit MAC). All remaining addressing and indexing logic (counters, state
machines, etc.) are implemented using slices (LUTs and FFs).
The ABFT-Shared design does not require any additional BRAM or DSP48 units
over the baseline design. The additional logic needed to handle addressing matrices
during checksum generation and verification increases the number of required slices by
14%. The ABFT-Extra design uses 100% more DSP48s and 32% more slices for the
additional MAC; no additional BRAMs are needed.
For comparison, Table 4-1 also shows the resource usage of a TMR design. The
serial TMR design has 146% overhead as compared to the baseline design. This
TMR design was created using the high-reliability features of the Synplify Premier
synthesis tool, which inserts low-level TMR voting into the ABFT-MM core’s netlist.
Alternative methods for creating TMR designs are available from Xilinx [Xilinx 2004],
BYU-LANL [Pratt et al. 2006], Mentor Graphics [Mentor Graphics 2013], and others. As
expected, 200% more DSPs and BRAMs are required for TMR. However, slice usage in
69
the TMR design also increased less than expected. This difference may be caused by
optimization during the Xilinx mapping and place-and-route processes. Since a Xilinx
logic slice has many internal components (multiple lookup tables), additional logic may
be packed into partially-used slices, reducing the need for additional slices.
We also examine a hybrid design based on the ABFT-Extra design which uses
TMR on the address generator and all state machines within the design, but only uses
ABFT along the data path. This hybrid approach results in a fine-grained design that has
approximately 156% overhead for slices but only 100% overhead on the limited DSP48
resources.
4.1.3.2 Resource overhead of parallel architectures
As additional processing elements are added to the existing serial MM designs,
the resources required for the data path should increase much more than the control
path. This asymmetric scaling results in resource overhead being dependent on the
amount of parallelism. Figure 4-4 and Figure 4-5 compare the slice overhead of each
of the designs discussed in Section 4.1.3.1 while varying the number of processing
elements from 1 to 32, which is a fully parallel implementation for the 32× 32 MM. Figure
4-6 compares the DSP48 resource overhead for each of the designs. For each parallel
design, both coarse-grained and fine-grained parallelizations are examined. Overhead
of each fault-tolerant design is calculated based on the resource usage of the equivalent
non-fault-tolerant design.
In the serial designs, the overhead of both fine-grained and coarse-grained
ABFT-Extra designs is higher than the ABFT-Shared designs. However, the ABFT-Extra
slice overhead decreases for highly parallel designs. Alternatively, the ABFT-Shared
designs show the opposite trend. For modules with a large number of processing
elements, the ABFT-Shared control logic can require significantly more slices than the
baseline, unprotected design. The overhead in the ABFT-Shared designs is attributed
70
to additional multiplexers needed to correctly share access to the design’s processing
elements. Each of the explored ABFT designs have less than 60% slice overhead.
The overhead of the TMR designs ranges from 65% to 150% additional slices,
significantly lower than the expected 200% overhead. In the fine-grained designs,
relative overhead was smaller in the highly-parallel implementations. However, for
the coarse-grained designs, the overhead increased. The TMR design has a higher
overhead (200%+) of LUTs, however each slice is composed of multiple LUTs, and a
slice is considered used if any of the LUTs are used. The TMR design can have a lower
slice overhead when the design can be more efficiently packed into slices by using more
LUTs per slice. The fine-grained architecture is able to take advantage of the efficient
packing, while the coarse-grained architectures do not. For large designs (e.g., 32
PEs), the number of fully-used slices in the original design increases, increasing TMR
overhead by requiring additional slices for the TMR logic.
The slice overhead of the hybrid ABFT/TMR design is comparable to the TMR
design. Designs with many processing elements have less resource overhead than
smaller designs. The fine-grained design’s slice overhead scales much more favorably
than the coarse-grained design, which is correlated to the slice overhead for the TMR
design. This slice overhead could be reduced by more selectively choosing which
portions of logic to protect with TMR.
The limiting resources for the MM designs are the DSP48 and BRAM components.
Figure 4-6 shows the overhead of DSP48 and BRAM components for the various
designs. The results are identical for the fine-grained and coarse-grained parallelizations.
As more processing elements are used in the ABFT-Extra MM module, the DSP48
overhead created by the additional ABFT MAC unit becomes extremely low (3.125%
with 32 processing elements). The ABFT-Shared design requires zero additional
DSP48 units. Additionally, both the ABFT-Shared and ABFT-Extra methods do not
require additional BRAMs. The Hybrid design has the same DSP48 overhead as the
71
ABFT-Extra design because only control logic is replicated. The DSP48 and BRAM
overhead for the TMR design was expected to be 200%, independent of the amount
of parallelism. However, the BRAM usage actually increased by less than 100% for
non-serial designs due to the underlying FPGA architecture, design size, and because
of optimizations performed by Synopsys’ synthesizer. The Virtex-6 BRAM can be used
as one 1024-element RAM or as two 512-element RAMs. The Xilinx tool will only infer
the larger, 1024-element RAMs, while Synopsys can more efficiently use the smaller
512-element RAMs. Therefore, the 200% overhead is mitigated by the need for half as
many large BRAMs, and the BRAM overhead will asymptotically approach 50% with
more processing elements.
From a resource overhead perspective, the ABFT designs have the desired
properties of a low-overhead fault tolerance method. All of the examined ABFT designs
exhibited a slice overhead of less than 60%, with extremely low DSP48 resource
overhead. The hybrid ABFT/TMR should provide additional fault tolerance while not
requiring replication of the limited DSP48 resources on the FPGA. In the next section,
fault-injection testing will determine the effectiveness of the designs discussed in the
section.
4.1.4 Fault-Injection Experiments
While Section 4.1.3 has shown that ABFT provides lower overhead compared to
other fault-tolerance strategies, the reliability of ABFT must also be evaluated. In order
to validate our ABFT design, faults must be injected into an executing system. In this
section, we present FPGA fault-injection results gathered using the Simple, Portable
Fault Injector (SPFI) [Cieslewski et al. 2010]. SPFI performs fault injection by modifying
FPGA configuration frames within a design’s bitstream, re-programming the FPGA, and
comparing the resulting output against known values. Partial reconfiguration is used to
reduce the time required to modify configuration memory and to improve the speed of
fault injection. However, due to the length of time required to exhaustively test an entire
72
FPGA design, statistical sampling is used to estimate the total number of vulnerable bits
in a given design.
We implemented multiple fault-tolerant designs discussed in Section 4.1.2 on a
Xilinx ML605 FPGA development platform. Each design used a 32-bit integer MM
module with up to 32 processing elements. A UART was also implemented on the FPGA
to stream input test vectors to the MM module and to report results back to a verification
program. During the design phase, the MM module is constrained to a small portion of
the FPGA. The SPFI fault-injection tool enables targeted injections, allowing the UART
to be avoided during fault injection.
Table 4-2 shows the fault-injection results for each of the fault-tolerant MM designs
from Section 4.1.2. For each design, the table indicates the number of injections
performed, the number of undetected data errors, and system hangs detected. The
measured percentage of faults is then scaled to the size of the injection area to estimate
the total number of vulnerable bits. In this analysis, vulnerable bits are the total number
of bits which cause silent (undetected) data errors or system hangs. For ABFT designs,
bits resulting in false-positive results were not considered vulnerable, since they do not
allow faulty data to be propagated to back to the host system. The fault vulnerability
for each component is divided by the FPGA’s total number of configuration bits to
estimate the components’ design vulnerability factor (DVF). A design’s configuration
DVF represents the percentage of configuration bits that are vulnerable to faults and
can result in errors. The total DVF of a component includes vulnerable configuration bits
and BRAM data bits which may affect computation results. Reliability of a given design
can then be calculated from the FPGA’s total fault rate scaled by the design’s DVF. Most
Xilinx FPGA designs have a DVF that ranges from 1% to 10% [Xilinx 2010b] due to the
large amount of configuration memory devoted to routing.
73
4.1.4.1 Design vulnerability of serial architectures
The non-fault-tolerant baseline MM design has an estimated 14,802 vulnerable
configuration bits. The majority of these vulnerable bits results in undetected data errors.
Other bits cause the system to become unresponsive, or hang. These system hangs
may be caused by errors in the MM state machine logic, or by errors in FPGA routing
logic which prevents connections from the MM module to the communication UART.
The baseline design is small and uses less than 0.5% of the total logic slices available
on the FPGA. However, the configuration DVF of the baseline design is 0.026%. When
vulnerable data stored in BRAMs is also included, the total DVF of the baseline design is
0.198%.
The ABFT-Shared design has an estimated 8,756 vulnerable bits (approximately
60% of the unprotected design) and a DVF of 0.016%. The vulnerable bits can affect
address generation, the checksum generation or verification, or the error detection
status register. BRAM data bits are initially unprotected, but are effectively safe once
the ABFT checksum generation process has taken place. If the input BRAMs, prior to
the checksum calculation, are considered vulnerable, the ABFT-Shared design has a
DVF of 0.127%. In the ABFT-Extra design, the independent MAC unit for checksum
generation and verification was expected to isolate faults in the main data path from
the ABFT checksum calculations, leading to a more reliable design. However, in the
comparison of serial designs, the vulnerabilities of each ABFT design were similar, but
the ABFT-Shared design had 5% fewer vulnerable bits than the ABFT-Extra design. Both
ABFT designs experience a similar amount of data errors, but fewer faults cause the
ABFT-Extra design to hang.
The TMR design has an estimated 1,390 vulnerable bits and a total DVF of 0.002%.
The majority of these bits are the result of routing faults or TMR majority voter faults.
This result represents a realistic lower bound on total vulnerability of the MM design.
The TMR reliability results in Table 4-2 were obtained by using the Synplify Premier
74
high-reliability tool, which resulted in significantly more reliable designs than a naive
VHDL-level TMR approach, since the tool performs TMR with finer granularity and more
frequent voting.
The hybrid ABFT design has approximately 30% lower vulnerability than the
ABFT-Extra and ABFT-Shared designs, but higher vulnerability than the TMR design.
However, the hybrid design has lower resource usage compared to the TMR design
(see Table 4-1). For applications where BRAM or DSP48 resources are constrained, the
hybrid ABFT design provides a good compromise between low vulnerability (0.012%
configuration DVF) and low DSP48 usage (100% overhead).
4.1.4.2 Design vulnerability of parallel architectures
Although the serial implementations of the ABFT designs demonstrated a modest
reduction in vulnerability, the benefits of ABFT are expected to be more visible in parallel
implementations. With multiple processing elements, faulty elements will only affect
a subset of the output data, improving fault localization. Figures 4-7 and 4-8 show
the number of vulnerable bits for each design while varying the number of processing
elements.
In each of the baseline designs, the number of vulnerable bits increases linearly
with the amount of parallelization. Intuitively, as the total area required for the design
increases, the number of vulnerable bits increases proportionally. The coarse-grained
design has a higher vulnerability per processing element than the fine-grained design
due to larger resource requirements.
As more processing elements are used, the vulnerability of the ABFT designs
increases, up to the 16 PE design. The rate of increase is much slower rate than the
baseline designs, and therefore, the improvement in reliability is positively correlated to
the amount of parallelism employed. Additionally, in the fully-parallel 32-PE designs, the
ABFT designs exhibit their highest reliability. When fully parallelized, the control logic for
the MM calculation is significantly simplified, resulting in less system hangs. For each
75
of the parallel implementations, the ABFT-Shared design was more vulnerable than the
ABFT-Extra design, with 20%-50% more vulnerable bits. Meanwhile, there was not a
significant difference in vulnerability between the coarse-grained and fine-grained ABFT
designs. Any benefits from fault localization in the coarse-grained design were offset by
the increased resource requirements.
As expected, the TMR design has significantly fewer vulnerable bits than any of
the other designs. Additionally, as the number of processing elements scales up, the
number of vulnerable bits stays constant.
Finally, the hybrid ABFT design has a similar fault vulnerability as the ABFT-Extra
design upon which it was based. The fine-grained hybrid design has lower vulnerability
for implementations with up to 4 processing elements. For larger designs, the vulnerabilities
are similar. For the coarse-grained designs, the hybrid design has 5%-15% fewer
vulnerable bits than the ABFT-Extra design. In both cases, the number of false positives
and system hangs are up to 70% lower in the hybrid design.
4.1.5 Analysis of Matrix-Multiplication Architectures
The DVF of the serial matrix multiplication designs tested in this section are very
low, due to the small resource requirements in comparison to the size of the FPGA
being used. However, the effectiveness of each design is readily measurable. While the
TMR design is the most reliable design, the serial hybrid ABFT design does reduce the
number of vulnerable bits by 56% over the baseline design. For the 32-MAC parallel
design, the fine-grained hybrid ABFT design has 98% fewer vulnerable bits than the
baseline design, while only incurring 25% slice overhead, 3.125% DSP48 overhead, and
0% BRAM overhead.
The observed errors were categorized into three types: silent data errors, system
hangs, and false positives. While silent data errors must be reduced as much as
possible, system hangs and false positives are correctable within a system-level
reliability framework. Each of the tested designs, including the TMR design, experienced
76
system hangs which prevented the system from returning data. In order to prevent
system hangs, an external watchdog timer may be employed to reset the system,
reconfigure the FPGA, and resume processing. Additionally, each of the ABFT designs
experienced false positive results. False positive results cannot be determined during
run-time, but the error flag will trigger the recovery procedure, causing an FPGA
reconfiguration and re-computation of the last data set. By reducing false positives,
the performance overhead incurred by recomputing data can be reduced. The hybrid
ABFT design experienced up to 70% fewer false positives than the ABFT-Shared or
ABFT-Extra designs. Additionally, an examination of result matrices with data errors
revealed that many faulty matrices contained all zero values, producing an incorrect
result with a valid (zero) checksum. It may be possible to further improve the reliability
of the ABFT designs by modifying the ABFT encoding matrix to ensure that output
matrices cannot contain all zeros.
4.2 Fast Fourier Transform
The Fast Fourier Transform (FFT) is another key kernel in many space-based
applications, such as synthetic-aperture radar and beamforming. While the FFT can be
computed using tradition general-purpose processors or DSPs, the FFT has an efficient,
high-throughput FPGA implementation. Unlike matrix multiplication, the FFT does not
naturally fit into the traditional ABFT framework. In Section 4.2.1, the mathematical
description of our block-based ABFT approach for the FFT is presented. In Section
4.2.2, multiple FFT architectures are discussed for use within a generic ABFT-enabled
FFT architecture. Section 4.2.3 analyzes the resource overhead tradeoffs for each of
the proposed architectures. Section 4.2.4 presents an analysis of fault-injection results
obtained from each of the proposed architectures. Finally, Section 4.2.5 summarizes the
results from the FFT analysis.
77
4.2.1 Checksum-Based ABFT for FFTs
Since the Fourier Transform is a linear operator, the properties of linearity can be
used to show that ABFT techniques can be applied to the Fast Fourier Transform. For
two vectors a and b, the following equality holds for all linear operators (including the
FFT):
F(a + b) = F(a) + F(b) (4–5)
Using this property we can create a block-based ABFT technique that can be used
to detect errors in blocks of FFTs. Instead of performing detection on each input vector,
we form a matrix A composed of individual signal vectors (x0, x1, x2, ...). Following the
approach in Section 4.1, we can then augment this matrix with an additional checksum
row by using the same encoding vector EN as in Equation 1.
A =
x0
x1
...
xN−2
xN−1
(4–6)
AC =
A
ETN · A
=
A∑N−1i=0
xi
(4–7)
By performing an FFT on each row of the augmented matrix AC , we produce
an output matrix BC which contains the transformed values of each input vector and
78
the transformed checksum vector. The transformed checksum vector will be a valid
checksum for the output matrix, as shown in Equation 8.
F(AC) =
F(x0)
...
F(xN−1)
F(∑N−1
i=0xi)
(4–8)
BC =
X0
...
XN−1∑N−1i=0
Xi
(4–9)
Using the property of linearity described in Equation 5, the checksum rows of F(AC)
and BC are equivalent. In the event of an error during computation, the error will be
detectable, although more complex encoding vectors are required to detect which vector
is faulty.
The absolute overhead of using this ABFT technique is dependent on the number
of vectors used per ABFT block. Given N vectors of M elements, the additional
overhead of checksum generation and verification is O(NM), while the additional core
computation is O(Mlog(M)). As a function of ABFT block size, the relative overhead of
ABFT is O(1/N).
4.2.2 FFT Architectures
Multiple FFT modules were created to examine the reliability of hardware-based
ABFT. Unlike the previous matrix-multiplication architectures, this analysis uses
pre-built, single-precision floating-point operators to construct each design. Commonly,
pre-existing IP is used to reduce development costs by improving design and verification
time. The FFT operators and floating-point arithmetic operators used in the following
sections are pre-existing IP generated using Xilinx CoreGen tool [Xilinx 2013a]. The
79
Xilinx FFT operator has several possible architectures including a high-performance
pipelined architecture and a minimal-hardware serial architecture which will be
compared in this section. The ABFT logic for each design was generated by combining
these pre-existing IP cores to perform the checksum operations. This section discusses
trade-offs of each FFT architecture and the design decisions that were made for the
ABFT architecture.
4.2.2.1 Radix-2 Burst-IO FFT architecture
The basic, serial architecture for the fast Fourier Transform function consists of a
single butterfly operator (complex multiplication and addition), an address generation
module, and two data storage modules (BRAM) as shown in Figure 4-9(a). One memory
is used to store the input matrix A, and one memory is used for the resulting output
matrix B. The address generator iterates through the correct matrix indices, sending
data stored in the input BRAMs to the FFT operator, and generates the appropriate
address for output values. This FFT module can be used for any size matrix as long
as enough data storage is supplied. For an M-point FFT, this architecture requires
O(Mlog(M)) cycles to calculate an output vector.
4.2.2.2 Radix-2 Pipelined FFT architecture
The pipelined FFT architecture has the same high-level architectural diagram
as in Figure 4-9(a). However, internally it uses log(M) butterfly operators to enable
high throughput. The latency of the pipelined architecture is approximately O(2M)
cycles, and the time to compute N vectors is O(NM) cycles, resulting in a significant
improvement in processing speed while calculating multiple, consecutive FFTs.
4.2.2.3 Architectural modifications for ABFT
The addition of ABFT logic requires the creation of two functions, ABFT checksum
generation and ABFT checksum verification, each requiring a floating-point accumulator.
Unlike the matrix-multiplication case study, the ABFT checksum logic is significantly
different from the algorithm’s main computation, and the logic must be implemented
80
separately. Checksum generation sums each column of matrix A using a floating-point
accumulator and writes the checksum into a checksum BRAM. After the FFT operation,
the checksum verification function sums the columns of matrix B and writes the value
to a checksum BRAM. Then, the checksum verification function uses a floating-point
comparison operator to compare checksum values in the checksum BRAMs. If a
mismatch is detected, an “error found” signal is asserted until the module is reset. In
order to reduce the BRAM requirements, the last row of the matrix A and B BRAMs may
be reserved for checksums. In this case, the initially generated checksum will be written
to the matrix A BRAM and the verification checksum will overwrite that checksum value
after computation is complete. Figure 4-9(b) shows an example of an ABFT-enabled FFT
architecture where checksums are stored within the matrix A and B BRAMs.
Due to the nature of floating-point arithmetic, the calculated checksums may not
be exact. Quantization error is introduced due to the order of additions as well as
the internal rounding within the FFT module. For most applications using FFTs, such
as image processing, data errors that occur in low-significance bits may be safely
ignored, and errors that avoid detection should not significantly impact the higher-level
application. In order to handle errors in the highly significant bits, ABFT detects errors by
comparing the difference of the two generated checksums to a user-defined threshold
value. For maximum coverage, this threshold should be set as close to zero as possible.
Unfortunately, determination of the threshold value is both algorithm-specific and
data-specific. In order to estimate the required threshold values while calculating blocks
of FFTs, MATLAB simulations were used to measure the maximum quantization error
encountered while using various input data sets. For each simulation, test vectors
were generated using a uniform random distribution on the interval (− x
2, x2), where x
is the range of the input data. Figure 4-10 shows the required threshold to avoid false
positives for the block-based ABFT for FFT while varying the range of the input data.
The observed linear trend simplifies the selection of an acceptable threshold value.
81
4.2.3 Resource-Overhead Experiments
In this section we analyze the overhead of the ABFT architectures presented in
Section 4.2.2 and compare them to traditional fault-tolerance mitigation strategies. For
this analysis, 64-point FFTs will be used, enabling up to 16 complex input vectors to fit
within two BRAM components. The floating-point Burst-IO and Pipelined FFT modules
are generated using the Xilinx CoreGen utility. We compare the two baseline FFT
architectures with several fault-tolerant designs (TMR, ABFT, and hybrid TMR/ABFT).
The results of this comparison are shown in Table 4-3.
4.2.3.1 Resource overhead of Burst-IO architecture
In the baseline Burst-IO architecture, 4 BRAMs are used to store input and output
data. The FFT module uses an additional 4 BRAMs to hold intermediate values during
processing. Additionally, 8 DSP48 units are used for the FFT butterfly operation
computation. The ABFT design requires 4 additional DSP48 units and 260 slices for
the floating-point addition, floating-point comparison, and control logic functionality,
representing 42% slice overhead and 50% DSP overhead. Unlike the MM case study,
the FFT TMR design has approximately 200% overhead of each of the slice, BRAM,
and DSP resources. Since the FFT components are assembled from pre-built netlists
produced by CoreGen, little optimization is performed during synthesis, resulting in the
expected amount of resource overhead. In the hybrid FFT design, each of the state
machines controlling FFT or ABFT components in the design are protected using TMR
while the FFT and ABFT floating-point operators remain unchanged. The hybrid design
has 48% slice overhead and 50% DSP overhead.
For larger FFTs (e.g., 128+ points), the Burst-IO DSP48 resource requirements
remain constant while additional BRAMs are required for intermediate result storage.
The 50% resource overhead of the ABFT design is independent of the parameters of the
FFT operator.
82
4.2.3.2 Resource overhead of Pipelined architecture
In the baseline Pipelined architecture, 4 BRAMs are used to store input and output
data. The FFT module uses 16 DSP48 units for the FFT butterfly computation and one
additional BRAM to hold intermediate values during processing. The Pipelined FFT uses
almost twice as many slices and DSP48s, but fewer BRAMs than the Burst-IO version.
The ABFT design increases the resource requirements by the same amount as the
Burst-IO design (i.e.: 4 DSP48, 260 slices). Since the Pipelined design is approximately
twice as large as the Burst-IO design, the overhead is correspondingly lower at 23%
slice overhead and 25% DSP overhead. The TMR design has approximately 200%
overhead of each of the slice, BRAM, and DSP resources. The Pipelined hybrid design
uses the same methodology as the Burst-IO hybrid design and has only 26% slice
overhead and 25% DSP overhead.
For larger FFTs (e.g., 128+ points), the size of the Pipelined design increases
because more pipeline stages and butterfly operators are required. As the size of the
FFT increases, the ABFT requirements remain static, and the ABFT resource overhead
is reduced. For example, a 1024-point FFT would require 34 DSP48 components and
the 4 DSP48 components for the ABFT checksums will only add 12% overhead to the
baseline design.
4.2.4 FFT Fault Injection
In order to validate our ABFT FFT designs, faults must be injected into an executing
system. In this section, we present FPGA fault-injection results gathered using the
methodology described in Section 4.1.4. For the ABFT designs, the value for the
checksum error threshold was calculated from the input test vectors during a fault-free
processing run. Table 4-4 shows the fault-injection results for each of the fault-tolerant
FFT designs from Section 4.2.2. For each design, the table indicates the number of
injections performed, the number of undetected data errors, and system hangs detected.
The measured percentage of faults is then scaled to the size of the injection area to
83
estimate the total number of vulnerable bits. In this analysis, vulnerable bits are the total
number of bits which cause silent (undetected) data errors or system hangs. For ABFT
designs, bits resulting in false-positive results were not considered vulnerable, since they
do not allow faulty data to be propagated to back to the host system.
The Burst-IO baseline design has an estimated 48,012 vulnerable configuration
bits and a configuration DVF of 0.084%. The total DVF includes an additional 131,072
memory bits in BRAM, increasing the baseline design’s total DVF to 0.314%. In the
baseline design, 80% of the configuration memory fault cause data errors while 20%
cause system hangs. The Burst-IO ABFT design reduces the total number of vulnerable
configuration bits by 80% to 9,624. Additionally in the ABFT design, the Matrix A BRAM
data bits are only vulnerable before the checksum value has been computed and the
Matrix B BRAM values are always protected, improving the total DVF of the design to
0.125%. The most dangerous type of error, silent data errors, are reduced by 95% over
the baseline design. The distribution of error types in the ABFT design is reversed from
the baseline design’s distribution; 80% of configuration faults result in system hangs
while only 20% cause data errors. The hybrid ABFT/TMR design reduces the design
vulnerability by an additional 30%, to only 6,662 vulnerable bits. The hybrid design
decreases the occurrence of system hangs more than data errors. Finally, the TMR FFT
design has 3,278 vulnerable bits, 94% fewer vulnerable bits than the baseline design,
while protecting all BRAM data bits.
The Pipelined baseline design is much more vulnerable than the Burst-IO
counterpart with 182,192 vulnerable configuration bits (0.320% configuration DVF).
The increased vulnerability comes from the larger resource requirements and increased
routing complexity of the pipelined architecture. The Pipelined ABFT design reduces
data errors by 97% but only reduces system hangs by 33%. The Pipelined ABFT
design’s error-type distribution is more skewed than the Burst-IO case; 85% of errors
result in system hang and 15% result in data errors. The Pipelined hybrid ABFT/TMR
84
design reduces the design vulnerability by an additional 20%, to 22,251 vulnerable
configuration bits. The hybrid design decreases the occurrence of system hangs more
than data errors. Finally, the pipelined TMR FFT design has 9,306 vulnerable bits, 95%
fewer vulnerable configuration bits than the baseline design, while also protecting all
BRAM data bits.
4.2.5 Analysis of FFT Architectures
The FFT architectures examined in this section are significantly more complex than
the matrix-multiplication architectures of Section 4.1. Due to algorithm complexity and
the use of floating-point precision operators, the resource requirements are much higher.
For example, the baseline Burst-IO design is 300% larger than the baseline MM design.
Additionally, the floating-point precision also increased the resource requirements of
the ABFT logic. However, even with the smaller Burst-IO architecture, the overhead of
the ABFT design was only 50%, significantly better than the TMR alternative. For the
pipelined FFT designs, larger FFTs do not increase the resources required by the ABFT
component, further improving resource overhead.
In each of the FFT designs, the number of vulnerable bits that cause system hangs
is much higher than in the MM case studies. The generated FFT components have
substantially more control logic than the MM designs, which can lead to system hangs
when upset. In the baseline FFT designs, most vulnerable bits result in corrupt data but
the proportion of vulnerable bits causing system hangs is higher than in the MM case
studies. In each of the fault-tolerant designs, more vulnerable bits result in system hangs
than corrupt data. While the hybrid design does not significantly reduce data corruption,
it does lower the occurrence of system hangs. Although TMR provides the highest
reliability designs, ABFT is effective in reducing undetected data errors. If a system-level
mechanism (e.g., watchdog timer) is used to recover from system hangs, the ABFT
designs can approach the reliability of TMR.
85
4.3 Conclusions
In this chapter, we have presented a novel analysis of ABFT for low-overhead fault
tolerance in FPGA systems. Several matrix-multiplication and FFT designs employing
TMR and ABFT fault-tolerance techniques were developed and tested using an FPGA
fault-injection tool. The results demonstrated that ABFT was capable of reducing
the number of vulnerable configuration bits in a design while also protecting most
memory bits. While the TMR fault-mitigation approach had the lowest vulnerability, a
hybrid ABFT/TMR design approach was able to lower configuration vulnerability while
maintaining low overhead. With matrix multiplication, configuration vulnerability was
reduced by 90% in highly-parallel ABFT architectures while using less than 20% slice
and DSP48 resource overhead. For FFT architectures, configuration DVF was lowered
by 85% while using 50% or less additional resources. As the size and parallelism
of each of the ABFT architectures increase, the relative effectiveness of ABFT also
increases.
Matrix Multiplication and Fast Fourier Transforms are the key kernels in many
space-based processing systems. The ABFT architectures described in this work
are generic and can be applied to many of these systems. Additionally, other linear
operations, such as LU or QR decomposition, can also be protected using ABFT.
The vulnerability results show that ABFT can be used as part of a comprehensive
fault-tolerance strategy for space-based FPGA systems. ABFT detects most data
errors that occur during processing. The use of ECC techniques (e.g., Hamming,
parity) can be used to protect data memory for minimal overhead. A higher-level
watchdog mechanism can be used to recover from system hangs and crashes. Many
of these techniques, in combination with TMR, are already in use in space systems.
By selectively removing replicated components and using ABFT in their place, space
systems can increase their processing capabilities without sacrificing reliability.
86
Table 4-1. Resource utilization and overhead of serial MM designs.Fault Slice Slice BRAM BRAM DSP48 DSP48tolerance count overhead count overhead count overheadNone 158 – 3 – 3 –ABFT-Shared 180 14% 3 0% 3 0%ABFT-Extra 208 32% 3 0% 6 100%TMR 388 146% 9 200% 9 200%Hybrid 404 156% 3 0% 6 100%
Table 4-2. Serial matrix multiplication fault-injection results.Design name Faults Data System Vulnerable Configuration Total
Injected Errors Hangs Config Bits (est.) DVF (%) DVF (%)Baseline 100,000 3,114 368 14,802 0.026% 0.198%ABFT-Shared 100,000 1,056 633 8,756 0.016% 0.127%ABFT-Extra 100,000 1,055 712 9,161 0.016% 0.127%TMR 100,000 118 16 1,390 0.002% 0.002%Hybrid ABFT 100,000 348 281 6,522 0.012% 0.123%
Table 4-3. Resource utilization and overhead of FFT designs.Architecture Fault Slice Slice BRAM BRAM DSP48 DSP48
tolerance count overhead count overhead count overheadBurst-IO None 607 – 8 – 8 –Burst-IO ABFT 867 43% 8 0% 12 50%Burst-IO TMR 1834 202% 24 200% 24 200%Burst-IO Hybrid 901 48% 8 0% 12 50%Pipelined None 1129 – 5 – 16 –Pipelined ABFT 1390 23% 5 0% 20 25%Pipelined TMR 3404 202% 15 200% 48 200%Pipelined Hybrid 1424 26% 5 0% 20 25%
Table 4-4. FFT fault-injection results.Design Name Faults Data System Vulnerable Configuration Total
injected errors hangs config bits (est.) DVF (%) DVF (%)Burst-IO - Baseline 100,000 1,809 461 48,012 0.084% 0.314%Burst-IO - ABFT 100,000 98 357 9,624 0.017% 0.125%Burst-IO - TMR 100,000 27 128 3,278 0.006% 0.006%Burst-IO - Hybrid 100,000 80 235 6,662 0.012% 0.120%Pipelined - Baseline 100,000 6,891 1,723 182,192 0.320% 0.550%Pipelined - ABFT 100,000 180 1,112 27,327 0.048% 0.156%Pipelined - TMR 100,000 68 97 9,306 0.016% 0.016%Pipelined - Hybrid 100,000 162 890 22,251 0.039% 0.147%
87
0
RA
M
(Mat
rix
B)
N2
Address Generator
0
RA
M
(Mat
rix
A)
N2
0
RA
M
(Mat
rix
C)
N2
Address Generator0
RA
M
(Mat
rix
A)
N2/20
RA
M
(Mat
rix
B)
N2/2N2/2
RA
M
(Mat
rix
A)
N2
N2/2
RA
M
(Mat
rix
B)
N2
0
RA
M
(Mat
rix
C)
N2
Adder Tree
RA
M (
Mat
rix
B)
Address Generator
0
RA
M
(Mat
rix
A)
N2/2
0
RA
M
(Mat
rix
C)
N2/2
N2/2
RA
M
(Mat
rix
A)
N2
N2/2
RA
M
(Mat
rix
C)
N2
A B C
Figure 4-1. Matrix-multiplication architectures. A) Baseline. B) Fine-grained parallel. C)Coarse-grained parallel.
1 for ( i = 0 ; i < N; i ++)2 {3 for ( j = 0 ; j < N; j ++)4 {5 C[ i ] [ j ] = 0 ;6 for ( k = 0 ; k < N; k++)7 {8 C[ i ] [ j ] += A[ i ] [ k ] ∗ B[ k ] [ j ] ;9 }
10 }11 }
Figure 4-2. Matrix-multiplication psuedocode.
0
RA
M
(Mat
rix
B)
N2
Address Generator
0
RA
M
(Mat
rix
A)
N2
0
RA
M
(Mat
rix
C)
N2
En
cod
ing
M
atri
x
Figure 4-3. Matrix-multiplication ABFT-Extra architecture.
88
Figure 4-4. Slice overhead of fine-grained parallel matrix multiplication.
Figure 4-5. Slice overhead of coarse-grained parallel matrix multiplication.
89
Figure 4-6. DSP48 and BlockRAM overhead of parallel matrix multiplication.
90
Figure 4-7. Fault vulnerability of fine-grained parallel matrix multiplication.
91
Figure 4-8. Fault vulnerability of coarse-grained parallel matrix multiplication.
Xilinx LogiCORE FFT
0
RA
M
(Mat
rix
A)
N*M
0
RA
M
(Mat
rix
B)
N*M
Address / Control Logic Xilinx
LogiCORE FFT
0
RA
M(M
atri
x A
)
(N+1)*M
0
RA
M(M
atri
x B
)
(N+1)*M
Address / Control Logic
A B
Figure 4-9. FFT architectures. A) Serial. B) ABFT.
92
0.00E+00
5.00E-03
1.00E-02
1.50E-02
2.00E-02
2.50E-02
3.00E-02
0 100 200 300 400 500 600 700 800 900 1000
Re
qu
ire
d A
BF
T T
hre
sho
ld
Range of Input Data
Figure 4-10. Required ABFT threshold value for floating-point FFTs.
93
CHAPTER 5RFT SYSTEM INTEGRATION FOR RAPID SYSTEM DEVELOPMENT
The RFT-based system presented in Section 3.1 is intended to be a reference
design that can be adapted to actual space-system architectures. In this chapter, we
investigate possible methods and/or tools that will improve the usability of RFT-based
systems for system developers to facilitate adoption of our framework. Our approach,
shown in Figure 5-1, has identified two areas of RFT system design that may be
improved through external tools. The first area, generating RFT hardware for arbitrary
system configurations, can be facilitated by creating multiple system templates for
various interconnection architectures. Specifically, we target architectural support in
order to provide compatibility with commonly used partial reconfiguration architectures,
such as VAPRES [Jara-Berrocal and Gordon-Ross 2010]. The second area, software-based
support for intelligent fault-tolerance mode selection, will combine fault-rate modeling
and task scheduling to optimize job placement for high performability. Research into
each of these areas will facilitate the adoption of the RFT framework into new and
existing space systems.
In this section, we outline our approach for improving hardware and software
support for RFT systems. Section 5.1 discusses RFT architectural templates, integration
of an RFT controller into an existing VAPRES PR-based system, and parameterized
VHDL designs for RFT controller logic. RFT controller generation and system integration
will reduce time required to develop RFT systems by using existing designs with little
modification and dynamically creating new RFT-related components. Section 5.2
discusses reliable task scheduling, which will reduce the complexity of RFT software
requirements.
5.1 Dynamically Generated RFT Components
The RFT controller created for Section 3.1 was designed for a small RFT system
consisting of only three PRRs and connected to a host CPU through a bus-based
94
interconnect. However, for maximum flexibility, the RFT framework must support
systems with an arbitrary number of PRRs and interconnect types. Additionally, a
system design may decide to support only a subset of the fault-tolerance modes allowed
by the RFT framework. Dynamic RFT controller generation allows a system designer to
specify the size and supported features of an RFT controller customized for their specific
system. By disabling unneeded features, the overhead of the RFT controller can be
reduced.
The RFT controller can be divided in to two categories: system interconnect
interfaces and fault-tolerance features. The type of system interconnect affects the
structure of the Address Decoder component. The fault-tolerance features affect the
Voting Logic, Output Mux, and Watchdog Timers components.
5.1.1 RFT Controller Point-to-Point Interface
The RFT hardware architecture from Section 3.1 was used as an example
bus-based PR architecture. However, to improve usability with pre-existing systems,
it is possible to adapt the RFT controller to be used in other PR architectures. By
adapting the RFT controller to be used within a point-to-point VAPRES PR system, it will
be possible to quickly provide fault tolerance to pre-existing VAPRES systems. Other
existing PR architectures can be classified as either bus-based or point-to-point.
The initial RFT controller connects to the rest of the system through the Processor
Local Bus (PLB). PRMs connected to the RFT controller must act as slave devices.
Slave devices can respond to bus requests, but cannot initiate transfers. For example,
the processor can read from a PRM’s memory, but the PRM cannot write a value directly
to the host processor’s memory. The RFT controller instantiates a slave PLB controller
to handle communication with the host processor. The PLB controller converts the bus
signals into a simplified set of user signals.
The primary difference between the system from Section 3.1 and a VAPRES-based
system is the use of Fast Simplex Links (FSL) for communications between the
95
Microblaze processor and the PRRs. An FSL template for the RFT controller must
be created in order to operate with a point-to-point architecture (VAPRES). An FSL link
is a FIFO queue that directly connects the MicroBlaze with user logic. The MicroBlaze
instruction set has specific intrinsic functions for reading and writing to the FSL links.
These direct links from the PRMs to the Microblaze increase throughput and reduce
latency from bus contention.
For an RFT controller operating in high-performance mode, data from an FSL could
be passed directly to its connected PRR. However, an FSL-based RFT controller must
be able to send data from an arbitrary FSL to any other FSL in order to enable operate
in TMR or DWC modes. For example, in TMR mode, any data written to FSL #0 must
also be simultaneously written to FSL #1 and FSL #2 to ensure correct functionality.
In order to accomplish this, the FSL-based RFT controller must include an additional
bi-directional FIFO queue for each PRR. The RFT controller must process all incoming
data, route the data to the appropriate PRM, and write any output data to the correct
output FSL. This requires an additional 2 BRAMs per PRR plus the necessary control
logic. Table 5-1 shows a comparison of the resource requirements for the bus-based
and point-to-point RFT controllers.
Since most PR-based systems can be classified as either bus-based or point-to-point,
the two RFT controllers created from this work can serve as a reference design.
Although alternative buses or protocols may be used in future systems, the overall
architecture should remain applicable.
5.1.2 Parameterized and Configurable Voting Logic
While the RFT controller’s input- and output-address decoder hardware may be
dependent on the interface to the larger PR system, many other components are
only dependent on the number of PRRs in the design. These components, created
using VHDL, can be parameterized in order to be easily adapted to arbitrarily large
systems. The main parameter, NUM PRRS, scales the RFT controller interfaces to the
96
appropriate size for the system. The parameterized components we examine are the
watchdog timers, RFT status registers, output mux, and voting logic,.
The ENABLE WATCHDOG parameter, enables the creation of watchdog timer
logic for each PRR in the design. Additionally, a watchdog configuration register is
created for each PRR. For PRMs which do not wish to use the watchdog functionality,
the configuration register can be used to disable the timer. NUM PRRS status registers
are created.
There are several parameters related to the creation of voting logic within the
RFT controller. The ENABLE TMR and ENABLE DWC creates TMR or DWC voters,
respectively. One TMR voter is created for every three PRRs, and one DWC voter is
created for every two PRRs. By not using(NUM PRRS
3
)or
(NUM PRRS
2
)voters, respectively,
we reduce the total number of required resources at the expense of some flexibility.
Based on the parameterized voters, the output multiplexer can then select from the
complete enumeration of the possible RFT-mode combinations.
These simple parameters allow for the creation of a scalable RFT controller
design. By enabling or disabling specific fault-tolerance features, the RFT controller
can be targeted and optimized for specific architectures or applications. As system
size increases, the parameterized design automatically creates the logic required for
additional PRRs.
5.2 Task Scheduling for RC Systems in Dynamic Fault-Rate Environments
In order to optimize a system for performance and reliability, the fault-tolerant mode
must be selected carefully based upon the current fault conditions experienced by the
system. In this section, we propose a fault-tolerant scheduler that can schedule real-time
tasks as well as select a heuristic to select a fault-tolerant RFT mode for each task.
We then evaluate the effectiveness of the proposed heuristic using two case-study
simulations.
97
5.2.1 Selection Criteria for Fault-Tolerant Mode
RFT mode switching can be triggered by a priori knowledge of the operating
environment, application-triggered events, or external events. In an RFT system, the
expected fault rate can be estimated either directly or indirectly. An external radiation
sensor can be directly interfaced with the FPGA, allowing the system to track the current
fault rate and predict future fault rates. Alternatively, the RFT system can indirectly
determine fault rates using models of the expected fault environment from Chapter 3. By
correlating the space system’s current position to an existing model, a fault-rate estimate
can be used to make scheduling decisions.
In the following sections we assume that tasks can be scheduled with no fault
tolerance (Simplex), duplication with compare (DWC), or triple-modular redundancy
(TMR). Data errors in tasks are detected by comparing or voting on the output of each
task replica. We also assume that the system can estimate the current fault environment
using a pre-existing model of the system’s orbit.
5.2.1.1 FT-mode selection using thresholds
One of the most straightforward methods for selecting an appropriate fault-tolerance
mode is the use of thresholds. At very high fault rates, TMR is required to maintain
reliability. At low fault rates, DWC or Simplex modes may provide sufficient reliability
while increasing performance. At each time step, the current fault rate is measured, and
new tasks are assigned based on the pre-selected rules. The optimal threshold occurs
at the fault rate where the reliable performance of TMR and DWC are equal. The ideal
scheduling heuristic will select TMR when the current fault rate is above the fault-rate
threshold, fthresh, and will select DWC otherwise. Selecting the appropriate values for the
threshold is dependent on the application and environment. Determining fthresh requires
information about fault rates, task frequency, task load, and other factors which may not
be static throughout a system’s operation.
98
5.2.1.2 Time-resource metric for FT-mode selection
Instead of depending upon a user-defined threshold value to determine a fault-tolerance
strategy, we explore a possible metric which can estimate the optimal threshold. The
metric combines computation time (τ ) and fault probability of a single task (f ) in order
to select between FT modes. By developing a scheduling metric that incorporates a
dynamic fault rate, we intend to improve overall system performance without the need for
expert user input. Given the current fault rate, f , the reliability of a task at the end of its
computation time, R, is given by the following equations (depending on mode):
RSimplex = (1− f )τ
RDWC = (1− f )2τ
RTMR = 3(1− f )2τ − 2(1− f )3τ
(5–1)
If tasks are running in DWC or TMR mode, faults are discovered at the end of
the task’s computation time. Faulty tasks are then rescheduled until they complete
successfully. The average number of times a task must be executed in order to
successfully complete is then given by the following geometric series:
τe� =
∞∑n=0
τ(1− R)n =τ
R(5–2)
We define a time-resource coefficient, α, which combines the effective execution
time with the required resources of a given task (α = N × τe� ). Then, by comparing the
α of a DWC or TMR task, we can determine which mode is optimal for reliability (lower
is better). In Equation 3, we solve for conditions where DWC will provide lower α than
TMR.
2τ
(1− f )2τ≤ 3τ
3(1− f )2τ − 2(1− f )3τ(5–3)
99
Simplifying Equation 3 provides the following simple relation:
(1− f )τ ≥ 3
4(5–4)
Based on the definition of α, DWC provides more reliable performance than TMR
when RSimplex is greater than 0.75. For low fault rates, DWC provides higher overall
performance. For very high fault rates, or very long execution times, the reliability of
TMR scheduling is preferred. A similar analysis can be performed for simplex tasks,
however simplex scheduling has no method for detecting faults. We use this conclusion
as the basis for an adaptive fault-tolerance threshold.
5.2.2 Scheduler for RFT
Traditional fault-tolerant scheduling algorithms assume that the fault rate experienced
by the system will be constant, and that the fault-tolerance strategy will also be constant.
For an RFT-based system, a scheduler which uses the current fault rate is necessary
to maximize system utilization while maintaining system availability. The fault-tolerant
scheduler presented in this section can schedule tasks in any FT mode based on
user-defined thresholds or the α-metric.
5.2.2.1 RFT architecture description
The RFT system described in Chapter 3 contains a microprocessor connected
to several large partially-reconfigurable regions (PRRs) through a shared system
bus. During normal system operation, unique tasks can be scheduled to any of the
PRRs. Depending on system configuration, the outputs of three contiguous PRRs can
be voted on to provide coarse-grained TMR functionality, or two PRRs can provide
DWC functionality. Each of these PRRs are large and identical in size, and can be
represented with a 1D area model, reducing many of the scheduling problems presented
in Section 2.7.
100
5.2.3 Software Simulation
In order to evaluate our scheduling technique and possible heuristics, a software-based
discrete-time simulator was developed in C++. The simulator enables us to specify
task arrival rates, task deadlines, dynamic fault rates, and scheduling algorithms. In
addition to scheduling tasks, the simulator can also inject faults into tasks and force
re-scheduling of failed tasks.
Figure 5-3 shows the basic overview of how the simulator is used. At each time
step, tasks are randomly added to a task pool. This process is modeled as a Poisson
process with mean λarrival . All tasks in the task pool are scheduled, if possible, and
then moved to the reservation list. When multiple tasks arrive simultaneously, tasks
are scheduled using an earliest-deadline-first (EDF) heuristic. The scheduler does not
employ preemption; tasks are scheduled on arrival only, and new tasks must be placed
around the existing schedule. Tasks which cannot be scheduled before their deadlines
are rejected. If a task is scheduled to begin at the current time step, the simulator moves
the task from the reserved list to the execution list.
After tasks have been scheduled, faults are injected into each PRR with probability
f in order to simulate the dynamic fault environment. At the end of the task’s execution,
the outcome of the task is determined based upon the number of faults encountered
(i.e., Simplex and DWC tasks fail with 1 fault, TMR tasks fail with faults in 2 or more
PRRs). Multiple faults within a single PRR have no additional effect on the system. If
the task fails, the scheduler then returns the task to the task queue to be re-scheduled.
All other reserved tasks (scheduled, but not yet executing) are also returned to the task
queue to be rescheduled. Fault-tolerant tasks are rescheduled until they successfully
complete or can no longer meet their deadline.
When tasks must be re-scheduled due to faults, they are treated as a new task for
scheduling purposes, although their original deadline is maintained. The fault-tolerant
mode for rescheduled tasks will based on the fault rate at the time of rescheduling.
101
5.2.4 Analysis and Results
In the following analysis, task execution times (texec ) are uniformly distributed in
[10, 100] time steps with deadlines (tdeadline) of [100, 200] time units. For simplicity, we
assume that a time step is 1 second. Simplex tasks use one processing region, DWC
tasks use two processing regions, and TMR tasks use three processing regions. The
simulated system uses 12 processing regions (NPRRs) in order to enable flexibility in
placing TMR and DWC tasks. For each experiment, we measure the performance of
each metric with the scheduler’s guarantee ratio (percentage of total tasks scheduled
successfully) while attempting to schedule 100,000 tasks.
5.2.4.1 Constant fault rates
Initially, the simulator is used to get a fault-free baseline for comparison purposes.
Figure 5-4 shows the effect of arrival rate on the performance of the system. At
low arrival rates, Simplex, DWC, and TMR scheduling can all meet the system
demand. However, arrival rates higher than 0.06 tasks per second begin to impact
the schedulability of the TMR system because there are not enough resources to handle
all incoming tasks. Using DWC for fault tolerance will result in higher guarantee ratios
since more DWC tasks can be scheduled at any one time. The lack of a fault-tolerance
mechanism excludes the use of Simplex scheduling in the presence of faults.
In order to investigate the effect of fault rates on our system, we chose a constant
arrival rate of 0.075 tasks per second. With this arrival rate, the DWC system can
successfully schedule all incoming tasks, while the TMR system cannot. Using the
arrival rate in this way, we attempt to define a system which requires DWC to meet
performance demands but can temporarily use TMR to meet reliability constraints.
The effect on the guarantee ratio is shown in Figure 5-5. At low fault rates the DWC
mode provides higher throughput, while TMR outperforms DWC at high fault rates.
Our adaptive metric produces high throughput at low fault rates, closely tracking the
performance of DWC, but performs between TMR and DWC at intermediate fault rates.
102
At higher fault rates, the guarantee ratio of the adaptive heuristic produces results close
to TMR. From these constant-fault-rate results, an ideal-threshold heuristic can be
determined. The crossover fault rate for TMR and DWC occurs at 0.0025 faults/sec.
5.2.4.2 Dynamic fault-rate case studies
In order to get a benefit from the adaptive scheduling methods, the fault rates
experienced by the system must vary. We present two fault profiles which represent
patterns commonly seen in space missions. Figure 5-6 shows the fault profiles used
for the following analysis, based on the fault model in [Jacobs et al. 2012]. The first
profile is a sinusoidal pattern with a 90-minute period which is characteristic of fault
rates in Low-Earth Orbit (LEO). The second pattern (Burst) represents Highly-Elliptical
Orbits (HEO), where the system experiences low fault rates for most of the orbit, with
a large burst when making the closest approach to Earth, once every 12 hours. For
the following case studies, four different scheduling heuristics are examined. The
TMR-only and DWC-only heuristics will schedule every task in their respective mode.
The ideal-threshold heuristic will use the fault-rate threshold measured in the previous
section (0.0025 faults/sec) to choose between the DWC and TMR modes. The adaptive
heuristic uses Equation 5–4 to determine the FT mode for each task. Each heuristic will
be evaluated using an arrival rate of 0.075 tasks/sec and the same parameters used in
Section 5.2.4.1.
For the Sinusoidal case study, fault rates are low compared to the average task
execution time. For this fault profile, scheduling tasks with the DWC-only, ideal-threshold,
or adaptive heuristics provide an equivalent rejection ratio, 0.2%. There are enough
system resources to make re-computation of failed DWC tasks better than simply
using TMR to protect against all failures. The fault rate rarely gets high enough for the
threshold or adaptive heuristics to schedule tasks in TMR mode. The adaptive heuristic
performs well, reducing the number of rejected tasks over the TMR-only strategy by
94%, while maintaining a low average task latency of 8 seconds per task.
103
In the Burst case study, fault rates are low except during a short window of time with
extremely high fault rates. Unlike in the previous case study, the adaptive heuristics will
benefit from the large range of fault rates and each heuristic has different performance
characteristics. For this fault-rate profile, the adaptive heuristics perform the best, with
11% fewer rejected tasks than the DWC-only strategy and 48% fewer than the TMR-only
strategy. Additionally, the adaptive heuristic has only 3% more rejected tasks than the
ideal-threshold heuristic. TMR is only optimal when the high-fault-rate burst occurs.
Otherwise, DWC will better utilize the system resources. The Burst fault profile is ideal
for all dynamic metrics, since the two phases are highly separated.
5.2.4.3 Scheduling improvements
At extreme fault rates (high or low), the adaptive heuristic will schedule all incoming
tasks in the same mode. However, for moderate fault rates, both DWC and TMR tasks
will be scheduled depending on task computation time. One drawback to the dynamic
fault-tolerant selection metrics is the resource fragmentation that occurs when different
sized objects are placed on the FPGA fabric. This effect can produce schedules similar
to Figure 5-7, where fragmentation causes poor utilization of the available FPGA
resources. As tasks arrive to the system, they are placed in PRRs p0 through p5, in
either DWC or TMR mode. At time t6 there are enough unused PRRs in the system
for a DWC task, but because the resources are not contiguous the task cannot be
placed. Unfortunately, the simplistic EDF scheduling heuristic currently in use does
not account for FPGA fragmentation. The low effectiveness of the adaptive metric in
Figure 3 can be explained by this FPGA fragmentation. By using a placement-aware
scheduler, the adaptive heuristic should perform closer to optimal for all fault rates. For
example, delaying the placement of task FDWC for until t6 would enable a more compact
placement in PRRs p2 and p3, enabling space for the placement of an additional DWC
task. Alternatively, task FDWC could be placed in PRRs p4 and p5 at time t5, leaving
room for future DWC tasks in PRRs p2 and p3. An FPGA placement-aware scheduling
104
algorithm such as the Horizon or Stuffing scheduler [Steiger et al. 2004] should be
incorporated in order to improve the performance of the adaptive heuristic.
Fault-rate lag is another possible problem. If a task is scheduled using a specific
mode but does not execute for a long period of time, a different FT mode may become
more appropriate. Limiting the scheduling window to only schedule a few tasks at a
time may prevent this lag. Alternatively, using a prediction of the future fault rate during
scheduling may reduce this effect. Finally, preemption enables the scheduler to return a
currently running task to the task queue in order to start a higher priority task. By adding
preemption capabilities to the scheduler, the guarantee ratio of all the tested metrics can
be improved.
5.3 Conclusions
In Phase 3 we have adapted the original bus-based RFT hardware architecture
for use with a PR architecture based on point-to-point connections (VAPRES) and
quantified the design tradeoffs. Additionally, by providing bus-based and point-to-point
hardware templates, the RFT architecture can now be more easily be ported to future
systems. The additional parameterization of internal TMR components allows for user
customization of RFT features such as watchdog timers or voting logic configurations.
We have also presented a novel metric for determining optimal fault-tolerance
settings for reconfigurable fault-tolerant systems. An RFT scheduler and simulator were
developed in order to test the effectiveness of the adaptive scheduling heuristic and to
compare its performance to traditional static fault-tolerance strategies. When using our
adaptive FT strategy in Burst-like fault environments, we maintain system reliability while
reducing the number of rejected tasks by 48% compared to a static TMR fault-tolerance
strategy and 11% compared to static DWC. In the Sinusoidal case study, the static DWC,
Ideal, adaptive heuristics reduce the number of rejected tasks by 94% compare to static
TMR strategy. We have demonstrated that the adaptive heuristic performs similarly to
105
an optimal user-defined threshold, without the need for detailed system simulation and
measurement.
Table 5-1. RFT controller resource usage.Module name Slices used BRAMs usedPLB Controller 345 0FSL Controller 768 12
Table 5-2. Dynamic scheduling results.Case study FT metric Guarantee ratio Reject ratio Avg. latency (s)Sinusoidal TMR-Only 0.970 0.030 51.2Sinusoidal DWC-Only 0.998 0.002 8.0Sinusoidal Ideal 0.998 0.002 7.9Sinusoidal Adaptive 0.998 0.002 8.2Burst TMR-Only 0.935 0.065 53.6Burst DWC-Only 0.962 0.038 9.9Burst Ideal 0.967 0.033 11.7Burst Adaptive 0.966 0.034 11.2
Figure 5-1. Research areas for Phase 3.
106
Microprocessor(MicroBlaze)
MemoryController
I/O Ports (UART, USB)
ICAPRFT Controller
System Interconnect (PLB)
TMR Components
PR
R N
PR
R N
−1
PR
R 2
PR
R 1
Microprocessor(MicroBlaze)
MemoryController
I/O Ports (UART, USB)
ICAPRFT Controller
PLB
TMR Components
PR
R N
PR
R N
−1
PR
R 2
PR
R 1
A B
Figure 5-2. Comparison of RFT architectures. A) PLB-based architecture. B) FSL-basedarchitecture.
Figure 5-3. Flowchart of scheduling simulator.
107
(NPRRs = 12, texecϵ[10, 100], tdeadlineϵ[100, 200])
Figure 5-4. Effect of arrival rate on fault-free operation.
(NPRRs = 12, texecϵ[10, 100], tdeadlineϵ[100, 200], λarrival = 0.075)
Figure 5-5. Effect of fault rate on task rejection.
108
Figure 5-6. Fault-rate profile for case studies.
109
Figure 5-7. Resource fragmentation from adaptive placement.
110
CHAPTER 6CONCLUSIONS
In this research, a comprehensive framework for providing reconfigurable fault
tolerance for FPGA-based space systems has been developed. The framework features
three primary areas each addressed by one of the research phases presented in this
document. In Phase 1, the initial RFT framework consisting of a hardware architecture,
fault-rate model, and a performability model is presented. The hardware architecture
was developed and implemented on a Virtex-5 platform enabling reconfigurable
fault tolerance with support for TMR, DWC, or user-defined fault-tolerance modes.
Additionally, RFT fault-rate and performability models were used to predict the
performance and reliability of an RFT system in multiple specified orbits. For highly-elliptical
orbits, adaptive fault-tolerance strategies were shown to increase system performability
by 128% over TMR while improving unavailability by 85% over ABFT. Fault injection was
used to validate the reliability of the architecture and the accuracy of the performance
model (1.5% error). In Phase 2, an in-depth reliability and overhead analysis of FPGA
designs using ABFT is presented. By identifying fault tolerance techniques with low
overhead and high reliability, a spectrum of reliability and performance characteristics
become available for RFT systems, enabling system flexibility. Fault-injection testing of
matrix multiplication and Fast Fourier Transform FPGA designs show that ABFT can
reduce design vulnerability by up to 98% with less than 25% overhead. Selectively
applying redundancy (TMR) can allow for even more reliable designs. In Phase 3,
methods for integrating the RFT framework with pre-existing PR systems, and making
the fault-tolerance features of the architecture easier to use, are demonstrated. RFT
hardware will be dynamically generated based on individual system parameters.
The adaptive scheduling heuristic will simplify the decision to use a specific RFT
fault-tolerant mode and enable the use of environmentally-aware task schedulers.
Combined, these three phases of research provide an RFT framework which is capable
111
of providing adaptive fault-tolerance to existing FPGA systems, enabling their possible
use as space systems.
The contributions of Phase 1 include the RFT hardware architecture, the orbital
fault-rate model, and the phased-mission performability model. The fault-rate and
performability models are applicable to many space systems, enabling a time-varying
fault model where only static, time-averaged models are normally considered. From
Phase 2, the analysis of ABFT reliability on FPGA architectures demonstrates the
usefulness of ABFT as a reliable, low-overhead alternative technique in low-to-medium
fault-rate environments. The design techniques discovered in this research will promote
the use of ABFT as a alternative fault-tolerance technique for space systems, enabling
higher performance, lower power consumption, and lower costs by reducing the number
of processors needed to perform onboard data processing. The contributions from
Phase 3 facilitates the use of RFT techniques in future systems by partially automating
the process of fault-tolerant hardware design, allowing system designers to focus their
efforts on other parts of their potential system. Optimizing the selection of fault-tolerance
modes through fault-rate prediction and scheduler heuristics will enable systems to
maintain high performability automatically.
112
REFERENCES
ACREE, R., ULLAH, N., KARIA, A., RAHMEH, J., AND ABRAHAM, J. 1993. Anobject-oriented approach for implementing algorithm-based fault tolerance. InTwelfth Annual International Phoenix Conference on Computers and Communications.210 –216.
ACTEL. 2010a. Actel product page. http://www.actel.com/products/milaero/rtsxsu/default.aspx.
ACTEL. 2010b. Actel product page. http://www.actel.com/products/milaero/rtpa3/default.aspx.
ALAM, M., SONG, M., HESTER, S., AND SELIGA, T. 2006. Reliability analysis ofphased-mission systems: a practical approach. In Annual Reliability and Maintainabil-ity Symposium, 2006. RAMS ’06. 551 –558.
ALNAJIAR, D., KO, Y., IMAGAWA, T., KONOURA, H., HIROMOTO, M., MITSUYAMA, Y.,HASHIMOTO, M., OCHI, H., AND ONOYE, T. 2009. Coarse-grained dynamicallyreconfigurable architecture with flexible reliability. In International Conference on FieldProgrammable Logic and Applications, 2009. FPL 2009. 186 –192.
ALTERA. 2010. Stratix V FPGAs: Ultimate Flexibility Through Partial and DynamicReconfiguration. http://www.altera.com/products/devices/stratix-fpgas/
stratix-v/overview/partial-reconfiguration/stxv-part-reconfig.html.
ARNDT, O., FREISLEBEN, B., KIELMANN, T., AND THILO, F. 2000. A comparative studyof online scheduling algorithms for networks of workstations. Cluster Computing 3,95–112.
BANERJEE, S., BOZORGZADEH, E., AND DUTT, N. 2005. Physically-aware hw-swpartitioning for reconfigurable architectures with partial dynamic reconfiguration. InDesign Automation Conference, 2005. Proceedings. 42nd. 335 – 340.
CARMICHAEL, C., FULLER, E., BLAIN, P., AND CAFFREY, M. 1999. SEU mitigationtechniques for Virtex FPGAs in space applications. In 2nd Annual Military andAerospace Applications of Programmable Devices and Technologies Conference.Laurel, MD.
CHOWDHURY, A.-R. AND BANERJEE, P. 1996. A new error analysis based method fortolerance computation for algorithm-based checks. Computers, IEEE Transactionson 45, 2, 238 –243.
CIARDO, G., MARIE, R., SERICOLA, B., AND TRIVEDI, K. 1990. Performability analysisusing semi-Markov reward processes. IEEE Transactions on Computers 39, 10, 1251–1264.
113
CIESLEWSKI, G., GEORGE, A., AND JACOBS, A. 2010. Acceleration of FPGA faultinjection through multi-bit testing. In 2010 Engineering of Reconfigurable Systems andAlgorithms.
DAVE, N., FLEMING, K., KING, M., PELLAUER, M., AND VIJAYARAGHAVAN, M. 2007.Hardware acceleration of matrix multiplication on a xilinx fpga. In Formal Methodsand Models for Codesign, 2007. MEMOCODE 2007. 5th IEEE/ACM InternationalConference on. 97 –100.
DAWOOD, A., VISSER, S., AND WILLIAMS, J. 2002. Reconfigurable FPGAs for realtime image processing in space. In 14th International Conference on Digital SignalProcessing, 2002. DSP 2002. Vol. 2. 845 – 848 vol.2.
DOBIAS, R., KUBALIK, P., AND KUBATOVA, H. 2005. Dependability computations forfault-tolerant system based on FPGA. In 12th IEEE International Conference onElectronics, Circuits and Systems (ICECS). 1 –4.
FLATLEY, T. 2010. Advanced hybrid on-board science data processor - SpaceCube 2.0.Earth Science Technology Forum.
GANO, S. 2010. JSatTrak. http://www.gano.name/shawn/JSatTrak/index.html.
GARVIE, M. AND THOMPSON, A. 2004. Scrubbing away transients and jiggling aroundthe permanent: long survival of FPGA systems through evolutionary self-repair. In10th IEEE International On-Line Testing Symposium (IOLTS). 155 – 160.
GUPTA, A., NOOSHABADI, S., TAUBMAN, D., AND DYER, M. 2006. Realizing low-costhigh-throughput general-purpose block encoder for JPEG2000. IEEE Transactions onCircuits and Systems for Video Technology 16, 7, 843 –858.
HAN, C.-C., SHIN, K., AND WU, J. 2003. A fault-tolerant scheduling algorithm forreal-time periodic tasks with possible software faults. Computers, IEEE Transactionson 52, 3, 362 – 372.
HOOTS, F. R. AND ROEHRICH, R. L. 1980. SPACETRACK REPORT NO. 3 - Mod-els for Propagation of NORAD Element Sets. http://celestrak.com/NORAD/
documentation/spacetrk.pdf.
HSUEH, M. AND CHANG, C.-I. 2008. Field programmable gate arrays (FPGA) forpixel purity index using blocks of skewers for endmember extraction in hyperspectralimagery. Int. J. High Perform. Comput. Appl. 22, 408–423.
HUANG, K.-H. AND ABRAHAM, J. 1984. Algorithm-based fault tolerance for matrixoperations. IEEE Transactions on Computers C-33, 6, 518 –528.
JACOBS, A., CIESLEWSKI, G., GEORGE, A. D., GORDON-ROSS, A., AND LAM, H. 2012.Reconfigurable fault tolerance: A comprehensive framework for reliable and adaptivefpga-based space computing. ACM Trans. Reconfigurable Technol. Syst. 5, 4,21:1–21:30.
114
JACOBS, A., CONGER, C., AND GEORGE, A. 2008. Multiparadigm space processing forhyperspectral imaging. In Aerospace Conference, 2008 IEEE. 1 –11.
JARA-BERROCAL, A. AND GORDON-ROSS, A. 2010. VAPRES: A virtual architecture forpartially reconfigurable embedded systems. In Design, Automation Test in EuropeConference Exhibition (DATE), 2010. 837 –842.
JOHNSON, J., HOWES, W., WIRTHLIN, M., MCMURTREY, D., CAFFREY, M., GRAHAM, P.,AND MORGAN, K. 2008. Using duplication with compare for on-line error detection inFPGA-based designs. In 2008 IEEE Aerospace Conference. 1–11.
KARNIK, T. AND HAZUCHA, P. 2004. Characterization of soft errors caused by singleevent upsets in CMOS processes. IEEE Transactions on Dependable and SecureComputing 1, 2, 128 – 143.
KERNIGHAN, B. AND LIN, S. 1970. An eflicient heuristic procedure for partitioninggraphs. Bell system technical journal .
KIM, K. AND PARK, K. 1994. Phased-mission system reliability under Markovenvironment. IEEE Transactions on Reliability 43, 2, 301 –309.
KYRIAKOULAKOS, K. AND PNEVMATIKATOS, D. 2009. A novel SRAM-based FPGAarchitecture for efficient TMR fault tolerance support. In International Conference onField Programmable Logic and Applications, 2009. FPL 2009. 193 –198.
LAPRIE, J.-C., ARLAT, J., BEOUNES, C., AND KANOUN, K. 1990. Definition andanalysis of hardware- and software-fault-tolerant architectures. IEEE Transactions onComputers 23, 7, 39 –51.
LE, C., CHAN, S., CHENG, F., FANG, W., FISCHMAN, M., HENSLEY, S., JOHNSON,R., JOURDAN, M., MARINA, M., PARHAM, B., ROGEZ, F., ROSEN, P., SHAH, B.,AND TAFT, S. 2004. Onboard FPGA-based SAR processing for future spacebornesystems. In Proceedings of the IEEE Radar Conference, 2004. 15 – 20.
MACMILLAN, S. AND MAUS, S. 2010. IGRF10 Model Coefficients for 1945-2010.http://modelweb.gsfc.nasa.gov/magnetos/igrf.html.
MAUS, S., MACMILLAN, S., CHERNOVA, T., CHOI, S., DATER, D., GOLOVKOV, V.,LESUR, V., LOWES, F., LHR, H., MAI, W., MCLEAN, S., OLSEN, N., ROTHER, M.,SABAKA, T., THOMSON, A., AND ZVEREVA, T. 2005. The 10th generation internationalgeomagnetic reference field. Physics of The Earth and Planetary Interiors 151, 3-4,320 – 322.
MEI, B., SCHAUMONT, P., AND VERNALDE, S. 2000. A hardware-software partitioningand scheduling algorithm for dynamically reconfigurable embedded systems. InProceedings of ProRISC. Citeseer, 405–411.
115
Mentor Graphics 2013. Precision Hi-Rel Technology Overview. MentorGraphics. http://www.mentor.com/products/fpga/multimedia/overview/
precision-hi-rel-technology-overview.
MEYER, J. 1982. Closed-form solutions of performability. IEEE Transactions onComputers C-31, 7, 648 –657.
MISHRA, A. AND BANERJEE, P. 2003. An algorithm-based error detection scheme forthe multigrid method. Computers, IEEE Transactions on 52, 9, 1089 – 1099.
MORGAN, K., MCMURTREY, D., PRATT, B., AND WIRTHLIN, M. 2007. A comparison ofTMR with alternative fault-tolerant design techniques for FPGAs. IEEE Transactionson Nuclear Science 54, 6, 2065 –2072.
PATHAN, R. 2006. Fault-tolerant real-time scheduling algorithm for tolerating multipletransient faults. In Electrical and Computer Engineering, 2006. ICECE ’06. Interna-tional Conference on. 577 –580.
PRATT, B., CAFFREY, M., GRAHAM, P., MORGAN, K., AND WIRTHLIN, M. 2006.Improving FPGA design robustness with partial TMR. In 44th Annual IEEE Inter-national Reliability Physics Symposium Proceedings, 2006. 226 –232.
PRATT, B., WIRTHLIN, M., CAFFREY, M., GRAHAM, P., MORGAN, K., QUINN, H., ANDSHELLEY, S. 2007. Improving FPGA reliability in harsh environments using triplemodular redundancy with more frequent voting. In Military and Aerospace FPGAApplications.
RAO, T. AND FUJIWARA, E. 1989. Error-Control Coding for Computer Systems.
RATTER, D. 2004. FPGAs on Mars. Xcell Journal , 8–11.
ROY-CHOWDHURY, A. AND BANERJEE, P. 1993. Tolerance determination foralgorithm-based checks using simplified error analysis techniques. In Fault-TolerantComputing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Sym-posium on. 290 –298.
ROY-CHOWDHURY, A., BELLAS, N., AND BANERJEE, P. 1996. Algorithm-basederror-detection schemes for iterative solution of partial differential equations. Comput-ers, IEEE Transactions on 45, 4, 394 –407.
SAHNER, R. A. AND TRIVEDI, K. S. 1987. Reliability modeling using SHARPE. IEEETransactions on Reliability R-36, 2, 186 –193.
SHIM, B., SRIDHARA, S., AND SHANBHAG, N. 2004. Reliable low-power digital signalprocessing via reduced precision redundancy. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems 12, 5, 497 – 510.
116
SILVA, J., PRATA, P., RELA, M., AND MADEIRA, H. 1998. Practical issues in the use ofABFT and a new failure model. In Twenty-Eighth Annual International Symposium onFault-Tolerant Computing. 26 –35.
STEIGER, C., WALDER, H., AND PLATZNER, M. 2004. Operating systems forreconfigurable embedded platforms: online scheduling of real-time tasks. Com-puters, IEEE Transactions on 53, 11, 1393 – 1407.
SWIFT, G., ALLEN, G., TSENG, C. W., CARMICHAEL, C., MILLER, G., AND GEORGE, J.2008. Static upset characteristics of the 90nm Virtex-4QV FPGAs. In IEEE RadiationEffects Data Workshop. 98 –105.
TAO, D. AND HARTMANN, C. 1993. A novel concurrent error detection scheme for FFTnetworks. Parallel and Distributed Systems 4, 2, 198 –221.
TROXEL, I., FEHRINGER, M., AND CHENOWETH, M. 2008. Achieving multipurposespace imaging with the ARTEMIS reconfigurable payload processor. In 2008 IEEEAerospace Conference. 1–8.
TYLKA, A., ADAMS, J.H., J., BOBERG, P., BROWNSTEIN, B., DIETRICH, W., FLUECK-IGER, E., PETERSEN, E., SHEA, M., SMART, D., AND SMITH, E. 1997. CREME96:A revision of the cosmic ray effects on micro-electronics code. IEEE Transactions onNuclear Science 44, 6, 2150 –2160.
WANG, J. 2003. Radiation effects in FPGAs. In 9th Workshop on Electronics for LHCExperiments.
WANG, S.-J. AND JHA, N. 1994. Algorithm-based fault tolerance for FFT networks.IEEE Transactions on Computers 43, 7, 849 –854.
WILLIAMS, J., MASSIE, C., GEORGE, A. D., RICHARDSON, J., GOSRANI, K., ANDLAM, H. 2010. Characterization of fixed and reconfigurable multi-core devices forapplication acceleration. ACM Transactions on Reconfigurable Technology andSystems 3, 19:1–19:29.
WU, G., DOU, Y., AND WANG, M. 2010. High performance and memory efficientimplementation of matrix multiplication on fpgas. In Field-Programmable Technology(FPT), 2010 International Conference on. 134 –137.
XILINX. 2004. XTMR Tool User Guide. Xilinx User Guide UG156.
XILINX. 2010a. Partial Reconfiguration User Guide. Xilinx User Guide UG702.
XILINX. 2010b. SEU Strategies for Virtex-5 Devices. Xilinx Application Note XAPP864.
XILINX. 2010c. Space-Grade Virtex-4QV Family Overview. Xilinx Product SpecificationDS653.
117
XILINX. 2013a. Xilinx CORE Generator System. Xilinx CORE Generator Product Page,http://www.xilinx.com/tools/coregen.htm.
XILINX. 2013b. Xilinx Soft Error Mitigation (SEM) Core. http://www.xilinx.com/
products/intellectual-property/SEM.htm.
YAO, E., WANG, R., CHEN, M., TAN, G., AND SUN, N. 2012. A case study of designingefficient algorithm-based fault tolerant application for exascale parallelism. In ParallelDistributed Processing Symposium (IPDPS), 2012 IEEE 26th International. 438 –448.
ZHUO, L. AND PRASANNA, V. 2004. Scalable and modular algorithms for floating-pointmatrix multiplication on FPGAs. In Parallel and Distributed Processing Symposium,2004. Proceedings. 18th International. 92.
118
BIOGRAPHICAL SKETCH
Adam Jacobs earned his Bachelor of Science degree in electrical engineering from
the University of Florida in 2005. After graduation, he participated in an internship at
Honeywell International in Clearwater, FL before returning to the University of Florida
for graduate studies. Adam received his Master of Science degree in electrical and
computer engineering in 2007 before joining the doctoral program.
While pursuing his degree, Adam worked as a research assistant in the High-Performance
Computing and Simulation (HCS) Research Lab and the NSF Center for High-Performance
Reconfigurable Computing (CHREC). In support of his studies, Adam interned at
Goddard Space Flight Center in 2010, gaining experience in embedded processing
systems for space. After graduation, he will be moving to Austin, TX, where he has
accepted a position in the processor design group of ARM.
119