c 2013 Adam M. Jacobs - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/53/37/00001/JACOBS_A.pdfreconfigurable fault tolerance for space systems by adam m. jacobs a dissertation

RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS

By

ADAM M. JACOBS

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2013

c⃝ 2013 Adam M. Jacobs

2

To my parents for all of their patience and support

3

ACKNOWLEDGMENTS

This work was supported in part by the I/UCRC Program of the National Science

Foundation under Grant No. EEC-0642422 and IIP-1161022. The author gratefully

acknowledges vendor equipment and/or tools provided by various vendors that helped

make this work possible. The author also thanks fellow graduate student, Grzegorz

Cieslewski, for developing the FPGA fault-injection tool used to gather many of the

experimental results for this work.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 BACKGROUND AND RELATED RESEARCH . . . . . . . . . . . . . . . . . . . 18

2.1 FPGA Performance and Power Efficiency . . . . . . . . . . . . . . . . . . 182.2 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Single-Event Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 FPGAs in Space Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 Low-Overhead Fault Tolerance Methods . . . . . . . . . . . . . . . . . . . 232.6 Algorithm-Based Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . 252.7 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 FRAMEWORK FOR RECONFIGURABLE FAULT TOLERANCE . . . . . . . . 28

3.1 RFT Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.1 RFT Controller Operation . . . . . . . . . . . . . . . . . . . . . . . 303.1.2 MicroBlaze Operation . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.3 Environment-Based Fault Mitigation . . . . . . . . . . . . . . . . . 343.1.4 RFT Controller Resource and Performance Overheads . . . . . . . 35

3.2 RFT Fault-Rate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 RFT Performability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Validation Case Study . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Orbital Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.2.1 Low-Earth orbit case study . . . . . . . . . . . . . . . . . 483.4.2.2 Highly-elliptical orbit case study . . . . . . . . . . . . . . 50

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 ALGORITHM-BASED FAULT TOLERANCE FOR FPGA SYSTEMS . . . . . . 63

4.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1.1 Checksum-Based ABFT for Matrix Multiplication . . . . . . . . . . 644.1.2 Matrix-Multiplication Architectures . . . . . . . . . . . . . . . . . . 66

4.1.2.1 Baseline, serial architecture . . . . . . . . . . . . . . . . . 664.1.2.2 Fine-grained parallel architecture . . . . . . . . . . . . . . 66

5

4.1.2.3 Coarse-grained parallel architecture . . . . . . . . . . . . 674.1.2.4 Architectural modifications for ABFT . . . . . . . . . . . . 67

4.1.3 Resource-Overhead Experiments . . . . . . . . . . . . . . . . . . . 684.1.3.1 Resource overhead of serial architectures . . . . . . . . . 694.1.3.2 Resource overhead of parallel architectures . . . . . . . . 70

4.1.4 Fault-Injection Experiments . . . . . . . . . . . . . . . . . . . . . . 724.1.4.1 Design vulnerability of serial architectures . . . . . . . . . 744.1.4.2 Design vulnerability of parallel architectures . . . . . . . . 75

4.1.5 Analysis of Matrix-Multiplication Architectures . . . . . . . . . . . . 764.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2.1 Checksum-Based ABFT for FFTs . . . . . . . . . . . . . . . . . . . 784.2.2 FFT Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.2.2.1 Radix-2 Burst-IO FFT architecture . . . . . . . . . . . . . 804.2.2.2 Radix-2 Pipelined FFT architecture . . . . . . . . . . . . 804.2.2.3 Architectural modifications for ABFT . . . . . . . . . . . . 80

4.2.3 Resource-Overhead Experiments . . . . . . . . . . . . . . . . . . . 824.2.3.1 Resource overhead of Burst-IO architecture . . . . . . . . 824.2.3.2 Resource overhead of Pipelined architecture . . . . . . . 83

4.2.4 FFT Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.5 Analysis of FFT Architectures . . . . . . . . . . . . . . . . . . . . . 85

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 RFT SYSTEM INTEGRATION FOR RAPID SYSTEM DEVELOPMENT . . . . 94

5.1 Dynamically Generated RFT Components . . . . . . . . . . . . . . . . . . 945.1.1 RFT Controller Point-to-Point Interface . . . . . . . . . . . . . . . . 955.1.2 Parameterized and Configurable Voting Logic . . . . . . . . . . . . 96

5.2 Task Scheduling for RC Systems in Dynamic Fault-Rate Environments . . 975.2.1 Selection Criteria for Fault-Tolerant Mode . . . . . . . . . . . . . . 98

5.2.1.1 FT-mode selection using thresholds . . . . . . . . . . . . 985.2.1.2 Time-resource metric for FT-mode selection . . . . . . . . 99

5.2.2 Scheduler for RFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2.2.1 RFT architecture description . . . . . . . . . . . . . . . . 100

5.2.3 Software Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.2.4 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2.4.1 Constant fault rates . . . . . . . . . . . . . . . . . . . . . 1025.2.4.2 Dynamic fault-rate case studies . . . . . . . . . . . . . . 1035.2.4.3 Scheduling improvements . . . . . . . . . . . . . . . . . . 104

5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6

LIST OF TABLES

Table page

3-1 RFT fault-tolerance modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3-2 RFT controller resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3-3 Fault-injection results for RFT components. . . . . . . . . . . . . . . . . . . . . 54

3-4 RFT Markov model validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3-5 Unavailability and performability for LEO case study. . . . . . . . . . . . . . . . 55

3-6 Unavailability and performability for HEO case study. . . . . . . . . . . . . . . . 55

4-1 Resource utilization and overhead of serial MM designs. . . . . . . . . . . . . . 87

4-2 Serial matrix multiplication fault-injection results. . . . . . . . . . . . . . . . . . 87

4-3 Resource utilization and overhead of FFT designs. . . . . . . . . . . . . . . . . 87

4-4 FFT fault-injection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5-1 RFT controller resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5-2 Dynamic scheduling results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7

LIST OF FIGURES

Figure page

3-1 System-on-chip architecture with RFT controller. . . . . . . . . . . . . . . . . . 55

3-2 RFT controller PLB-to-PRR interface. . . . . . . . . . . . . . . . . . . . . . . . 56

3-3 RFT fault-rate model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3-4 Phased-mission Markov model transitioning between TMR and DWC modes. . 57

3-5 Markov models of RFT modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3-6 RFT validation Markov models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3-7 LEO fault rates using the RFT fault-rate model. . . . . . . . . . . . . . . . . . . 58

3-8 LEO system availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3-9 Effects of adaptive thresholds on availability and performability. . . . . . . . . . 60

3-10 HEO fault rates using the RFT fault-rate model. . . . . . . . . . . . . . . . . . . 61

3-11 HEO system availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-1 Matrix-multiplication architectures. . . . . . . . . . . . . . . . . . . . . . . . . . 88

4-2 Matrix-multiplication psuedocode. . . . . . . . . . . . . . . . . . . . . . . . . . 88

4-3 Matrix-multiplication ABFT-Extra architecture. . . . . . . . . . . . . . . . . . . . 88

4-4 Slice overhead of fine-grained parallel matrix multiplication. . . . . . . . . . . . 89

4-5 Slice overhead of coarse-grained parallel matrix multiplication. . . . . . . . . . 89

4-6 DSP48 and BlockRAM overhead of parallel matrix multiplication. . . . . . . . . 90

4-7 Fault vulnerability of fine-grained parallel matrix multiplication. . . . . . . . . . . 91

4-8 Fault vulnerability of coarse-grained parallel matrix multiplication. . . . . . . . . 92

4-9 FFT architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4-10 Required ABFT threshold value for floating-point FFTs. . . . . . . . . . . . . . 93

5-1 Research areas for Phase 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5-2 Comparison of RFT architectures. . . . . . . . . . . . . . . . . . . . . . . . . . 107

5-3 Flowchart of scheduling simulator. . . . . . . . . . . . . . . . . . . . . . . . . . 107

5-4 Effect of arrival rate on fault-free operation. . . . . . . . . . . . . . . . . . . . . 108

8

5-5 Effect of fault rate on task rejection. . . . . . . . . . . . . . . . . . . . . . . . . 108

5-6 Fault-rate profile for case studies. . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5-7 Resource fragmentation from adaptive placement. . . . . . . . . . . . . . . . . 110

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS

By

Adam M. Jacobs

May 2013

Chair: Alan D. GeorgeMajor: Electrical and Computer Engineering

Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the

capability to provide space applications with the necessary performance, energy-efficiency,

and adaptability to meet next-generation mission requirements. However, mitigating

an FPGA’s susceptibility to radiation-induced faults is challenging. Triple-modular

redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but

TMR incurs substantial overheads such as increased area and power requirements. In

order to reduce overhead while providing sufficient radiation mitigation, this research

proposes a framework for reconfigurable fault tolerance (RFT) that enables system

designers to dynamically adjust a system’s level of redundancy and fault mitigation

based on the varying radiation incurred at different orbital positions. To realize this

goal and validate the effectiveness of the approach, three areas are investigated and

addressed.

First, a method for accurately estimating time-varying fault rates in space systems

and a reliability and performance model for adaptive systems are needed to quantify the

effectiveness of the RFT approach. Using multiple case-study orbits, our models predict

that adaptive fault-tolerance strategies are able to improve unavailability by 85% over

low-overhead fault tolerance techniques and performability by 128% over traditional,

static TMR fault tolerance.

10

Second, low-overhead fault-tolerance techniques which can be used within the

RFT framework for improved performance must be investigated. The effectiveness

of Algorithm-Based Fault Tolerance (ABFT) for FPGA-based systems is explored for

matrix multiplication and FFT. ABFT kernels were developed for an FPGA platform, and

reliability was measured using fault-injection testing. We show that matrix multiplication

and FFTs with ABFT can provide improved reliability (vulnerability reduced by 98%) with

low resource overhead, and scale favorably with additional parallelism.

Third, methods for facilitating the integration of RFT hardware into existing

PR-based systems and architectures are explored. We expand the RFT framework

to be used with bus-based or point-to-point architectures. We design a fault-tolerant

task-scheduling algorithm which can schedule RFT tasks in a dynamically-changing fault

environment in order to maximize system performability.

Combined, these three areas demonstrate the capability of RFT to provide both

performance and reliability in space. Using low-overhead fault-tolerance techniques and

reconfiguration, RFT can meet the strict constraints of next-generation space systems.

11

CHAPTER 1INTRODUCTION

As remote sensor technology for space systems increases in fidelity, the amount

of data collected by orbiting satellites and other space vehicles will continue to

outpace the ability to transmit that data to other stations (e.g., ground stations, other

satellites). Increasing future systems’ onboard data-processing capabilities can

alleviate this downlink bottleneck, which is caused by bandwidth limitations and high

latency transmission. Onboard data processing enables much of the raw data to be

interpreted, reduced, and/or compressed onboard the space system before transmitting

the results to ground stations or other space systems, thus reducing data transmission

requirements. For applications with low data transmission requirements, improved

onboard data processing can enable more complex and autonomous capabilities. These

autonomous capabilities will enable new classes of space missions such as in situ

scientific studies, constellations of multiple, coordinated space systems, or automated

maneuvering and guidance of deep-space systems. Finally, increased onboard data

processing can enable future space systems to keep up with increasingly stringent

real-time constraints. However, increasing the onboard data-processing capabilities

requires high-performance computing, which has largely been absent from space

systems.

In addition to the high performance requirements of these enhanced future systems,

space environments impose several other stringent, high-priority design constraints.

System design decisions must consider the system’s size, weight, and power (SWaP)

and heat dissipation, which are dictated by the system’s physical platform configuration.

The satellite’s physical dimensions restrict the system’s size, the photovoltaic solar

array’s capacity restricts the power generation, and all heat dissipation must occur

from passive, radiative cooling, which can be slow. In addition to SWaP requirements,

system designers must consider system component reliability and system availability

12

because faulty space systems are either impossible or prohibitively expensive to

service. Radiation-hardened devices, with increased protection from long-term

radiation exposure (total ionizing dose), provide system reliability and correctness.

While device-level radiation hardening increases component lifetimes and reliability,

these hardened devices are expensive and have dramatically reduced performance as

compared to non-hardened commercial-off-the-shelf (COTS) components.

In order to design an optimal system, the performance, SWaP, and reliability

requirements of future space applications must be considered together. System design

philosophy must consider the worst-case operating scenario (e.g., typically worst-case

radiation), which may dramatically limit the overall system performance even though

the worst-case scenario may be infrequent (e.g., radiation levels typically change based

on orbital position). However, if components used for redundancy and reliability can be

dynamically repurposed to perform useful, additional computation, future space systems

could meet both high-performance and high-reliability requirements. Future systems

could maintain high reliability during worst-case operating environments while achieving

high performance during less radiation-intensive periods. Current design methodologies

do not account for this type of adaptability. Therefore, in order for future space systems

to achieve high levels of performance, more sophisticated and adaptive system design

methodologies are necessary.

One approach for adaptive high-performance space system design leverages

hardware-adaptive devices such as field-programmable gate arrays (FPGAs), which

provide parallel computations at a high level of performance per unit size, mass, and

power [Williams et al. 2010]. Fortunately, many space applications, such as synthetic

aperture radar (SAR) [Le et al. 2004], hyperspectral imaging (HSI) [Hsueh and Chang

2008], image compression [Gupta et al. 2006], and other image processing applications

[Dawood et al. 2002], where onboard data processing can significantly reduce data

transmission requirements, are amenable to an FPGA’s highly parallel architecture.

13

Since reconfiguration enables FPGAs to perform a wide variety of application-specific

tasks, these systems can rival application-specific integrated circuit (ASIC) performance

while maintaining a general-purpose processor’s flexibility. An SRAM-based FPGA can

be reconfigured multiple times within a single application, allowing a single FPGA to

be used for multiple functions by time-multiplexing the FPGA’s hardware resources,

reducing the number of concurrently active processing modules when an application

does not require all processing modules all of the time. Thus, FPGA reconfiguration

facilitates small, lightweight, yet powerful systems that can be optimized for a space

application’s time-varying hardware requirements.

In order to leverage FPGAs in space systems, the FPGA must operate correctly

and reliably in high-radiation environments, such as those found in near-Earth orbits.

Currently, most radiation-hardened FPGAs have antifuse-based configuration memories

that are immune to single-event upsets (SEUs). However, these hardened FPGAs have

reconfiguration limitations and small capacities, reducing the primary performance

benefits offered by COTS SRAM-based FPGAs. Fortunately, when combined with

special system design techniques, SRAM-based FPGAs can be viable for space

systems. An SRAM-based FPGA’s primary computational limitation is the possibility

of SEUs causing errors within the FPGA user logic and routing resources, which

can manifest as configuration memory upsets or logic memory (e.g., flip-flops, user

RAM) upsets (i.e., resulting in deviations from the expected application behavior).

Fault-tolerant techniques, such as triple-modular redundancy (TMR) and memory

scrubbing, can protect the system from most SEUs and significantly decrease the

SEU-induced errors, but designing an FPGA-based space system using TMR introduces

at least 200% area overhead for each protected module. Depending on the expected

upset rates for a given space system, other lower-overhead fault tolerance methods

could be used to provide sufficient reliability while maximizing the resources available for

performance.

14

When designing a traditional space system, system designers estimate the

expected worst-case upset rates and include an additional safety margin. However,

since single-event upset (SEU) rates vary based on orbital position and the majority

of orbital positions experience relatively low upset rates, a system designed for

the worst-case upset-rate scenario contains processing resources that are wasted

during the frequent low-upset-rate periods. In order to provide the necessary reliability

during high-upset-rate periods and reduce the processing overhead incurred during

low-upset-rate periods, the fault tolerance method must change based on the current

upset rate. For example, during high-upset-rate periods, the system can be reconfigured

to provide high reliability at the expense of reduced processing capabilities, while during

low-upset-rate periods the system can be reconfigured to provide higher performance

by re-provisioning the excess hardware (used for high reliability during high-upset-rate

periods) to application functionality. This upset-rate-based adaptability provides high

performance while maintaining reliability.

This research proposes a framework for reconfigurable fault tolerance (RFT) that

enables FPGA-based space systems to dynamically adapt the amount of fault tolerance

based on the current upset rate. To realize this goal and validate the effectiveness of

the approach, three areas must be addressed and investigated. First, a method for

accurately estimating time-varying fault rates in space systems and a reliability and

performance model for adaptive systems is need to quantify the effectiveness of the RFT

approach. Second, techniques for low-overhead fault tolerance which can be used within

the RFT framework for improved performance must be investigated. Third, tools which

facilitate the creation of the RFT hardware components are required to enable the RFT

integration into existing PR-based systems and architectures. The research presented in

this document is divided into three phases, each proposing a solution to the goals listed

above.

15

The first phase of this research addresses the need for performance and reliability

modelling of adaptive systems with changing environments, such as FPGA-based

space systems. A fault-rate estimation methodology for systems in near-Earth orbits is

developed to estimate the time-varying fault rates experienced during a specified orbit.

A phased-mission Markov model is then used to estimate performance and reliability of

an adaptive system using several adaptation schedules. In this phase, we also develop

and implement an RFT controller design which can provide adaptive fault tolerance in an

FPGA. The reliability and performance models are then experimentally validated using

fault-injection testing on the RFT controller.

The second phase of this research investigates the effectiveness of techniques

for low-overhead fault tolerance in FPGA systems. Algorithm-based fault tolerance

(ABFT) is a technique that can be used with many linear-algebra operations, such as

matrix multiplication or LU decomposition [Huang and Abraham 1984], to provide fault

tolerance with as little as 5-10% overhead. Traditionally, ABFT has been implemented

in software, with multiprocessor arrays, and in hardware, with systolic arrays, to protect

application datapaths. Our ABFT approach may be used in FPGA applications to

provide both datapath and configuration memory protection with low overhead. Other

fault-tolerance techniques (e.g., duplication with compare, error-correcting codes,

concurrent error detection, reduced-precision redundancy) are also examined and

evaluated for FPGA resource overhead and reliability within the context of the RFT

framework.

In the third phase of this research, methods for integrating the RFT framework

with pre-existing partial reconfiguration architectures will be examined, enabling fault

tolerance for pre-existing systems with minimal design changes. Three aspects of

system integration will be considered. First, a tool that will enable the dynamic creation

of system-specific RFT controllers, allowing mission-specific, resource-optimized

hardware. Second, support and integration of an RFT controller with the existing

16

PR architectures to enable fault tolerance with minimal design modifications. Third,

investigation of fault-tolerant task scheduling in the presence of changing fault rates.

The remaining sections of this paper are organized as follows. Chapter 2 surveys

previous work related to this topics common to all three phases of this research.

Chapter 3 describes the RFT hardware architecture, the RFT fault-rate model, and

the RFT performability model and provides case-study examples for two near-Earth

orbits. Chapter 4 evaluates the use of low-overhead fault tolerance techniques in FPGA

systems, analyzes reliability results obtained using fault-injection in MM and FFT case

studies, and suggests design modifications for higher reliability. Chapter 5 demonstrates

a dynamic RFT hardware-creation methodology and a point-to-point RFT controller

implementation, and explores heuristics for an environmentally-aware task scheduler

for RFT systems. Finally, Chapter 6 presents conclusions and outlines directions for

possible future research.

17

CHAPTER 2BACKGROUND AND RELATED RESEARCH

In this chapter, we motivate the need for SRAM-based FPGAs in space systems.

The FPGAs used in space systems must be rated for a sufficient total accumulated

ionizing dose (TID) to ensure long-term device functionality over a given mission

duration. Single-event effects (SEEs), caused by collisions with high-energy protons

and heavy ions (i.e., radiation), are the primary short-term reliability concern in space

systems. In order for FPGA-based systems to maintain reliability, a combination of

radiation-hardened FPGAs and complex fault-tolerance techniques are required

to mitigate errors. Since many of the common fault-tolerance techniques require

substantial temporal or spatial area overhead, new, low-overhead techniques allow more

reconfigurable logic to be used for actual, useful computation instead of for redundancy.

2.1 FPGA Performance and Power Efficiency

SRAM-based FPGAs offer a very large amount of configurable logic and have the

ability to modify portions of a design during run-time, giving these FPGAs the capability

to efficiently perform a wide range of applications by using a high degree of parallelism

while running at low clock rates. Williams et al. [2010] developed computational density

metrics to quantify and predict performance of SRAM-based FPGAs for specific types

of algorithms and applications. Their analysis showed that FPGAs were capable

of providing between 3 and 60 times more performance per unit power than many

conventional general-purpose processors, depending on the types of operations being

considered. The performance and power efficiency of FPGAs is extremely desirable for

the small power budgets of space systems.

2.2 Partial Reconfiguration

Partial reconfiguration (PR) enables a user to modify a portion of an FPGA’s

configuration while the remainder of the FPGA remains operational. PR time-multiplexes

mutually exclusive application-specific processing modules on the same hardware and

18

only the modules that need to be reconfigured halt operation, which makes PR attractive

for real-time systems. Currently, Xilinx supports PR for the Virtex-4 and newer devices

[Xilinx 2010a], while Altera has recently announced PR support for Stratix-V devices

[Altera 2010].

During the system design phase, system designers define the FPGA’s partially

reconfigurable regions (PRRs) and partially reconfigurable modules (PRMs) and route

signals to/from the PRRs through bus macro PR primitives (Xilinx 9.2 PR tool flow)

or partition pins (Xilinx 12.1 PR tool flow). Partial bitstreams, communicated through

external configuration interfaces (e.g., SelectMAP, JTAG), are used to reconfigure the

PRRs with the PRMs. On Xilinx devices, the Internal Configuration Access Port (ICAP)

is an internal configuration interface, allowing user logic to directly reconfigure PRRs,

removing the need for additional external configuration support devices. Additionally,

since partial bitstreams are typically much smaller than full bitstreams (which are used

to configure the entire FPGA), PR reduces bitstream storage requirements. Partial

bitstreams may be significantly smaller than full bitstreams, related to the size of each

PRR.

2.3 Single-Event Effects

SEEs occur when high-energy particles, such as protons, neutrons, or heavy

ions, collide with silicon atoms on a device, depositing the ion’s electric charge into the

device’s circuit. Protons and electrons are trapped within the Earth’s Van Allen belts,

while heavy ions are mainly produced by galactic cosmic rays and solar flares. When a

high-energy particle collides with a silicon device, the energy of the collision can cause

the logical values stored in sequential memory elements to be inverted [Karnik and

Hazucha 2004]. Errors caused by these particles are often referred to as soft errors or

SEUs, as there is no permanent circuitry damage and any affected memories can be

corrected by re-writing the correct values. Single-event functional interrupts (SEFIs),

another type of SEE, can cause a semi-permanent fault that requires a circuit to be

19

power-cycled to restore correct operation. Single-event latchups (SELs) are destructive

SEEs that occur when a particle causes a parasitic forward-biased structure within the

device substrate that can allow destructively high amounts of current to flow through the

substrate, potentially damaging the device. SELs can be avoided by using appropriate

device manufacturing processes, and devices produced using silicon-on-insulator (SOI)

processes are largely immune. SEL immunity is an important property for selecting

devices for many space systems.

2.4 FPGAs in Space Systems

FPGA configuration memories can be constructed using several different technologies

(e.g., antifuse, flash, and SRAM), each with performance and reliability tradeoffs.

Traditionally, antifuse-based FPGAs have been used in space systems to provide

simple processing capabilities and “glue logic” to interconnect multiple peripherals

or to combine/replace the functionality of multiple ASICs. Antifuse-based FPGAs are

one-time programmable where the configuration process creates/fixes the FPGA’s

physical routing interconnect structure. This configuration process provides an inherent

level of fault tolerance from SEUs since the antifuse-based routing cannot be reversed.

Additionally, many commercially available antifuse-based FPGAs (e.g., Actel RTSX-SU

[Actel 2010a]) include replicated flip-flop cells to prevent upsets in the sequential logic.

Additionally, antifuse devices generally have a high TID threshold and immunity to SELs.

Unfortunately, when compared to flash-based or SRAM-based FPGAs, antifuse-based

FPGAs contain a relatively small amount of available logic gates and the fixed-logic

structure limits performance potential.

Flash-based FPGAs (e.g., Actel RT-ProASIC3 [Actel 2010b]) attempt to maintain

an antifuse-based FPGA’s reliability while increasing the amount of configurable

logic (logic available for a system designer to implement application functionality) and

allowing reconfiguration. Flash-based FPGA configuration memories are composed

of radiation-tolerant flash cells that provide reliability for combinational logic, however

20

system designers must insert sequential logic replication to fully protect the FPGA from

faults. Even though flash-based FPGAs can be fully reconfigured to support multiple

applications, flash-based FPGAs do not support the PR capability that is available on

some SRAM-based FPGAs due to a lack of architectural and vendor support for such a

capability. Concerns over the TID effects on flash-based logic (floating-gate transistors)

have prevented wide-spread acceptance of flash-based FPGAs in space systems [Wang

2003].

SRAM-based FPGAs are the most radiation-susceptible type of FPGA since the

design functionality is stored in vulnerable SRAM cells (i.e., configuration memory),

and configuration memory upsets cause functional changes to the design’s logic.

Traditionally, this vulnerability has prevented SRAM-based FPGAs from being used in

highly critical space applications, however, some space systems use space-qualified

SRAM-based FPGAs for onboard processing. Space-qualified FPGAs are similar to

COTS FPGAs but are produced using epitaxial wafers, use ceramic, hermetically-sealed

packaging, and have been tested to ensure that damaging SEL events will not

compromise the system. These devices are rated for TID levels high enough to be used

in space systems [Xilinx 2010c]. Even with these reliability techniques, these space

systems must still use several fault-mitigation strategies, such as TMR and configuration

scrubbing, to ensure that system upsets are detectable and recoverable. (We note that

unless specified, all further FPGA references implicitly refer to SRAM-based FPGAs.)

Traditional FPGA-based space system designs leverage spatial TMR. TMR uses

a reliable majority voter connected to three identical module replicas in order to detect

and mask errors in any one module. In the context of TMR, a module refers to the

functional unit being replicated, which can range from a single logic gate to an entire

device. There are two primary TMR variations: external and internal. External TMR

uses three independent FPGAs working in lockstep where each FPGA implements a

module replica and the outputs are connected to an external radiation-hardened voter

21

that compares the results. External TMR requires significant hardware overhead

(each protected module is triplicated and board layout complexity is significantly

increased), but is reliable. The RCC board produced by SEAKR engineering [Troxel

et al. 2008] uses external TMR to provide reliable computation using Xilinx Virtex-4

FPGAs. Alternatively, internal TMR creates three identical modules within a single

FPGA, and the majority voter resides internally or externally [Carmichael et al. 1999].

Internal TMR can reduce the number of physical FPGAs required to implement a space

application, but may increase the chance of a common-mode failure, where multiple

modules fail simultaneously from a single fault. For example, a SEFI may cause multiple

internally-replicated modules to fail, whereas externally-replicated FPGAs would be

immune. Several tools assist system designers in incorporating TMR into space system

designs [Pratt et al. 2006; Xilinx 2004].

In addition to TMR, configuration scrubbing prevents error accumulation in FPGA

configuration memory. While TMR masks individual errors, TMR does not correct the

underlying fault and cannot correct errors that occur in multiple modules. Scrubbing

uses an external device to read back the FPGA’s configuration memory and compares

the read configuration memory to a known “good” copy. Alternatively, some FPGAs

calculate an error correction code (ECC) during configuration read-back for every

configuration frame, which can be used to detect and correct configuration faults. If a

mismatch is detected, the correct configuration can be written using PR without halting

the entire FPGA operation [Xilinx 2010a]. Traditionally, scrubbing is performed by an

external radiation-hardened microcontroller to ensure reliability of the reconfiguration

process. However, a self-scrubber may be implemented within the FPGA using the ICAP

available in Xilinx Virtex-4 and newer devices. Xilinx has also developed a single-event

mitigation (SEM) IP core which can perform configuration memory error detection and

correction for user designs [Xilinx 2013b].

22

Despite these drawbacks, SRAM-based FPGAs have been used in many space

systems, including earth-observing science satellites, communication satellites, and

satellites and rovers for the Venus and Mars missions. For instance, space-qualified

Xilinx Virtex-1000 (XQVR1000) devices were used on the Mars Exploration Rovers for

motor control functions and four XQR4000XLs were used for the lander pyrotechnics

[Ratter 2004]. Configuration read-back and scrubbing were used for detection and

correction of SEUs, and the full system was cycled once per Martian evening to remove

persistent errors. These systems, which contained very little logic as compared to

today’s standards and lacked the ability for PR, were not used for data processing.

The increased logic resources on recent FPGA families enable future systems to

use FPGAs for image processing and other computation-intensive applications. The

SpaceCube created at NASA Goddard Space Flight Center uses multiple commercial

Xilinx Virtex-4 FPGAs along with a radiation-hardened processor to provide onboard

processing capabilities for a variety of missions [Flatley 2010]. The SpaceCube provided

computational power for the Relative Navigation Sensors (RNS) experiment during

Hubble Servicing Mission 4 and is currently being used as an on-orbit test platform

aboard the Naval Research Laboratory’s MISSE-7 experiment on the International

Space Station (ISS). Each FPGA in the system contains a self-scrubber module,

protected with TMR, to correct errors and prevent error accumulation.

2.5 Low-Overhead Fault Tolerance Methods

While TMR is the most common fault tolerance method for FPGAs, the high area

overhead due to replicating modules partially negates some FPGA benefits. Therefore,

much research has focused on developing alternative, low-overhead fault-tolerance

methods for low-upset environments, such as replication-based fault tolerance [Laprie

et al. 1990], ECCs [Rao and Fujiwara 1989], and application-specific optimizations

to provide low-cost reliability. Many of these alternative techniques can detect errors

quickly, but may require additional processing or complete re-computation to correct

23

the errors. When the expected upset rates are low, the re-computation rates may be

acceptably low, depending on application throughput requirements.

Replication-based fault tolerance represents the most commonly used type of

fault-mitigation strategy, due to conceptual simplicity and high fault coverage. TMR

can detect single errors and can correct/mask single errors using a majority voter.

Duplication with compare (DWC) is an alternative replication-based method that

compares the outputs of duplicated modules. DWC reduces the resource overhead

by one half as compared to TMR, but DWC cannot correct errors and must fully

re-compute data when errors are detected [Johnson et al. 2008]. Shim et al. [2004]

proposed reduced-precision redundancy (RPR) for numerical computation. RPR

triplicates an application module as in TMR, but the replicas have lower precision

or only operate on the most-significant bits of application data, ensuring that the

most-significant bits are protected and errors in the least-significant bits are treated

as noise in the system. RPR reduced the number of detectable faults (fault coverage),

and resulted in significant resource savings while maintaining sufficient signal-to-noise

ratios for many DSP applications. Morgan et al. [2007] investigated the use of temporal

redundancy and “quadded logic” as additional methods for providing fault tolerance

through redundancy. However, due to inefficient mapping to the underlying FPGA

architecture, their methods did not provide fault tolerance improvement over TMR and

imposed a large area overhead. The effects of the fault-tolerant methods were offset by

the increased cross-sectional vulnerability of the larger FPGA designs.

Even though these replication-based methods for fault tolerance are suitable

for FPGAs, several new fault-tolerant FPGA architectures leverage an FPGA’s high

capacity and flexibility while maintaining reliability. Alnajiar et al. [2009] proposed a

hypothetical coarse-grained multi-context FPGA architecture that supported TMR, DWC,

and single-context modes. Their work explored the effects of soft errors and aging on

the proposed architecture, with an example Viterbi decoder module mapped to the

24

hypothetical architecture. Kyriakoulakos et al. [2009] proposed simple modifications

to the current Virtex-5 and Virtex-6 architectures to allow native support for DWC and

TMR. By adding an XOR-gate between each 5-input LUT within a larger 6-input LUT, the

existing architecture and synthesis tools required only minimal changes to support their

approach, while incurring only 17.5% to 76% slice utilization overhead for DWC or TMR,

respectively.

2.6 Algorithm-Based Fault Tolerance

Algorithm-based fault tolerance (ABFT) is a method that can be used with

many linear-algebra operations to provide fault tolerance without the use of explicit

replication., The ABFT method was originally described for matrix multiplication and

LU decomposition [Huang and Abraham 1984] but has been expanded to protect

other algorithms comprised of linear operations, such as QR decomposition, Fast

Fourier Transform [Tao and Hartmann 1993; Wang and Jha 1994], and finite element

analysis [Mishra and Banerjee 2003; Roy-Chowdhury et al. 1996]. The traditional

description of ABFT was designed for systolic arrays, but the method has also been

used in multiprocessor-based, high-performance computing [Yao et al. 2012].

ABFT augments an original data matrix with row and/or column checksums

and the linear-algebra operation is performed on the new, augmented matrix. If

the linear-algebra operation is computed successfully, the resulting augmented

matrix will contain valid, consistent checksums [Huang and Abraham 1984]. ABFT

checksum generation and comparison has lower computational complexity than the

primary linear-algebra operation. ABFT computational overhead is generally low, and

as a proportion of total computation, decreases as the matrix size increases. The

mathematical basis of ABFT for the matrix multiplication and FFT algorithms will be

shown in Section 3 and Section 4, respectively.

When using ABFT with floating-point algorithms, the introduction of rounding and

precision errors requires the use of a threshold comparison to differentiate rounding

25

errors from incorrectly computed data. The determination of a sufficient threshold is

significantly influenced by properties of the input data, and can affect error coverage and

the number of false positive results [Chowdhury and Banerjee 1996; Roy-Chowdhury

and Banerjee 1993]. Additionally, the original description of ABFT considered algorithms

executed on systolic arrays, with each processing element computing a single data

element of the result matrix. In these systems, a single fault could not propagate to other

processing elements and would only result in a single erroneous result element. [Silva

et al. 1998] investigated these vulnerabilities in traditional ABFT implementations and

proposed methods for improving fault coverage using a Robust ABFT approach.

2.7 Task Scheduling

Task scheduling algorithms can be categorized as either online or offline. Offline

scheduling algorithms have complete knowledge of all tasks that must be scheduled.

The general scheduling optimization problem is NP-hard, but efficient heuristics exist,

and offline schedules can be pre-determined at compile-time. With online scheduling,

tasks arrive at the scheduler periodically over time, and must be placed around

previously scheduled tasks. Task arrival rates and patterns can greatly affect the

quality of the scheduler’s results. Arndt et al. [2000] examined many online scheduling

algorithms for distributed parallel computers, and used simulation to evaluate their

performance. Of the several algorithms studied, the FirstFit algorithm, which prioritizes

scheduling by the arrival time of each task, provided good schedule lengths while

minimizing the average wait times.

Real-time systems introduce additional constraints for task scheduling. Each task

must be completed by a deadline, otherwise the results will no longer be relevant or

needed. A hard deadline must be met, otherwise the system is considered failed. A firm

deadline can be missed, but the usefulness of the result after the deadline is zero. A

soft deadline can be missed, but the value of the result decreases after the deadline has

passed. For a hard real-time system all deadlines must be met, but the goal of a soft

26

real-time system is to meet as many deadlines as possible while optimizing for other

critera. Traditionally, schedulers attempt to minimize criteria such as makespan (total

schedule length) or average task latency.

Han et al. [2003] created a fault-tolerant scheduling algorithm for periodic real-time

software tasks. For each primary task, an alternate less-precise task is also used to

generate a sufficient result before the deadline. These alternate tasks are scheduled as

close to the task deadline as possible. In the case of a primary task failure, the alternate

task will be executed. If the primary task succeeded, the alternate tasks are discarded.

Their algorithm is intended for offline use and was intended to protect systems against

software faults. Pathan [2006] extends the rate-monotonic scheduling algorithm to

support temporal TMR (RM-FT), scheduling multiple copies of tasks to mask faults in

any one copy. RM-FT also requires periodic tasks to perform a scheduling analysis.

Scheduling aperiodic real-time tasks is a more difficult problem and is currently being

studied.

Scheduling tasks for reconfigurable computing creates an additional level of

complexity. Instead of scheduling a task for a one-dimensional array of processors,

scheduling on a two-dimensional FPGA fabric becomes a constrained placement

problem. Additionally, tasks may have multiple hardware or software implementations,

increasing the overall search space. Banerjee et al. [2005] present an offline KLFM

heuristic [Kernighan and Lin 1970] which incorporates detailed placement information in

order to provide high-quality schedules. Mei et al. [2000] combine a genetic algorithm

to determine HW/SW placement with a traditional list scheduling algorithm to enable

online scheduling of real-time reconfigurable embedded systems. Steiger et al. [2004]

developed two heuristic scheduling algorithms, Horizon and Stuffing, which provide good

results while limiting the computational requirements.

27

CHAPTER 3FRAMEWORK FOR RECONFIGURABLE FAULT TOLERANCE

RFT leverages COTS components in space systems to achieve high performance

and flexibility while maintaining reliability. Beyond traditional, spatial TMR, alternative

fault-mitigation methods may be appropriate given a particular application’s performance

requirements and set of expected environmental factors (e.g., radiation). Other

methods, such as temporal TMR, ABFT, software-implemented fault tolerance (SIFT), or

checkpointing and rollback, may be suitable for system-level protection. Each alternative

fault-mitigation method has tradeoffs between performance, reliability, and overhead,

and Pareto-optimal operation may change over a system’s lifetime. Therefore, the main

goal of our proposed RFT framework is to enable a system to autonomously adapt to

Pareto-optimal operation based on the current system’s environmental situation.

Our RFT framework consists of three main elements: a PR-based hardware

architecture; a fault-rate model for estimating orbital SEU rates; and a methodology

for modeling RFT system performability. The hardware architecture, described in

Section 3.1, is similar to other traditional SoC architectures used for reconfigurable

computing, with multiple, identical PRRs, which are leveraged for module redundancy.

The hardware architecture allows the system to execute each processing module

(PRM) in several possible fault-tolerance modes, each with differing performance and

reliability characteristics. RFT software adapts the amount of per-module redundancy in

accordance with the current environment and temporarily pauses only the reconfigured

hardware modules to preserve and record application state while changing the

fault-tolerance mode, allowing the remainder of the system to continue processing.

Additionally, modules with constant state can continue processing during the adaptation

process. Section 3.2 presents a model for estimating expected fault rates in potential

space-system orbits. Finally, Section 3.3 presents a performability model that quantifies

the RFT benefits in environments with varying upset rates.

28

3.1 RFT Hardware Architecture

Figure 3-1 shows the high-level architecture of an FPGA-based SoC design with

PRRs integrated with an RFT controller. The main architectural components include

a microprocessor, a memory controller, I/O ports, PRRs (1 . . . N) for PRMs, and the

system interconnect, which connects all of these components to the microprocessor.

Since we leverage Xilinx FPGAs, the microprocessor is a MicroBlaze (less resource-intensive

processors such as PicoBlaze can also be used) and the system interconnect is a

processor local bus (PLB). The MicroBlaze orchestrates PRR reconfiguration using the

ICAP, maintains the state of the currently active PRMs, and initiates fault-tolerance mode

switching.

All architectural components except for the PRRs are protected using TMR since

these components’ functionality is crucial to the entire system’s reliability. Although the

ICAP cannot be replicated, the signals to and from the ICAP are also protected with

TMR. Tools such as Xilinx’s TMRTool [Xilinx 2004] or BYU’s EDIF-based TMR tool [Pratt

et al. 2006] automate TMR design creation by applying low-level TMR voting on the

design’s original, unprotected netlist. For additional SEU protection, FPGA configuration

scrubbing should be performed in order to prevent error accumulation. Scrubbing can

be performed with an external scrubber and radiation-hardened configuration storage or

with an internal scrubber using the internal configuration ECC present in Virtex-4 and

later devices.

Each PRM uses a PLB-compatible interface that connects to the RFT controller

or directly to the PLB based on whether the PRM is an RFT-enabled module (a

module replicated/instantiated by the RFT controller) or not, respectively. The RFT

controller instantiates the bus macros or other low-level components required for

interfacing with the PRRs. The RFT controller also contains multiple majority voters

and comparators (voting logic) that can be used to detect or correct errors by evaluating

the replicated PRMs’ outputs. The RFT-enabled modules are used in parallel to create

29

redundancy-based, fault-protection modes (e.g., DWC, TMR) by interfacing with the RFT

controller’s voting logic. Additionally, other single-module fault-protection modes, such as

ABFT, can be used for individual PRRs and the RFT controller provides additional fault

tolerance components, such as watchdog timers, to detect hanging conditions within

PRMs.

Table 3-1 lists the currently supported fault-tolerance modes of RFT and the modes’

fault-tolerance type and PRR requirements. The MicroBlaze evaluates the system’s

current performance requirements and monitors external stimuli (radiation) using

external sensors to determine when the fault-tolerance mode should be switched, at

which time the MicroBlaze reconfigures the appropriate PRRs and the RFT controller’s

internal voting scheme between the PRRs’ outputs for the new fault-tolerance mode.

3.1.1 RFT Controller Operation

Figure 3-2 illustrates the interface between PRRs and the PLB for an RFT controller

that can operate in single-module and redundancy-based fault-tolerance modes.

This RFT controller interface routes input signals from the system interconnect

to the appropriate PRRs and routes voting output signals back to the PLB. The

abstraction of this logic into the RFT controller enables pre-existing PRMs to leverage

the fault-tolerance modes and interface with the RFT controller with minimum modifications.

To communicate data and control signals between the MicroBlaze, the PRRs, and the

RFT controller, at design time, the system designer assigns a large memory-mapped

region of the MicroBlaze’s address space to the RFT controller. In order to route

signals to specified PRRs, the system designer also subdivides this region into smaller

subregions and assigns a subregion to each PRR, while taking into consideration the

memory interface requirements of each potential PRM. Each PRM must implement

the actual memory interface (e.g., dual-port RAM, FIFO-based interface), which allows

pre-existing PRMs to interface with the RFT controller with minor modifications since no

specific interface is required. When communication data from the MicroBlaze arrives

30

over the PLB, the RFT controller’s address decoder processes the input address and

determines the destination PRR(s) based on the current fault-tolerance mode. While

operating in single-module fault-tolerance modes, the data is passed directly to the PRR

specified by the decoded address. While operating in redundancy-based fault-tolerance

modes, the data is passed to multiple PRRs using appropriate enable signals to the

PRRs’ interfaces.

The RFT controller routes the outputs from each individual PRR, the outputs

from the voting logic via the FT-mode register, or recorded internal status information

over the PLB to the MicroBlaze. When the MicroBlaze requests a PRR’s output, the

RFT controller’s output multiplexer (Output Mux in Fig. 3-2) selects the appropriate

PRR output to route to the PLB controller based on the decoded address from the

MicroBlaze and the current FT-Mode value. In single-module fault tolerance modes,

the RFT controller’s output multiplexer routes signals directly from the PRRs to the

PLB controller, which bypasses the RFT’s internal voting logic. In redundancy-based

fault-tolerance modes, the output multiplexer routes the verified outputs from the RFT

controller’s voting logic to the PLB controller.

In addition to providing routing and voting logic for redundancy-based fault-tolerance

modes, the RFT controller supports additional fault-tolerance capabilities that do not

require redundant PRMs. If the system is operating in a single-module fault-tolerance

mode, each PRM may use the RFT controller’s watchdog timers or perform internal

fault detection. The RFT controller provides each PRR with an optional watchdog timer

interface using a signal that must be asserted periodically within a user-defined time

interval (usually on the order of seconds). If the PRM does not assert the watchdog

timer reset signal within the time interval, the RFT controller’s interrupt generator

asserts an interrupt signal to the MicroBlaze, alerting the MicroBlaze of a possible

failed PRR. Additionally, PRMs that perform internal fault detection must include an

interrupt signal to notify the RFT controller of internally detected errors, which the RFT

31

controller propagates to the MicroBlaze. PRMs using ABFT perform self-checking (or

self-correction if the ABFT operation permits) on internally generated checksums and

send an interrupt signal to the RFT controller when data errors are detected. Similarly,

PRMs that use internal TMR can also signal detected and corrected errors to the

RFT controller. PRMs without any specific fault-tolerance features can also use the

RFT controller’s watchdog timer, which still allows the RFT controller to detect module

hang-ups or other operational errors.

3.1.2 MicroBlaze Operation

In an RFT system, the MicroBlaze enables additional operational features, such

as fault-tolerance methods to complement the fault-tolerance modes offered by the

RFT controller. Additionally, the MicroBlaze maintains the system’s fault-tolerance state

(e.g., active PRMs, current PRRs’ fault-tolerance modes, etc.) and orchestrates PRR

reconfiguration and the fault-tolerance mode switching process.

To protect the application state while reconfiguring PRRs, the MicroBlaze provides

support for PRM checkpointing and rollback. A checkpoint, which the MicroBlaze

stores in external memory, consists of the minimum set of application state information

needed to restore the application’s current state. In the event of a fault, an application

can use the previous checkpoint to roll back to a known good state, instead of wasting

execution time by beginning execution at an initial starting state. A PRM’s state can

be checkpointed if the PRM can be read and modified by the MicroBlaze. In an RFT

system, the state of all PRMs should be checkpointed periodically in order to reduce

wasted computation in the event of a fault-induced PRM restart. The stored checkpoints

can then be used during fault recovery procedures to improve system availability.

In addition to handling PRM checkpointing, the MicroBlaze also handles fault

recovery and reconfiguration procedures. If the RFT controller’s voting logic detects

faults in the PRMs’ outputs, the RFT controller records fault status information about the

faulty PRM (e.g., error location, time, etc.) in internal FT-status registers and sends an

32

interrupt to the MicroBlaze, which initiates the reconfiguration procedure. The FT-status

registers record fault status information that may be used by the MicroBlaze to make

fault-tolerance mode decisions or to provide system log information to system operators.

For single-module fault-tolerance modes, the MicroBlaze corrects the faulty PRM by

reconfiguring the PRR with the PRM’s original bitstream over the ICAP. For the DWC

fault-tolerance mode, both of the associated PRMs must be reconfigured since the faulty

PRM cannot be identified. If checkpoints exist for the PRMs, the MicroBlaze initializes

the new PRMs using these checkpoints. Alternatively, if the RFT system is operating in

the TMR fault-tolerance mode, the MicroBlaze checkpoints one of the non-faulty PRMs

while the faulty PRM is reconfigured. After the non-faulty PRM has been checkpointed,

the RFT controller pauses the non-faulty PRMs using clock gating to keep the PRMs

synchronized. Once the faulty PRM has been reconfigured, the MicroBlaze initializes the

newly reconfigured PRM from the most recent checkpoint, and the RFT controller can

re-enable the clocks for all three of the replicated PRMs to resume operation.

Finally, the MicroBlaze orchestrates the switching procedure between different RFT

fault-tolerance modes. Fault-tolerance mode switching can be triggered by external

events, by a priori knowledge of the operating environment, or by application-triggered

events, and the fault-tolerance mode switching procedure may vary on a per-system

basis depending on the specific system performance and reliability requirements. The

MicroBlaze can use information from the fault-status registers to make Pareto-optimal

decisions about future fault-tolerance modes. Before fault-tolerance mode switching,

the MicroBlaze ensures that sufficient PRRs are available for the new fault-tolerance

mode. Additional PRRs may be required or PRRs may be freed when switching from a

single-module fault-tolerance mode to a redundancy-based fault-tolerance mode or vice

versa, respectively. The MicroBlaze signals the RFT controller to change fault-tolerance

modes by writing to the RFT controller’s FT-Mode register. The MicroBlaze reconfigures

the PRRs involved in the fault-tolerance mode switching via the ICAP with partial

33

bitstreams for the appropriate PRM or a blank partial bitstream if the PRR is not required

for the new fault-tolerance mode. When the reconfiguration process is complete, the

MicroBlaze signals the RFT controller, by rewriting the FT-Mode register, to resume PRM

operation.

3.1.3 Environment-Based Fault Mitigation

RFT fault-tolerance mode switching can be triggered by a priori knowledge of the

operating environment, by application-triggered events, or by external events. While a

priori knowledge and application-triggered events are convenient for modeling purposes,

real-world systems leverage measurements from attached sensors to determine the

system’s current environmental status. Additionally, due to the unpredictability of space

weather conditions, such as solar flare events, an RFT system must be able to respond

dynamically to the changing environment.

In an RFT system, the current expected fault rate can be estimated either directly or

indirectly. An external radiation sensor can be directly interfaced with the FPGA, allowing

the MicroBlaze to track the current fault rate and predict future fault rates. Alternatively,

the RFT system can indirectly determine fault rates by tracking the number of data and

configuration faults detected during operation. Since an FPGA’s fabric is composed of

large SRAM arrays, these arrays can be used as makeshift radiation detectors. If a fault

is detected in the FPGA configuration or data memory, either through readback during

scrubbing or from the RFT controller’s logic, the fault can be recorded or can be used to

make decisions about which RFT fault-tolerance mode to use. One simple, rule-based

method for choosing the RFT fault-tolerance mode is to use a sliding window approach.

For instance, a system may use the following rules:

1. If there were any faults in the past 5 minutes, use the DWC fault-tolerance mode.

2. If there were more than 5 faults in the past 5 minutes, use the TMR fault-tolerancemode.

34

3. Only transition from the TMR fault-tolerance mode to a lower-reliability mode after5 minutes for fault-free operation.

The size of the sliding window and the RFT fault-tolerance mode choices are system-

and mission-dependent. A larger window size can provide a more conservative

fault-tolerance strategy, while a small window size can more quickly adapt to spikes

in the experienced fault rate. Implementing a rule with hysteresis, such a rule (3), can

produce a more reliable strategy, while requiring a smaller window size.

3.1.4 RFT Controller Resource and Performance Overheads

To quantify the RFT controller’s resource overhead, we implemented an RFT-based

system on a Virtex-4 FX60-based platform, similar to the SpaceCube. The design uses

six PRRs connected to a MicroBlaze through a PLB-connected RFT controller and

operates at 100 MHz. Each PRR contains 2,000 slices for user logic. Each PRR can

operate in high-performance (HP) mode, DWC-mode with the PRR’s neighboring PRRs,

or TMR-mode with two consecutive, neighboring PRRs. The resources (LUT, FF, and

slice) required for the RFT controller and constituent modules are detailed in Table 3-2.

The final column of Table 3-2 shows the percentage of the Virtex-4 FX60 used by each

module. The RFT controller logic, excluding the PLB controller that would be required

for non-RFT designs, requires approximately 900 slices. The largest modules within

the RFT controller are the (optional) watchdog timers and the voting logic. Overall, the

complete RFT controller uses approximately 5% of the total FPGA.

Since frequent PRR reconfiguration combined with lengthy PRR reconfiguration

time can impose a performance overhead, we measured the reconfiguration of a

single PRR. The PRR reconfiguration time was measured using timing functions on a

MicroBlaze with a PLB-attached ICAP controller running at 100 MHz. We measured

the PRR reconfiguration time for a 2,000 slice PRR as approximately 15 ms. Even

though the RFT mode-switching overhead is dominated by this PRR reconfiguration

time, the mode-switching time incurs a negligible performance penalty due to the likely

35

low frequency of mode-switching. Additionally, the switching overhead only occurs when

increasing the fault tolerance redundancy (i.e., switching from DWC to TMR). When

decreasing the fault tolerance redundancy, the retained modules can continue running

without interruption.

3.2 RFT Fault-Rate Model

In order to analyze RFT’s reliability benefits, a suitable fault-rate model that

incorporates varying fault rates, system performance capabilities, and system fault-tolerance

modes is required. Traditional reliability analysis focuses on quantifying permanent

hardware failures in systems with long lifetimes. Since most processors and FPGAs

have sufficiently long lifetimes, permanent hardware failures due to long-term, end-of-life

failures can be ignored. Therefore, reliability analysis for space systems focuses on

short-term failure analysis by modeling SEU-induced computational failures. For

FPGA-based systems, these failures can cause either single or multiple data errors

through corrupted data or configuration memory. FPGA upset rates for space systems

are correlated with the magnetic field strength of the system’s current orbital position.

For example, as a system passes into the Van Allen radiation belt, the trapped charged

particles in the belt have a higher likelihood of interacting with the system. For systems

in a low-earth orbit, only a portion of the orbit passes through the inner Van Allen belt,

at a location known as the South Atlantic Anomaly (SAA). The SAA represents the

low-altitude area with a large number of trapped particles from the inner Van Allen belt’s

closest point to the Earth.

Existing fault-rate modeling generally produces a single, average orbital upset

rate, however this average is not sufficient for RFT. In order for RFT to adapt the

fault-tolerance mode, the fault-rate model must estimate the expected fault rates based

on the instantaneous orbital position. To account for the orbit-dependent, time-varying

fault rates, data from several sources can be combined to form a more accurate

estimate than a single average rate. Our fault-rate model combines orbital position

36

and trajectory, magnetic field strength, and cosmic ray particle interaction data to provide

an accurate estimate of instantaneous fault rates. We use these fault-rate estimates as

input into multiple system-level Markov models in order to calculate reliability, availability,

and performability of RFT systems.

Our RFT fault-rate model combines three existing models to estimate time-varying

fault rates. Figure 3-3 illustrates the three models used as well as the inputs and

outputs to each model. To generate accurate fault estimates, a system’s time-varying

orbital position must be modeled. Orbital position can be estimated using the SGP4

model, which is a simplified general perturbation modeling algorithm. NORAD, who is

responsible for tracking space objects and space debris, developed SGP4 for tracking

near-earth satellites [Hoots and Roehrich 1980]. SGP4 accurately calculates a satellite’s

position given a set of orbital elements (apogee, perigee, inclination, etc.) collectively

referred to as a two-line element (TLE). Given a system’s TLE, the RFT fault-rate model

uses SGP4 to generate the system’s position information, in terms of latitude, longitude,

and altitude, over the user-defined modeling time period.

Next, the RFT fault-rate model passes the SGP4 positioning information for each

point along a specified orbit to the International Association of Geomagnetism and

Aeronomy’s (IAGA) International Geomagnetic Reference Field (IGRF) model [Maus

et al. 2005], which models the Earth’s magnetosphere. IGRF combines magnetosphere

data collected from satellites and observatories around the world in order to create the

most accurate and up-to-date model possible (the model is updated every five years).

For a given orbital position, IGRF outputs a McIlwain L-parameter, representing the set

of magnetic field lines that cross the Earth’s magnetic equator at a number of Earth-radii

equal to the value of the L-parameter. The inner Van Allen radiation belt corresponds

to L-values between 1.5 and 2.5. The outer Van Allen belt corresponds to L-values

between 4 and 6. In addition to identifying regions with trapped particles, the McIlwain

L-parameter can be used to estimate the effect of geomagnetic shielding (cutoff rigidity)

37

from galactic cosmic rays. The estimated L-parameters are then used, along with the

outputs of CREME96, to estimate SEU rates.

The CREME96 model [Tylka et al. 1997], a publically available SEU estimation

tool, generates fault-rate estimates and has been used extensively to predict heavy ion

and proton-induced SEU rates, as well as estimate the expected total ionizing dose in

modern electronics. CREME96 combines orbital parameters, space system physical

characteristics, and silicon device process information to create a highly accurate SEU

simulation. Traditionally, CREME96 generates an average fault rate for a particular orbit,

generated by averaging hundreds of orbits together. However, CREME96 can also

generate fault rates for orbital segments, which can be segmented using the McIlwain

L-parameter. By running several simulations with very narrow L-parameter segments,

CREME96 can obtain estimated fault rates for each segment. As the width of each

segment decreases, the generated fault rates become more precise and continuous.

The L-parameter outputs from the IGRF model are then mapped to the appropriate

orbital segment and the associated fault rate.

The SGP4, IGRF, and CREME96 models collectively generate a time-varying

fault-rate estimate over the course of a specified orbit. We implemented the RFT

fault-rate model as a C++-based program that connects the separate models together

and passes data between them. SGP4’s algorithms, along with C++/Java reference

code, are publically available via the Internet. A FORTRAN-based implementation of

the IGRF algorithm for calculating position-based McIlwain L-parameters is publically

available from NASA Goddard [Macmillan and Maus 2010]. Fault-rate information can

be generated from the CREME96 model through a web-based interface, which is stored

for efficient offline use. Orbits described by TLEs can be visualized using open-source

tools, such as JSatTrak [Gano 2010]. The RFT fault-rate model program accepts TLE

data as input and generates time-varying fault-rate estimates as output. These fault-rate

estimates are then used by the RFT performability model.

38

3.3 RFT Performability Model

System reliability is the probability that a system is operating without faults after a

specified time period. Assuming exponentially-distributed random faults at rate λ, the

system reliability is traditionally defined as:

R(t) = e−λt (3–1)

Mean-time-to-failure (MTTF) and Mean-time-to-repair (MTTR), are the average amounts

of time before a system encounters a failure (or repair) event. For a fault rate λ (or repair

rate µ), MTTF (or MTTR) is defined as:

MTTF =

∫ ∞

0

t · Rλ(t) dt MTTR =

∫ ∞

0

t · Rµ(t) dt (3–2)

System availability, which is similar to system reliability, estimates the long-term,

steady-state probability that the system is operating correctly, and is defined as:

A =MTTF

MTTF +MTTR(3–3)

System unavailability is the opposite of availability, and is often used for convenience

when discussing systems with very high availability. Unavailability is defined as:

UA = 1− A =MTTR

MTTF +MTTR(3–4)

Space system reliability and availability can be accurately modeled using Markov

models. A Markov model is composed of states and transition rates. A state represents

the current operating state of the system and the transition rates represent the

transitions from an operating state to a failure state (i.e., failure rates), or from a failure

state to an operating state (i.e., repair rates). The Markov model can be transformed

into a series of equations, which can be solved or approximated numerically using

tools such as SHARPE, an open-source fault-modeling tool [Sahner and Trivedi 1987],

to determine probabilities of each state. System reliability and availability can be

39

directly determined from the calculated state probabilities. For Markov models, the

instantaneous availability of repairable systems measures the probability of being in

an “available” vs. “failed” state at a given point in time. These types of models are

frequently used to estimate the effects of TMR, scrubbing, and other fault-tolerance

methods in FPGAs and other electronics [Dobias et al. 2005; Garvie and Thompson

2004; Pratt et al. 2007].

Markov reward models, a type of weighted Markov model, can be used to extend

the concept of system availability to measure system performability of adaptable

systems [Ciardo et al. 1990; Meyer 1982]. Performability is a metric that combines

system availability with the amount of work produced by the system and gives a

measure of total work performed. Performability is especially useful for gracefully

degradable systems or other systems that having changing characteristics over time.

Assuming that X (t) is a semi-Markov process with state space S and is continuous over

time t > 0, the instantaneous performability is defined by

Performability(t) =∑a∈S

Perf (a) · P{X (t) = a} (3–5)

where Perf (a) is the system performance in state a. System performance can be

defined using any desired performance metric (e.g., throughput, execution time, etc.)

and performability is measured similarly. In this context, instantaneous availability can

be viewed as a special case when the Perf (a) = 1 in available states and Perf (a) = 0

otherwise.

For reconfigurable FPGA systems where the system configuration changes over

time (e.g., our RFT architecture), the system must be modeled as a phased-mission

system. A phased-mission system is described using a set of unique models for each

phase of the mission. The states at the end of a given phase’s model map to the states

of the following phase’s model during phase transitions and the phase duration can be

modeled as either probabilistic or deterministic [Alam et al. 2006; Kim and Park 1994].

40

We use the RFT fault-rate model’s generated fault-rate estimates to drive multiple

system-level Markov models in order to calculate reliability, availability, and performability

of RFT systems. In order to incorporate varying fault rates (due to orbital position)

and varying system topologies (due to RFT), we leverage a phased-mission Markov

approach. We model the RFT system as a collection of individual phases, with each

phase consisting of a period of time where the fault-tolerance mode, failure rates,

and repair rates are constant. The phase lengths are both application-dependent

and orbit-dependent. At the end of each phase, the pre-transition state probabilities

are mapped onto initial probabilities for the post-transition Markov model. Figure 3-4

illustrates an example high-level model transitioning from TMR to DWC at time t1, and

then transitioning from DWC back to TMR at time t2. Fault rates (λ) and repair rates

(µ) are represented as directed graph edges between states. At each phase transition

(denoted by the dashed vertical lines), state probabilities are re-mapped (dashed

arrows). When re-mapping state probabilities from TMR to DWC, two TMR states are

merged into a single operational state. When re-mapping from DWC to TMR, the single

operational DWC state maps to the most-similar TMR state.

Each fault-tolerance mode has an associated Markov model. Each state of the

Markov model corresponds to the number of operational devices in a system (e.g.,

PRRs in an FPGA-based SoC). Transitions between states occur when a device

changes state from operational to failed, or vice-versa. FPGA upset rates are estimated

using the RFT fault-rate model described in Section 3.2. The system repair rates are

based on the system-designer-specified scrub rate and checkpointing rate and the

system and PRR reconfiguration time, which can be obtained experimentally. Transitions

between fault-tolerance modes are mission-dependent. State-mapping functions ensure

a continuous availability function, although performability may contain discontinuities at

phase transitions.

41

Figure 3-5 shows Markov model representations for each fault-tolerance mode

in a system with six PRRs. Figures 3-5(a) and 3-5(b) represent systems where each

PRR is operating in the HP or ABFT fault-tolerance modes, respectively. Figure 3-5(c)

represents a system using three independent pairs of PRMs operating in the DWC

fault-tolerance mode. Figure 3-5(d) represents a system using two independent sets

of three PRMs operating in the TMR fault-tolerance mode. Solid circles represent the

Markov model’s available operating states and dashed circles represent failed operating

states. Using the definition of performability from Eq. 3–5, each state in a Markov

reward model is assigned a performance throughput value. For this analysis, system

performance is normalized to the work performed by a single PRR. The performance

throughput value of each state is represented in the Markov model by the value in

the rectangles. For example, six independent PRRs running concurrently in the HP

fault-tolerance mode would have a performance throughput value of 6 while a system

using six PRRs with two independent sets of PRRs operating in the TMR fault-tolerance

mode would have a performance throughput value of 2.

The ABFT Markov model is used to demonstrate a generic reliability model

for modules that contain some form of internal fault tolerance and are capable of

self-detecting data errors. In particular, ABFT provides a low-overhead method for

detecting errors in certain linear algebra operations. While hardware-implemented

ABFT may not have 100% fault coverage, hardware-implemented ABFT provides

improvements over the HP model, which cannot detect when corrupt data is returned.

In the Markov model for the single-module ABFT mode, we estimate the performance

of a single module as 80% of the default, unprotected module due to the performance

overhead associated with generating and comparing checksums for fault detection

[Acree et al. 1993]. Additionally, ABFT fault rates are modeled as having a 20% higher

fault rate than unprotected modules due to the increased module size from the ABFT

logic. The repair rate for the ABFT model uses the system’s scrubbing rate, supplying a

42

worst-case repair rate for cases when ABFT logic does not detect (or even introduces)

an error, relying on external scrubbing and configuration readback to detect errors.

Although these numbers are application- and implementation-specific, the numbers

represent overhead costs that must be considered. The ABFT model can also be used

to model other modules using user-implemented fault tolerance techniques.

For a given phased-mission Markov model, a TLE is used to determine the

expected fault rates using the RFT fault-rate model in Section 3.2. These fault rates,

along with a description of the mission-specific criteria for using each fault-tolerance

mode, are used to split a full space mission into distinct phases for Markov modeling.

RFT phases are periods of time with constant fault rates, repair rates, and fault-tolerance

modes. For each phase, the previous phase’s state probabilities are mapped onto the

new phase’s Markov model as initial probabilities. SHARPE processes the new model

to numerically solve the state probabilities and reliability metrics for the current phase.

SHARPE’s results provide the initial probabilities for the next phase. This process is

repeated iteratively for each phase for the entire mission’s duration and SHARPE’s

results are aggregated to produce overall reliability results.

3.4 Results and Analysis

In this section, we present one validation case study and two orbital case studies

to evaluate the potential reliability and performance benefits of our RFT framework for

space systems. The validation case study uses FPGA fault injection to estimate fault

rates and error coverage, which can then be used by the other case studies. The orbital

case studies represent FPGA-based space systems operating in two common orbits,

with multiple performance and reliability requirements. The first case study represents a

space system operating in low-Earth orbit (LEO), the Earth Observing-1 (EO-1) satellite.

Space systems in a LEO experience relatively low radiation. The second case study

represents a space system operating in a highly elliptical orbit (HEO), which is a much

43

harsher radiation orbit. Each case study will compare multiple adaptive fault-tolerance

strategies to a traditional static TMR strategy.

3.4.1 Validation Case Study

In order to validate our reliability models, faults must be injected into an executing

RFT system. In this section, we present FPGA fault-injection results gathered using

the Simple, Portable Fault Injector (SPFI) [Cieslewski et al. 2010]. SPFI performs fault

injection using both full and partial reconfiguration, which reduces the time required to

modify configuration memory and improves the speed of fault injection. We validate our

reliability models by correlating SPFI’s results with analytical Markov model results.

For the validation case study, we implemented a simplified RFT-based system on

a Xilinx ML505 FPGA development platform. The RFT-based system had a 3-PRR

RFT controller that allowed HP and TMR voting and had watchdog timer functionality.

Each PRR contained a matrix multiplication (MM) PRM and a MicroBlaze processor

in the static region streamed data from a UART to an MM PRM and streamed results

back to the UART. The SPFI fault-injection tool enabled individual system components

(e.g., PRMs, RFT controller, MicroBlaze) to be tested independently without modifying

the entire system. Table 3-3 shows the RFT systems’ fault-injection results. For each

system component, Table 3-3 indicates the number of injections performed and the

number of data errors and system hangs detected. The fault vulnerability for each

component is scaled to the FPGA’s total number of configuration bits to estimate the

components’ design vulnerability factor (DVF). A component’s/device’s DVF represents

the percentage of bits that are vulnerable to faults and can result in observable errors.

Most Xilinx FPGA designs have a DVF that ranges from 1%-10% [Xilinx 2010b]

due to the large amount of configuration memory devoted to routing. The DVF for

each component is calculated by measuring the component’s fault rate, estimating

the number of vulnerable bits by scaling the fault rate to the size of the area occupied

by the component, and dividing the number of vulnerable bits by the total number

44

of FPGA configuration bits. For a single MM PRM with the RFT controller using HP

mode, only 1.6% of faults that occurred in the PRR were found to cause observable

errors in the output. Approximately 41,497 vulnerable bits in each PRR, which occupies

approximately one-eighth of the FPGA, results in a DVFMM of 0.197%. In this case, the

majority of the PRR is unused, resulting in very few vulnerable bits and a low DVF. Faults

injected into the RFT controller resulted in a DVFRFT of 0.022%. The MicroBlaze was not

protected using TMR or other design techniques and had a DVFMB of 0.342%. Based on

these component fault-injection results, the total FPGA DVF is estimated to be 0.955%

(3DVFMM +DVFRFT +DVFMB).

Figure 3-6 shows Markov model representations for each fault-tolerance mode in

a system with three PRRs. Figures 3-6(a) and 3-6(b) represent systems where each

PRR is operating in the HP or the TMR fault-tolerance mode, respectively. These

models are similar to the models presented in Section 3.3, without additional states for

performability, and with the addition of state transitions to account for fault coverage.

In a TMR system, coverage refers to the percentage of faults that cause the system to

immediately enter the “failed” state. These non-covered faults occur due to designs not

being fully protected by TMR.

Initial fault-injection testing revealed that our system appeared to have two possible

fault scenarios. In the first scenario, a fault could cause the system to remain operational

but produce erroneous data. This state was recoverable using periodic scrubbing. In the

second scenario, a fault could cause the system to hang until a full reconfiguration was

performed. We expanded the Markov models to include these behaviors as additional

states. Figure 3-6(a) shows the HP Markov model with two unavailable states that

account for the two fault scenarios. The probabilities of transitioning to “Faulty Data”

or “System Hang” were determined from the component fault-injection testing. The

DVFFPGA was estimated from the sum of each of the components tested.

45

A similar approach was taken for the TMR Markov model shown in Figure 3-6(b).

The additional “degraded” state is used to model faults that have been masked by

the TMR protection provided by the RFT controller. The probability of transitioning

to the “degraded” state is provided by the DVFMM from previous testing. The DVFsys

term represents faults in the RFT controller (DVFRFT) and MicroBlaze (DVFMB). From

fault-injection testing, we estimate the DVFsys to be approximately 0.364%.

By assigning fault rates, repair rates, and coverage to the Markov model, we can

calculate the system availability. Table 3-4 shows the fault and repair rates used in this

analysis. Using a scrub rate of 1 fault per 10 seconds, and a repair rate of 1 scrub per

5 seconds, the RFT system will have a 98.82% availability in HP mode and 99.31%

availability in TMR mode. When the fault rate to scrub rate ratio is increased, the

benefits of TMR become more pronounced. Using a scrub rate of 1 fault per 2 seconds,

and a repair rate of 1 scrub per 10 seconds, the RFT system will have a 92.25%

availability in HP mode and 95.32% availability in TMR mode. These high availabilities,

even in HP mode, are due to frequent scrubbing and the very low DVF of the FPGA

design.

Finally, we validate the Markov model results using fault injection. In a continuously-running

RFT system, faults are injected at a specified rate, which is randomly varied using a

Poisson distribution to simulate the error model used in the Markov model. Scrubbing,

using partial reconfiguration, occurs at user-defined periodic intervals. Full reconfiguration

occurs at the next scrubbing cycle after a PRM error has been detected by the RFT

controller. Full reconfigurations also occur if the external testing program detects that the

FPGA system has entered the “System Hang” state. Availability can be experimentally

determined by the ratio of time the system is operating correctly to the total experiment

run time. For each run, 10,000 faults were randomly injected into a running system.

Table 3-4 shows the results of the availability experiment and the relative error from the

analytical model. At low fault rates, the HP and TMR modes both provide approximately

46

99% availability. With high fault rates, the system had an availability of 92.59% in the

HP-mode and 93.89% in the TMR-mode.

The Markov model availability methodology provides a simple and effective method

for determining the effects of fault and repair rates on system availability without

exhaustive testing. In general, the HP models predicted slightly lower unavailability than

what was observed during experimental testing while the TMR models overestimated

the availability of the system. All availability results were within 1.5% of the Markov

models’ predictions. The Markov model accuracy can be improved by providing more

accurate fault-injection results. The availability values obtained through experiments can

be improved by increasing the length of the testing period.

Fault-injection testing did highlight implementation issues that must be handled in

any high-reliability design. The use of TMR had a lower than expected benefit due to

the unprotected MicroBlaze processor. Since the DVFMB was larger than the DVFMM,

the availability of the system was dominated by the MicroBlaze’s availability. For high

reliability, the MicroBlaze must be protected with TMR or an alternative fault-tolerant

processor (e.g., FT-LEON3).

3.4.2 Orbital Case Studies

For each orbital case study, the system under test is an FPGA System-on-Chip

implementation of the RFT hardware architecture described in Section 3.1. In order to

calculate device vulnerability, the parameters for the Xilinx Virtex-4 FX60 will be used as

inputs to the RFT fault-rate model described in Section 3.2. The radiation susceptibility

parameters of the Virtex-4 device family are obtained from the Xilinx Radiation Test

Consortium’s (XRTC) published results [Swift et al. 2008]. The generated fault rate is

linearly scaled from a full device to the size of the PRR to produce PRR fault rates. The

RFT controller is connected to 6 PRRs, allowing for several combinations of the TMR,

DWC, and ABFT fault-tolerance modes discussed in Section 3.3. The performability

47

model from Section 3.3 is used to evaluate the effectiveness of each fault-tolerance

strategy.

3.4.2.1 Low-Earth orbit case study

Space systems in LEO are commonly used for earth-observing science applications,

such as HSI or SAR. Both HSI and SAR have large datasets, which can be significantly

reduced through on-board processing using a variety of algorithms that are decomposable

into basic kernels that can be parallelized and implemented on an FPGA system for

high performance and power efficiency. For example, HSI can be decomposed into a

sequence of matrix multiplications and matrix inversions, and SAR can be decomposed

into vector multiplication and FFTs. Since these mathematical operations are linear,

ABFT can be used to protect the computation results from SEUs. For applications that

cannot be protected with ABFT, TMR or DWC must be used to provide fault tolerance.

The TLE used to generate the fault rates for this LEO case study is from the EO-1

satellite. The EO-1 orbit is circular at an altitude of 700 km and a 98.12 ◦ inclination with

a mean travel time of 98 minutes. Figure 3-7(a) shows the orbital track of EO-1 and the

shaded circle represents the EO-1’s field of view. Figure 3-7(b) shows the estimated

number of upsets per hour that occur in the EO-1 orbit over several orbital periods. The

average fault rate of the Virtex-4 FX60 in the EO-1’s orbit is 16.5 faults per device-day

(combined configuration memory and BRAM vulnerability). Each local maximum occurs

when the satellite is closest to the Earth’s magnetic poles. Fault rates in EO-1’s orbit

are low because the orbit is lower than the Van Allen Belts and is fully within the Earth’s

magnetosphere, which deflects a large amount of radiation.

We examine the availability and performability of TMR, DWC, and ABFT fault-tolerance

modes in LEO, as well as three adaptive fault-tolerance strategies to maximize

application performability: 10% two-mode, 50% two-mode, and three-mode adaptive

strategies. For the adaptive fault-tolerance strategies, the fault-tolerance mode switching

is determined by comparing the current upset rate with a fault-rate threshold. The 10%

48

two-mode adaptive strategy uses the ABFT fault-tolerance mode when upset rate is in

the lowest 10% of the expected fault rates and the TMR fault-tolerance mode otherwise.

The 50% two-mode adaptive strategy uses a similar strategy as the 10% two-mode, but

with a higher fault-rate threshold. The three-mode adaptive strategy uses ABFT when

the upset rate is in the lowest 10% of the expected fault rates, TMR when the upset rate

is in the highest 50% of the expected fault rates, and DWC otherwise. In all modes, the

system performs scrubbing to ensure that configuration memory errors are removed

from the system. For this LEO case study, the system uses a 60-second scrub cycle,

which is also used as the system repair rate for the Markov models.

Figure 3-8(a) shows the availability of the LEO system while statically using the four

fault-tolerance models described in Figure 3-5. While the static HP strategy’s availability

quickly declines (due to the lack of any fault-tolerance mechanisms), the static TMR,

DWC, and ABFT strategies all maintain availability above 95%. The dynamically

changing availability is directly related to the current fault rate, however, since the repair

rate of the system (through scrubbing) is much larger than the expected fault rates, most

configuration memory faults are rapidly mitigated. The system availability while using

the three adaptive fault-tolerance strategies is shown in Figure 3-8(b). The average

availability for each adaptive strategy improves availability over the static ABFT strategy,

and the adaptive strategies can increase the minimum availability to above 99.5%.

Table 3-5 displays the availability results for each fault-tolerance strategy in terms of

unavailability, showing the probability of a system failure. The 10% two-mode adaptive

strategy reduces average unavailability by 88% as compared to the static ABFT strategy.

For extremely high availability, static TMR is required, as the maximum unavailability for

the adaptive strategies is more than 100 times higher than TMR.

The system performability for the static and adaptive fault-tolerance strategies

for a LEO is also shown in Table 3-5. Due to the low overall upset rates and good

availability of the static ABFT strategy, the static ABFT strategy achieves the highest

49

performability. The 50% two-mode adaptive strategy achieves an average performability

throughput of 4.01, a 100% improvement over static TMR, while improving unavailability

over the static ABFT strategy by 73%. The 10% two-mode strategy has lower average

performability, while maintaining better system availability. The three-mode strategy has

better performability than the static DWC mode while having better availability than the

static DWC or ABFT modes, but is outperformed by the 50% two-mode strategy.

We point out that the fault-rate threshold for the two-mode strategies can be used

to adjust the availability and performability parameters for a space system. Figure 3-9

illustrates the effect of changing the threshold of the two-mode adaptive strategy for

the LEO case study. As the threshold is raised, more time is spent using the ABFT

mode, which lowers system availability while increasing performability. For the LEO

case study, most of the performance gains from using ABFT can be obtained by using

a low threshold value because much of the orbit will be under this threshold. With a

10% threshold, an RFT system would spend an approximately equal amount of time

in each of the ABFT and TMR modes. Raising the adaptive threshold higher than 50%

results in limited performance gains at the expense of decreased availability. Further

analysis of these thresholds in mode-switching strategies can enable optimization

toward Pareto-optimal goals.

3.4.2.2 Highly-elliptical orbit case study

The HEO is a common type of orbit used mostly by communication satellites.

From the ground, satellites traveling in an HEO can appear stationary in the sky for

long periods of time. HEOs also offer visibility of the Earth’s polar regions, where most

geosynchronous satellites do not. The HEO used for this case study is a Molniya orbit,

named for the communication satellites that first used this orbit. A TLE for a Molniya-1

satellite was used to generate fault rates for the HEO case study. This orbit has a

perigee of 1,100km, an apogee of 39,000km, and a 63.4 ◦ inclination with a mean travel

time of 12 hours. The average amount of radiation throughout the orbit is much higher

50

than the LEO case study, and much larger amounts of radiation are encountered when

the satellite passes through the Van Allen belts. The average fault rate in the Molniya-1

orbit is 62 faults per device-day. For most of the orbit, the fault rate averages 7 faults per

device-day, but the large fault-rate peaks that occur near perigee increases the overall

fault rate. Figure 3-10(a) illustrates the Molniya-1 orbit and Figure 3-10(b) shows the

estimated number of upsets per hour that an FPGA might experience in an HEO.

Space systems outside of the Earth’s magnetosphere, either in geosynchronous

orbit or in interplanetary space, experience constant fault rates that vary based on the

occurrence of solar flares and other space weather conditions. Solar flares can send

a wave of high-energy particles into space, causing a brief period of extremely high

fault rates. These fault-rate spikes, while different in origin, look similar to the fault-rate

peaks that occur in this case study. The analysis used in this case study can also be

used to estimate RFT system performance in the presence of different space weather

conditions.

We evaluate the availability and performability of TMR, DWC, and ABFT fault-tolerance

modes in an HEO. In order to maximize performability of applications in an HEO, we

also examine two adaptive fault-tolerance strategies. The ABFT/TMR adaptive strategy

uses the ABFT fault-tolerance mode when upset rates are in the lowest 5% of expected

fault rates and the TMR fault-tolerance mode otherwise. The DWC/TMR adaptive

strategy uses the same fault-rate thresholds as ABFT/TMR, but switches between the

DWC and TMR modes. In each mode, the system performs scrubbing to ensure that

configuration memory errors are removed from the system. For the HEO case study, the

system uses a 10-second scrub cycle to account for the increased average fault rates

experienced in an HEO, which is used as the system repair rate for the Markov models.

Figures 3-11(a) and 3-11(b) show the availability of the HEO case study system

while using three static fault-tolerance strategies and two adaptive strategies. The

static TMR strategy maintains an average availability of 99.93%, but the availability

51

drops as low as 95.1% while passing through the peak fault-rate periods. While using

the static DWC and ABFT strategies, the availability drops significantly during the

peak fault-rate periods, making these strategies unsuitable for systems that must

maintain continuous operation (due to the extremely high upset rate, many systems

shut down operation during peak fault-rate periods). However, outside of the peak

fault-rate periods, the minimum availability for the static DWC (99.8%) and ABFT

(99.3%) strategies are high enough to be tolerable by many applications. The two

adaptive strategies use DWC and ABFT fault-tolerance modes during low fault-rate

periods and TMR fault-tolerance mode during the peak fault-rate periods. Using the

adaptive strategies, the availability never falls below the TMR strategy’s minimum

availability of 95.1% during peak fault-rate periods, while maintaining higher availability

at other times. (Note the change of scale in Figure 3-11(b).) Table 3-6 shows the

unavailability and performability of the fault-tolerance strategies described in this

section. The DWC/TMR adaptive strategy reduces average unavailability by 80%

as compared to the static DWC strategy. The ABFT/TMR adaptive strategy reduces

average unavailability by 85% as compared to the static ABFT strategy.

The system performability for the static and adaptive fault-tolerance strategies for an

HEO are shown in Table 3-6. While the TMR strategy exhibits the lowest performability,

the TMR strategy also has the lowest variation in performability and highest availability,

allowing for predictable performance levels. The performability of the static ABFT and

DWC strategies is significantly reduced in the peak fault-rate periods, but quickly returns

to acceptable levels after passing through the peak fault-rate periods. The maximum

unavailability for each of the adaptive strategies occurs during the peak fault-rate

periods. Since the RFT system is using the TMR fault-tolerance mode during these

segments, the adaptive strategies have the same maximum unavailability as the static

TMR strategy. The DWC/TMR adaptive strategy increases performability over the TMR

strategy by 46%. The ABFT/TMR adaptive strategy increases average performability

52

over the TMR strategy by 128%, or reduces performability by 4% over the static ABFT

strategy, while significantly improving unavailability over the ABFT strategy.

3.5 Conclusions

In this work, we have presented a novel and comprehensive framework for

reconfigurable fault tolerance capable of creating and modeling FPGA-based reconfigurable

architectures that enable a system to self-adapt its fault-tolerance strategy in accordance

with dynamically varying fault rates. The PR-based RFT hardware architecture enables

several redundancy-based fault-tolerance modes (e.g., DWC, TMR) and additional

fault-tolerance features (e.g., watchdog timers, ABFT, checkpointing and rollback), as

well as a mechanism for dynamically switching between modes. The combination of

these fault-tolerance features enables the use of COTS SRAM-based FPGAs in harsh

environments. Future work will automate and optimize the creation RFT controllers

for specific system configurations in order to increase developer productivity, facilitate

adoption, and reduce system overhead.

In addition to the hardware architecture, we have demonstrated a fault-rate model

for RFT to accurately estimate upset rates and capture time-varying radiation effects

for arbitrary satellite orbits using a collection of existing, publically available tools and

models. Our model provides important characterization of fault rates for space systems

over the course of a mission that cannot be captured by average fault rates. The HEO

case study demonstrated the large range of fault rates experienced in certain elliptical

orbits, and the potential for performance improvements when using RFT.

Using the results from the fault-rate model, our Markov-model-based RFT

performability model was used to demonstrate the benefits of using an adaptive

fault-tolerance system architecture. The performability model was validated using

FPGA fault injection on an experimental RFT system and multiple static and adaptive

fault-tolerance strategies for space systems in dynamically changing environments

were evaluated. The RFT performability model demonstrated that while TMR provides

53

a lower bound on availability, less reliable, low-overhead methods can be used during

low-fault-rate periods to improve performance. We evaluated two case study orbits

and observed that adaptive fault-tolerance strategies are able to improve unavailability

by 85% over static ABFT and performability by 128% over traditional, static TMR fault

tolerance. This additional performance can lead to more capable and power-efficient

FPGA-based onboard processing systems in the future.

Table 3-1. RFT fault-tolerance modes.Fault-tolerance mode Fault-tolerance type PRRs requiredTriple Modular Redundancy (TMR) Redundancy 3Duplication with Compare (DWC) Redundancy 2High-Performance (HP) - no fault protection Single-module 1Algorithm-Based Fault Tolerance (ABFT) Single-module 1Internal TMR Single-module 1

Table 3-2. RFT controller resource usage.Module name LUTs used FFs used Slices used FPGA utilizationPLB Controller 479 375 345 1.4%Address Decoder 238 0 136 0.5%RFT Registers 82 64 48 0.2%Watchdog Timers 498 390 366 1.4%Voting Logic 438 0 234 0.9%Output Mux 227 0 132 0.5%Total 1962 829 1261 5.0%

Table 3-3. Fault-injection results for RFT components.System Faults Data System Vulnerable DVFComponent injected errors hangs bits (est.) (%)Matrix Multiply (MM) 100,000 1,501 71 41,497 0.197%RFT Controller (RFT) 100,000 63 157 4,576 0.022%MicroBlaze (MB) 100,000 150 1,219 86,584 0.342%Full FPGA 0.955%

54

Table 3-4. RFT Markov model validation.

Fault period Repair period FT-mode Markov model Experimental Model(time/fault) (time/scrub) availability availability error

10s 5s HP 98.82% 98.99% 0.2%10s 5s TMR 99.31% 99.11% 0.2%2s 10s HP 92.25% 92.59% 0.4%2s 10s TMR 95.32% 93.89% 1.5%

Table 3-5. Unavailability and performability for LEO case study.

Average Maximum Average Minimumunavailability unavailability performability performability

TMR 9.8× 10−6 3.8× 10−5 2.00 2.00DWC 3.9× 10−3 1.1× 10−2 3.00 2.98ABFT 1.6× 10−2 4.7× 10−2 4.79 4.762-Mode (10%) 1.9× 10−3 4.6× 10−3 3.51 2.002-Mode (50%) 4.4× 10−3 1.8× 10−2 4.01 2.003-Mode 2.7× 10−3 5.4× 10−3 3.69 2.00

Table 3-6. Unavailability and performability for HEO case study.Average Maximum Average Minimumunavailability unavailability performability performability

TMR 7.2× 10−4 4.9× 10−2 2.00 1.95DWC 1.4× 10−2 4.1× 10−1 2.98 2.46ABFT 4.9× 10−2 9.2× 10−1 4.73 2.62DWC/TMR 2.6× 10−3 4.9× 10−2 2.92 1.95ABFT/TMR 7.2× 10−3 4.9× 10−2 4.56 1.95

Microprocessor(MicroBlaze)

MemoryController

I/O Ports (UART, USB)

ICAPRFT Controller

System Interconnect (PLB)

TMR Components

PR

R N

PR

R N

−1

PR

R 2

PR

R 1

Figure 3-1. System-on-chip architecture with RFT controller.

55

PRR1Interface

Voting Logic

RFT

Controller

PR

Regions

Fau

lt S

yndr

ome

AddressDecoder Output

Mux

PLB Controller

Dat

a In

Add

ress

Dat

a O

ut

FT−Status Registers

Watchdog Timers

FT−ModeRegister

Vot

erR

esul

t

InterruptGenerator

Inte

rrup

t

PRR2Interface

PRR3Interface

Inte

rnal

P

RM

si

gnal

s

Wat

chdo

g re

set

Inte

rnal

P

RM

si

gnal

s

Wat

chdo

g re

set

Inte

rnal

P

RM

si

gnal

s

Wat

chdo

g re

set

PR

R

Out

PR

R

Out

PR

R

Out

Figure 3-2. RFT controller PLB-to-PRR interface.

SGP4 Model IGRF Model

Two-Line

Element

(TLE)

Positioning

Information

Segmented Fault Rates

McIlwain

L-Parameters

Fault Rate

EstimatesOrbital Parameters

Device & Process

Parameters

User-

Supplied

Data

Time

Period

System Physical

Characteristics

CREME96

Model

L-Parameter

/

Fault Rate

Mapping

Figure 3-3. RFT fault-rate model.

56

Zero

Faults

One

Fault

Failed

Zero

Faults

Failed

Zero

Faults

One

Fault

Failed

3そ

2そ

た1

た2

3そ

2そ

た1

た2

t0 t1 t2 t3TMR TMRDWC

2そた1

Figure 3-4. Phased-mission Markov model transitioning between TMR and DWC modes.

Fully

Operational

1 Failed

Module

2 Failed

Modules

3 Failed

Modules

4 Failed

Modules

5 Failed

Modules

6.0

5.0

4.0

3.0

2.0

1.0

6 Failed

Modules

0

6 λ

5 λ

4 λ

3 λ

2 λ

1 λ

Fully

Operational

1 Failed

Module

2 Failed

Modules

3 Failed

Modules

4 Failed

Modules

5 Failed

Modules

4.8

4.0

3.2

2.4

1.6

0.8

6 Failed

Modules

0

7.2 λ

6 λ

4.8 λ

3.6 λ

2.4 λ

1.2 λ

µ

µ

µ

µ

µ

µ

Fully

Operational

1 Failed

Module

2 Failed

Modules

3.0

2.0

1.0

6 λ

4 λ

µ

µ

3 Failed

Modules

0

2 λ µ

Fully

Operational

One Degraded

Instance

Multiple

Modules

Failed

Two Degraded

Instances

One Failed

TMR Instance

One Failed,

one degraded

TMR Instance

6 λ

2 λ 3 λ

3 λ 4 λ

2 λ

µ

µ

µ

µ

µ

2.0

2.0

2.01.0

1.0

0

A B C D

Figure 3-5. Markov models of RFT modes. A) 6x HP. B) 6x ABFT. C) 3x DWC. D) 2xTMR.

57

Operational

Faulty Data

そ(DVFFPGA)(Pdata)

System

Hang

そ(DVFFPGA)(Phang)

た1 た2

そ(DVFFPGA)(Phang)

Degraded

Faulty Data

そ(DVFsys)(Pdata)

System

Hang

Operational

そ(DVFsys)(Phang)

そ(DVFsys)(Phang)

3そ(DVFMM)

2そ(DVFMM)

そ(DVFsys)(Pdata)

そ(DVFsys)(Phang)

た2

た1

た1

A B

Figure 3-6. RFT validation Markov models. A) HP mode. B) TMR mode.

A

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

Time (minutes)

Fau

lts p

er

devi

ce−d

ay

B

Figure 3-7. LEO fault rates using the RFT fault-rate model. A) Visualization of the EO-1TLE data using JSatTrak [Gano 2010]. B) Expected fault rates over severalorbits.

58

0 40 80 120 160 2000.9500

0.9550

0.9600

0.9650

0.9700

0.9750

0.9800

0.9850

0.9900

0.9950

1.0000

HP DWC TMR ABFT

Time (minutes)

Ava

ilabi

lity

A

0 40 80 120 160 2000.9500

0.9550

0.9600

0.9650

0.9700

0.9750

0.9800

0.9850

0.9900

0.9950

1.0000

2−mode (10%) 2−Mode (50%) 3−Mode

Time (minutes)

Ava

ilabi

lity

B

Figure 3-8. LEO system availability. A) System availability of static strategies. B) Systemavailability of adaptive strategies.

59

0%20%

40%60%

80%100%

0.9800

0.9850

0.9900

0.9950

1.0000

0

1

2

3

4

5

6

Performability Availability

2−Mode Strategy Threshold

Ava

ilab

ility

Pe

rfo

rmab

ility

Figure 3-9. Effects of adaptive thresholds on availability and performability.

60

A

0 1000 2000 3000 4000 50000

500

1000

1500

2000

2500

Time (minutes)

Faul

ts p

er d

evic

e−da

y

B

Figure 3-10. HEO fault rates using the RFT fault-rate model. A) Visualization ofMolniya-1 TLE data using JSatTrak [Gano 2010]. B) Expected fault ratesover time.

61

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

TMR DWC ABFT

Time (minutes)

Ava

ilabi

lity

620 640 660 680 700 720 740 7600

0.2

0.4

0.6

0.8

1

TMR DWC ABFT

Time (minutes)

Ava

ilabi

lity

A

0 1000 2000 3000 4000 50000.9

0.92

0.94

0.96

0.98

1

DWC/TMR ABFT/TMR

Time (minutes)

Ava

ilabi

lity

620 640 660 680 700 720 740 7600.9

0.92

0.94

0.96

0.98

1

DWC/TMR ABFT/TMR

Time (minutes)

Ava

ilabi

lity

B

Figure 3-11. HEO system availability. A) System availability of static strategies (zoomedimage on right). B) System availability of adaptive strategies (zoomedimage on right).

62

CHAPTER 4ALGORITHM-BASED FAULT TOLERANCE FOR FPGA SYSTEMS

Algorithm-based fault tolerance (ABFT) is a fault-tolerance method that can be used

with many linear-algebra operations, such as matrix multiplication or LU decomposition

[Huang and Abraham 1984]. Additionally, the ABFT approach can be extended to

other linear operators such as the Fourier transform. Fortunately, many common space

applications are composed of linear-algebra operations; e.g., HSI features matrix

multiplication [Jacobs et al. 2008], while SAR features Fast Fourier transforms. Other

algorithms can often be converted to fit an algebraic framework. Traditionally, ABFT

has been implemented in software, with multiprocessor arrays, and in hardware, with

systolic arrays, to protect application datapaths. Our ABFT approach may be used in

FPGA applications to provide both datapath and configuration memory protection with

low overhead. By demonstrating the effectiveness of ABFT in FPGA systems, we can

enable the use of FPGAs in future space missions where resource constraints and

reliability are the major challenges. In the larger perspective of an RFT system, ABFT

enables additional design-space options for optimizing total system performability.

In this chapter, we present an analysis of multiple fault-tolerance methods on

Xilinx FPGAs including TMR and ABFT. We examine the resource usage of each

method and measure the vulnerability of the design using a fault-injection tool. We then

examine possible design tradeoffs and modifications that can enable higher reliability.

The following sections of this chapter are organized as follows. Section 4.1 describes

multiple hardware architectures for ABFT matrix multiplication, along with analysis of

resource usage and fault vulnerability. Section 4.2 describes an ABFT-based algorithm

for FFTs, FFT case study architectures, and fault vulnerability analysis. Section 4.3

presents conclusions, provides suggestions for developing more reliable designs, and

outlines directions for future research.

63

4.1 Matrix Multiplication

Matrix multiplication (MM) is used as a key kernel in a large number of signal-processing

applications, and has been shown to benefit from the performance of FPGAs [Dave

et al. 2007; Wu et al. 2010; Zhuo and Prasanna 2004]. The traditional formulation of

ABFT uses MM as a motivating case, demonstrating the method’s ability to detect and

correct errors while limiting computational complexity and overhead. MM also has the

benefit of simple parallel decomposition strategies, trading additional FPGA resource

usage for decreased total execution time, which work well with ABFT. In Section 4.1.1,

the mathematical description of ABFT for MM is presented. In Section 4.1.2, several

possible MM-ABFT architectures are described. Section 4.1.3 analyzes the resource

overhead tradeoffs for each of the proposed architectures. Section 4.1.4 presents an

analysis of fault injection results obtained from each of the proposed architectures.

Finally, Section 4.1.5 summarizes the results from the matrix multiplication analysis.

4.1.1 Checksum-Based ABFT for Matrix Multiplication

The following definitions provide the mathematical background for ABFT. To obtain

the weighted checksums, the initial data will have to be multiplied by an encoder matrix.

Without a loss of generality and to simplify the notation, we assume that generic matrix

is square with dimensions of N × N.

Definition 1: An encoder matrix is a matrix whose product with the data matrix will

yield the desired checksums. For the remainder of this paper we will refer to the encoder

matrix as EN . Depending on the size of the encoding matrix, EN may encode more than

one checksum row or column. The EN used in this paper will have dimensions of N × 1.

EN =

[1 1 · · · 1 1

]T(4–1)

64

Definition 2: A column checksum matrix AC is an initial data matrix A that has

been augmented with extra rows of checksums. Such a matrix will have dimensions of

(N + 1)× N and has the form:

AC =

A

ETN · A

(4–2)

Similarly, a row checksum matrix AR can be obtained by augmenting a data matrix

A with additional columns. Such a matrix will have dimensions of N × (N + 1) and has

the following form:

AR =

[A A · EN

](4–3)

Definition 3: The product of a column checksum matrix AC and a row checksum

matrix BR will produce a full checksum matrix CF . Such a matrix will have dimensions of

(N + 1)× (N + 1) and the form:

AC · BR =

A

ETN · A

·[B B · EN

]

=

A · B A · B · EN

ETN · A · B ET

N · A · B · EN

=

C C · EN

ETN · C ET

N · C · EN

= CF

(4–4)

The associative property of the matrix product allows for verification of the

multiplication procedure by simply recalculating the checksums and comparing them

with ones obtained through the matrix multiplication. In general, operations that preserve

65

weighted checksums are called checksum-preserving and the matrix product is an

example of such a function.

4.1.2 Matrix-Multiplication Architectures

Several matrix multiplication modules were created to examine the reliability of

hardware-based ABFT. This section discusses a serial architecture and two possible

parallelization strategies for MM and the design decisions that were made for the ABFT

architectures. For this analysis, 32-bit integer precision was used in each design.

4.1.2.1 Baseline, serial architecture

The minimal-hardware, serial architecture for the matrix multiplication function

consists of a single multiply-accumulator (MAC), an address generation module, and

three data-storage modules (RAM) as shown in Figure 4-1(a). Two memories are used

to store the input matrices A and B, and one memory is used for the resulting output

matrix C . The address generator iterates through the correct matrix indices (i, j, k in

Figure 4-2), sending data stored in the two input RAMs to the MAC, and generates the

appropriate address for output values. This MM module can be used for any size matrix,

the limiting factor being data storage. For this analysis, the input and output matrices

are each stored in a 1024-deep 32-bit Xilinx BlockRAM (BRAM) component. This

architecture requires O(N3) cycles to fully calculate an output matrix. MM computational

throughput can be improved by exploiting parallelism with additional MAC units.

4.1.2.2 Fine-grained parallel architecture

The fine-grained parallel MM architecture unrolls the inner loop of the MM algorithm

as shown in Figure 4-2. Each element in the result matrix C is the dot product of a

row from matrix A and a column from matrix B. This parallel architecture uses multiple

processing elements to compute the dot-product in parallel. With the fine-grained

parallel architecture shown in Figure 4-1(b), the output of several multipliers are

connected to an adder-tree structure, allowing the parallel computation of partial

dot-products, which are then accumulated into the final, full dot product. By fully

66

parallelizing the dot product (using N multipliers), the execution time of the full algorithm

can be reduced from O(N3) to O(N2). This method requires accessing multiple memory

elements in parallel, and may be limited by the total number of memory elements or

DSP components on the FPGA.

4.1.2.3 Coarse-grained parallel architecture

The coarse-grained parallel approach essentially unrolls the outer loop of the MM

algorithm in Figure 4-2. Each row in the result matrix is calculated from a row in matrix A

and every element of matrix B. This data parallelism enables each processing element

to calculate a unique portion of matrix C by using a portion of matrix A and all of matrix

B, as shown in Figure 4-1(c). When each processing element computes a single row,

the execution time of the full algorithm can be reduced from O(N3) to O(N2). This

method requires reading multiple values from matrix A and writing multiple values to

matrix C in parallel, and requires the same total number of memory and DSP elements

as in the fine-grained parallel architecture.

4.1.2.4 Architectural modifications for ABFT

The addition of ABFT logic requires the creation of two functions, ABFT checksum

generation and ABFT checksum verification. Each of these functions requires a

simple accumulator (or a MAC for weighted checksums). In order to accommodate

the calculated checksum data, the BRAM storage must be large enough to hold the

ABFT-augmented matrices. Checksum generation sums each column of matrix A and

writes the checksum into the matrix A BRAM, creating the column checksum matrix,

AC . Next, checksum generation performs the same process for the rows of matrix B

to create a row checksum matrix, BR . The checksum verification function sums the

columns and rows of matrix C and compares the sums to the checksum values in the

matrix C BRAM. If a mismatch is detected, an error signal is asserted until the module

is reset. For some applications, such as image processing, data errors that occur

in low-significance bits may be ignored. ABFT accomplishes this by comparing the

67

difference of two generated checksums to a user-defined threshold value. However, for

maximum coverage, this threshold should be set to zero for integer operations.

Figure 4-3 shows an example of an ABFT-enabled MM architecture where an

additional MAC is used for the checksum generation and verification functions. The MAC

hardware that exists for the main MM operation may be reused for creating checksums

(ABFT-Shared), or an additional accumulator can be used for this purpose (ABFT-Extra).

For the baseline architecture, an additional ABFT accumulator would incur almost 100%

overhead. However, when the ABFT encoding matrix in Equation 1 is used, multipliers

are not required and the ABFT hardware can be simplified. For parallel designs with

multiple processing elements, the overhead of a single MAC for ABFT calculations is

amortized.

To implement ABFT error correction, the column and row indices of faulty rows can

be temporarily stored in registers. Faulty elements exist at the intersection of a faulty row

and faulty column. To correct the faulty element, the checksum-generation module must

recalculate the column checksum, ignoring the value at the faulty row index. This sum

would be subtracted from the matrix C checksum value to obtain the correct value, and

stored in the matrix C RAM. Using the encoding matrix defined in Equation 1, ABFT will

be able to detect single- and multiple-element errors. However, for specific distributions

of multiple-element errors, the correction algorithm may fail. Weighted-checksum

encoding matrices can be used to provide additional fault localization capabilities for

multiple-element errors. The ABFT designs discussed in later sections perform error

detection only.

4.1.3 Resource-Overhead Experiments

In this section we analyze the overhead of the architectures presented in Section

4.1.2 and compare them to traditional fault-tolerance mitigation strategies. For this

analysis, we use a 32-bit integer MM module and vary the number of processing

elements and the type of parallelization. These MM modules can perform computation

68

on matrices up to 32 × 32 elements in size, limited only by the amount of BRAM

dedicated to storage. We compare fine-grained and coarse-grained parallel MM designs

with several fault-tolerant designs (TMR, ABFT, and hybrid TMR/ABFT). The baseline

and ABFT designs were synthesized using the Xilinx synthesis tool while the TMR

designs used Synplify Premier. Each design was implemented on a Xilinx ML605

development board with a Virtex6-LX240T FPGA. The results of this comparison, with 1

processing element per design, are shown in Table 4-1, and will be discussed below.

4.1.3.1 Resource overhead of serial architectures

The baseline (non-fault-tolerant) MM architecture with a single processing

element uses 3 BRAMs to store input and output data. Matrix A and B are each

stored in separate BRAM components in order to allow simultaneous accesses,

while matrix C only requires a single BRAM. DSP48 components are reserved for

the multiply-accumulate unit. The multiply-accumulator can be implemented in 3 DSP48

units (3 per 32-bit MAC). All remaining addressing and indexing logic (counters, state

machines, etc.) are implemented using slices (LUTs and FFs).

The ABFT-Shared design does not require any additional BRAM or DSP48 units

over the baseline design. The additional logic needed to handle addressing matrices

during checksum generation and verification increases the number of required slices by

14%. The ABFT-Extra design uses 100% more DSP48s and 32% more slices for the

additional MAC; no additional BRAMs are needed.

For comparison, Table 4-1 also shows the resource usage of a TMR design. The

serial TMR design has 146% overhead as compared to the baseline design. This

TMR design was created using the high-reliability features of the Synplify Premier

synthesis tool, which inserts low-level TMR voting into the ABFT-MM core’s netlist.

Alternative methods for creating TMR designs are available from Xilinx [Xilinx 2004],

BYU-LANL [Pratt et al. 2006], Mentor Graphics [Mentor Graphics 2013], and others. As

expected, 200% more DSPs and BRAMs are required for TMR. However, slice usage in

69

the TMR design also increased less than expected. This difference may be caused by

optimization during the Xilinx mapping and place-and-route processes. Since a Xilinx

logic slice has many internal components (multiple lookup tables), additional logic may

be packed into partially-used slices, reducing the need for additional slices.

We also examine a hybrid design based on the ABFT-Extra design which uses

TMR on the address generator and all state machines within the design, but only uses

ABFT along the data path. This hybrid approach results in a fine-grained design that has

approximately 156% overhead for slices but only 100% overhead on the limited DSP48

resources.

4.1.3.2 Resource overhead of parallel architectures

As additional processing elements are added to the existing serial MM designs,

the resources required for the data path should increase much more than the control

path. This asymmetric scaling results in resource overhead being dependent on the

amount of parallelism. Figure 4-4 and Figure 4-5 compare the slice overhead of each

of the designs discussed in Section 4.1.3.1 while varying the number of processing

elements from 1 to 32, which is a fully parallel implementation for the 32× 32 MM. Figure

4-6 compares the DSP48 resource overhead for each of the designs. For each parallel

design, both coarse-grained and fine-grained parallelizations are examined. Overhead

of each fault-tolerant design is calculated based on the resource usage of the equivalent

non-fault-tolerant design.

In the serial designs, the overhead of both fine-grained and coarse-grained

ABFT-Extra designs is higher than the ABFT-Shared designs. However, the ABFT-Extra

slice overhead decreases for highly parallel designs. Alternatively, the ABFT-Shared

designs show the opposite trend. For modules with a large number of processing

elements, the ABFT-Shared control logic can require significantly more slices than the

baseline, unprotected design. The overhead in the ABFT-Shared designs is attributed

70

to additional multiplexers needed to correctly share access to the design’s processing

elements. Each of the explored ABFT designs have less than 60% slice overhead.

The overhead of the TMR designs ranges from 65% to 150% additional slices,

significantly lower than the expected 200% overhead. In the fine-grained designs,

relative overhead was smaller in the highly-parallel implementations. However, for

the coarse-grained designs, the overhead increased. The TMR design has a higher

overhead (200%+) of LUTs, however each slice is composed of multiple LUTs, and a

slice is considered used if any of the LUTs are used. The TMR design can have a lower

slice overhead when the design can be more efficiently packed into slices by using more

LUTs per slice. The fine-grained architecture is able to take advantage of the efficient

packing, while the coarse-grained architectures do not. For large designs (e.g., 32

PEs), the number of fully-used slices in the original design increases, increasing TMR

overhead by requiring additional slices for the TMR logic.

The slice overhead of the hybrid ABFT/TMR design is comparable to the TMR

design. Designs with many processing elements have less resource overhead than

smaller designs. The fine-grained design’s slice overhead scales much more favorably

than the coarse-grained design, which is correlated to the slice overhead for the TMR

design. This slice overhead could be reduced by more selectively choosing which

portions of logic to protect with TMR.

The limiting resources for the MM designs are the DSP48 and BRAM components.

Figure 4-6 shows the overhead of DSP48 and BRAM components for the various

designs. The results are identical for the fine-grained and coarse-grained parallelizations.

As more processing elements are used in the ABFT-Extra MM module, the DSP48

overhead created by the additional ABFT MAC unit becomes extremely low (3.125%

with 32 processing elements). The ABFT-Shared design requires zero additional

DSP48 units. Additionally, both the ABFT-Shared and ABFT-Extra methods do not

require additional BRAMs. The Hybrid design has the same DSP48 overhead as the

71

ABFT-Extra design because only control logic is replicated. The DSP48 and BRAM

overhead for the TMR design was expected to be 200%, independent of the amount

of parallelism. However, the BRAM usage actually increased by less than 100% for

non-serial designs due to the underlying FPGA architecture, design size, and because

of optimizations performed by Synopsys’ synthesizer. The Virtex-6 BRAM can be used

as one 1024-element RAM or as two 512-element RAMs. The Xilinx tool will only infer

the larger, 1024-element RAMs, while Synopsys can more efficiently use the smaller

512-element RAMs. Therefore, the 200% overhead is mitigated by the need for half as

many large BRAMs, and the BRAM overhead will asymptotically approach 50% with

more processing elements.

From a resource overhead perspective, the ABFT designs have the desired

properties of a low-overhead fault tolerance method. All of the examined ABFT designs

exhibited a slice overhead of less than 60%, with extremely low DSP48 resource

overhead. The hybrid ABFT/TMR should provide additional fault tolerance while not

requiring replication of the limited DSP48 resources on the FPGA. In the next section,

fault-injection testing will determine the effectiveness of the designs discussed in the

section.

4.1.4 Fault-Injection Experiments

While Section 4.1.3 has shown that ABFT provides lower overhead compared to

other fault-tolerance strategies, the reliability of ABFT must also be evaluated. In order

to validate our ABFT design, faults must be injected into an executing system. In this

section, we present FPGA fault-injection results gathered using the Simple, Portable

Fault Injector (SPFI) [Cieslewski et al. 2010]. SPFI performs fault injection by modifying

FPGA configuration frames within a design’s bitstream, re-programming the FPGA, and

comparing the resulting output against known values. Partial reconfiguration is used to

reduce the time required to modify configuration memory and to improve the speed of

fault injection. However, due to the length of time required to exhaustively test an entire

72

FPGA design, statistical sampling is used to estimate the total number of vulnerable bits

in a given design.

We implemented multiple fault-tolerant designs discussed in Section 4.1.2 on a

Xilinx ML605 FPGA development platform. Each design used a 32-bit integer MM

module with up to 32 processing elements. A UART was also implemented on the FPGA

to stream input test vectors to the MM module and to report results back to a verification

program. During the design phase, the MM module is constrained to a small portion of

the FPGA. The SPFI fault-injection tool enables targeted injections, allowing the UART

to be avoided during fault injection.

Table 4-2 shows the fault-injection results for each of the fault-tolerant MM designs

from Section 4.1.2. For each design, the table indicates the number of injections

performed, the number of undetected data errors, and system hangs detected. The

measured percentage of faults is then scaled to the size of the injection area to estimate

the total number of vulnerable bits. In this analysis, vulnerable bits are the total number

of bits which cause silent (undetected) data errors or system hangs. For ABFT designs,

bits resulting in false-positive results were not considered vulnerable, since they do not

allow faulty data to be propagated to back to the host system. The fault vulnerability

for each component is divided by the FPGA’s total number of configuration bits to

estimate the components’ design vulnerability factor (DVF). A design’s configuration

DVF represents the percentage of configuration bits that are vulnerable to faults and

can result in errors. The total DVF of a component includes vulnerable configuration bits

and BRAM data bits which may affect computation results. Reliability of a given design

can then be calculated from the FPGA’s total fault rate scaled by the design’s DVF. Most

Xilinx FPGA designs have a DVF that ranges from 1% to 10% [Xilinx 2010b] due to the

large amount of configuration memory devoted to routing.

73

4.1.4.1 Design vulnerability of serial architectures

The non-fault-tolerant baseline MM design has an estimated 14,802 vulnerable

configuration bits. The majority of these vulnerable bits results in undetected data errors.

Other bits cause the system to become unresponsive, or hang. These system hangs

may be caused by errors in the MM state machine logic, or by errors in FPGA routing

logic which prevents connections from the MM module to the communication UART.

The baseline design is small and uses less than 0.5% of the total logic slices available

on the FPGA. However, the configuration DVF of the baseline design is 0.026%. When

vulnerable data stored in BRAMs is also included, the total DVF of the baseline design is

0.198%.

The ABFT-Shared design has an estimated 8,756 vulnerable bits (approximately

60% of the unprotected design) and a DVF of 0.016%. The vulnerable bits can affect

address generation, the checksum generation or verification, or the error detection

status register. BRAM data bits are initially unprotected, but are effectively safe once

the ABFT checksum generation process has taken place. If the input BRAMs, prior to

the checksum calculation, are considered vulnerable, the ABFT-Shared design has a

DVF of 0.127%. In the ABFT-Extra design, the independent MAC unit for checksum

generation and verification was expected to isolate faults in the main data path from

the ABFT checksum calculations, leading to a more reliable design. However, in the

comparison of serial designs, the vulnerabilities of each ABFT design were similar, but

the ABFT-Shared design had 5% fewer vulnerable bits than the ABFT-Extra design. Both

ABFT designs experience a similar amount of data errors, but fewer faults cause the

ABFT-Extra design to hang.

The TMR design has an estimated 1,390 vulnerable bits and a total DVF of 0.002%.

The majority of these bits are the result of routing faults or TMR majority voter faults.

This result represents a realistic lower bound on total vulnerability of the MM design.

The TMR reliability results in Table 4-2 were obtained by using the Synplify Premier

74

high-reliability tool, which resulted in significantly more reliable designs than a naive

VHDL-level TMR approach, since the tool performs TMR with finer granularity and more

frequent voting.

The hybrid ABFT design has approximately 30% lower vulnerability than the

ABFT-Extra and ABFT-Shared designs, but higher vulnerability than the TMR design.

However, the hybrid design has lower resource usage compared to the TMR design

(see Table 4-1). For applications where BRAM or DSP48 resources are constrained, the

hybrid ABFT design provides a good compromise between low vulnerability (0.012%

configuration DVF) and low DSP48 usage (100% overhead).

4.1.4.2 Design vulnerability of parallel architectures

Although the serial implementations of the ABFT designs demonstrated a modest

reduction in vulnerability, the benefits of ABFT are expected to be more visible in parallel

implementations. With multiple processing elements, faulty elements will only affect

a subset of the output data, improving fault localization. Figures 4-7 and 4-8 show

the number of vulnerable bits for each design while varying the number of processing

elements.

In each of the baseline designs, the number of vulnerable bits increases linearly

with the amount of parallelization. Intuitively, as the total area required for the design

increases, the number of vulnerable bits increases proportionally. The coarse-grained

design has a higher vulnerability per processing element than the fine-grained design

due to larger resource requirements.

As more processing elements are used, the vulnerability of the ABFT designs

increases, up to the 16 PE design. The rate of increase is much slower rate than the

baseline designs, and therefore, the improvement in reliability is positively correlated to

the amount of parallelism employed. Additionally, in the fully-parallel 32-PE designs, the

ABFT designs exhibit their highest reliability. When fully parallelized, the control logic for

the MM calculation is significantly simplified, resulting in less system hangs. For each

75

of the parallel implementations, the ABFT-Shared design was more vulnerable than the

ABFT-Extra design, with 20%-50% more vulnerable bits. Meanwhile, there was not a

significant difference in vulnerability between the coarse-grained and fine-grained ABFT

designs. Any benefits from fault localization in the coarse-grained design were offset by

the increased resource requirements.

As expected, the TMR design has significantly fewer vulnerable bits than any of

the other designs. Additionally, as the number of processing elements scales up, the

number of vulnerable bits stays constant.

Finally, the hybrid ABFT design has a similar fault vulnerability as the ABFT-Extra

design upon which it was based. The fine-grained hybrid design has lower vulnerability

for implementations with up to 4 processing elements. For larger designs, the vulnerabilities

are similar. For the coarse-grained designs, the hybrid design has 5%-15% fewer

vulnerable bits than the ABFT-Extra design. In both cases, the number of false positives

and system hangs are up to 70% lower in the hybrid design.

4.1.5 Analysis of Matrix-Multiplication Architectures

The DVF of the serial matrix multiplication designs tested in this section are very

low, due to the small resource requirements in comparison to the size of the FPGA

being used. However, the effectiveness of each design is readily measurable. While the

TMR design is the most reliable design, the serial hybrid ABFT design does reduce the

number of vulnerable bits by 56% over the baseline design. For the 32-MAC parallel

design, the fine-grained hybrid ABFT design has 98% fewer vulnerable bits than the

baseline design, while only incurring 25% slice overhead, 3.125% DSP48 overhead, and

0% BRAM overhead.

The observed errors were categorized into three types: silent data errors, system

hangs, and false positives. While silent data errors must be reduced as much as

possible, system hangs and false positives are correctable within a system-level

reliability framework. Each of the tested designs, including the TMR design, experienced

76

system hangs which prevented the system from returning data. In order to prevent

system hangs, an external watchdog timer may be employed to reset the system,

reconfigure the FPGA, and resume processing. Additionally, each of the ABFT designs

experienced false positive results. False positive results cannot be determined during

run-time, but the error flag will trigger the recovery procedure, causing an FPGA

reconfiguration and re-computation of the last data set. By reducing false positives,

the performance overhead incurred by recomputing data can be reduced. The hybrid

ABFT design experienced up to 70% fewer false positives than the ABFT-Shared or

ABFT-Extra designs. Additionally, an examination of result matrices with data errors

revealed that many faulty matrices contained all zero values, producing an incorrect

result with a valid (zero) checksum. It may be possible to further improve the reliability

of the ABFT designs by modifying the ABFT encoding matrix to ensure that output

matrices cannot contain all zeros.

4.2 Fast Fourier Transform

The Fast Fourier Transform (FFT) is another key kernel in many space-based

applications, such as synthetic-aperture radar and beamforming. While the FFT can be

computed using tradition general-purpose processors or DSPs, the FFT has an efficient,

high-throughput FPGA implementation. Unlike matrix multiplication, the FFT does not

naturally fit into the traditional ABFT framework. In Section 4.2.1, the mathematical

description of our block-based ABFT approach for the FFT is presented. In Section

4.2.2, multiple FFT architectures are discussed for use within a generic ABFT-enabled

FFT architecture. Section 4.2.3 analyzes the resource overhead tradeoffs for each of

the proposed architectures. Section 4.2.4 presents an analysis of fault-injection results

obtained from each of the proposed architectures. Finally, Section 4.2.5 summarizes the

results from the FFT analysis.

77

4.2.1 Checksum-Based ABFT for FFTs

Since the Fourier Transform is a linear operator, the properties of linearity can be

used to show that ABFT techniques can be applied to the Fast Fourier Transform. For

two vectors a and b, the following equality holds for all linear operators (including the

FFT):

F(a + b) = F(a) + F(b) (4–5)

Using this property we can create a block-based ABFT technique that can be used

to detect errors in blocks of FFTs. Instead of performing detection on each input vector,

we form a matrix A composed of individual signal vectors (x0, x1, x2, ...). Following the

approach in Section 4.1, we can then augment this matrix with an additional checksum

row by using the same encoding vector EN as in Equation 1.

A =

x0

x1

...

xN−2

xN−1

(4–6)

AC =

A

ETN · A

=

A∑N−1i=0

xi

(4–7)

By performing an FFT on each row of the augmented matrix AC , we produce

an output matrix BC which contains the transformed values of each input vector and

78

the transformed checksum vector. The transformed checksum vector will be a valid

checksum for the output matrix, as shown in Equation 8.

F(AC) =

F(x0)

...

F(xN−1)

F(∑N−1

i=0xi)

(4–8)

BC =

X0

...

XN−1∑N−1i=0

Xi

(4–9)

Using the property of linearity described in Equation 5, the checksum rows of F(AC)

and BC are equivalent. In the event of an error during computation, the error will be

detectable, although more complex encoding vectors are required to detect which vector

is faulty.

The absolute overhead of using this ABFT technique is dependent on the number

of vectors used per ABFT block. Given N vectors of M elements, the additional

overhead of checksum generation and verification is O(NM), while the additional core

computation is O(Mlog(M)). As a function of ABFT block size, the relative overhead of

ABFT is O(1/N).

4.2.2 FFT Architectures

Multiple FFT modules were created to examine the reliability of hardware-based

ABFT. Unlike the previous matrix-multiplication architectures, this analysis uses

pre-built, single-precision floating-point operators to construct each design. Commonly,

pre-existing IP is used to reduce development costs by improving design and verification

time. The FFT operators and floating-point arithmetic operators used in the following

sections are pre-existing IP generated using Xilinx CoreGen tool [Xilinx 2013a]. The

79

Xilinx FFT operator has several possible architectures including a high-performance

pipelined architecture and a minimal-hardware serial architecture which will be

compared in this section. The ABFT logic for each design was generated by combining

these pre-existing IP cores to perform the checksum operations. This section discusses

trade-offs of each FFT architecture and the design decisions that were made for the

ABFT architecture.

4.2.2.1 Radix-2 Burst-IO FFT architecture

The basic, serial architecture for the fast Fourier Transform function consists of a

single butterfly operator (complex multiplication and addition), an address generation

module, and two data storage modules (BRAM) as shown in Figure 4-9(a). One memory

is used to store the input matrix A, and one memory is used for the resulting output

matrix B. The address generator iterates through the correct matrix indices, sending

data stored in the input BRAMs to the FFT operator, and generates the appropriate

address for output values. This FFT module can be used for any size matrix as long

as enough data storage is supplied. For an M-point FFT, this architecture requires

O(Mlog(M)) cycles to calculate an output vector.

4.2.2.2 Radix-2 Pipelined FFT architecture

The pipelined FFT architecture has the same high-level architectural diagram

as in Figure 4-9(a). However, internally it uses log(M) butterfly operators to enable

high throughput. The latency of the pipelined architecture is approximately O(2M)

cycles, and the time to compute N vectors is O(NM) cycles, resulting in a significant

improvement in processing speed while calculating multiple, consecutive FFTs.

4.2.2.3 Architectural modifications for ABFT

The addition of ABFT logic requires the creation of two functions, ABFT checksum

generation and ABFT checksum verification, each requiring a floating-point accumulator.

Unlike the matrix-multiplication case study, the ABFT checksum logic is significantly

different from the algorithm’s main computation, and the logic must be implemented

80

separately. Checksum generation sums each column of matrix A using a floating-point

accumulator and writes the checksum into a checksum BRAM. After the FFT operation,

the checksum verification function sums the columns of matrix B and writes the value

to a checksum BRAM. Then, the checksum verification function uses a floating-point

comparison operator to compare checksum values in the checksum BRAMs. If a

mismatch is detected, an “error found” signal is asserted until the module is reset. In

order to reduce the BRAM requirements, the last row of the matrix A and B BRAMs may

be reserved for checksums. In this case, the initially generated checksum will be written

to the matrix A BRAM and the verification checksum will overwrite that checksum value

after computation is complete. Figure 4-9(b) shows an example of an ABFT-enabled FFT

architecture where checksums are stored within the matrix A and B BRAMs.

Due to the nature of floating-point arithmetic, the calculated checksums may not

be exact. Quantization error is introduced due to the order of additions as well as

the internal rounding within the FFT module. For most applications using FFTs, such

as image processing, data errors that occur in low-significance bits may be safely

ignored, and errors that avoid detection should not significantly impact the higher-level

application. In order to handle errors in the highly significant bits, ABFT detects errors by

comparing the difference of the two generated checksums to a user-defined threshold

value. For maximum coverage, this threshold should be set as close to zero as possible.

Unfortunately, determination of the threshold value is both algorithm-specific and

data-specific. In order to estimate the required threshold values while calculating blocks

of FFTs, MATLAB simulations were used to measure the maximum quantization error

encountered while using various input data sets. For each simulation, test vectors

were generated using a uniform random distribution on the interval (− x

2, x2), where x

is the range of the input data. Figure 4-10 shows the required threshold to avoid false

positives for the block-based ABFT for FFT while varying the range of the input data.

The observed linear trend simplifies the selection of an acceptable threshold value.

81

4.2.3 Resource-Overhead Experiments

In this section we analyze the overhead of the ABFT architectures presented in

Section 4.2.2 and compare them to traditional fault-tolerance mitigation strategies. For

this analysis, 64-point FFTs will be used, enabling up to 16 complex input vectors to fit

within two BRAM components. The floating-point Burst-IO and Pipelined FFT modules

are generated using the Xilinx CoreGen utility. We compare the two baseline FFT

architectures with several fault-tolerant designs (TMR, ABFT, and hybrid TMR/ABFT).

The results of this comparison are shown in Table 4-3.

4.2.3.1 Resource overhead of Burst-IO architecture

In the baseline Burst-IO architecture, 4 BRAMs are used to store input and output

data. The FFT module uses an additional 4 BRAMs to hold intermediate values during

processing. Additionally, 8 DSP48 units are used for the FFT butterfly operation

computation. The ABFT design requires 4 additional DSP48 units and 260 slices for

the floating-point addition, floating-point comparison, and control logic functionality,

representing 42% slice overhead and 50% DSP overhead. Unlike the MM case study,

the FFT TMR design has approximately 200% overhead of each of the slice, BRAM,

and DSP resources. Since the FFT components are assembled from pre-built netlists

produced by CoreGen, little optimization is performed during synthesis, resulting in the

expected amount of resource overhead. In the hybrid FFT design, each of the state

machines controlling FFT or ABFT components in the design are protected using TMR

while the FFT and ABFT floating-point operators remain unchanged. The hybrid design

has 48% slice overhead and 50% DSP overhead.

For larger FFTs (e.g., 128+ points), the Burst-IO DSP48 resource requirements

remain constant while additional BRAMs are required for intermediate result storage.

The 50% resource overhead of the ABFT design is independent of the parameters of the

FFT operator.

82

4.2.3.2 Resource overhead of Pipelined architecture

In the baseline Pipelined architecture, 4 BRAMs are used to store input and output

data. The FFT module uses 16 DSP48 units for the FFT butterfly computation and one

additional BRAM to hold intermediate values during processing. The Pipelined FFT uses

almost twice as many slices and DSP48s, but fewer BRAMs than the Burst-IO version.

The ABFT design increases the resource requirements by the same amount as the

Burst-IO design (i.e.: 4 DSP48, 260 slices). Since the Pipelined design is approximately

twice as large as the Burst-IO design, the overhead is correspondingly lower at 23%

slice overhead and 25% DSP overhead. The TMR design has approximately 200%

overhead of each of the slice, BRAM, and DSP resources. The Pipelined hybrid design

uses the same methodology as the Burst-IO hybrid design and has only 26% slice

overhead and 25% DSP overhead.

For larger FFTs (e.g., 128+ points), the size of the Pipelined design increases

because more pipeline stages and butterfly operators are required. As the size of the

FFT increases, the ABFT requirements remain static, and the ABFT resource overhead

is reduced. For example, a 1024-point FFT would require 34 DSP48 components and

the 4 DSP48 components for the ABFT checksums will only add 12% overhead to the

baseline design.

4.2.4 FFT Fault Injection

In order to validate our ABFT FFT designs, faults must be injected into an executing

system. In this section, we present FPGA fault-injection results gathered using the

methodology described in Section 4.1.4. For the ABFT designs, the value for the

checksum error threshold was calculated from the input test vectors during a fault-free

processing run. Table 4-4 shows the fault-injection results for each of the fault-tolerant

FFT designs from Section 4.2.2. For each design, the table indicates the number of

injections performed, the number of undetected data errors, and system hangs detected.

The measured percentage of faults is then scaled to the size of the injection area to

83

estimate the total number of vulnerable bits. In this analysis, vulnerable bits are the total

number of bits which cause silent (undetected) data errors or system hangs. For ABFT

designs, bits resulting in false-positive results were not considered vulnerable, since they

do not allow faulty data to be propagated to back to the host system.

The Burst-IO baseline design has an estimated 48,012 vulnerable configuration

bits and a configuration DVF of 0.084%. The total DVF includes an additional 131,072

memory bits in BRAM, increasing the baseline design’s total DVF to 0.314%. In the

baseline design, 80% of the configuration memory fault cause data errors while 20%

cause system hangs. The Burst-IO ABFT design reduces the total number of vulnerable

configuration bits by 80% to 9,624. Additionally in the ABFT design, the Matrix A BRAM

data bits are only vulnerable before the checksum value has been computed and the

Matrix B BRAM values are always protected, improving the total DVF of the design to

0.125%. The most dangerous type of error, silent data errors, are reduced by 95% over

the baseline design. The distribution of error types in the ABFT design is reversed from

the baseline design’s distribution; 80% of configuration faults result in system hangs

while only 20% cause data errors. The hybrid ABFT/TMR design reduces the design

vulnerability by an additional 30%, to only 6,662 vulnerable bits. The hybrid design

decreases the occurrence of system hangs more than data errors. Finally, the TMR FFT

design has 3,278 vulnerable bits, 94% fewer vulnerable bits than the baseline design,

while protecting all BRAM data bits.

The Pipelined baseline design is much more vulnerable than the Burst-IO

counterpart with 182,192 vulnerable configuration bits (0.320% configuration DVF).

The increased vulnerability comes from the larger resource requirements and increased

routing complexity of the pipelined architecture. The Pipelined ABFT design reduces

data errors by 97% but only reduces system hangs by 33%. The Pipelined ABFT

design’s error-type distribution is more skewed than the Burst-IO case; 85% of errors

result in system hang and 15% result in data errors. The Pipelined hybrid ABFT/TMR

84

design reduces the design vulnerability by an additional 20%, to 22,251 vulnerable

configuration bits. The hybrid design decreases the occurrence of system hangs more

than data errors. Finally, the pipelined TMR FFT design has 9,306 vulnerable bits, 95%

fewer vulnerable configuration bits than the baseline design, while also protecting all

BRAM data bits.

4.2.5 Analysis of FFT Architectures

The FFT architectures examined in this section are significantly more complex than

the matrix-multiplication architectures of Section 4.1. Due to algorithm complexity and

the use of floating-point precision operators, the resource requirements are much higher.

For example, the baseline Burst-IO design is 300% larger than the baseline MM design.

Additionally, the floating-point precision also increased the resource requirements of

the ABFT logic. However, even with the smaller Burst-IO architecture, the overhead of

the ABFT design was only 50%, significantly better than the TMR alternative. For the

pipelined FFT designs, larger FFTs do not increase the resources required by the ABFT

component, further improving resource overhead.

In each of the FFT designs, the number of vulnerable bits that cause system hangs

is much higher than in the MM case studies. The generated FFT components have

substantially more control logic than the MM designs, which can lead to system hangs

when upset. In the baseline FFT designs, most vulnerable bits result in corrupt data but

the proportion of vulnerable bits causing system hangs is higher than in the MM case

studies. In each of the fault-tolerant designs, more vulnerable bits result in system hangs

than corrupt data. While the hybrid design does not significantly reduce data corruption,

it does lower the occurrence of system hangs. Although TMR provides the highest

reliability designs, ABFT is effective in reducing undetected data errors. If a system-level

mechanism (e.g., watchdog timer) is used to recover from system hangs, the ABFT

designs can approach the reliability of TMR.

85

4.3 Conclusions

In this chapter, we have presented a novel analysis of ABFT for low-overhead fault

tolerance in FPGA systems. Several matrix-multiplication and FFT designs employing

TMR and ABFT fault-tolerance techniques were developed and tested using an FPGA

fault-injection tool. The results demonstrated that ABFT was capable of reducing

the number of vulnerable configuration bits in a design while also protecting most

memory bits. While the TMR fault-mitigation approach had the lowest vulnerability, a

hybrid ABFT/TMR design approach was able to lower configuration vulnerability while

maintaining low overhead. With matrix multiplication, configuration vulnerability was

reduced by 90% in highly-parallel ABFT architectures while using less than 20% slice

and DSP48 resource overhead. For FFT architectures, configuration DVF was lowered

by 85% while using 50% or less additional resources. As the size and parallelism

of each of the ABFT architectures increase, the relative effectiveness of ABFT also

increases.

Matrix Multiplication and Fast Fourier Transforms are the key kernels in many

space-based processing systems. The ABFT architectures described in this work

are generic and can be applied to many of these systems. Additionally, other linear

operations, such as LU or QR decomposition, can also be protected using ABFT.

The vulnerability results show that ABFT can be used as part of a comprehensive

fault-tolerance strategy for space-based FPGA systems. ABFT detects most data

errors that occur during processing. The use of ECC techniques (e.g., Hamming,

parity) can be used to protect data memory for minimal overhead. A higher-level

watchdog mechanism can be used to recover from system hangs and crashes. Many

of these techniques, in combination with TMR, are already in use in space systems.

By selectively removing replicated components and using ABFT in their place, space

systems can increase their processing capabilities without sacrificing reliability.

86

Table 4-1. Resource utilization and overhead of serial MM designs.Fault Slice Slice BRAM BRAM DSP48 DSP48tolerance count overhead count overhead count overheadNone 158 – 3 – 3 –ABFT-Shared 180 14% 3 0% 3 0%ABFT-Extra 208 32% 3 0% 6 100%TMR 388 146% 9 200% 9 200%Hybrid 404 156% 3 0% 6 100%

Table 4-2. Serial matrix multiplication fault-injection results.Design name Faults Data System Vulnerable Configuration Total

Injected Errors Hangs Config Bits (est.) DVF (%) DVF (%)Baseline 100,000 3,114 368 14,802 0.026% 0.198%ABFT-Shared 100,000 1,056 633 8,756 0.016% 0.127%ABFT-Extra 100,000 1,055 712 9,161 0.016% 0.127%TMR 100,000 118 16 1,390 0.002% 0.002%Hybrid ABFT 100,000 348 281 6,522 0.012% 0.123%

Table 4-3. Resource utilization and overhead of FFT designs.Architecture Fault Slice Slice BRAM BRAM DSP48 DSP48

tolerance count overhead count overhead count overheadBurst-IO None 607 – 8 – 8 –Burst-IO ABFT 867 43% 8 0% 12 50%Burst-IO TMR 1834 202% 24 200% 24 200%Burst-IO Hybrid 901 48% 8 0% 12 50%Pipelined None 1129 – 5 – 16 –Pipelined ABFT 1390 23% 5 0% 20 25%Pipelined TMR 3404 202% 15 200% 48 200%Pipelined Hybrid 1424 26% 5 0% 20 25%

Table 4-4. FFT fault-injection results.Design Name Faults Data System Vulnerable Configuration Total

injected errors hangs config bits (est.) DVF (%) DVF (%)Burst-IO - Baseline 100,000 1,809 461 48,012 0.084% 0.314%Burst-IO - ABFT 100,000 98 357 9,624 0.017% 0.125%Burst-IO - TMR 100,000 27 128 3,278 0.006% 0.006%Burst-IO - Hybrid 100,000 80 235 6,662 0.012% 0.120%Pipelined - Baseline 100,000 6,891 1,723 182,192 0.320% 0.550%Pipelined - ABFT 100,000 180 1,112 27,327 0.048% 0.156%Pipelined - TMR 100,000 68 97 9,306 0.016% 0.016%Pipelined - Hybrid 100,000 162 890 22,251 0.039% 0.147%

87

0

RA

M

(Mat

rix

B)

N2

Address Generator

0

RA

M

(Mat

rix

A)

N2

0

RA

M

(Mat

rix

C)

N2

Address Generator0

RA

M

(Mat

rix

A)

N2/20

RA

M

(Mat

rix

B)

N2/2N2/2

RA

M

(Mat

rix

A)

N2

N2/2

RA

M

(Mat

rix

B)

N2

0

RA

M

(Mat

rix

C)

N2

Adder Tree

RA

M (

Mat

rix

B)

Address Generator

0

RA

M

(Mat

rix

A)

N2/2

0

RA

M

(Mat

rix

C)

N2/2

N2/2

RA

M

(Mat

rix

A)

N2

N2/2

RA

M

(Mat

rix

C)

N2

A B C

Figure 4-1. Matrix-multiplication architectures. A) Baseline. B) Fine-grained parallel. C)Coarse-grained parallel.

1 for ( i = 0 ; i < N; i ++)2 {3 for ( j = 0 ; j < N; j ++)4 {5 C[ i ] [ j ] = 0 ;6 for ( k = 0 ; k < N; k++)7 {8 C[ i ] [ j ] += A[ i ] [ k ] ∗ B[ k ] [ j ] ;9 }

10 }11 }

Figure 4-2. Matrix-multiplication psuedocode.

0

RA

M

(Mat

rix

B)

N2

Address Generator

0

RA

M

(Mat

rix

A)

N2

0

RA

M

(Mat

rix

C)

N2

En

cod

ing

M

atri

x

Figure 4-3. Matrix-multiplication ABFT-Extra architecture.

88

Figure 4-4. Slice overhead of fine-grained parallel matrix multiplication.

Figure 4-5. Slice overhead of coarse-grained parallel matrix multiplication.

89

Figure 4-6. DSP48 and BlockRAM overhead of parallel matrix multiplication.

90

Figure 4-7. Fault vulnerability of fine-grained parallel matrix multiplication.

91

Figure 4-8. Fault vulnerability of coarse-grained parallel matrix multiplication.

Xilinx LogiCORE FFT

0

RA

M

(Mat

rix

A)

N*M

0

RA

M

(Mat

rix

B)

N*M

Address / Control Logic Xilinx

LogiCORE FFT

0

RA

M(M

atri

x A

)

(N+1)*M

0

RA

M(M

atri

x B

)

(N+1)*M

Address / Control Logic

A B

Figure 4-9. FFT architectures. A) Serial. B) ABFT.

92

0.00E+00

5.00E-03

1.00E-02

1.50E-02

2.00E-02

2.50E-02

3.00E-02

0 100 200 300 400 500 600 700 800 900 1000

Re

qu

ire

d A

BF

T T

hre

sho

ld

Range of Input Data

Figure 4-10. Required ABFT threshold value for floating-point FFTs.

93

CHAPTER 5RFT SYSTEM INTEGRATION FOR RAPID SYSTEM DEVELOPMENT

The RFT-based system presented in Section 3.1 is intended to be a reference

design that can be adapted to actual space-system architectures. In this chapter, we

investigate possible methods and/or tools that will improve the usability of RFT-based

systems for system developers to facilitate adoption of our framework. Our approach,

shown in Figure 5-1, has identified two areas of RFT system design that may be

improved through external tools. The first area, generating RFT hardware for arbitrary

system configurations, can be facilitated by creating multiple system templates for

various interconnection architectures. Specifically, we target architectural support in

order to provide compatibility with commonly used partial reconfiguration architectures,

such as VAPRES [Jara-Berrocal and Gordon-Ross 2010]. The second area, software-based

support for intelligent fault-tolerance mode selection, will combine fault-rate modeling

and task scheduling to optimize job placement for high performability. Research into

each of these areas will facilitate the adoption of the RFT framework into new and

existing space systems.

In this section, we outline our approach for improving hardware and software

support for RFT systems. Section 5.1 discusses RFT architectural templates, integration

of an RFT controller into an existing VAPRES PR-based system, and parameterized

VHDL designs for RFT controller logic. RFT controller generation and system integration

will reduce time required to develop RFT systems by using existing designs with little

modification and dynamically creating new RFT-related components. Section 5.2

discusses reliable task scheduling, which will reduce the complexity of RFT software

requirements.

5.1 Dynamically Generated RFT Components

The RFT controller created for Section 3.1 was designed for a small RFT system

consisting of only three PRRs and connected to a host CPU through a bus-based

94

interconnect. However, for maximum flexibility, the RFT framework must support

systems with an arbitrary number of PRRs and interconnect types. Additionally, a

system design may decide to support only a subset of the fault-tolerance modes allowed

by the RFT framework. Dynamic RFT controller generation allows a system designer to

specify the size and supported features of an RFT controller customized for their specific

system. By disabling unneeded features, the overhead of the RFT controller can be

reduced.

The RFT controller can be divided in to two categories: system interconnect

interfaces and fault-tolerance features. The type of system interconnect affects the

structure of the Address Decoder component. The fault-tolerance features affect the

Voting Logic, Output Mux, and Watchdog Timers components.

5.1.1 RFT Controller Point-to-Point Interface

The RFT hardware architecture from Section 3.1 was used as an example

bus-based PR architecture. However, to improve usability with pre-existing systems,

it is possible to adapt the RFT controller to be used in other PR architectures. By

adapting the RFT controller to be used within a point-to-point VAPRES PR system, it will

be possible to quickly provide fault tolerance to pre-existing VAPRES systems. Other

existing PR architectures can be classified as either bus-based or point-to-point.

The initial RFT controller connects to the rest of the system through the Processor

Local Bus (PLB). PRMs connected to the RFT controller must act as slave devices.

Slave devices can respond to bus requests, but cannot initiate transfers. For example,

the processor can read from a PRM’s memory, but the PRM cannot write a value directly

to the host processor’s memory. The RFT controller instantiates a slave PLB controller

to handle communication with the host processor. The PLB controller converts the bus

signals into a simplified set of user signals.

The primary difference between the system from Section 3.1 and a VAPRES-based

system is the use of Fast Simplex Links (FSL) for communications between the

95

Microblaze processor and the PRRs. An FSL template for the RFT controller must

be created in order to operate with a point-to-point architecture (VAPRES). An FSL link

is a FIFO queue that directly connects the MicroBlaze with user logic. The MicroBlaze

instruction set has specific intrinsic functions for reading and writing to the FSL links.

These direct links from the PRMs to the Microblaze increase throughput and reduce

latency from bus contention.

For an RFT controller operating in high-performance mode, data from an FSL could

be passed directly to its connected PRR. However, an FSL-based RFT controller must

be able to send data from an arbitrary FSL to any other FSL in order to enable operate

in TMR or DWC modes. For example, in TMR mode, any data written to FSL #0 must

also be simultaneously written to FSL #1 and FSL #2 to ensure correct functionality.

In order to accomplish this, the FSL-based RFT controller must include an additional

bi-directional FIFO queue for each PRR. The RFT controller must process all incoming

data, route the data to the appropriate PRM, and write any output data to the correct

output FSL. This requires an additional 2 BRAMs per PRR plus the necessary control

logic. Table 5-1 shows a comparison of the resource requirements for the bus-based

and point-to-point RFT controllers.

Since most PR-based systems can be classified as either bus-based or point-to-point,

the two RFT controllers created from this work can serve as a reference design.

Although alternative buses or protocols may be used in future systems, the overall

architecture should remain applicable.

5.1.2 Parameterized and Configurable Voting Logic

While the RFT controller’s input- and output-address decoder hardware may be

dependent on the interface to the larger PR system, many other components are

only dependent on the number of PRRs in the design. These components, created

using VHDL, can be parameterized in order to be easily adapted to arbitrarily large

systems. The main parameter, NUM PRRS, scales the RFT controller interfaces to the

96

appropriate size for the system. The parameterized components we examine are the

watchdog timers, RFT status registers, output mux, and voting logic,.

The ENABLE WATCHDOG parameter, enables the creation of watchdog timer

logic for each PRR in the design. Additionally, a watchdog configuration register is

created for each PRR. For PRMs which do not wish to use the watchdog functionality,

the configuration register can be used to disable the timer. NUM PRRS status registers

are created.

There are several parameters related to the creation of voting logic within the

RFT controller. The ENABLE TMR and ENABLE DWC creates TMR or DWC voters,

respectively. One TMR voter is created for every three PRRs, and one DWC voter is

created for every two PRRs. By not using(NUM PRRS

3

)or

(NUM PRRS

2

)voters, respectively,

we reduce the total number of required resources at the expense of some flexibility.

Based on the parameterized voters, the output multiplexer can then select from the

complete enumeration of the possible RFT-mode combinations.

These simple parameters allow for the creation of a scalable RFT controller

design. By enabling or disabling specific fault-tolerance features, the RFT controller

can be targeted and optimized for specific architectures or applications. As system

size increases, the parameterized design automatically creates the logic required for

additional PRRs.

5.2 Task Scheduling for RC Systems in Dynamic Fault-Rate Environments

In order to optimize a system for performance and reliability, the fault-tolerant mode

must be selected carefully based upon the current fault conditions experienced by the

system. In this section, we propose a fault-tolerant scheduler that can schedule real-time

tasks as well as select a heuristic to select a fault-tolerant RFT mode for each task.

We then evaluate the effectiveness of the proposed heuristic using two case-study

simulations.

97

5.2.1 Selection Criteria for Fault-Tolerant Mode

RFT mode switching can be triggered by a priori knowledge of the operating

environment, application-triggered events, or external events. In an RFT system, the

expected fault rate can be estimated either directly or indirectly. An external radiation

sensor can be directly interfaced with the FPGA, allowing the system to track the current

fault rate and predict future fault rates. Alternatively, the RFT system can indirectly

determine fault rates using models of the expected fault environment from Chapter 3. By

correlating the space system’s current position to an existing model, a fault-rate estimate

can be used to make scheduling decisions.

In the following sections we assume that tasks can be scheduled with no fault

tolerance (Simplex), duplication with compare (DWC), or triple-modular redundancy

(TMR). Data errors in tasks are detected by comparing or voting on the output of each

task replica. We also assume that the system can estimate the current fault environment

using a pre-existing model of the system’s orbit.

5.2.1.1 FT-mode selection using thresholds

One of the most straightforward methods for selecting an appropriate fault-tolerance

mode is the use of thresholds. At very high fault rates, TMR is required to maintain

reliability. At low fault rates, DWC or Simplex modes may provide sufficient reliability

while increasing performance. At each time step, the current fault rate is measured, and

new tasks are assigned based on the pre-selected rules. The optimal threshold occurs

at the fault rate where the reliable performance of TMR and DWC are equal. The ideal

scheduling heuristic will select TMR when the current fault rate is above the fault-rate

threshold, fthresh, and will select DWC otherwise. Selecting the appropriate values for the

threshold is dependent on the application and environment. Determining fthresh requires

information about fault rates, task frequency, task load, and other factors which may not

be static throughout a system’s operation.

98

5.2.1.2 Time-resource metric for FT-mode selection

Instead of depending upon a user-defined threshold value to determine a fault-tolerance

strategy, we explore a possible metric which can estimate the optimal threshold. The

metric combines computation time (τ ) and fault probability of a single task (f ) in order

to select between FT modes. By developing a scheduling metric that incorporates a

dynamic fault rate, we intend to improve overall system performance without the need for

expert user input. Given the current fault rate, f , the reliability of a task at the end of its

computation time, R, is given by the following equations (depending on mode):

RSimplex = (1− f )τ

RDWC = (1− f )2τ

RTMR = 3(1− f )2τ − 2(1− f )3τ

(5–1)

If tasks are running in DWC or TMR mode, faults are discovered at the end of

the task’s computation time. Faulty tasks are then rescheduled until they complete

successfully. The average number of times a task must be executed in order to

successfully complete is then given by the following geometric series:

τe� =

∞∑n=0

τ(1− R)n =τ

R(5–2)

We define a time-resource coefficient, α, which combines the effective execution

time with the required resources of a given task (α = N × τe� ). Then, by comparing the

α of a DWC or TMR task, we can determine which mode is optimal for reliability (lower

is better). In Equation 3, we solve for conditions where DWC will provide lower α than

TMR.

2τ

(1− f )2τ≤ 3τ

3(1− f )2τ − 2(1− f )3τ(5–3)

99

Simplifying Equation 3 provides the following simple relation:

(1− f )τ ≥ 3

4(5–4)

Based on the definition of α, DWC provides more reliable performance than TMR

when RSimplex is greater than 0.75. For low fault rates, DWC provides higher overall

performance. For very high fault rates, or very long execution times, the reliability of

TMR scheduling is preferred. A similar analysis can be performed for simplex tasks,

however simplex scheduling has no method for detecting faults. We use this conclusion

as the basis for an adaptive fault-tolerance threshold.

5.2.2 Scheduler for RFT

Traditional fault-tolerant scheduling algorithms assume that the fault rate experienced

by the system will be constant, and that the fault-tolerance strategy will also be constant.

For an RFT-based system, a scheduler which uses the current fault rate is necessary

to maximize system utilization while maintaining system availability. The fault-tolerant

scheduler presented in this section can schedule tasks in any FT mode based on

user-defined thresholds or the α-metric.

5.2.2.1 RFT architecture description

The RFT system described in Chapter 3 contains a microprocessor connected

to several large partially-reconfigurable regions (PRRs) through a shared system

bus. During normal system operation, unique tasks can be scheduled to any of the

PRRs. Depending on system configuration, the outputs of three contiguous PRRs can

be voted on to provide coarse-grained TMR functionality, or two PRRs can provide

DWC functionality. Each of these PRRs are large and identical in size, and can be

represented with a 1D area model, reducing many of the scheduling problems presented

in Section 2.7.

100

5.2.3 Software Simulation

In order to evaluate our scheduling technique and possible heuristics, a software-based

discrete-time simulator was developed in C++. The simulator enables us to specify

task arrival rates, task deadlines, dynamic fault rates, and scheduling algorithms. In

addition to scheduling tasks, the simulator can also inject faults into tasks and force

re-scheduling of failed tasks.

Figure 5-3 shows the basic overview of how the simulator is used. At each time

step, tasks are randomly added to a task pool. This process is modeled as a Poisson

process with mean λarrival . All tasks in the task pool are scheduled, if possible, and

then moved to the reservation list. When multiple tasks arrive simultaneously, tasks

are scheduled using an earliest-deadline-first (EDF) heuristic. The scheduler does not

employ preemption; tasks are scheduled on arrival only, and new tasks must be placed

around the existing schedule. Tasks which cannot be scheduled before their deadlines

are rejected. If a task is scheduled to begin at the current time step, the simulator moves

the task from the reserved list to the execution list.

After tasks have been scheduled, faults are injected into each PRR with probability

f in order to simulate the dynamic fault environment. At the end of the task’s execution,

the outcome of the task is determined based upon the number of faults encountered

(i.e., Simplex and DWC tasks fail with 1 fault, TMR tasks fail with faults in 2 or more

PRRs). Multiple faults within a single PRR have no additional effect on the system. If

the task fails, the scheduler then returns the task to the task queue to be re-scheduled.

All other reserved tasks (scheduled, but not yet executing) are also returned to the task

queue to be rescheduled. Fault-tolerant tasks are rescheduled until they successfully

complete or can no longer meet their deadline.

When tasks must be re-scheduled due to faults, they are treated as a new task for

scheduling purposes, although their original deadline is maintained. The fault-tolerant

mode for rescheduled tasks will based on the fault rate at the time of rescheduling.

101

5.2.4 Analysis and Results

In the following analysis, task execution times (texec ) are uniformly distributed in

[10, 100] time steps with deadlines (tdeadline) of [100, 200] time units. For simplicity, we

assume that a time step is 1 second. Simplex tasks use one processing region, DWC

tasks use two processing regions, and TMR tasks use three processing regions. The

simulated system uses 12 processing regions (NPRRs) in order to enable flexibility in

placing TMR and DWC tasks. For each experiment, we measure the performance of

each metric with the scheduler’s guarantee ratio (percentage of total tasks scheduled

successfully) while attempting to schedule 100,000 tasks.

5.2.4.1 Constant fault rates

Initially, the simulator is used to get a fault-free baseline for comparison purposes.

Figure 5-4 shows the effect of arrival rate on the performance of the system. At

low arrival rates, Simplex, DWC, and TMR scheduling can all meet the system

demand. However, arrival rates higher than 0.06 tasks per second begin to impact

the schedulability of the TMR system because there are not enough resources to handle

all incoming tasks. Using DWC for fault tolerance will result in higher guarantee ratios

since more DWC tasks can be scheduled at any one time. The lack of a fault-tolerance

mechanism excludes the use of Simplex scheduling in the presence of faults.

In order to investigate the effect of fault rates on our system, we chose a constant

arrival rate of 0.075 tasks per second. With this arrival rate, the DWC system can

successfully schedule all incoming tasks, while the TMR system cannot. Using the

arrival rate in this way, we attempt to define a system which requires DWC to meet

performance demands but can temporarily use TMR to meet reliability constraints.

The effect on the guarantee ratio is shown in Figure 5-5. At low fault rates the DWC

mode provides higher throughput, while TMR outperforms DWC at high fault rates.

Our adaptive metric produces high throughput at low fault rates, closely tracking the

performance of DWC, but performs between TMR and DWC at intermediate fault rates.

102

At higher fault rates, the guarantee ratio of the adaptive heuristic produces results close

to TMR. From these constant-fault-rate results, an ideal-threshold heuristic can be

determined. The crossover fault rate for TMR and DWC occurs at 0.0025 faults/sec.

5.2.4.2 Dynamic fault-rate case studies

In order to get a benefit from the adaptive scheduling methods, the fault rates

experienced by the system must vary. We present two fault profiles which represent

patterns commonly seen in space missions. Figure 5-6 shows the fault profiles used

for the following analysis, based on the fault model in [Jacobs et al. 2012]. The first

profile is a sinusoidal pattern with a 90-minute period which is characteristic of fault

rates in Low-Earth Orbit (LEO). The second pattern (Burst) represents Highly-Elliptical

Orbits (HEO), where the system experiences low fault rates for most of the orbit, with

a large burst when making the closest approach to Earth, once every 12 hours. For

the following case studies, four different scheduling heuristics are examined. The

TMR-only and DWC-only heuristics will schedule every task in their respective mode.

The ideal-threshold heuristic will use the fault-rate threshold measured in the previous

section (0.0025 faults/sec) to choose between the DWC and TMR modes. The adaptive

heuristic uses Equation 5–4 to determine the FT mode for each task. Each heuristic will

be evaluated using an arrival rate of 0.075 tasks/sec and the same parameters used in

Section 5.2.4.1.

For the Sinusoidal case study, fault rates are low compared to the average task

execution time. For this fault profile, scheduling tasks with the DWC-only, ideal-threshold,

or adaptive heuristics provide an equivalent rejection ratio, 0.2%. There are enough

system resources to make re-computation of failed DWC tasks better than simply

using TMR to protect against all failures. The fault rate rarely gets high enough for the

threshold or adaptive heuristics to schedule tasks in TMR mode. The adaptive heuristic

performs well, reducing the number of rejected tasks over the TMR-only strategy by

94%, while maintaining a low average task latency of 8 seconds per task.

103

In the Burst case study, fault rates are low except during a short window of time with

extremely high fault rates. Unlike in the previous case study, the adaptive heuristics will

benefit from the large range of fault rates and each heuristic has different performance

characteristics. For this fault-rate profile, the adaptive heuristics perform the best, with

11% fewer rejected tasks than the DWC-only strategy and 48% fewer than the TMR-only

strategy. Additionally, the adaptive heuristic has only 3% more rejected tasks than the

ideal-threshold heuristic. TMR is only optimal when the high-fault-rate burst occurs.

Otherwise, DWC will better utilize the system resources. The Burst fault profile is ideal

for all dynamic metrics, since the two phases are highly separated.

5.2.4.3 Scheduling improvements

At extreme fault rates (high or low), the adaptive heuristic will schedule all incoming

tasks in the same mode. However, for moderate fault rates, both DWC and TMR tasks

will be scheduled depending on task computation time. One drawback to the dynamic

fault-tolerant selection metrics is the resource fragmentation that occurs when different

sized objects are placed on the FPGA fabric. This effect can produce schedules similar

to Figure 5-7, where fragmentation causes poor utilization of the available FPGA

resources. As tasks arrive to the system, they are placed in PRRs p0 through p5, in

either DWC or TMR mode. At time t6 there are enough unused PRRs in the system

for a DWC task, but because the resources are not contiguous the task cannot be

placed. Unfortunately, the simplistic EDF scheduling heuristic currently in use does

not account for FPGA fragmentation. The low effectiveness of the adaptive metric in

Figure 3 can be explained by this FPGA fragmentation. By using a placement-aware

scheduler, the adaptive heuristic should perform closer to optimal for all fault rates. For

example, delaying the placement of task FDWC for until t6 would enable a more compact

placement in PRRs p2 and p3, enabling space for the placement of an additional DWC

task. Alternatively, task FDWC could be placed in PRRs p4 and p5 at time t5, leaving

room for future DWC tasks in PRRs p2 and p3. An FPGA placement-aware scheduling

104

algorithm such as the Horizon or Stuffing scheduler [Steiger et al. 2004] should be

incorporated in order to improve the performance of the adaptive heuristic.

Fault-rate lag is another possible problem. If a task is scheduled using a specific

mode but does not execute for a long period of time, a different FT mode may become

more appropriate. Limiting the scheduling window to only schedule a few tasks at a

time may prevent this lag. Alternatively, using a prediction of the future fault rate during

scheduling may reduce this effect. Finally, preemption enables the scheduler to return a

currently running task to the task queue in order to start a higher priority task. By adding

preemption capabilities to the scheduler, the guarantee ratio of all the tested metrics can

be improved.

5.3 Conclusions

In Phase 3 we have adapted the original bus-based RFT hardware architecture

for use with a PR architecture based on point-to-point connections (VAPRES) and

quantified the design tradeoffs. Additionally, by providing bus-based and point-to-point

hardware templates, the RFT architecture can now be more easily be ported to future

systems. The additional parameterization of internal TMR components allows for user

customization of RFT features such as watchdog timers or voting logic configurations.

We have also presented a novel metric for determining optimal fault-tolerance

settings for reconfigurable fault-tolerant systems. An RFT scheduler and simulator were

developed in order to test the effectiveness of the adaptive scheduling heuristic and to

compare its performance to traditional static fault-tolerance strategies. When using our

adaptive FT strategy in Burst-like fault environments, we maintain system reliability while

reducing the number of rejected tasks by 48% compared to a static TMR fault-tolerance

strategy and 11% compared to static DWC. In the Sinusoidal case study, the static DWC,

Ideal, adaptive heuristics reduce the number of rejected tasks by 94% compare to static

TMR strategy. We have demonstrated that the adaptive heuristic performs similarly to

105

an optimal user-defined threshold, without the need for detailed system simulation and

measurement.

Table 5-1. RFT controller resource usage.Module name Slices used BRAMs usedPLB Controller 345 0FSL Controller 768 12

Table 5-2. Dynamic scheduling results.Case study FT metric Guarantee ratio Reject ratio Avg. latency (s)Sinusoidal TMR-Only 0.970 0.030 51.2Sinusoidal DWC-Only 0.998 0.002 8.0Sinusoidal Ideal 0.998 0.002 7.9Sinusoidal Adaptive 0.998 0.002 8.2Burst TMR-Only 0.935 0.065 53.6Burst DWC-Only 0.962 0.038 9.9Burst Ideal 0.967 0.033 11.7Burst Adaptive 0.966 0.034 11.2

Figure 5-1. Research areas for Phase 3.

106


MemoryController


ICAPRFT Controller

System Interconnect (PLB)

TMR Components

PR

R N

PR

R N

−1

PR

R 2

PR

R 1


MemoryController


ICAPRFT Controller

PLB

TMR Components

PR

R N

PR

R N

−1

PR

R 2

PR

R 1

A B

Figure 5-2. Comparison of RFT architectures. A) PLB-based architecture. B) FSL-basedarchitecture.

Figure 5-3. Flowchart of scheduling simulator.

107

(NPRRs = 12, texecϵ[10, 100], tdeadlineϵ[100, 200])

Figure 5-4. Effect of arrival rate on fault-free operation.

(NPRRs = 12, texecϵ[10, 100], tdeadlineϵ[100, 200], λarrival = 0.075)

Figure 5-5. Effect of fault rate on task rejection.

108

Figure 5-6. Fault-rate profile for case studies.

109

Figure 5-7. Resource fragmentation from adaptive placement.

110

CHAPTER 6CONCLUSIONS

In this research, a comprehensive framework for providing reconfigurable fault

tolerance for FPGA-based space systems has been developed. The framework features

three primary areas each addressed by one of the research phases presented in this

document. In Phase 1, the initial RFT framework consisting of a hardware architecture,

fault-rate model, and a performability model is presented. The hardware architecture

was developed and implemented on a Virtex-5 platform enabling reconfigurable

fault tolerance with support for TMR, DWC, or user-defined fault-tolerance modes.

Additionally, RFT fault-rate and performability models were used to predict the

performance and reliability of an RFT system in multiple specified orbits. For highly-elliptical

orbits, adaptive fault-tolerance strategies were shown to increase system performability

by 128% over TMR while improving unavailability by 85% over ABFT. Fault injection was

used to validate the reliability of the architecture and the accuracy of the performance

model (1.5% error). In Phase 2, an in-depth reliability and overhead analysis of FPGA

designs using ABFT is presented. By identifying fault tolerance techniques with low

overhead and high reliability, a spectrum of reliability and performance characteristics

become available for RFT systems, enabling system flexibility. Fault-injection testing of

matrix multiplication and Fast Fourier Transform FPGA designs show that ABFT can

reduce design vulnerability by up to 98% with less than 25% overhead. Selectively

applying redundancy (TMR) can allow for even more reliable designs. In Phase 3,

methods for integrating the RFT framework with pre-existing PR systems, and making

the fault-tolerance features of the architecture easier to use, are demonstrated. RFT

hardware will be dynamically generated based on individual system parameters.

The adaptive scheduling heuristic will simplify the decision to use a specific RFT

fault-tolerant mode and enable the use of environmentally-aware task schedulers.

Combined, these three phases of research provide an RFT framework which is capable

111

of providing adaptive fault-tolerance to existing FPGA systems, enabling their possible

use as space systems.

The contributions of Phase 1 include the RFT hardware architecture, the orbital

fault-rate model, and the phased-mission performability model. The fault-rate and

performability models are applicable to many space systems, enabling a time-varying

fault model where only static, time-averaged models are normally considered. From

Phase 2, the analysis of ABFT reliability on FPGA architectures demonstrates the

usefulness of ABFT as a reliable, low-overhead alternative technique in low-to-medium

fault-rate environments. The design techniques discovered in this research will promote

the use of ABFT as a alternative fault-tolerance technique for space systems, enabling

higher performance, lower power consumption, and lower costs by reducing the number

of processors needed to perform onboard data processing. The contributions from

Phase 3 facilitates the use of RFT techniques in future systems by partially automating

the process of fault-tolerant hardware design, allowing system designers to focus their

efforts on other parts of their potential system. Optimizing the selection of fault-tolerance

modes through fault-rate prediction and scheduler heuristics will enable systems to

maintain high performability automatically.

112

REFERENCES

ACREE, R., ULLAH, N., KARIA, A., RAHMEH, J., AND ABRAHAM, J. 1993. Anobject-oriented approach for implementing algorithm-based fault tolerance. InTwelfth Annual International Phoenix Conference on Computers and Communications.210 –216.

ACTEL. 2010a. Actel product page. http://www.actel.com/products/milaero/rtsxsu/default.aspx.

ACTEL. 2010b. Actel product page. http://www.actel.com/products/milaero/rtpa3/default.aspx.

ALAM, M., SONG, M., HESTER, S., AND SELIGA, T. 2006. Reliability analysis ofphased-mission systems: a practical approach. In Annual Reliability and Maintainabil-ity Symposium, 2006. RAMS ’06. 551 –558.

ALNAJIAR, D., KO, Y., IMAGAWA, T., KONOURA, H., HIROMOTO, M., MITSUYAMA, Y.,HASHIMOTO, M., OCHI, H., AND ONOYE, T. 2009. Coarse-grained dynamicallyreconfigurable architecture with flexible reliability. In International Conference on FieldProgrammable Logic and Applications, 2009. FPL 2009. 186 –192.

ALTERA. 2010. Stratix V FPGAs: Ultimate Flexibility Through Partial and DynamicReconfiguration. http://www.altera.com/products/devices/stratix-fpgas/

stratix-v/overview/partial-reconfiguration/stxv-part-reconfig.html.

ARNDT, O., FREISLEBEN, B., KIELMANN, T., AND THILO, F. 2000. A comparative studyof online scheduling algorithms for networks of workstations. Cluster Computing 3,95–112.

BANERJEE, S., BOZORGZADEH, E., AND DUTT, N. 2005. Physically-aware hw-swpartitioning for reconfigurable architectures with partial dynamic reconfiguration. InDesign Automation Conference, 2005. Proceedings. 42nd. 335 – 340.

CARMICHAEL, C., FULLER, E., BLAIN, P., AND CAFFREY, M. 1999. SEU mitigationtechniques for Virtex FPGAs in space applications. In 2nd Annual Military andAerospace Applications of Programmable Devices and Technologies Conference.Laurel, MD.

CHOWDHURY, A.-R. AND BANERJEE, P. 1996. A new error analysis based method fortolerance computation for algorithm-based checks. Computers, IEEE Transactionson 45, 2, 238 –243.

CIARDO, G., MARIE, R., SERICOLA, B., AND TRIVEDI, K. 1990. Performability analysisusing semi-Markov reward processes. IEEE Transactions on Computers 39, 10, 1251–1264.

113

http://www.actel.com/products/milaero/rtsxsu/default.aspx

http://www.actel.com/products/milaero/rtsxsu/default.aspx

http://www.actel.com/products/milaero/rtpa3/default.aspx

http://www.actel.com/products/milaero/rtpa3/default.aspx

http://www.altera.com/products/devices/stratix-fpgas/stratix-v/overview/partial-reconfiguration/stxv-part-reconfig.html

http://www.altera.com/products/devices/stratix-fpgas/stratix-v/overview/partial-reconfiguration/stxv-part-reconfig.html

CIESLEWSKI, G., GEORGE, A., AND JACOBS, A. 2010. Acceleration of FPGA faultinjection through multi-bit testing. In 2010 Engineering of Reconfigurable Systems andAlgorithms.

DAVE, N., FLEMING, K., KING, M., PELLAUER, M., AND VIJAYARAGHAVAN, M. 2007.Hardware acceleration of matrix multiplication on a xilinx fpga. In Formal Methodsand Models for Codesign, 2007. MEMOCODE 2007. 5th IEEE/ACM InternationalConference on. 97 –100.

DAWOOD, A., VISSER, S., AND WILLIAMS, J. 2002. Reconfigurable FPGAs for realtime image processing in space. In 14th International Conference on Digital SignalProcessing, 2002. DSP 2002. Vol. 2. 845 – 848 vol.2.

DOBIAS, R., KUBALIK, P., AND KUBATOVA, H. 2005. Dependability computations forfault-tolerant system based on FPGA. In 12th IEEE International Conference onElectronics, Circuits and Systems (ICECS). 1 –4.

FLATLEY, T. 2010. Advanced hybrid on-board science data processor - SpaceCube 2.0.Earth Science Technology Forum.

GANO, S. 2010. JSatTrak. http://www.gano.name/shawn/JSatTrak/index.html.

GARVIE, M. AND THOMPSON, A. 2004. Scrubbing away transients and jiggling aroundthe permanent: long survival of FPGA systems through evolutionary self-repair. In10th IEEE International On-Line Testing Symposium (IOLTS). 155 – 160.

GUPTA, A., NOOSHABADI, S., TAUBMAN, D., AND DYER, M. 2006. Realizing low-costhigh-throughput general-purpose block encoder for JPEG2000. IEEE Transactions onCircuits and Systems for Video Technology 16, 7, 843 –858.

HAN, C.-C., SHIN, K., AND WU, J. 2003. A fault-tolerant scheduling algorithm forreal-time periodic tasks with possible software faults. Computers, IEEE Transactionson 52, 3, 362 – 372.

HOOTS, F. R. AND ROEHRICH, R. L. 1980. SPACETRACK REPORT NO. 3 - Mod-els for Propagation of NORAD Element Sets. http://celestrak.com/NORAD/

documentation/spacetrk.pdf.

HSUEH, M. AND CHANG, C.-I. 2008. Field programmable gate arrays (FPGA) forpixel purity index using blocks of skewers for endmember extraction in hyperspectralimagery. Int. J. High Perform. Comput. Appl. 22, 408–423.

HUANG, K.-H. AND ABRAHAM, J. 1984. Algorithm-based fault tolerance for matrixoperations. IEEE Transactions on Computers C-33, 6, 518 –528.

JACOBS, A., CIESLEWSKI, G., GEORGE, A. D., GORDON-ROSS, A., AND LAM, H. 2012.Reconfigurable fault tolerance: A comprehensive framework for reliable and adaptivefpga-based space computing. ACM Trans. Reconfigurable Technol. Syst. 5, 4,21:1–21:30.

114

http://www.gano.name/shawn/JSatTrak/index.html

http://celestrak.com/NORAD/documentation/spacetrk.pdf

http://celestrak.com/NORAD/documentation/spacetrk.pdf

JACOBS, A., CONGER, C., AND GEORGE, A. 2008. Multiparadigm space processing forhyperspectral imaging. In Aerospace Conference, 2008 IEEE. 1 –11.

JARA-BERROCAL, A. AND GORDON-ROSS, A. 2010. VAPRES: A virtual architecture forpartially reconfigurable embedded systems. In Design, Automation Test in EuropeConference Exhibition (DATE), 2010. 837 –842.

JOHNSON, J., HOWES, W., WIRTHLIN, M., MCMURTREY, D., CAFFREY, M., GRAHAM, P.,AND MORGAN, K. 2008. Using duplication with compare for on-line error detection inFPGA-based designs. In 2008 IEEE Aerospace Conference. 1–11.

KARNIK, T. AND HAZUCHA, P. 2004. Characterization of soft errors caused by singleevent upsets in CMOS processes. IEEE Transactions on Dependable and SecureComputing 1, 2, 128 – 143.

KERNIGHAN, B. AND LIN, S. 1970. An eflicient heuristic procedure for partitioninggraphs. Bell system technical journal .

KIM, K. AND PARK, K. 1994. Phased-mission system reliability under Markovenvironment. IEEE Transactions on Reliability 43, 2, 301 –309.

KYRIAKOULAKOS, K. AND PNEVMATIKATOS, D. 2009. A novel SRAM-based FPGAarchitecture for efficient TMR fault tolerance support. In International Conference onField Programmable Logic and Applications, 2009. FPL 2009. 193 –198.

LAPRIE, J.-C., ARLAT, J., BEOUNES, C., AND KANOUN, K. 1990. Definition andanalysis of hardware- and software-fault-tolerant architectures. IEEE Transactions onComputers 23, 7, 39 –51.

LE, C., CHAN, S., CHENG, F., FANG, W., FISCHMAN, M., HENSLEY, S., JOHNSON,R., JOURDAN, M., MARINA, M., PARHAM, B., ROGEZ, F., ROSEN, P., SHAH, B.,AND TAFT, S. 2004. Onboard FPGA-based SAR processing for future spacebornesystems. In Proceedings of the IEEE Radar Conference, 2004. 15 – 20.

MACMILLAN, S. AND MAUS, S. 2010. IGRF10 Model Coefficients for 1945-2010.http://modelweb.gsfc.nasa.gov/magnetos/igrf.html.

MAUS, S., MACMILLAN, S., CHERNOVA, T., CHOI, S., DATER, D., GOLOVKOV, V.,LESUR, V., LOWES, F., LHR, H., MAI, W., MCLEAN, S., OLSEN, N., ROTHER, M.,SABAKA, T., THOMSON, A., AND ZVEREVA, T. 2005. The 10th generation internationalgeomagnetic reference field. Physics of The Earth and Planetary Interiors 151, 3-4,320 – 322.

MEI, B., SCHAUMONT, P., AND VERNALDE, S. 2000. A hardware-software partitioningand scheduling algorithm for dynamically reconfigurable embedded systems. InProceedings of ProRISC. Citeseer, 405–411.

115

http://modelweb.gsfc.nasa.gov/magnetos/igrf.html

Mentor Graphics 2013. Precision Hi-Rel Technology Overview. MentorGraphics. http://www.mentor.com/products/fpga/multimedia/overview/

precision-hi-rel-technology-overview.

MEYER, J. 1982. Closed-form solutions of performability. IEEE Transactions onComputers C-31, 7, 648 –657.

MISHRA, A. AND BANERJEE, P. 2003. An algorithm-based error detection scheme forthe multigrid method. Computers, IEEE Transactions on 52, 9, 1089 – 1099.

MORGAN, K., MCMURTREY, D., PRATT, B., AND WIRTHLIN, M. 2007. A comparison ofTMR with alternative fault-tolerant design techniques for FPGAs. IEEE Transactionson Nuclear Science 54, 6, 2065 –2072.

PATHAN, R. 2006. Fault-tolerant real-time scheduling algorithm for tolerating multipletransient faults. In Electrical and Computer Engineering, 2006. ICECE ’06. Interna-tional Conference on. 577 –580.

PRATT, B., CAFFREY, M., GRAHAM, P., MORGAN, K., AND WIRTHLIN, M. 2006.Improving FPGA design robustness with partial TMR. In 44th Annual IEEE Inter-national Reliability Physics Symposium Proceedings, 2006. 226 –232.

PRATT, B., WIRTHLIN, M., CAFFREY, M., GRAHAM, P., MORGAN, K., QUINN, H., ANDSHELLEY, S. 2007. Improving FPGA reliability in harsh environments using triplemodular redundancy with more frequent voting. In Military and Aerospace FPGAApplications.

RAO, T. AND FUJIWARA, E. 1989. Error-Control Coding for Computer Systems.

RATTER, D. 2004. FPGAs on Mars. Xcell Journal , 8–11.

ROY-CHOWDHURY, A. AND BANERJEE, P. 1993. Tolerance determination foralgorithm-based checks using simplified error analysis techniques. In Fault-TolerantComputing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Sym-posium on. 290 –298.

ROY-CHOWDHURY, A., BELLAS, N., AND BANERJEE, P. 1996. Algorithm-basederror-detection schemes for iterative solution of partial differential equations. Comput-ers, IEEE Transactions on 45, 4, 394 –407.

SAHNER, R. A. AND TRIVEDI, K. S. 1987. Reliability modeling using SHARPE. IEEETransactions on Reliability R-36, 2, 186 –193.

SHIM, B., SRIDHARA, S., AND SHANBHAG, N. 2004. Reliable low-power digital signalprocessing via reduced precision redundancy. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems 12, 5, 497 – 510.

116

http://www.mentor.com/products/fpga/multimedia/overview/precision-hi-rel-technology-overview

http://www.mentor.com/products/fpga/multimedia/overview/precision-hi-rel-technology-overview

SILVA, J., PRATA, P., RELA, M., AND MADEIRA, H. 1998. Practical issues in the use ofABFT and a new failure model. In Twenty-Eighth Annual International Symposium onFault-Tolerant Computing. 26 –35.

STEIGER, C., WALDER, H., AND PLATZNER, M. 2004. Operating systems forreconfigurable embedded platforms: online scheduling of real-time tasks. Com-puters, IEEE Transactions on 53, 11, 1393 – 1407.

SWIFT, G., ALLEN, G., TSENG, C. W., CARMICHAEL, C., MILLER, G., AND GEORGE, J.2008. Static upset characteristics of the 90nm Virtex-4QV FPGAs. In IEEE RadiationEffects Data Workshop. 98 –105.

TAO, D. AND HARTMANN, C. 1993. A novel concurrent error detection scheme for FFTnetworks. Parallel and Distributed Systems 4, 2, 198 –221.

TROXEL, I., FEHRINGER, M., AND CHENOWETH, M. 2008. Achieving multipurposespace imaging with the ARTEMIS reconfigurable payload processor. In 2008 IEEEAerospace Conference. 1–8.

TYLKA, A., ADAMS, J.H., J., BOBERG, P., BROWNSTEIN, B., DIETRICH, W., FLUECK-IGER, E., PETERSEN, E., SHEA, M., SMART, D., AND SMITH, E. 1997. CREME96:A revision of the cosmic ray effects on micro-electronics code. IEEE Transactions onNuclear Science 44, 6, 2150 –2160.

WANG, J. 2003. Radiation effects in FPGAs. In 9th Workshop on Electronics for LHCExperiments.

WANG, S.-J. AND JHA, N. 1994. Algorithm-based fault tolerance for FFT networks.IEEE Transactions on Computers 43, 7, 849 –854.

WILLIAMS, J., MASSIE, C., GEORGE, A. D., RICHARDSON, J., GOSRANI, K., ANDLAM, H. 2010. Characterization of fixed and reconfigurable multi-core devices forapplication acceleration. ACM Transactions on Reconfigurable Technology andSystems 3, 19:1–19:29.

WU, G., DOU, Y., AND WANG, M. 2010. High performance and memory efficientimplementation of matrix multiplication on fpgas. In Field-Programmable Technology(FPT), 2010 International Conference on. 134 –137.

XILINX. 2004. XTMR Tool User Guide. Xilinx User Guide UG156.

XILINX. 2010a. Partial Reconfiguration User Guide. Xilinx User Guide UG702.

XILINX. 2010b. SEU Strategies for Virtex-5 Devices. Xilinx Application Note XAPP864.

XILINX. 2010c. Space-Grade Virtex-4QV Family Overview. Xilinx Product SpecificationDS653.

117

XILINX. 2013a. Xilinx CORE Generator System. Xilinx CORE Generator Product Page,http://www.xilinx.com/tools/coregen.htm.

XILINX. 2013b. Xilinx Soft Error Mitigation (SEM) Core. http://www.xilinx.com/

products/intellectual-property/SEM.htm.

YAO, E., WANG, R., CHEN, M., TAN, G., AND SUN, N. 2012. A case study of designingefficient algorithm-based fault tolerant application for exascale parallelism. In ParallelDistributed Processing Symposium (IPDPS), 2012 IEEE 26th International. 438 –448.

ZHUO, L. AND PRASANNA, V. 2004. Scalable and modular algorithms for floating-pointmatrix multiplication on FPGAs. In Parallel and Distributed Processing Symposium,2004. Proceedings. 18th International. 92.

118

http://www.xilinx.com/tools/coregen.htm

http://www.xilinx.com/products/intellectual-property/SEM.htm

http://www.xilinx.com/products/intellectual-property/SEM.htm

BIOGRAPHICAL SKETCH

Adam Jacobs earned his Bachelor of Science degree in electrical engineering from

the University of Florida in 2005. After graduation, he participated in an internship at

Honeywell International in Clearwater, FL before returning to the University of Florida

for graduate studies. Adam received his Master of Science degree in electrical and

computer engineering in 2007 before joining the doctoral program.

While pursuing his degree, Adam worked as a research assistant in the High-Performance

Computing and Simulation (HCS) Research Lab and the NSF Center for High-Performance

Reconfigurable Computing (CHREC). In support of his studies, Adam interned at

Goddard Space Flight Center in 2010, gaining experience in embedded processing

systems for space. After graduation, he will be moving to Austin, TX, where he has

accepted a position in the processor design group of ARM.

119

Documents

c 2013 Adam M. Jacobs - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/53/37/00001/JACOBS_A.pdfreconfigurable fault tolerance for space systems by adam m. jacobs a dissertation