54
Rochester Institute of Technology Rochester Institute of Technology RIT Scholar Works RIT Scholar Works Theses 5-2014 Reliable and Fault-Resilient Schemes for Efficient Radix-4 Reliable and Fault-Resilient Schemes for Efficient Radix-4 Complex Division Complex Division Niranjan Manoharan Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Recommended Citation Manoharan, Niranjan, "Reliable and Fault-Resilient Schemes for Efficient Radix-4 Complex Division" (2014). Thesis. Rochester Institute of Technology. Accessed from This Master's Project is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].

Reliable and Fault-Resilient Schemes for Efficient Radix-4

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Rochester Institute of Technology Rochester Institute of Technology

RIT Scholar Works RIT Scholar Works

Theses

5-2014

Reliable and Fault-Resilient Schemes for Efficient Radix-4 Reliable and Fault-Resilient Schemes for Efficient Radix-4

Complex Division Complex Division

Niranjan Manoharan

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation Recommended Citation Manoharan, Niranjan, "Reliable and Fault-Resilient Schemes for Efficient Radix-4 Complex Division" (2014). Thesis. Rochester Institute of Technology. Accessed from

This Master's Project is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].

Page 2: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Reliable and Fault-Resilient Schemes for Efficient Radix-4 Complex

Division

by

Niranjan Manoharan

A Gradute Paper Submitted

in

Partial Fulfillment

of the

Requirements for the Degree of

Master of Science

in

Electrical Engineering

Approved by:

Prof.(Dr. Mehran Mozaffari-Kermani)

Prof.(Dr. Sohail A. Dianat)

Department of Electrical and Microelectronic Engineering

College of Engineering

Rochester Institute of Technology

Rochester, New York

May, 2014

Page 3: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Acknowledgements

I am grateful to my advisor, Dr. Mehran Mozaffari-Kermani, whose support and guidance

helped me to understand the concepts clearly and complete my Masters. Special thanks

also go to other professors, who helped me in my field of study, Dr. Dorin Patru, Dr.

Marcin Lukowiak and Dr. Reza Azarderakhsh, for their constructive comments and

participation.

This graduate paper is dedicated to my parents.

Page 4: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Declaration

I hereby declare that except where specific reference is made to the work of others, the

contents of this dissertation are original and have not been submitted in whole or in part

for consideration for any other degree or qualification in this, or any other University.

This dissertation is the result of my own work and includes nothing which is the outcome

of work done in collaboration, except where specifically indicated in the text.

Niranjan Manoharan

2014

Page 5: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Abstract

Complex division is commonly used in various applications in signal processing and con-

trol theory including astronomy and nonlinear RF measurements. Nevertheless, unless

reliability and assurance are embedded into the architectures of such structures, the sub-

optimal (and thus erroneous) results could undermine the objectives of such applications.

As such, in this thesis, we present schemes to provide complex number division archi-

tectures based on (Sweeney, Robertson, and Tocher) SRT-division with fault diagnosis

mechanisms. Different fault resilient architectures are proposed in this thesis which can

be tailored based on the eventual objectives of the designs in terms of area and time re-

quirements, among which we pinpoint carefully the schemes based on recomputing with

shifted operands (RESO) to be able to detect both natural and malicious faults and with

proper modification achieve high throughputs. The design also implements a minimized

look up table approach which favors in error detection based designs and provides high

fault coverage with relatively-low overhead. Additionally, to benchmark the effective-

ness of the proposed schemes, extensive fault diagnosis assessments are performed for the

proposed designs through fault simulations and FPGA implementations; the design is

implemented on Xilinx Spartan-VI and Xilinx Virtex-VI FPGA families.

Page 6: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Contents

Contents ii

List of Figures iv

List of Tables v

1 Introduction 1

1.1 Complex Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Faults and Degradation . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Fault Detection Techniques . . . . . . . . . . . . . . . . . . . . . 4

1.2.2.1 Concurrent Error Detection . . . . . . . . . . . . . . . . 4

1.2.2.2 Off-Line Fault Detection . . . . . . . . . . . . . . . . . . 5

1.2.2.3 Roving Fault Detection . . . . . . . . . . . . . . . . . . 6

1.3 Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Preliminaries 11

2.1 Hardware Based Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Radix-β SRT Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Golub’s Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Presented Complex Division Radix-4 SRT Module 18

Page 7: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Contents iii

3.1 ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Proposed Error Detection Schemes 22

4.1 Unified Parity Check and Hardware Redundancy . . . . . . . . . . . . . 23

4.1.1 Golub’s Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.2 Datapath Registers . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.3 ROM (Quotient Selection Logic) . . . . . . . . . . . . . . . . . . . 27

4.1.4 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Error Detection through RESO . . . . . . . . . . . . . . . . . . . . . . . 28

5 Error Simulations 31

6 FPGA Implementations and Benchmark 33

7 Conclusion 36

References 37

Page 8: Reliable and Fault-Resilient Schemes for Efficient Radix-4

List of Figures

2.1 Dot representation of basic hardware division. . . . . . . . . . . . . . . . 13

2.2 New shifted remainder and old shifted remainder in radix-β (Robertson

diagram). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 p-d plot quotient selection for radix-βdivision. . . . . . . . . . . . . . . . 15

2.4 The quotient selection scale. . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Golub’s multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Presented complex radix-4 SRT module. . . . . . . . . . . . . . . . . . . 19

3.2 Quotient selection logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Minimized ROM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Radix-4 complex SRT module with unified parity check and hardware re-

dudancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 CED Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Parity check registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Minimized ROM with the proposed parity prediction logic. . . . . . . . . 27

4.5 Full added with parity check. . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Proposed divider with RESO for fault diagnosis. . . . . . . . . . . . . . . 30

Page 9: Reliable and Fault-Resilient Schemes for Efficient Radix-4

List of Tables

6.1 FPGA Implementation results of the original complex division module and

the proposed error detection designs on Spartan-6 FPGA device xc6s1x16. 34

6.2 FPGA Implementation results of the original complex division module and

the proposed error detection designs on Virtex-6 FPGA device xc6vlx75t. 34

Page 10: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 1

Introduction

1.1 Complex Division

Complex division is a critical mathematical operation with applications in various fields

such as signal processing, control theory, microwave systems, quantum mechanics, and the

like. Control theory uses complex division arithmetic to find the root locus [1], Nyquist

plot [2], and Nichols plot [3]. Microwave systems also use complex division arithmetic to

find the frequency response [4] and transfer functions. Because of its complexity, the oper-

ation has been mainly implemented in software [5], [6]. This has been further improved by

using different algorithms [7], [8] to prevent overflows and provide precise results. Other

optimizations [9] have been proposed to make use of the fused multiply–add (FMA)

instructions available on different processors to improve the component-wise accuracy.

A technique for high radix complex division has been proposed in [10], this approach

is based on operand prescaling and digit recurrence. Such an algorithm has later been

implemented on FPGAs with different radices [11], [12]. Furthermore, a radix-16 com-

bined complex divider/square root module has also been presented in [13] based on the

same algorithm. Moreover, a complex divider has been presented in [14] which uses the

standard formula for complex division but uses an optimized architecture to reduce area

and improve the operating frequency. Another complex divider is implemented using

the coordinate rotational digital computer (CORDIC)-like algorithms [15]. There exist

Page 11: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.2 Fault Detection 2

other complex division techniques which are based on complex binary number system

(CBNS) [16], where complex numbers are represented in binary. Additionally, a divi-

sion algorithm based on [16] has been implemented in [17]. This design has drawback in

terms of accuracy as it is based on a new system and extensive research is still needed to

reach more accurate results. There has also been a complex division scheme which uses

dichotomous coordinate descent (DCD) algorithm to calculate the results by converting

the division operation to a system of linear equations [18]. Implementing this technique is

very complicated and area inefficient. It is noted that none of the aforementioned systems

provides reliability and hardware assurance for the underlying architectures. Indeed, in

the presence of defects in very-large-scale integration (VLSI) architectures of such im-

portant computer arithmetic calculations, erroneous outputs resulting from sub-optimal

reliability assurance could undermine the respective eventual objectives.

1.2 Fault Detection

Fault is a problem that results in a complete failure of a piece of equipment, or even

involves specific hardware. A problem in digital system can be defined as a bit inversion

in digital hardware, i.e., 0 to 1 or 1 to 0. As technology becomes scaled, manufacturing

large defect-free integrated circuits becomes difficult. There is also the issue of device

degradation over large periods of time. ASICs and FPGAs are primarily affected by these

degradation issues which make them less reliable over time. Future FPGAs, beyond the

45nm technology will have low reliability such that fault tolerance or other recovery

methods will be unavoidable in large FPGAs. This section provides some insight on

some common faults and degradation. FPGAs are highly reconfigurable; this provides

interesting opportunities for fault detection and tolerance.

1.2.1 Faults and Degradation

Digital circuits incur degradation in many ways [19], some of the leading ways are as

follows:

Page 12: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.2 Fault Detection 3

• The hot-carrier effect leads to a buildup of trapped charges in the gate-channel

interface region [20]. This leads to degradation in electron mobility and increased

threshold voltage in CMOS transistors. This in turn leads to reduced switching

speeds and hence leads to increased delays. This phenomenon is also caused as

a result of negative-bias temperature instability, which exhibits a similar behavior

[21].

• Electromigration is a phenomenon in which metal ions migrate, which leads to a lack

of holes and voids in interconnect. Eventually these can cause faults by creation of

shorted circuits or open circuits [22].

• Time-Dependent Dielectric Breakdown causes an increase in leakage current by

affecting the transistor gates, eventually this leads to a short circuit. The reason

for this is charge trap creation within the gate dielectrics which diminishes the

potential barrier if forms [23, 24]

In addition to degradation, there are a couple of other faults that can affect FPGAs.

These are highly relevant to this thesis because the proposed techniques uses these fault

models.

• Manufacturing defects can affect circuit nodes which cause a stuck-at-0 or 1 or they

may switch slowly and cause a timing issue. Interconnect networks can also be

affected, leading to short or open circuits and open or closed transistors [25].

• The most common type of fault in FPGAs comprises of Single Event Upsets (SEUs)

and Single Event Transient (SETs) caused by certain types of radiation. These types

of faults primarily affect circuits in aviation, space applications and nuclear research

where devices are subjected to higher levels of radiation. Another most common

and frequently occuring type of fault is in SRAM cells, where particular cells maybe

flipped. This error remains in the memory until a refresh occurs; this process is

known as scrubbing. Scrubbing may not be relevant during permanent faults but

is highly useful during transient faults.

Page 13: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.2 Fault Detection 4

1.2.2 Fault Detection Techniques

Fault detection primarily has two purposes; alerting the supervising process that action

needs to be taken for the system to remain operational and secondly, the defective com-

ponents are identified so that it can be repaired. Usually, these two stages are covered

simultaneously or it can have more than one stage comprising of different strategies. Fault

detection strategies can be categorized into three types:

• Redundant/concurrent error detection: This technique uses additional circuitry to

detect a potential fault/error. The most frequently used techniques are parity

detection and hardware redundancy.

• Off-line test methods: This methodology uses external circuitry to detect faults in

an FPGA/ASIC when it is not in operation. Some examples of off-line test circuits

are Built-In-Self-Test (BIST) and Automated-Test-Pattern-Generator (ATPG).

• Roving test methods: These techniques take a complicated approach but are useful

in pinpointing a faulty location in a FPGA circuit. Roving performs a scan of

the entire FPGA structure and checks for defects by replacing them with a test

function.

1.2.2.1 Concurrent Error Detection

Error detection of this kind is mainly used to detect errors due to SEUs and SETs. These

detection methods as mentioned before involve adding logic to the original module to

implement the detection mechanism. In the event of an error, there is a disagreement

between the included logic and the original module, over which a particular calculation

is processed and this is indicated by some sort of error detection mechanism, usually

a comparator. One of the simplest and most used methods is modular redundancy. A

functional block is replicated, usually two or three times. The outputs of these replicated

modules are compared; any differences in the results trigger the error mechanism. Con-

current Error Detection (CED) allows a more area efficient design compared to modular

redundancy. The data lines and registers are widened to store error coding algorithms

Page 14: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.2 Fault Detection 5

such as parity. Error validation algorithms are included at the outputs of these lines or

registers to detect faults (parity check registers).

The major drawback of modular redundancy is the additional overhead required for

its implementation, which can be three times in case of TMR [26] or more in NMR. In

addition to this, it provides very limited resolution for the identification of the faulty

component. The fault detection can only be limited to a particular block which imple-

ments modular redundancy. This can be overcome by breaking down the circuit and

adding additional error detection logic, with the expense of area. In CED, the efficiency

of error coverage comes at the expense of additional area. These methods do not provide

coverage of multiple SEUs or SETs.

Error coverage through redundancy does not have to be restricted to circuit area. It

is possible to detect errors with the trade-off with latency/throughput. [27], [28] proposes

a method where operations are carried out twice. During the second run the operations

are encoded in a different way so that they yield a different output. The output of both

runs are passed through a suitable decoder and then compared at the output. In certain

cases, data-checking (parity) and redundant systems are incorporated into an FPGA

configuration. [29] implements a FPGA system with built in redundancy.

1.2.2.2 Off-Line Fault Detection

One of the most widely used techniques for fault detection is off-line fault detection.

This is usually used to identify manufacturing defects in FPGAs. Any circuit which does

this without the help of an external circuit, i.e., built into the FPGA but not a part of

the original design. These techniques work by loading one or more test patterns into

the FPGA. The entire system consists of a test pattern generator, an output response

analyzer and between them the device under test. To include full coverage the system

will have to test not only the logic and interconnect, but also the configuration network.

Many recent consumer grade FPGAs have this built into the development boards. This

eliminates the need of a large number of different test configurations. Compared to ASICs,

FPGAs have the advantage of a fixed reconfigurable structure, which removes the need of

Page 15: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.2 Fault Detection 6

dedicated test structures to be built into the circuit. As FPGAs are reconfigurable, test

patterns should be optimized so that they can be used for various designs. The major

advantage of BIST is that it does not interfere with normal operation of the FPGA. It

also covers complicated systems such as PLLs and clock networks. The major drawback

of BIST is that it can detect faults only when the circuit is not operational, i.e., only

when a dedicated test mode is run, it can detect faults. This is usually done at system

startup or when an error event triggers the BIST check.

Recent trends use BIST to test individual LUT [30–32] properties like timing and

stuck-at-faults. [33] uses a BIST system to test interconnects which reduces time through

a large degree. In [25] and [34] a hierarchical approach is used to locate stuck-at faults,

short circuits and open circuits with high accuracy.

1.2.2.3 Roving Fault Detection

This technique exploits run-time configuration to adapt BIST techniques on-line, with

minimum area overhead. In roving detection, the FPGA is split equally into a certain

number of regions. One of the regions is subjected to BIST testing while the rest carry

out the desired operation. Over time the region under test is switched with the opera-

tional region, in this order the entire FPGA array is scanned for faults. This method is

very popular in partial reconfiguration. Roving techniques have lower area compared to

modular techniques; the overhead comprises of just one self-test region and a controller to

manage the reconfigurable (swapping) process and better than off-line methods because

the circuit can be operational. The speed, while better than off-line testing methods, it

is not comparable to redundancy techniques. The detection speed of a roving technique

depends on the speed of the roving cycle and the operation time. The best roving tests

are reported to have latency of less than one second [35].

In roving tests, performance is impacted in two ways. Firstly, the connections between

adjacent functional areas are stretched as the test region is moved though the FPGA,

resulting in longer signal delays which in turn affect the system clock speed, in the range

of 2.5% to 15% [36]. Secondly, in current FPGAs the functional blocks are halted as

Page 16: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.3 Reconfigurable Hardware 7

they are swapped. It is reported that a 250 µs pause is required between swaps. Most

of the initial testing in the field of roving test has been carried out by Emmert, Stroud

and Abramovici [35, 36]. The system called Roving STARs uses two test sections one for

rows and other for columns. Another variation of roving test was proposed in [37] which

uses buses rather than segmented interconnects. Though, this system has no impact on

system clock, it limits the application base.

1.3 Reconfigurable Hardware

Hardware modules containing a large number of switching-circuit components are called

reconfigurable hardware modules. The modules in general are called field programmable

gate arrays (FPGAs), general FPGAs contain an array of simple logic cells, some storage,

and interconnection path in each cell that can be configured to perform specific functions

by loading a predetermined bit pattern. The predetermined bit patterns are loaded

into a layer called a configurable logic block (CLB) which rests below the logic layer.

This layer is responsible for the connection pattern between different components in the

FPGA. Once the bit pattern is loaded into the CLB, the actual customization is very fast

compared to the standard process of producing custom ICs. This allows the user to use

the same hardware for different implementations. Even though, the designs presented in

this material are suitable for ASIC, FPGAs have gained popularity in recent years, hence

this is an attractive option to implement the proposed designs.

Some notable FPGA implementations: Girau and Tisserand [38] implemented mul-

tilayer perceptron back-propagation algorithms. Tisserand and Dimmler [39] designed

real-time digital controllers. Tenca and Ercegovac [40] designed a variable long-precision

arithmetic unit. Tisserand, Marchal, and Piguet [41] developed an on-line arithmetic

based FPGA architecture for low power custom computing. Mosanya and Sanchez [42]

implemented a generalized profile search. Lau, et. al [43] implemented signal processing

algorithms for Fast Fourier Transform and Discrete Cosine Transform.

Page 17: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.4 Objectives 8

1.4 Objectives

The numbers of soft errors occurring in digital circuits is rising due to a number of relia-

bility challenges including reductions in feature sizes and supply voltages. Counteracting

natural faults has been a subject for a number of hardware architectures in different do-

mains. For cryptographic architectures, many research works have been carried out to

achieve reliable and fault-immune structures, see, for instance, [44–49]. Moreover, con-

cerning the finite field arithmetic architectures, various concurrent error detection (CED)

multipliers for polynomial basis and normal basis of GF (2m) have been proposed using

parity codes and recomputing with shifted operands (RESO) schemes [27].

Hardware redundancy techniques in such architectures leads to increase in hardware

resources while making the design highly fault resilient, without considerable loss in the

total time. Indeed, such a reliability assurance method for error detection is commonly

used in systems where efficiency is preferred over area. On the other hand, time redun-

dancy provides a more area effective way of detecting faults. The technique involves two

runs; in the first run, the actual operands are used, during the second run the operands

are encoded in such a way the original result can be obtained after the operation is

complete. The encoding is usually done by shifting through RESO, rotating through

recomputing with rotated operands (RERO), or other reversible encoding scheme based

on the application and architecture.

Various real number division techniques for hardware have been discussed in [50], [51].

The proposed design uses a slightly modified radix-4 (Sweeney, Robertson, and Tocher)

SRT division [50] to calculate the quotient and the remainder of the complex dividend

and divisor. The contributions of this are summarized as follows:

• We propose fault diagnosis approaches for the presented slightly modified radix-4

SRT division considering the reliability, resiliency, and performance metrics objec-

tives. Unified and combined fault resilience approaches are used in conjunction

with performance boost modifications to achieve high throughput and frequency

architectures while maintaining high error coverage. Although, such designs pro-

vide error detection capability, it is limited to 16-bit precision result. Increasing the

Page 18: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.4 Objectives 9

precision leads to larger area due to the additional logic used for error detection.

As such, a reduced lookup table approach is employed to the quotient selection

logic which favors in concurrent error detection. To the author’s knowledge, the

aforementioned designs are not presented before.

• Through simulations for various fault models, we benchmark the error detection

capabilities of the proposed schemes. The results of these simulations show accept-

able error detection capabilities which ensures reliability and hardware assurance

of the proposed approaches.

• Finally, we implement the proposed fault immune architectures on Xilinx Spartan-

VI and Xilinx Virtex-VI FPGA families. Our results show that the proposed effi-

cient error detection architectures can be feasibly utilized for reliable architectures

of the presented complex division structures making them suitable for the required

performance, reliability, and implementation metrics to achieve for constrained ap-

plications.

Page 19: Reliable and Fault-Resilient Schemes for Efficient Radix-4

1.5 Thesis Outline 10

1.5 Thesis Outline

The structure of the thesis is as follows:

• Chapter 2: This chapter explains briefly different hardware based dividers in

general. This is followed by a detailed description of Radix-β (High radix dividers)

dividers. In addition, a basic description of Golub’s multiplier is also described.

• Chapter 3: This chapter describes the presented design used for complex division

and the optimized ROM (quotient selection table) used in the design.

• Chapter 4: In this chapter a brief description of the utilized concurrent error

detection (CED) schemes is given. The hardware designs of the two proposed fault

detection schemes are described. In addition, the fault detection techniques used

in different modules are also presented

• Chapter 5: Describes the fault model used for testing the proposed deigns. The

chapter also discusses the error coverage results of the two architectures.

• Chapter 6: The FPGA benchmarks of two different FPGAs (Spartan-6 and

Virtex-6) for the proposed architectures are given in this section.

• Chapter 7: Possible future work and applications are described in this chapter.

Page 20: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 2

Preliminaries

2.1 Hardware Based Dividers

Division process in hardware based systems has a similar nature to multiplication. Mul-

tiplication in general hardware systems is performed using shift and add operations, i.e.,

repeated or sequential additions. Likewise, division also involves repeated or sequential

subtractions. Unlike multiplication operations, division techniques have the added com-

plexity of estimating a quotient digit based on the value of the partial remainder. Another

property of hardware division different from multiplication is that; in multiplication the

product of two k-bit numbers can always be represented by 2k-bits. Whereas, the quo-

tient of a 2k-bit number divided by a k-bit number is not always k-bits, it can have a

size more than k-bits, one example of this is the division of floating point or fractional

numbers. Thus, an overflow check is required before the commencement of a division op-

eration. In addition, the division process has to be carried out from top to bottom, i.e.,

divisor must be subtracted from the dividend. Unlike multiplication, where the partial

products can be produced from top to bottom or bottom to top. The general division is

always iterative since it requires a quotient fetch phase; the division process for hardware

is as follows, initially the partial remainder r0 is set to r0 = z, where z is the dividend. In

ith sequence, the quotient digit qi is selected based on the partial remainder riand divisor

d . This is followed by the subtraction of 2 × qi × d (shifted version) from the partial

Page 21: Reliable and Fault-Resilient Schemes for Efficient Radix-4

2.1 Hardware Based Dividers 12

remainder ri. Each successive number to be subtracted from the partial remainder has to

be shifted by one bit to the left. An alternative approach is to shift the partial remainder

so that it aligns with the next term to be subtracted. A better representation of the

operation is given in (2.1)

ri = 2 × ri − qi × (2i × d) with r0 = z and ri = 2 × r (2.1)

On microprocessor based systems, where there is no specific hardware for division.

One can use shift and add instructions to perform division. It is to be noted that the num-

ber of sequential subtractions increase relative to the bit length of the division operands,

i.e., larger the size of the dividend and divisor, greater the number of subtractions. The

Fig. 2.1, below shows a clear representation of the aforementioned division technique.

Most advanced arithmetic circuits use dedicated division circuits. This provides faster

calculation times with low power consumption. Some examples of the most commonly

used division techniques are:

• Restoring hardware dividers: The quotient fetch operation is performed on a non-

redundant digit set 0, 1, 2, . . . . . . , a − 1, where a is in the power of 2. The subtrac-

tion is continued until the partial remainder reaches a value less than the divisor

or until the partial remainder is negative. In the end, the addition of the divisor

to the partial remainder produces the correct remainder. This step is called the

restoring step, hence the name.

• Non-restoring hardware dividers: This divider architecture follows the same process

as above but eliminates the final negative partial remainder by a slight modification.

This makes it faster compared to the latter.

• Radix-β dividers or high radix dividers : This architecture is the most practical to

implement compared to the previous ones. Based on the radix-β, the number of

quotient digits obtained each iteration can be increased hence reducing the total

number of iterations.

The next subsection explains in detail the operation of radix-β dividers.

Page 22: Reliable and Fault-Resilient Schemes for Efficient Radix-4

2.2 Radix-β SRT Division 13

q

z

-q4d x 23

-q3d x 23

-q2d x 23

-q1d x 23

-q0d x 23

r

d

Figure 2.1: Dot representation of basic hardware division.

2.2 Radix-β SRT Division

The most critical part of the complex divider module is the SRT divider which is applied

to both the real and imaginary parts in parallel. The presented design uses a radix-β

division algorithm which yields log2(β) quotient bits every iteration. To achieve a 16, 32,

or 64 bit precision, we require 16log2(β) ,

32log2(β) , or 64

log2(β) iterations, respectively. In order to

achieve higher precision with less iterations, higher radix-β (β > 64) SRT dividers can

be used at the expense of more area on respective platforms.

The general iterative formula used in the SRT division for radix-β is:

R [j + 1] = β × (R [j] − q [j] × D) , (2.2)

where R [j] is the previous partial remainder after j iterations and R [j + 1] is the next

partial remainder at iteration j. At the end of iteration, the radix-β quotient is calcu-

lated based on the first few bits of the divisor (D) and the partial remainder (R [j]).

The next partial remainder (R [j + 1]) is then calculated based on R [j] and the prod-

uct of q [j] and the divisor D. The quotient selection is done by referring to an LUT

(look-up table) using the bits in D and R (j). The selected quotient bit is in the set

{−α, ..., −3, −2, −1, 0, 1, 2, 3, ..., α}, where α is in the range (β−1)2 ≤ α ≤ (β − 1).

The Robertson diagram in Fig. 2.2 shows the relation between the new shifted partial

Page 23: Reliable and Fault-Resilient Schemes for Efficient Radix-4

2.2 Radix-β SRT Division 14

R[j+1]

R[j]

1 α

D

-D

0 2 3-1-2-3-α

-βD βD

-hβD

-hD

hβD

hD

Figure 2.2: New shifted remainder and old shifted remainder in radix-β (Robertsondiagram).

remainder and the old partial remainder based on the quotient. From Fig. 2.2, it is seen

that the remainder is bounded by [−hd, hd), where h is a constant which determines the

quotient set {−α, ..., −3, −2, −1, 0, 1, 2, 3, ..., α} and h < 1. Then, the new shifted partial

remainder β × R [j + 1] will be in the range [−β × hd, β × hd). Applying the worst case

values to the original range, we have β × hd − α × d ≤ hd or h ≤ α(β−1) . Using this, the

quotient selection digit set can be adjusted based on the specific requirements.

We can also use a p-d plot (shifted partial remainder vs divisor) as a graphical aid

to better understand the quotient digit selection process. The p-d plot can also be

used to derive the required precision, i.e., the number of bits (partial remainder and

divisor) required for the quotient selection. For example in Fig. 2.3, assume point A

is the intersection of 4 bits of p and 3 bits of d, the rectangle around A represents the

truncation range from point A. The region between the lines are the quotient selection

regions. The quotient qi is selected based on the position of the uncertainty rectangle, in

this case both qi = β or β + 1 will yield the correct result.

The proposed design uses a radix-4 SRT division technique. Thus, from (2.2), the

division scheme for radix-4 is R [j + 1] = 4 × (R [j] –q [j] × D) with two quotient sets

{−3, −2, −1, 0, 1, 2, 3} called the maximally redundant set and {−2, −1, 0, 1, 2} the min-

imally redundant set. The maximally redundant set is faster and smaller compared to

the minimally redundant; nonetheless, it requires the computation of 3× which leads to

additional hardware and delay [52].

The quotient selection logic for the minimally redundant quotient set is shown in the

Page 24: Reliable and Fault-Resilient Schemes for Efficient Radix-4

2.2 Radix-β SRT Division 15

A

4 bits of p3 bits of d

(h+β+1)d

(h+β)d

(-h+β+1)d

(-h+β)d

Note: h=α÷(β-1)

d

p

dmaxdmin

Figure 2.3: p-d plot quotient selection for radix-βdivision.

Page 25: Reliable and Fault-Resilient Schemes for Efficient Radix-4

2.3 Golub’s Multiplication 16

q(j)= -1 or -2 q(j) = 0

q(j) = 1 q(j) = 2

q(j)= -2 q(j)= -1

q(j)= -1 or 0 q(j)= 1 or 0

q(j)= 2 or 1

Figure 2.4: The quotient selection scale.

following and detailed in Fig. 2.4, where the quotient selection scale is illustrated based

on:

q(j + 1) =

−2 −83 × D < R [j + 1] < −4

3 × D

−1 −53 × D < R [j + 1] < −1

3 × D

0 −23 × D < R [j + 1] < 2

3 × D

1 13 × D < R [j + 1] < 5

3 × D

2 43 × D < R [j + 1] < 8

3 × D

. (2.3)

2.3 Golub’s Multiplication

Complex multiplications in DSP systems generally use a more efficient indirect approach

[53], referred to as the Golub’s method. For two complex numbers with a and c as the

Page 26: Reliable and Fault-Resilient Schemes for Efficient Radix-4

2.3 Golub’s Multiplication 17

X

- + X

-

X + +

a b c da c b d

xreal

ximag

Figure 2.5: Golub’s multiplication.

real parts and b and d as imaginary parts, one can reach:

t1 = (a + b) × (c + d) , t2 = a × c, t3 = b × d

x = (t2 − t3) + j × (t1 − t2 − t3)

xreal = t2 − t3

ximag = t1 − t2 − t3

. (2.4)

The direct implementation of complex multiplication requires four real multiplications

and two additions. The indirect implementation, on the other hand, requires three real

multiplications and five additions, as shown in Fig. 2.5. We note that the latter is more

area efficient because multiplication requires more area compared to addition.

Page 27: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 3

Presented Complex Division Radix-4

SRT Module

The design of the complex divider consists of several modules: multiplication module,

normalizing module, real and imaginary iteration module, shared ROM for the quotient

selection, and an on-the-fly converter. Fig. 3.1 shows the high level block diagram of the

complete design. In this figure, the multiplication in the numerator is performed using

Golub’s multiplication method. This leads to fewer resources with lower hardware com-

plexity compared to the traditional multiplication approach. The multiplication module

multiplies the complex conjugate of the denominator to the numerator and denominator

to rationalize the denominator to a real number, i.e., c2 + d2:

a + jb

c + jd= (a + jb) × (c − jd)

(c + jd) × (c − jd) = x

c2 + d2 + jy

c2 + d2 . (3.1)

The normalizer normalizes the numbers such that 1 ≤ x, y, (c2 + d2) ≤ 2. The real and

imaginary iteration modules are essentially two radix-4 SRT blocks. As seen in Fig. 3.1,

a single dual-port ROM is shared between the real and imaginary modules for quotient

selection. The normalized divisor is saved in register D and the dividend is saved in

U , V in carry-save form (for both real and imaginary sections), as shown in Fig. 3.1.

The dividend is then subjected to repeated subtractions and table look-up to obtain the

required number of quotient bits, in this case 16 bits. The MUX in this figure selects

Page 28: Reliable and Fault-Resilient Schemes for Efficient Radix-4

3.1 ROM 19

0,d,-d,2d,-2d

D (real)V (carry)

U (sum)

MUX (4:1)ROM

Adder

4 bits

q-j(real)

Partial remainder

(2R)

Shift left

CSA tree

D (imag)V (carry)

U (sum)

MUX (4:1)

Adder

Partial remainder

(2R)

CSA tree

q-j(imag)

R[j+1](imag)R[j+1](real)

0,d,-d,2d,-2d

5 bits4 bits

5 bits Shift left

4 16

16 16 16 16

16 16 1616 16 16 16

5 5

5

5 5

4 16

1616

1616

5

33

Q (real) Q (imag)

16 16

16

Figure 3.1: Presented complex radix-4 SRT module.

the divisor D based on the value of the quotient fetched (initial 5 bits of U and V and

4 bits of D), i.e., 0, D, −D, 2D, and−2D. The carry-save adder (CSA) calculates the

new partial remainder based on the MUX output and the present values and saves it in

U and V . This process is repeated until the desired number of quotient bits is reached

for both real and imaginary parts. The on-the-fly converter is essentially an adder which

subtracts the positive quotients from the negative quotients and saves them in Qreal and

Qimag, respectively. The shared ROM used for quotient fetch is minimized to save area;

the minimized ROM structure is explained in the followings.

3.1 ROM

The ROM look-up consists of two steps; rounding and quotient fetch. The partial remain-

der from the CSA is rounded off to 5 bits, round (R [j]). A combination of the partial

remainder (5 bits) and the divisor (4 bits) is used as the address to the ROM for the

quotient fetch. The ideal ROM table contains a total of 29 = 512 entries as shown in Fig.

3.2.

The ROM is symmetrical between the positive and negative quotients. When the

round (R[j]) is positive, the positive quotient is fetched and vice versa. The shared ROM

Page 29: Reliable and Fault-Resilient Schemes for Efficient Radix-4

3.1 ROM 20

used in the design is condensed to just the positive section of the table and a combinational

circuit which generates the 2’s complement of q[j] if R[j] is negative. This can easily be

determined by verifying the MSB of R[j] which reduces the ROM size to half of the

original size, i.e., to 256 entries.

The “quotient fetch” operation pertaining to the minimized shared ROM architecture

is presented below:

round (RR [j]) = r2Rr1

Rr0R.r−1

R r−2R (3.2)

round (RI [j]) = r2Ir1

Ir0I .r−1

I r−2I (3.3)

addrR =

r1Rr0

R.r−1R r−2

R |d0d−1d−2d−3 if r2R = 0

not(r1

Rr0R.r−1

R r−2R

)+ 1|d0d−1d−2d−3 else

(3.4)

addrI =

r1Ir0

I .r−1I r−2

I |d0d−1d−2d−3 if r2I = 0

not(r1

Ir0I .r−1

I r−2I

)+ 1|d0d−1d−2d−3 else

(3.5)

qR (j) =

q2

Rq1Rq0

R if r2R = 0

not (q2Rq1

Rq0R) + 1 else

(3.6)

qI (j) =

q2

I q1I q0

I if r2I = 0

not (q2I q1

I q0I ) + 1 else

(3.7)

Page 30: Reliable and Fault-Resilient Schemes for Efficient Radix-4

3.1 ROM 21

1 1.0625 1.125 1.1875 1.25 1.3125 1.375 1.4375 1.5 1.5625 1.625 1.6875 1.75 1.8125 1.875 1.9375

3.75 2 2 2 2 2 2 2 2 2

3.5 2 2 2 2 2 2 2 2 2 2

3.25 2 2 2 2 2 2 2 2 2 2 2 2

3 2 2 2 2 2 2 2 2 2 2 1,2 1,2 1,2

2.75 2 2 2 2 2 2 2 2 2 2 1,2 1,2 1,2 1,2 1,2

2.5 2 2 2 2 2 2 2 2 2 1,2 1,2 1,2 1,2 1 1 1

2.25 2 2 2 2 2 2 1,2 1,2 1,2 1,2 1,2 1 1 1 1 1

2 2 2 2 2 2 1,2 1,2 1,2 1 1 1 1 1 1 1 1

1.75 2 1,2 1,2 1,2 1,2 1 1 1 1 1 1 1 1 1 1 1

1.5 1,2 1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1.25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1

1 1 1 1 1 1 1 1 1 1 0,1 0,1 0,1 0,1 0,1 0,1 0,1

0.75 1 1 1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1

0.5 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0 0 0 0 0 0 0 0

0.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-0.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-0.5 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0 0 0 0 0 0 0 0

-0.75 -1 -1 -1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0,-1 0,-1 0,-1 0,-1 0,-1 0,-1

-1.25 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0,-1

-1.5 -1,-2 -1/-2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1.75 -2 -1/-2 -1,-2 -1,-2 -1,-2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-2 -2 -2 -2 2 -2 -1,-2 -1,-2 -1,-2 -1 -1 -1 -1 -1 -1 -1 -1

-2.25 -2 -2 -2 2 -2 -2 -1,-2 -1,-2 -1,-2 -1,-2 -1,-2 -1 -1 -1 -1 -1

-2.5 -2 -2 -2 2 -2 -2 -2 -2 -2 -1,-2 -1,-2 -1,-2 -1,-2 -1 -1 -1

-2.75 -2 -2 2 -2 -2 -2 -2 -2 -2 -2 -1,-2 -1,-2 -1,-2 -1,-2 -1,-2

-3 2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -1,-2 -1,-2 -1,-2

-3.25 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2

-3.5 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2

-3.75 -2 -2 -2 -2 -2 -2 -2 -2 -2

Figure 3.2: Quotient selection logic.

As shown in (3.4) and (3.5), the first 4 bits of D and R (U + V ) are concatenated

to form the ROM addresses. The first bit of R (U + V ) is used as a select line for the

multiplexers to select the two inputs, i.e., original or 2’s complement. Similarly, in (3.6)

and (3.7), the same bit of R (U + V ) is used to determine the output q [j] or −q [j]. The

circuit operation of the proposed ROM is denoted in Fig. 3.3.

2:1

MU

X ROM (256

Entries) 2:1

MU

X

2's complement adder

2's complement adder

D

R[j+1]

qj

4

5

5

11

3

3

3

5

Figure 3.3: Minimized ROM.

Page 31: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 4

Proposed Error Detection Schemes

The most prominent method for error detection is information redundancy (usually di-

vided into time and hardware redundancy). General hardware redundancy, though sim-

ple, is very efficient in detecting faults. We do not present a modular hardware redundant

scheme separately, for the sake of brevity, but its implementation has been carried out

to record results. Specifically, since the scheme is general hardware duplication, all the

registers in the data path are duplicated and the content is compared with its duplicate

in real time. The main drawback of this scheme is that it results in a 100% increase

in area. Register duplication is not just limited to the registers in the data path but

also in the ROM leading to an increased ROM size of 512 × 3. Other arithmetic blocks

in the design are also duplicated and checked in real time. A more practical approach

and low-complexity technique is using error detecting codes, such as parity, for error de-

tection. The logic relation for parity is simple to implement and is more area efficient

compared to duplication. The challenging part is incorporating the parity checker in the

proposed design. Error checks need not be performed on all the sections of the design.

It is sufficient to check the parts which are capable of propagating errors. This reduces

the area by eliminating unnecessary logic in the design.

The second scheme incorporates time redundancy using RESO. Using this technique,

an internment error, i.e., transient faults occurring during one of the runs can be detected;

moreover, it can detect permanent faults. This provides RESO with an added advantage

Page 32: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.1 Unified Parity Check and Hardware Redundancy 23

compared to traditional time redundancy schemes. As mentioned before RESO involves

two runs, one uses the actual operands and the other uses shifted operands. Since this

is a division operation there is an added advantage in using RESO for error detection,

the shifting performed in RESO is a mere multiplication by 2 in the numerator and

the denominator which leads to the same final result, i.e., (a+jb)×2(c+jd)×2 = X + jY = (a+jb)

(c+jd) .

This requires no additional decoding hardware at the end of the operation to obtain the

original result. In order to incorporate all possible 2n combinations of an n-bit register

in RESO, an n + 1 bit register design is required.

In what follows, we present two error detection schemes, where details for the error

detection technique and implementation used in the design are presented.

4.1 Unified Parity Check and Hardware Redundancy

To implement an effective error detection approach, the propagation of faults through-

out the circuit needs to be assessed. In the proposed complex divider architecture, the

occurrence of single or multiple faults may lead to random error propagation through

the circuit. Considering the operation’s iterative nature, faults may also propagate to

circuit locations which lay before the affected region. This leads to the inclusion of parity

registers throughout the circuit which are prone to propagating faults. The error detec-

tion structure is developed by comparing the actual parity and the predicted parity. The

error detection division architecture is divided into 5 modules; each individual module

has its own parity or hardware redundancy schemes as described in the following. Fig.

4.1, shows the main CED blocks in the datapath of the proposed error detection scheme.

The following subsections explain the individual fault technique applied to the modules.

Page 33: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.1 Unified Parity Check and Hardware Redundancy 24

D (

real

)P

MU

X (

4:1

) CSA

tre

e

1616

16

V (

carr

y)P

U (

sum

)P

Ad

de

rA

dd

er

RO

M(w

ith

P

arit

y)

Q (

real

)P

D (

imag

)P

MU

X (

4:1

) CSA

tre

e

1616

16

V (

carr

y)P

U (

sum

)P

Ad

de

rA

dd

er

Q (

imag

)P

Erro

r_re

alEr

ror_

imag

Erro

r_R

OM

4 b

its

4 b

its

Part

ial

rem

ain

der

(2R

)

5 b

its

5 b

its

Shif

t le

ftSh

ift le

ft

Part

ial

rem

ain

de

r (2

R)

R[j

+1] (i

mag

)R

[j+1

] (re

al)

q-j

(re

al)

q-j

(im

ag)

Figure 4.1: Radix-4 complex SRT module with unified parity check and hardware redu-

dancy.

Page 34: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.1 Unified Parity Check and Hardware Redundancy 25

4.1.1 Golub’s Multiplier

As mentioned before, Golub’s technique of complex multiplication is the most efficient way

to multiply two complex numbers. For error detection in this particular module, we use

the checker 3 method mentioned in [54]. This was found to be the most efficient in terms

of area compared to other error detection techniques for adders and multipliers. Such an

implementation slightly alters the design of the multiplication module presented in Fig.

2.5 in order to incorporate the error detection scheme. This separates the calculation of

real and imaginary parts and, in turn, prevents the error propagation form a × c + b × d

to ximag as shown in Fig. 2.5. The checker verifies (4.1),(4.2) which are obtained by

rearranging (2.3). The structure of the concurrent error detection multiplier is shown in

Fig. 4.2.

xreal + ximag = t1 − 2 × b × d, (4.1)

xreal + ximag + 2 × b × d = t1. (4.2)

4.1.2 Datapath Registers

The registers in the datapath are very vital to the operation and are also capable of

randomizing the error propagation in the design. Each register is incorporated with a

parity bit and the contents are checked in real time to detect faults. The collective output

comparisons of the actual and predicted parity are connected to an OR gate which rises an

error flag in case of a detection. Fig. 4.3, shows a generalized representation of the theory

explained above. The bit P is the individual parity bit of registers R1, R2,. . . . . . . . ., Rn.

Some cases may also contain multiple parity bits in the same register, to detect errors

[55, 56].

Page 35: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.1 Unified Parity Check and Hardware Redundancy 26

X

- X

X + +

a b c da c b d

xreal

X

+

X

a d b c

ximag

1<<

+

+

error

Figure 4.2: CED Multiplier.

PR1 PR2 PR3 PRn

error

Figure 4.3: Parity check registers.

Page 36: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.1 Unified Parity Check and Hardware Redundancy 27

2:1

MU

X ROM (256

Entries) 2:1

MU

X

2's complement adder

2's complement adder

D

R[j+1]

qj

4

5

5

11

3

3

3

5

3

1

par

ity 4

3

error

Figure 4.4: Minimized ROM with the proposed parity prediction logic.

4.1.3 ROM (Quotient Selection Logic)

The memory module used in the design is very critical to the operation and, thus, if

it is prone to faults, it can undermine the entire objectives. The width of the ROM is

extended to one extra bit to incorporate parity. The arrangement is capable of detecting

all single stuck-at faults. During a quotient fetch operation, the quotient corresponding to

the address bits is checked for correctness at the output of the ROM using the respective

parity bits. In the case of a bad memory location, the circuit raises an error flag. The

ROM with the proposed parity prediction logic is depicted in Fig. 4.4.

4.1.4 Carry Save Adder

The proposed design uses a carry-save adder to compute the next partial remainder based

on the previous partial remainder R [j] and the product of the quotient and divisor,

i.e., q [j] × D. Since the design is aimed towards error detection, CSAs provide faster

computation of the results as the carry is not propagated but saved; this reduces the

delay of the adder. In addition to this, the CSAs provide easy incorporation of error

detection. The error detection scheme used in the CSA incorporates parity checks for

individual full adders [57]. The parity check equations (4.3),(4.4) used in the design is

given below:

Sume = a ⊕ b ⊕ cin ⊕ Sum, (4.3)

Page 37: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.2 Error Detection through RESO 28

abcin

sum

cout

sume

ce

Figure 4.5: Full added with parity check.

Ce = (a • b) ⊕ (a ⊕ b) • (cin ⊕ cout) . (4.4)

The parity check equations are used to check the correctness of both “sum” and “carry”.

In the case where one of them is incorrect, the module raises an error flag. The structure

for a single fault detection full adder is shown in Fig. 4.5. In case of a n-bit full adder this

can be arrayed, like in a traditional n-bit adder and the error signals can be connected

to an OR gate.

4.2 Error Detection through RESO

In this error detection model, we use RESO [27] to detect faults in the design. The

RESO method, as explained before, uses the same hardware without any modification.

This makes it efficient in applications where low-area is a requirement. Let us assume F

is the function to be performed on a particular operand x. Then, F (x) is the result of

the functional module. The initial result F (x) is stored in a register. The operation is

repeated with a shifted version of the same operand x, in this case x′. The operation is

repeated with x′ to obtain F (x′), in such a way the original result F (x) can be restored

with a simple operation on F (x′). The division architecture for the implemented RESO

Page 38: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.2 Error Detection through RESO 29

structure is shown in Fig. 4.6.

As shown in Fig. 4.6, the size of the registers are increased to an extra bit to incor-

porate shifting of all 28 possible combinations. The applied RESO method in this case is

limited to one left shift, i.e., multiplication by 2. Multiple left shifts are possible, but at

the expense of increased register size which leads to increased CSA, ROM, and address

registers. In accordance with the principle explained above, the initial complex dividend

and divisor are passed through the divider. The quotient is stored in a register. In the

second run, the initial operands are shifted and passed through the divider. This new

result is compared with the previous quotient and in case of a mismatch, an error flag

is triggered. Time-redundancy techniques tend to increase the number of cycles for a

particular operation. Hence, we have sub-pipelined the design in order to alleviate the

throughput. The addition of the pipeline registers does indeed increase the area slightly,

but it is negligible compared to hardware redundant or parity-based schemes.

Page 39: Reliable and Fault-Resilient Schemes for Efficient Radix-4

4.2 Error Detection through RESO 30

0,d

,-d

,2d

,-2

dD (

real

)V

(ca

rry)

U (

sum

)

MU

X (

4:1

)R

OM

Ad

de

r

4 b

its

q-j

(re

al)

Part

ial

rem

ain

der

(2R

)

Shift

left

CSA

tre

e

D (

imag

)V

(ca

rry)

U (

sum

)

MU

X (

4:1

)

Ad

de

r

Part

ial

rem

ain

der

(2R

)

CSA

tre

e

q-j

(im

ag)

R[j

+1] (i

mag

)R

[j+1

] (re

al)

0,d

,-d

,2d

,-2

d

5 b

its

4 b

its

5 b

its

Shif

t le

ft

418

1818

1818

1818

1818

181

818

55

5

55

418

18

18

18

18

5

33

18

Q (

real

)Q

(im

ag)

Q_R

ESO

(re

al)

Q_R

ESO

(im

ag)

erro

rer

ror

1st_r

un

/2n

d_r

un

1818

1818

Figu

re4.

6:Pr

opos

eddi

vide

rw

ithR

ESO

for

faul

tdi

agno

sis.

Page 40: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 5

Error Simulations

For the fault model in this thesis, both single and multiple stuck-at-fault are considered

to cover natural failures and counteract VLSI defects [58]. The fault model is constructed

using linear-feedback shift registers (LFSRs) through maximum tap length polynomials,

which are ORed or ANDed with the actual output to simulate both stuck-at-one and

stuck-at-zero faults for transient and permanent faults [58]. The faults are injected at

different locations in the circuit and checked for the error indication flags. This provides

a more real-world scenario, because naturally-occurring faults does not affect a particular

part of the circuit but the entire circuit as a whole with a uniform distribution. Moreover,

the false-positive cases are excluded from the error analysis, i.e., the cases where the

injected faults also produce the same output as the original. The same fault model is

used for both the presented error detection modules (parity and RESO). The model

simulates both single and multiple faults in the circuit by flipping the bits from 0 to 1

and vice versa. Since we use a 16-bit LFSR, the probability of the bits flipping is 1216 and

the error stays for exactly one clock cycle. This perfectly simulates both transient and

permanent faults.

The first proposed scheme uses a combination of signatures and hardware redundancy

and it can be analytically proven that the detection rate for single-bit stuck-at-fault for

parity prediction blocks is 100%. The simulations for single stuck-at-fault are performed

exhaustively in every byte and every operation to confirm the theoretical results. We

Page 41: Reliable and Fault-Resilient Schemes for Efficient Radix-4

32

mainly concentrate on the multiple stuck-at-fault scenarios, by performing extensive anal-

ysis through simulations. For hardware redundant blocks, it is highly unlikely for two

transient faults or permanent faults of the same nature to occur simultaneously. Hence,

the detection rate for these modules will achieve a 100% error coverage. As for parity

prediction blocks used in the design, the faults may or may not be detected, based on

the parity they generate. To counter this, different parts of the circuit are equipped with

parity blocks so that at least one of them alarm the errors. The RESO module uses two

runs to detect faults, both permanent and transient ones. Since the operands are shifted

on the second run, a transient fault leads to different results, because both the shifted and

original operands are supposed to generate the same result. A similar condition occurs

in the case of a permanent fault. Consider a stuck-at-one fault at the LSB of a register in

the design. During the shifted second run, the fault at the LSB, leads to different results

due to the iterative nature of the design.

The simulations have been performed by applying 1, 000 random inputs and 9, 17, 504

multiple faults. The results of the simulations show more than 99.999% of the errors

are detected for both the hardware and time redundant modules. A theoretical analysis

has also been performed to confirm the simulation results. Let p be the probability of

a parity module detecting a fault. Then, 1 − pn is the probability of error detection

for n parity modules. The hardware redundant modules consist of a total of 6 × 2 + 1

parity modules (real and imaginary iterative modules). Each iterative module performs 8

iterations. Therefore, we have p = 0.5 and n= 13 × 8 × 2, the error detection probability

is calculated as 1 − 0.5208 = 99.999%, which matches with the simulation results.

Page 42: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 6

FPGA Implementations and

Benchmark

In this section, through FPGA implementations on two diverse families, we present the

overhead evaluation results, i.e., area, delay, and throughput. The benchmarking has been

carried out for the original and the error detection structures of the discussed complex

division SRT module. The FPGA implementations are done using ISE version 14.5

and synthesized for Spartan-6 xc6slx16-2cgs324 and Virtex-6 xc6vlx75t-3ff484 devices

[77]. Through this analysis, the performance and implementation metrics of the complex

divider for both low-end and high-end FPGAs can be observed with VHDL as the design

entry.

The complex division architecture is structured hierarchically. Specifically, it is di-

vided into 4 functional modules: Golub’s multiplier, normalizer, SRT divider (real and

imaginary), and ROM. Each of these parts is implemented individually and port-mapped

at the top level. The error detection designs for individual modules are tested to verify

functionality. The original design (no error detection) runs at a maximum frequency of

268.557 MHz for the Virtex-6 and 139.758 MHz for the Spartan-6, after performing syn-

thesis and place-and-route. Each iteration takes two clock cycles; the operation requires

a total of eight iterations. The critical path delay of 5.23 ns for the Spartan-6 and 3.76 ns

for the Virtex-6 is considered from the input of the Golub’s multiplier through the ROM

Page 43: Reliable and Fault-Resilient Schemes for Efficient Radix-4

34

Table 6.1: FPGA Implementation results of the original complex division module andthe proposed error detection designs on Spartan-6 FPGA device xc6s1x16.

Architecture Area[Slices]

Throughput[Gbps]

Power[mW ]

Original 228 0.559 5.43Scheme-I 304

(33.33%)0.510

(8.76%)5.79

(6.62%)Scheme-II 269

(17.98%)0.367

(34.34%)6.60

(21.54%)

Table 6.2: FPGA Implementation results of the original complex division module andthe proposed error detection designs on Virtex-6 FPGA device xc6vlx75t.

Architecture Area[Slices]

Throughput[Gbps]

Power[mW ]

Original 248 1.074 9.86Scheme-I 346

(39.52%)1.064

(0.93%)10.93

(10.85%)Scheme-II 283

(14.11%)0.677

(36.64%)12.36

(25.35%)

till the output of the quotient register. The total delay of Scheme-I (unified parity and

hardware redudancy design) is 5.237 ns (Spartan-6) and 1.600 ns (Virtex-6). The total

delay increase from the original design to Scheme-I is 36.452% (Spartan-6) and 29.56%

(Virtex-6), this is due to storing the result in carry save form rather than using traditional

CPA’s, which may lead to higher delays. Scheme-II employs the RESO architecture dis-

cussed in previous sections and it leads to additional processing time since it requires

twice the original number of iterations, eight for the original operands and eight for the

shifted operands. As mentioned before, the design in Scheme-II has been sub-pipelined to

relatively improve the throughput. A delay increase of 16.326% (Spartan-6) and 10.625%

(Virtex-6) from the original design is observed. Though, this leads to an increase in area,

it is low compared to Scheme-I. The synthesis and place and route are performed using

the same settings, i.e., speed optimization, minimum area, and no DSP blocks, for both

FPGA platforms.

The implementation results are tabulated in Tables 6.1 and 6.2 and the area (slices),

throughput, and power consumption (at 140 MHz) for both designs are presented. The

overheads are also denoted in the parentheses in both tables. The maximum area over-

Page 44: Reliable and Fault-Resilient Schemes for Efficient Radix-4

35

heads are 33.33% (Spartan-6) and 39.516% (Virtex-6) which correspond to Scheme-I

and 17.98% (Spartan-6) and 14.11% (Virtex-6) for Scheme-II, respectively. The maxi-

mum power overheads are 6.62% (Spartan-6) and 10.85% (Virtex-6) which correspond to

Scheme-I and 21.54% (Spartan-6) and 25.35% (Virtex-6) for Scheme-II, respectively. As for

Scheme-II, sub-pipelining the design leads to higher frequencies at the expense of area.

Since Scheme-II is aimed towards area-efficient designs compared to Scheme-I, further

sub-pipelining has not been implemented to maintain the area within acceptable limits.

Based on the simulation results, both schemes provide the high error coverage of

99.99%. One can refer to [59]-[76] for similar sub-block works on fault detection in cryptography.

Page 45: Reliable and Fault-Resilient Schemes for Efficient Radix-4

Chapter 7

Conclusion

We have presented two complex divider architectures capable of error detection. The di-

vision scheme uses the SRT division algorithm to obtain the quotient and remainder. We

also propose a new ROM look-up technique which reduces the number of ROM entries

50%, and, thus, significantly reduces the area and power consumption. Scheme-I utilizes

signatures and hardware redundant blocks to incorporate error detection. Traditional

error detection architectures are limited to the use of either parity (less area but not

complete error coverage) or hardware redundancy (maximum error coverage but imprac-

tical hardware complexity). Scheme-I uses a combination of both the former and latter

to achieve maximum error coverage using relatively less hardware complexity. Scheme-II

achieves its error detection by utilizing redundancy in time. Due to the nature of the

operation, no extra logic is required to compare the results, apart from the pipeline reg-

isters. Benchmark circuits were evaluated with both single and multiple stuck-at-fault.

The error coverage simulations show results of more than 99.999% coverage for the pro-

posed designs. The proposed designs are extendable through increasing the efficiency of

the original design with fault detection capability by moving to higher radices such as

radix-8 or radix-16. Moreover, they provide different options to embed error detection in

many complex time-efficient and area-efficient arithmetic applications.

Page 46: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References

[1] M. C. M. Teixeria, E. Assuncao, and E. R. M. D. Machado, “A method for plot-

ting the complementary root locus using the root-locus (positive gain) rules,” IEEE

Trans. Educ., vol. 47, no. 3, pp. 405–409, Aug. 2004.

[2] P. J. S. G. Ferreira, “Concerning the nyquist plots of rational functions of nonzero

type,” IEEE Trans. Educ., vol. 42, no. 3, pp. 228–229, Aug. 1999.

[3] A. C. Bartlett, “Nyquist, bode, and nichols plots of uncertain systems,” in Proc.

Conf. American Control, May. 1990, pp. 2033–2037.

[4] S. Mijalkovic, “Using frequency response coherent structures for model-order reduc-

tion in microwave applications,” IEEE Trans. Microwave Theory and Techniques,

vol. 52, no. 9, pp. 2292–2297, Sep. 2004.

[5] (2011) Exocortex.dsp. [Online]. Available: http://www.exocortex.org/dsp/.

[6] (2011) Costly: Complex standard functions library. [Online]. Available: http:

//iamlasun8.mathematik.unikarlsruhe.de/~ae16/CoStLy.html/.

[7] R. L. Smith, “Algorithm 116: Complex division,” Communications of the ACM,

vol. 5, no. 8, p. 435, 1962.

[8] G. W. Stewart, “A note on complex division,” ACM Trans. Mathematical Software,

vol. 11, no. 3, pp. 238–241, 1985.

[9] C. P. Jeannerod, N. Louvet, and J. M. Muller, “On the component-wise accuracy

of complex floating-point division with an FMA,” in Proc. IEEE Symp. Computer

Arithmetic, 2013, pp. 83–90.

Page 47: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 38

[10] M. D. Ercegovac and J. M. Muller, “Complex division with prescaling of operands,”

in Proc. IEEE Int. Conf. Application-Specific Systems, Architectures, and Processors,

2003, pp. 304–314.

[11] P. Dormiani, M. D. Ercegovac, and J. M. Muller, “Design and implementation of a

radix-4 complex division unit with prescaling,” in Proc. IEEE Int. Conf. Application-

specific Systems, Architectures and Processors, 2009, pp. 83–90.

[12] D. Wang, M. D. Ercegovac, and N. Zheng, “A radix-8 complex divider for FPGA im-

plementation,” in Proc. IEEE Int. Conf. Field Programmable Logic and Applications,

2009, pp. 236–241.

[13] D. Wang and M. D. Ercegovac, “A radix-16 combined complex division/square root

unit with operand prescaling,” IEEE Trans. Comput., vol. 61, no. 9, pp. 1243–1255,

2012.

[14] F. Edman and V. Oewall, “Fixed-point implementation of a robust complex valued

divider architecture,” in Proc. European Conf. Circuit Theory and Design, Aug. 2005.

[15] J. Liu, B. Weaver, and Y. Zakharov, “FPGA implementation of multiplication-free

complex division,” Electronic Letters, vol. 44, no. 2, pp. 95–96, Jan. 2008.

[16] T. Jamil, “An introduction to complex binary number system,” in Proc. Int. Conf.

Information and Computing, 2011, pp. 229–232.

[17] T. Jamil and S. S. Al-Abri, “Design of a divider circuit for complex binary numbers,”

in Proc. World Congress on Engineering and Computer Science, vol. 2, 2010.

[18] J. Liu, “DCD algorithm: Architectures, FPGA implementations and applications,”

in Diss. University of York, 2008.

[19] S. Srinivasan, “FLAW: FPGA lifetime awareness,” in Design Automation Confer-

ence, 2006, pp. 630–635.

Page 48: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 39

[20] C. Guérin, V. Huard, and A. Bravaix, “The energy-driven hot-carrier degradation

modes of nMOSFETs,” IEEE Trans. Device and Materials Reliability, vol. 7, no. 2,

pp. 225–235, Jun. 2007.

[21] D. K. Schroder and J. A. Babcock, “Negative bias temperature instability: Road to

cross in deep submicron silicon semiconductor manufacturing,” Journal of Applied

Physics, vol. 94, no. 1, pp. 1–18, Jul. 2003.

[22] P. J. Clarke, A. K. Ray, and C. A. Hogarth, “Electromigration-a tutorial introduc-

tion,” Int. Journal of Electronics, vol. 69, no. 3, pp. 333–338, Feb. 1990.

[23] D. Esseni, J. D. Bude, and L. Selmi, “On interface and oxide degradation in VLSI

MOSFETs-part I: Deuterium effect in che stress regime,” IEEE Trans. Electron

Devices, vol. 49, no. 2, pp. 247–253, Feb. 2002.

[24] D. Esseni, J. D. Bude, and L. Selmi, “On interface and oxide degradation in VLSI

MOSFETs-part II: Fowler-nordheim stress regime,” IEEE Trans. Electron Devices,

vol. 59, no. 2, pp. 254–263, Feb. 2002.

[25] I. G. Harris and R. Tessier, “Testing and diagnosis of interconnect faults in cluster-

based FPGA architectures,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,

vol. 21, no. 11, pp. 1337–1343, Nov. 2002.

[26] M. Berg, “Fault tolerance implementation within SRAM based FPGA designs based

upon the increased level of single event upset susceptibility,” in Proc. IEEE Int.

On-Line Testing Symp. (IOLTS’06), 2006.

[27] J. H. Patel and L. Y. Fung, “Concurrent error detection in ALUs by recomputing

with shifted operands,” IEEE Trans. Comput., vol. C-31, no. 7, pp. 589–595, Jul.

1982.

[28] J. Li and E. E. Swartzlander, “Concurrent error detection in ALUs by recomputing

with rotated operands,” in Proc. IEEE Int. Workshop on Defect and Fault Tolerance

in VLSI Systems, Nov. 1992, pp. 109–116.

Page 49: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 40

[29] S. Durand, C. Piguet, and C. suisse, “FPGA with selfrepair capabilities,” in Proc.

Int. Workshop on Field Programmable Gate Arrays, 1994, pp. 1–6.

[30] C. Stroud, S. Konala, P. Chen, and M. Abramovici, “Built-in self-test of logic blocks

in FPGAs,” in Proc. VLSI Test Symp., 1996, pp. 387–392.

[31] S.-K. Lu and C.-Y. Chen, “Fault detection and fault diagnosis techniques for lookup

table FPGAs,” in Proc. Asian Test Symp. (ATS’02), vol. 15, no. 1, 2002, pp. 397–

406.

[32] A. Alaghi, M. S. Yarandi, and Z. Navabi, “An optimum ORA BIST for multiple

fault FPGA look-up table testing,” in Proc. Asian Test Symp. (ATS’06), 2006, pp.

293–298.

[33] J. Smith, T. Xia, and C. Stroud, “An automated BIST architecture for testing and

diagnosing FPGA interconnect faults,” Journal of Electronic Testing: Theory and

Applications, vol. 22, no. 3, pp. 239–253, Jun. 2006.

[34] I. G. Harris and R. Tessier, “Diagnosis of interconnect faults in cluster-based FPGA

architectures,” in Proc. Int. Conf. Computer Aided Design, 2000, pp. 472–475.

[35] M. Abramovici, J. M. Emmert, and C. E. Stroud, “Roving stars: An integrated

approach to on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive

computing systems,” in Proc. NASA/DoD Workshop on Evolvable Hardware, 2001,

pp. 73–92.

[36] J. M. Emmert, C. E. Stroud, and M. Abramovici, “Online fault tolerance for FPGA

logic blocks,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 15,

no. 2, pp. 216–226, Feb. 2007.

[37] N. R. Shnidman, W. H. Mangione-Smith, and M. Potkonjak, “On-line fault detec-

tion for bus-based field programmable gate arrays,” IEEE Trans. Very Large Scale

Integration (VLSI) Systems, vol. 6, no. 4, pp. 656–667, Dec. 1998.

Page 50: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 41

[38] B. Girau and A. Tisserand, “On-line arithmetic-based reprogrammable hardware

implementation of multilayer perceptron back-propagation,” in Proc. of MicroNeuro

’96, 1996, pp. 168–175.

[39] A. Tisserand and M. Dimmler, “FPGA implementation of real-time digial controllers

using on-line arithmetic,” in Proc. Int. Workshop on Field-programmable Logic and

Applications, Sep. 1997, pp. 472–481.

[40] A. Tenca and M. D. Ercegovac, “A variable long-precision arithmetic unit design

for reconfigurable coprocessor architectures,” in IEEE Symp. FPGAs for Custom

Computing Machines, 1998, pp. 216–225.

[41] A. Tisserand, P. Marchal, and C. Piguet, “An on-line arithmetic based FPGA for

low power custom computing,” in Proc. Int. Workshop on Field Programmable Logic

and Applications, 1999, pp. 264–273.

[42] E. Mosanya and E. Sanchez, “A FPGA-based hardware implementation of general-

ized profile search using online arithmetic,” in Proc. AGM/SIGDA Int. Symp. Field

Programmable Gate Arrays, 1999, pp. 101–111.

[43] D. Lau, A. Schneider, M. D. Ercegovac, and J. Villasenor, “FPGA-based structures

for on-line FFT and DCT,” in Proc. IEEE Symp. Field-programmable Custom Com-

puting Machines, 1999, pp. 310–311.

[44] C. H. Yen and B. F. Wu, “Simple error detection methods for hardware implemen-

tation of advanced encryption standard,” IEEE Trans. Comput., vol. 55, no. 6, pp.

720–731, Jun. 2006.

[45] T. G. Malkin, F. X. Standaert, and M. Yung, “A comparative cost/security anal-

ysis of fault attack countermeasures,” in Proc. Int. Workshop Fault Diagnosis and

Tolerance in Cryptography, Oct. 2006, pp. 159–172.

[46] G. D. Natale, M. Doulcier, M. L. Flottes, and B. Rouzeyre, “A reliable architecture

for parallel implementations of the advanced encryption standard,” J. Electronic

Testing: Theory and Applications, vol. 25, no. 4, pp. 269–278, Aug. 2009.

Page 51: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 42

[47] M. Mozaffari-Kermani and A. Reyhani-Masoleh, “Concurrent structure-independent

fault detection schemes for the advanced encryption standard,” IEEE Trans. Com-

put., vol. 59, no. 5, pp. 608–622, May. 2010.

[48] M. Mozaffari-Kermani and R. Azarderakhsh, “Efficient fault diagnosis schemes for re-

liable lightweight cryptographic ISO/IEC standard CLEFIA benchmarked on ASIC

and FPGA,” IEEE Trans. Ind. Electron., vol. 60, no. 12, pp. 5925–5932, Dec. 2013.

[49] P. Maistri and R. Leveugle, “Double-data-rate computation as a countermeasure

against fault analysis,” IEEE Trans. Comput., vol. 57, no. 11, pp. 1528–1539, Nov.

2008.

[50] B. Parhami, “High-radix dividers,” in Computer arithmetic Algorithms and Hard-

ware Designs, N. York, Ed. New York: Oxford University Press, 2000, pp. 228–245.

[51] M. D. Ercegovac and T. Lang, “Division by digit recurrence,” in Digital Arithmetic,

D. E. M. Penrose, Ed. San Francisco: Morgan Kaufmann Publishers, 2004, pp.

247–319.

[52] D. L. Harris, S. F. Oberman, and M. A. Horowitz, “SRT division architectures and

implementations,” in Proc. IEEE Symp. Computer Arithmetic, 1997, pp. 18–25.

[53] J. W. Hartwell, “A procedure for implementing the fast fourier transform on small

computers,” IBM J. Research and Development, vol. 15, pp. 355–363, 1971.

[54] S. Pontarelli, P. Reviriego, C. J. Bleakley, and J. A. Maestro, “Low complexity con-

current error detection for complex multiplication,” IEEE Trans. Comput., vol. 62,

no. 9, pp. 1899–1903, Sep. 2013.

[55] M. Nicolaidis, “Carry checking/parity prediction adders and ALUs,” IEEE Trans.

Very Large Scale Integration (VLSI) Systems, vol. 11, no. 1, pp. 121–128, Feb. 2013.

[56] M. Nicolaidis and R. O. Duarte, “Design of fault-secure parity-prediction booth

multipliers,” IEEE Des. Test, vol. 14, no. 2, pp. 60–71, 1997.

Page 52: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 43

[57] S. H. Mozafari, M. Fazeli, S. Hessabi, and S. G. Miremadi, “A low cost circuit level

fault detection technique to full adder design,” in Proc. IEEE Int. Conf. Electronics,

Circuits and Systems (ICECS), 2011, pp. 446–450.

[58] L. Breveglieri, I. Koren, and P. Maistri, “An operation-centered approach to fault

detection in symmetric cryptography ciphers,” IEEE Trans. Comput., vol. 56, no. 5,

pp. 635–649, May. 2007.

[59] M. Mozaffari Kermani and R. Azarderakhsh, "Reliable hash trees for post-quantum

stateless cryptographic hash-based signatures," in Proc. IEEE Int. Symp. Defect and Fault

Tolerance in VLSI Systems (DFT), pp. -, Oct. 2015.

[60] M. Mozaffari Kermani and A. Reyhani-Masoleh, "Reliable Hardware Architectures for

the Third-Round SHA-3 Finalist Grostl Benchmarked on FPGA Platform," in Proc. IEEE Int.

Symp. Defect and Fault Tolerance in VLSI Systems (DFT), pp. 325-331, Vancouver, Canada,

Oct. 2011.

[61] M. Mozaffari Kermani and A. Reyhani-Masoleh, "A High-Performance Fault Diagnosis

Approach for the AES SubBytes Utilizing Mixed Bases," in Proc. IEEE Workshop Fault

Diagnosis and Tolerance in Cryptography (FDTC), pp. 80-87, Nara, Japan, Sep. 2011.

[62] M. Mozaffari Kermani and A. Reyhani-Masoleh, "A Lightweight Concurrent Fault

Detection Scheme for the AES S-Boxes Using Normal Basis," in Proc. LNCS Cryptographic

Hardware and Embedded Systems (CHES), pp. 113-129, Washington, D.C., USA, Aug. 2008

[63] M. Mozaffari Kermani and A. Reyhani-Masoleh, "A Structure-independent Approach

for Fault Detection Hardware Implementations of the Advanced Encryption Standard," in

Proc. IEEE Workshop Fault Diagnosis and Tolerance in Cryptography (FDTC), pp. 47-53,

Vienna, Austria, Sep. 2007.

[64] M. Mozaffari Kermani and A. Reyhani-Masoleh, "Parity-based Fault Detection

Architecture of S-box for Advanced Encryption Standard," in Proc. IEEE Int. Symp. Defect

and Fault Tolerance in VLSI Systems (DFT), pp. 572-580, Washington, D.C., USA, Oct.

2006.

Page 53: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 44

[65] M. Mozaffari Kermani and A. Reyhani-Masoleh, "Parity Prediction of S-box for AES,"

in Proc. IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp.

2357-2360, Ottawa, Canada, May 2006.

[66] M. Mozaffari Kermani, R. Azarderakhsh, and A. Aghaie, "Reliable and error detection

architectures of Pomaranch for false-alarm-sensitive cryptographic applications," IEEE

Trans. Very Large Scale Integr. (VLSI) Syst., Accepted and to be published in 2015.

[67] M. Mozaffari Kermani, N. Manoharan, and R. Azarderakhsh, "Reliable radix-4

complex division for fault-sensitive applications," IEEE Trans. Comput.-Aided Design Integr.

Circuits Syst., vol. 34, no. 4, pp. 656-667, Apr. 2015.

[68] M. Mozaffari Kermani, K. Tian, R. Azarderakhsh, and S. Bayat-Sarmadi, "Fault-

resilient lightweight cryptographic block ciphers for secure embedded systems," IEEE

Embedded Sys., vol. 6, no. 4, pp. 89-92, Dec. 2014.

[69] S. Bayat-Sarmadi, M. Mozaffari Kermani, and A. Reyhani-Masoleh, "Efficient

and concurrent reliable realization of the secure cryptographic SHA-3 algorithm,"

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 33, no. 7, pp. 1105-1109,

Jul. 2014.

[70] M. Mozaffari Kermani, R. Azarderakhsh, C. Lee, and S. Bayat-Sarmadi,

"Reliable concurrent error detection architectures for extended Euclidean-based

division over GF(2^m)," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no.

5, pp. 995-1003, May 2014.

[71] M. Mozaffari Kermani and A. Reyhani-Masoleh, "A Low-Power High-

Performance Concurrent Fault Detection Approach for the Composite Field S-box

and Inverse S-box," IEEE Trans. Comput., vol. 60, no. 9, pp. 1327-1340, Sep. 2011.

[72] M. Mozaffari Kermani and A. Reyhani-Masoleh, "A Lightweight High-

Performance Fault Detection Scheme for the Advanced Encryption Standard

Using Composite Fields," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,

no. 1, pp. 85-91, Jan. 2011.

[73] M. Mozaffari Kermani and A. Reyhani-Masoleh, "Fault Detection Structures of the

S-boxes and the Inverse S-boxes for the Advanced Encryption Standard," J. Electronic

Testing: Theory and Applications (JETTA), vol. 25, no. 4, pp. 225-245, Aug. 2009.

Page 54: Reliable and Fault-Resilient Schemes for Efficient Radix-4

References 45

[74] M. Mozaffari Kermani, "Fault Detection Schemes for High Performance VLSI

Implementations of the Advanced Encryption Standard", M.E.SC. Thesis, The University of

Western Ontario.

[75] M. Mozaffari-Kermani, "Reliable and High-Performance Hardware Architectures for

the Advanced Encryption Standard/Galois Counter Mode," Ph.D. Thesis, University of

Western Ontario.

[76] M. Mozaffari-Kermani, R. Azarderakhsh, "Integrating emerging cryptographic

engineering research and security education", ASEE Conf., scholarworks.rit.edu.

[77] Xilinx. [Online]. Available: http://www.xilinx.com.