11
F E R M A T Formal Engineering Research using Methods, Abstractions and Transformations Technical Report No: 2004-13 Reliability Evaluation of Multiplexing Based Defect- Tolerant Majority Circuits Debayan Bhaduri Sandeep Shukla [email protected] [email protected] Defect tolerant architectures are gaining importance for building economical and cheap computing systems with billions of devices of nanometer dimension. This is because, at the nanoscale, devices will be prone to errors due to manufacturing defects, ageing, transient faults and quantum physical effects. Due to the increase in the device density, micro- architects may opt for redundancy based defect tolerant techniques. Logic circuits are implemented in many non-silicon manufacturing methodologies such as Quantum Dot Cellular Automata using three input majority gates as the basic logic devices. We have extended our previous work and analyzed redundancy based majority gate architectures by using probabilistic model checking techniques. Such analysis provides efficient evaluation of the reliability/redundancy trade-offs. Analytical probabilistic models to evaluate reliability/redundancy trade-offs are error prone and cumbersome.

Reliability evaluation of von Neumann multiplexing based defect-tolerant majority circuits

Embed Size (px)

Citation preview

Fo

Defect tolerant architectures are gaining importance for building economical and cheap computing systems with billions of devices of nanometer dimension. This is because, at the nanoscale, devices will be prone to errors due to manufacturing defects, ageing, transient faults and quantum physical effects. Due to the increase in the device density, micro-architects may opt for redundancy based defect tolerant techniques. Logic circuits are implemented in many non-silicon manufacturing methodologies such as Quantum Dot Cellular Automata using three input majority gates as the basic logic devices. We have extended our previous work and analyzed redundancy based majority gate architectures by using probabilistic model checking techniques. Such analysis provides efficient evaluation of the reliability/redundancy trade-offs. Analytical probabilistic models to evaluate reliability/redundancy trade-offs are error prone and cumbersome.

F E R M A T rmal Engineering Research using Methods,

Abstractions and Transformations

Technical Report No: 2004-13

MT

db

Reliability Evaluation of ultiplexing Based Defect-olerant Majority Circuits

Debayan Bhaduri Sandeep Shukla

[email protected] [email protected]

Reliability Evaluation of Multiplexing Based Defect-TolerantMajority Circuits ∗

Debayan Bhaduri Sandeep Shukla

FERMAT LabThe Bradley Department of Electrical & Computer Engineering

Virginia Polytechnic Institute and State UniversityBlacksburg, VA 24060

E-mail:{dbhaduri, shukla }@vt.edu

∗This work was supported by NSF Grant CCR-0340740

i

Contents

1 Introduction 2

2 Background 2

3 Model Construction 4

4 Experiments and Results 5

5 Conclusion and Future work 7

List of Figures

1 A majority multiplexing unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Reliability for I/O Bundle Size of 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Reliability for I/O Bundle Size of 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

List of Tables

ii

Abstract

Defect tolerant architectures are gaining importance for building economical and cheap computing sys-

tems with billions of devices of nanometer dimension. This is because, at the nanoscale, devices will

be prone to errors due to manufacturing defects, ageing, transient faults and quantum physical effects.

Due to the increase in the device density, micro-architects may opt for redundancy based defect toler-

ant techniques. Logic circuits are implemented in many non-silicon manufacturing methodologies such

as Quantum Dot Cellular Automata using three input majority gates as the basic logic devices. We

have extended our previous work and analyzed redundancy based majority gate architectures by using

probabilistic model checking techniques. Such analysis provides efficient evaluation of the reliabili-

ty/redundancy trade-offs. Analytical probabilistic models to evaluate reliability/redundancy trade-offs

are error prone and cumbersome.

1

1 Introduction

In the future, nanotechnology will let us combine the fundamental building blocks of nature easily,

inexpensively and in most of the ways permitted by the laws of physics. Continued improvements in

lithography have resulted in line widths that are less than one micron. Sub-micron lithography is clearly

very valuable but it is equally clear that conventional lithography will not let us build semiconductor

devices in which individual dopant atoms are located at specific lattice sites. There is fairly widespread

belief that silicon based technologies are likely to continue for at least another several years and then

reach their practical limits. If we are to continue these miniaturization trends we will have to develop new

manufacturing technologies which will let us inexpensively build computer systems with mole quantities

of logic elements that are molecular in both size and precision and are interconnected in complex and

highly idiosyncratic patterns. Such technologies will increase the defect density and assuming error-

free computing may no longer be possible. Due to the small feature size, redundancy based defect-

tolerance will be adopted, and conventional techniques such as Von Neumann multiplexing [9] may be

implemented to obtain high reliability.

Non-silicon manufacturing methodologies such as quantum dots [8], quantum cellular automata [1, 4]

use majority logic devices as the fundamental building blocks of a Boolean network. In this paper, we

analyze reliabilty/redundancy trade-offs for multiplexing based majority circuits by building a generic

multiplexing library. This is also an enhancement of our probabilistic model checking based tool

NANOPRISM [2] . Such a library can be applied to model any arbitrary boolean circuit or a portion of

a large Boolean network and also at different levels of granularity, such as gate level, logic block level,

logic function level, unit level etc.

2 Background

Defect-Tolerant Computing: Formally, adefect-tolerant architectureis one which uses techniques to

mitigate the effects of defects in the devices that make up the architecture, and guarantees a given level

of reliability. In 1952, von Neumann introduced a redundancy technique called NAND multiplexing [9]

for constructing reliable computation from unreliable devices (due to unreliable valve based computers

at that time). He showed that, if the failure probabilities of the gates are sufficiently small and failures

are independent, then computations may be done with a high probability of correctness. Pippenger [6]

2

showed that von Neumann’s construction works only when the probability of failure per gate is strictly

less than1/2, and that computation in the presence of noise (which can be seen as the presence of defect),

requires more layers of redundancy. In [3, 5], NAND multiplexing was compared to other techniques

for fault-tolerance and theoretical calculations showed that the redundancy level must be quite high to

obtain acceptable levels of reliability.

U

M

M

M

:

:

:

:

:

:

:

:

:

:

X

Z

Y

Executive

Stage

U

M

M

M

:

: :

:

:

:

:

:

:

:

Restorative Stage

:

:

:

:

:

M=Majority Gate

Figure 1. A majority multiplexing unit

Multiplexing Based Defect-Tolerance:The basic technique of multiplexing is to replace a processing

unit by a multiplexed unit withN copies of every input and output of the processing unit. In a multiplex-

ing unit, there areN devices which in parallel process the copies of the inputs to giveN outputs. If the

inputs and devices are reliable, then each element of the output set will be identical and equal to that of

the processing unit. However, when there are errors in the inputs and devices are faulty, the outputs will

not be identical. Instead, after defining some critical level∆∈ (0,0.5), the output of the multiplexing unit

is considered stimulated (taking logical valuetrue) if at least(1−∆)·N of the outputs are stimulated and

non-stimulated (taking logical valuefalse) if no more than∆·N outputs are stimulated. In cases where

the number of stimulated outputs does not meet either criteria, i.e. the number of stimulated outputs is

in the interval(∆·N,(1−∆)·N), then the output is undecided, and hence a malfunction will occur. The

basic design of a multiplexing unit consists of two stages: theexecutive stagewhich performs the basic

function of the processing unit to be replaced, and therestorative stagewhich reduces the degradation

in the executive stage caused by errors in both the inputs and faulty devices.

In this paper, we consider multiplexing when the processing unit is a single majority gate. We therefore

replace the inputs and output of the gate withN copies and in the executive stage duplicate the majority

3

gateN times, as in Figure1. The unitU represents arandom permutationof the input signals, that is,

each signal of the first input is randomly paired with a signal from the second input to form an input pair

for one of the copies of the gate. Also shown in Figure1 is the restorative stage which takes the output

of the executive stage as its inputs. To give a more effective restoration mechanism this stage can be

iterated [9].

Probabilistic Model Checking and Prism: Probabilistic model checkingis a range of techniques for

calculating the likelihood of the occurrence of certain events during the execution of unreliable or un-

predictable systems. The system is usually specified as a state transition system, with probability values

attached to the transitions. A probabilistic model checker applies algorithmic techniques to analyze the

state space and calculate performance measures. We use PRISM [7], a probabilistic model checker de-

veloped at the University of Birmingham. We usediscrete-time Markov chains(DTMCs) to model the

generic multiplexing library for majority logic gates. This model of computation is suitable for conven-

tional digital circuits and the fault models considered. The fault models are manufacturing defects in the

gates and transient errors that can occur at any point of time in a Boolean network.

3 Model Construction

In this section we explain the PRISM model of a majority gate multiplexing configuration. The first

approach is directly modeling the system as shown in Figure1. A PRISM module is constructed for

each multiplexing stage comprisingN majority gates and these modules are combined through syn-

chronous parallel composition. However, following this construction leads to the well know state space

explosion problem. At the same time, we observe that the actual values of the inputs and outputs for

each stage is not important, instead one needs to keep track of only the total number of stimulated (and

non-stimulated) inputs and outputs. Furthermore, to allow us to compute these values, without having to

store all the outputs of the majority gates in each stage, we replace the set ofN majority gates working in

parallel with N majority gates working insequence. The same methodology is applied to the multiplex-

ing stages of the system so as to reuse the same module for each of the stages while keeping a record of

the outputs from the previous stage. This folds space into time, or in other words reuse the same majority

gate/stage over time rather than making redundancy over space. This approach does not influence the

performance of the system since each majority gate works independently and the probability of each

4

gate failing is also independent.

The unitU in Figure1 performs random permutation. Consider the case whenk outputs from the

previous stage are stimulated for some0 < k < N. Since there arek stimulated outputs, the next stage

will have k of the inputs stimulated ifU performs random permutation. Therefore, the probability of

either all or none of inputs being stimulated inputs is 0. This implies that each of the majority gates

in a stage are dependent on one other, for example, if one majority gate has a stimulated input, then

the probability of another having the same input stimulated decreases. It is difficult to calculate the

reliability of a system by means of analytical techniques for such a scenario. To change the number of

restorative stages, bundle size, input probabilities or probability of the majority gates failing requires

only modification of parameters given at the start of the model description. Since PRISM can also

represent non-deterministic behavior, one can set upper and lower bounds on the probability of gate

failure and then obtain best and worst case reliability characteristics for the system under these bounds.

1 2 3 4 5 6 7 8 9 10 110.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of restorative stages

pro

bab

ility

of

erro

r le

ss t

han

10%

probability of gate failure = 0.01probability of gate failure = 0.02probability of gate failure = 0.03probability of gate failure = 0.04

(a) Probability that atmost 10% of the

outputs are incorrect

1 2 3 4 5 6 7

4

6

8

10

12

14

16

number of restorative stages

exp

ecte

d %

of

inco

rrec

t o

utp

uts

probability of gate failure = 0.04probability of gate failure = 0.03probability of gate failure = 0.02probability of gate failure = 0.01

(b) Expected percentage of incorrect

outputs (large probability of failure)

Figure 2. Reliability for I/O Bundle Size of 20

4 Experiments and Results

In this section we report the reliability measures of multiplexing based majority systems both when the

I/O bundles are of size 10 and 20. These bundle sizes are only for illustration purposes and we have

investigated the performance of these systems for larger bundle sizes. In all the experiments reported in

this paper, we assume that the inputsX, Y andZ are identical (this is often true in circuits containing

similar devices) and that two of the inputs have high probability (0.9) of being logic high while the third

5

input has a 0.9 probability of being a logic low. Thus the circuit’s correct output should be stimulated.

Also, it is assumed that the gate failure is a von Neumann fault, i.e. when a gate fails, the value of its

output is inverted. In Figure2, we consider a bundle size of 10 and the probability of gate failure varying

from 0.01 to 0.04. The probability that system error is less than 10% and the expected percentage of

incorrect outputs are plotted against the number of restorative stages. As can be seen from the results,

after incorporating certain number of restorative stages, increasing these does not make the system ap-

preciably more reliable. The reliability tends to reach a steady state. This is because at large gate failure

probabilities, the restorative stages are sufficiently affected as well and augmenting these to the architec-

tural configuration does not reduce the degradation in the reliability of computation. From these results

we therefore conclude that, in the case of a bundle size equal to 10, if the gate failure probability of the

gates is greater than or equal to 0.01 then the system cannot be made more reliable once a sufficient

number of restorative stages have been added (in this case5).

−8 −7 −6 −5 −4 −30.75

0.8

0.85

0.9

0.95

1

error of individual gate (10x)

pro

bab

ility

of

erro

r le

ss t

han

10%

U : 3 Stages U : 4 StagesU : 5 StagesU : 7 Stages

(a) Probability that atmost 10% of the

outputs are incorrect

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

number of non−stimulated outputs

prob

abili

ty probability gate failure 0.2probability gate failure 0.1probability gate failure 0.02probability gate failure 0.0001

(b) Output distribution for 7 restora-

tive stages

Figure 3. Reliability for I/O Bundle Size of 20

On the other hand, Figure3(a) plots the probability that at most 10% of the outputs of the overall

system are incorrect (non-stimulated), for small gate failure probabilities and for a bundle size of 20. The

number of restorative stages varies between 1 and 7. This indicates that increasing the number of stages

can greatly enhance the reliability of the system. However, the the rate of increase in reliability decreases

as more restorative stages are added to the system. Moreover, there is a limit to the reliability which can

be gained by adding additional stages. Figure3(b) reports the distribution of the non-stimulated outputs

(error) for different majority gate failure probabilities for 7 restorative stages and the same bundle size.

We have also computed the output distribution of the system for different number of restorative stages,

6

and hence any measure of reliability can be calculated from these results. But PRISM can be used

directly for computing other measures of reliability as well. As expected, the output distribution and

the result from Figure3(a) show that, as the probability of a gate failure decreases, the reliability of the

multiplexing system increases.

5 Conclusion and Future work

In conclusion, this paper focuses on the need to have automation methodologies for analyzing reliability

of defect-tolerant architectural configurations. These architectures will be used to implement logic built

from emerging non-silicon manufacturing technologies. We have extended our tool NANOPRISM [2]

by developing a DTMC based generic multiplexing framework. A fragment of or an entire arbitrary

Boolean network can be plugged into such a framework to evaluate the redundancy and reliability trade-

offs. Analytical approaches may be error prone and cumbersome for complex network of gates. Our

probabilistic model checking methodology offers a complementary approach to such analytical method-

ologies for defect-tolerant nano architectures.

It is important to note that there is a difference between the bounds on the probability of gate fail-

ure required here for reliable computation and the theoretical bounds presented in the literature. This

difference is to be expected: in this paper we evaluate the performance of the system under a fixed

configuration (bundle size and number of restorative stages), whereas the bounds presented in the liter-

ature correspond to the scenario where the bundle size or number of restorative stages can be increased

arbitrarily in order to achieve a reliable system.

References

[1] Islamshah Amlani, Alexei O. Orlov, Geza Toth, Gary H. Bernstein, Craig S. Lent, and Gregory L.

Snider,Digital logic gate using quantum-dot cellular automata, Science284 (1999), no. 289-291,

Available at: http://www.nd.edu/ qcahome/reprints/Amlani2.pdf.

[2] Debayan Bhaduri and Sandeep Shukla,Nanoprism: A tool for evaluating granularity vs.

reliability trade-offs in nano architectures, GLSVLSI (Boston, MA), ACM, April 2004,

http://fermat.ece.vt.edu/Publications/pubs/techrep/techrep0318.pdf.

7

[3] J. Han and P. Jonker,A system architecture solution for unreliable nanoelectronic devices, IEEE

Transactions on Nanotechnology1 (2002), 201–208.

[4] C. Lent, A device architecture for computing with quantum dots, Porceedings of the IEEE, April

1997, p. 85.

[5] K. Nikolic, A. Sadek, and M. Forshaw,Architectures for reliable computing with unreliable nan-

odevices, Proc. IEEE-NANO’01, IEEE, 2001, pp. 254–259.

[6] N. Pippenger,Reliable computation by formulas in the presence of noise, IEEE Transactions on

Information Theory34 (1988), no. 2, 194–197.

[7] Web Page: www.cs.bham.ac.uk/ dxp/prism/.

[8] R. Turton,The quantum dot: A journey into the future of microelectronics, Oxford University Press,

U.K, 1995.

[9] J. von Neumann,Probabilistic logics and synthesis of reliable organisms from unreliable compo-

nents, Automata Studies (1956), 43–98.

8