Error-Correcting Codes and Self-Checking Circuits

Error-control coding techniques, implemented by means ofself-checking circuits, will improve system reliability.

Error-CorrectingCodes andSelf-CheckingCircuitsD. K. PradhanOakland University

J. J. StifflerRaytheon Company

It is not surprising that error-control codingtechniques have been used in computers for manyyears, especially since they have proven effectiveagainst both transient andpermanent faults. What issurprising is that coding techniques have not foundeven more extensive use (in commercial computers,for example), in view of their potential for improvingthe, overall reliability of computers. In this article,therefore, we will bring out some of the reasons forthe limited acceptance of error-control coding tech-niques. In addition, we will examine some code prop-erties, as well as certain implementation techniques,that might help overcome these currently perceivedlimitations.

Coding for error control

Techniques for coding digital information in orderto protect it from errors while it is being stored,transferred, or otherwise manipulated have been thesubject of intense investigation for several decades.There is a large body of literature detailing the re-markable theoretical and practical developments re-sulting from this effort. Although most of this workhas been concerned with the error models and decod-ing constraints encountered in communications, thepotential utility of these codes in protecting informa-tion in computers has long been recognized. Indeed,such considerations motivated some of the initialwork on error-control codes.

Stated somewhat formally, error-control coding en-tails mapping elements of a data set X={xij ontoelements of a code-word set Y={yi). The code wordsthus represent the information to be manipulated butare presumably less vulnerable to errors (induced, for

example, by failures in the circuitry used to carry outthe manipulation) than the original unencoded data.IfFdenotes the set ofindependent faults to which thecircuitry in question is subject, and ifE is the set oferrors that can be produced as a result of these faults,then each e EE is an error that can occur as the resultof some fault, f£ F. And if the distance d (y,y') from yto y'is defined as theminimum number oferrors in e E

E needed to change y to y', and the distance dassociated with a code Y is defined as the minimumdistance from any one of its code words to any othercode word, then it is not difficult to see that, subjectto some mild conditions (for example, the conditionthat d(yl, y') + d(y2, y')>min [d(y1,y2), d(y2,y1)]), adistance-d code can be used to detect up to at leastd-1 errors; to correct up to (d-1)/2 errors; or to cor-rect all errors if their totalnumber does not exceed c<(d- 1)/2; and otherwise to detect anynumber oferrorsin the range c + 1, d-c-1.

In the present context, X generally consists of the2k binary k-tuples, and Ya subset of 2£ binary n-tu-ples for some n > k.- The most widely postulated classof errors is that inwhich individual code-word bits areerroneously complemented; that is,E consists of then errors ei, i = 1,2,... ,n, with ei the error caused bycomplementing the ith bit of any y E Y. The errors en-countered in transferring (or transmitting) informa-tion from one point to another can often be character-ized in this way, as can those observed in retrievinginformation from certain storage media. The distancebetween two code words in this case is simply theHamming distance, i.e., the number ofcorrespondingbit positions in which they differ. However, thefaults, and hence the errors, in a given situationstrongly depend on the circuitry in question, and theHamming distance may or may not be the relevantmeasure.

0018-9162/80/0300-0027S00.75 © 1980 IEEEMarch 1980 27

As already noted, communications problems havemotivated most of the study of error-control codes.Although much of the resulting knowledge can alsobe exploited in the design of fault-tolerant com-puters, there are some important differences betweenthe two applications:

(1) Information is generally transmitted seriallyover a communications channel; it is generallyhandled in a parallel format in a computer. Con-sequently, the serial encoding and decoding al-gorithms developed for communications appli-cations have a much more limited applicabilityin computer systems.

(2) The time allowed for encoding and decoding isgenerally more constrained in computer appli-cations than in communications. A few milli-seconds' or even a few seconds' delay in decod-ing information received over a communica-tions channel may be entirely acceptable,whereas even a microsecond's delay in han-dling critical-path information in a computercould be intolerable.

(3) The complexity of the encoding and decodingcircuitry is frequently a much more seriouslimitation in computer applications than it is incommunications. If a code is sufficiently effec-tive in combatting transmission errors, its usemay be justified, regardless of the complexityof its associated hardware. But in a computerthe main reason for encoding information is toprotect it agalnst hardware faults. Unless thehardware needed to generate and check thecode is relatively simple compared to the hard-ware thus monitored, a fault-prone decodercould increase rather than decrease thelikelihood of erroneous information propaga-tion.

(4) Anticipated errors in a computer may be dif-ferent from those in a communications system.Even when the first-order error statistics arethe same, differences in higher-order -statisticsmay considerably change the relative effective-ness of various coding techniques in the two ap-plications (see "Coding for RAMs" later in thisarticle).

(5) Communication codes are designed to protectinformation as it is transferred from one placeto another. Although this function is importantin computer applications as well, and is in factthe function of major interest here, it should benoted that other constraints on computer error-control codes are also sometimes desirable. Inparticular, it is frequently desirable to be ableto protect information not only when it is beingstored or transferred, but also when it is beingmodified. Many computer operations entail theevaluation of functions of two variables: Xk =f(xi, xj). If a code is to protect information evenwhile it is being manipulated in this manner,the code must be preserved under such opera-tions; that is, for every functionfof interest andfor every xi, xj, Xk E X, there must exist a func-

tion g such that Yk = g(yi, yj) whenever Xk =f(xi, xj), with yi, yj and Yk representing the en-coded versions of xi, xj and x k

As might be expected, this last constraint can besevere, particularly when the class of functions f(xi,Xj) is large. Codes that are preserved under certainarithmetic operations have been developed15; othersare preserved under bit-by-bit logical operations21;but no known class of -codes is preserved under alloperations usually implemented in a computer.Although the codes available for logical operations

can be used for arithmetic operations,13'14 and viceversa,19 such use is likely to be too inefficient to be ofinterest. To date, no efficient code useful for bothlogical and arithmetic operations is available. Toachieve a breakthrough, then, one has to look intosome new and unconventional coding schemes. Un-fortunately, research in this area has been limited,partly because of certain negative results establishedearlier by Elias'7 and by Peterson and Rabin,20 in thearea of single-error detection in logical computations.However, it is important to note that more recentwork by Pradhan and Reddy2' exhibits an efficientcode for multiple-error detection/correction in logicalcomputations.With the possible exception of (4), the differences

between communications and computer applicationsof error-control coding militate against the latter ap-plication. Nevertheless, error-control coding tech-niques, judiciously applied, can be highly effective inincreasing computer reliability. Inmany cases, codesdeveloped primarily for communications applica-tions can be used to advantage in computers, par-ticularly if the peculiarities of the error patterns like-ly to occur in a computer are fully exploited. A goodexample of this is the type of error coding used inRAMs (discussed in detail later in this article).The principles of error-correcting codes can also be

used to design fault-tolerant logic. Significant workin this area has been reported both in the US40-45 andin the USSR.33-39 This research includes the design ofcounters (Reed and Chiang,45 Hsiao, et al.40); syn-chronous sequential machines (Larsen and Reed,42Meyers41); and asynchronous sequential machines(Pradhan,44 Pradhan and Reddy,43 Sagalovich37-39).

The basic technique used in all of these designs in-corporates static redundancy as an integral part ofthe design. The states of the circuit are encoded intocode words of an error-correcting code (in the case ofsynchronous circuits), or its analog (in the case ofasynchronous circuits). The output and next-statefunctions are defined so that the effect of any faultfrom a prescribed set is masked; i.e., the netWork pro-duces correct outputs, in spite of the fault. This is ac-complished by defining the behavior of the machinefor potentially faulty states so that it corresponds tothat of the correct state.This coding technique differs from the replication

technique in two respects: the redundancy is implicit,and there is no explicit error-correcting logic, such asmajority logic. The advantage of coding over replica-tion is that the former may require"fewer I/O pins and

COMPUTER28

chips. This follows from the fact that the number ofstate variables required by the coding scheme ismuch smaller than that required by replication, andfrom the fact that the fault-tolerant circuit that usescoding can be implemented as a single circuit.Two important possible applications of this tech-

nique are both yield enhancement and reliability im-provement.30 Yield is improved when the chips with eor fewer cell failures are not discarded in the accep-tance testing, where e is some number less than theerror-correction capability, t. The remaining (t-e) er-ror-correction capability is used for reliability im-provement.The principles of coding have also found use in the

design of self-checking circuits,4649 which are dis-cussed next.

Self-checking circuits

Self-checking circuits, in general, are that class ofcircuits in which occurrence of a fault can be deter-mined by observation of the outputs of the circuits.An important subclass of these self-checking circuitsis known as "totally self-checking" circuits. TSC cir-cuits are of special interest and so will be discussedhere in detail.Loosely speaking, a circuit is TSC if a fault in the

circuit cannot cause an error in the outputs withoutdetection of the fault. Also, the information regard-ing the presence of the fault is available in TSC cir-cuits as an integral part of the outputs. These TSCcircuits are particularly useful in reducing oreliminating the "hard core" of a fault-tolerant sys-tem. (A "hard core" is that part of a system that can-not tolerate failures, e.g., decoders for error-correct-ing codes.) The following is an example of a circuitwhich has built-in fault-detection capability.Through this example, we develop the motivation forthe TSC design, which is discussed next.The autonomous linear shift register, shown in

Figure 1, is a linear counter and has a cycle length of15. A special feature of.this counter is that it candetect faults in itself on-line. The shift register pro-duces the nonzero code words from a five-bit single-parity code in a cyclic chain. The states of the shiftregister are five-bit code words: the first four bits areinformation bits, and the last bit is the parity bit.To begin with, the shift register is initialized to a

nonzero code word in the single-bit parity code; foreach successive shift, it produces a new code word, asshown in Table 1. After 15 shifts, the shift registercounter starts repeating the cycle.Any single error in the outputs of the shift register

can be detected by checking the parity. This requiresan addition of extra logic, as shown in Figure 1. Theoutput of this extra logic will be 0 in the absence ofany errors, and 1 in the presence of a single error.

It can be shown that this design detects all singlefaults that may occur on the shift register.58 How-ever, the inherent difficulty in this design is thatfaults cannot be allowed to occur in the parity-checklogic because such a fault could prevent the circuit

Figure 1. A linear counter designed for fault-detection.

Table 1.Shift register contents after different shifts.

SHIFT01234567891011121314

CONTENT1 0 0 1 0O 1 0 0 11 0 0 0 11 1 1 0 11 1 0 1 11 1 0 0 00 1 1 0 0O 0 1 1 0O 0 0 1 11 0 1 0 00 1 0 1 0O 0 1 0 11 0 1 1 11 1 1 1 00 1 1 1 1

from detecting any errors thereafter in the outputs ofthe shift register (for example, a stuck-at-0 fault atthe output of this extra logic). Thus, a subsequentfault in the shift register and the corresponding errorcould go unnoticed.The presence of a hard core in the design is respon-

sible for the difficulty described above. TSC circuitshave been developed to overcome precisely this typeof problem. One fundamental feature ofTSC design isthat redundancy is embedded directly into the out-puts, making it possible to partition the set of out-puts into a set of code-word (valid) outputs, and a setof non-code-word (invalid) outputs. These code-wordoutputs are the set of outputs that normally appearat the outputs of the network. On the other hand, theappearance of a non-code word at the outputs of thenetwork signals the presence of a fault in the net-work. However, the features that distinguish theTSC network from other self-checking networks areits fault-secure46 and self-testing49 properties.These two properties can best be described in terms

of input/output mapping of the network: LetX and Yrepresent the sets of inputs and outputs, respective-ly. Let Y = Y1 U Y2, where Y1 and Y2 represent thecode-word (valid) and non-code-word (invalid) out-

March 1980

f

29

1 0 .(

Figure 2. The fault-secure property of TSC networks.

Figure 3. The self-testing property of TSC networks.

puts, respectively. LetF represent a prescribed set offaults for which the design is TSC. Let y be the cor-rect output for the input, x. (Note that x E X, and y E

Y,.) Let y'be the output of the network for the sameinput, x, in the presence of a fault, f, f E F. Fault-secureness guarantees that ify 'is a code word, y'E YI,then y' = y. In other words, the output of the faultynetwork cannot be a code word and, at the same time,be different from the correct output, y. Thus, as longas the output is a code word, it can safely be assumedto be correct. This is illustrated in Figure 2.On the other hand, the self-testing property

guarantees that for every fault, f, fE F, there exists atleast one input, x, x E X, for which the resulting out-put, y, is a non-code word, y' E Y2; i.e., input x willresult in the signaling of the presence of the fault. Inother words, the input set contains at least one testfor every fault in the prescribed set; thus, the occur-rence of any fault is bound to be detected sometimeduring the operation (see Figure 3).While TSC networks possess these interesting

features, they do have some practical limitations. Forone thing, although every fault is detectable some-time during the operation, there is no guarantee thatthe first fault will be detected before a second fault oc-curs. Such a pathological case might be the occur-rence of a fault detectable only by a particular input.The possibility exists that this input can become in-valid as a test if a second fault occurs before the inputis actually applied. This has been referred to as the er-ror latency problem.59Second, the application of TSC circuits has been

limited by the absence of any systematic techniquethat realizes the self-testing property in an

economical way. For example, although fault-secure-ness can be readily achieved in asynchronous net-works, incorporating the self-testing property can betoo formidable a task.57However, a class of TSC circuits known as TSC

checkers has a significant potential for large-scaleuse in the design of fault-tolerant computers. A TSCchecker46 is a TSC circuit designed to detect errors inerror-detecting codes used in fault-tolerant com-puters. The most basic application of TSC checkers,however, is their use in monitoring a generalTSC net-work, as described below.The outputs of the TSC network are fed to a TSC

checker, designed so that any non-code word at its in-puts produces only a non-code word at its outputs.Thus, by observing the output of the checker, one candetect any fault in the TSC network or the checkeritself. (However, the checker output does not provideany information as to the location of the fault, i.e.,whether the fault is in the TSC circuit or in thechecker itself.)TSC checkers have two output leads and, hence,

four output combinations: (00, 01, 10, 11). Two com-binations are considered valid (code-word) outputs;they are usually (01, 10). The appearance ofan invalid(non-code-word) output combination indicates eitherthe presence of an error at the input code word, or afault in the checker. The function of TSC checkers(see Figure 4) is to detect any errors in the input codeword, as well as any faults that may occur in thechecker itself-as long as they do not occur at thesame time. Table 2 presents the network outputs forfour possible cases.In the first case, the checker is fault-free and the

code word is error-free; the output of the checker isalways one of the valid outputs. In the next case,there is an error in the input code word, but there areno faults in the checker. Here, the autput will alwaysbe an invalid output, so that the error in the codewordcan be detected. In the third case, there is a fault inthe checker, but there is no error in the code word.Then, the output may be either the correct, valid out-put, or an invalid output, depending on whether ornot the input code word is a test for the fault in thechecker. Finally, when there is an error in the codeword and a fault in the checker, the output is indeter-minate.As an example, consider a TSC checker design for

single-parity codes, where the prescribed set of faultsis the set of single faults. This design requires the useof two separate parity checkers. (Figure 4 shows the

Table 2.Outputs for TSC checker.

CHECKER: CHECKER:FAULT-FREE FAULTY

CODE WORD: VALID VALID/INVALIDERROR-FREE

CODE WORD: INVALID INDETERMINATEWITH ERROR

COMPUTER30

design for a nine-bit code.) The bits in the code wordUl, U...2 ,Um,Um+,...,Un, are divided into twogroups: ul, U2.. ,um; and um+1, Um+2, . . ,u. (For op-timal design, m is equal to n/2.) These two groupsthen form the inputs to thetwo different parity-checkcircuits. The first circuit produces the output, g = uleU2 e.. . Um, and the second one yields the output, h=Um+l 9 Um+2 a) ... 9 Un -

To illustrate that this design is indeed TSC, first letus consider even-parity codes. Since any code wordhas an even number of l's, bothgroups ofbits containeither aneven or an odd number of l's. Thus, the validoutputs from the network correspond to (01, 10). Onthe other hand, in the presence of a single error in theinput code word, one of the two groups will have anodd number of l's, and the other an even number of1 's. So, a single error at the inputs will produce either11 or 00 as the output. Since single faults can producean error only at one of the two outputs, this TSCchecker is fault-secure.The self-testing property of the checker can be de-

duced from the following observations: The set of in-put code words applies all possible input combina-tions to each of the parity-check circuits. Thus, a faultin one of these parity-check circuits will result in anerror at its output sometime during the operation.This will, therefore, be detected as an invalid networkoutput. As an example, consider a stuck-at-i fault atthe output lead of theEX-OR gate, shown in Figure 4.This fault is detected by a large number of codewords, including the one shown.

Similarly, in the case of odd-parity codes, the abovedesign can be modified by deleting the inverter at theoutput of h.Now, consider the design of a linear counter, shown

in Figure 1. This design can be made TSC by replac-ing the error-detection logic with a TSC checker, asshown in Figure 5. Although the checker never re-ceives the all-0 code word, it is still self-testing for allsingle faults.A point worth noting is that since all TSC checkers

have at least two output leads, there may be the needfor some hard-core logic to monitor the checker out-puts. However, the value of the TSC checker is itscapacity to significantly reduce the hard core in afault-tolerant system.TSC checkers for several other codes, such as

constant-weight codes,46,52'55,59"1 Hamming codes,7and Berger codes,48-56 are available in the literature.The next two sections discuss the use of codes for er-ror control in RAMs and certain integrated circuits.

Coding for RAMs

Errors in binary communications channels general-ly affect the transmitted bits either independently orin bursts and tend tobe symmetric; that is, an error isroughly as likely to convert a 1 to a 0 as conversely.The same can often be said about errors in computersdue to bus faults or faults in the storage medium (al-though other types of errors can also occur; see "Cod-

Figure 4. A TSC checker tor 9-bit single-parity code.

Figure 5. A TSC linear counter.

ing for LSI circuits and transient faults" below).Codes designed to combat such faults are sometimesreferred to as transfer-error-control codes to distin-guish them from codes used to control other types oferrors (e.g., arithmetic or logical errors). Theminimum Hamming distance between any two dis-tinct words in such a code clearly provides an indica-tion of its effectiveness against independent bit er-rors. Moreover, if the word "bit" in the previousdefinition of Hamming distance (see "Coding for er-ror control") is replaced by "symbol," the samemeasure can also be used to gauge the effectivenessof codes in combatting errors (e.g., byte-oriented er-rors) confined to discrete groups of bits2'24 (with eachgroup of bits treated as a single symbol).Virtually all useful transfer-error-control coding

techniques involve generalizations of the familiarparity-check concept. A simple parity check on thebinary digits representing the information (i.e., a (k +1)th bit representing the modulo-two sum of the k

March 1980

data bits) obviously provides a means of detecting achange (error) in any single bit, or indeed in any oddnumber of bits. More powerful codes (codes havingminimum distances d > 2) are constructed by simplyappending more parity-check bits, with each sum bitrepresenting the parity of a different subset of the kinformation bits. Techniques for constructing error-controlcodes in this manner are well-documented andneed not be discussed further here.2'3Such codes protect the contents ofmemory against

hardware malfunctions simply by storing the parity-check bits belonging to each word along with theword itself. When a word is retrieved from memory,the parity bits that should be associated with thatword can be redetermined and compared with thoseactually retrieved. Any difference between the calcu-lated and retrieved parity bits, called the "error syn-drome," indicates the presence of one or more errors.Moreover, if the number of errors does not exceed thenumber that can be corrected, the syndrome uniquelyidentifies those bits that are erroneous.This procedure, while conceptually straightfor-

ward, canbe complex toimplement when even moder-ate numbers of errors are to be corrected. The wide-spread use of multiple-error correcting codes in com-munication systems has come about largely becausethe mathematical structure used to define good error-control codes (the cyclic code structure) can also beexploited to reduce significantly the complexity oftheir decoders.-Unfortunately, the resulting decod-ers, while well-suited for the serial data flow en-countered in communications, would significantly in-crease the time needed to retrieve a corrected wordfrom memory. (This is not to say that the cyclic struc-ture imposed on error-control codes cannot be used toadvantage in computer applications; see, for exam-ple, Brown and Sellers,27 and Chien.28)The use of error-control codes to protect the con-

tents of a computer's main memory is thereforelimited by two factors: fast (parallel) decoders tend tobe too complex; simple (serial) decoders tend to be tooslow. As a result, the codes used in main memorieshave been restricted to single-error-detection (d = 2);single-error-correction (d = 3); or single-error-correc-tion, double-error-detection (d = 4) codes; since eitherthe delay or the cost of the circuitry entailed in cor-rectingmore than a single error is generally unaccept-able.There are several ways to correct multiple errors in

memories, however, without incurring either the de-lays or the complexities usually associated withmultiple-error-correction. These methods, to be effec-tive, must take advantage of the fact that certain er-ror patterns are much more likely than others to beencountered in random-access memories. The error-pattern distribution' clearly depends on both thememory technology and organization.Consider, forexample, aRAM inwhich faults occur

independently, each fault affecting only a single bitposition (but possibly affecting the same bit in everyword). Such error patterns are likely to occur, for ex-ample, in plated-wire memories and inN-word x 1-bitsemiconductor memories. One method of handling

this fault pattern is simply touse ad =2 ord =3 codeto detect errors as they occur, and then to isolate thedefective bit position and switch in a spare bit line toreplace it on all subsequent stores. This, of course, re-quires extra hardware to implement the spare bitlines and the switch needed to make them accessible.The code is used only to detect errors (or, at most, tocorrect errors when the memory's contents are beingrecovered following a detected error). The timeneeded to detect an error, even when added to thedelay introduced by the switch, can be considerablyless than the time needed to correct even a single er-ror. The number of errors that can be corrected usingthis method, so long as they occur singly, is limitedonly by thenumber of spare bit lines. (Iferrors tend tooccur in clusters, a suitable single-symbol error-de-tecting or correcting code2 might be used instead ofthe single-bit error-detecting or correcting codeassumed here.) If the memory were implementedwith N-word by m-bit semiconductor devices, for ex-ample, the symbol alphabet might be defined as theset of 2m binary m-tuples.Another method for coping with multiple errors in

random-access memories without introducing exces-sive decoding delays or excessively complex hard-ware is to take advantage of the erasure-correctingcapability of transfer-error-control codes.32 A poten-tially erroneous bit is called an erasure if its locationis known and the only uncertainty is whether or not itis actually correct. This, of course, is the situationtypically encountered in a RAM once a bit line hasbeen diagnosed as defective (and is not replaced).Erasure correction has two major advantages: (1)Distance-d transfer codes can be used to correctup tod-1 erasures, as opposed to a maximum of (d-l)12errors. (This is easily verified: if two code words aredistance d apart, at least d bits have to be erased ineach before the two words can be identical in theirunerased positions.) And (2) erasures canbe correctedmore quickly than errors. It is only necessary todetermine and store the syndromes associated withthe various combinations of erasures to be corrected,and to compare the calculated syndrome ofeachwordread from memory with this stored set. The erasedpositions containing errors are known as soon as amatch is found. If the calculated syndrome does notmatch a stored syndrome (and if it is not the all-O syn-drome indicating no errors), a new bit position con-tains an error. It is then only necessary to determinethe location of that error, identify that position as anerasure, and augment the set of stored syndromes toreflect this added erasure. This procedure can con-tinue until the erasure-correction capability of thecode has been exhausted.

It is important to reemphasize that the effective-ness both of a code and of the procedure used todecode it strongly depend on the failure modes of thehardware. Consider, for example, the reliability of a32-bit, 4096-word RAM array protected by adistance-4 code. (Any single-bit error is corrected; if asecond error is detected, the memory is no longerused.) Suppose the memory array is to be im-plemented with 1024 x 1 semiconductor chips, each

COMPUTER32

having a A = 10-6 failure/hour hazard rate; and sup-pose all failed chips exhibit one of two symptoms:either a single cell (bit-storage element) chip failswithout affecting the rest of the device, or the entirechip fails. Let yA denote the hazard rate associatedwith the first type of failure, then, and (1 -y)A the rateassociated with the second type of failure. The prob-ability R(t) that the array is still usable after sixmonths of operation is plotted in Figure 6. As the plotshows, the effectiveness of the code is highly depen-dent on y, the fraction of failures confined to a singlecell. The probability that the code is inadequate (i.e.,that the array is no longer usable) varies by nearlythree orders of magnitude, from .0056 percent when y= 1 to 5 percent when y = 0. The conclusion is ap-parent-a coding scheme may be considerably less ef-fective than expected if the types of failures are con-siderably different than expected. So unless the likeli-hood of various types of failures can be reliablypredicted, it is generally better to select as robust acoding procedure as possible (i.e., one that works wellregardless of the type of failure). In this example, amultiple-erasure-decoding scheme, in which eachchip is treated as unreliable as soon as it exhibits amalfunction, might well be preferable.

Coding for LSI circuits and transient faults

As illustrated by the example in the last section, acode can be effective if there is a good match betweenthe type of errors for which the code is designed andthe type of errors that occur. This section focuses oncodes specifically designed for a class of errors dif-ferent from the types discussed so far. This type isthe so-called unidirectional error-one that containseither 1 to 0, or 0 to 1 errors, but not both. (There isevidence that unidirectional errors occur in many in-tegrated circuits69; faults such as short-circuit faultsare a likely source of these errors.)Existing codes developed to control unidirectional

errors are reviewed here. Their inadequacies arediscussed, since these inadequacies have prompteddevelopment of new codes68 particularly effectiveagainst transient faults.The various faults in a standard LSI device-the

read-only memories-provide a clear illustration ofhow unidirectional errors are caused in practice. (It isimportant to note that the following discussion of thesources of unidirectional errors in ROMs is relevantto certain technologies and not to all.)

Unidirectional errors in ROMs have a number oflikely sources:

(1) Address decoder. Single and multiple faults inaddress decoders result in either no access ormultiple access.51 No access yields an all-0-word read-out, and multiple access causes theOR of several words to be read out. In bothcases, the resulting errors are unidirectional, asthey contain either 0 to 1, or 1 to 0 type errors.(In the case of multiple access, when the correctcode word is not contained in the accessed set of

-1

co

x -2.00

0.0 0.1 0.2 0.1 0.4 0.5 0.6 0.7 0.8 0.9 1.0y

FRACTION OF SINGLE-BIT FAILURES

Figure 6. Code effectiveness as a function of failuremode.

code words, the error in the correct code word isnot necessarily unidirectional. However, thiserror can be modeled as a unidirectional error insome other code word that is contained in theaccessed set. Hence, any unidirectional error-detection scheme can detect the error.)

(2) Word line. An open word line may cause all bitsin the word beyond the point of failure to bestuck at 0. On the other hand, two word linesshorted together will form an OR function be-yond the point where they are shorted. In eithercase, the resulting errors are unidirectional.

(3) Power supply. A failure in the power supplyusually results in a unidirectional error.

There are two classes of codes that can detect allunidirectional errors: constant weight codes, whichare nonseparable, and Berger codes,63 which areseparable. A code is separable if the information con-tained in any code word is represented directly by thek-bit number in some fixed k positions. In otherwords, a code C withM code words is separable if, forevery i, 0 < i 4 M-1, there exists a single code word,Xi E C, in which the number, i, appears in some k posi-tions in xi.

March 1980 33

On the other hand, the information contained in acode word in a nonseparable code cannot be obtainedwithout using a special decoder circuit. The nonsepa-rable codes are, therefore, not useful in most com-puter applications, such as error-control coding ofoperands in arithmetic and logic processors, ad-dresses for data in memory, and horizontal microin-structions in control units.5The nonseparable m-out of-n code consists simply

of all possible binary n-vectors with m l's in them.Unidirectional errors result in a change in the numberof 1 's in the code word; hence, these errors are detect-ed. Because these codes are of the nonseparable type,they have limited use. However, significant work hasalready been performed on the design of TSCcheckers for m-out-of-n codes. (In fact, the availabili-ty of efficient TSC checkers56-59 is precisely whatmakes these codes of some practical interest.) TheESS computer is the first known application ofm-out-of-n codes.In contrast, Berger codes63 are separable and there-

fore have a much greater potential for application infault-tolerant computers. Recently, efficient TSC de-signs for Berger codes have become available.55 Al-though these codes have not yet found impleimenta-tion in fault-tolerant computers, there is significantpotential for their use in both synchronous and asyn-chronous circuits.57

Error correction is one of the mosteffective error-control techniques fortransient faults, since these faults areoften environmentally induced and

hard to diagnose.

It is interesting to note, however, that there are tworeasons why neither of the above-described codesmay ever find widespread application in fault-tolerant computers: their inability to correct any er-rors; and their incompatibility with parity-checkcodes, currently the primary codes used in com-puters.Consequently, recent research64'68 has focused on

developing codes not only compatible with parity-check codes and able to correct a limited number of er-rors, but also able tadetect certain unidirectional er-rors. Before we illustrate this further, we will providethe motivation for error correction in the context oftransient faults.Error correction is one of the most effective error-

control techniques for transient (or intermittent)faults. These faults constitute many real-time faultsand have the following general characteristics:

(1) They are often environmentally induced; thecircuits most vulnerable to transient faults arethose operating close to their tolerance limitsbecause of aging, overloading, etc. As a result,transient faults are extremely difficult to diag-nose during initial (acceptance) testing. There-

fore, some form of on-line protection is essentialto control errors resulting from these faults.

(2) They cause errors that are often nonrecurring.Therefore, dynamic redundancy techniquessuch as on-line testing and switching to sparesmay not be effective, since all that is neededmay be to restore the correct information, notto discard physical components.

(3) They produce two types of errors-indepen-dent errors or "bursty" errors. The errorscaused by a single transient fault are likely tobe limited in number if they are independent,and unidirectional if they are bursty.

All these characteristics lead to an error-control ap-proach that may prove to be most effective againsttransient faults-to use codes that can correct some trandom errors, as well as detect all unidirectional er-rors. Recently, some attempt has been made to con-struct precisely such codes.64'67'68An example of a random error-correcting and unidi-

rectional error-detecting code is shown in Table 3.The code, C, is a systematic code and can both correctsingle errors and detect all unidirectional errors.

It is important to note that this code is a parity-check code as well as a systematic (separable) one.Therefore, it does not have the shortcomings of eitherBerger codes (which are not parity-check codes) orm-out-of-n codes (which are not separable codes).The following equations describe the parity-check

relationship between the check bits and the informa-tion bits:

Pi = U1 @ 1P2 = U2 @ 1P3= U1 @ U2p4 U1 @ U2 D 1

(Pi P2, andp4 are odd parities.)The equations for the four-bit error syndrome (sI, s2,S3, S4) are as follows:

S1 =-p U1U 1S2 = P2 9 U2 @ 1S3 = p3 @ U1 U2

S4 =p4 @ U1 D U2 @ 1

Table 4 describes the combination of syndrome bitpatterns and the corresponding error locations.The encoding and decoding circuit for the code

shown in Table 3 can easily be complemented. Notethat the code is derived from the maximal-length (7,3)code by code puncturing and expurgation techniquesand is also a co-set code.2 Other techniques for con-structing such codes can be found in Pradhan.68The motivations for using certain types of codes in

LSI devices, then, are different from those for usingcertain other codes in communications. We hope thatthe material here will suggest new research-for theconstruction of unconventional codes thatprovide er-ror protection more in line with the type of errors en-countered in LSI devices.

Error-control coding techniques can be highly ef-fective in improving computer reliability, in spite of

COMPUTER34

Table 3.A random error-correcting and unidirectional

error-detecting code.

INFORMATION CHECK BITSBITS

Ul U2 P1 P2 P3 P40 0 1 1 0 1

C= 0 1 1 0 1 01 0 0 1 1 01 1 0 0 0 1

1 2 3 4 5 6 BIT POSITION

Table 4.Syndrome bits and error positions.

S1 S2 S3 S4 BIT IN ERROR

0 O 0 0 NONE1 0 1 1 1o 1 1 1 21 0 0 0 30 1 0 0 40 0 1 0 5O 0 0 1 6

ALL OTHER COMBINATIONS MULTIPLE UNIDIRECTIONALERROR DETECTION

the generally tougher constraints imposed on en-coding and decoding hardware in this applicationthan in more conventional communications applica-tions. The effective use of such codes requires boththe code and the decoding procedure to be tailored tothe peculiarities of the hardware.The benefits of error-control codes are demon-

strated in the Fault-Tolerant Spaceborne Computer.4The FTSC protects all addresses and data by meansof various versions of the same (shortened) cycliccode. This code is appended to each data word as itenters the computer (if the word is already encoded,the code is checked at the entry port) and remainswith that word throughout all computer operationsexcept those taking place in the control processingunit (in which case, all operations are performed induplicate). The properties of the code used in theFTSC4 include the

(1) ability to detect bursts of up to eight adjacenterrors (serial data bus);

(2) ability to detect all errors confined to an eight-bit byte (address and data buses);

(3) ability to correct a single erasure and simul-taneously detect a second error (memory);

(4) ability to detect a solid burst of errors-i.e., anerror pattern in which a group of contiguousbits are all in error (address generators in thedirect-memory access units); and

(5) ability to be generated and monitored by a con-catenation of identical devices (throughout thecomputer; this property can be regarded as aparallel version of the previously noted shift-register encoding properties of cyclic codes).

Even this list does not begin to exhaust theways inwhich error-control coding techniques can improve

computer reliability. We predict significantly in-creased use of these techniques as the properties oferror-control codes become better understood. U

Acknowledgment

This work was supported in part by AFOSR Con-tract No. F49620-79-C-0119.

References and bibliography

General references

1. Ball, M. and F. Hardie, "Effect on Detection of Inter-mittent Failure in Digital Systems," AFIPS ConfProc., Vol. 35, 1969 FJCC, pp. 329-335.

2. Berelkamp, E. R., Algebraic Coding Theory, McGrawHill, New York, 1968.

3. Peterson, W. W. and E. J. Weldon, Error-CorrectingCodes, MIT Press, Cambridge, Mass., 1971.

4. Stiffler, J. J., "Architectural Design for Near-100%Fault Coverage," Proc. 1976 Int'l Symp. Fault-Tolerant Computing, Pittsburgh, Pa., June 1976, pp.134-137.*

5. Tanenbaum, A. S., Structured Computer Organiza-tion, Prentice-Hall, Englewood Cliffs, N. J., 1976.

6. Tasar, D. and V. Tasar, "A Study of IntermittentFaults in Digital Computers, "AFIPS Conf. Proc. 1977NCC, pp. 807-811.

7. Wakerly, J. F., Error Detecting Codes, Self-CheckingCircuits and Applications, Elsevier-North Holland,New York, 1978.

Codes for arithmetic operations

8. Avizienis, A., "Arithmetic Error Codes, Cost and Ef-fectiveness Studies for Application in Digital Sys-tems," IEEE Trans. Computers, C-20, No. 11, Nov.1971, pp. 1322-1330.

9. Chien, R. T., S. J. Hong, and F. P. Preparata, "SomeResults on the Theory of Arithmetic Codes," Informa-tion and Contro; Vol. 19, 1971, pp. 246-264.

10. Diamond, J. L., "Checking Codes for Digital Com-puters," Proc. IRE, Apr. 1955, pp. 487-488.

11. Langdon, G. G., Jr., and C. K. Tang, "Concurrent Er-ror Detection for Group-Carry-Look-Ahead in BinaryAdders," IBMJ. Research and Development, Vol. 14,No. 5, Sept. 1970, pp. 563-573.

12. Massey, J. L., and 0. N. Garcia, "Error CorrectingCodes in Computer Arithmetic," Chapter 5, in Ad-vances in Information System Sciences, Vol. 4, pp.273-326, Plenum Press, New York, 1971.

13. Pradhan, D. K., "Fault-Tolerant Carry Save Adders,"IEEE Trans. Computers, Vol. C-23, No. 11, Nov. 1974,pp. 1320-1322.

14. Pradhan, D. K., and L. C. Chang, "Synthesis of Fault-Tolerant Arithmetic and Logic Processors by UsingNonbinary Codes," Digest of Papers-Fourth Ann.Int'l Symp. Fault-Tolerant Computing, Urbana, Ill.,June 1974, pp. 4.22-4.28.*

15. Rao, T. R. N., Error Coding forArithmetic Processors,Academic Press, New York, 1974.

March 1980 35

Codes for logical operations16. Eden, M., "A Note on Error Detection in Noisy

Logical Computers," Information and Contro4 Vol. 2,Sept. 1959, pp. 310-313.

17. Elias, P., "Computation in the Presence of Noise,"IBMJ. Research and Development, Vol. 2, Oct. 1958,pp. 346-353.

18. Garcia, 0. N. and T. R. N. Rao, "On the Methods ofChecking Logical Operations," Proc. 2ndAnn. Prince-ton Conf Information Science and Systems, 1968, pp.89-95.

19. Monterio, P. M. and T. R. N. Rao, "A Residue Checkerfor Arithmetic and Logical Operations," Digest ofPapers-Int'l Symp. Fault-Tolerant Computing,Boston, Mass., June 1972.*

20. Peterson, W. W. and M. 0. Rabin, "On Codes forChecking Logical Networks," IBM J. Research andDevelopment, Vol. 3, No. 2, Apr. 1959, pp. 163-168.

21. Pradhan, D. K., and S. M. Reddy, "ErrorControlTech-niques for Logical Processors," IEEE Trans. Com-puters, Vol. C-21, No. 12, Dec. 1972, pp. 1331-1336.

22. Winograd, S. and J. C. Cown, Reliable Computation inthe Presence of Noise, M.I.T. Press, Cambridge,Mass., 1963.

Codes for memory23. Black, C. J., C. E. Sundberg, and W. K. S. Walker, "De-

velopment of a Spaceborne Memory with a Single Er-ror and Erasure Correction Scheme," Proc. SeventhAnn. Int'l Conf Fault-Tolerant Computing, LosAngeles, Calif., June 1977, pp. 50-55.*

24. Bossen, D. C., "b-adjacent Error Correction," IBMJ.Research and Development, Vol. 14, July 1970, pp.402-408.

25. Bossen, D. C., L. C. Chang, and C. L. Chen, "Measure-ment and Generation of Error Correcting Codes forPackage Failures," IEEE Trans. Computers, Vol.C-27, No. 3, Mar. 1978, pp. 201-204.

26. Carter, W. C. and C. E. McCarthy, "Implementation ofan Experimental Fault-Tolerant Memory System,"IEEE Trans. Computers, Vol. C-25, No. 6, June,1976,pp. 557-568.

27. Brown, D. T. and F. F. Sellers, Jr., "Error Correctionfor IBM 800-bit-per-inch Magnetic Tape," IBM J.Research and Development, Vol. 14, July 1970, pp.384-389.

28. Chien, R. T., "Memory Error Control Beyond Parity,"IEEE Spectrum, July 1973, pp. 18-23.

29. Hsiao, M. Y., "O,ptimum Odd-weight Column Codes,"IBM J. Research and Development, Vol. 14, No. 4,July 1970.

30. Hsiao, M. Y. and D. C. Bossen, "Orthogonal LatinSquare Configuration for LSI Memory Yield andReliability Enhancement," IEEE Trans. Computers,Vol. C-24, No. 5, May 1975, pp. 512-517.

31. Reddy, S. M., "A Class of Linear Codes for Error Con-trol in Byte-per-card Organized Digital Systems,;'IEEE Trans. Computers, Vol. C-27, No. 5, May 1978,pp. 455-459.

32. Stiffler, J. J., "Coding for Random Access Memories,"IEEE Trans. Computers, Vol. C-27, No. 6, June 1978,pp. 526-531.

Fault-tolerant logic using coding33. Problems of Information Transmission, Translated

from Russian, Vol. 1-4, Faraday Press; Vol. 5, Consul-tants Bureau, Plenum Publishing Co., New York.

34. Nemsadze, N. I., Problems of Information Transmis-sion, 1969 (No. 1), 1972 (No. 2), Consultants Bureau,Plenum Publishing Co., New York.

35. Nikaronov, A. A., Problems ofInformation Transmis-sion, 1974, (No. 2), Consultants Bureau, Plenum Pub-lishing Co., New York.

36. Nikanorov, A. A. and Y. L. Sagalovich, "Linear Codesfor Automata," Int'l Symp. Design and Maintenanceof Logical Systems, Toulouse, France, Sept. 27-28,1972.

37. Sagalovich, Y. L., Problems ofInformation Transmis-sion, 1960 (No. 2), 1965 (No. 2), 1967 (No. 2), 1972(No.3), 1973 (No. 1), 1976 (No.4), 1978 (No. 2), FaradayPress and Consultants Bureau, Plenum PublishingCo., New York.

38. Sagalovich, Y. L., States Coding and AutomataReliability, Svjas, Moscow, 1975 (in Russian).

39. Sagalovich, Y. L., "Information Theoretical Methodsin the Theory of Reliability for Discrete Automata,"Proc. 1975 IEEE-USSR Joint Workshop on Informa-tion Theory, Dec. 15-19, 1975, Moscow.

40. Jisiao, M. Y., A. M. Patel, and D. K. Pradhan, "StoreAddress Generator with Built-in Fault-DetectionCapabilities," IEEE Trans. Computers, Vol. C-26, No.11, Nov. 1977, pp. 1144-1147.

41. Meyer, J. F., "Fault-Tolerant Sequential Machines,"IEEE Trans. Computers, Vol. C.20, Oct. 1971, pp.1167-1177.

42. Larsen, R. W. and I. Reed, "Redundancy by CodingVersus Redundancy by Replication for Failure-Tolerant Sequential Circuits," IEEE Trans. Com-puters, Vol. C-21, No. 2, Feb. 1972, pp. 130-137.

43. Pradhan, D. K. and S. M. Reddy, "Fault-TolerantAsynchronous Networks," IEEE Trans. Computers,Vol. C-23, No. 7, July 1974, pp. 651-658.

44. Pradhan, D. K., "Fault-Tolerant Asynchronous Net-works Using Read-Only Memories," IEEE Trans.Computers, Vol. C-27, No. 7, July 1978, pp. 674-679.

45. Reed, I. S. and A. C. L. Chiang, "Coding Techniquesfor Failure Tolerant Counts," IEEE Trans. Com-puters, Vol. C-19, No. 11, Nov. 1970, pp. 1035-1038.

Self-checking circuits46. Anderson, D. A., "Design of Self-Checking Digital

Networks," CSL Report No. 527, University of Illi-nois, Urbana, Ill., 1971.

47. Anderson, D. A. andG. Metze, "Design of Totally Self-Checking Circuits for M-out-of-n Codes," IEEETrans. Computers, Vol. C-22, No. 3, Mar. 1973, pp.263-269.

48. Ashjee, M. J. and S. M. Reddy, "On Totally Self-Checking Checkers for Separable Codes," IEEETrans. Computers, Vol. C-26, No. 8, Aug. 1977, pp.737-744.

49. Carter, W. C. and P. R. Schneider, "Design ofDynamically Checked Computers," Proc. IFIP Con-gress 68, Vol. 2, Edinburgh, Scotland, pp. 878-883.

50. Carter, W. C., K. A. Duke, and D. C. Jessep, "A SimpleSelf-Testing Decoder Checking Circuit, " IEEE Trans.Computers, Vol. C-20, No. 11, Nov. 1971, pp.1413-1414.

3 COMPUTER36

51. Cook. R. W. et al., "Design of Self-Checking Micropro-gram Controls," IEEE Trans. Computers, Vol. C-22,No. 3, Mar. 1973, pp. 255-262.

52. David, R., "A Totally Self Checking 1-out-of-3 Code,"IEEE Trans. Computers, Vol. C-27, No. 6, June 1978,pp. 570-572.

53. Diaz, M., "Design of Totally Self Checking and FailSafe Sequential Machines," Digest ofPapers-FourthAnn. Int'l Symp. Fault-Tolerant Computing, Urbana,Ill., June 1974, pp. 3:19-3.24.*

54. Diaz, M..and J. M. Desouza, "Design of-Self-CheckingMicroprogrammed Controls," Digest ofPapers-1975Int'l Symp. Fault-Tolerant Computing, Paris, France,June 1975, pp. 137-142.*

55. Marouf, M. A. and A. D. Friedman, "Design of Self-Checking Checkers for Berger Codes," Digest ofPapers-Eighth Ann. Int'l Conf Fault-Tolerant Com-puting, Toulouse, France, June 1978, pp. 179-184.*

56. Marouf, M. A. and A. D. Friedman, "Efficient Designof Self-Checking Checkers for M-Out-of-N Codes,"Proc. SeventhAnn. Int'l Conf Fault-TolerantComput-ing, Los Angeles, Calif., June 1977, pp. 143-149.

57. Pradhan, D. K., "Asynchronous State Assignmentswith Unateness Properties and Fault-Secure Design,"IEEE Trans. Computers, Vol. C-27, No. 5, May 1978,pp. 396-404.

58. Pradhan, D. K. et al., "Shift Registers Designed forOn-Line Fault-Detection," Digest of Papers-EighthAnn. Conf. Fault-Tolerant Computing, Toulouse,France, June 1978, pp. 173-178.*

59. Shedletsky, J. J. and E. J. McCluskey, "The ErrorLatency of a Fault in Combinational Digital Circuits,"Digest of Papers-1975 Int'l Symp. Fault-TolerantComputing, Paris, France, June 1975, pp. 210-214.*

60. Smith, J. E., "The Design of Totally Self-CheckingCombinational Circuits," CSL Report No. R-737, Uni-versity of Illinois, Urbana, III., Aug. 1976.

61. Smith, J. E. and G. Metze, "Strongly Fault SecureLogic Networks," IEEE Trans. Computers, Vol. C-27,No. 6, June 1978, pp. 491-499.

62. Wang, S. L. and A. Avizienis, "The Design of TotallySelf-Checking Circuits Using Programmable LogicArrays," Digest of Papers-Ninth Ann. Int'l Symp.Fault-Tolerant Computing, Madison, Wis., June 1979,pp. 173-180.*

Codes for unidirectional errors

63. Berger, J. M., "A Note on Error Correction Codes forAsymmetric Channels," Information and Control,Vol. 4, Mar. 1961, pp. 68-73.

64. Bose, B., "Theory and Design of Unidirectional ErrorCodes," PhD Dissertation, Computer Science andEngineering Dept., Southern Methodist UJniversity,Dallas, Tex., in progress.

65. Frieman, C. V., "Protective Block Codes for Asym-metric Binary Channels," PhD Dissertation, Colum-bia University, New York, May 1961.

66. Parhami, B. and A. Avizienis, "Detection of Storage inMass Memories Using Low-Cost Arithmetic Codes,"IEEE Trans. Computers, Vol. C-27, No. 4, Apr. 1978,pp. 302-308.

67. Pradhan, D. K. and S. M. Reddy, "Fault-Tolerant Fail-Safe Logic Networks," Proc. COMPCOM Spring 77,pp. 361-363.

March 1980

68. Pradhan, D. K., "A New Class of Error Correcting-Detecting Codes for Fault-Tolerant Computer Ap-plications," to appear in IEEE Trans. Computers,Special Issue on Fault-Tolerant Computing, Vol. C-29,No. 6, June 1980.

69. Sahani, R. M., "Reliability of Integrated Circuits,"Proc. IEEE Int'l Computer Group Conf., WashingtonD. C., June 1970, pp. 213-219.

70. Wakerly, J. F., "Detection of Unidirectional MultipleErrors Using Low Cost Arithmetic Codes," IEEETrans. Computers, Vol. C-24, No. 2, Feb. 1975, pp.210-212.

*This digest or proceedings is available from the IEEE ComputerSociety Publications Office, 5855 Naples Plaza, Suite 301, LongBeach, CA 90803.

D. K. Pradhan is the guest editor of this issue; hisbiographical sketch appears on p. 7.

J. J. Stiffler is a consulting engineer atthe Raytheon Company, Sudbury,Massachusetts. From 1961 to 1967 hewas on the technical staff of the Jet Pro-

s** o pulsion Laboratory, Pasadena, Califor-

fnia. The author of many papers in thefield of communications, Stiffler wroteTheory of Synchronous Communica-_t ~~~tions and contributed to two otherbooks. His current interests include the

design and analysis of highly reliable data-processingsystems.

Stiffler received the AB in physics, magna cum laude,from Harvard University in 1956 and the MS in electricalengineering from the California Institute of Technology in1957. After a year in Paris as a Fulbright scholar, he re-

turned to Caltech, where he received the PhD in 1962. Stiff-ler is a member of Phi Beta Kappa and Sigma Xi.

SOFTWARE ENGINEERS

Professional Careerswith XEROX

Xerox is developing the future of reprographics. We'relooking for a few talented specialists for the ElectronicsDivision who can help keep our technological edge. We'veincreased our R&D budgetyearafteryearto keep upwith ourgrowing competition. At Xerox we have the technology, theresources, and the challenges to stimulate your career.We're looking for professionals to advance our long termprojects at our R&D facility. Responsibilities Includedevelopment of real-time operating systems and languagesfor microprocessor-based control systems; contributing tosystems architectural studies; and development of practicalapproaches to modularity distributive prooessing, languagelayering, etc. Time/space efficiency trade-offs will becontinuously required.You should have a B.S., M.S., or Ph.D. in CS or EE andfamiliarity with structured programming concurrentprocessing, high level languages (PASCAL, MODULA,CONCURRENT PASCAL, etc.) operating system kernaldesign, and other modern programming methodologies.We offer a competitive starting salary/benefits package.Please forward your resume in strict confidence to: Ms.Carol Jones, Dept. IC, XEROX CORPORATION, 800 PhillipsRd., Bldg. 105, Webster, New York 14580. Xerox is anaffirmative action employer (male/female).

XEROX

Documents

Error-Correcting Codes and Self-Checking Circuits