Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Algorithm-Based Fault ToleranceTheory of Check Placement

Greg Bronevetsky

CS717

So Far…

• Learned how certain computations could be checked using algorithm-specific checks.

• In any algorithm we can develop checks to verify any set of data items.

• How effective are these checks?• How many faults can given set of checks

detect?

CS717

Abstract Checks

• Suppose we are given (g,h)-checks• Check defined on g data elements• If all elements correct, returns 0• If 0 and h elements erroneous, return 1• If h elements erroneous, undefined

CS717

Checking Example

• Assume (2, 1) checks – 2 elements, 1-failure detect

• Both sets of checks can detect single errors• Neither can locate individual errors

…

d1

d2

dn

+ sum …d1

d2

dn

+sum

n checks: i. di and sum 1 check: sum

CS717

But with one more check…

• If also check sum– can detect any pair of errors– can locate single errors

• Need general theory of effective and efficient check placement

…

d1

d2

dn

+ sum

n checks: i. di and sum1 more check: sum

CS717

Goals

• Need models for correlating processor faults to data errors

• Given fault model and set of checks need to derive fault detectability and locatability

CS717

Papers covered

• V.S.S. Nair, J.A. Abraham, P. Banerjee. "Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes", 1996.

• Choon-Sik Park and Mineo Kaneko, "An Efficient Technique for Design of ABFT Systems Based on Modified PD Graph".

• Choon-Sik Park, "Algorithm-Based Fault Tolerant Systems Based on Graph-Theoretic Error Occurence+Propagation Models", 2000. (PhD Thesis)

• V.S.S. Nair, J.A. Abraham. "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection", 1990.

CS717

Outline

• Matrix-based formalism of Nair et al

• Dependence graph-based formalism of Park et al– Includes fault propagation models

• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant

components

CS717

Basic Framework

• Each processor and check associated with set of elements

P1

P2

P3

P4

d1

d2

d3

d4

d5

d6

d7

C1

C2

C3

Processors

Checks

CS717

Basic Framework

• Data(Pi) = set of data elements affected by processor i– If Pi fails, any subset of of Data(Pi) may be

erroneous– No notion of errors propagating based on data

dependences

• Data() defines the Processor-Data (PD) Matrix

CS717

Associated PD Matrix

P1

P2

P3

P4

d1

d2

d3

d4

d5

d6

d7

Processors

Data Elements

Processors

1100000

0010000

0001010

0000111

CS717

Basic Framework

• Check(di) = set of checks that check data element di.– Must be non-empty if we expect to detect errors

• Check defines the Data-Check (DC) Matrix

• Paper focuses on (g,1) checks– g data elements– can detect upto 1 fault

CS717

Associated DC Matrixd1

d2

d3

d4

d5

d6

d7

C1

C2

C3

Checks

Checks

Data Elements

100

010

001

110

001

010

001

• C1 and C2 are (3,1) checks

• C3 is a (2,1) check

CS717

The PC Matrix

• Finally, associate processors and checks:• Processor-check (PC) matrix = PDDC

Data Elements

Processors

0100000

0010000

0001010

0000111

Checks

Data Elements

=

=# elements verified by check

Processors

120

012

100

010

001

110

001

010

001PD DC

PC

CS717

Using the PC Matrix

• PC matrix shows if we can detect single-processor errors:

• Assume all checks are (g,h) checks• If each row of PC has all entries h failure of

that process will be detected– Regardless of which entries actually become

erroneous

# elements verified by check

Processors

120

012

PC

CS717

Using the PC Matrix

• If each row of PC has all entries h failure of that process will be detected

P1

P2

P3

P4

d2

d3

d4

d5

d6

d7

C1

C2

C3

Processors

Checks

d1


Processors

120

012

PC

CS717

Relaxing Detectability

• Condition is too conservative• Suppose we have (3, 2) checks

• Pi’s PD row is:

• There are 2 checks. DC matrix:• PC Matrix:

P1

d1

d2

d3

d4

d5

C1

C2

11011

11

10

10

01

01

23

CS717

Relaxing Detectability

• C1 may be overwhelmed by errors

– Will not notice error <d1, d2 d5>

• By above criterion system can’t detect failure in P1

P1

d1

d2

d3

d4

d5

C1

C2

CS717

Reaching New Detectability Definition

• But how could C1 be overwhelmed?

• When all 3 of its elements have errors– Recall, these are (3,2) checks

P1

d1

d2

d3

d4

d5

C1

C2

CS717

Reaching New Detectability Definition

• But C1 and C2 overlap on d5

• Thus if C1 overwhelmed, C2 detects error– It is not overwhelmed

• Thus, for any error pattern can see if any check will notice

P1

d1

d2

d3

d4

d5

C1

C2

CS717

Trivial Algorithm 2

• Try every possible error pattern– Exponentially many of them

• For each pattern see if some check will detect it– Before: ensured that no check overwhelmed

• Pro: Correct and not conservative• Con: Expensive

CS717

New Definition of Detectability

• Work with error patterns– Ex: <d1, d2, d5>, <d1, d3, d4>, <d3>, etc.

• If one check detects given error pattern, no problem if other checks overwhelmed

• Repeat until all error patterns detected:

If some check not overwhelmed, eliminate all detectable error patterns from

consideration

CS717

Example of Detectability Algorithm

• Is failure of P1 detectable?

• P1 fails d1, d2 and/or d3 may have errors

• C1, C2 overwhelmed

• C3 not overwhelmed

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4

CS717


• Look at errors C3 can detect: d3

• Remove them from consideration– Since any error pattern involving d3 will be

detected

P1

d1

d2

d3

C1

C2P2C3

(2,1) checks

C4

d4

d5

CS717


• Look at remaining error patterns: combinations of d1 and/or d2

• Now C2 not overwhelmed

• Remove any error patterns involving d2

P1

d1

d2 C1

C2P2C3

(2,1) checks

C4

d4

d5

CS717


• Look at remaining error patterns: d1

• C1 not overwhelmed

• Remove any of its error patterns

P1

d1

C1

C2P2C3

(2,1) checks

C4

d4

d5

CS717


• All of P1’s error patterns detected

• We are done!

P1 C1

C2P2C3

(2,1) checks

C4

d4

d5

CS717

Failing Check Processors

• What if processor performing check fails?

• Add “pseudo” data elements to represent processors

• Each check will also check its processor’s pseudo-data element– New element has weight, so error in it will

overwhelm any check

CS717

Final System

• Check C3 is in P1

• Checks C1, C2 and C4 on P2

P1 d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7

CS717

The Infinities

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7


Processors

11

1122

Data Elements

Processors

011000

000111

PDChecks

Data Elements

1011

0100

1000

0100

0110

1011

0001

DC

PC

CS717

The Infinities

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7# elements verified by check

11

1122

PC

• If P1 fails, C1 and C2 overwhelmed• C3 also overwhelmed by +1

– Because C3 runs on failed P1

• Only C4 not overwhelmed

Processors

CS717

The Infinities

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6


11

1122

PC

• Remove all error patterns detected by C4

– Any that include d2

Processors

CS717

The Infinities

P1

d1

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6


11

0111

PC

• C1 and C2 no longer overwhelmed

• Remove error patterns detected by C1 and C2

– Any that include d1 and d3

Processors C4’s entry must become 0Others may go lower

CS717

The Infinities

P1

d4

d5

C1

C2P2C3

(2,1) checks

C4d6


11

000

PC

• Now P1’s row is all 0’s and ’s• All real data elements successfully checked• Only pseudo-elements remain

– Don’t care

Processors C1’s and C2’s entries must become 0Others may go lower

CS717

The Infinities

P1

d4

d5

C1

C2P2C3

(2,1) checks

C4d6


11

000

PC

Processors

• Note failure of P2 not detectable

• d5 only checked by C4, which runs on P2

• Thus, entry will never drop to

CS717

Multi-Process Errors

• Want to know if system detect failures of r processors

• For every subset of r processors– Take union of all data elements they touched– Pretend each r-set is single processor

• Use above algorithm to check if all resulting error patterns detectable

CS717

Fault Locatability

• We only see errors, not faults• For each error pattern, want to know which

fault caused it

• Given two fault patterns, are they distinguishable?

• Only if they have different patterns of failed checks

• Will give intuition for analysis

CS717

0-1 Disagreement

• Take rows Ri and Rj of rPC (faults Fi and Fj)

• For every possible error pattern in Ri and Rj look at what each check says on this pattern

• If check responses different on each pattern: Fi and Fj can be differentiated

CS717

1-0 Disagreement

• Want to differentiate faults Fi and FiFj j

• Compare each error pattern of Fi and Fj: Eik and Ejl

• If some check meets Eik on 1 & h spots and meets Eil on 0 spots then Ejk and EjkEjl distinguishable

• If this is true for all error patterns then F i and FiFj distinguishable

CS717

1-0 Disagreement Example

101

110

,

,

lj

ki

EError

EError

001

011

101

DC

102

012

,

,

lj

ki

EonChecks

EonChecks1-0 disagreement in

both directions

CS717

1-0 Disagreement Example

• Clearly, Eik and Ejl look different

• EikEjl corresponds to fault pattern:

• Checks would say:

• Different from Eik or Ejl : Distinguishable!

001

011

101

DC

111

112

101

110

,

,

lj

ki

EError

EError

102

012

,

,

lj

ki

EonChecks

EonChecks

CS717

Fault Locatability

• If can show 1-0 disagreement between every single-process fault and every r-process fault:System is r-fault locatable

• Algorithm for locatability is obscure

• Read the paper

CS717

Summary

• Presented matrix-based framework for evaluating error detectability & locatability

• Framework deals with arbitrary errors

• More work by V.S.S. Nair with other coauthors

CS717

Outline




components

CS717

Graph-Based Framework

• Developed by Choon-Sik Park• Does in graphs what Nair et al work does in

matrices• Assumes (g,1) checks• Differences:

– Different definition of fault locatability• Unknown if equivalent

– Presents more limited faulterror models• As opposed to “anything and everything”

• Will first present general view, then specific error models

CS717

Basic Picture

Errors

……

…

Faults

……

…

Fi

Fj

Data

……

…eiu

ejv

Checks

……

…

c

c`

ProcessorData, DataData dependence info maintained

CS717

ErrorsFaults Data

k-Faults

• Faults may cause number of possible errors– For given fault, many errors possible– If given error happens, all associated data

elements definitely corrupted

• k-Faults: faults generating errors that corrupt k data elements

Fi

eiu

CS717

Fault Detectability

• System is k-fault detectable if for every error pattern check c s.t. |ceiu|=1 means intersection of affected data elements

• Proof:– If there exists such check then every error pattern

induced by fault will be detected– If k-fault detectable then must some check that

reliably yells for any possible error pattern• Can allow the check that yells to be the check in

definition

CS717

Fault Management

• k-fault detectability: If a fault affects k data elements then checks will detect it

• k-fault locatability: For all faults that affect k data elements, can tell any pair of faults apart

• Will examine all fault patterns Fi that come from k data elements failing

CS717

Fault Locatability 1

• To locate faults, must ensure that different faults cause different errors

• Theorem 1:System k-fault locatable only if for error patterns eiu, ejv (from faults Fi and Fj) eiuejv symmetric difference

• Proof clear:If two faults can show up as same error, can’t tell them apart

CS717


• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.

– |c(eiuejv)|=1 (recall: all checks are (g,1))

– |c(eiuejv)|=0

– If |c(eiu-ejv)|=1 then |c'ejv)|=1

– If |c(ejv-eiu)|=1 then |c'eiu)|=1

• Intuition: Trying to make tuple <c,c'> be different and <0,0> on errors eiu and ejv

CS717

Fault Locatability Illustration

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv

CS717


• |c(eiuejv)|=1

• i.e. c overlaps one element (eiuejv)

(because of (g,1) checks)

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv

c

CS717


(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv• |c(eiuejv)|=0

• i.e. c only touches on the part that is unique to ejv

c

CS717


(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv• If |c(ejv-eiu)|=1 then |

c'eiu)|=1

• If c notices ejv make sure that c‘ notices eiu

c

c'OR

CS717


(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv• Error eiu:<c,c'>=<0,1>• Error ejv:<c,c'>=<1,?>• Patterns distinguishable• Either error detected

c

c'OR

CS717


• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.

– |c(eiuejv)|=1 (recall: all checks are (g,1))

– |c(eiuejv)|=0

– If |c(eiu-ejv)|=1 then |c'ejv)|=1

– If |c(ejv-eiu)|=1 then |c'eiu)|=1

• This, is above true for every pair of error patterns, system k-fault detectable

CS717

Extra Fault Detectability

• Theorem: if system is k-fault locatable then it is 2k-fault detectable

• Must show: for any fault Fl in 2k processors, resulting errors elw, check c. |celw|=1

• Note: Failures of 2k processors result in 2 errors as failures of k data elements

• Thus, can break up elw = (eiuejv), coming from k-fault patterns Fi and Fj

CS717

Extra Fault Detectability

• Theorem: if system is k-fault locatable then it is 2k-fault detectable

• Must show: eiu,ejv check c. |c(eiuejv)|=1

• If (eiuejv) happens, both c and c' will notice

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv

c

c'OR

CS717

FaultError Models

• So far trying to deal with arbitrary errors• Actual model of how faults turn into errors not

defined– i.e. arbitrary

• This is unnecessarily general

• Should focus on realistic models of error generation and propagation– Makes it easier to design reliable systems

CS717

Single-Input-Driven Model

• Output of computation erroneous if any input(s) are– Even if processor is faulty

• If processor is faulty, its computations may or may not be erroneous(this is where we use data dependence information)

• Will focus on how model treats single-processor failures

CS717

SID Model Picture

• … : data elements on Pi

– Synonymous with sets of data elements on Pi

• Focus on single-processor failures

Pi

iiWD

2iD

1iD

……

iwD

Data

1iD iiWD

CS717

Fault Model in Practice

• If Pi fails, any subset of Diw’s may have error

• If Diw has error, any data depending on it has error– Bijection between Diw

and errors Eiw

Pi

iiWD

2iD

1iD

……

iwD

Data2iE

iwE

iiWE

CS717

Single-Fault Detectability in SID

• Brute-Force algorithm: sets of Eiw’s

– If check c s.t. |c(Eiw’s)|=1 then this error pattern detectable

– If all patterns detectable, system is single-fault detectable

Pi

iiWD

2iD

1iD

……

iwD

Data

c

CS717

Too Conservative

• Like before, algorithm too conservative• Examines exponentially many error patterns• Suppose set of errors

detected via check c– i.e. |cE|=1

• Look at

} E , E,{EE r21

1E..E }E ,E ,E{E r21 jcts

1D

2D

c

3D

EE

CS717

Too Conservative

• Clearly, all overlap with c on one element– Thus, each one detectable– Similarly, all unions containing detectable

• Therefore, if a set of errors detectable, all unions containing suberrors also detectable– And thus, no need to check them

1D

2D

c

s'E j

s'E j

3D

EE

Can ignore:E1, E2, E1E2, E1E3, E1E2, E1 E2 E3

Can’t ignore:E3

CS717

New Definition of Detectability

• = (start with all possible errors)

• For each check cs:– Check that detectable:

• Now ignore detectable subsets of • Remove detectable subsets:

• Repeat to ensure rest of also detectable

0iE iE

1

1

iws Ec

iwsi

si EEE

siE 1

siiw EEiws Ec

siE

siE

CS717

Detectability Example

• Check (= )

• c1 meets E1 and E21D

2D

c1

3D

4D

5D

6D

0iE iE

CS717


• Check (= )

• c1 meets E1 and E2

• Remove them to get

1D

2D

c1

3D

4D

5D

6D

0iE iE

1iE

CS717


• Check

• C2 meets E3 and E4

– Also meets E2 but on error E2, c1 will ring

1D

2D

c1

3D

4D

5D

6D

},,,{ 65431 EEEEEi

c2

CS717


• Check

• C2 meets E3 and E4

– Also meets E2 but on error E2, c1 will ring

• Remove them to get

1D

2D

c1

3D

4D

5D

6D

},,,{ 65431 EEEEEi

c2

2iE

CS717


• Check

• C3 meets E5

1D

2D

c1

3D

4D

5D

6D

},{ 652 EEEi

c2c3

CS717


• Check

• C3 meets E5

• Remove it to get

1D

2D

c1

3D

4D

5D

6D

},{ 652 EEEi

c23iE

c3

CS717


• Check

• C3 meets E6

– Recall: circles on left are data on processor I

1D

2D

c1

3D

4D

5D

6D

}{ 63 EEi

c2c3

c4

CS717


• Check

• C3 meets E6

– Recall: circles on left are data on processor I

• Remove it to get

1D

2D

c1

3D

4D

5D

6D

}{ 63 EEi

c2

3iE

c3

c4

CS717


DONE!

1D

2D

c1

3D

4D

5D

6D

c2c3

c4

CS717

Single-Fault Locatability in SID

• Basic definition:Must exist enough checks s.t. all error patterns produced by failure of Pi differentiable from error patterns of Pj

• Involves a lot of error patterns

• Start with brute-force definition

CS717

Brute-Force Definition

error patterns Eq={Ei1, Ei5, Eiw, …} from Pi checks and s.t.–

• Detects error E

– • Ignores any error from Pj

– detect Ej and all subsets via above algorithm– And vice versa (since ‘s may ring on Pi’s errors)

• Result: – Any error pattern in Ei, none in Ej will ring some cq

– Every pattern in Ej detectable

rcc ...1qc1Ecq

0 jq Ec

rcc ...1

kc

CS717

Responses of Checks

• On error pattern Eq (due to failure of Pi):

• On any error Ej due to failure of Pj

• Can brute-force evaluate test on every possible Eq

???11 rq ccc

1/01/01/001 rccc

At least one must be =1 (else Ej not detectable)

CS717

Brute Force Too Exhaustive

• Recall that if then same true for all sets containing E1, … Er

• Thus, can eliminate many of the steps above

1} E , E,{Ec r21

CS717

New Definition of Locatability

• = (start with all possible Pi errors)

• For each check cs:

– Check cs detects :

– But not Ej :

• Ensure that Ej is detectable via above algorithm

0iE iE

siE 1

siiw EEiws Ec

0 js Ec

CS717

New Definition of Locatability

• Syndrome of Ei and detectable subsets:

• Syndrome of Ej all subsets:

• Can now ignore detectable subsets of • Remove detectable subsets:• Repeat until all covered• Do same for

– In paper, steps for and interleaved

1

1

iws Ec

iwsi

si EEE

siE

???11 rq ccc

1/01/01/001 rccc

At least one must be =1 (else Ej not detectable)

iE

jEiE jE

CS717

Summary

• Presented graph-based framework for evaluating error detectability & locatability

• Framework deals with arbitrary errors• Can be specialized to a simpler fault model:

Single-Input Driven• Choon-Sik Park’s thesis presents the

Multiple-Input Driven model– More realistic but complex

CS717

Outline




components

CS717

Building Larger Systems

• Now know how to analyze systems for detectability & locatability

• For large systems this can be very hard/expensive

• Large systems typically made up of smaller components

• Simplifies fault tolerance design

CS717

Basic Idea

• Have component with known detectability (=t) & locatability (=l)

• Construct system S out of k components

• What is resulting fault tolerance?

CS717

Basic Idea

• System fault tolerance no better than for individual component

• If >t data elements fail in same component, error not detected

• If >l elements fail in component, will not locate

• Detectability & locatability ratio tends to 0 as system size increases!

CS717

Hierarchical Design

• To build fault tolerant systems must introduce checks with new components

• Will present hierarchical design scheme with specific detectability & locatability guarantees

• Assumptions:– All (g,h) checks have same h

• No restriction on g

– Every processor produces only one data element• Same true for blocks of processors

– Checks are fault tolerant• Claims that this doesn’t change problem

CS717

Basic Component

• Start off with basic system:

• System has internal checks• Fault detectability = t• Fault locatability = l

…

B

CS717

Basic Component

• Then replicate it k-fold

• Assumptions:– copies are independent

• (i.e. do not affect each other’s data)

– Each system produces one data element…

B1

…

B2

…

Bk

…

CS717

Basic Component

• Then replicate it k-fold

• And add additional checks across all copies• Process repeated d-1 times to get d-level

hierarchical system…

B1

…

B2

…

Bk

c1c2

cr

…

CS717

Detectability 1kh

• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors

• Proof:– Base case: d=2– Suppose every element has error– Each check must deal with kh

errors– But they are (g,h) checks and

will detect such errors– Thus, system can detect |B|k errors

…

B1

…

B2

…

Bk

c1c2

cr

…

CS717

Detectability 1kh

• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors

• Proof:– Inductive case: d+1

– Components Bi each have |B|kd-2

elements– By argument above, system

detects (|B|kd-2)k=|B|kd-1 errors• Argument works because sub-systems

at each level produce one data element

…

B1

…

B2

…

Bk

c1c2

cr

…

CS717

Detectability k>h

• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors

• Proof:– Base case: d=2– Suppose (t+1)(h+1) errors with h+1

copies of B having t+1 errors each– Detectability of B = t, so internal

checks will not notice errors– 2nd level checks will get h+1 errors

each: will not notice– Thus, error pattern of size (t+1)(h+1) that will not

be detected

…

B1

…

B2

…

Bk

c1c2

cr

…

CS717

Detectability k>h


• Proof:– Base case: d=2– Suppose (t+1)(h+1)-1 errors– By pigeonhole principle, some unit

has t errors or some 2nd levelcheck has h errors

– Thus, some check at 1st or 2nd levelwill ring

– Thus, system detectability = (t+1)(h+1)-1

…

B1

…

B2

…

Bk

c1c2

cr

…

CS717

Detectability k>h


• Proof:– Inductive case: d+1

– Components Bi detect Td errors

– By induction, Td= (t+1)(h+1)d-1-1

– By argument above, system detects (Td+1)(h+1)-1 errors

– Thus, system detectability = (t+1)(h+1)d-1

…

B1

…

B2

…

Bk

c1c2

cr

…

CS717

Locatability

• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors

• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)

errors, l+1 errors in two Bi’s

– Bi & Bj can’t locate the errors

– 2nd level checks may locate erroneous rows, not columns

– Thus, unlocatable fault pattern of size 2(l+1)

… … …

Bk

c1c2

cr

…

B1 B2

CS717

Locatability


• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)-1

– At most one Bi may have l+1 errors

• If none do, we’re done

– Remaining l errors distributed among other Bj’s

… …

c1c2

cr

B1

…

Bk

…

B2

CS717

Locatability

• Let Bi have l+r errors (r1)

…

Bi

…

Bj

…

Bk

c1c2

cr

…

CS717

Locatability

• Let Bi have l+r errors (r1)

• Remaining Bj’s share remaining l-r+1 errors

(l+r)-(l-r+1)=2r-1 rows only have errors in Bi

– =2r-1 rows when all l-r+1 errors are in same Bj…

Bi

…

Bj

…

Bk

c1c2

cr

…

CS717

Finding Overwhelmed Unit

• First, find the Bi that have >l errors

• All but one sub-system detects and locates errors correctly

• Overwhelmed subsystem:– Detects correctly

• Locatability = l Detectability > 2*l• Citation of 1973 paper by Russel & Kime

– Error location mistakes

CS717


• In 2r-1 rows only Bi has error– Thus, no other row will claim an error there

• 2nd-level checks will catch these errors– Bi’s checks can’t lie about it

– Will definitely know these are errorsBi Bj Bk…

l+12r-1

Known errors:Uknown errors:

No error:

CS717


• Number of errors in Bi = l+r

• Number of known errors 2r-1

• Number of unknown errors in Bi

(l+r)-(2r-1) = l-r+1

• Since r1, l-r+1l

• Bi’s checks can identify l errors– Error patterns l produce unique check alert

patterns – This data enough to identify remaining unknown

errors

CS717

Locatability


• Proof:– Base case: d=2– Can Locate errors size 2(l+1)-1– Inductive case: d+1

– Components Bi can locate 2d-1(l+1)-1 errors

– By argument above, system locates 2*[(2d-1(l+1)-1)+1]-1 = 2d(l+1)-1 errors

CS717

Summary

• Presented systematic way to build hierarchical systems with good fault-detection properties

• For d-level system composed of identical independent components– Component detectability=t, locatability=l

11)1(

1)1)(1(

1

1

1

kforldL

hkforht

hkforkBT

dd

d

d

d

CS717

Conclusion

• Formalisms for analyzing fault detectability & locatability– Matrix-based formalism of Nair et al– Dependence graph-based formalism of Park et al

• Includes fault propagation models


components

CS717

Conclusion

• These schemes have complex rules for acceptable check placements

• Requires detailed analysis of system to place them manually

• More detailed analysis if checks are hand-designed– Likely since few known automatic techniques

• Overall, approach can support automatic solutions but currently very manual

Documents

Algorithm-Based Fault Tolerance Theory of Check Placement