107
CS71 7 Algorithm-Based Fault Tolerance Theory of Check Placement Greg Bronevetsky

Algorithm-Based Fault Tolerance Theory of Check Placement

  • Upload
    kasie

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Algorithm-Based Fault Tolerance Theory of Check Placement. Greg Bronevetsky. So Far…. Learned how certain computations could be checked using algorithm-specific checks. In any algorithm we can develop checks to verify any set of data items. How effective are these checks? - PowerPoint PPT Presentation

Citation preview

Page 1: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Algorithm-Based Fault ToleranceTheory of Check Placement

Greg Bronevetsky

Page 2: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

So Far…

• Learned how certain computations could be checked using algorithm-specific checks.

• In any algorithm we can develop checks to verify any set of data items.

• How effective are these checks?• How many faults can given set of checks

detect?

Page 3: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Abstract Checks

• Suppose we are given (g,h)-checks• Check defined on g data elements• If all elements correct, returns 0• If 0 and h elements erroneous, return 1• If h elements erroneous, undefined

Page 4: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Checking Example

• Assume (2, 1) checks – 2 elements, 1-failure detect

• Both sets of checks can detect single errors• Neither can locate individual errors

d1

d2

dn

+ sum …d1

d2

dn

+sum

n checks: i. di and sum 1 check: sum

Page 5: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

But with one more check…

• If also check sum– can detect any pair of errors– can locate single errors

• Need general theory of effective and efficient check placement

d1

d2

dn

+ sum

n checks: i. di and sum1 more check: sum

Page 6: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Goals

• Need models for correlating processor faults to data errors

• Given fault model and set of checks need to derive fault detectability and locatability

Page 7: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Papers covered

• V.S.S. Nair, J.A. Abraham, P. Banerjee. "Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes", 1996.

• Choon-Sik Park and Mineo Kaneko, "An Efficient Technique for Design of ABFT Systems Based on Modified PD Graph".

• Choon-Sik Park, "Algorithm-Based Fault Tolerant Systems Based on Graph-Theoretic Error Occurence+Propagation Models", 2000. (PhD Thesis)

• V.S.S. Nair, J.A. Abraham. "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection", 1990.

Page 8: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Outline

• Matrix-based formalism of Nair et al

• Dependence graph-based formalism of Park et al– Includes fault propagation models

• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant

components

Page 9: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Framework

• Each processor and check associated with set of elements

P1

P2

P3

P4

d1

d2

d3

d4

d5

d6

d7

C1

C2

C3

Processors

Checks

Page 10: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Framework

• Data(Pi) = set of data elements affected by processor i– If Pi fails, any subset of of Data(Pi) may be

erroneous– No notion of errors propagating based on data

dependences

• Data() defines the Processor-Data (PD) Matrix

Page 11: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Associated PD Matrix

P1

P2

P3

P4

d1

d2

d3

d4

d5

d6

d7

Processors

Data Elements

Processors

1100000

0010000

0001010

0000111

Page 12: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Framework

• Check(di) = set of checks that check data element di.– Must be non-empty if we expect to detect errors

• Check defines the Data-Check (DC) Matrix

• Paper focuses on (g,1) checks– g data elements– can detect upto 1 fault

Page 13: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Associated DC Matrixd1

d2

d3

d4

d5

d6

d7

C1

C2

C3

Checks

Checks

Data Elements

100

010

001

110

001

010

001

• C1 and C2 are (3,1) checks

• C3 is a (2,1) check

Page 14: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The PC Matrix

• Finally, associate processors and checks:• Processor-check (PC) matrix = PDDC

Data Elements

Processors

0100000

0010000

0001010

0000111

Checks

Data Elements

=

=# elements verified by check

Processors

120

012

100

010

001

110

001

010

001PD DC

PC

Page 15: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Using the PC Matrix

• PC matrix shows if we can detect single-processor errors:

• Assume all checks are (g,h) checks• If each row of PC has all entries h failure of

that process will be detected– Regardless of which entries actually become

erroneous

# elements verified by check

Processors

120

012

PC

Page 16: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Using the PC Matrix

• If each row of PC has all entries h failure of that process will be detected

P1

P2

P3

P4

d2

d3

d4

d5

d6

d7

C1

C2

C3

Processors

Checks

d1

# elements verified by check

Processors

120

012

PC

Page 17: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Relaxing Detectability

• Condition is too conservative• Suppose we have (3, 2) checks

• Pi’s PD row is:

• There are 2 checks. DC matrix:• PC Matrix:

P1

d1

d2

d3

d4

d5

C1

C2

11011

11

10

10

01

01

23

Page 18: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Relaxing Detectability

• C1 may be overwhelmed by errors

– Will not notice error <d1, d2 d5>

• By above criterion system can’t detect failure in P1

P1

d1

d2

d3

d4

d5

C1

C2

Page 19: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Reaching New Detectability Definition

• But how could C1 be overwhelmed?

• When all 3 of its elements have errors– Recall, these are (3,2) checks

P1

d1

d2

d3

d4

d5

C1

C2

Page 20: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Reaching New Detectability Definition

• But C1 and C2 overlap on d5

• Thus if C1 overwhelmed, C2 detects error– It is not overwhelmed

• Thus, for any error pattern can see if any check will notice

P1

d1

d2

d3

d4

d5

C1

C2

Page 21: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Trivial Algorithm 2

• Try every possible error pattern– Exponentially many of them

• For each pattern see if some check will detect it– Before: ensured that no check overwhelmed

• Pro: Correct and not conservative• Con: Expensive

Page 22: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

New Definition of Detectability

• Work with error patterns– Ex: <d1, d2, d5>, <d1, d3, d4>, <d3>, etc.

• If one check detects given error pattern, no problem if other checks overwhelmed

• Repeat until all error patterns detected:

If some check not overwhelmed, eliminate all detectable error patterns from

consideration

Page 23: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Example of Detectability Algorithm

• Is failure of P1 detectable?

• P1 fails d1, d2 and/or d3 may have errors

• C1, C2 overwhelmed

• C3 not overwhelmed

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4

Page 24: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Example of Detectability Algorithm

• Look at errors C3 can detect: d3

• Remove them from consideration– Since any error pattern involving d3 will be

detected

P1

d1

d2

d3

C1

C2P2C3

(2,1) checks

C4

d4

d5

Page 25: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Example of Detectability Algorithm

• Look at remaining error patterns: combinations of d1 and/or d2

• Now C2 not overwhelmed

• Remove any error patterns involving d2

P1

d1

d2 C1

C2P2C3

(2,1) checks

C4

d4

d5

Page 26: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Example of Detectability Algorithm

• Look at remaining error patterns: d1

• C1 not overwhelmed

• Remove any of its error patterns

P1

d1

C1

C2P2C3

(2,1) checks

C4

d4

d5

Page 27: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Example of Detectability Algorithm

• All of P1’s error patterns detected

• We are done!

P1 C1

C2P2C3

(2,1) checks

C4

d4

d5

Page 28: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Failing Check Processors

• What if processor performing check fails?

• Add “pseudo” data elements to represent processors

• Each check will also check its processor’s pseudo-data element– New element has weight, so error in it will

overwhelm any check

Page 29: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Final System

• Check C3 is in P1

• Checks C1, C2 and C4 on P2

P1 d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7

Page 30: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The Infinities

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7

# elements verified by check

Processors

11

1122

Data Elements

Processors

011000

000111

PDChecks

Data Elements

1011

0100

1000

0100

0110

1011

0001

DC

PC

Page 31: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The Infinities

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7# elements verified by check

11

1122

PC

• If P1 fails, C1 and C2 overwhelmed• C3 also overwhelmed by +1

– Because C3 runs on failed P1

• Only C4 not overwhelmed

Processors

Page 32: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The Infinities

P1

d1

d2

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7# elements verified by check

11

1122

PC

• Remove all error patterns detected by C4

– Any that include d2

Processors

Page 33: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The Infinities

P1

d1

d3

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7# elements verified by check

11

0111

PC

• C1 and C2 no longer overwhelmed

• Remove error patterns detected by C1 and C2

– Any that include d1 and d3

Processors C4’s entry must become 0Others may go lower

Page 34: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The Infinities

P1

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7# elements verified by check

11

000

PC

• Now P1’s row is all 0’s and ’s• All real data elements successfully checked• Only pseudo-elements remain

– Don’t care

Processors C1’s and C2’s entries must become 0Others may go lower

Page 35: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

The Infinities

P1

d4

d5

C1

C2P2C3

(2,1) checks

C4d6

d7# elements verified by check

11

000

PC

Processors

• Note failure of P2 not detectable

• d5 only checked by C4, which runs on P2

• Thus, entry will never drop to

Page 36: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Multi-Process Errors

• Want to know if system detect failures of r processors

• For every subset of r processors– Take union of all data elements they touched– Pretend each r-set is single processor

• Use above algorithm to check if all resulting error patterns detectable

Page 37: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability

• We only see errors, not faults• For each error pattern, want to know which

fault caused it

• Given two fault patterns, are they distinguishable?

• Only if they have different patterns of failed checks

• Will give intuition for analysis

Page 38: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

0-1 Disagreement

• Take rows Ri and Rj of rPC (faults Fi and Fj)

• For every possible error pattern in Ri and Rj look at what each check says on this pattern

• If check responses different on each pattern: Fi and Fj can be differentiated

Page 39: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

1-0 Disagreement

• Want to differentiate faults Fi and FiFj j

• Compare each error pattern of Fi and Fj: Eik and Ejl

• If some check meets Eik on 1 & h spots and meets Eil on 0 spots then Ejk and EjkEjl distinguishable

• If this is true for all error patterns then F i and FiFj distinguishable

Page 40: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

1-0 Disagreement Example

101

110

,

,

lj

ki

EError

EError

001

011

101

DC

102

012

,

,

lj

ki

EonChecks

EonChecks1-0 disagreement in

both directions

Page 41: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

1-0 Disagreement Example

• Clearly, Eik and Ejl look different

• EikEjl corresponds to fault pattern:

• Checks would say:

• Different from Eik or Ejl : Distinguishable!

001

011

101

DC

111

112

101

110

,

,

lj

ki

EError

EError

102

012

,

,

lj

ki

EonChecks

EonChecks

Page 42: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability

• If can show 1-0 disagreement between every single-process fault and every r-process fault:System is r-fault locatable

• Algorithm for locatability is obscure

• Read the paper

Page 43: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Summary

• Presented matrix-based framework for evaluating error detectability & locatability

• Framework deals with arbitrary errors

• More work by V.S.S. Nair with other coauthors

Page 44: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Outline

• Matrix-based formalism of Nair et al

• Dependence graph-based formalism of Park et al– Includes fault propagation models

• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant

components

Page 45: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Graph-Based Framework

• Developed by Choon-Sik Park• Does in graphs what Nair et al work does in

matrices• Assumes (g,1) checks• Differences:

– Different definition of fault locatability• Unknown if equivalent

– Presents more limited faulterror models• As opposed to “anything and everything”

• Will first present general view, then specific error models

Page 46: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Picture

Errors

……

Faults

……

Fi

Fj

Data

……

…eiu

ejv

Checks

……

c

c`

ProcessorData, DataData dependence info maintained

Page 47: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

ErrorsFaults Data

k-Faults

• Faults may cause number of possible errors– For given fault, many errors possible– If given error happens, all associated data

elements definitely corrupted

• k-Faults: faults generating errors that corrupt k data elements

Fi

eiu

Page 48: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Detectability

• System is k-fault detectable if for every error pattern check c s.t. |ceiu|=1 means intersection of affected data elements

• Proof:– If there exists such check then every error pattern

induced by fault will be detected– If k-fault detectable then must some check that

reliably yells for any possible error pattern• Can allow the check that yells to be the check in

definition

Page 49: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Management

• k-fault detectability: If a fault affects k data elements then checks will detect it

• k-fault locatability: For all faults that affect k data elements, can tell any pair of faults apart

• Will examine all fault patterns Fi that come from k data elements failing

Page 50: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability 1

• To locate faults, must ensure that different faults cause different errors

• Theorem 1:System k-fault locatable only if for error patterns eiu, ejv (from faults Fi and Fj) eiuejv symmetric difference

• Proof clear:If two faults can show up as same error, can’t tell them apart

Page 51: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability 2

• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.

– |c(eiuejv)|=1 (recall: all checks are (g,1))

– |c(eiuejv)|=0

– If |c(eiu-ejv)|=1 then |c'ejv)|=1

– If |c(ejv-eiu)|=1 then |c'eiu)|=1

• Intuition: Trying to make tuple <c,c'> be different and <0,0> on errors eiu and ejv

Page 52: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability Illustration

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv

Page 53: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability Illustration

• |c(eiuejv)|=1

• i.e. c overlaps one element (eiuejv)

(because of (g,1) checks)

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv

c

Page 54: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability Illustration

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv• |c(eiuejv)|=0

• i.e. c only touches on the part that is unique to ejv

c

Page 55: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability Illustration

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv• If |c(ejv-eiu)|=1 then |

c'eiu)|=1

• If c notices ejv make sure that c‘ notices eiu

c

c'OR

Page 56: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability Illustration

(eiuejv)

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv• Error eiu:<c,c'>=<0,1>• Error ejv:<c,c'>=<1,?>• Patterns distinguishable• Either error detected

c

c'OR

Page 57: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Locatability 2

• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.

– |c(eiuejv)|=1 (recall: all checks are (g,1))

– |c(eiuejv)|=0

– If |c(eiu-ejv)|=1 then |c'ejv)|=1

– If |c(ejv-eiu)|=1 then |c'eiu)|=1

• This, is above true for every pair of error patterns, system k-fault detectable

Page 58: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Extra Fault Detectability

• Theorem: if system is k-fault locatable then it is 2k-fault detectable

• Must show: for any fault Fl in 2k processors, resulting errors elw, check c. |celw|=1

• Note: Failures of 2k processors result in 2 errors as failures of k data elements

• Thus, can break up elw = (eiuejv), coming from k-fault patterns Fi and Fj

Page 59: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Extra Fault Detectability

• Theorem: if system is k-fault locatable then it is 2k-fault detectable

• Must show: eiu,ejv check c. |c(eiuejv)|=1

• If (eiuejv) happens, both c and c' will notice

(eiuejv)

(eiu-ejv)

(ejv-eju)

eiu

ejv

c

c'OR

Page 60: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

FaultError Models

• So far trying to deal with arbitrary errors• Actual model of how faults turn into errors not

defined– i.e. arbitrary

• This is unnecessarily general

• Should focus on realistic models of error generation and propagation– Makes it easier to design reliable systems

Page 61: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Single-Input-Driven Model

• Output of computation erroneous if any input(s) are– Even if processor is faulty

• If processor is faulty, its computations may or may not be erroneous(this is where we use data dependence information)

• Will focus on how model treats single-processor failures

Page 62: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

SID Model Picture

• … : data elements on Pi

– Synonymous with sets of data elements on Pi

• Focus on single-processor failures

Pi

iiWD

2iD

1iD

……

iwD

Data

1iD iiWD

Page 63: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Fault Model in Practice

• If Pi fails, any subset of Diw’s may have error

• If Diw has error, any data depending on it has error– Bijection between Diw

and errors Eiw

Pi

iiWD

2iD

1iD

……

iwD

Data2iE

iwE

iiWE

Page 64: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Single-Fault Detectability in SID

• Brute-Force algorithm: sets of Eiw’s

– If check c s.t. |c(Eiw’s)|=1 then this error pattern detectable

– If all patterns detectable, system is single-fault detectable

Pi

iiWD

2iD

1iD

……

iwD

Data

c

Page 65: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Too Conservative

• Like before, algorithm too conservative• Examines exponentially many error patterns• Suppose set of errors

detected via check c– i.e. |cE|=1

• Look at

} E , E,{EE r21

1E..E }E ,E ,E{E r21 jcts

1D

2D

c

3D

EE

Page 66: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Too Conservative

• Clearly, all overlap with c on one element– Thus, each one detectable– Similarly, all unions containing detectable

• Therefore, if a set of errors detectable, all unions containing suberrors also detectable– And thus, no need to check them

1D

2D

c

s'E j

s'E j

3D

EE

Can ignore:E1, E2, E1E2, E1E3, E1E2, E1 E2 E3

Can’t ignore:E3

Page 67: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

New Definition of Detectability

• = (start with all possible errors)

• For each check cs:– Check that detectable:

• Now ignore detectable subsets of • Remove detectable subsets:

• Repeat to ensure rest of also detectable

0iE iE

1

1

iws Ec

iwsi

si EEE

siE 1

siiw EEiws Ec

siE

siE

Page 68: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check (= )

• c1 meets E1 and E21D

2D

c1

3D

4D

5D

6D

0iE iE

Page 69: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check (= )

• c1 meets E1 and E2

• Remove them to get

1D

2D

c1

3D

4D

5D

6D

0iE iE

1iE

Page 70: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check

• C2 meets E3 and E4

– Also meets E2 but on error E2, c1 will ring

1D

2D

c1

3D

4D

5D

6D

},,,{ 65431 EEEEEi

c2

Page 71: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check

• C2 meets E3 and E4

– Also meets E2 but on error E2, c1 will ring

• Remove them to get

1D

2D

c1

3D

4D

5D

6D

},,,{ 65431 EEEEEi

c2

2iE

Page 72: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check

• C3 meets E5

1D

2D

c1

3D

4D

5D

6D

},{ 652 EEEi

c2c3

Page 73: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check

• C3 meets E5

• Remove it to get

1D

2D

c1

3D

4D

5D

6D

},{ 652 EEEi

c23iE

c3

Page 74: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check

• C3 meets E6

– Recall: circles on left are data on processor I

1D

2D

c1

3D

4D

5D

6D

}{ 63 EEi

c2c3

c4

Page 75: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

• Check

• C3 meets E6

– Recall: circles on left are data on processor I

• Remove it to get

1D

2D

c1

3D

4D

5D

6D

}{ 63 EEi

c2

3iE

c3

c4

Page 76: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability Example

DONE!

1D

2D

c1

3D

4D

5D

6D

c2c3

c4

Page 77: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Single-Fault Locatability in SID

• Basic definition:Must exist enough checks s.t. all error patterns produced by failure of Pi differentiable from error patterns of Pj

• Involves a lot of error patterns

• Start with brute-force definition

Page 78: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Brute-Force Definition

error patterns Eq={Ei1, Ei5, Eiw, …} from Pi checks and s.t.–

• Detects error E

– • Ignores any error from Pj

– detect Ej and all subsets via above algorithm– And vice versa (since ‘s may ring on Pi’s errors)

• Result: – Any error pattern in Ei, none in Ej will ring some cq

– Every pattern in Ej detectable

rcc ...1qc1Ecq

0 jq Ec

rcc ...1

kc

Page 79: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Responses of Checks

• On error pattern Eq (due to failure of Pi):

• On any error Ej due to failure of Pj

• Can brute-force evaluate test on every possible Eq

???11 rq ccc

1/01/01/001 rccc

At least one must be =1 (else Ej not detectable)

Page 80: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Brute Force Too Exhaustive

• Recall that if then same true for all sets containing E1, … Er

• Thus, can eliminate many of the steps above

1} E , E,{Ec r21

Page 81: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

New Definition of Locatability

• = (start with all possible Pi errors)

• For each check cs:

– Check cs detects :

– But not Ej :

• Ensure that Ej is detectable via above algorithm

0iE iE

siE 1

siiw EEiws Ec

0 js Ec

Page 82: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

New Definition of Locatability

• Syndrome of Ei and detectable subsets:

• Syndrome of Ej all subsets:

• Can now ignore detectable subsets of • Remove detectable subsets:• Repeat until all covered• Do same for

– In paper, steps for and interleaved

1

1

iws Ec

iwsi

si EEE

siE

???11 rq ccc

1/01/01/001 rccc

At least one must be =1 (else Ej not detectable)

iE

jEiE jE

Page 83: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Summary

• Presented graph-based framework for evaluating error detectability & locatability

• Framework deals with arbitrary errors• Can be specialized to a simpler fault model:

Single-Input Driven• Choon-Sik Park’s thesis presents the

Multiple-Input Driven model– More realistic but complex

Page 84: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Outline

• Matrix-based formalism of Nair et al

• Dependence graph-based formalism of Park et al– Includes fault propagation models

• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant

components

Page 85: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Building Larger Systems

• Now know how to analyze systems for detectability & locatability

• For large systems this can be very hard/expensive

• Large systems typically made up of smaller components

• Simplifies fault tolerance design

Page 86: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Idea

• Have component with known detectability (=t) & locatability (=l)

• Construct system S out of k components

• What is resulting fault tolerance?

Page 87: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Idea

• System fault tolerance no better than for individual component

• If >t data elements fail in same component, error not detected

• If >l elements fail in component, will not locate

• Detectability & locatability ratio tends to 0 as system size increases!

Page 88: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Hierarchical Design

• To build fault tolerant systems must introduce checks with new components

• Will present hierarchical design scheme with specific detectability & locatability guarantees

• Assumptions:– All (g,h) checks have same h

• No restriction on g

– Every processor produces only one data element• Same true for blocks of processors

– Checks are fault tolerant• Claims that this doesn’t change problem

Page 89: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Component

• Start off with basic system:

• System has internal checks• Fault detectability = t• Fault locatability = l

B

Page 90: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Component

• Then replicate it k-fold

• Assumptions:– copies are independent

• (i.e. do not affect each other’s data)

– Each system produces one data element…

B1

B2

Bk

Page 91: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Basic Component

• Then replicate it k-fold

• And add additional checks across all copies• Process repeated d-1 times to get d-level

hierarchical system…

B1

B2

Bk

c1c2

cr

Page 92: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability 1kh

• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors

• Proof:– Base case: d=2– Suppose every element has error– Each check must deal with kh

errors– But they are (g,h) checks and

will detect such errors– Thus, system can detect |B|k errors

B1

B2

Bk

c1c2

cr

Page 93: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability 1kh

• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors

• Proof:– Inductive case: d+1

– Components Bi each have |B|kd-2

elements– By argument above, system

detects (|B|kd-2)k=|B|kd-1 errors• Argument works because sub-systems

at each level produce one data element

B1

B2

Bk

c1c2

cr

Page 94: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability k>h

• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors

• Proof:– Base case: d=2– Suppose (t+1)(h+1) errors with h+1

copies of B having t+1 errors each– Detectability of B = t, so internal

checks will not notice errors– 2nd level checks will get h+1 errors

each: will not notice– Thus, error pattern of size (t+1)(h+1) that will not

be detected

B1

B2

Bk

c1c2

cr

Page 95: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability k>h

• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors

• Proof:– Base case: d=2– Suppose (t+1)(h+1)-1 errors– By pigeonhole principle, some unit

has t errors or some 2nd levelcheck has h errors

– Thus, some check at 1st or 2nd levelwill ring

– Thus, system detectability = (t+1)(h+1)-1

B1

B2

Bk

c1c2

cr

Page 96: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Detectability k>h

• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors

• Proof:– Inductive case: d+1

– Components Bi detect Td errors

– By induction, Td= (t+1)(h+1)d-1-1

– By argument above, system detects (Td+1)(h+1)-1 errors

– Thus, system detectability = (t+1)(h+1)d-1

B1

B2

Bk

c1c2

cr

Page 97: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Locatability

• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors

• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)

errors, l+1 errors in two Bi’s

– Bi & Bj can’t locate the errors

– 2nd level checks may locate erroneous rows, not columns

– Thus, unlocatable fault pattern of size 2(l+1)

… … …

Bk

c1c2

cr

B1 B2

Page 98: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Locatability

• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors

• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)-1

– At most one Bi may have l+1 errors

• If none do, we’re done

– Remaining l errors distributed among other Bj’s

… …

c1c2

cr

B1

Bk

B2

Page 99: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Locatability

• Let Bi have l+r errors (r1)

Bi

Bj

Bk

c1c2

cr

Page 100: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Locatability

• Let Bi have l+r errors (r1)

• Remaining Bj’s share remaining l-r+1 errors

(l+r)-(l-r+1)=2r-1 rows only have errors in Bi

– =2r-1 rows when all l-r+1 errors are in same Bj…

Bi

Bj

Bk

c1c2

cr

Page 101: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Finding Overwhelmed Unit

• First, find the Bi that have >l errors

• All but one sub-system detects and locates errors correctly

• Overwhelmed subsystem:– Detects correctly

• Locatability = l Detectability > 2*l• Citation of 1973 paper by Russel & Kime

– Error location mistakes

Page 102: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Finding Overwhelmed Unit

• In 2r-1 rows only Bi has error– Thus, no other row will claim an error there

• 2nd-level checks will catch these errors– Bi’s checks can’t lie about it

– Will definitely know these are errorsBi Bj Bk…

l+12r-1

Known errors:Uknown errors:

No error:

Page 103: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Finding Overwhelmed Unit

• Number of errors in Bi = l+r

• Number of known errors 2r-1

• Number of unknown errors in Bi

(l+r)-(2r-1) = l-r+1

• Since r1, l-r+1l

• Bi’s checks can identify l errors– Error patterns l produce unique check alert

patterns – This data enough to identify remaining unknown

errors

Page 104: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Locatability

• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors

• Proof:– Base case: d=2– Can Locate errors size 2(l+1)-1– Inductive case: d+1

– Components Bi can locate 2d-1(l+1)-1 errors

– By argument above, system locates 2*[(2d-1(l+1)-1)+1]-1 = 2d(l+1)-1 errors

Page 105: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Summary

• Presented systematic way to build hierarchical systems with good fault-detection properties

• For d-level system composed of identical independent components– Component detectability=t, locatability=l

11)1(

1)1)(1(

1

1

1

kforldL

hkforht

hkforkBT

dd

d

d

d

Page 106: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Conclusion

• Formalisms for analyzing fault detectability & locatability– Matrix-based formalism of Nair et al– Dependence graph-based formalism of Park et al

• Includes fault propagation models

• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant

components

Page 107: Algorithm-Based Fault Tolerance Theory of Check Placement

CS717

Conclusion

• These schemes have complex rules for acceptable check placements

• Requires detailed analysis of system to place them manually

• More detailed analysis if checks are hand-designed– Likely since few known automatic techniques

• Overall, approach can support automatic solutions but currently very manual