20
FTC (DS) - V - TT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM LEVEL) Wintersemester 99/00 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc

FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

Embed Size (px)

Citation preview

Page 1: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 1

HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK

DEPENDABLE SYSTEMS

Vorlesung 5

FAULT RECOVERY AND TOLERANCE TECHNIQUES(SYSTEM LEVEL)

Wintersemester 99/00

Leitung: Prof. Dr. Miroslaw Malek

www.informatik.hu-berlin.de/~rok/ftc

Page 2: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 2

FAULT RECOVERY AND TOLERANCE TECHNIQUES(SYSTEM LEVEL)

• OBJECTIVE:

 – TO INTRODUCE MAIN FAULT RECOVERY AND FAULT

TOLERANCE TECHNIQUES FOR COMPUTER SYSTEMS

• CONTENTS:

– DYNAMIC TECHNIQUES

– STATIC TECHNIQUES

– HYBRID TECHNIQUES

Page 3: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 3

FAULT RECOVERY TECHNIQUES

FAULT RECOVERY IS INITIATED BY SUCCESSFUL 

FAULT DETECTION 

AND/OR 

FAULT LOCATION 

HARDWARE RECOVERY TECHNIQUES INCLUDE 

REPLACEMENT/REPAIR 

RECONFIGURATION 

OR 

FAULT MASKING 

SOFTWARE RECOVERY TECHNIQUES INCLUDE 

EXCEPTION HANDLING 

RECOVERY BLOCKS 

MASKING (N-VERSION PROGRAMMING) 

ROLL-BACKWARD

ROLL-FORWARD

Page 4: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 4

SYSTEM REPLICATION METHODS • DYNAMIC 

– DUPLEX– BACK-UP SPARING– DUPLEX AND SPARE– PAIR AND SPARE– SOFTWARE-IMPLEMENTED FAULT TOLERANCE (SIFT) 

• STATIC– TRIPLE MODULAR REDUNDANCY (TMR)– N MODULAR REDUNDANCY (NMR)– (4-2) CONCEPT – SPECIAL LOGIC– TMR WITH DUPLEX MODULES

• HYBRID– HYBRID REDUNDANCY (NMR WITH SPARES)– TMR WITH TWO SPARES (SPACE SHUTTLE)– SELF-PURGING REDUNDANCY– SIFT-OUT MODULAR REDUNDANCY

Page 5: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 5

DUPLEX SYSTEMS (1)

OUTPUT SWITCH

OUTPUT 1 OUTPUT 2

Test and Reconfigure

P1 P2PRIMARY

UNITSECONDARY

UNIT

COMPARATOR

INPUT[from Siewiorek and Swarz]

Page 6: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 6

DUPLEX SYSTEMS (2)

• If a mismatch occurs the following methods can be used to identify a faulty system:

– Self-diagnostic program

– Self-checking logic (capabilities)

– Watchdog timer method (periodically reset timer of another processor)

– Outside arbiter (may check signatures or run tests)

Page 7: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 7

DUPLEX SYSTEMS (3)

SYNCHRONIZATION METHODS • At the end of each clock period (cycle or microcycle) (e.g., ESS

systems, UDET)• Update and match unit (UPM) compares every bus cycle (e.g.,

AXE telephone switching system)• At the end of program execution - program or subroutine

level comparison (e.g. COMTRAC railway control system)

RELIABILITY OF DUPLEX SYSTEMS

C - coverage factor (represents the combined probability of successful fault detection and reconfiguration) 

Rk - reliability of the control, switching and matching circuitry

R =

[ R

m

2 + 2 C R

m

( 1 - R

m ) ] R

k

Page 8: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 8

DUPLEX SYSTEMS (4)Back-up Sparing

MODULES

1

2

n

SWITCH

OUTPUT

INPUT

HOT, WARM AND COLD SPARES

Page 9: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 9

DUPLEX AND SPARE

MODULES

1

2

3

OUTPUT

INPUT

COMPARATOR

SWITCH

Page 10: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 10

PAIR AND SPARE

MODULES

1

2

3

OUTPUTINPUT

COMPARATOR

4 COMPARATOR

SWITCH/COMPARATOR

Page 11: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 11

TRIPLE MODULAR REDUNDANCY (TMR) (1)

• A method that incorporates static redundancy into system design

• The voter produces correct output if there are no failures in the voter and if there are no failures in two of the three modules

InputVoter output

Module A

Module B

Module C

Voter

Triple Modular Redundancy (TMR) configuration.

Page 12: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 12

TRIPLE MODULAR REDUNDANCY (TMR) (2)

• Reliability– RTMR = RV (reliability of 2 out of 3 modules)

– RV - Reliability of the voter 

– Rm - Reliability of each module

• When does a TMR system have a higher reliability than the original single module? – Must have RTMR > Rm

RTMR

= Rv

[Rm

3 + 3R

m

2(1 - R

m) ]

Rv

[Rm

3 + 3R

m

2(1 - R

m) ] > R

m

Page 13: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 13

TRIPLE MODULAR REDUNDANCY (TMR) (3)

• Assuming a perfect voter (RV = 1)

• TMR is more reliable only if Rm > 0.5

• Also the voter must be very reliable. Must have RV > 0.9 for RTMR > Rm

• This technique can be generalized to any odd number of modules N

Rsys

1

.75 .5 .25 0 0.5 1.0

Rm

Single Module

TMR

Page 14: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 14

TMR WITH DUPLEX MODULES(USED IN JAPANESE TRAIN SHINKANSEN)

MODULES

1

2

3

OUTPUTINPUT

COMPARATORS /

SWITCHERS

4

5

6

1

2

3

VOTER

Page 15: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 15

HYBRID REDUNDANT SYSTEM (1)• One of the drawbacks of N-modular redundancy with voting (NMR) is

that fault masking ability deteriorates as more copies fail.• Hybrid redundancy combines NMR with backup sparing.

M1

M2

M3

MN+S

functional

units

Switch Select

N out of (N + S)

N + S

Voter Voted output

Voter-Switch-Detector (VSD)

Control lines

N

1

Disagree- ment

detector

(Siewiorek & Swarz)Basic organization of a hybrid-redundant system

Page 16: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 16

HYBRID REDUNDANT SYSTEM (2)

• Assuming the same reliability of modules on-line and on standby, the system reliability is:

• P = N/2 + S = The maximum number of modules that can fail without crashing the system

Page 17: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 17

Plots of hybrid TMR system reliability (Rs) &individual module reliability (Rm)S

• Plots of hybrid TMR system reliability (Rs) vs. individual module reliability (Rm) S is the number of spares. (Siewiorek and Swarz)

b. System with standby failure rate 10% of on-line failure rate

a. System with standby failure rate equal to on-line failure rate

1 . 0 0

0 . 8 0

0 . 6 0

0 . 4 0

0 . 2 0

0 . 0 0

0 . 0 1 0 . 2 0 0 . 4 0 0 . 6 0 0 . 8 0 1 . 0 0

1 . 0 0

0 . 8 0

0 . 6 0

0 . 4 0

0 . 2 0

0 . 0 0

0 . 0 1 0 . 2 0 0 . 4 0 0 . 6 0 0 . 8 0 1 . 0 0

Simplex

S = 0

(TMR)

RM

S = 6 4 2 1

RS

RM

RS

S = 6

Simplex

S = 0

(TMR)

4 2 1

Page 18: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 18

SELF-PURGING REDUNDANCY

S R Q Q

S R Q Q

I n i t i a l i z e a n d r e t r y

D e l a y

P

C l o c k

M 1

S R Q Q

M 2

M p

• • • • •

• • • •

T h r e s h o l d = M

V o t e d s y s t e m o u t p u t

System using self-purging redundancy(Siewiorek and Swarz)

• Potentially more reliable than hybrid

• Threshold gates are analog circuit elements

Page 19: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 19

SIFT-OUT MODULAR REDUNDANCY(N-2) - fault-tolerant

Basic configuration for sift-out redundancy• BASIC CONCEPT:

– COMPARE EACH PAIR AND ELIMINATE FAULTY UNITS

M1

M2

MN

D1

D1

DN

Clock

Collector

DetectorComparator

Output

N redundant modules, operating

synchronously

E12

E13

F1

F2

FN

E(N-1)N

N lines, line , signals

the failure of module .

F

i

N C2 lines, each for signaling

the disagreement of a pair of modules

Page 20: FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM

FTC (DS) - V - TT - 20

TMR WITH TWO SPARES(USED IN SPACE SHUTTLE)

MODULES

1

2

3 OUTPUTINPUT

4

5

VOTER / SWITCH

PRIMARY MODULES 1, 2 and 3

“ WARM “ SPARE 4

“ COLD “ SPARE 5