Upload
muriel-nichols
View
215
Download
2
Embed Size (px)
Citation preview
FTC (DS) - V - TT - 1
HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK
DEPENDABLE SYSTEMS
Vorlesung 5
FAULT RECOVERY AND TOLERANCE TECHNIQUES(SYSTEM LEVEL)
Wintersemester 99/00
Leitung: Prof. Dr. Miroslaw Malek
www.informatik.hu-berlin.de/~rok/ftc
FTC (DS) - V - TT - 2
FAULT RECOVERY AND TOLERANCE TECHNIQUES(SYSTEM LEVEL)
• OBJECTIVE:
– TO INTRODUCE MAIN FAULT RECOVERY AND FAULT
TOLERANCE TECHNIQUES FOR COMPUTER SYSTEMS
• CONTENTS:
– DYNAMIC TECHNIQUES
– STATIC TECHNIQUES
– HYBRID TECHNIQUES
FTC (DS) - V - TT - 3
FAULT RECOVERY TECHNIQUES
FAULT RECOVERY IS INITIATED BY SUCCESSFUL
FAULT DETECTION
AND/OR
FAULT LOCATION
HARDWARE RECOVERY TECHNIQUES INCLUDE
REPLACEMENT/REPAIR
RECONFIGURATION
OR
FAULT MASKING
SOFTWARE RECOVERY TECHNIQUES INCLUDE
EXCEPTION HANDLING
RECOVERY BLOCKS
MASKING (N-VERSION PROGRAMMING)
ROLL-BACKWARD
ROLL-FORWARD
FTC (DS) - V - TT - 4
SYSTEM REPLICATION METHODS • DYNAMIC
– DUPLEX– BACK-UP SPARING– DUPLEX AND SPARE– PAIR AND SPARE– SOFTWARE-IMPLEMENTED FAULT TOLERANCE (SIFT)
• STATIC– TRIPLE MODULAR REDUNDANCY (TMR)– N MODULAR REDUNDANCY (NMR)– (4-2) CONCEPT – SPECIAL LOGIC– TMR WITH DUPLEX MODULES
• HYBRID– HYBRID REDUNDANCY (NMR WITH SPARES)– TMR WITH TWO SPARES (SPACE SHUTTLE)– SELF-PURGING REDUNDANCY– SIFT-OUT MODULAR REDUNDANCY
FTC (DS) - V - TT - 5
DUPLEX SYSTEMS (1)
OUTPUT SWITCH
OUTPUT 1 OUTPUT 2
Test and Reconfigure
P1 P2PRIMARY
UNITSECONDARY
UNIT
COMPARATOR
INPUT[from Siewiorek and Swarz]
FTC (DS) - V - TT - 6
DUPLEX SYSTEMS (2)
• If a mismatch occurs the following methods can be used to identify a faulty system:
– Self-diagnostic program
– Self-checking logic (capabilities)
– Watchdog timer method (periodically reset timer of another processor)
– Outside arbiter (may check signatures or run tests)
FTC (DS) - V - TT - 7
DUPLEX SYSTEMS (3)
SYNCHRONIZATION METHODS • At the end of each clock period (cycle or microcycle) (e.g., ESS
systems, UDET)• Update and match unit (UPM) compares every bus cycle (e.g.,
AXE telephone switching system)• At the end of program execution - program or subroutine
level comparison (e.g. COMTRAC railway control system)
RELIABILITY OF DUPLEX SYSTEMS
C - coverage factor (represents the combined probability of successful fault detection and reconfiguration)
Rk - reliability of the control, switching and matching circuitry
R =
[ R
m
2 + 2 C R
m
( 1 - R
m ) ] R
k
FTC (DS) - V - TT - 8
DUPLEX SYSTEMS (4)Back-up Sparing
MODULES
1
2
n
SWITCH
OUTPUT
INPUT
HOT, WARM AND COLD SPARES
FTC (DS) - V - TT - 9
DUPLEX AND SPARE
MODULES
1
2
3
OUTPUT
INPUT
COMPARATOR
SWITCH
FTC (DS) - V - TT - 10
PAIR AND SPARE
MODULES
1
2
3
OUTPUTINPUT
COMPARATOR
4 COMPARATOR
SWITCH/COMPARATOR
FTC (DS) - V - TT - 11
TRIPLE MODULAR REDUNDANCY (TMR) (1)
• A method that incorporates static redundancy into system design
• The voter produces correct output if there are no failures in the voter and if there are no failures in two of the three modules
InputVoter output
Module A
Module B
Module C
Voter
Triple Modular Redundancy (TMR) configuration.
FTC (DS) - V - TT - 12
TRIPLE MODULAR REDUNDANCY (TMR) (2)
• Reliability– RTMR = RV (reliability of 2 out of 3 modules)
– RV - Reliability of the voter
– Rm - Reliability of each module
• When does a TMR system have a higher reliability than the original single module? – Must have RTMR > Rm
RTMR
= Rv
[Rm
3 + 3R
m
2(1 - R
m) ]
Rv
[Rm
3 + 3R
m
2(1 - R
m) ] > R
m
FTC (DS) - V - TT - 13
TRIPLE MODULAR REDUNDANCY (TMR) (3)
• Assuming a perfect voter (RV = 1)
• TMR is more reliable only if Rm > 0.5
• Also the voter must be very reliable. Must have RV > 0.9 for RTMR > Rm
• This technique can be generalized to any odd number of modules N
Rsys
1
.75 .5 .25 0 0.5 1.0
Rm
Single Module
TMR
FTC (DS) - V - TT - 14
TMR WITH DUPLEX MODULES(USED IN JAPANESE TRAIN SHINKANSEN)
MODULES
1
2
3
OUTPUTINPUT
COMPARATORS /
SWITCHERS
4
5
6
1
2
3
VOTER
FTC (DS) - V - TT - 15
HYBRID REDUNDANT SYSTEM (1)• One of the drawbacks of N-modular redundancy with voting (NMR) is
that fault masking ability deteriorates as more copies fail.• Hybrid redundancy combines NMR with backup sparing.
M1
M2
M3
MN+S
functional
units
Switch Select
N out of (N + S)
N + S
Voter Voted output
Voter-Switch-Detector (VSD)
Control lines
N
1
Disagree- ment
detector
(Siewiorek & Swarz)Basic organization of a hybrid-redundant system
FTC (DS) - V - TT - 16
HYBRID REDUNDANT SYSTEM (2)
• Assuming the same reliability of modules on-line and on standby, the system reliability is:
• P = N/2 + S = The maximum number of modules that can fail without crashing the system
FTC (DS) - V - TT - 17
Plots of hybrid TMR system reliability (Rs) &individual module reliability (Rm)S
• Plots of hybrid TMR system reliability (Rs) vs. individual module reliability (Rm) S is the number of spares. (Siewiorek and Swarz)
b. System with standby failure rate 10% of on-line failure rate
a. System with standby failure rate equal to on-line failure rate
1 . 0 0
0 . 8 0
0 . 6 0
0 . 4 0
0 . 2 0
0 . 0 0
0 . 0 1 0 . 2 0 0 . 4 0 0 . 6 0 0 . 8 0 1 . 0 0
1 . 0 0
0 . 8 0
0 . 6 0
0 . 4 0
0 . 2 0
0 . 0 0
0 . 0 1 0 . 2 0 0 . 4 0 0 . 6 0 0 . 8 0 1 . 0 0
Simplex
S = 0
(TMR)
RM
S = 6 4 2 1
RS
RM
RS
S = 6
Simplex
S = 0
(TMR)
4 2 1
FTC (DS) - V - TT - 18
SELF-PURGING REDUNDANCY
S R Q Q
S R Q Q
I n i t i a l i z e a n d r e t r y
D e l a y
P
C l o c k
M 1
S R Q Q
M 2
M p
• • • • •
• • • •
T h r e s h o l d = M
V o t e d s y s t e m o u t p u t
System using self-purging redundancy(Siewiorek and Swarz)
• Potentially more reliable than hybrid
• Threshold gates are analog circuit elements
FTC (DS) - V - TT - 19
SIFT-OUT MODULAR REDUNDANCY(N-2) - fault-tolerant
Basic configuration for sift-out redundancy• BASIC CONCEPT:
– COMPARE EACH PAIR AND ELIMINATE FAULTY UNITS
M1
M2
MN
D1
D1
DN
Clock
Collector
DetectorComparator
Output
N redundant modules, operating
synchronously
E12
E13
F1
F2
FN
E(N-1)N
N lines, line , signals
the failure of module .
F
i
N C2 lines, each for signaling
the disagreement of a pair of modules
FTC (DS) - V - TT - 20
TMR WITH TWO SPARES(USED IN SPACE SHUTTLE)
MODULES
1
2
3 OUTPUTINPUT
4
5
VOTER / SWITCH
PRIMARY MODULES 1, 2 and 3
“ WARM “ SPARE 4
“ COLD “ SPARE 5