Upload
clifton-ellis
View
221
Download
0
Embed Size (px)
Citation preview
Error Detection in Error Detection in HardwareHardware
VO Hardware-Software-CodesignVO Hardware-Software-Codesign
Philipp JahnPhilipp Jahn
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 22
Error detectionError detection
How to How to detectdetect errors with errors with hardware methods hardware methods during system operationduring system operation ConditionsConditions Coverage (probability that error is detected)Coverage (probability that error is detected) Latency (time between start of error and detection)Latency (time between start of error and detection) PerformancePerformance
Slide from VO „Echtzeitsysteme“, H. KopetzSlide from VO „Echtzeitsysteme“, H. Kopetz
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 33
Hardware-based error detection Hardware-based error detection Hardware redundancyHardware redundancy
Passive (TMR, majority voting)Passive (TMR, majority voting) Active (duplication and comparison, standby)Active (duplication and comparison, standby) HybridHybrid
Information redundancyInformation redundancy ParityParity ChecksumsChecksums Arithmetic CodesArithmetic Codes
Time redundancyTime redundancy Watchdog timersWatchdog timers CheckingChecking
Capability CheckingCapability Checking Consistency CheckingConsistency Checking Control-Flow CheckingControl-Flow Checking
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 44
Information redundancy (1)Information redundancy (1)
Detection / CorrectionDetection / Correction Hamming distanceHamming distance
X = (1001), Y = (0111)X = (1001), Y = (0111) d(X,Y) = 3d(X,Y) = 3
SEC – DEDSEC – DED
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 55
Information redundancy (2)Information redundancy (2)
ParityParity One extra bit (even / odd)One extra bit (even / odd) Decoding circuit (set of XOR gates)Decoding circuit (set of XOR gates) Routine checking in busses, memory and registersRoutine checking in busses, memory and registers Detecting singleDetecting single
bit errorsbit errors(no stuck-at faults)(no stuck-at faults)
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 66
Information redundancy (3)Information redundancy (3)
Overlapping parityOverlapping parity m of n codesm of n codes Duplication codesDuplication codes Cycle redundancy checksCycle redundancy checks
Sender and receiver agree upon generator polynom G(x)Sender and receiver agree upon generator polynom G(x) Append checksum (k bit) at end of data frame (n-k bit)Append checksum (k bit) at end of data frame (n-k bit) Checksum / G(x) = 0 Checksum / G(x) = 0 correct correct Simple implementation (linear feedback shift register and Simple implementation (linear feedback shift register and
XOR gates)XOR gates) Detect single-bit errors, multiple adjacent bit errors affecting Detect single-bit errors, multiple adjacent bit errors affecting
fewer than n-k bits, and burst transient errorsfewer than n-k bits, and burst transient errors High successful in serial transmission (communication High successful in serial transmission (communication
channels: Ethernet, Token Ring)channels: Ethernet, Token Ring)
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 77
Information redundancy (4)Information redundancy (4)
ChecksumsChecksums
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 88
Information redundancy (5)Information redundancy (5)
Arithmetic CodesArithmetic Codes Detect errors in arithmetic units (parity would not be Detect errors in arithmetic units (parity would not be
preserved)preserved) Separate or nonseparateSeparate or nonseparate ExamplesExamples
AN codesAN codes Residue codesResidue codes
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 99
Time redundancy (1)Time redundancy (1)
Repetition of computations two or more times Repetition of computations two or more times and then comparing (detection or correction by and then comparing (detection or correction by majority)majority) Error detected Error detected maybe retry maybe retry Good for detecting transient faultsGood for detecting transient faults Not protecting against errors resulting from Not protecting against errors resulting from
permanent faultspermanent faults No extra hardware needed but longer processing timeNo extra hardware needed but longer processing time Non-time-critical applicationsNon-time-critical applications
Alternate Logic also detects permanent faults Alternate Logic also detects permanent faults (self-checking circuits f(x) = f ‘(x’))(self-checking circuits f(x) = f ‘(x’))
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1010
Time redundancy (2)Time redundancy (2)
Handle permanent faults per encoding the Handle permanent faults per encoding the second computation (must not alter calculation) second computation (must not alter calculation) e.g. k-shifte.g. k-shift Error in k-1 consecutive bit of arithmetic or logical Error in k-1 consecutive bit of arithmetic or logical
operation detectedoperation detected Additional hardware (two shifters, storage register, Additional hardware (two shifters, storage register,
comparator)comparator)
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1111
Watchdog timersWatchdog timers
Implemented in hardware (external timer) or Implemented in hardware (external timer) or software (process)software (process)
If timer expires If timer expires system reset or recover system reset or recover Detect only very specific type = control-flow Detect only very specific type = control-flow
errorerror If error occurs but timer reset If error occurs but timer reset no detection no detection Difficult to determine runtimeDifficult to determine runtime High detection latencyHigh detection latency
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1212
Capability & Consistency CheckingCapability & Consistency Checking
Capability checking limits access to objects (e.g. Capability checking limits access to objects (e.g. memory segments) to authorized users memory segments) to authorized users (processes)(processes) Implemented in hardware (error traps) or software Implemented in hardware (error traps) or software
(firewall)(firewall) e.g. checking of address validity by MMUe.g. checking of address validity by MMU
Consistency checking determines if states or Consistency checking determines if states or results are reasonableresults are reasonable e.g. range checking, address checking, opcode e.g. range checking, address checking, opcode
checkingchecking
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1313
Control-Flow Checking (1)Control-Flow Checking (1)
Hardware schemeHardware scheme Divide application program into blocksDivide application program into blocks Each block has a single entry and exit pointEach block has a single entry and exit point Reference signature represents an encoding of the Reference signature represents an encoding of the
correct executioncorrect execution Watchdog processor validates the application Watchdog processor validates the application
program by comparing the runtime with the signatureprogram by comparing the runtime with the signature 70% of transient faults lead to control flow errors70% of transient faults lead to control flow errors
LimitationsLimitations Only suitable for processors running single programs Only suitable for processors running single programs
(multiple processes or threads)(multiple processes or threads) Reduced coverage if transmission errors on the bus Reduced coverage if transmission errors on the bus
to the watchdog processor occursto the watchdog processor occurs
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1414
Control-Flow Checking (2)Control-Flow Checking (2)
Signatured Instruction Stream (SIS)Signatured Instruction Stream (SIS) Hardware: Watchdog processor with cyclic code Hardware: Watchdog processor with cyclic code
signature generatorsignature generator Software: Modified assembler and loaderSoftware: Modified assembler and loader
Control Flow Checking using Shadow ProcessingControl Flow Checking using Shadow Processing
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1515
SummarySummary
Hardware low error latencyHardware low error latency Hardware is more expensiveHardware is more expensive e.g. Massively parallel multiprocessorse.g. Massively parallel multiprocessors Combining error detection mechanismCombining error detection mechanism
6.6.20076.6.2007 Error Detection in HardwareError Detection in Hardware 1616
ReferencesReferences Ravishankar K. Iyer, Zbigniew Kalbarczyk - Ravishankar K. Iyer, Zbigniew Kalbarczyk - Hardware and Software Error Hardware and Software Error
DetectionDetection - Center for Reliable and High-Performance Computing, - Center for Reliable and High-Performance Computing, University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign
Real-Time Systems, Real-Time Systems, Design Principles for Distributed Embedded Design Principles for Distributed Embedded Applications Kopetz, Hermann, 1997, 356 p., Hardcover, ISBN: 978-0-7923-Applications Kopetz, Hermann, 1997, 356 p., Hardcover, ISBN: 978-0-7923-9894-39894-3
Alireza Vahdatpour, Mahdi Fazeli, Seyed Ghassem Miremadi - Alireza Vahdatpour, Mahdi Fazeli, Seyed Ghassem Miremadi - Transient Transient Error Detection in Embedded Sysetms Using Reconfigurable Error Detection in Embedded Sysetms Using Reconfigurable ComponentsComponents - IES, October 2006 - IES, October 2006
M. Dal Chin, W. Hohl, E. Michel, A. Pataricza - M. Dal Chin, W. Hohl, E. Michel, A. Pataricza - Error Detection Error Detection Mechansims for Massively Parallel MultiprocessorsMechansims for Massively Parallel Multiprocessors - IEEE - IEEE Proceedings, 1993Proceedings, 1993
Evaluation of error detection coverage and fault-tolerance of digital plant Evaluation of error detection coverage and fault-tolerance of digital plant protection system in nuclear power plantsprotection system in nuclear power plants
http://robotics.ee.uwa.edu.au/courses/faulttolerant/notes/FT2b.pdfhttp://robotics.ee.uwa.edu.au/courses/faulttolerant/notes/FT2b.pdf A. Steiniger, C. Scherrer - Identifying Efficient Combinations of Error A. Steiniger, C. Scherrer - Identifying Efficient Combinations of Error
Detection Mechanisms Based on Results of Fault Injection Experiments - Detection Mechanisms Based on Results of Fault Injection Experiments - IEEE Transactions on computers, Vol. 51, No. 2, February 2002IEEE Transactions on computers, Vol. 51, No. 2, February 2002