Upload
corey-matthews
View
224
Download
1
Embed Size (px)
Citation preview
A Case Study In Reliability Analysis
Lewis Sykalski
Background (cont.)Background (cont.)
• Net Centric Warfare Data Collector
Approximately 180KLOC
Written in Java and heavily uses JDBC and RMI from J2EE package
CMMI Level 1
Utilizes Oracle 9.2 EE OTS DBMS
• Reliability Required: Moderate
GLOBAL VISION NETWORK (GVN)
Integrated WarfareDevelopment Center
Fort Worth, TX
Light HouseSuffolk, VA
LM – Mission SysColorado Springs, CO
DC
FUSIONCAOC
WCS
JSAF
JIMM
JTAC
JABE
DC
LM – Sim & TrainingOrlando, FL
OtherSimulators
ThreatSims
VBMS
VBMS
BackgroundBackground
Design Diversity (Part I)Design Diversity (Part I)
• Part I: Oracle DBMS Design Diversity– Acquire 20 bug reports each from Oracle 9.2 &
Oracle 10.0– Bugs had to be Date Independent, Easy To
Reproduce, & Type Independent– Results would then be classified by self-evidence &
divergence
Design Diversity: Results 9.2 BugsDesign Diversity: Results 9.2 BugsBug # Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent
2357784 Internal Error X NO N/A X
2299898 Performance/Hang X NO N/A X
2202561 Incorrect Results NO N/A
2221401 Incorrect Results NO N/A
2739068 Incorrect Results NO N/A
2683540 Incorrect Results NO N/A
2991842 Incorrect Results NO N/A
2200057 Internal Error X NO N/A
2405258 Internal Error X NO N/A
2716265 Internal Error X NO N/A
2054241 Performance/Hang X NO N/A
2485871 Internal Error X NO N/A
2670497 Internal Error X NO N/A
2659126 Internal Error X NO N/A X
2064478 Internal Error X NO N/A
2624737 Internal Error X NO N/A X
1918751 Internal Error X NO N/A
2286290 Incorrect Results NO N/A X
2700474 Incorrect Results NO N/A
2576353 Internal Error X NO N/A
Design Diversity: Results 10.0 BugsDesign Diversity: Results 10.0 BugsBug # Type 10.0 SE 9.2 Fails? 9.2 SE Divergent
5731063 Internal Error X NO N/A
3664284 Incorrect Results NO N/A
4582808 Incorrect Results NO N/A
3895678 Internal Error X YES X
3893571 Internal Error X YES X
3903063 Incorrect Results YES
3912423 Internal Error X NO N/A
4029857 Engine Crash X YES X
4156695 Incorrect Results YES
2929556 Internal Error X YES X X
3255350 Performance / Hang X NO N/A
3887704 Internal Error X NO N/A
3405237 Engine Crash X YES X
3952322 Feature Unusable X YES X
4033889 Incorrect Results NO N/A
4060997 Internal Error X YES X
4134776 Internal Error X NO N/A
4149779 Incorrect Results NO N/A
2964132 Internal Error X YES X
3361118 Internal Error X YES X
Design Diversity: More AnalysisDesign Diversity: More Analysis
Oracle 9.2 Oracle 10.0 Oracle 10.0 Oracle 9.2
Total Bug Scripts 20 - 20 -
Failure Observed 20 - 20 11
Performance/Hang
S.E 2 0 1 0
Internal Error S.E 11 0 10 6
Engine Crash S.E 0 0 2 2
IncorrectResult
S.E 0 0 0 0
N.S.E 7 0 6 2
S.E 0 0 1 1
N.S.E 0 0 0 0
TotalBug
Scripts
Failures 1 out of 2 Bug Scripts Failing
Both DBMS Products Failing
S.E N.S.E Non-Divergent Divergent
S.E N.S.E S.E. N.S.E
40 40 18 11 8 2 1 0
Bottom Line:•Not a Statistical Sample (Not Enough Time)•2/40 = 10% of Failures not detected across both products•Out of the 20 failures for Oracle 10.0, 6 were N.S.E & 4 out of 6 of these failures would be resolved by utilizing a past release in tangent with future release
Design Diversity: Even More AnalysisDesign Diversity: Even More Analysis
• Part II: CASRE Reliability Analysis of NCW Data Collector
1. Extract the following from Failure Logs using JavaScript: Time of Program Start, Time of Program Termination, Time of Thread Terminations, and Exception or Failure Messages
2. Parse failures manually into CASRE input format3. Categorize by severity utilizing chart on next slide4. Compare 2 consecutive events (CALOE08 &
MAGTF08) as well as 2 consecutives lifecycles within same event (Integration & Execution)
Reliability Analysis (Part II)Reliability Analysis (Part II)
SeverityCode
Failure Description
9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss
8 Failure Causes Program Abort
7 Failure Causes Program Thread Abort
5 Failure Causes Record Not to be Written, Thread Continues
3 Failure Causes Incorrect Data to be Written, Thread Continues
1 Failure is Caught, Handled and Recovers Correctly
SeveritySeverity
Using CASRE Using CASRE
Using CASRE (cont.)Using CASRE (cont.)
Interval Number of Interval Error Number Errors Length Severity(int) (float) (float) (int)
Example:Hours
1 5.0 40.0 11 3.0 40.0 21 2.0 40.0 32 4.0 40.0 12 3.0 40.0 33 7.0 40.0 14 5.0 40.0 15 4.0 40.0 1
FAILURE COUNT FORMAT
TIME BETWEEN FAILURES FORMAT: N/A
CASRE Input FormatCASRE Input Format
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Failure CountsCASRE Failure Counts
CASRE Time Between FailuresCASRE Time Between Failures
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Failure IntensityCASRE Failure Intensity
CALOE+MAGTF Execution MAGTF Integration + Execution
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Cummulative FailuresCASRE Cummulative Failures
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Test Interval LengthCASRE Test Interval Length
• Running Average:– Not as Useful for Failure Count Data (unless test intervals are equal
length) – Computes the running average of the time between successive failures
for time between failures data, or the running average of number of failures per interval for failure count data.
– If the running average decreases with time (fewer failures per test interval), reliability growth is indicated.
• Laplace Test: – Not as Useful for Failure Count Data (unless test intervals are equal
length) – Occurrences of failures = homogeneous Poisson process– If the test statistic decreases with increasing failure#, then the null
hypothesis can be rejected in favor of reliability growth at an appropriate significance level. Opposite for increases with increasing failure#
Detecting Reliability TrendsDetecting Reliability Trends
CALOE+MAGTF Execution MAGTF Integration + Execution
Running AverageRunning Average
Laplace TestLaplace Test
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Cum Failure PredictionsCASRE Cum Failure Predictions
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Prediction SetupCASRE Prediction Setup
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Reliability PredictionCASRE Reliability Prediction
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Prequential LikelihoodCASRE Prequential Likelihood
CALOE+MAGTF Execution MAGTF Integration + Execution
CASRE Model-RankingCASRE Model-Ranking
CALOE+MAGTF Execution MAGTF Integration + Execution
• Haven’t been able to get these to run yet.
• Instruction manual says many of the built-in models only work with Time Between Failures Data.
• Doubt there would be much utility with Failure Count Data
Reliability ModelsReliability Models
• It actually would be QUITE easy to integrate Failure Count or Time Between Failures Output Auto-Generation into my environment
• This would facilitate quick trend-analysis
• Reliability trends and not the actual numbers is what is important
Conclusion/Follow-UpConclusion/Follow-Up