A Case Study In Reliability Analysis Lewis Sykalski

Preview:

Citation preview

A Case Study In Reliability Analysis

Lewis Sykalski

Background (cont.)Background (cont.)

• Net Centric Warfare Data Collector

Approximately 180KLOC

Written in Java and heavily uses JDBC and RMI from J2EE package

CMMI Level 1

Utilizes Oracle 9.2 EE OTS DBMS

• Reliability Required: Moderate

GLOBAL VISION NETWORK (GVN)

Integrated WarfareDevelopment Center

Fort Worth, TX

Light HouseSuffolk, VA

LM – Mission SysColorado Springs, CO

DC

FUSIONCAOC

WCS

JSAF

JIMM

JTAC

JABE

DC

LM – Sim & TrainingOrlando, FL

OtherSimulators

ThreatSims

VBMS

VBMS

BackgroundBackground

Design Diversity (Part I)Design Diversity (Part I)

• Part I: Oracle DBMS Design Diversity– Acquire 20 bug reports each from Oracle 9.2 &

Oracle 10.0– Bugs had to be Date Independent, Easy To

Reproduce, & Type Independent– Results would then be classified by self-evidence &

divergence

Design Diversity: Results 9.2 BugsDesign Diversity: Results 9.2 BugsBug # Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent

2357784 Internal Error X NO N/A X

2299898 Performance/Hang X NO N/A X

2202561 Incorrect Results NO N/A

2221401 Incorrect Results NO N/A

2739068 Incorrect Results NO N/A

2683540 Incorrect Results NO N/A

2991842 Incorrect Results NO N/A

2200057 Internal Error X NO N/A

2405258 Internal Error X NO N/A

2716265 Internal Error X NO N/A

2054241 Performance/Hang X NO N/A

2485871 Internal Error X NO N/A

2670497 Internal Error X NO N/A

2659126 Internal Error X NO N/A X

2064478 Internal Error X NO N/A

2624737 Internal Error X NO N/A X

1918751 Internal Error X NO N/A

2286290 Incorrect Results NO N/A X

2700474 Incorrect Results NO N/A

2576353 Internal Error X NO N/A

Design Diversity: Results 10.0 BugsDesign Diversity: Results 10.0 BugsBug # Type 10.0 SE 9.2 Fails? 9.2 SE Divergent

5731063 Internal Error X NO N/A

3664284 Incorrect Results NO N/A

4582808 Incorrect Results NO N/A

3895678 Internal Error X YES X

3893571 Internal Error X YES X

3903063 Incorrect Results YES

3912423 Internal Error X NO N/A

4029857 Engine Crash X YES X

4156695 Incorrect Results YES

2929556 Internal Error X YES X X

3255350 Performance / Hang X NO N/A

3887704 Internal Error X NO N/A

3405237 Engine Crash X YES X

3952322 Feature Unusable X YES X

4033889 Incorrect Results NO N/A

4060997 Internal Error X YES X

4134776 Internal Error X NO N/A

4149779 Incorrect Results NO N/A

2964132 Internal Error X YES X

3361118 Internal Error X YES X

Design Diversity: More AnalysisDesign Diversity: More Analysis

Oracle 9.2 Oracle 10.0 Oracle 10.0 Oracle 9.2

Total Bug Scripts 20 - 20 -

Failure Observed 20 - 20 11

Performance/Hang

S.E 2 0 1 0

Internal Error S.E 11 0 10 6

Engine Crash S.E 0 0 2 2

IncorrectResult

S.E 0 0 0 0

N.S.E 7 0 6 2

S.E 0 0 1 1

N.S.E 0 0 0 0

TotalBug

Scripts

Failures 1 out of 2 Bug Scripts Failing

Both DBMS Products Failing

S.E N.S.E Non-Divergent Divergent

S.E N.S.E S.E. N.S.E

40 40 18 11 8 2 1 0

Bottom Line:•Not a Statistical Sample (Not Enough Time)•2/40 = 10% of Failures not detected across both products•Out of the 20 failures for Oracle 10.0, 6 were N.S.E & 4 out of 6 of these failures would be resolved by utilizing a past release in tangent with future release

Design Diversity: Even More AnalysisDesign Diversity: Even More Analysis

• Part II: CASRE Reliability Analysis of NCW Data Collector

1. Extract the following from Failure Logs using JavaScript: Time of Program Start, Time of Program Termination, Time of Thread Terminations, and Exception or Failure Messages

2. Parse failures manually into CASRE input format3. Categorize by severity utilizing chart on next slide4. Compare 2 consecutive events (CALOE08 &

MAGTF08) as well as 2 consecutives lifecycles within same event (Integration & Execution)

Reliability Analysis (Part II)Reliability Analysis (Part II)

SeverityCode

Failure Description

9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss

8 Failure Causes Program Abort

7 Failure Causes Program Thread Abort

5 Failure Causes Record Not to be Written, Thread Continues

3 Failure Causes Incorrect Data to be Written, Thread Continues

1 Failure is Caught, Handled and Recovers Correctly

SeveritySeverity

Using CASRE Using CASRE

Using CASRE (cont.)Using CASRE (cont.)

Interval Number of Interval Error Number Errors Length Severity(int) (float) (float) (int)

Example:Hours

1 5.0 40.0 11 3.0 40.0 21 2.0 40.0 32 4.0 40.0 12 3.0 40.0 33 7.0 40.0 14 5.0 40.0 15 4.0 40.0 1

FAILURE COUNT FORMAT

TIME BETWEEN FAILURES FORMAT: N/A

CASRE Input FormatCASRE Input Format

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Failure CountsCASRE Failure Counts

CASRE Time Between FailuresCASRE Time Between Failures

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Failure IntensityCASRE Failure Intensity

CALOE+MAGTF Execution MAGTF Integration + Execution

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Cummulative FailuresCASRE Cummulative Failures

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Test Interval LengthCASRE Test Interval Length

• Running Average:– Not as Useful for Failure Count Data (unless test intervals are equal

length) – Computes the running average of the time between successive failures

for time between failures data, or the running average of number of failures per interval for failure count data.

– If the running average decreases with time (fewer failures per test interval), reliability growth is indicated.

• Laplace Test: – Not as Useful for Failure Count Data (unless test intervals are equal

length) – Occurrences of failures = homogeneous Poisson process– If the test statistic decreases with increasing failure#, then the null

hypothesis can be rejected in favor of reliability growth at an appropriate significance level. Opposite for increases with increasing failure#

Detecting Reliability TrendsDetecting Reliability Trends

CALOE+MAGTF Execution MAGTF Integration + Execution

Running AverageRunning Average

Laplace TestLaplace Test

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Cum Failure PredictionsCASRE Cum Failure Predictions

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Prediction SetupCASRE Prediction Setup

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Reliability PredictionCASRE Reliability Prediction

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Prequential LikelihoodCASRE Prequential Likelihood

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Model-RankingCASRE Model-Ranking

CALOE+MAGTF Execution MAGTF Integration + Execution

• Haven’t been able to get these to run yet.

• Instruction manual says many of the built-in models only work with Time Between Failures Data.

• Doubt there would be much utility with Failure Count Data

Reliability ModelsReliability Models

• It actually would be QUITE easy to integrate Failure Count or Time Between Failures Output Auto-Generation into my environment

• This would facilitate quick trend-analysis

• Reliability trends and not the actual numbers is what is important

Conclusion/Follow-UpConclusion/Follow-Up

Recommended