A Case Study In Reliability Analysis Lewis Sykalski

A Case Study In Reliability Analysis

Lewis Sykalski

Background (cont.)Background (cont.)

• Net Centric Warfare Data Collector

Approximately 180KLOC

Written in Java and heavily uses JDBC and RMI from J2EE package

CMMI Level 1

Utilizes Oracle 9.2 EE OTS DBMS

• Reliability Required: Moderate

GLOBAL VISION NETWORK (GVN)

Integrated WarfareDevelopment Center

Fort Worth, TX

Light HouseSuffolk, VA

LM – Mission SysColorado Springs, CO

FUSIONCAOC

LM – Sim & TrainingOrlando, FL

OtherSimulators

ThreatSims

BackgroundBackground

Design Diversity (Part I)Design Diversity (Part I)

• Part I: Oracle DBMS Design Diversity– Acquire 20 bug reports each from Oracle 9.2 &

Oracle 10.0– Bugs had to be Date Independent, Easy To

Reproduce, & Type Independent– Results would then be classified by self-evidence &

divergence

Design Diversity: Results 9.2 BugsDesign Diversity: Results 9.2 BugsBug # Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent

2357784 Internal Error X NO N/A X

2299898 Performance/Hang X NO N/A X

2202561 Incorrect Results NO N/A

2200057 Internal Error X NO N/A

2054241 Performance/Hang X NO N/A

2286290 Incorrect Results NO N/A X

Design Diversity: Results 10.0 BugsDesign Diversity: Results 10.0 BugsBug # Type 10.0 SE 9.2 Fails? 9.2 SE Divergent

3895678 Internal Error X YES X

3903063 Incorrect Results YES

4029857 Engine Crash X YES X

4156695 Incorrect Results YES

2929556 Internal Error X YES X X

3255350 Performance / Hang X NO N/A

3405237 Engine Crash X YES X

3952322 Feature Unusable X YES X

Design Diversity: More AnalysisDesign Diversity: More Analysis

Oracle 9.2 Oracle 10.0 Oracle 10.0 Oracle 9.2

Total Bug Scripts 20 - 20 -

Failure Observed 20 - 20 11

Performance/Hang

S.E 2 0 1 0

Internal Error S.E 11 0 10 6

Engine Crash S.E 0 0 2 2

IncorrectResult

S.E 0 0 0 0

N.S.E 7 0 6 2

S.E 0 0 1 1

N.S.E 0 0 0 0

TotalBug

Scripts

Failures 1 out of 2 Bug Scripts Failing

Both DBMS Products Failing

S.E N.S.E Non-Divergent Divergent

S.E N.S.E S.E. N.S.E

40 40 18 11 8 2 1 0

Bottom Line:•Not a Statistical Sample (Not Enough Time)•2/40 = 10% of Failures not detected across both products•Out of the 20 failures for Oracle 10.0, 6 were N.S.E & 4 out of 6 of these failures would be resolved by utilizing a past release in tangent with future release

Design Diversity: Even More AnalysisDesign Diversity: Even More Analysis

• Part II: CASRE Reliability Analysis of NCW Data Collector

1. Extract the following from Failure Logs using JavaScript: Time of Program Start, Time of Program Termination, Time of Thread Terminations, and Exception or Failure Messages

2. Parse failures manually into CASRE input format3. Categorize by severity utilizing chart on next slide4. Compare 2 consecutive events (CALOE08 &

MAGTF08) as well as 2 consecutives lifecycles within same event (Integration & Execution)

Reliability Analysis (Part II)Reliability Analysis (Part II)

SeverityCode

Failure Description

9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss

8 Failure Causes Program Abort

7 Failure Causes Program Thread Abort

5 Failure Causes Record Not to be Written, Thread Continues

3 Failure Causes Incorrect Data to be Written, Thread Continues

1 Failure is Caught, Handled and Recovers Correctly

SeveritySeverity

Using CASRE Using CASRE

Using CASRE (cont.)Using CASRE (cont.)

Interval Number of Interval Error Number Errors Length Severity(int) (float) (float) (int)

Example:Hours

1 5.0 40.0 11 3.0 40.0 21 2.0 40.0 32 4.0 40.0 12 3.0 40.0 33 7.0 40.0 14 5.0 40.0 15 4.0 40.0 1

FAILURE COUNT FORMAT

TIME BETWEEN FAILURES FORMAT: N/A

CASRE Input FormatCASRE Input Format

CALOE+MAGTF Execution MAGTF Integration + Execution

CASRE Failure CountsCASRE Failure Counts

CASRE Time Between FailuresCASRE Time Between Failures

CASRE Failure IntensityCASRE Failure Intensity

CASRE Cummulative FailuresCASRE Cummulative Failures

CASRE Test Interval LengthCASRE Test Interval Length

• Running Average:– Not as Useful for Failure Count Data (unless test intervals are equal

length) – Computes the running average of the time between successive failures

for time between failures data, or the running average of number of failures per interval for failure count data.

– If the running average decreases with time (fewer failures per test interval), reliability growth is indicated.

• Laplace Test: – Not as Useful for Failure Count Data (unless test intervals are equal

length) – Occurrences of failures = homogeneous Poisson process– If the test statistic decreases with increasing failure#, then the null

hypothesis can be rejected in favor of reliability growth at an appropriate significance level. Opposite for increases with increasing failure#

Detecting Reliability TrendsDetecting Reliability Trends

Running AverageRunning Average

Laplace TestLaplace Test

CASRE Cum Failure PredictionsCASRE Cum Failure Predictions

CASRE Prediction SetupCASRE Prediction Setup

CASRE Reliability PredictionCASRE Reliability Prediction

CASRE Prequential LikelihoodCASRE Prequential Likelihood

CASRE Model-RankingCASRE Model-Ranking

• Haven’t been able to get these to run yet.

• Instruction manual says many of the built-in models only work with Time Between Failures Data.

• Doubt there would be much utility with Failure Count Data

Reliability ModelsReliability Models

• It actually would be QUITE easy to integrate Failure Count or Time Between Failures Output Auto-Generation into my environment

• This would facilitate quick trend-analysis

• Reliability trends and not the actual numbers is what is important

Conclusion/Follow-UpConclusion/Follow-Up

A Case Study In Reliability Analysis Lewis Sykalski

Documents

Lewis Structures and Molecular Shape. Lewis Dot Diagrams

The Lewis Legacy - Lewis Family Descendantslewis-family-descendants.com › userfiles › image › LEWIS_FAMILY_N… · The Lewis Legacy Lewis Coat of Arms Kenmore Dear Lewis Family,

Jerry Lee Lewis Lewis Boogie - pisnicky-akordy.cz

Reliability & Agreement DeShon - 2006. Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half

Software Effort Estimation Planning to Meet Schedule Lewis Sykalski 5/01/2010

Increased Reliability andIncreased Reliability and Reduced

Paul O. Lewis...Title lewis-bayesian-part1.key Author Paul O. Lewis Created Date 20190804122701Z

Reliability Engineering Service - HKPC · Reliability Engineering Service Reliability is the key factor to success in today’s high reliability electronic product market. Our Reliability

Reliability Engineering- An Overview - Mohammad … Slides--Reliability Engineering... · Reliability Engineering Overview • Reliability engineering measures and improves resistance

Morgan, Lewis S BocKius LLP Morgan Lewis

Network Reliability Council (NRC) Reliability Issues

Lewis and Clark–Lewis’ William Indiana

DISCOM Quality Attribute Impact A Case Study in OO-Design Lewis Sykalski E-mail: lsykalski@smu.edulsykalski@smu.edu

Lewis Symbols and Structures - OpenStax CNX5.pdf/lewis-symbols-and...2 Lewis Structures WealsouseLewissymbolstoindicatetheformationofcoalenvtbonds, whichareshownin Lewis structures

Corporate Operational Reliability Reliability Center, Inc. © Reliability Center, Inc. 1985-2002

LMF2 - Isle of Lewis, The Vikings in Lewis

Reliability engineering for semiconductor … Reliability Engineering for ... –Reactive reliability engineering • Overall process ... Practical Reliability Engineering for Semiconductor

Lewis Acids, Lewis Bases, and Curvy Arrows

Part 1: The Lion of Judah in Narnia. C.S. Lewis (Clive Staples Lewis) 1898 – 1963 C.S. Lewis (Clive Staples Lewis) 1898 – 1963

Reliability Education Opportunity: “Reliability Analysis ...crr.umd.edu/sites/default/files/documents/anniv25/presentations/... · Reliability Education Opportunity: “Reliability