Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Justin W. Eggstaff Thomas A. Mazzuchi
Shahram Sarkani
J. W. Eggstaff, T. A. Mazzuchi, and S. Sarkani. “The development of progress plans using a performance-based expert judgment model to assess technical performance and risk”. Systems Engineering Volume 16 Number 2 in 2014.
J. W. Eggstaff, T. A. Mazzuchi, and S. Sarkani. “The effect of the number of seed variables on the performance of Cooke’s Classical Model”, Reliability Engineering and Systems Safety. – 2nd Revision
Each year, annual costs of DoD research & development (R&D) are approximately 50% above original estimates
Typical delays in weapons systems initial operational capability (IOC) are in excess of 20 months
Weapons Systems Acquisition Reform Act of 2009
2 of 35
Overall program performance depends on three factors: • Cost • Schedule • Technical
Technical performance is typically “assumed”
Poor cost and schedule performance are symptoms or effects that manifest from poor technical performance
Current methods for the predication of
3 of 35
Technical Measurement • Types of technical measures • Attributes • Technical reviews and audits
4 of 35
Designed to provide a numerical value of risk by the comparison of current TPM progress against a desired level or performance, or a performance threshold, predefined by the analyst
Category A – Smaller the Better (Software Errors)
Category B – Larger the Better (system Range)
5 of 35
Overall Risk
6 of 35
7 of 35
• Probably the most widely used method for combining expert judgment in a variety of applications
• Uses a set of seed variables to calculate individual expert Calibration and Information scores which, in turn, are used to calculate an expert’s relative weight
• The experts’ predicted values for a target variable are combined using their individual weights to calculate the decision maker’s assessment of that variable
Experts assess their uncertainty distribution via specification of a 5%, 50% and 95%-ile values for unknown values and for a set of seed variables (whose actual realization is known to the analyst alone) and a set of variables of interest
The analyst determines the Intrinsic Range or bounds for the variable distributions
• By specifying the 5%, 50% and 95%-iles, the expert is specifying a 4-bin multinomial distribution with probabilities .05, .45, .45, and .05 for each seed variable response
• Let si denote the observed bin frequency of seed variables
• We may test how well the expert is calibrated by testing the hypothesis that H0 si = pi for all i vs Ha si ≠ pi for some i
Test Statistic
If N (the number of seed variables) is large enough
Thus the calibration score for the expert is the probability of getting a relative information score worse (greater or equal to) than what was obtained
The relative information for expert e on a variable is
• total weight for the expert is the normalized product of calibration times information score
• the calibration score is optimized by choosing A minimum α value such that if C(e) > α, C(e) = 0
• α is selected so that a fictitious expert with a distribution equal to that of the weighted combination of expert distributions would be given the highest weight among experts
• Final uncertainty distribution = Σ wiFi(x)
Three reasons for an iterative cross-validation analysis • The Classical Model uses a set of seed
variables to develop expert weights; an iterative approach is needed
• The question of the minimum number of seed variables required has not been answered
• The ongoing debate over the robustness of the Classical Model (performance weights versus equal weights)
14 of 35
Cooke and Goossen (2008) • Examines 45 expert judgment studies compiled over 20
years Clemen (2008)
• Asserts “in-sample” analysis is biased toward the classical model; Suggests the use of “out-of-sample/Remove-One-At-a-Time (ROAT)” analysis
• Selected 14 studies to compare the performance-weighted (PW) decision maker and the equally-weighted (EW) decision maker
15 of 12
Cooke (2008) • Notes that a ROAT approach tends to favor or punish
excluded experts and presents a “two-fold” cross validation
• In 20 of 26 validation runs, the PW outperformed the EW Lin and Cheng (2008); (2009)
• Using out-of-sample analysis, examines the available 45 studies and finds that the PW outperforms the EW, but with degraded performance
Flandoli et al (2010) • Performs a modified “two-fold” cross validation with 500
combinations of 30-70 splits • Results show the Cooke’s model gives best indication of
uncertainty when averaged 16 of 12
Analysis conducted • Comprehensive “Out-of-Sample” analysis • One-tailed sign test (Clemen, 2008)
Data used • 55 expert judgment studies compiled over 20
years • 63 data sets: 604 experts, 770 seed
variables, ~68M judgments
17 of 35
Iteration Seed Variables Used Target Variables Evaluated
1 1 2 3 4
2 2 1 3 4
3 3 1 2 4
4 4 1 2 3
5 1 2 3 4
6 1 3 2 4
7 1 4 2 3
8 2 3 1 4
9 2 4 1 3
10 3 4 1 2
11 1 2 3 4
12 1 2 4 3
13 1 3 4 2
14 2 3 4 1
18 of 35
Extent of previous cross-validation studies
Mean Out-of-Sample Combination Scores (Calibration × Information)
19 of 35
Study ID No. of Experts
No. of Variables
DM Type
No. of Variables Used to Determine Performance Measure
1 2 3 4 5 6 7 8
MVOSEEDS 77 5 PWDM EWDM
0.3259 0.0279
0.5579 0.1154
0.6773 0.3071
0.8414 0.6963
A_SEED 7 6 PWDM EWDM
0.1434 0.0072
0.3312 0.0229
0.3462 0.0580
0.3332 0.1260
0.4439 0.2508
AOTDAILY 7 6 PWDM EWDM
0.0167 0.0164
0.0294 0.0313
0.0583 0.0586
0.1199 0.1036
0.2271 0.1565
FCEP 5 8 PWDM EWDM
0.0028 0.0001
0.5309 0.0008
0.7328 0.0038
0.8917 0.0135
1.0556 0.0399
1.0792 0.1059
1.1396 0.2434
BSWAAL 6 8 PWDM EWDM
0.3811 0.2697
0.2538 0.3142
0.3624 0.3458
0.3958 0.3688
0.3932 0.3862
0.3665 0.3900
0.4860 0.4406
DSM-1 10 8 PWDM EWDM
0.1546 0.2637
0.2075 0.2939
0.2448 0.3105
0.3224 0.3241
0.4849 0.3403
0.6048 0.3576
0.6591 0.4508
MONT1 11 8 PWDM EWDM
0.6249 0.2312
0.6168 0.2880
0.5673 0.3497
0.5964 0.4158
0.6656 0.4854
0.6350 0.5734
0.6423 0.7321
SO3EXPTS 4 9 PWDM EWDM
0.0123 2.9E-5
0.1847 0.0002
0.3236 0.0013
0.5801 0.0063
0.7460 0.0254
0.9834 0.0856
1.0993 0.2407
2.1950 0.5700
WATERPOL 11 9 PWDM EWDM
0.0115 0.0033
0.1661 0.0111
0.4032 0.0313
0.5544 0.0687
0.6987 0.1195
0.8737 0.1798
0.9985 0.2624
1.0289 0.4852
Single Decision Maker Dominates in 28 of 63 Cases PWDM: 21 Cases EWDM: 7 Cases
Single Modal Switching in 22 of 63 Cases EWDM gives way to PWDM: 10 Cases PWDM gives way to EWDM: 12 Cases
Dual Modal Switching (Parabolic) in 11 of 63 Cases PWDM at the extremes: 7 Cases EWDM at the extremes: 4 Cases Somewhat Random Switching in 2 of 63 Cases BSWAAL ACNEXPTS
Mean Out-of-Sample Combination Scores (Calibration × Information)
20 of 35
Study ID No. of Experts
No. of Variables
DM Type
No. of Variables Used to Determine Performance Measure
1 2 3 4 5 6 7 8
MVOSEEDS 77 5 PWDM EWDM
0.3259 0.0279
0.5579 0.1154
0.6773 0.3071
0.8414 0.6963
A_SEED 7 6 PWDM EWDM
0.1434 0.0072
0.3312 0.0229
0.3462 0.0580
0.3332 0.1260
0.4439 0.2508
AOTDAILY 7 6 PWDM EWDM
0.0167 0.0164
0.0294 0.0313
0.0583 0.0586
0.1199 0.1036
0.2271 0.1565
FCEP 5 8 PWDM EWDM
0.0028 0.0001
0.5309 0.0008
0.7328 0.0038
0.8917 0.0135
1.0556 0.0399
1.0792 0.1059
1.1396 0.2434
BSWAAL 6 8 PWDM EWDM
0.3811 0.2697
0.2538 0.3142
0.3624 0.3458
0.3958 0.3688
0.3932 0.3862
0.3665 0.3900
0.4860 0.4406
DSM-1 10 8 PWDM EWDM
0.1546 0.2637
0.2075 0.2939
0.2448 0.3105
0.3224 0.3241
0.4849 0.3403
0.6048 0.3576
0.6591 0.4508
MONT1 11 8 PWDM EWDM
0.6249 0.2312
0.6168 0.2880
0.5673 0.3497
0.5964 0.4158
0.6656 0.4854
0.6350 0.5734
0.6423 0.7321
SO3EXPTS 4 9 PWDM EWDM
0.0123 2.9E-5
0.1847 0.0002
0.3236 0.0013
0.5801 0.0063
0.7460 0.0254
0.9834 0.0856
1.0993 0.2407
2.1950 0.5700
WATERPOL 11 9 PWDM EWDM
0.0115 0.0033
0.1661 0.0111
0.4032 0.0313
0.5544 0.0687
0.6987 0.1195
0.8737 0.1798
0.9985 0.2624
1.0289 0.4852
Accuracy Measures (One-tailed Sign Test)
21 of 35
Median p-value by data set PWDM is more accurate than EWDM (p ≥ 0.5): 42 of 63 cases EWDM is more accurate than PWDM (p < 0.5): 11 of 63 cases Overall median p-value: 0.74
Median p-value by number of seed variables PWDM is more accurate in ALL cases PWDM is significantly more accurate in all but two cases
Data Used • The data set for this research comes from the
unpublished white paper by Coleman, Kulick, and Pisano (1996) on the T45TS Cockpit-21 project
• Actual data used simulated for use in the expert judgment model
22 of 35 Appendix B
23 of 35 Figure 4-3, p. 112 E-TRI Flow Diagram
Oct-93 Milestone Review: Nov-93 TPM value is predicted
24 of 35
Nov-93 Milestone Review: TPM value is realized
Expert weights are determined
25 of 35
Decision maker’s assessment is calculated • Using weighted expert predictions • Calculated for all remaining milestones
26 of 35
Nov-93 updated prediction is presented
27 of 35
E-TRI for final state (Feb-94) is calculated
28 of 35
Case Study Data
29 of 35
System E-TRI for final state (Feb-94) is calculated
30 of 35
QUESTIONS?
Garvey, P. R., & Cho, C.-C. (2003). An Index to Measure a System's Performance Risk. Acquisition Review Quarterly, Spring, 189-199.
Winkler, R. L. (1968). The consensus of subjective probability distributions. Management Science, 15(2), 61-75.
Cooke, R. M. (1991). Experts in uncertainty: opinion and subjective probability in science. New York: Oxford University Press.
Clemen, R. T. (2008). Comment on Cooke's classical method. Reliability Engineering & System Safety, 93(5), 760-765.
Coleman, C., Kulick, K., & Pisano, N. (1996). Technical performance measurement (TPM) retrospective implementation and concept validation on the T45TS Cockpit-21 program. Program Executive Office for Air Anti-Submarine Warfare, Assault, and Special Mission Programs, White Paper.
32 of 35