Upload
promise07
View
1.407
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Yan Ma and Bojan Cukic
Citation preview
Yan Ma, Bojan Cukic Yan Ma, Bojan Cukic
Lane Department of Computer Science and Electrical EngineeringLane Department of Computer Science and Electrical Engineering
West Virginia UniversityWest Virginia University
May 2007
Adequate Evaluation of Quality Models Adequate Evaluation of Quality Models
in Software Engineering Studiesin Software Engineering Studies
CITeRCITeR The Center for Identification Technology Researchwww.citer.wvu.eduAn NSF I/UCR Center advancing integrative research
Evaluating Defect Models
• Hundreds of research papers. – Most offer very little one can generalize and
reapply.– Initial hurdle was the lack of data, but not any
longer:• Open source repositories, NASA MDP, PROMISE
datasets.
• How to evaluate defect prediction models?
Software defect data: Class Imbalance
• A few modules are fault-prone.– A problem for supervised learning algorithms, which
typically try to maximize overall accuracy.
Software Defect Data: Correlation
MDP-PC1: Pearson correlation coefficients
LOC TOpnd V B LCC N
LOC 1.000 0.908 0.937 0.931 0.545 0.924
TOpnd 0.908 1.000 0.976 0.971 0.464 0.996
V 0.937 0.976 1.000 0.995 0.468 0.987
B 0.931 0.971 0.995 1.000 0.468 0.982
LCC 0.545 0.464 0.468 0.468 1.000 0.473
N 0.924 0.996 0.987 0.982 0.473 1.000
Software Defect Data: Correlation (2)
MDP-KC2: Pearson correlation coefficients
LOC UOp V IV.G TOp LOB
LOC 1.000 0.632 0.986 0.968 0.991 0.909
UOp 0.632 1.000 0.536 0.577 0.615 0.636
V 0.986 0.536 1.000 0.970 0.990 0.887
IV.G 0.968 0.577 0.970 1.000 0.972 0.836
Top 0.991 0.615 0.990 0.972 1.000 0.912
LOB 0.909 0.636 0.887 0.836 0.912 1.000
Software Defect Data: Correlation (3)
• Five “most informative attributes”
Software Defect Data: Module Size
• Defect-free modules are smaller.• In MDP, modules are very small.
The 90th percentile of LOC for the collection of defect modules and defect-free modules
KC1 KC2 PC1 JM1 CM1
Defect-free 42 55 47 72 55
Defect 99 167 114 165 131
Software Defect Data: Close Neighbors
• The “nearest neighbor” of most defective modules is a fault free module. – Measured by Euclidian distance between module
metrics
Project% of defect modules whose
nearest neighbor is a majority class instance
% of defect modules that has 2 among the three nearest
neighbors in the majority class
KC1 66.26% 73.62%
KC2 58.33% 58.33%
JM1 67.90% 75.46%
PC1 75.32% 85.71%
CM1 73.47% 97.96%
Implications on Evaluation
• Many machine learning algorithms ineffective.– But one would never know by reading the literature.
• Experimental results rarely reported adequately.
Classification Success Measures
• Probability of detection (PD): Correctly classifying faulty modules (called sensitivity).
• Specificity: Correctly classified fault-free modules.
• False alarm (PF): Proportion of misclassified fault-free modules.– PF = 1- Specificity
• Precision index: Proportion of faulty modules amongst those predicted as faulty.
Random Forests on PC1(only 7% modules faulty )
Success Measures (2)
• Accuracy, PD, specificity, precision index tell a one sided story.
• Indices combine measures of interest.
ecisionPDmeanG Pr1
G mean2 PD Specificity The geometric mean ofthe two accuracies.
Higher precision leads to“cheaper” V&V
Success Measures (3)
• F-measure, like G-mean1, combines PD and Precision.– More flexibility.– Weight should
reflect project’s “cost vs. risk aversion”
• May be difficult.
F measure ( 2 1) Precision PD
2 Pr ecision PD
Comparing Models
PC1: random forests at different voting cutoffs
Figure 1. (a) Figure 1. (b)
G-mean1 0.399 0.426
G-mean2 0.519 0.783
F-measure (=1) 0.372 0.368
F-measure (=2) 0.305 0.527
•G-mean2 and F-measure (=2) reflect the difference between the two models.
• Interpretation still in the domain of human understanding.
Comparing Performance MDP-PC1
IndicesNaïveBayes Logistic IB1 J48 Bagging
PD 0.299 0.065 0.442 0.234 0.169
1 – PF 0.936 0.988 0.954 0.985 0.995
Precision 0.259 0.289 0.415 0.540 0.732
Overall Accuracy 0.892 0.924 0.918 0.933 0.938
G-mean1 0.278 0.137 0.428 0.356 0.352
G-mean2 0.529 0.253 0.649 0.480 0.410
F-measure ( = 1) 0.278 0.106 0.428 0.327 0.275
F-measure ( = 2) 0.290 0.077 0.436 0.264 0.200
ED ( = 0.5) 0.498 0.661 0.396 0.542 0.588
Comparing Performance: KC-2
IndicesNaïveBayes Logistic IB1 J48 Bagging
PD 0.398 0.389 0.509 0.546 0.472
1 - PF 0.950 0.932 0.858 0.896 0.931
Precision 0.674 0.599 0.483 0.578 0.639
Overall Accuracy 0.836 0.820 0.786 0.824 0.836
G-mean1 0.518 0.483 0.496 0.562 0.549
G-mean2 0.615 0.602 0.661 0.700 0.663
F-measure ( = 1) 0.501 0.472 0.496 0.562 0.543
F-measure ( = 2) 0.434 0.418 0.504 0.552 0.498
ED ( = 0.5) 0.427 0.435 0.361 0.329 0.377
ED ( = 0.67) 0.492 0.500 0.409 0.375 0.433
Visual Tools: Margin Plots
Visual Tools: ROC
• Classifiers that allow multiple operational points are more flexible.
• Area Under the Curve, if two curves available.
• Distance from the ideal performance.
ED * (1 PD)2 (1 ) * PF 2
Factor depends of the misclassification cost.
Summary
• Emerging data collection must be accompanied by mature evaluation.
• Research reports must reveal all aspects of achieved predictive performance fairly.
• The goodness of a model depends on the cost of misclassification.
Current work
• Statistical significance calculation between ROC curves.• ROC vs. PR (Precision-Recall) curves, cost curves.