Upload
sean-sellers
View
38
Download
0
Embed Size (px)
DESCRIPTION
Software Quality Analysis with Limited Prior Knowledge of Faults. Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn 313 583 6669 [email protected]. Overview. Introduction Knowledge-Based Software Quality Analysis - PowerPoint PPT Presentation
Citation preview
Software Quality Analysis with Limited Prior Knowledge of Faults
Naeem (Jim) SeliyaAssistant Professor, CIS
DepartmentUniversity of Michigan – Dearborn
313 583 6669
Oct. 3, 2006 Wayne State University CS Seminar
2
Overview
Introduction Knowledge-Based Software Quality
Analysis Software Quality Analysis with Limited Prior
Knowledge of Faults Empirical Case Study
Software Measurement Data Empirical Results
Conclusion
Oct. 3, 2006 Wayne State University CS Seminar
3
Introduction Software quality assurance is vital during the
software development process
Knowledge-based software quality models useful for allocating limited resources to faulty programs
Software measurements often observed as predictors of software quality, i.e. working hypothesis
System operations and software maintenance are benefited by targeting program modules that are likely to have defects
Oct. 3, 2006 Wayne State University CS Seminar
4
Introduction … Software quality modeling has been addressed in
related literature software quality classification models software fault prediction models software module order modeling
A supervised learning approach is typically taken for software quality modeling
requires the prior experience of developing systems relatively similar to target system
requires complete knowledge of defect data of previously developed program modules
Oct. 3, 2006 Wayne State University CS Seminar
5
Software Quality Analysis
SoftwareMetrics
Know
n Defect D
ata
SoftwareMetrics
Unknow
n Defect D
ata
LearntHypothesis
Previous Experience
Target Project
Model Training
Model Application
Oct. 3, 2006 Wayne State University CS Seminar
6
Software Quality Analysis …
Practical software engineering problems Organization has limited software defect data from
previous experiences with similar software projects Organization does not have software defect data from
previous experiences with similar software projects Organization does not have experience with developing
similar software projects
Two very likely problem scenarios Software quality modeling with limited software defect
data Software quality modeling without software defect data
Oct. 3, 2006 Wayne State University CS Seminar
7
Limited Defect Data Problem
SoftwareMetrics
SoftwareMetrics
Unknow
n Defect D
ata
LearntHypothesis
Previous Experience
Target Project
Model Training
Model Application
Known Defect Data
Unknown Defect Data
Oct. 3, 2006 Wayne State University CS Seminar
8
No Software Defect Data Problem
SoftwareMetrics
Know
n Defect D
ata
SoftwareMetrics
Unknow
n Defect D
ata
LearntHypothesis
Previous Experience
Target Project
Model Training
Model Application
Oct. 3, 2006 Wayne State University CS Seminar
9
Limited Defect Data Problem …
Some contributing issues: cost of metrics data collection may limit for which
subsystems the software fault data is collected software defect data collected for some modules may
be error prone due to data collection problems defect data may be reliable for only some components
only some project components of a distributed software system may collect software fault data
in a multiple release system, fault data may not be collected for all releases
Oct. 3, 2006 Wayne State University CS Seminar
10
Objectives
Developing solutions to software quality analysis when there is only limited a priori knowledge of defect data Learning software quality trends from both
labeled (small size) and unlabeled (large size) components of software measurement data
Providing empirical software engineering evidence toward effectiveness and practical appeal of proposed solutions
Oct. 3, 2006 Wayne State University CS Seminar
11
Constraint-Based Clustering with Expert Input
Semi-Supervised Classification with the Expectation-Maximization Algorithm
Proposed Solutions
Oct. 3, 2006 Wayne State University CS Seminar
12
Clustering is an appropriate choice for software quality analysis based on program attributes alone
Clustering algorithms group program modules according to their software attributes
Program modules with similar attributes will likely have similar software quality characteristics
Low-quality modules will likely group together into nfp clusters
High-quality modules will likely group together into fp clusters
Constraint-Based Clustering with Expert Input
Oct. 3, 2006 Wayne State University CS Seminar
13
Constraint-Based Clustering with Expert Input …
Labeled data instances are used to modify and enhance clustering results on the unlabeled data instances
Investigated unsupervised clustering with expert input for software quality classification
Constraint-based clustering can aid the expert in better labeling the clusters as fp or nfp
Identify difficult-to-classify modules or noisy instances in the software measurement data
Oct. 3, 2006 Wayne State University CS Seminar
14
Proposed Algorithm A constraint-based clustering approach is
implemented with the k-means algorithm
Labeled program modules are used to initialize centroids of a certain number of clusters
Grouping of the labeled modules remains unchanged as fixed constraints
Expert has the flexibility to inspect and label additional clusters as nfp or fp during the semi-supervised clustering process
Oct. 3, 2006 Wayne State University CS Seminar
15
Proposed Algorithm …
Let D contain L_nfp, L_fp, and U sets of program modules
1. Obtain initial numbers of nfp and fp clusters: Execute Cg algorithm to obtain optimal (p)
number of nfp clusters among {1, 2, …, Cin_nfp} Execute Cg algorithm to obtain optimal (q)
number of fp clusters among {1, 2, …, Cin_fp} Cg algorithm work of Krzanowski and Lai
A criterion for determining the number of groups in a data set using sums-of-squares clustering. Biometrics, 44(1):23-34, March 1988.
Oct. 3, 2006 Wayne State University CS Seminar
16
Proposed Algorithm …
2. Initialize centroids of clusters: Centroids of p out of C_max clusters are
initialized to centroids of nfp clusters Centroids of q out of {C_max - p} clusters
are initialized to centroids of fp clusters Centroids of remaining r (i.e. C_max - p – q)
clusters initialized to randomly selected modules from U
Randomly select 5 unique sets of modules for initializing the r unlabeled clusters
Oct. 3, 2006 Wayne State University CS Seminar
17
Proposed Algorithm …
3. Execute constraint-based clustering: k-means with Euclidean distance run on D
with initialized centroids of C_max clusters Clustering is run under constraint that an
existing membership of a module to a labeled cluster remains unchanged
Clustering repeated for all 5 centroid initialization settings
Clustering associated with median SSE value selected for subsequent computation
Oct. 3, 2006 Wayne State University CS Seminar
18
Proposed Algorithm …
4. Expert-based labeling of clusters: Expert is presented with descriptive statistics
of the r unlabeled clusters and asked to label them as nfp or fp
Expert labels only those clusters for which he is very confident in the label estimation
If at least 1 of the r clusters is labeled, go to to Step 2, and continue
Oct. 3, 2006 Wayne State University CS Seminar
19
Proposed Algorithm …
5. Stop semi-supervised clustering: Iterative semi-supervised clustering process
is stopped when the sets C_nfp, C_fp, and C_ul are unchanged
Program modules in the p (or q) clusters are labeled and recorded as nfp (or fp)
Program modules in the remaining r unlabeled clusters are not assigned any fault-proneness labels
Oct. 3, 2006 Wayne State University CS Seminar
20
Software Measurement Data
Software metrics datasets obtained from seven NASA software projects (MDP Initiative)
JM1, KC1, KC2, KC3, CM1, MW1, and PC1
Projects characterized by same set of software product metrics and built in similar software development environments
Defect data reflect changes made to source code for correcting errors recorded in problem reporting systems
Oct. 3, 2006 Wayne State University CS Seminar
21
Software Measurement Data …
The JM1 project dataset used as training data in our empirical case studies
it is the largest dataset among the seven software projects
The remaining six datasets used as test data for model evaluation and generalization performance
Among the 21 product metrics only 13 basic metrics are used in our study
Oct. 3, 2006 Wayne State University CS Seminar
22
Software Metrics1. Cyclomatic complexity
2. Essential complexity
3. Design complexity
4. Number of unique operators
5. Number of unique operands
6. Total number of operators
7. Total number of operands
8. Total number of lines of source code
9. Executable lines of code
10. Lines with code and comments
11. Lines with only comments
12. Blank lines of code
13. Branch count
Oct. 3, 2006 Wayne State University CS Seminar
23
Case Study Datasets
Oct. 3, 2006 Wayne State University CS Seminar
24
Constraint-Based Clustering Case Study
JM1 dataset pre-processed to yield a reduced dataset of 8850 modules, i.e. JM1-8850
program modules with identical software attributes but with different fault-proneness labels were eliminated
JM1 used as training instances and to form the respective labeled & unlabeled datasets
KC1, KC2, KC3, CM1, MW1, & PC1 used as test datasets to evaluate knowledge learnt post constraint-based clustering analysis of software data
Oct. 3, 2006 Wayne State University CS Seminar
25
Constraint-Based Clustering Case Study …
Labeled datasets formed by random sampling LP = {100, 250, 500, 1000, 1500, 2000, 2500, 3000}
labeled modules Each LP dataset randomly selected to maintain a 80:20
proportion of nfp:fp program modules
3 samples were obtained for each LP value, and average results are reported in the paper
5 samples for LP = {100, 250, 500}
Parameter settings C_max = {30, 40} clusters Cin_fp = {10, 20} for Cg algorithm Cin_nfp = {10, 20} for Cg algorithm
Oct. 3, 2006 Wayne State University CS Seminar
26
Initial Clusters of Labeled Modules
Oct. 3, 2006 Wayne State University CS Seminar
27
Expert-Based Labeling Results
Oct. 3, 2006 Wayne State University CS Seminar
28
Classification of Labeled Modules
Oct. 3, 2006 Wayne State University CS Seminar
29
Classification of Labeled …
Oct. 3, 2006 Wayne State University CS Seminar
30
Classification of Labeled …C_max = 30 Clusters
Oct. 3, 2006 Wayne State University CS Seminar
31
Classification of Labeled …C_max = 40 Clusters
Oct. 3, 2006 Wayne State University CS Seminar
32
Classification of Test Datasets
Unsupervised Clustering
Oct. 3, 2006 Wayne State University CS Seminar
33
Classification of Test Datasets …
Constraint-Based Clustering (LP = 250)
Oct. 3, 2006 Wayne State University CS Seminar
34
Classification of Test Datasets …
Constraint-Based Clustering (LP = 1000)
Oct. 3, 2006 Wayne State University CS Seminar
35
Classification of Test Datasets …
Constraint-Based Clustering (LP = 3000)
Oct. 3, 2006 Wayne State University CS Seminar
36
Classification of Test Datasets …
Average Classification of Test Data Modules
Oct. 3, 2006 Wayne State University CS Seminar
37
Comparison with C4.5 Models C4.5 decision tree implemented in Weka, an open source data
mining tool
Supervised decision tree models built using 10 fold cross validation
Decision tree parameters tuned for appropriate comparison with constraint-based clustering
Tuning for similar Type I (false positive) error rates
C4.5 models yielded very low false positives in conjunction with very high false negatives
Performance of C4.5 models generally remain unchanged with LP compared to an improvement by constraint-based clustering
Oct. 3, 2006 Wayne State University CS Seminar
38
Remaining Unlabeled Modules do they constitute as noisy data? are they hard to model
modules?
do they form new groups of program modules for given system ?
are their software measurements uniquely different from the other program modules?
did something go wrong in the software metrics data collection process?
did the project not collect other software metrics that may better represent the software quality?
Oct. 3, 2006 Wayne State University CS Seminar
39
Remaining Unlabeled Modules …
Ensemble Filter (EF) strategy Comparison with majority EF
Consists of 25 classifiers from different learning theories and methodologies
Investigate commonality of modules detected by EF and those that remain unlabeled after constraint-based clustering process
About 40% to 50% were common with those considered noisy by ensemble filter
A relatively large number of same modules were consistently included in the pool of remaining unlabeled program modules
Oct. 3, 2006 Wayne State University CS Seminar
40
Constraint-Based Clustering Case Study … Summary
Improved estimation performance compared to unsupervised clustering with expert input
Better test data performances compared to a supervised learner trained on labeled dataset
For larger labeled datasets, generally improved performance compared to EM-based semi-supervised classification
Several of remaining modules are likely to constitute as noisy data, providing insight into their attributes
Oct. 3, 2006 Wayne State University CS Seminar
41
Learning from a small labeled and a large unlabeled software measurement dataset
Expectation Maximization (EM) algorithm for building semi-supervised software quality classification models
Improve the supervised learner with knowledge stored in software attributes of the unlabeled program modules
The labeled dataset is iteratively augmented with program modules in unlabeled dataset
Semi-Supervised Classification with the EM Algorithm
Oct. 3, 2006 Wayne State University CS Seminar
42
Labeled ProgramModules
Unlabeled ProgramModules
SelectedUnlabeled Modules
EM Algorithmfor Estimating Class Labels
ConfidenceBased Selection
{100, 250, 500, 1000, 1500, 2000, 2500, 3000}
JM1-8850
Semi-Supervised Classification with the EM Algorithm …
Oct. 3, 2006 Wayne State University CS Seminar
43
Semi-Supervised Classification with the EM Algorithm …
Proposed semi-supervised classification process improved generalization performance
Semi-supervised software quality classification models generally yielded better performance than C4.5 decision trees
About 40 to 50% of the remaining modules are likely to constitute as noisy data
Number of unlabeled modules selected for augmentation was largest when LP = 1000
Oct. 3, 2006 Wayne State University CS Seminar
44
Conclusion Practical solutions to problem of software quality analysis
with limited a priori knowledge of defect data
Empirical investigation with software measurement data from real world projects
Constraint-based clustering vs. Semi-supervised classification
Constraint-based clustering generally yielded better performance than
EM-based semi-supervised classification has lower complexity Semi-supervised classification with EM allows for control of
relative balance between the Type I and Type II error rates
Oct. 3, 2006 Wayne State University CS Seminar
45
Some Future Work Applying the limited defect data problem to
quantitative software fault prediction models
A software engineering study on characteristics of program modules that remain unlabeled
Investigate software development process Exploring semi-supervised learning schemes for
detecting noisy instances in a dataset
Investigating self-labeling heuristics for minimizing expert involvement in the unsupervised and constraint-based clustering approaches
Oct. 3, 2006 Wayne State University CS Seminar
46
Other SE Research Focus Knowledge-based software security modeling
and analysis
Studying the influence of diversity in pair programming teams
Software engineering measurements for agile development teams and projects
Software forensics with cyber security and education applications
Software Quality Analysis with Limited Prior Knowledge of Faults
Naeem (Jim) SeliyaAssistant Professor, CIS DepartmentUniversity of Michigan – Dearborn
313 583 6669 [email protected]
Questions !!!