Software Quality Analysis with Limited Prior Knowledge of Faults

Software Quality Analysis with Limited Prior Knowledge of Faults

Naeem (Jim) SeliyaAssistant Professor, CIS

DepartmentUniversity of Michigan – Dearborn

313 583 6669

[email protected]

Oct. 3, 2006 Wayne State University CS Seminar

2

Overview

Introduction Knowledge-Based Software Quality

Analysis Software Quality Analysis with Limited Prior

Knowledge of Faults Empirical Case Study

Software Measurement Data Empirical Results

Conclusion


3

Introduction Software quality assurance is vital during the

software development process

Knowledge-based software quality models useful for allocating limited resources to faulty programs

Software measurements often observed as predictors of software quality, i.e. working hypothesis

System operations and software maintenance are benefited by targeting program modules that are likely to have defects


4

Introduction … Software quality modeling has been addressed in

related literature software quality classification models software fault prediction models software module order modeling

A supervised learning approach is typically taken for software quality modeling

requires the prior experience of developing systems relatively similar to target system

requires complete knowledge of defect data of previously developed program modules


5

Software Quality Analysis

SoftwareMetrics

Know

n Defect D

ata

SoftwareMetrics

Unknow

n Defect D

ata

LearntHypothesis

Previous Experience

Target Project

Model Training

Model Application


6

Software Quality Analysis …

Practical software engineering problems Organization has limited software defect data from

previous experiences with similar software projects Organization does not have software defect data from

previous experiences with similar software projects Organization does not have experience with developing

similar software projects

Two very likely problem scenarios Software quality modeling with limited software defect

data Software quality modeling without software defect data


7

Limited Defect Data Problem

SoftwareMetrics

SoftwareMetrics

Unknow

n Defect D

ata

LearntHypothesis

Previous Experience

Target Project

Model Training

Model Application

Known Defect Data

Unknown Defect Data


8

No Software Defect Data Problem

SoftwareMetrics

Know

n Defect D

ata

SoftwareMetrics

Unknow

n Defect D

ata

LearntHypothesis

Previous Experience

Target Project

Model Training

Model Application


9

Limited Defect Data Problem …

Some contributing issues: cost of metrics data collection may limit for which

subsystems the software fault data is collected software defect data collected for some modules may

be error prone due to data collection problems defect data may be reliable for only some components

only some project components of a distributed software system may collect software fault data

in a multiple release system, fault data may not be collected for all releases


10

Objectives

Developing solutions to software quality analysis when there is only limited a priori knowledge of defect data Learning software quality trends from both

labeled (small size) and unlabeled (large size) components of software measurement data

Providing empirical software engineering evidence toward effectiveness and practical appeal of proposed solutions


11

Constraint-Based Clustering with Expert Input

Semi-Supervised Classification with the Expectation-Maximization Algorithm

Proposed Solutions


12

Clustering is an appropriate choice for software quality analysis based on program attributes alone

Clustering algorithms group program modules according to their software attributes

Program modules with similar attributes will likely have similar software quality characteristics

Low-quality modules will likely group together into nfp clusters

High-quality modules will likely group together into fp clusters

Constraint-Based Clustering with Expert Input


13

Constraint-Based Clustering with Expert Input …

Labeled data instances are used to modify and enhance clustering results on the unlabeled data instances

Investigated unsupervised clustering with expert input for software quality classification

Constraint-based clustering can aid the expert in better labeling the clusters as fp or nfp

Identify difficult-to-classify modules or noisy instances in the software measurement data


14

Proposed Algorithm A constraint-based clustering approach is

implemented with the k-means algorithm

Labeled program modules are used to initialize centroids of a certain number of clusters

Grouping of the labeled modules remains unchanged as fixed constraints

Expert has the flexibility to inspect and label additional clusters as nfp or fp during the semi-supervised clustering process


15

Proposed Algorithm …

Let D contain L_nfp, L_fp, and U sets of program modules

1. Obtain initial numbers of nfp and fp clusters: Execute Cg algorithm to obtain optimal (p)

number of nfp clusters among {1, 2, …, Cin_nfp} Execute Cg algorithm to obtain optimal (q)

number of fp clusters among {1, 2, …, Cin_fp} Cg algorithm work of Krzanowski and Lai

A criterion for determining the number of groups in a data set using sums-of-squares clustering. Biometrics, 44(1):23-34, March 1988.


16


2. Initialize centroids of clusters: Centroids of p out of C_max clusters are

initialized to centroids of nfp clusters Centroids of q out of {C_max - p} clusters

are initialized to centroids of fp clusters Centroids of remaining r (i.e. C_max - p – q)

clusters initialized to randomly selected modules from U

Randomly select 5 unique sets of modules for initializing the r unlabeled clusters


17


3. Execute constraint-based clustering: k-means with Euclidean distance run on D

with initialized centroids of C_max clusters Clustering is run under constraint that an

existing membership of a module to a labeled cluster remains unchanged

Clustering repeated for all 5 centroid initialization settings

Clustering associated with median SSE value selected for subsequent computation


18


4. Expert-based labeling of clusters: Expert is presented with descriptive statistics

of the r unlabeled clusters and asked to label them as nfp or fp

Expert labels only those clusters for which he is very confident in the label estimation

If at least 1 of the r clusters is labeled, go to to Step 2, and continue


19


5. Stop semi-supervised clustering: Iterative semi-supervised clustering process

is stopped when the sets C_nfp, C_fp, and C_ul are unchanged

Program modules in the p (or q) clusters are labeled and recorded as nfp (or fp)

Program modules in the remaining r unlabeled clusters are not assigned any fault-proneness labels


20

Software Measurement Data

Software metrics datasets obtained from seven NASA software projects (MDP Initiative)

JM1, KC1, KC2, KC3, CM1, MW1, and PC1

Projects characterized by same set of software product metrics and built in similar software development environments

Defect data reflect changes made to source code for correcting errors recorded in problem reporting systems


21

Software Measurement Data …

The JM1 project dataset used as training data in our empirical case studies

it is the largest dataset among the seven software projects

The remaining six datasets used as test data for model evaluation and generalization performance

Among the 21 product metrics only 13 basic metrics are used in our study


22

Software Metrics1. Cyclomatic complexity

2. Essential complexity

3. Design complexity

4. Number of unique operators

5. Number of unique operands

6. Total number of operators

7. Total number of operands

8. Total number of lines of source code

9. Executable lines of code

10. Lines with code and comments

11. Lines with only comments

12. Blank lines of code

13. Branch count


23

Case Study Datasets


24

Constraint-Based Clustering Case Study

JM1 dataset pre-processed to yield a reduced dataset of 8850 modules, i.e. JM1-8850

program modules with identical software attributes but with different fault-proneness labels were eliminated

JM1 used as training instances and to form the respective labeled & unlabeled datasets

KC1, KC2, KC3, CM1, MW1, & PC1 used as test datasets to evaluate knowledge learnt post constraint-based clustering analysis of software data


25

Constraint-Based Clustering Case Study …

Labeled datasets formed by random sampling LP = {100, 250, 500, 1000, 1500, 2000, 2500, 3000}

labeled modules Each LP dataset randomly selected to maintain a 80:20

proportion of nfp:fp program modules

3 samples were obtained for each LP value, and average results are reported in the paper

5 samples for LP = {100, 250, 500}

Parameter settings C_max = {30, 40} clusters Cin_fp = {10, 20} for Cg algorithm Cin_nfp = {10, 20} for Cg algorithm


26

Initial Clusters of Labeled Modules


27

Expert-Based Labeling Results


28

Classification of Labeled Modules


29

Classification of Labeled …


30

Classification of Labeled …C_max = 30 Clusters


31

Classification of Labeled …C_max = 40 Clusters


32

Classification of Test Datasets

Unsupervised Clustering


33

Classification of Test Datasets …

Constraint-Based Clustering (LP = 250)


34




35




36


Average Classification of Test Data Modules


37

Comparison with C4.5 Models C4.5 decision tree implemented in Weka, an open source data

mining tool

Supervised decision tree models built using 10 fold cross validation

Decision tree parameters tuned for appropriate comparison with constraint-based clustering

Tuning for similar Type I (false positive) error rates

C4.5 models yielded very low false positives in conjunction with very high false negatives

Performance of C4.5 models generally remain unchanged with LP compared to an improvement by constraint-based clustering


38

Remaining Unlabeled Modules do they constitute as noisy data? are they hard to model

modules?

do they form new groups of program modules for given system ?

are their software measurements uniquely different from the other program modules?

did something go wrong in the software metrics data collection process?

did the project not collect other software metrics that may better represent the software quality?


39

Remaining Unlabeled Modules …

Ensemble Filter (EF) strategy Comparison with majority EF

Consists of 25 classifiers from different learning theories and methodologies

Investigate commonality of modules detected by EF and those that remain unlabeled after constraint-based clustering process

About 40% to 50% were common with those considered noisy by ensemble filter

A relatively large number of same modules were consistently included in the pool of remaining unlabeled program modules


40

Constraint-Based Clustering Case Study … Summary

Improved estimation performance compared to unsupervised clustering with expert input

Better test data performances compared to a supervised learner trained on labeled dataset

For larger labeled datasets, generally improved performance compared to EM-based semi-supervised classification

Several of remaining modules are likely to constitute as noisy data, providing insight into their attributes


41

Learning from a small labeled and a large unlabeled software measurement dataset

Expectation Maximization (EM) algorithm for building semi-supervised software quality classification models

Improve the supervised learner with knowledge stored in software attributes of the unlabeled program modules

The labeled dataset is iteratively augmented with program modules in unlabeled dataset

Semi-Supervised Classification with the EM Algorithm


42

Labeled ProgramModules

Unlabeled ProgramModules

SelectedUnlabeled Modules

EM Algorithmfor Estimating Class Labels

ConfidenceBased Selection

{100, 250, 500, 1000, 1500, 2000, 2500, 3000}

JM1-8850

Semi-Supervised Classification with the EM Algorithm …


43

Semi-Supervised Classification with the EM Algorithm …

Proposed semi-supervised classification process improved generalization performance

Semi-supervised software quality classification models generally yielded better performance than C4.5 decision trees

About 40 to 50% of the remaining modules are likely to constitute as noisy data

Number of unlabeled modules selected for augmentation was largest when LP = 1000


44

Conclusion Practical solutions to problem of software quality analysis

with limited a priori knowledge of defect data

Empirical investigation with software measurement data from real world projects

Constraint-based clustering vs. Semi-supervised classification

Constraint-based clustering generally yielded better performance than

EM-based semi-supervised classification has lower complexity Semi-supervised classification with EM allows for control of

relative balance between the Type I and Type II error rates


45

Some Future Work Applying the limited defect data problem to

quantitative software fault prediction models

A software engineering study on characteristics of program modules that remain unlabeled

Investigate software development process Exploring semi-supervised learning schemes for

detecting noisy instances in a dataset

Investigating self-labeling heuristics for minimizing expert involvement in the unsupervised and constraint-based clustering approaches


46

Other SE Research Focus Knowledge-based software security modeling

and analysis

Studying the influence of diversity in pair programming teams

Software engineering measurements for agile development teams and projects

Software forensics with cyber security and education applications

Software Quality Analysis with Limited Prior Knowledge of Faults

Naeem (Jim) SeliyaAssistant Professor, CIS DepartmentUniversity of Michigan – Dearborn

313 583 6669 [email protected]

Questions !!!

Documents

Software Quality Analysis with Limited Prior Knowledge of Faults