47
Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn 313 583 6669 [email protected]

Software Quality Analysis with Limited Prior Knowledge of Faults

Embed Size (px)

DESCRIPTION

Software Quality Analysis with Limited Prior Knowledge of Faults. Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn 313 583 6669  [email protected]. Overview. Introduction Knowledge-Based Software Quality Analysis - PowerPoint PPT Presentation

Citation preview

Page 1: Software Quality Analysis with Limited Prior Knowledge of Faults

Software Quality Analysis with Limited Prior Knowledge of Faults

Naeem (Jim) SeliyaAssistant Professor, CIS

DepartmentUniversity of Michigan – Dearborn

313 583 6669

[email protected]

Page 2: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

2

Overview

Introduction Knowledge-Based Software Quality

Analysis Software Quality Analysis with Limited Prior

Knowledge of Faults Empirical Case Study

Software Measurement Data Empirical Results

Conclusion

Page 3: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

3

Introduction Software quality assurance is vital during the

software development process

Knowledge-based software quality models useful for allocating limited resources to faulty programs

Software measurements often observed as predictors of software quality, i.e. working hypothesis

System operations and software maintenance are benefited by targeting program modules that are likely to have defects

Page 4: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

4

Introduction … Software quality modeling has been addressed in

related literature software quality classification models software fault prediction models software module order modeling

A supervised learning approach is typically taken for software quality modeling

requires the prior experience of developing systems relatively similar to target system

requires complete knowledge of defect data of previously developed program modules

Page 5: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

5

Software Quality Analysis

SoftwareMetrics

Know

n Defect D

ata

SoftwareMetrics

Unknow

n Defect D

ata

LearntHypothesis

Previous Experience

Target Project

Model Training

Model Application

Page 6: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

6

Software Quality Analysis …

Practical software engineering problems Organization has limited software defect data from

previous experiences with similar software projects Organization does not have software defect data from

previous experiences with similar software projects Organization does not have experience with developing

similar software projects

Two very likely problem scenarios Software quality modeling with limited software defect

data Software quality modeling without software defect data

Page 7: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

7

Limited Defect Data Problem

SoftwareMetrics

SoftwareMetrics

Unknow

n Defect D

ata

LearntHypothesis

Previous Experience

Target Project

Model Training

Model Application

Known Defect Data

Unknown Defect Data

Page 8: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

8

No Software Defect Data Problem

SoftwareMetrics

Know

n Defect D

ata

SoftwareMetrics

Unknow

n Defect D

ata

LearntHypothesis

Previous Experience

Target Project

Model Training

Model Application

Page 9: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

9

Limited Defect Data Problem …

Some contributing issues: cost of metrics data collection may limit for which

subsystems the software fault data is collected software defect data collected for some modules may

be error prone due to data collection problems defect data may be reliable for only some components

only some project components of a distributed software system may collect software fault data

in a multiple release system, fault data may not be collected for all releases

Page 10: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

10

Objectives

Developing solutions to software quality analysis when there is only limited a priori knowledge of defect data Learning software quality trends from both

labeled (small size) and unlabeled (large size) components of software measurement data

Providing empirical software engineering evidence toward effectiveness and practical appeal of proposed solutions

Page 11: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

11

Constraint-Based Clustering with Expert Input

Semi-Supervised Classification with the Expectation-Maximization Algorithm

Proposed Solutions

Page 12: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

12

Clustering is an appropriate choice for software quality analysis based on program attributes alone

Clustering algorithms group program modules according to their software attributes

Program modules with similar attributes will likely have similar software quality characteristics

Low-quality modules will likely group together into nfp clusters

High-quality modules will likely group together into fp clusters

Constraint-Based Clustering with Expert Input

Page 13: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

13

Constraint-Based Clustering with Expert Input …

Labeled data instances are used to modify and enhance clustering results on the unlabeled data instances

Investigated unsupervised clustering with expert input for software quality classification

Constraint-based clustering can aid the expert in better labeling the clusters as fp or nfp

Identify difficult-to-classify modules or noisy instances in the software measurement data

Page 14: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

14

Proposed Algorithm A constraint-based clustering approach is

implemented with the k-means algorithm

Labeled program modules are used to initialize centroids of a certain number of clusters

Grouping of the labeled modules remains unchanged as fixed constraints

Expert has the flexibility to inspect and label additional clusters as nfp or fp during the semi-supervised clustering process

Page 15: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

15

Proposed Algorithm …

Let D contain L_nfp, L_fp, and U sets of program modules

1. Obtain initial numbers of nfp and fp clusters: Execute Cg algorithm to obtain optimal (p)

number of nfp clusters among {1, 2, …, Cin_nfp} Execute Cg algorithm to obtain optimal (q)

number of fp clusters among {1, 2, …, Cin_fp} Cg algorithm work of Krzanowski and Lai

A criterion for determining the number of groups in a data set using sums-of-squares clustering. Biometrics, 44(1):23-34, March 1988.

Page 16: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

16

Proposed Algorithm …

2. Initialize centroids of clusters: Centroids of p out of C_max clusters are

initialized to centroids of nfp clusters Centroids of q out of {C_max - p} clusters

are initialized to centroids of fp clusters Centroids of remaining r (i.e. C_max - p – q)

clusters initialized to randomly selected modules from U

Randomly select 5 unique sets of modules for initializing the r unlabeled clusters

Page 17: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

17

Proposed Algorithm …

3. Execute constraint-based clustering: k-means with Euclidean distance run on D

with initialized centroids of C_max clusters Clustering is run under constraint that an

existing membership of a module to a labeled cluster remains unchanged

Clustering repeated for all 5 centroid initialization settings

Clustering associated with median SSE value selected for subsequent computation

Page 18: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

18

Proposed Algorithm …

4. Expert-based labeling of clusters: Expert is presented with descriptive statistics

of the r unlabeled clusters and asked to label them as nfp or fp

Expert labels only those clusters for which he is very confident in the label estimation

If at least 1 of the r clusters is labeled, go to to Step 2, and continue

Page 19: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

19

Proposed Algorithm …

5. Stop semi-supervised clustering: Iterative semi-supervised clustering process

is stopped when the sets C_nfp, C_fp, and C_ul are unchanged

Program modules in the p (or q) clusters are labeled and recorded as nfp (or fp)

Program modules in the remaining r unlabeled clusters are not assigned any fault-proneness labels

Page 20: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

20

Software Measurement Data

Software metrics datasets obtained from seven NASA software projects (MDP Initiative)

JM1, KC1, KC2, KC3, CM1, MW1, and PC1

Projects characterized by same set of software product metrics and built in similar software development environments

Defect data reflect changes made to source code for correcting errors recorded in problem reporting systems

Page 21: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

21

Software Measurement Data …

The JM1 project dataset used as training data in our empirical case studies

it is the largest dataset among the seven software projects

The remaining six datasets used as test data for model evaluation and generalization performance

Among the 21 product metrics only 13 basic metrics are used in our study

Page 22: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

22

Software Metrics1. Cyclomatic complexity

2. Essential complexity

3. Design complexity

4. Number of unique operators

5. Number of unique operands

6. Total number of operators

7. Total number of operands

8. Total number of lines of source code

9. Executable lines of code

10. Lines with code and comments

11. Lines with only comments

12. Blank lines of code

13. Branch count

Page 23: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

23

Case Study Datasets

Page 24: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

24

Constraint-Based Clustering Case Study

JM1 dataset pre-processed to yield a reduced dataset of 8850 modules, i.e. JM1-8850

program modules with identical software attributes but with different fault-proneness labels were eliminated

JM1 used as training instances and to form the respective labeled & unlabeled datasets

KC1, KC2, KC3, CM1, MW1, & PC1 used as test datasets to evaluate knowledge learnt post constraint-based clustering analysis of software data

Page 25: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

25

Constraint-Based Clustering Case Study …

Labeled datasets formed by random sampling LP = {100, 250, 500, 1000, 1500, 2000, 2500, 3000}

labeled modules Each LP dataset randomly selected to maintain a 80:20

proportion of nfp:fp program modules

3 samples were obtained for each LP value, and average results are reported in the paper

5 samples for LP = {100, 250, 500}

Parameter settings C_max = {30, 40} clusters Cin_fp = {10, 20} for Cg algorithm Cin_nfp = {10, 20} for Cg algorithm

Page 26: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

26

Initial Clusters of Labeled Modules

Page 27: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

27

Expert-Based Labeling Results

Page 28: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

28

Classification of Labeled Modules

Page 29: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

29

Classification of Labeled …

Page 30: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

30

Classification of Labeled …C_max = 30 Clusters

Page 31: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

31

Classification of Labeled …C_max = 40 Clusters

Page 32: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

32

Classification of Test Datasets

Unsupervised Clustering

Page 33: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

33

Classification of Test Datasets …

Constraint-Based Clustering (LP = 250)

Page 34: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

34

Classification of Test Datasets …

Constraint-Based Clustering (LP = 1000)

Page 35: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

35

Classification of Test Datasets …

Constraint-Based Clustering (LP = 3000)

Page 36: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

36

Classification of Test Datasets …

Average Classification of Test Data Modules

Page 37: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

37

Comparison with C4.5 Models C4.5 decision tree implemented in Weka, an open source data

mining tool

Supervised decision tree models built using 10 fold cross validation

Decision tree parameters tuned for appropriate comparison with constraint-based clustering

Tuning for similar Type I (false positive) error rates

C4.5 models yielded very low false positives in conjunction with very high false negatives

Performance of C4.5 models generally remain unchanged with LP compared to an improvement by constraint-based clustering

Page 38: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

38

Remaining Unlabeled Modules do they constitute as noisy data? are they hard to model

modules?

do they form new groups of program modules for given system ?

are their software measurements uniquely different from the other program modules?

did something go wrong in the software metrics data collection process?

did the project not collect other software metrics that may better represent the software quality?

Page 39: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

39

Remaining Unlabeled Modules …

Ensemble Filter (EF) strategy Comparison with majority EF

Consists of 25 classifiers from different learning theories and methodologies

Investigate commonality of modules detected by EF and those that remain unlabeled after constraint-based clustering process

About 40% to 50% were common with those considered noisy by ensemble filter

A relatively large number of same modules were consistently included in the pool of remaining unlabeled program modules

Page 40: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

40

Constraint-Based Clustering Case Study … Summary

Improved estimation performance compared to unsupervised clustering with expert input

Better test data performances compared to a supervised learner trained on labeled dataset

For larger labeled datasets, generally improved performance compared to EM-based semi-supervised classification

Several of remaining modules are likely to constitute as noisy data, providing insight into their attributes

Page 41: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

41

Learning from a small labeled and a large unlabeled software measurement dataset

Expectation Maximization (EM) algorithm for building semi-supervised software quality classification models

Improve the supervised learner with knowledge stored in software attributes of the unlabeled program modules

The labeled dataset is iteratively augmented with program modules in unlabeled dataset

Semi-Supervised Classification with the EM Algorithm

Page 42: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

42

Labeled ProgramModules

Unlabeled ProgramModules

SelectedUnlabeled Modules

EM Algorithmfor Estimating Class Labels

ConfidenceBased Selection

{100, 250, 500, 1000, 1500, 2000, 2500, 3000}

JM1-8850

Semi-Supervised Classification with the EM Algorithm …

Page 43: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

43

Semi-Supervised Classification with the EM Algorithm …

Proposed semi-supervised classification process improved generalization performance

Semi-supervised software quality classification models generally yielded better performance than C4.5 decision trees

About 40 to 50% of the remaining modules are likely to constitute as noisy data

Number of unlabeled modules selected for augmentation was largest when LP = 1000

Page 44: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

44

Conclusion Practical solutions to problem of software quality analysis

with limited a priori knowledge of defect data

Empirical investigation with software measurement data from real world projects

Constraint-based clustering vs. Semi-supervised classification

Constraint-based clustering generally yielded better performance than

EM-based semi-supervised classification has lower complexity Semi-supervised classification with EM allows for control of

relative balance between the Type I and Type II error rates

Page 45: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

45

Some Future Work Applying the limited defect data problem to

quantitative software fault prediction models

A software engineering study on characteristics of program modules that remain unlabeled

Investigate software development process Exploring semi-supervised learning schemes for

detecting noisy instances in a dataset

Investigating self-labeling heuristics for minimizing expert involvement in the unsupervised and constraint-based clustering approaches

Page 46: Software Quality Analysis with Limited Prior Knowledge of Faults

Oct. 3, 2006 Wayne State University CS Seminar

46

Other SE Research Focus Knowledge-based software security modeling

and analysis

Studying the influence of diversity in pair programming teams

Software engineering measurements for agile development teams and projects

Software forensics with cyber security and education applications

Page 47: Software Quality Analysis with Limited Prior Knowledge of Faults

Software Quality Analysis with Limited Prior Knowledge of Faults

Naeem (Jim) SeliyaAssistant Professor, CIS DepartmentUniversity of Michigan – Dearborn

313 583 6669 [email protected]

Questions !!!