Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Static code metrics vs. process metrics for software fault prediction using Bayesian
network learners
Mälardalen University School of Innovation, Design and Technology Author: Biljana Stanić Thesis for the Degree of Master of Science in Software Engineering (30.0 credits) Date: 28th October, 2015 Supervisor: Wasif Afzal Examiner: Antonio Cicchetti
Mälardalen University Master thesis
2
“The real question is not whether machines think but whether men do. The mystery which surrounds a thinking machine already surrounds a
thinking man.”
Burrhus Frederic Skinner
Mälardalen University Master thesis
3
Acknowledgments I would like to express my deep appreciation to my supervisor, Dr. Wasif Afzal, for the constructive, useful suggestions, and guidelines throughout my research work. I would also like to thank to:
EUROWEB Project1, funded by the Erasmus Mundus Action II programme of the European Commission; Lech Madeyski and Marian Jureczko for using their metrics repository.
Finally, I wish to thank my dear ones for their support and for believing in me all this time.
1 http://www.mrtc.mdh.se/euroweb/
Mälardalen University Master thesis
4
Abstract
Software fault prediction (SFP) has an important role in the process of improving software product quality by identifying fault-‐prone modules. Constructing quality models includes a usage of metrics that describe real world entities defined by numbers or attributes. Examining the nature of machine learning (ML), researchers proposed its algorithms as suitable for fault prediction. Moreover, information that software metrics contain will be used as statistical data necessary to build models for a certain ML algorithm. One of the most used ML algorithms is a Bayesian network (BN), which is represented as a graph, with a set of variables and relations between them. This thesis will be focused on the usage of process and static code metrics with BN learners for SFP. First, we provided an informal review on non-‐static code metrics. Furthermore, we created models that contained different combinations of process and static code metrics, and then we used them to conduct an experiment. The results of the experiment were statistically analyzed using a non-‐parametric test, the Kruskal-‐Wallis test. The informal review reported that non-‐static code metrics are beneficial for the prediction process and its usage is highly recommended for industrial projects. Finally, experimental results did not provide a conclusion which process metric gives a statistically significant result; therefore, a further investigation is needed.
Mälardalen University Master thesis
5
Contents
Abstract ............................................................................................................................................... 4 List of Figures .................................................................................................................................... 7
List of Tables ..................................................................................................................................... 8
Abbreviations ................................................................................................................................... 9 1. Introduction ............................................................................................................................... 11 1.1 Motivation ....................................................................................................................................... 12 1.2 Research questions ...................................................................................................................... 13
2. Background ................................................................................................................................ 14 2.1 Software fault prediction .............................................................................................................. 14 2.2 Software metrics .............................................................................................................................. 15 2.2.1 Static code metrics ..................................................................................................................................... 15 2.2.2 Process metrics ........................................................................................................................................... 15
2.3 Machine learning .............................................................................................................................. 16 2.3.1 Bayesian network ....................................................................................................................................... 17 2.3.1.1 The Naive Bayes Classifier ................................................................................................................................ 19 2.3.1.2 Augmented Naive Bayes Classifier ................................................................................................................ 19
3. Method ......................................................................................................................................... 21 3.1 Methodology for RQ1 ...................................................................................................................... 21 3.2 Methodology for RQ2 ...................................................................................................................... 23
4. Informal review ........................................................................................................................ 25 4.1 Process metrics ................................................................................................................................. 25 4.1.1 Code churn metrics .................................................................................................................................... 26 4.1.2 Developer metrics ...................................................................................................................................... 26 4.1.3 Other process metrics .............................................................................................................................. 26
5. Design ........................................................................................................................................... 27 5.1 Projects ................................................................................................................................................ 27 5.2 Extracted metrics ............................................................................................................................. 27 5.3 Evaluation of results ....................................................................................................................... 29 5.4 Experiment ......................................................................................................................................... 29
6. Results ......................................................................................................................................... 33 6.1 Statistical analysis of results ........................................................................................................ 33 6.2.1 Results of the experiment for NB classifier ..................................................................................... 33 6.2.1.1 Models with combined, static code and process metric ...................................................................... 34 6.2.1.2 Models with static code and 1 process metrics ....................................................................................... 35 6.2.1.3 Models with a combination of 2 process and static code metrics ................................................... 36 6.2.1.4 Models with a combination of 3 process and static code metrics ................................................... 37
6.2.2 Results of the experiment for TAN classifier .................................................................................. 38 6.2.2.1 Models with combined, static code and process metric ...................................................................... 38 6.2.2.2 Models with static code and 1 process metrics ....................................................................................... 39 6.2.2.3 Models with a combination of 2 process and static code metrics ................................................... 40 6.2.2.4 Models with a combination of 3 process and static code metrics ................................................... 41
7. Result discussion ...................................................................................................................... 42 7.1 Related work ..................................................................................................................................... 44
Mälardalen University Master thesis
6
8. Validity threats ......................................................................................................................... 45 8.1 Internal validity ................................................................................................................................ 45 8.2 External validity ............................................................................................................................... 45 8.3 Statistical conclusion validity ...................................................................................................... 45
9. Conclusion .................................................................................................................................. 46 9.1 Future works ..................................................................................................................................... 46 9.1.1 Model investigation ................................................................................................................................... 46 9.1.2 Industrial projects ...................................................................................................................................... 46 9.1.3 Data extraction ............................................................................................................................................ 47
Reference ......................................................................................................................................... 48 A Graphs for NB classifier .......................................................................................................... 51
B Graphs for TAN classifier ....................................................................................................... 53
Mälardalen University Master thesis
7
List of Figures Figure 1. Example of NB [6] ......................................................................................................................... 19 Figure 2. Example of STAN [6] .................................................................................................................... 20 Figure 3. Methodology for the RQ1 .......................................................................................................... 22 Figure 4. Methodology for the RQ2 .......................................................................................................... 24 Figure 5. Weka Explorer ................................................................................................................................ 30 Figure 6. Set values for NB classifier ........................................................................................................ 31 Figure 7. Classifier output with results ................................................................................................... 32 Figure 8. Graphs with comparison results for NB classifier .......................................................... 42 Figure 9. Graphs with comparison results for TAN classifier ....................................................... 43 Figure 10. Graphical comparison of combined, static code and process models using NB
........................................................................................................................................................................ 51 Figure 11. Graphical comparison of models containing 1 process and static code metrics
using NB ...................................................................................................................................................... 51 Figure 12. Graphical comparison of models containing the combination of 2 process and
static code metrics using NB .............................................................................................................. 52 Figure 13. Graphical comparison of models containing the combination of 3 process and
static code metrics using NB .............................................................................................................. 52 Figure 14. Graphical comparison of combined, static code and process models using TAN
........................................................................................................................................................................ 53 Figure 15. Graphical comparison of models containing 1 process and static code metrics
using TAN ................................................................................................................................................... 53 Figure 16. Graphical comparison of models containing the combination of 2 process and
static code metrics using TAN ........................................................................................................... 54 Figure 17. Graphical comparison of models containing the combination of 3 process and
static code metrics using TAN ........................................................................................................... 54
Mälardalen University Master thesis
8
List of Tables Table 1. Selected projects for the experiment ..................................................................................... 27 Table 2. Results for combined, static code and process metric using NB classifier ............ 34 Table 3. Comparison results for combined, static code and process models for NB
classifier ...................................................................................................................................................... 34 Table 4. Results for models containing 1 process and static code metric using NB
classifier ...................................................................................................................................................... 35 Table 5. Comparison results for models containing static code and 1 process metrics
using NB classifier .................................................................................................................................. 35 Table 6. Results for models containing combination of 2 process and static code metrics
using NB classifier .................................................................................................................................. 36 Table 7. Comparison results for models containing combination of 2 process and static
code metrics using NB classifier ...................................................................................................... 36 Table 8. Results for models containing combination of 3 process and static code metrics
using NB classifier .................................................................................................................................. 37 Table 9. Comparison results for models containing combination of 3 process and static
code metrics using NB classifier ...................................................................................................... 37 Table 10. Results for combined, static code and process models using TAN classifier ..... 38 Table 11. Comparison results for combined, static code and process models using TAN
classifier ...................................................................................................................................................... 38 Table 12. Results for models containing 1 process and static code metrics using TAN
classifier ...................................................................................................................................................... 39 Table 13. Comparison results for models containing static code and 1 process metrics
using TAN classifier ............................................................................................................................... 39 Table 14. Results for models containing combination of 2 process and static code metrics
using TAN classifier ............................................................................................................................... 40 Table 15. Comparison results for models containing combination of 2 process and static
code metrics using TAN classifier ................................................................................................... 40 Table 16. Results for models containing combination of 3 process and static code metrics
using TAN classifier ............................................................................................................................... 41 Table 17. Comparison results for models containing combination of 3 process and static
code metrics using TAN classifier ................................................................................................... 41
Mälardalen University Master thesis
9
Abbreviations
SFP Software Fault Prediction
ML Machine Learning
BN Bayesian Network
DAG Directed Acyclic Graph
NB Naive Bayes
ANB Augmented Naive Bayes
TAN Tree Augmented Naive Bayes
FAN Forest Augmented Naive Bayes
STAN Selective Tree Augmented Bayes
STAND Selective Tree Augmented Bayes with Discarding
SFAN Selective Forest Augmented Bayes
SFAND Selective Forest Augmented Bayes with Discarding
ANOVA Analysis of Variance
WMC Weighted Methods per Class
DIT Depth of Inheritance Tree
NOC Number Of Children
CBO Coupling Between Object class
RFC Response For a Class
LCOM Lack of Cohesion in Methods
LCOM3 Lack of Cohesion in Methods (normalized version of LCOM)
Ca Afferent Coupling
Ce Efferent Coupling
LOC Lines Of Code
NPM Number of Public Methods
Mälardalen University Master thesis
10
DAM Data Access Metric
MOA Measure Of Aggregation
CAM Cohesion Among Methods of class
IC Inheritance Coupling
CBM Coupling Between Methods
AMC Average Method Complexity
CC McCabe’s Cyclomatic complexity
NR Number of Revisions
NDC Number of Distinct Committers
NML Number of Modified Lines
NDPV Number of Defects in the Past Version
ROC Receiver Operating Characteristic
AUC Area Under the Curve
RQ Research Question
RQ1 Research Question 1
RQ2 Research Question 2
Mälardalen University Master thesis
11
1. Introduction Introducing the term ‘software engineering measurement’ and its basic characteristics, creates a logical step for defining another one—software metrics. Software metrics represent numbers or certain attributes used for describing real world entities formed by definite rules. Moreover, there are certain software quality assurances that use such metrics. One of the activities involves constructing quality models based on metrics. In that manner, quality metrics are beneficial for determining if the software system delivers the intended functionality [1]. A software fault prediction (SFP) model is one type of the quality model that has attracted a lot of research in the past few decades [2]. Complexity of the system brings mistakes in the code, labeled as faults2 [16], and their discovery, as the system grows, is not an easy task for the developers [3]. The process of detecting faults can be long and infinite, since it is hard to claim that a system is 100% fault-‐free. This also requires additional resources during testing which leads to enlarging costs of software development. One of the solutions is the creation of a prediction model that will guide development team towards quicker and more efficient fault detection [3]. With the help of metrics, it is possible to define a model that is responsible for predicting faults. Hall et al. [4] claim that there are a lot of complex models that deal with the problem of fault prediction, but despite that, there is a visible lack of information about the actual state in this area. Song et al. [2] list three most researched problems in SFP. These problems are:
o Determining the number of faults that were not identified during testing; o Finding connections between remaining faults and; o Classification of components that are fault-‐prone.
To solve above mentioned problems, a lot of attention in the past has been on applying machine learning techniques. Machine learning (ML) is about programming machines in order to get optimized results using statistical data or previous experience. ML uses statistical rules to build different mathematical models necessary for creating a conclusion from a sample [15]. The nature of ML, and some of its algorithms, is suitable for a creation of fault prediction models, which has shown good results. ML algorithms use patterns to identify and classify features, e.g., whether a software component contains a fault or not. One of the mostly used ML algorithms is a Bayesian network (BN) [2]. The BN is represented through a directed acyclic graph (DAG) containing set of variables, a certain structure that is defined as a relation between variables and a set of probability distributions. For the cases, such as software products, the usage of the BN can be depicted as a relationship between software features and possible faults. The network can be used for computing probabilities of the presence of different faults for specified features. That way, we will know which faults are possible to occur and how we can control and isolate them [5].
2 A software fault causes malfunctioning of an software product
Mälardalen University Master thesis
12
1.1 Motivation The main arguments for this thesis are found in recent publications [6],[7], where the usage of BNs has been found to be performing “surprisingly well” [6]. Dejaeger et al. [6] conducted experiments where different BN learners with static code metrics were compared. They made this choice of metrics because they are easier to gather and they are widely used in practice. They recorded better prediction for cases where a smaller set of features, that are highly predictive, were used. For the future work, they stated:
“Recently, several researchers turned their attention to another topic of interest, i.e., the inclusion of information other than static code features into fault prediction models such as information on intermodule relations and requirement metrics. The relation to the more commonly used static code features remains however unclear. Using, e.g., Bayesian network learners, important insights into these different information sources could be gained which is left as a topic for future research.” [6]
This conclusion represents the starting point and the main motivation for the thesis definition. Dejaeger et al. [6] indicated that one interesting direction of investigating BN learners for SFP is to include non-‐static code metrics. They are proposing information on intermodule relations and requirement metrics. These two are just examples of many other non-‐static code metrics that can be used for SFP. Another paper that deals with BNs for SFP, and gives a great motivation, is by Okutan et al. [7]. For the experiment, they reported performance of some static code metrics. Namely, they stated that metrics with information about a number of lines of code (LOC), low quality of a coding style and a class response were the most effective. Moreover, they presented other conclusions of their experiment and possible steps for an extension of their research:
“As a future direction, we plan to refine our research to include other software and process metrics in our model to reveal the relationships among them and to determine the most useful ones in defect prediction. We believe that rather than dealing with a large set of software metrics, focusing on the most effective ones will improve the success rate in defect prediction studies.” [7]
Following the statement in [6], Okutan et al. are proposing the usage of process metrics. Radjenović et al. [8] have done a systematic literature review on SFP metrics and collected results that are supporting statement given in [7] about process metrics and their future usage for fault prediction. They came up with the following conclusions:
o Process metrics have shown better results in detecting faults on post-‐release level than static code metrics;
o Process metrics, comparing to static code metrics, contain more descriptive information related to fault distribution;
o Better prediction of faulty classes was detected using process metrics;
Mälardalen University Master thesis
13
o Unlike object-‐oriented metrics that are mostly using small datasets, process metrics are applicable for larger datasets and it is an advantage when it comes to the validity and maturity of research results [4];
o Process metrics can be very beneficial, but they are mostly used in industrial domain. Therefore, it is challenging for researchers to use process metrics in their studies in the future.
1.2 Research questions Taking into account the collected statements about SFP, BNs and the usage of non-‐static code metrics, the structure of the thesis will be defined by the research questions:
o RQ1: What is the current state-‐of-‐the-‐art with respect to the use of non-‐static code metrics for SFP?
o RQ2: What is an impact of process metrics combined with static code metrics, in terms of performance, using BN learners for SFP?
In RQ1, we will discuss and explain in detail the current state-‐of-‐the-‐art of non-‐static code metrics, with a focus on process metrics. Research will be driven by conclusions made by Radjenović et al. [8]. For RQ2, we will be conducting experiments where several process metrics are combined with static code metrics and they will be used with BN learners. The results of this thesis can give a new insight into software metrics that can be beneficial for SFP. The content of the thesis is organized as follows: Section 2 contains background information about SFP, software metrics, ML and BN. Section 3 explains the method for collecting resources for the informal review and the experiment. In Section 4, informal review on non-‐static metrics will be presented. Section 5 contains information about the design of the experiment, used datasets and a response variable. In Section 6, results of the experiment and statistical analysis will be presented. Moreover, Section 7 provides a discussion of a comparison of experimental outcomes. Section 8 explains possible validity threats. Finally, Section 9 offers conclusions and lists areas of future work.
Mälardalen University Master thesis
14
2. Background In this section, we will describe SFP, the purpose of software metrics, how ML can be used for fault prediction (with a focus on BN learners).
2.1 Software fault prediction SFP, as quality model, is recognized as an important tool for improving software product by identifying possible faults that can occur in the system. Faults directly jeopardize software, which decreases its performance, and, furthermore, its quality [16]. Software systems are designed to handle complex activities in different domains. Because of their critical nature, every system is supposed to provide quality of service for its users. As the system increases, the growth of potential faults enhancements. Dejaeger et al. [6] found several cases where software faults were examined in terms of reliability and bug localization. In order to test reliability, we need to create a stochastic model that will output probability of fault existence once the component is executed. Combining remaining components, we are able to estimate the reliability of the whole system. Bug localization is based on usage of certain patterns that are associated with faulty components. Using this approach, we can discover faults that were not previously detected [6]. We saw that it is crucial to classify and identify fault-‐prone modules of the software on time. Different studies in this area have shown that faults, in majority cases, occur just in several modules and it causes malfunctioning of the other parts of the system and, also, those that are in direct relation with faulty ones. This means that the costs of production and maintenance will increase due to the effect of faults in these modules [9]. Therefore, much research has been focused on making fault prediction regarding system under development. Such predictions focus on locating modules that can be shown to be fault-‐prone [10]. Thus, creating accurate predictions is useful for increasing the quality of the system [9]. To be able to achieve good results, software engineers have to find a suitable prediction technique that will help them in their intention of detecting faults and conducting experimental validation [10]. Since direct measurement of fault prediction is impossible, we need to use metrics for necessary estimation. Reliable results of prediction depend on selection of software metrics. One needs to select appropriate metrics, because there exists a number of datasets that can make a prediction harder and might result in unsatisfactory and misleading results. Once the metrics are chosen, it is necessary to decide which technique will be used for making predictions. ML is the non-‐parametric3 technique that is frequently used for SFP. Sometimes datasets can be heavily skewed, which results in inaccurate prediction. ML techniques have a mechanism to overcome such problems, because it has an ability to “learn from imbalanced datasets” [11]. In Section 2.2, we will describe software quality metrics and ML as key elements in SFP process.
3 distribution-‐free
Mälardalen University Master thesis
15
2.2 Software metrics Software quality metrics deserve particular attention in SFP since they can be used for measuring quality of the system. They are further divided into in-‐process and end-‐process metrics. The first group of metrics is responsible for improving the development process, unlike end-‐process that is focused on assessment of the characteristics of the final product. Based on measuring specific parts of the system, there can be two types of software quality metrics:
o Static; o Dynamic metrics.
Static code metrics are suitable for checking attributes of the code, such as the complexity of the software and accessing the length of the code. Dynamic code metrics in a great extent examine the behavior of the system, presented as usability, reliability, maintainability and evaluation of the efficiency of the program [6]. In Section 2.2.1, we will briefly introduce some metrics, namely static code and process, which are relevant for the further content of this thesis.
2.2.1 Static code metrics Static code metrics are a type of quality metrics. Their usability is noticeable when it comes to measuring:
o Size (through lines of code (LOC) counts); o Complexity (using linearly path counts); o Readability (through operator counts and different operands).
The principle of calculating static code metrics is based on parsing of the source code; therefore, the process of metrics collection is automated. Because of this feature, it is manageable to measure metrics of the whole system, regardless of the system’s size. Moreover, it is possible to make predictions about the entire system based on metrics— developers can easily find faulty modules since they have a clear image of system’s vulnerabilities [12]. Static code metrics are easier and widely used in practice; therefore, they are representing a safe choice for predicting faulty software [6].
2.2.2 Process metrics Process metrics are also used for measuring the quality of the system [8]. They can be specified from various sources:
o Developer’s experience [10]; o Software change history [8], etc.
Developer’s experience is concentrated on activities that are envisaging how a certain part of the code (or the whole system) was developed [10]. Those metrics that are determined through the software change history split into two groups:
o Delta; o Code churn metrics;
Mälardalen University Master thesis
16
Delta metrics are defined as the difference between versions of the software [8]. As the result, we have the changed value of the metrics from one version to another. An illustrative example is when we are adding new lines of code and when we save those changes, delta value will be different. In cases when we are adding, and at the same time removing the same number of lines of code, delta value between those two versions will be the same. To be able to track changes, code churn metrics will report every activity on the code. Advantage of using process metrics is that they contain more descriptive information about a faulty part of code. They are also good at making fault predictions on post-‐release level. Since process metrics are used mostly for industrial purposes, they are able to handle large datasets, which leads to better validity of the predictions [8].
2.3 Machine learning ML is about programming machines in order to get optimized results using statistical data or previous experience. ML uses statistical rules to build different mathematical models necessary for creating the conclusion from the sample [15]. Those models can give predictions of the future steps or form a description based on knowledge from different data, or combining both in particular cases [15]. The first step for building the model is to train data using certain algorithms in order to optimize the specified problem. Learned model has to be efficient in terms of time and space complexity. ML has several applications in:
o Learning Associations; o Supervised learning; o Unsupervised Learning; o Regression; o Reinforcement Learning.
We will use real-‐life examples to explain types of ML. Learning Associations are suitable for “learning a conditional probability”. Probability is presented in equation 1: 𝑃(𝑋) (1)
where Y is a variable that is conditioned on X. Moreover, X can be a single or a set of variables of the same type as Y. Taking the example of a bookstore; Y can be a book that we are conditioning on X. Based on customer's behavior, we know that when s(he) buys a book from X, there is a high probability that Y will we bought as well. In supervised learning input features are related to corresponding outputs and machine has to understand rules of mapping between those two parameters. Supervised learning is applicable for prediction tasks, where is necessary to identify connections between different measures. On the other hand, for unsupervised learning, output data are not required, however, it examines structure within some input dataset. Unsupervised learning is not suitable for predictions of existence software faults in the
Mälardalen University Master thesis
17
system because of its nature to create the result without previously specified output data. Cases of supervised learning problems are classification and regression. In classification, we are making predictions based on a rule learned from the past data. Assuming that a behavior is similar in the past and future, we can easily produce predictions for every future case. The input contains data that we need to analyze, whereas the output is represented as classes that have a descriptive value. Regression has the same approach for selecting the input data as the classification problem, but for the output we will get a numerical value. In classification and regression problems we are creating the model, shown in equation 2: 𝑦 = 𝑔(𝑥|𝜃) (2)
where y is the class, in classification, or the numerical value, in regression. The model is represented as g (.) and model’s parameters as 𝜃. The task of ML is to optimize values of 𝜃, in order to get a minimized approximation error. Reinforcement Learning has for the input set of actions. In that system, it is important that all actions are part of a “good policy” [15], which will lead to obtaining correct results. In this particular problem, we are not observing a single action. The task of the program is to learn with characteristics are forming the good policy based on past set of actions [15]. Finally, ML has a mechanism to overcome problems of datasets that are heavily skewed and that can cause inaccurate results [11]. This basically means that it possible to create prediction even if some data are missing, or there is a big number of variables, etc. [17]. The BN is a type of supervised learning paradigm and one type of algorithm suitable for SFP (see Section 2.3.1).
2.3.1 Bayesian network The structure of a BN is presented as a directed acyclic graph (DAG) consisting of variables, a certain structure that is defined as a relation between variables and a set of probability distributions [5]. Considering the nature of SFP, BN can be used for calculating probabilities regarding the presence of faults in the software features. The BN graphically presented as a graph, contains following elements:
o Variables that are presented as vertices or nodes; o Conditional dependencies that are edges or arcs.
This graph cannot contain cycle and all edges inside it have to be directed. BNs provide theoretical framework that combined with statistical data, can give good fault predictions [5]. Model for SFP can be defined as (see equation 3): 𝐷 = 𝑡𝑟𝑛 { 𝑥! ,𝑦! }!! = 1 (3)
Mälardalen University Master thesis
18
where D is a set with N observations, with 𝑥! ∈ 𝑅! shows all static and non static code features and 𝑦! ∈ {0, 1} is used to indicate presence of some fault. Using Bayesian theorem for the BN to calculate probability of fault presence, we will have the following equation 4:
𝑃 𝑦! = 1|𝑥! = 𝑃 𝑥!|𝑦! = 1 𝑃(𝑦! = 1)
𝑃(𝑥!) (4)
The concept of the BN is based on probability distribution using stochastic variables that can be both continuous and discrete. Individual variables 𝑥(!) are used to construct the graph and dependencies between those variables are presented with directed arcs. In the same way, independence between different nodes 𝑥(!) and 𝑥(!!) is shown by the absence of the arc [6]. Defining and building the BN consists of three steps [14]:
o “Set” and “field” variables have to be defined; o Network topology has to be constructed; o Probability distribution on the local level has to be identified.
Set and field variables have to be defined “Set” variable inspects possible factors that are causing software faults, while “field” variable has to change the range of all variables to the degree of software faults. The value of “set” variable can vary depending on the system, organization, environment where it will be used. Variable “field” can be stated as “high”, “middle” or “low” (or more precise if it is required). Network topology has to be constructed Network topology is defined using event relation. The construction of the certain topology depends on studies or literature and it can be changed if experiments demand it. Probability distribution on the local level has to be identified In order to identify probability distribution on the local level, it is required to find marginal and conditional probabilities. Probability distribution can be used to emphasize “affect degree of causality” [14]. BN classifiers are used for problems that require classification. Other attractive features that classifiers have are presented in following list:
o Models can be created even if the knowledge is uncertain; o Probabilistic model can be used for cost-‐sensitive problems; o The nature of BNs classifiers can deal with issues related to data that are missing; o Classifiers can solve complex classification problems; o Future work on BNs include presenting models in hierarchical manner based on
their complexity; o There is possibility of using BN classifiers with algorithms that have linear time
complexity; o Models that are using the BN have shown very good performance [4].
Mälardalen University Master thesis
19
In Section 2.3.1.1, we will present 2 types of BN learners (classifiers) what will be used for the experiment.
2.3.1.1 The Naive Bayes Classifier The Naive Bayes (NB) is based on “conditional independence between attributes given the class label” [6], which is represented with nodes in directed acyclic graph, where one node is a parent and the rest are children. The node is a certain value in the dataset. Results from different studies have shown that NB classifiers are giving good results in fault prediction. In order to calculate fault probability for classes, each of them will have a vector of input variables for every new code segment. Resulted probabilities are gained using “frequency counts for the discrete variables and a normal or kernel density-‐based method for continuous variables”. Because of their simplified nature, NB classifiers can be constructed very easily, containing computational efficiency. The structure of NB is shown in Figure 1 [6].
Figure 1. Example of NB [6]
As we have shown on Figure 1. NB consists of one unobserved parent variable (node y) and a number of observed children variables (x nodes).
2.3.1.2 Augmented Naive Bayes Classifier Augmented Naive Bayes (ANB) classifiers were created as a modification of the basic conditional independence assumption principle of the NB. We will show changes in the graph that are made by adding new arcs and removing unnecessary variables. One
Mälardalen University Master thesis
20
example of ANB classifier is the Tree Augmented Naive Bayes (TAN), where each variable has one more parent. On the other hand, Semi-‐Naive Bayesian classifier “partitions the variables into pairwise disjoint groups”. Finally, Selective Naive Bayes, in order to face correlation between attributes, omits some variables. There are several ANB classifiers:
o Tree Augmented Naive Bayes (TAN); o Forest Augmented Naive Bayes (FAN); o Selective Tree Augmented Naive Bayes (STAN); o Selective Tree Augmented Naive Bayes with Discarding (STAND); o Selective Forest Augmented Naive Bayes (SFAN); o Selective Forest Augmented Naive Bayes with Discarding (SFAND).
Graphical representation of the STAN is shown on the following Figure 2.
Figure 2. Example of STAN [6]
In Figure 2, is it shown that each child (x) node is allowed to have additional parent, located next to the certain class node. We will use the NB to balance the BN classifiers and show a clear image of all dependencies between attributes [6].
Mälardalen University Master thesis
21
3. Method This thesis will consist of 2 parts, depending on RQs. We will answer RQ1 by doing the informal review of the field and for RQ2 we will conduct the experiment.
3.1 Methodology for RQ1 In order to answer RQ1, we have to collect material related to:
o SFP; o Metrics; o ML and o BN;
Our goal is not focused on conducting a systematic literature review on the topic of SFP, software metrics or ML and its algorithms, since it has already been done in several studies. Therefore, we will use collected sources to present the state-‐of-‐the-‐art for non-‐static code metrics in the process of SFP. Furthermore, in this section, we will describe the procedure that was followed in the process of selecting papers relevant for the thesis topic: Step 1: Define keywords relevant for the RQ1:
o SFP; o Software defect prediction; o Code metrics; o ML; o BN.
Step 2: Define filters in terms of publication years and type of the papers Selected papers are published between 2010 and 2015 and each of them has to be a journal or/and conference paper. The 5–year range was set to review only recently published results, which also include some recent literature reviews on the topic. Step 3: Use keywords and filters in several databases:
o IEEExplore4; o ACM Digital Library5; o Scopus6 (used for validation of papers that have been found in first two
databases).
Step 4: Collect and select results To get suitable results from databases, search string often contained combination of terms, using Boolean AND and OR operators, such as:
(TITLE-‐ABS-‐KEY AND TITLE-‐ABS-‐KEY) AND DOCTYPE(ar OR re) AND ( LIMIT-‐TO(SUBJAREA ) AND RECENT()7.
4 http://ieeexplore.ieee.org/Xplore/home.jsp 5 http://dl.acm.org/ 6 http://www.scopus.com/ 7 The example of the search string is taken from Scopus database.
Mälardalen University Master thesis
22
Papers with SFP and process metrics keywords, found in Scopus database, were used for snowball sampling and collecting other papers relevant for this topic with the, earlier mentioned, 5–year range. During the selection process, papers that were irrelevant, but have been displayed in search results, were rejected and other new papers were taken into consideration. The majority of papers were found in IEEExplore database. Step 5: Present the informal review. Present all relevant information regarding non-‐static code metrics. In Figure 3, we will illustrate the activities as part of methodology for answering RQ1.
Figure 3. Methodology for the RQ1
Mälardalen University Master thesis
23
3.2 Methodology for RQ2 We identified following activities, suggested in [15], for finalizing RQ2:
o Find suitable projects with datasets containing extracted static and process metrics;
o Decide upon a response variable; o Choose the design of the experiment; o Conduct the experiment for the projects, using previously selected BN
classifiers; o Compare results using a statistical analysis; o Publish conclusions about results of the statistical analysis.
Step 1: Find suitable projects with datasets containing extracted static and process metrics We will analyze datasets available on the Metric Repository8 collected from different open-‐source projects. The datasets were used in the study of Madeyski et al. [18]. Step 2: Decide upon a response variable As the quality measure, we will use area under the curve, AUC, proposed in [6]. AUC is specified by an average value of performances defined by all thresholds. Step 3: Choose the design of the experiment A principle of our experimental design will be based on replication [15], which means that the experiment will be run a defined number of times in order to find an average value of the specific quality measure. More specifically, we will use cross-‐validation. Step 4: Conduct the experiment for the projects, using previously selected BN classifiers As it is presented in the Background Section, for the purposes of the experiment, we will use NB and TAN classifiers. Using specified classifiers with the datasets, we will create a learner. Doing training multiple times and then testing with the validation sets, we will get results for the desired measure [15]. Implementation of classifiers is available through Weka9, suggested in [6], which will speed up the process of collecting results of prediction for the chosen datasets. Step 5: Compare results using a statistical analysis Conducting the statistical analysis of data is done for the reason of providing objective and unbiased results regarding the comparison of different datasets. Since we are operating with 17 datasets and 2 classifiers, we will decide between a one-‐way analysis of variance (ANOVA) or Kruskal-‐Wallis test, once we collect results from the experiment. Step 6: Publish conclusions about results of the statistical analysis The statistical analysis will give answers whether samples of the datasets are significantly different or not. In case we have differences, we will discuss which datasets are more superior to others, otherwise we will give an explanation and possible improvements for the future experiments.
8 http://purl.org/MarianJureczko/MetricsRepo 9 http://www.cs.waikato.ac.nz/ml/weka/documentation.html
Mälardalen University Master thesis
24
In Figure 4, we will illustrate the activities as part of methodology for answering RQ2.
Figure 4. Methodology for the RQ2
We will describe the experiment with detailed requirements in Section 5: Design.
Mälardalen University Master thesis
25
4. Informal review We collected recent papers that are dealing with SFP using non-‐static code metrics, in order to answer RQ1. The best source of information is found in a form of a systematic literature review and studies, which objectives were to investigate the performances of process metrics for various open-‐source, industrial projects and software systems. We classified relevant findings into several groups based on their occurrence in the papers. The groups are following:
o Process metrics in general; o Code churn metrics; o Developer metrics; o Other process metrics.
4.1 Process metrics In their systematic literature review, Radjenović et al. [8] analyzed 106 papers concluding that process metrics are represented only 24% among other metrics for the purposes of fault prediction. Moreover, they identified that process metrics, such as the number of different developers that worked on the same file, the number of changes made on some file, the age of a module, etc., are better for faults detection in post-‐release phase of software development than some static code metrics. Those metrics are mainly produced extracting source code and the history from the repository. Results gained from different studies have pointed out several advantages of process over static code metrics:
o Process metrics provide a better description regarding distribution of faults in the software;
o Used for Java-‐based applications, process metrics can provide better models in terms of cost-‐effectiveness;
o Process metrics performed the best in cases of prediction faulty classes using the ROC area.
However, some studies have shown that process metrics have not performed well because they were used in a pre-‐release phase of software development. Results of the conducted experiments have confirmed that process metrics perform better and have to be used in post-‐release phase, which explains previously mentioned issues in prediction [8]. In [4] is suggested combination of code, process and static metrics for the better prediction.
Xia et al. [20] analyzed performances of code and process metrics for the TT&C software. They emphasize the benefits of process metrics during different development phases, such as requirements analysis, design and coding. There have been listed 16 process metrics, suitable for the specific phase. In the early stages of software development, analysis of requirements and their maturity are metrics related to detection of faulty software module. Therefore, it is important to eliminate errors that occur in the requirements phase, which can be transferred to other stages of development, i.e. design and coding phases. Another important feature of process metrics is tracking historical
Mälardalen University Master thesis
26
changes within a file or a version of the software, which could give the insight to possible faults. They concluded that process metrics combined with code metrics increased accuracy level, while error rate was decreased.
4.1.1 Code churn metrics Reported advantage of process metrics is that they can be assessed using large datasets. Therefore, this is an important feature when is needed to create the prediction for industrial projects. However, studies have shown that process metrics are rather used for academic than for industrial purposes. Code churn metrics are the one that performed the best in case of industry. These metrics are used to calculate the change between different versions of the software. Taking into account benefits of code churn metrics, it is suggested that researchers should turn their interest to industrial projects combining and testing them using those metrics [8]. Moreover, according to [8],[19], code churn metrics performed better than cyclomatic complexity in case of fault density prediction.
4.1.2 Developer metrics Opinions about developers’ data are divided. While some studies show developer metrics improved prediction [4],[23], other claim that those metrics were not useful or they had a minor impact [4],[8],[22]. Matsumoto et al. [21] conducted the experiment in order to determine whether developers’ metrics can be beneficial for software reliability. They defined two types of metrics where are described developer’s activities and modules that were inspected by developers. After performing the experiment, in order to prove a hypothesis regarding metrics, they offered several conclusions:
o The possibility that developer will create fault in the newer version is higher if s(he) previously had the same manner of working;
o More faults can occur in subsystems that were modified by a larger number of developers;
o Usage of developer metrics is beneficial for fault prediction.
4.1.3 Other process metrics Along with code churn, some of the metrics that provided the best results in fault prediction are the age of a certain module; number of changes made on a file or module and changed set size. Their usage is recommended in large industrial systems, because of their reliability and efficiency to predict fault-‐prone module [8].
Mälardalen University Master thesis
27
5. Design In this section, we will define suitable projects, collected datasets and design of the experiment.
5.1 Projects We investigated 5 projects that contain static code and process metrics. Projects are Java-‐based, which provides consistency in terms of language domain. In [18], authors emphasize their inability to extract all process metrics for each project. Moreover, during datasets selection, metrics that did not contain most of the data about all process metrics were excluded. In Table 1, the description of the projects is presented. Explanation for process metrics is provided in Section 5.2. Table 1. Selected projects for the experiment
Project Version(s) Description Extracted process metrics
Ant10 1.4; 1.5; 1.6; 1.7 Java-‐based build tool NR, NDC, NML NDPV
jEdit11 4.0; 4.1 Java-‐based cross platform text editor
NR, NC, NML, NDPV
Synapse12 1.1; 1.2 Enterprise Service Bus that
supports different protocols and ways for information exchange
NR, NC, NML, NDPV
Xalan13 2.5.0; 2.6.0; 2.7.0 XSLT processor used for
transforming XML into different file formats
NR, NC, NML, NDPV
Xerces14 1.2.0; 1.3.0; 1.4.4 Parser that supports XML 1.0 and provides parsing functionalities on
advanced level
NR, NC, NML, NDPV
5.2 Extracted metrics Madeyski et al. [18] used 2 types of tools to collect metrics from project repositories. Metrics have been calculated using ckjm15 program. Tool BugInfo16 was used to detect bugs from a log history. In order to identify bugs from commits, regular expressions were formalized for each project. Bugs were calculated comparing commit comments and regular expressions. Static code metrics relevant for the experiment are: 10 http://ant.apache.org/ 11 http://www.jedit.org/ 12 http://synapse.apache.org/ 13 http://xml.apache.org/ xalan-‐j/ 14 http://xerces.apache.org/ xerces-‐j/ 15 http://www.spinellis.gr/sw/ckjm/ 16 https://kenai.com/projects/buginfo
Mälardalen University Master thesis
28
o Weighted methods per class (WMC) metric returns a number of methods of the specific class;
o Depth of inheritance tree (DIT) gives the number of inheritance level starting from the Object class;
o Number of children (NOC) calculates the number of descendants of the specific class;
o Coupling between Object class (CBO) indicates the number of classes that are linked to the specific class in a form of method calls and/or arguments, field declarations, inheritance, exceptions, etc.;
o Response for a class (RFC) represents the total number of called class methods, as well as methods that are contained in bodies of the class methods;
o Lack of cohesion in methods (LCOM) returns the number of methods that do not share any class fields. For the LCOM3, result can be in the range of 0 and 2. LCOM3 is presented in the equation 5:
𝐿𝐶𝑂𝑀3 = (1𝑎 𝜇 𝐴! −𝑚!
!!!
1−𝑚 (5)
where variable a is defined as the number of attributes of a class, m is a total number of methods used in the class and 𝜇(𝐴!) is used to present the number of all methods that will access the attribute A.
o Afferent couplings (Ca) returns the number of classes that are dependent on some observed class;
o Efferent couplings (Ce) indicates the number of classes that the observed class is depended on;
o Lines of code (LOC) returns the total number of class fields, methods and code written in the body of methods;
o Number of public methods (NPM) represents the number of all methods whose access modifier has a value defined as public;
o Data access metric (DAM) is the proportion of public/protected attributes compared to the total number of all class attributes;
o Measure of aggregation (MOA) gives a number of classes that are defined by a user;
o Cohesion among methods of class (CAM) returns the number of methods that are related to each other within the class based on the method’s parameter list;
o Inheritance coupling (IC) returns the number of base classes to which an examined class is connected.
o Coupling between methods (CBM) gives the number of inherited methods that are coupled to modified or new methods of the class;
o Average method complexity (AMC) metric returns the number of average size of all methods within some class;
o McCabe’s cyclomatic complexity (CC) defines the number of various paths within the method plus one, calculating all edges, nodes of the graph and components that are connected. CC is defined in the equation 6:
𝐶𝐶 = 𝐸 − 𝑁 + 𝑃, (6)
Mälardalen University Master thesis
29
where variable E defines the number of edges, N nodes and P connected components of the graph [18].
Process metrics that were extracted (Table 1) and used for the experiment are:
o Number of Revisions (NR) metric contains the number of committed revisions of some Java class in the development process;
o Number of Distinct Committers (NDC) will return a number of different developers that committed changes on the specific Java class;
o Number of Modified Lines (NML) metrics will show a number of added or removed lines of code for the specific Java class;
o Number of Defects in the Past Version (NDPV) will give a number of repaired faults from the past version of the system [18].
5.3 Evaluation of results The value of the output model can be presented using various performance measurements. According to [2],[6], one of the mostly accurate measures is called the receiver operating characteristic (ROC). The ROC represents a curve that displays the ratio between true and false positive rates, taking into consideration all thresholds that could have value in a range of 0 and 1. In the case of ROC, we have the equation 7: 𝑇𝑃𝑅 = !"
!"!!", (7)
where TP is a true positive, while FN is a false negative rate. Although ROC gives excellent results regarding performance comparison between classifiers, the practice has shown that experimenters prefer numeric value that will clearly show the end results. For that purpose, area under the curve (AUC) was proposed as one of the solutions [6]. AUC is determined by an average value of performances defined by all thresholds. The model that reports the best prediction is the one where the ROC and AUC are close to 1. AUC gives the possibility to compare our output model with some random predictions without considering the ratio of defected files [6].
5.4 Experiment We performed the experiment in Weka using NB and TAN classifiers for the given static code and process metrics. Weka has designed Explorer, shown on Figure 5, where is possible to test projects regarding performances using different classifiers.
Mälardalen University Master thesis
30
Figure 5. Weka Explorer
In order to prevent errors that can occur in datasets that contain String values, such as values for project and name of the examined class, we used meta.FilteredClassifier that enables us to choose NaiveBayes and BayesNet classifiers with additional features. Selecting Bayes Net with TAN search algorithm, we are telling Weka that we want to test datasets using TAN classifier. Furthermore, filtering with unsupervised attribute StringToWordVector helps us to handle String values. Figure 6 shows the example of set values in case of using NaiveBayes classifier.
Mälardalen University Master thesis
31
Figure 6. Set values for NB classifier
We did the experiment using 10-‐folds cross-‐validation. This experimental design is suitable for small datasets, where the experiment is repeated a number of times and in each iteration the data will be split differently. The idea is to generate sets for training and testing learners and to ensure the lowest percentage of overlapping between created sets. The final model will have a value that represents an averaged result of all values gained after each iteration [15]. Classifier output, presented on Figure 7, contains several values gained as a result of the prediction.
Mälardalen University Master thesis
32
Figure 7. Classifier output with results
As it can be seen on Figure 7, the output offers several values for defined classes, such as correctly classified instances, incorrectly classified instances, true positive and false positive rate, precision, etc. In our results evaluation, we will use the values of ROC Area. We defined 0 and 1 value for binary classifiers. In every file, for each class we can have either 0 value indicating that our class is fault free, or 1, in case of fault existence. This is regulated by checking the value of bug fixes in the datasets. We will have several models for our experiment:
o 1 combined model that will be created using all process and static code metrics; o 1 model will contain static code metrics; o 1 model will have process metrics; o 4 models will be built by adding 1 of the process metrics to the static metrics, e.g.
adding the value of NR metric to the dataset that already has only values of static code metrics;
o 6 models will contain a combination of 2 process and all static code metrics; o 4 models will consist of the combination of 3 process and all static code metrics.
Mälardalen University Master thesis
33
6. Results In our experiment, we had 18 static code metrics, 4 process metrics and the metric that contains information about bug fixes. Furthermore, we used AUC, as the quality measure and the response variable for each project, where the final value is the average of 10-‐folds cross-‐validation. We will present the statistical analysis used in order to compare results of the experiment. Prediction results, along with the statistical analysis are presented in Section 6.2.1 and 6.2.2.
6.1 Statistical analysis of results Every ML experiment, regardless of a type of experimentation, has to undergo the certain steps related to data analysis. Before we start to discuss the results of the experiment, we need to carefully choose an appropriate statistical test to compare collected results, ensuring objective conclusions [15]. We will introduce a term samples, referring to the results of the experiment [24]. There are two types of statistical analysis that deal with a comparison of two or more sample means and/or classifiers, which is required in our case. One test is the analysis of variance (ANOVA) [15],[26],[27]. ANOVA is a parametric test, where is the basic idea to compare means by comparing the variances of given samples [27]. Assumptions of normal distribution of samples and equal variance have to be met, which sometimes makes ANOVA unsuitable for ML studies [24]. Another type of analysis is the Kruskal-‐Wallis test [15],[25]. The Kruskal-‐Wallis test is the type of a non-‐parametric test. In this case, we do not have to prove any assumption regarding normal distribution of data and equal variances [25]. According to [15], the non-‐parametric test is more applicable when we have to compare more than 2 datasets because of the possibility that one classifier can show a different behavior for a different dataset. In that situation we cannot claim that error values for tested datasets are normally distributed [15]. Considering all listed arguments, we decided to use Kruskal-‐Wallis test and, therefore, we defined 2 hypotheses:
o H0: There is no significant difference in performance between datasets. o H1: There is a significant difference in performance for at least 1 dataset.
Once we get comparison results, we will consider the p-‐value, representing a probability and statistical significance, as a merit indicator whether the null hypothesis should be rejected or not with regard to significance criteria 𝛼. The value of 𝛼 is 0.05 and in case we have that the p-‐value is smaller than 𝛼, we can conclude that null hypothesis is rejected and there is the difference in performance for at least 1 dataset [26].
6.2.1 Results of the experiment for NB classifier In this Section, we will present experimental results using NB classifier, along with statistical analysis. Each table contains different models chosen based on the structure of datasets. Below tables, we will report comparison results. Models are grouped in 4 tables, based on the number of process metrics in the dataset.
Mälardalen University Master thesis
34
6.2.1.1 Models with combined, static code and process metric In Table 2, we presented AUC values for each version of selected projects for combined, static code and process metrics models using NB classifier. Table 2. Results for combined, static code and process metric using NB classifier
Project Version Combined SC17 Process
Ant
1.4 0.805 0.736 0.827 1.5 0.824 0.813 0.816 1.6 0.847 0.842 0.816 1.7 0.871 0.852 0.856
jEdit 4.0 0.896 0.855 0.915 4.1 0.934 0.887 0.919 4.3 0.78 0.795 0.922
Synapse 1.1 0.753 0.765 0.746 1.2 0.785 0.774 0.793
Xalan 2.5.0 0.807 0.788 0.822 2.6.0 0.843 0.846 0.764 2.7.0 0.965 0.947 0.906
Xerces 1.2.0 0.878 0.861 0.936 1.3.0 0.845 0.836 0.811 1.4.4 0.857 0.856 0.876
Observing the raw data from Table 2, we can conclude that in 46.66% cases process model had a better AUC value, followed by combined and static code model with 40% and 13.33%, respectively. In the following Table 3, we will present statistical results for combined, static code and process models. Table 3. Comparison results for combined, static code and process models for NB classifier
Method P-‐Value Not adjusted for ties 0.7184 Adjusted for ties 0.7184
We have 2 p-‐values for our analyzed models, where one of them is not adjusted and the other one is adjusted for ties. Ties can be found in situation where some value occurs in 2 or more tested samples. We will observe the p-‐value for the method that is not adjusted for ties, since it will report the greater value. The Kruskal-‐Wallis test reported the statistical significance of 0.7184 (i.e., p=0.7184) for the method that is not adjusted
17 The abbreviation “SC” (referring to static code metrics) will be used only in tables for purposes of a better utilization of a space in columns.
Mälardalen University Master thesis
35
for ties (see Table 3). This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
6.2.1.2 Models with static code and 1 process metrics In Table 4, we presented AUC values for 4 models containing all static code and only 1 process metrics (NR, NDC, NML and/or NDPV) using NB classifier. Table 4. Results for models containing 1 process and static code metric using NB classifier
Project Version SC + NR SC + NDC SC + NML SC + NDPV
Ant
1.4 0.753 0.76 0.772 0.732 1.5 0.813 0.824 0.811 0.812 1.6 0.841 0.85 0.841 0.842 1.7 0.852 0.871 0.848 0.856
jEdit 4.0 0.844 0.907 0.842 0.86 4.1 0.885 0.927 0.882 0.908 4.3 0.769 0.828 0.776 0.788
Synapse 1.1 0.761 0.764 0.761 0.763 1.2 0.78 0.782 0.773 0.778
Xalan 2.5.0 0.792 0.802 0.786 0.793 2.6.0 0.844 0.853 0.832 0.852 2.7.0 0.95 0.961 0.962 0.954
Xerces 1.2.0 0.871 0.859 0.855 0.873 1.3.0 0.834 0.85 0.827 0.842 1.4.4 0.855 0.867 0.849 0.855
In 80% cases, the model with static code and NDC metrics had better AUC value, followed by models with NML and NDPV metrics with 13.33% and 6.66%, respectively. Table 5. Comparison results for models containing static code and 1 process metrics using NB classifier
Method P-‐Value Not adjusted for ties 0.6460 Adjusted for ties 0.6459
Statistical test for models that are built on static code and 1 process metrics reported the statistical significance of 0.6460 (i.e., p=0.6460), shown in Table 5. This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
36
6.2.1.3 Models with a combination of 2 process and static code metrics In Table 6, we presented AUC values for 6 models containing combination of 2 process and all static code metrics using NB classifier. Table 6. Results for models containing combination of 2 process and static code metrics using NB classifier
Project Version SC + NR + NDC
SC + NR + NML
SC + NR + NDPV
SC + NDC + NML
SC + NDC + NDPV
SC + NML + NDPV
Ant
1.4 0.775 0.787 0.749 0.794 0.756 0.769 1.5 0.825 0.812 0.812 0.823 0.822 0.81 1.6 0.849 0.839 0.841 0.85 0.85 0.841 1.7 0.872 0.846 0.855 0.869 0.873 0.852
jEdit 4.0 0.903 0.83 0.849 0.903 0.907 0.847 4.1 0.929 0.877 0.901 0.925 0.938 0.903 4.3 0.804 0.752 0.763 0.812 0.819 0.77
Synapse 1.1 0.759 0.757 0.759 0.759 0.762 0.759 1.2 0.786 0.78 0.782 0.78 0.784 0.777
Xalan 2.5.0 0.804 0.79 0.796 0.799 0.806 0.792 2.6.0 0.851 0.83 0.85 0.841 0.858 0.84 2.7.0 0.959 0.962 0.954 0.966 0.964 0.965
Xerces 1.2.0 0.869 0.861 0.879 0.855 0.875 0.868 1.3.0 0.847 0.824 0.839 0.843 0.854 0.833 1.4.4 0.864 0.849 0.853 0.863 0.864 0.849
In 60% cases, the model with static code, NDC and NDPV metrics had better AUC value, followed by models with static code, NDC and NML, and static code, NR and NDC metrics with 16.66%. The model with static code, NR and NDPV had the better AUC value in 6.66% cases. Table 7. Comparison results for models containing combination of 2 process and static code metrics using NB classifier
Method P-‐Value Not adjusted for ties 0.6847 Adjusted for ties 0.6845
In Table 7, the Kruskal-‐Wallis test has shown the statistical significance of 0.6847 (i.e., p=0.6847). This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
37
6.2.1.4 Models with a combination of 3 process and static code metrics In Table 8, we presented AUC values for 4 models containing combination of 3 process and all static code metrics using NB classifier. Table 8. Results for models containing combination of 3 process and static code metrics using NB classifier
Project Version SC + NR + NDC + NML
SC + NR + NDC + NDPV
SC + NR + NML + NDPV
SC + NDC + NML + NDPV
Ant
1.4 0.807 0.772 0.784 0.791 1.5 0.824 0.824 0.812 0.822 1.6 0.848 0.849 0.838 0.849 1.7 0.869 0.874 0.849 0.871
jEdit 4.0 0.895 0.903 0.835 0.903 4.1 0.926 0.936 0.894 0.936 4.3 0.786 0.797 0.745 0.803
Synapse 1.1 0.755 0.757 0.755 0.757 1.2 0.786 0.786 0.781 0.784
Xalan 2.5.0 0.803 0.807 0.795 0.805 2.6.0 0.838 0.855 0.837 0.847 2.7.0 0.964 0.962 0.963 0.968
Xerces 1.2.0 0.862 0.882 0.871 0.873 1.3.0 0.84 0.852 0.829 0.848 1.4.4 0.86 0.862 0.848 0.86
In 60% cases, the model with static code, NR, NDC and NDPV metrics had better AUC value, followed by models with static code, NDC, NML and NDPV, and static code, NR, NDC and NML metrics with 26.66% and 10%, respectively. Table 9. Comparison results for models containing combination of 3 process and static code metrics using NB classifier
Method P-‐Value Not adjusted for ties 0.6326 Adjusted for ties 0.6323
The Kruskal-‐Wallis test reported the statistical significance of 0.6326 (i.e., p=0.6326), shown in Table 9. This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
38
6.2.2 Results of the experiment for TAN classifier In this Section, we will present experimental for the previously defined test cases using TAN classifier. As we mentioned before, tables contain different models along with the results of statistical analysis.
6.2.2.1 Models with combined, static code and process metric In Table 10, we presented AUC values for combined, static code and process metrics models using TAN classifier. Table 10. Results for combined, static code and process models using TAN classifier
Project Version Combined SC Process
Ant
1.4 0.828 0.783 0.811 1.5 0.819 0.795 0.82 1.6 0.895 0.861 0.875 1.7 0.893 0.859 0.883
jEdit 4.0 0.94 0.882 0.934 4.1 0.935 0.886 0.927 4.3 0.851 0.425 0.864
Synapse 1.1 0.802 0.802 0.606 1.2 0.809 0.817 0.69
Xalan 2.5.0 0.846 0.812 0.842 2.6.0 0.879 0.856 0.825 2.7.0 0.982 0.971 0.965
Xerces 1.2.0 0.894 0.886 0.879 1.3.0 0.871 0.838 0.8 1.4.4 0.978 0.926 0.961
We can see that in 76.66% cases combined model had a better AUC value, followed by process and static code model with 13.33% and 10%, respectively. Table 11. Comparison results for combined, static code and process models using TAN classifier
Method P-‐Value Not adjusted for ties 0.2833 Adjusted for ties 0.2833
After performing the Kruskal-‐Wallis test (see Table 11), we are reporting the statistical significance of 0.2833 (i.e., p=0.2833). This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
39
6.2.2.2 Models with static code and 1 process metrics In Table 12, we presented AUC values for 4 models containing all static code and only 1 process metrics (NR, NDC, NML and/or NDPV) using TAN classifier. Table 12. Results for models containing 1 process and static code metrics using TAN classifier
Project Version SC + NR SC + NDC SC + NML SC + NDPV
Ant
1.4 0.775 0.832 0.782 0.783 1.5 0.795 0.807 0.794 0.795 1.6 0.862 0.886 0.863 0.86 1.7 0.887 0.884 0.872 0.87
jEdit 4.0 0.904 0.939 0.892 0.892 4.1 0.914 0.926 0.903 0.905 4.3 0.426 0.64 0.837 0.425
Synapse 1.1 0.802 0.802 0.802 0.802 1.2 0.812 0.822 0.814 0.817
Xalan 2.5.0 0.826 0.843 0.812 0.819 2.6.0 0.86 0.87 0.856 0.868 2.7.0 0.971 0.98 0.971 0.975
Xerces 1.2.0 0.881 0.899 0.885 0.88 1.3.0 0.865 0.855 0.844 0.838 1.4.4 0.93 0.969 0.926 0.927
Static code and NDC models had in 75% cases better AUC value, followed by models with NR, and NML metrics with 15% and 8.33%, respectively. Table 13. Comparison results for models containing static code and 1 process metrics using TAN classifier
Method P-‐Value Not adjusted for ties 0.8512 Adjusted for ties 0.8512
In Table 13, The Kruskal-‐Wallis test has shown the statistical significance of 0.8512 (i.e., p=0.8512). This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
40
6.2.2.3 Models with a combination of 2 process and static code metrics In Table 14, we presented AUC values for each version of selected projects for 6 models containing combination of 2 process and all static code metrics using TAN classifier. Table 14. Results for models containing combination of 2 process and static code metrics using TAN classifier
Project Version SC + NR + NDC
SC + NR + NML
SC + NR + NDPV
SC + NDC + NML
SC + NDC + NDPV
SC + NML + NDPV
Ant
1.4 0.835 0.775 0.775 0.825 0.832 0.782 1.5 0.82 0.794 0.795 0.809 0.807 0.794 1.6 0.889 0.871 0.861 0.889 0.884 0.863 1.7 0.898 0.88 0.891 0.885 0.891 0.878
jEdit 4.0 0.944 0.906 0.904 0.942 0.936 0.897 4.1 0.933 0.916 0.916 0.933 0.935 0.918 4.3 0.639 0.836 0.426 0.851 0.64 0.837
Synapse 1.1 0.802 0.802 0.802 0.802 0.802 0.802 1.2 0.814 0.807 0.812 0.817 0.822 0.814
Xalan 2.5.0 0.841 0.826 0.831 0.843 0.849 0.819 2.6.0 0.879 0.86 0.867 0.867 0.873 0.867 2.7.0 0.982 0.971 0.976 0.98 0.98 0.975
Xerces 1.2.0 0.898 0.882 0.878 0.898 0.897 0.879 1.3.0 0.866 0.87 0.865 0.861 0.855 0.844 1.4.4 0.978 0.93 0.931 0.969 0.969 0.927
In 54.4% cases, the model with static code, NR and NDC metrics had better AUC value, followed by models with static code, NDC and NDPV, and static code, NDC and NML metrics with 21.13% and 14.4%, respectively. The model with static code, NR and NML had the better AUC value in 7.8% cases. Table 15. Comparison results for models containing combination of 2 process and static code metrics using TAN classifier
Method P-‐Value Not adjusted for ties 0.8685 Adjusted for ties 0.8683
The Kruskal-‐Wallis test reported the statistical significance of 0.8685 (i.e., p=0.8685), shown in Table 15. This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
41
6.2.2.4 Models with a combination of 3 process and static code metrics In Table 16, we presented AUC values for each version of selected projects for 4 models containing combination of 3 process and all static code metrics using TAN classifier. Table 16. Results for models containing combination of 3 process and static code metrics using TAN classifier
Project Version SC + NR + NDC + NML
SC + NR + NDC + NDPV
SC + NR + NML + NDPV
SC + NDC + NML + NDPV
Ant
1.4 0.828 0.835 0.775 0.825 1.5 0.819 0.82 0.794 0.809 1.6 0.895 0.889 0.872 0.888 1.7 0.891 0.901 0.884 0.892
jEdit 4.0 0.942 0.941 0.904 0.938 4.1 0.934 0.935 0.921 0.938 4.3 0.851 0.639 0.836 0.851
Synapse 1.1 0.802 0.802 0.802 0.802 1.2 0.809 0.814 0.807 0.817
Xalan 2.5.0 0.841 0.846 0.831 0.849 2.6.0 0.875 0.882 0.866 0.87 2.7.0 0.982 0.982 0.976 0.98
Xerces 1.2.0 0.898 0.895 0.878 0.895 1.3.0 0.871 0.866 0.87 0.861 1.4.4 0.978 0.978 0.931 0.969
The model with static code, NR, NDC and NML metrics is better in 38.33% cases, followed by models with static code, NR, NDC and NDPV, and static code, NDC, NML and NDPV metrics with 35% and 25%, respectively. Table 17. Comparison results for models containing combination of 3 process and static code metrics using TAN classifier
Method P-‐Value Not adjusted for ties 0.8100 Adjusted for ties 0.8099
The Kruskal-‐Wallis test has shown, in Table 17, the statistical significance of 0.8100 (i.e., p=0.8100). This is above 0.05 and, therefore, we can conclude that there is no basis to reject the null hypothesis.
Mälardalen University Master thesis
42
7. Result discussion We investigated 17 models in order to determine whether and which process metric(s) can provide better performance in the process of software faults detection. The graphical representations of comparison results are shown on Figure 8 and 9 for NB and TAN classifiers, respectively. Models are divided into 4 logical groups for the purpose of easier representation of performance results. Considering models built using NB classifier, the Kruskal-‐Wallis test has shown that we do not have the basis to reject the null hypothesis for any case; therefore, we can conclude that there is no statistical difference between models. This is illustrated on Figure 8, where we have overlapping of samples for each model.
Figure 8. Graphs with comparison results for NB classifier
However, observing graphs, we can see that models containing only 4 process metrics slightly improved experimental results compared to models with combined and static code metrics. Although the difference was not substantial so we could reject the null hypothesis, we can recommend process metrics for the further investigation. Similarly to process models, NDC metric in the single process case and combined with NML and/or NDPV metrics in a small degree improved model and should be investigated more. NR metric did not provide any significant improvement in models.
Mälardalen University Master thesis
43
Models built using TAN classifier after statistical testing, did not show the significant difference, i.e. we cannot claim that some model has a superior performance in SFP. This statement is supported by graphical representation (see Figure 9) with overlapping samples of models. Nevertheless, based on experimental results and graphs with statistical comparison, we can propose combined and models that contain NDC and NR metrics to be investigated more.
Figure 9. Graphs with comparison results for TAN classifier
The objective of our experiment, and the main purpose of RQ2, was to establish whether process metrics could improve SFP. After conducted analysis, we saw that we cannot give a solid assertion and recommend a usage of the specific model, since there is no reported evident statistical difference between samples. We propose an additional investigation for models containing certain combinations of previously mentioned metrics. Furthermore, in Section 5, we stated that we used projects where some data about process metrics were missing. Providing complete metrics data can be beneficial for the further research and can possibly open new discussions regarding process metrics in SFP.
Mälardalen University Master thesis
44
7.1 Related work As a related work, we considered publications where are observed cases of process metrics available on the repository that is used for our research. Moreover, classifiers in those publications have to be different from the ones that we analyzed. The publication that fulfills those requests is the one from Madeyski et al. [28]. They conducted an empirical study on which process metrics can be beneficial for the defect prediction. Their research included open-‐source and industrial projects, which lead to obtaining some statistically significant values. They examined models that contained all static code and only 1 process metrics. Using a stepwise linear regression method for building models, they came to the conclusion that process metrics, because of the nature of the information that they contain, can improve defect prediction and give a notable contribution. Namely, metrics that indicate a number of developers (NDC) that committed changes on the same class, can considerably improve the defect process. Also, NML metrics can be taken into consideration for the prediction purposes. Moreover, statistical significance was for at least one process metric was reported mainly in the industrial projects, which was not the case with open-‐source projects.
Mälardalen University Master thesis
45
8. Validity threats The purpose of any experimental study is to estimate usability and success rate of the proposed technique, algorithm or, in our case, model. Our task is to investigate to which extent gained results can be valid for similar experimental conditions and systems. It is crucial to follow defined steps appropriate for the study that is the part of our research [15]. In spite of our detailed study, some validity threats should be taken into consideration.
8.1 Internal validity Internal validity implies every wrong interpretation of the study results [28]. In this thesis, we defined 2 RQs with different methodologies. The potential threat for the RQ1 can be related to selected papers that deal with usage of process metrics for SFP. Different papers will give new and a different perspective for the review. For this reason, we strictly defined steps, filters and techniques that can overcome this problem. Furthermore, parameters of the RQ2, such as observed projects with datasets, response variable, experimental design can differ depending on experimenter, which can provide different results.
8.2 External validity External validity defines how obtained results can be applied to other population or system [28]. We investigated only open-‐source Java projects. Madeyski et al. [28] stated that process metrics have shown statistically significant results for industrial projects. This can be a basis for the external threats, since the difference among project can be considerable.
8.3 Statistical conclusion validity Statistical validity refers to problems that can affect the results of the statistical analysis [28]. In order to ensure correct results, we used two types of tools for the model comparison -‐ Minitab18 and SPSS19. We used non-‐parametric, the Kruskal-‐Wallis test, suitable for comparison over multiple datasets, suggested in [15].
18 https://www.minitab.com/en-‐us/ 19 http://www.spss.co.in/products.php?p=statistics
Mälardalen University Master thesis
46
9. Conclusion In this thesis, we investigated the importance of process metrics for SFP. Therefore, we defined 2 RQs in order to cover state-‐of-‐the-‐art regarding usage of different non-‐static code metrics and to conduct the experimental study. Informal review for RQ1 offered some interesting conclusions. Namely, we identified that process metrics, such as code churn, developer metrics and metrics, which contain information about changes in the file or module, are some of the most occurred in research papers. Also, it is reported that process metrics are suitable for post-‐release phase. Furthermore, combination of static code and process metrics can be beneficial for fault prediction in different phases of software development, such as requirements definition, design of the system and coding. Finally, the usage of process metrics in mostly recommended for industrial projects, because of their reliability to detect faulty modules of the system. Experimental study, for RQ2, involved 17 models (created from static code and process metrics) built on 2 BN classifiers. Experimental results and statistical analysis have indicated that none of the compared models have shown the statistically significant difference regarding improving SFP. However, observing results and boxplots with comparison analysis, we can conclude that:
o It is worth mentioning that combined, process, as well as, models containing NDC metric can slightly, but not statistically, improve experimental results;
o NDC metric combined with NML, NDPV or NR, depending on classifier that is used, can provide a better model;
o It would be useful if we contained complete data about all 4 process metrics. This could give us a new insight about the impact of those metrics to the prediction process.
9.1 Future works In this subsection, we will propose some of the possible steps for the future work, based on our results and discussion conclusions.
9.1.1 Model investigation As we mentioned before, models with combined, process and NDC metric should be investigated more, in a direction of choosing other representative projects that can be beneficial in the sense of providing statistically significant results.
9.1.2 Industrial projects Our research would extend to industrial projects so we could compare the differences between projects of the different domain, and that way, avoid potential threats to the external validity, discussed in Section 8.
Mälardalen University Master thesis
47
9.1.3 Data extraction It would be good to replicate results in case the repository offers extracted all data about process metrics for the projects that were examined in our study. Once it is done, we can give the concussion how the changes affected results and whether it leads to our inability to grasp which process metric can be recommended as useful for the SFP process.
Mälardalen University Master thesis
48
Reference [1] N. E. Fenton and S. L. Pfleeger, “Software metrics: A rigorous and practical approach”, Course Technology, Boston, MA, USA, 2nd edition, 1998. [2] Q. Song, Z. Jia, M. Shepperd, S. Ying, and J. Liu, “A general software defect-‐proneness prediction framework”, Software Engineering, IEEE Transactions, Volume 37, Issue 3, pp 356 – 370, May-‐June 2011. [3] R. Shatnawi, “Empirical study of fault prediction for open-‐source systems using the Chidamber and Kemerer metrics”, Software, IET, Volume 8, Issue 3, pp 113 – 119, June 2014. [4] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A systematic literature review on fault prediction performance in software engineering”, Software Engineering, IEEE Transactions, Volume 38, Issue 6, pp 1276 – 1304, Nov.-‐Dec. 2012. [5] E. Harahap, W. Sakamoto and H. Nishi “Failure prediction method for network management system by using Bayesian network and shared database”, Information and Telecommunication Technologies (APSITT), IEEE, pp 1 – 6, 15-‐18 June 2010. [6] K. Dejaeger T. Verbraken and B. Baesens, “Toward comprehensible software fault prediction models using Bayesian network classifiers”. Software Engineering, IEEE Transactions, Volume 39, Issue 2, pp 237 – 257, Feb. 2013. [7] A. Okutan, O. T. Yıldız, “Software defect prediction using Bayesian networks”, Empirical Software Engineering, Volume 19, Issue 1, pp 154-‐181, 2014. [8] D. Radjenović, M. Heričko, R. Torkar, A. Živkovič, “Software fault prediction metrics: A systematic literature review”, Information and Software Technology, Volume 55, Issue 8, pp 1397–1418, Aug. 2013. [9] C. Jin, S.-‐W. Jin, J.-‐M. Ye, “Artificial neural network-‐based metric selection for software fault-‐prone prediction model”, Software, IET, Volume 6, Issue 6 , pp 479 – 487, Dec. 2012. [10] M. Shepperd, D. Bowes, and T. Hall “Researcher bias: The use of machine learning in software defect prediction”, Software Engineering, IEEE Transactions, Volume 40, Issue 6, pp 603 – 616, June 1 2014. [11] L. Pelayo and S. Dick “Evaluating stratification alternatives to improve software defect prediction”, Reliability, IEEE Transactions, Volume 61, Issue 2, pp 516 – 525, June 2012. [12] D. Gray, D. Bowes, N. Davey, Y. Sun and B. Christianson “Software defect prediction using static code metrics underestimates defect-‐proneness”, Neural Networks (IJCNN), The 2010 International Joint Conference, pp 1 – 7, July 2010. [13] A. J. Stimpson and M. L. Cummings “Assessing intervention timing in computer-‐
Mälardalen University Master thesis
49
based education using machine learning algorithms”, Access, IEEE, Volume 2, pp 78 – 87, 2014. [14] C. Zheng, F. Peng, J. Wu and Z. Wu “Software life cycle-‐based defects prediction and diagnosis technique research”, Computer Application and System Modeling (ICCASM), 2010 International Conference, Volume 8, pp V8-‐192 -‐ V8-‐195, Oct. 2010. [15] E. Alpaydin, “Introduction to machine learning”, Massachusetts Institute of Technology, 2010. [16] N. E. Fenton, N. Ohlsson, “Quantitative analysis of faults and failures in a complex software system”, IEEE Transactions on Software Engineering, 26(8):797–814, 2000. [17] A. Gray, S. MacDonell. “A comparison of techniques for developing predictive models of software metrics”, Information and Software Technology, 39(6):425 – 437, 1997.
[18] L. Madeyski ,M. Jureczko, “Significance of different software metrics in defect prediction”, Software Engineering: An International Journal, Volume 1, Number 1, pp 86-‐95, 2011.
[19] Y. Kamei, H. Sato, A. Monden, S. Kawaguchi, H Uwano, M. Nagura, K.-‐i. Matsumoto and N. Ubayashi, ”An empirical study of fault prediction with code clone metrics”, Software Measurement, pp 55-‐61, 2011.
[20] Y. Xia, G. Yan and H. Zhang, “Analyzing the significance of process metrics for TT&C software defect prediction”, Software Engineering and Service Science (ICSESS), pp 77-‐81, 2014.
[21] S. Matsumoto, Y. Kamei, A. Monden, K.-‐i. Matsumoto and M. Nakamura, “An analysis of developer metrics for fault prediction”, Proceedings of the 6th International Conference on Predictive Models in Software Engineering, 2010.
[22] T. J. Ostrand, Elaine J. Weyuker and R. M. Bell. “Programmer-‐based fault prediction”, Proceedings of the 6th International Conference on Predictive Models in Software Engineering, 2010.
[23] Y. Shin, A. Meneely, L. Williams and J. A. Osborne, “Evaluating complexity, code churn and developer activity metrics as indicators of software vulnerability”, Software Engineering, IEEE Transactions, Volume 37, Issue 6, pp 772-‐782, 2011.
[24] J. Demšar, “Statistical comparison of classifiers over multiple data sets”, Journal of Machine Learning Research, Volume 7, pp 1-‐30, 2006.
[25] T. Illes-‐Seifert and B. Paech, “Exploring the relationship of a file’s history and its fault-‐proneness: An empirical method and its application to open source programs”, Information and Software Technology, Volume 52, Issue 5, pp 539-‐558, May 2010.
Mälardalen University Master thesis
50
[26] H. Lu, B. Cukic and M. Culp, “A semi-‐supervised approach to software defect prediction”, Computer Software and Application Conference (COMPSAC), pp 416-‐425, July 2014.
[27] M. J. Crawley, “The R book, second edition”, Imperial College London at Silwood Par, UK, 2013.
[28] L. Madeyski and M. Jureczko, ”Which process metrics can significantly improve defect prediction models? An empirical study”, Software Quality Journal, Volume 23, Issue 3, pp 393-‐422, September 2015.
Mälardalen University Master thesis
51
A Graphs for NB classifier
Figure 10. Graphical comparison of combined, static code and process models using NB
Figure 11. Graphical comparison of models containing 1 process and static code metrics using NB
Mälardalen University Master thesis
52
Figure 12. Graphical comparison of models containing the combination of 2 process and static code metrics using NB
Figure 13. Graphical comparison of models containing the combination of 3 process and static code metrics using NB
Mälardalen University Master thesis
53
B Graphs for TAN classifier
Figure 14. Graphical comparison of combined, static code and process models using TAN
Figure 15. Graphical comparison of models containing 1 process and static code metrics using TAN
Mälardalen University Master thesis
54
Figure 16. Graphical comparison of models containing the combination of 2 process and static code metrics using TAN
Figure 17. Graphical comparison of models containing the combination of 3 process and static code metrics using TAN