Are Unit and Integration Test De nitions Still Valid for ... · de nitions still t to modern software development. Objective: We analyze, if the existing standard de nitions of unit

Are Unit and Integration Test Definitions Still Valid forModern Java Projects? An Empirical Study on

Open-Source Projects

Fabian Trautscha,∗, Steffen Herbolda, Jens Grabowskia

aInstitute of Computer Science, University of Goettingen, Germany

Abstract

Context: Unit and integration testing are popular testing techniques. How-

ever, while the software development context evolved over time, the definitions

remained unchanged. There is no empirical evidence, if these commonly used

definitions still fit to modern software development.

Objective: We analyze, if the existing standard definitions of unit and

integration tests are still valid in modern software development contexts. Hence,

we analyze if unit and integration tests detect different types of defects, as

expected from the standard literature.

Method: We classify 38782 test cases into unit and integration tests accord-

ing to the definition of the IEEE and use mutation testing to assess their defect

detection capabilities. All integrated mutations are classified into five different

defect types. Afterwards, we evaluate if there are any statistically significant

differences in the results between unit and integration tests.

Results: We could not find any evidence that certain defect types are more

effectively detected by either one of the test types. Our results suggest that the

currently used definitions do not fit modern software development contexts.

Conclusions: This finding implies that we need to reconsider the defini-

tions of unit and integration tests and suggest that the current property-based

definitions may be exchanged with usage-based definitions.

∗Corresponding authorEmail address: [email protected] (Fabian Trautsch)

Preprint submitted to Journal of Systems and Software May 16, 2019

Keywords: software testing; unit testing; integration testing; empirical

software engineering

1. Introduction

Unit testing as well as integration testing are two popular testing techniques

that are used in industry and academia. There exist a lot of scientific literature,

which is focused on unit testing [1, 2, 3] or integration testing [4, 5, 6]. Fur-

thermore, there exist several development models in which both practices are

integrated, e.g., the V- Model [7] or the waterfall model [8].

The definitions for unit and integration tests, like the ones of the Institute of

Electrical and Electronics Engineers (IEEE) [9] or International Software Test-

ing Qualification Board (ISTQB) [10], are decades old. However, the software

development context has evolved over time, while these definitions remained

unchanged. For example, major contributions in the field of software testing

that pushed this evolution were the development of xUnit frameworks (e.g.,

JUnit [11]), which are nowadays used to execute all types of tests including

unit, integration, and system tests, or Continuous Integration (CI) systems

(e.g., Travis CI [12], Jenkins [13]). Those frameworks and systems changed the

way in which developers test their software. This is highlighted by a proposal

that is spread within the development community [14]. The idea is to change a

software testing paradigm: instead of testing a software thoroughly on unit level

with a few integration tests, developers propose to test the software mostly on

integration level with only a few unit tests. The reasons for this are manifold

like, e.g., that integration tests are more realistic [14] or more effective [15, 16]

than unit tests.

Hence, we can clearly see an evolution of the development context, but the

definitions were not adapted over the time. Therefore, we hypothesize that

the current definitions, that are used in current textbooks and software testing

certifications (e.g., ISTQB certified tester [10]), do not fit to modern software

development contexts. Our prior work [17] presents us indications that support

2

this hypothesis. We have shown that developers do not classify their tests

according to the definitions of the IEEE or ISTQB. While this shows that the

definitions are not used in practice, it does not gives us any indications regarding

the merit of the current definitions. Since the main purpose of testing is to

reveal defects, we focus on comparing the defect detection capabilities of unit

and integration tests in this paper. Software testing textbooks state [18, 19, 20],

that there should be a difference between unit and integration tests in terms of

types of defects that they detect. Hence, we evaluate if unit or integration tests

are more effective in detecting certain program defect types. If we do not find a

difference, we would have another indication that the current definitions of unit

and integration tests do not fit the modern software development context.

To steer our research, we defined one central Research Question (RQ):

• RQ: Are unit or integration tests more effective in detecting certain pro-

gram defect types?

To answer our RQ, we have a look at the effectiveness of unit and integration

tests in terms of their defect detection capabilities. For this, we first classify tests

into unit and integration tests and afterwards collect data about their defect

detection capabilities using mutation testing. Then, we perform an in-depth

analysis by assessing which types of defects are found by the two different types

of testing and if there are any differences in their effectiveness of detecting these

defect types. The contributions of this paper can be summarized as follows:

• an in-depth analysis of the defect detection capabilities of unit and inte-

gration tests.

• an in-depth analysis of which defect types are found by unit and integra-

tion tests.

• a data set, which includes detailed information about the defect detection

capabilities of 38782 test cases for 17 projects.

3

• a software test analysis framework, which can be used for further research

and enable other researchers to contribute to the body of knowledge of

empirical software testing research.

The remainder of this paper is structured as follows. In Section 2 we lay

the foundations for this paper and discuss related work. Section 3 describes

our research methodology, including the approach for our research question,

our data collection and analysis procedures, our results, and the replication kit

of this paper. We then discuss the results in Section 4. Section 5 presents

the threats to validity to our study, including construct, internal and external

threats. Finally, we conclude our paper and describe future work in Section 6.

2. Foundations and Related Work

In this section, we present the terminology that is used along this paper

together with a summary of the literature that is related to our research.

2.1. Definitions

The IEEE standard ISO/IEC/IEEE 24765-2010 [9] defines the most impor-

tant vocabulary for the software engineering world. In the following, we present

the definitions that are important for this paper based on this standard.

Definition 1 (Unit). 1. a separately testable element specified in the design

of a computer software component. 2. a logically separable part of a computer

program. 3. a software component that is not subdivided into other components.

[...]. [9]

Definition 2 (Unit Test). [...] 3. test of individual hardware or software

units or groups of related units [9]

Definition 3 (Integration Test). 1. the progressive linking and testing of

programs or modules in order to ensure their proper functioning in the complete

system. [9].

4

2.2. Test Type Classification

There are several approaches present in the literature that try to classify tests

into different classes. In the short paper by Orellana et al. [21], the classification

is based on the usage of Maven plugins in project builds. If test classes are

executed by the Maven SureFire plugin [22] they are classified as unit tests,

but if they are executed by the Maven FailSafe plugin [23] they are classified as

integration tests. Hence, they reuse the classification of the developers. This can

be problematic, as our earlier work [17] highlights, that the classification into

unit and integration tests, as done by the developers, is not always in line with

the definitions. Furthermore, this kind of classification does not allow a fine-

grained analysis on the method level. Another problem is, that not all projects

differentiate their tests based on the executed Maven plugins. For example,

none of the projects that we have used in our case study use the Maven FailSafe

plugin, while some tests of the projects are indeed integration tests.

Another approach that classifies tests into different types was proposed by

Kanstren [24]. Kanstren proposes a dynamic approach that uses aspect oriented

programming to calculate the test granularity of each test, which refers “to the

number of units of production code included in a test case...”[24]. To achieve

this, his approach calculates the number of methods that are covered by each

test. Afterwards, the results can be summarized to determine, if, e.g., a system

was tested only on low-level (i.e., many tests that executed a small amount of

methods) or only on high-level.

In our earlier work [17] we determined how many tests are unit tests, ac-

cording to the definitions of the ISTQB and IEEE for 10 Python projects. We

proposed a static analysis approach in which we assessed the number of im-

ported modules in a test to find unit tests. Furthermore, we compared the

intention that the developers had (i.e., if they wanted the test to be a unit test)

with the outcome of the classification analysis.

We improved our test classification approach in several ways. We switched

from a static analysis approach to a dynamic one to resolve the drawbacks

of our former approach, i.e., we can now directly determine if a unit is used

5

(and not only imported) in a test case and the approach is robust against the

usage of mocking frameworks inside a project. Our new approach is using test

coverage data to classify tests into unit and integration tests, which was not

possible before. Hence, our approach only needs the coverage data of a test

execution run in contrast to the work of Orellana et al. [21]. Furthermore, our

approach makes a clear separation between unit and integration tests based on

the definition of the IEEE [9]. This is in contrast to the work of Kanstren [24],

which only provides a general view on the granularity of the tests. Furthermore,

our analysis is more fine-grained than before. Instead of classifying a whole test

class/module we are able to classify each method separately.

2.3. Assessment of Defect Detection Capabilities via Mutation Analysis

The assessment of the defect detection capabilities of a set of test cases is

done by checking how many defects the test suite can find. There exist several

approaches in the literature that can be used. Recently, Oreallana et al. [21]

assessed if unit tests expose more defects than integration tests by using the

Travis Torrent dataset [25]. This data set contains information about project

builds, including which test cases failed during the build. The authors of the

paper used this information to determine the number defects that were exposed.

The assumption in this kind of analysis is that every test that failed during

the build of a project exposed a defect. The problem with this approach is

that tests are often executed before committing changes to a Version Control

System (VCS). Hence, all defects that might be found before committing the

changes are not recorded. Moreover, this kind of technique is focused on defects

that broke the build and therefore the number and type of assessed defects is very

limited. First, because tests in a CI system could also fail due to wrong commit

behavior (e.g., the developer forgot to commit a certain change to a class).

Second, with this technique we could only assess defects that are detected by

the CI system. Defects that might get fixed before committing to the CI system

are missing. Hence, to answer our RQ we require a controlled environment,

where we are able to exclude as many outside influences as possible that might

6

have an influence on the results. We achieve this through mutation analysis.

Mutation analysis provides a systematic way for introducing defects into

a software [26]. In mutation analysis we integrate mutants (defects) into the

program and check if these are found by test cases. If a test case detects a

mutant, i.e. the test case fails, we call the mutant killed. Mutants that survive

the execution of the test case are called live [26]. The mutation score is defined

as a ratio of the number of killed mutants to the total number of mutants.

Unfortunately, some of the generated mutants produce a program which behaves

equivalent to the original one [27, 28] and can therefore not be killed. These

mutants are called equivalent. However, the detection of such mutants is an

instance of an undecidable problem [19]. Mutation testing can be very time

consuming, as a large number of defect-seeded programs must be tested.

The use of mutation analysis for the assessment of the defect detection capa-

bilities of tests, has the underlying assumption that the mutants can construct

program failures that are similar to the ones that are created through real de-

fects. There are several studies that provide indications that this is the case.

One of the first studies that investigated the relationship between mutants and

real defects was done by Daran et al. [29]. They found that injected mutants

could produce failures and program data states that are similar to those pro-

duced by real defects. Similar to that, Andrews et al. [30, 31] concluded in

their studies that the detection ratio of mutants are representative for defect

detection ratios, i.e., there is a correlation between them. One of the most re-

cent papers that highlight the relationship between mutants and real defects

was done by Just et al. [32]. They found a strong correlation between mutant

detection ratios and real defect detection ratios.

Nevertheless, there are some works that identified limitations to the con-

clusions presented above. Namin et al. [33] found that there is only a weak

correlation between mutants and defect detection ratios. Chekam et al. [34]

concluded that there is a strong connection between an increase of the mutation

score and defect detection, but only at higher mutation score levels. A very

recent paper by Papadakis et al. [35] highlight, that there is only a weak corre-

7

lation between the mutation score and defect detection, if the test suite size is

controlled for.

As we can see on the above papers, the question if mutation testing is an

appropriate tool to assess the defect detection capabilities of tests can not be

definitely answered, as there exist indications for a positive as well as a negative

answer to this question in the literature. While we also make use of mutation

testing in our work, we do not create test suites, but reuse the ones that are

provided by the projects. Hence, some limitations, e.g. that the test suite size

must be controlled for, are not applicable to our research.

2.4. Defect Classification

There are numerous studies on defect classification. One of the main dif-

ferences between these studies is the data on which the classification is based.

While some taxonomies need specification or design documents [36], others only

need source code [37], or defect reports [38, 39].

A commonly used cause-driven taxonomy was proposed by Chillarege et

al. [40] and is called Orthogonal Defect Classification (ODC). In ODC defects

are classified into eight types based on their description about the symptoms,

semantics, and root causes. Offutt et al. [41] proposed a model for the character-

ization of defects based on the syntactic and semantic size. Hayes et al. [36] pre-

sented a requirements-based defect analysis methodology, which the author also

applied to NASA projects. Later, Hayes et al. [42] developed two taxonomies,

where the first classifies code modules (e.g., into data-centric, controller, view,

...) and the second code defects (e.g., into data, interface, computation, ...).

Xia et al. [38] used the descriptions available in defect reports to classify de-

fects into two defect trigger categories (Bohrbug and Mandelbug) via natural

language processing techniques. Tan et al. [39] created a taxonomy in which de-

fects are classified based on three dimensions: root cause (of the defect), impact

(failure caused by the defect), and component (location of the defect).

The main problem with the approaches described above is that the data

that is needed to classify the defects are often not available, e.g. specifications

8

or design documents. This is especially true for open source projects, as these

projects follow a special development process [43]. Additionally, the creation of

a link between the classified defect and its representation in the source code is

often hard to achieve.

Recently, Zhao et al. [37] presented an approach that can overcome the prob-

lems described above. They classify defects based on the change that was made

to fix the defect. For this, they adapted the classification of [42] and created

a tool for the C language that needs two versions of the source code of the

program (defective and clean). Afterwards, the differences (changes) between

these two versions are extracted and change patterns are detected. Based on

these change patterns a change type classification is created, which is also the

classification of the defect. They created five different categories with overall

nine subcategories. The data category comprises of Changes on Data Dec-

laration/Definition (CDDI) statements. This type of change indicate that a

data-related defect occurred, e.g., if the declared type of a variable is changed

from int to float. The computation category includes Changes on Assign-

ment Statements (CAS). This includes the addition or deletion of assignment

statements, or the modification of equations. Computation-related defects are

defects that can lead to a wrong assignment of a variable, which are then often

fixed by CAS. Interface-related defects are caused “by wrong definition or

faulty function dependency on other functions.” [37]. For example, if a function

is called with an incorrect amount of parameters. Defects that are related to in-

terface issues are fixed by Changes on Function Declaration/Definition (CFDD)

or Changes on Function Call (CFC). The logic/control category comprises

of Changes on Loop Statements (CLS), Changes on Branch Statements (CBS),

and Changes on Return/Goto Statements (CRGS). Hence, defects that oc-

cur in these statements, e.g., changing a < to a >= in an if statement, “may

cause the incorrect execution sequence or an abnormal state” [37]. Therefore, a

defect-fixing change in these statements signals that a logic/control-related de-

fect occurred. Every other change that could not be classified into the categories

above is then subsumed under the Other category. For example, Changes on

9

Preprocessor Directives (CPD), which is a change that can occur in programs

using the C language.

Our approach reuses the classification scheme of Zhao et al. [37] as it is tightly

connected to the source code. This connection to the source code is needed to

classify our generated mutants, as these mutants are not regular defects that

exhibit, e.g., an issue report. While we reuse the logic behind every sub category

our tool only outputs the main category in which a classified defect resides in

(e.g., “Interface”). Furthermore, in contrast to [37], our tool is working for Java

projects.

3. Research Methodology

Within our RQ, we want to assess if unit or integration tests are more ef-

fective in detecting certain program defect types. Our expectation would be

that integration tests detect more interface problems, while unit tests are more

focused on finding computation defects or problems with the control flow of a

program. This is also stated in software testing textbooks [18, 19, 20]. Hence,

we evaluate if there are indeed defect types that are only (or mostly) detected

by a certain test type and assess how effective these test types can detect certain

defect types. These results could help to assess if the current definitions of the

IEEE still fit to modern software development contexts.

Overall, our approach is as follows. We classify all test cases into unit and

integration tests. Additionally, we collect data on the defect detection capabil-

ities of each test case by using mutation analysis. We generate many different

defective versions of the code and check which tests are able to detect the in-

tegrated defects. Afterwards, we classify our introduced defects based on the

changes that were made to fix the defect using the taxonomy from Zhao et

al. [37]. Finally, we evaluate which defect types are found by which test type.

Within this section, we describe our case study design, including the subject

selection, data collection, and analysis procedures.

10

3.1. Subject Selection

We defined the following inclusion criteria to select our study subjects.

• Projects must be a library or a framework. Libraries and frame-

works should be included or used by other programs and are not executed

by themselves. Therefore, tests that execute the whole system (system

tests) are less likely to occur. Hence, by focusing on libraries and frame-

works we are reducing the risk of misclassifying a system test as an in-

tegration test, as our tooling infrastructure is currently not capable to

differentiate between these test types.

• Projects must have a minimum of 1000 commits and are at least

2 years old. This ensures, that we only include mature projects.

• Projects must use Java as programming language, Maven [44] as

build system, and JUnit [11] or TestNG [45] as test driver. This

is a limitation of our current tooling infrastructure.

Moreover, one exclusion criterion is defined.

• Projects should not be focused on Android alone. This work con-

centrates on pure Java projects. While Android projects also use Java

as programming language, there are several differences in testing those

projects in contrast to pure Java projects.

We applied these criteria on two different data sources. First, the list of

Borges et al. [46], which classified the most popular 5000 GitHub repositories

(language-independent) into six different categories (application software, sys-

tem software, web libraries and frameworks, non-web libraries and frameworks,

software tools, and documentation). Second, we used a list of the most popular

projects created by the maven repository [47]1. Overall, 41 projects fit to our

criteria from which we randomly selected 17 projects for our study.

1As available on the time we performed our case study (January 2018).

11

TestLOC

Intercept Coverage Collection

Collect Test Metrics

Test Classification

Mutation Detection Capabilities

COMFORT Framework

SmartSHARK Environment

ProjectVCS

Project Release

2

Coverage

1

Figure 1 Logical overview of our data collection process. Step 1 is the collection

of meta- data using the SmartSHARK environment [48]; Step 2 is the automatic

collection of test metrics, including the mutation detection capabilities of all tests.

Table 1 gives an overview of the chosen projects together with their char-

acteristics. It shows the project name, the release of the project that was used

in the case study, the overall number of commits (till the stated release), the

number of Java classes for the release, and the number of test cases that are

executed by the build script of the release. All projects can be found on GitHub.

The number of classes and test cases in this table highlight that the projects

that we included in our case study are non-trivial projects.

From the sampled projects we included the latest minor release, if avail-

able. If this release was not available, we took the next possible release (e.g.,

release 1.11.1 of jsoup, as the release 1.11.0 is not available). For two projects

(i.e., commons-beanutils and fastjson) the latest minor release was longer than

four years ago. Hence, we decided to use the most recent release to prevent

compatibility issues with our infrastructure.

3.2. Data Collection Procedures

Figure 1 gives a logical overview of our data collection process for our case

study. As a first step, we are collecting meta-data about each project using the

SmartSHARK environment [48]. Hence, we collect data from the VCSs of the

projects, as these are later on used in our data collection process to integrate the

different collected data. Afterwards, in the second step, we calculate different

12

Project Release #Commits #Java Classes #Test Cases

commons-beanutils 1.9.3 1128 370 1197

commons-codec 1.11 1677 166 874

commons-collections 4.1 2852 903 6637

commons-io 2.5 1868 308 1150

commons-lang 3.7 5106 711 4064

commons-math 3.6 5819 2596 6488

druid 1.1.0 5178 3785 4132

fastjson 1.2.41 2579 5214 4176

google-gson 2.8.0 1306 719 1014

guice 4.1 1513 2078 701

HikariCP 2.7.0 2493 154 120

jackson-core 2.9.0 1323 268 774

jfreechart 1.5.0 3622 1041 2175

joda-time 2.9 1913 527 4176

jsoup 1.11.1 1106 304 593

mybatis-3 3.4.0 1816 1205 1053

zxing 3.3.0 3282 408 401

Table 1 Selected projects with their characteristics.

test metrics for each analyzed test case. This step is automated by our Collection

of Metrics for Tests (COMFORT) framework. In a first step, we execute the

tests of a project release and intercept the coverage collection to collect the

coverage on a per test case level. The resulting coverage is afterwards used to

calculate different test metrics for each test case: the test classification into unit

and integration test (see Section 3.2.1), the Test Lines of Code (TestLOC) (see

Section 3.2.2), and the mutation detection capabilities (see Section 3.2.3). In

the following, we describe each approach that we followed to calculate these

metrics and the concrete implementation that we designed. Furthermore, we

describe our approach for the collection of the defect type in Section 3.2.4.

13

3.2.1. Assigning the Test Classification

Within this section, we explain our approach to assign a test class (i.e., unit

or integration test) to the tests of the projects. We base our our separation on

the IEEE definitions presented in Section 2.1, because the ISTQB definition (i.e.,

a unit test tests only one unit) is a subset of it. For example, all ISTQB unit tests

are also IEEE unit tests by definition. Furthermore, our prior work [17] shows

that the ISTQB definitions, are too restrictive for modern software systems,

where logical software units often consist of several classes.

According to the definitions of the IEEE (see Section 2.1), a test in Java is a

class, which consists of several test methods. These test methods are test cases,

which are e.g., annotated with the “@Test”-JUnit annotation [11]. A unit is a

class in Java, as it is a logically separable part of a Java program, which is not

further subdivided into other components. This is in line with the literature

[49]. A unit test is a test case that tests “...software units or groups of related

units” [9]. In Java, related units are put into one package, as the official Java

documentation states: “A package is a namespace that organizes a set of related

classes and interfaces.” [50]. Hence, if we apply the IEEE definition to Java a

unit test is a test that tests only units from within one package (i.e., related

units). On the other hand, an integration test tests units from more than one

package. Hence, we do not classify the intent with which a test is created, but

the actual type of the test according to the IEEE definitions.

To foster the understanding of the IEEE definitions, we selected two real

world examples from the commons-io project [51]. Listing 1 shows an IEEE

unit test and an IEEE integration test. The upper part of this listing presents

a typical unit test, where the correct workings of a function is tested. Here,

the getPrefix function of the FilenameUtils class is tested. More precisely, this

test checks if null bytes are part of the string. As this tests only calls one unit

(i.e., the FilenameUtils class) and the FilenameUtils class does not call other

units, this test gets classified as a unit test. On the other hand, the bottom part

of the listing shows an IEEE integration test. In this case, the test checks if

14

the ByteArrayOutputStream class of commons-io works as intended, if it is used

within the copy function of the CopyUtils class. As the ByteArrayOutputStream

is from the org.apache.commons.io.output package and the CopyUtils class from

the org.apache.commons.io package, this tests gets classified as integration test.

1 // IEEE Unit Test

2 [...]

3 package org.apache.commons.io;

4 [...]

5

6 @Test

7 public void testGetPrefix_with_nullbyte() {

8 try {

9 assertEquals("˜user\\", FilenameUtils.getPrefix("˜u\

u0000ser\\a\\b\\c.txt"));

10 } catch (IllegalArgumentException ignore) {

11 }

12 }

13

14 /***************************************************/

15

16 // IEEE Integration Test

17 [...]

18 package org.apache.commons.io;

19 [...]

20 import org.apache.commons.io.output.ByteArrayOutputStream;

21 [...]

22

23 @Test

24 public void copy_byteArrayToOutputStream() throws Exception {

25 final ByteArrayOutputStream baout = new ByteArrayOutputStream

();

26 final OutputStream out = new YellOnFlushAndCloseOutputStream(

baout, false, true);

15

27

28 CopyUtils.copy(inData, out);

29

30 assertEquals("Sizes differ", inData.length, baout.size());

31 assertTrue("Content differs", Arrays.equals(inData, baout.

toByteArray()));

32 }

Listing 1 Examples for an unit test (top) and an integration test (bottom) from the

commons-io project [51].

On the implementation side, we use the recorded per test case coverage data

for the test class assignment. A unit counts as covered, if at least one method of

it was executed by the test. Furthermore, we only record the coverage of units

that are within the project, i.e., calls to other libraries or the Java standard

Application Programming Interface (API) or mocking frameworks are filtered

out. In addition, we filter out every test class that might be inside the coverage

data, as we want to base our test class assignment only on covered production

classes. Therefore, we check the file path for each covered class and if the path

has the term “test”2 in it, we filter it out from the data. This has another

advantage: self-made mocks that might exist in the project are filtered out

from the coverage data. Within this filtered coverage data, we check how many

“related units” (i.e., different packages) are covered by a test case. If only one

logical unit is covered, then the test case is classified as unit test. Otherwise, it

is classified as an integration test.

3.2.2. Collecting the TestLOC

We need to acquire the number of TestLOC for our RQ to normalize the

results. This number is calculated based on the coverage data that is also used

for the test classification. We sum up the number of covered lines for each

2In Maven projects (like we use in our study) all tests and test related data is in the

“src/test” folder.

16

test case, while we differentiate between covered production lines and covered

test lines. This differentiation is done based on the path in which the covered

class resides. If the path starts with “src/test” we know that the covered class

is a test class, because it resides in the test folder, and is a production class

otherwise.

In contrast to just counting Lines of Code (LOC) or Logical Lines of Code

(LLOC) this approach has the advantage that we consider all executed test

code than just the lines of the test case itself. Otherwise our results could

be biased. For example, tests that have longer set up methods (e.g., to bring

several units into a testable state) but less LOC would show better results after

the normalization than tests where the whole set up is integrated into the test

case itself.

3.2.3. Collecting the Mutation Detection Capabilities

We collect the mutation data for every test case that was executed (i.e., is

available in the coverage data) by using PIT [52] within our framework. We

execute PIT for every test case separately, i.e., for each test case that was

covered. Each test case is run against all generated mutations. Afterwards, the

results of PIT are parsed. If a test case failed when it is executed alone, e.g.,

because this test case depends on the execution of other test cases, we excluded

it and no mutation data is collected. As the collection of the mutation data

is very computation intensive (i.e., our case study experiment consumed about

35 months of CPU time) we executed it on the High Performance Computing

(HPC) cluster hosted by the datacenter of our university3.

We followed the best practices on using mutation testing in controlled exper-

iments by Papadakis et al. [26]. Hence, we also report on each decision regarding

the mutation analysis, as advised by Papadakis et al. [26] in the following:

• Mutation Testing Tool: We used the paper by Kintis et al. [53] as a

basis for our decision regarding the mutation testing tool. We decided

3See: http://gwdg.de.

17

http://gwdg.de

to use PIT [54] for our analysis due to several reasons. PIT possess a

good defect-revelation ability [53], which is an indicator of its suitability

for our experiments. Only PITRV [52, 55] was able to reveal more defects

(115:122). Nevertheless, we decided against PITRV, because the number of

generated equivalent mutants is substantially higher for PITRV in contrast

to PIT [53]. In fact in the experiment by Kintis et al. [53], PITRV generated

about 88.74% more equivalent mutants than PIT. As equivalent mutants

are a substantial threat to the validity of our study, we decided to use

PIT instead of PITRV. We even contributed to the development of PIT

by creating a pull request that got accepted and integrated into the version

1.3.2 of PIT. This addition was required for our work and allows PIT users

to filter the test cases of tests against which the mutations are challenged.

• Mutant Redundancy: Mutant redundancy might have a large impact

on the validity of our study. Hence, it is important to care for this kind

of thread. We generated a lot of mutants for our case study (> 500.000).

Checking them all by hand to detect duplicate mutants is a task that

would take a substantial amount of time and is error prone. To mitigate

this threat, we create the set of disjoint mutants, which only includes

mutants that are not killed collaterally by most of the test cases [53, 56].

• Mutant Selection: We decided to use all mutation operators that are

provided by PIT [54], including operators that are also used in integration

mutation testing approaches [57, 58]. The reason for this is that we wanted

to generate a huge set of mutants so that we reduce the possibility that

we only generate mutants of low quality or mutants that favor only one

specific type of test. There is also empirical evidence that supports this

approach [53].

• Test Suite Choice and Test Suite Size: The test suite choice and its

size is set, as we reuse the test suites that were created by the developers

of the projects.

18

• Clean Program Assumption: The general problem that the Clean Pro-

gram Assumption (CPA) describes is that test suites are assessed on the

mutated program instead the original one for which they were created [34].

While we know of the CPA, we do not need to take it into account for our

experiment, as we do not compare different testing techniques or rely on

the coverage measured on the clean program.

• Multiple Experimental Repetitions: The calculations for our case

study are only done once, as we do not have a component in it that make

stochastic choices. The mutants that are generated for each test case and

the result (i.e., mutant was killed or not) are not affected by any stochastic

process. An exception is the creation of the disjoint mutant set, as the

used algorithm is not deterministic. Hence, we repeated the generation of

the disjoint mutant set 10 times.

• Presentation of the Results: As described above, our results are pre-

sented on the test case level.

3.2.4. Collecting the Defect Classification

To gather the defect classification for each integrated mutation, we created

the BugFixClassifier [59]. The BugFixClassifier is able to extract changes be-

tween two files using CHANGEDISTILLER by Fluri et al. [60] and classify the

detected changes based on the approach by Zhao et al. [37]. CHANGEDIS-

TILLER [60] is a tool that extract changes between two source code files based

on the comparison of the Abstract Syntax Trees (ASTs) of both files. For each

change that CHANGEDISTILLER detects, it outputs the change type (as de-

fined by Fluri et al. [60]), which entity was changed (e.g., an if-statement or

a method), and the parent entity of the changed entity (e.g., the initialization

part of a for loop). We decided to reuse CHANGEDISTILLER, as it provides

us with fine-grained source code changes that can be mapped onto the defect

types as defined by Zhao et al. [37]. Hence, we do not only store each generated

mutation, but also the type of defect that was integrated.

19

Zhao et al. [37] determines the class of a defect based on the change that

was made to fix the corresponding defect. The only change that we made to

the classification schema is that we excluded the category CPD, because pre-

processor directives are not available in Java. We reuse the detailed change

types of Fluri et al. [60] and provide a mapping between them and the defect

classes defined by Zhao et al. [37]. This mapping is only provided for the upper

categories of our classification scheme (i.e., data, computation, interface, log-

ic/control, others), as this level of detail is sufficient for our purpose. Table A.7

in Appendix A shows the mapping between the change types of Fluri et al. [60]

and the defect classes of Zhao et al. [37]. Unfortunately, some change types of

Fluri et al. [60] cannot be directly mapped onto a defect class. For these change

types we need to consider the type of the changed entity and/or the type of the

parent entity (e.g., METHOD, FOR INIT). Table A.8 in Appendix A shows the

conditions that need to be fulfilled for a change together with the defect class

that is then assigned.

In general, the original non-defective and the defective source code is needed

for the defect classification. The changes between both are then extracted and

classified based on the rules above. Unfortunately, our used mutation testing

framework does not offer the possibility to get the mutated source code, as the

mutation is done on the byte-code of the original code [52]. Nevertheless, some

of the applied mutations can be directly mapped to a defect class. This mapping

is presented in Table 2. For mutation operators that can not be directly mapped

we used the following approach. The mutation testing framework outputs the

mutation operator that is used and the line in which it was used. Hence, we

integrate the corresponding defect (based on the mutation operator) into the

original source code in the specified line. Afterwards, we extract the changes

between the original and the now defective source code to get the class of the

defect that was integrated by the mutation operator.

20

3.3. Data Analysis and Results

In the following sections, we explain our analyzed data sets, our analysis

procedures, as well as our results for our RQ.

3.3.1. Data Sets

Table 3 shows the number of unit and integration tests for each project

release that we analyzed, together with the number of not classifiable tests,

the sum of the TestLOC for unit and integration tests, the number of unique

mutants that are generated, and the number of analyzed tests. Overall, this

table highlights that nearly all projects have more integration than unit tests.

Nevertheless, for some projects our approach was not able to classify all test

cases. We looked into all not classifiable test cases and found several reasons for

this, e.g., tests that are empty, tests that directly return without executing any

production code, or test cases that only test constants of classes. Furthermore,

Table 3 highlights the number of generated unique mutants4 together with the

number of analyzed tests. The number of analyzed tests can be lower than the

number of all test cases of the projects, as our mutation testing tool might not

be able to run the test alone (e.g., test cases that fail if they are executed alone).

4Not all mutants are generated for each test case as our mutation testing tool pre-selects

mutations against which the test case should run by using the coverage data of the test

case [52].

21

Mutation Operator Defect Class

AP: Argument Propagation Interface

BFR: Boolean False Return Logic/Control

BTR: Boolean True Return Logic/Control

CB: Conditionals Boundary -

CC: Constructor Calls Interface

EOR: Empty Object Return Logic/Control

I: Increments -

IC: Inline Constant Data

IN: Invert Negatives -

M: Math -

MV: Member Variable Computation

NR: Naked Receiver Interface

NC: Negate Conditionals -

NVMC: Non Void Method Calls Interface

NR: Null Return Logic/Control

PR: Primitive Return Logic/Control

RC: Remove Conditionals Logic/Control

RI: Remove Increments -

RS: Remove Switch Logic/Control

RV: Return Values Logic/Control

S: Switch Logic/Control

VMC: Void Method Calls Interface

Table 2 Mapping between the used mutation operators and the defect class.

22

Pro

ject

#U

T#

IT#

NC

Tes

tLO

C

(UT

)

Tes

tLO

C

(IT

)

#U

niq

ue

Mu

tants

#A

naly

zed

Tes

ts

com

mon

s-b

eanu

tils

285

890

22

10966

44510

11310

1175

com

mon

s-co

dec

707

146

21

6160

1136

9059

853

com

mon

s-co

llec

tion

s19

254005

707

63048

260958

26464

5930

com

mon

s-io

634

504

12

17345

13185

9593

1138

com

mon

s-la

ng

3507

471

86

36718

10841

36549

3978

com

mon

s-m

ath

775

5709

410812

157902

113469

6484

dru

id94

4037

1585

71456

123835

4127

fast

json

315

3842

19

1968

41312

48289

4147

goog

le-g

son

215

797

22185

11115

8816

1012

guic

e28

673

0265

13526

10851

688

Hik

ariC

P21

97

2291

5848

4967

117

jack

son

-cor

e47

727

0490

17567

34361

774

jfre

ech

art

228

1946

12642

31240

96104

2174

jod

a-ti

me

179

3975

22

2810

83739

32951

4153

jsou

p64

525

4705

11772

14171

588

myb

atis

-322

4824

51449

28577

17126

1043

zxin

g10

8293

01564

13809

29592

401

Ove

rall

9356

29461

908

160003

818493

627507

38782

Table

3P

roje

cts

wit

hth

eir

collec

ted

data

,in

cludin

gU

nit

Tes

ts(U

T),

Inte

gra

tion

Tes

ts(I

T),

and

not

class

ifiable

test

s(N

C).

23

We create two different data sets for the analysis of our research question

based on the collected data. These data sets represent different perspectives on

the questions at hand.

• ALL: This data set consists of the test results for all generated mutations.

With this data set we want to assess the defect detection capabilities of

unit and integration tests for a large data set with many different defects

that are integrated.

• DISJ: This data set consists of the test results for the set of disjoint

mutants (see Section 3.2.3). With the analysis of this data set we can

gain insights into the defect detection capabilities of unit and integration

tests for defects that are hard to kill [56].

3.3.2. Analysis Procedure

For all of the in Section 3.3.1 presented data sets we executed the following

analysis process.

1. Calculate the number of detected defects: We gather the number

of all detected defects by summing up the number of defects that are

detected by a unit or an integration test. We consider a mutation that

is killed as a detected defect. The results for test cases that are executed

with several different parameters, are combined and analyzed as one test

case. As the algorithm applied to create the set of disjoint mutants is not

deterministic, we repeat the analysis process using the disjoint mutant set

10 times and take the average of all 10 runs for the number of detected

defects.

2. Build the sum for each test type: We divide the detected defects

by their type and sum them up for unit and integration tests separately.

This way, we can assess if unit or integration tests are more effective in

detecting a certain kind of defect. We excluded the defect type Other

from our analysis, as it does not represent a real defect type, but more

a type of change that can not be classified as one of the other types (see

24

Section 2.4). Note, that we have prespecified the subgroup analysis for

this paper. Hence, it is not a post hoc analysis, which would threat the

validity of our study [61].

3. Normalize by the number of TestLOC: The resulting sums from the

previous step are normalized by the number of TestKLOC to create scores.

Hence, the result is the number of detected defects per 1000 TestLOC. We

perform this normalization step because we want to include the effort that

is done to create tests into our analysis. Hence, this normalization step

is needed, as we would otherwise just compare the number of detected

defects, which would ignore the effort (i.e., the TestLOC).

4. Statistical Testing: As a last step, we take the normalized values of

each project and perform statistical tests to check if the differences be-

tween the unit and integration test scores are significant. We first check

if our populations follows a normal distribution (applying the Shapiro-

Wilk test [62]) and have equal variances (applying the Levene test [63])

to choose an appropriate significance test. Based on the results, we either

perform a students t-test [64] or a Mann-Whitney-U test [65]. We use a

significance level of α = 0.05 for all of our statistical tests. As we are per-

forming multiple statistical tests, which could increase the overall change

of false discoveries (Type 1 Errors) [66], we need to apply corrections for

multiple comparisons. We decided to use the Bonferroni correction [66]

for all of our statistical hypothesis tests that we made. Overall, we use

8 statistical hypothesis tests: we use our ALL and DISJ data sets and

check for differences in the scores for each defect type (i.e., COMPUTA-

TION, DATA, INTERFACE, LOGIC/CONTROL), which results in eight

different tests (two data sets * 4 defect types). Therefore our adjusted

significance level is α∗ = 0.05/8 = 0.00625.

3.3.3. Results

Tables 4, and 5 show the scores (i.e., number of detected defects per TestK-

LOC) of unit and integration tests, separated by the type of defect that they

25

have detected. Additionally, the mean and standard deviation is shown for each

column. Additional tables including the number of defects found by unit tests,

integration tests, and both for each defect type are available in the supple-

mentary material (see Section 3.4). Furthermore, the supplementary material

includes Venn-Diagrams for each project to visualize these numbers.

Table 4 depicts the results for the data set ALL. It shows that if we separate

the integrated defects by type, the mean scores of integration tests are higher

than the mean scores of unit tests for each defect type except DATA defects,

while they have a higher standard deviation. Hence, on average it seems that

integration tests are more effective (i.e., scores are higher) for any defect type,

except DATA defects. Furthermore, Table 4 also highlights the differences be-

tween projects. For some projects the results are as expected (i.e., integration

tests detect interface defects more effectively, while other defects are more ef-

fectively detected by unit tests), while for other projects it is vice versa (e.g.,

jackson-core).

26

COMP. DATA INT. L/C

Project UT IT UT IT UT IT UT IT

commons-beanutils 3.83 2.29 6.29 3.30 38.39 32.44 69.31 44.69

commons-codec 71.27 22.01 134.42 66.02 252.44 344.19 389.12 448.94

commons-collections 2.24 1.25 2.98 1.11 8.31 7.05 20.08 13.62

commons-io 11.01 15.09 20.12 19.34 57.94 64.77 101.64 104.29

commons-lang 24.59 17.06 61.36 44.46 125.25 162.62 354.29 276.36

commons-math 28.39 38.68 41.90 76.47 86.57 129.81 195.15 215.08

druid 133.33 57.56 169.23 55.92 194.87 396.17 558.97 501.95

fastjson 31.50 50.47 66.57 109.65 63.01 213.72 263.72 328.86

google-gson 66.36 17.09 59.04 14.22 65.45 115.61 174.37 157.17

guice 11.32 37.93 18.87 31.05 226.42 223.94 362.26 264.82

HikariCP 37.80 30.61 65.29 28.73 151.20 125.68 109.97 163.13

jackson-core 87.76 153.07 136.73 154.32 293.88 195.42 561.22 589.97

jfreechart 36.34 76.47 71.16 92.93 75.32 290.01 281.98 593.41

joda-time 7.47 13.76 19.93 21.91 65.12 100.98 87.19 158.89

jsoup 5.67 32.45 19.86 58.78 41.13 280.33 114.89 372.07

mybatis-3 15.87 13.89 18.63 12.25 193.24 119.15 204.28 116.21

zxing 41.56 81.83 190.54 211.46 203.32 282.42 435.42 611.12

Mean 36.25 38.91 64.88 58.94 125.99 181.43 251.99 291.80

StDev 35.45 37.67 58.56 56.96 86.22 110.24 168.34 197.19

Table 4 Scores for unit and integration tests for the ALL data set, separated by

defect type.

Table 5 highlights the results for the DISJ data set. It shows a different

picture than Table 4. For the disjoint mutant set our results show that (on

average) integration tests are less effective in detecting any type of defect than

unit tests, i.e., the mean of integration test scores are lower than the mean of

unit test scores for any of the defect types. But, the standard deviation for

integration test scores are lower than the standard deviation of unit test scores

for any of the defect types. Nevertheless, Table 5 also highlights, that some

mutations are only detected by integration tests (e.g., interface defects for the

27

projects fastjson, google-gson, guice, HikariCP, jackson-core) or only unit tests

(e.g., computation defects for the projectscommons-beanutils, commons-lang).

COMP. DATA INT. L/C

Project UT IT UT IT UT IT UT IT

commons-beanutils 0.09 0.00 0.00 0.02 0.18 0.09 0.64 0.38

commons-codec 1.30 0.88 0.65 0.00 2.44 2.64 7.31 2.64

commons-collections 0.02 0.00 0.02 0.01 0.13 0.04 0.40 0.11

commons-io 0.35 0.15 0.23 0.08 1.56 0.15 1.50 0.76

commons-lang 0.44 0.00 0.93 0.18 2.31 0.46 8.14 0.46

commons-math 0.28 0.28 0.00 0.16 0.37 0.16 1.85 1.04

druid 0.00 0.64 6.84 0.34 6.84 1.75 17.09 2.70

fastjson 0.51 1.02 0.00 1.19 0.00 2.52 1.52 6.00

google-gson 0.92 0.09 0.00 0.27 0.00 0.09 0.92 0.81

guice 3.77 0.30 0.00 0.44 0.00 1.11 0.00 1.85

HikariCP 0.00 0.00 0.00 0.17 0.00 0.17 0.00 0.17

jackson-core 0.00 0.23 2.04 0.06 0.00 0.17 10.20 0.91

jfreechart 1.89 0.29 0.38 0.10 0.76 0.54 5.30 1.28

joda-time 0.00 0.12 0.00 0.16 0.71 0.57 1.42 2.67

jsoup 0.00 0.00 0.00 0.08 0.00 0.08 0.00 0.08

mybatis-3 0.69 0.14 0.69 0.00 1.38 0.07 0.00 0.03

zxing 0.64 0.00 0.00 0.14 0.00 0.07 0.64 0.29

Mean 0.64 0.24 0.69 0.20 0.98 0.63 3.35 1.31

StDev 0.97 0.31 1.67 0.28 1.72 0.86 4.78 1.52

Table 5 Scores for unit and integration tests for the DISJ data set, separated by

defect type.

Table 6 presents the p-values of the significance tests that were performed

between the scores of unit and integration tests for the data sets ALL and

DISJ and each defect type. It highlights, that there are no statistical significant

differences in the effectiveness of unit and integration tests for any defect type.

28

Defect Type ALL DISJ

COMPUTATION p = .352 p = .152

DATA p = .766 p = .220

INTERFACE p = .112 p = .278

LOGIC/CONTROL p = .531 p = .267

Table 6 P-values of the significance tests that were performed between the scores of

unit and integration tests for the data sets ALL and DISJ and each defect type.

Answer to our RQ: Tables 4 and 5 highlight, that there are differences in

the effectiveness between unit and integration tests for the different defect

types if we compare the means in the tables. The results which test type

is more effective for which defect type is not consistent over our data sets,

indicating that unit tests are more effective in detecting “hard to kill”

defects. Furthermore, the differences in the effectiveness are not significant,

as Table 6 highlights. Our results also vary from project to project: there

are some projects which seem to have more effective unit tests for certain

defect types, while in others integration tests are more effective for the

same defect type.

3.4. Replication Kit

To facilitate further insights and the replication of our study, we provide a

replication kit [67]. The replication kit contains the following data:

• all source code used for the data collection, including the used versions

of the COMFORT framework, the BugFixClassifier, and the used tools of

the SmartSHARK environment;

• all source code used for the analysis of the collected data;

• a MongoDB database dump with all raw results, except the developer

information; and

• additional visualizations of the results.

29

4. Discussion

The results of our research were rather unexpected for us, as it does not

represents what we have learned and what we teach at our university. Overall,

we could not find any significant differences in the effectiveness between unit

and integration tests for any of the data sets that we tested. These results are

interesting out of several perspectives.

Education: Academia, as well as organizations like the ISTQB, teach that

integration and unit tests are equally important and find different types of de-

fects (i.e., integration tests detect interface defects, whereas unit tests detect

other kind of defects). They also highlight that a separation between unit and

integration tests make sense and should be done. Nevertheless, our results

show that there is no significant difference in the effectiveness between unit

and integration tests for any of the defect types. Hence, it seems that unit

and integration tests are even equally effective in detecting interface defects.

In addition, we have seen that there are defects that are detected by both test

types. These results contradict the accepted belief that unit and integration

tests detect different types of defects in the software. This raises the question, if

those definitions still fit to modern development contexts or if we should apply

another distinction criterion to differentiate between unit and integration tests.

It might not be a good idea to distinguish tests based on their properties, but

on their usage as Ammann et al. [19] suggest. They reason that “most of the

literature emphasizes these levels in terms of when they are applied, a more

important distinction is on the types of faults that we are looking for.” [19].

Hence, Ammann et al. [19] suggest that tests should be categorized according to

their purpose or usage (i.e., types of defects targeted) and not according to their

properties or when they are applied. While they use a different terminology,

the core of the separation between unit and integration tests is similar to the

definitions of the IEEE. This paper provides empirical data that it might not

make sense to differentiate unit and integration tests in the way we nowadays

do. Instead, we should create new definitions that fit to the modern software

30

development contexts and should follow the proposal of Ammann et al. [19] to

separate tests based on which types of defects are targeted. Hence, we would

need to revise current valid definitions that we teach in academia and industrial

certifications. Our study results can provide valuable insights that could help

with this revision.

Practice: Software development should always be accompanied by testing

the developed parts. Our results show, that there seems to be no need to focus

on the type of test developed, at least if we only consider the effectiveness of

tests and classify them according to the IEEE definition. Nevertheless, other

influences could play a role, e.g., if a test should serve as documentation [68].

Our results highlight, that there are differences from project to project. Hence,

it seems that the context in which a project is developed has an influence on the

effectiveness of the test types, as for some projects unit tests are more effective

than integration tests, while for others it is vice versa. However, it seems that

it is more important that developers “just create tests” instead of caring about

mocking classes or which test type is developed.

To come back to the cited discussion from our introduction, where developers

discuss if they should change their testing habit and focus more on integration

than unit testing: our results show that this change would not have a negative

impact, as both test types are equally effective. Hence, if developers argue that

integration tests are more realistic and valuable in their daily developer life,

our results do not argue against this practice. However, we also found that

some defects were only detected by either unit or integration tests. Moreover,

our results highlight that mutants that are hard to kill are (on average) more

effectively killed by unit tests, as the results with the DISJ data set depict.

Hence, our results show that it is still favorable to test on both test levels.

31

5. Threats to Validity

In this section, we discuss the external, internal, and construct validity of

our study together with the validation procedures that we have taken to counter

or measure those threats.

5.1. Construct Validity

Threats to this type of validity are concerned with the degree to which our

analysis really measures what we intended it to measure. While we carefully

tested our tools and scripts via manually written tests and manually curated

samples of data, there can still be defects, which can influence our results.

For the collection of the mutation data we make use of PIT [52], which is a

mature mutation testing framework that is often used in research, e.g. [53, 55].

Hence, it is less likely that the mutation testing is not working correctly. Every

problem that occurred during the mutation testing was manually inspected.

We found that PIT sometimes reported that a test was not finishing without a

failure, but this only occurred for 35 out of 36435 test cases. Furthermore, PIT

is designed for mutation testing on unit level. While PIT offers some mutation

operators that are also offered by tools that are used for integration mutation

testing (e.g. [57, 58]), this still could have an influence on our results.

The threat of mutant redundancy is substantial for our kind of analysis. We

tried to reduce this thread by including the disjoint mutant set into our analysis

process, as proposed by Kintis et al. [53]. Nevertheless, this is an approximation

and it might not be the most fitting set of mutants.

Another threat is our choice of the defect classification scheme. A different

classification scheme might produce different results for our research question.

But, we have chosen the scheme by Zhao et al. [37] as it is close to the code,

which is needed to classify integrated mutations, and it provides a good overview

of different defect types.

Moreover, our approach to create a defect classification for an integrated

mutation can be flawed for some mutations. For example, if a mutation is in-

tegrated in a line which has several statements on it so that the re-integration

32

of the defect in the correct source code could be done in the wrong statement.

To measure this threat, we manually checked a sample of our defect type clas-

sifications. None of the defect type classifications had the problem mentioned

above.

We reuse the build file of the projects that we analyze to execute all of their

tests. Hence, it can be the case that some tests are filtered out, e.g., because

they should only be executed via an CI system as they are very long running.

It can be the case that these filtered tests are always tests from the same type

(e.g., integration tests, as they often have a higher execution time than unit

tests). Nevertheless, we assess this threat by counting the number of tests that

are excluded from execution. Overall, 67 tests are excluded. However, this

number is rather low in comparison to the overall number of executed tests (i.e.

38782).

5.2. Internal Validity

Internal validity threats are concerned with the ability to draw conclusions

from the relation between causes and effects. While we tried to create an isolated

environment where we have control over influencing variables (e.g., by using

mutation analysis to integrate defects into a software) it can be the case that

the defect detection capabilities are also influenced by other variables. We

countered the influences that we know of, e.g., by normalizing the results.

The current version of our framework can not differentiate integration from

system tests. Hence, it is possible that a system test is classified as an integra-

tion test. To mitigate this issue we included only projects that are libraries or

frameworks.

The statistical tests used in this paper rely on the accurate implementations

of the algorithms in the external library used. To ease this threat we are using

a well known public library, namely SciPy [69], for these algorithms.

5.3. External Validity

Threats to this type of validity are concerned with the ability to generalize

our results. We had a look at Java projects only. Hence, the results can be

33

different for projects written in other languages. Although we analyzed a larger

sample of projects than most other related work, the results can vary if other

projects are chosen. Especially, as our set of projects is limited to open-source

projects, even if some of the projects are developed by companies in an open-

source manner (e.g., google-gson). Furthermore, we only selected libraries or

frameworks which could potentially influence our results. Hence, our results

could be different for other project types (e.g., applications). But, our approach

needs a compilable release of a project, which is often not given as Tufano et

al. [70] highlight in their paper. This complicates a fully automatic analysis

with more data. Nevertheless, the replication of this work using other projects

(and other programming languages) is required in order to reach a more gen-

eral conclusion. Hence, we added a replication kit that includes all data and

programs of this paper to support such conceptual replications.

6. Conclusion and Future Work

In this paper, we reported an empirical study that was conducted on 17

open-source Java projects. The goal of this study was to investigate, if the ex-

isting standard definitions of unit and integration tests are still valid in modern

software development contexts. We created two different data sets to pursue this

goal: mutation testing data representing a large variety of defects that could be

introduced in the program (627507 unique mutations) and the disjoint mutation

testing data set that represents only “hard to kill” mutants. Additionally, we

classified all created mutations into different defect types to evaluate if there is

a defect type that is more effectively detected by a certain test type. Overall

we collected and analyzed the defect detection capabilities of 38782 test cases.

Our results highlight, that there are no significant differences in the defect

detection capabilities of unit and integration tests in either of our two data

sets. However, we also found that there are some defects that can only be

detected by one test type (i.e., either unit or integration tests). Furthermore,

we found that we can not state that one defect type can be more effectively

34

found by a unit or an integration test, as our tests found no statistical significant

differences. Hence, it seems that the current standard definitions do not fit to

modern software development contexts anymore, as there should be a difference

between those test types. Therefore, our results questions if the division between

unit and integration tests is reasonable for modern systems, like we develop

nowadays. These results suggest that we should create usage-based definitions

instead of property-based definitions of unit and integration tests so that they

fit to modern software development contexts.

Our future work includes but is not limited to the use of more projects with

different programming languages for the performed and additional analyses. As

additional analysis we plan a qualitative study on the defects and test cases that

we used in this study. This could help us to understand the differences between

unit and integration tests and their (not) detected defects. Furthermore, we

would like to perform a developer study on unit and integration testing practices

to get feedback from developers how they use unit and integration tests in their

daily work. In connection to this study, it would be interesting to assess the

usage of different test types in different development phases or development

models, e.g., by comparing the usage of unit and integration tests in an classical

and agile development environment.

Acknowledgements

The authors would like to thank the GWDG for the access to their HPC

resources.

[1] B. Van Rompaey, B. Du Bois, S. Demeyer, M. Rieger, On the detection of

test smells: A metrics-based approach for general fixture and eager test,

IEEE Transactions on Software Engineering 33 (12) (2007) 800–817.

[2] G. Fraser, A. Zeller, Generating parameterized unit tests, in: Proceedings

of the 20th ACM SIGSOFT International Symposium on Software Testing

and Analysis, ACM, 2011, pp. 364–374.

35

[3] A. Gambi, S. Kappler, J. Lampel, A. Zeller, Cut: automatic unit testing

in the cloud, in: Proceedings of the 26th ACM SIGSOFT International

Symposium on Software Testing and Analysis, ACM, 2017, pp. 364–367.

[4] R. Lachmann, S. Lity, M. Al-Hajjaji, F. Furchtegott, I. Schaefer, Fine-

grained test case prioritization for integration testing of delta-oriented soft-

ware product lines, in: Proceedings of the 7th International Workshop on

Feature-Oriented Software Development, ACM, 2016, pp. 1–10.

[5] D. Xu, W. Xu, M. Tu, N. Shen, W. Chu, C.-H. Chang, Automated inte-

gration testing using logical contracts, IEEE Transactions on Reliability

65 (3) (2016) 1205–1222.

[6] D. Holling, A. Hofbauer, A. Pretschner, M. Gemmar, Profiting from unit

tests for integration testing, in: IEEE International Conference on Software

Testing, Verification and Validation (ICST), IEEE, 2016, pp. 353–363.

[7] B. W. Boehm, Verifying and validating software requirements and design

specifications, IEEE Software 1 (1) (1984) 75.

[8] W. Royce, Software project management, Pearson Education India, 1999.

[9] IEEE, Systems and software engineering – vocabulary, ISO/IEC/IEEE

24765:2010(E) (2010) 1–418doi:10.1109/IEEESTD.2010.5733835.

[10] International Software Testing Qualification Board, International Soft-

ware Testing Qualification Board Glossary, http://www.astqb.org/

glossary/search/unit, [accessed 30-November-2018].

[11] JUnit Team, JUnit Homepage, http://junit.org/junit5/, [accessed

30-November-2018] (2017).

[12] Travis CI GmbH, Travis CI Homepage, https://travis-ci.org/, [ac-

cessed 30-November-2018] (2017).

[13] Jenkins Contributors, Jenkins, https://jenkins.io/, [accessed 30-

November-2018] (2018).

36

http://dx.doi.org/10.1109/IEEESTD.2010.5733835

http://www.astqb.org/glossary/search/unit

http://www.astqb.org/glossary/search/unit

http://junit.org/junit5/

https://travis-ci.org/

https://jenkins.io/

[14] K. C. Dodds, Write tests. Not too many. Mostly

integration., https://blog.kentcdodds.com/

write-tests-not-too-many-mostly-integration-5e8c7fff591c,

[accessed 30-November-2018] (2017).

[15] J. O. Coplien, Seque, http://rbcs-us.com/documents/Segue.pdf,

[accessed 30-November-2018] (2014).

[16] J. O. Coplien, Why Most Unit Testing is Waste, http://rbcs-us.com/

documents/Why-Most-Unit-Testing-is-Waste.pdf, [accessed

30-November-2018] (2014).

[17] F. Trautsch, J. Grabowski, Are there any unit tests? an empirical study

on unit testing in open source python projects, in: IEEE International

Conference on Software Testing, Verification and Validation (ICST), IEEE,

2017, pp. 207–218.

[18] A. Spillner, T. Linz, H. Schaefer, Software testing foundations: a study

guide for the certified tester exam, Rocky Nook, Inc., 2014.

[19] P. Ammann, J. Offutt, Introduction to software testing, Cambridge Uni-

versity Press, 2016.

[20] G. J. Myers, C. Sandler, T. Badgett, The art of software testing, John

Wiley & Sons, 2011.

[21] G. Orellana, G. Laghari, A. Murgia, S. Demeyer, On the differences be-

tween unit and integration testing in the travistorrent dataset, in: Proceed-

ings of the 14th International Conference on Mining Software Repositories,

IEEE Press, 2017, pp. 451–454.

[22] Apache Software Foundation, Maven Surefire Plugin Website, http:

//maven.apache.org/surefire/maven-surefire-plugin/, [ac-


37

https://blog.kentcdodds.com/write-tests-not-too-many-mostly-integration-5e8c7fff591c

https://blog.kentcdodds.com/write-tests-not-too-many-mostly-integration-5e8c7fff591c

http://rbcs-us.com/documents/Segue.pdf

http://rbcs-us.com/documents/Why-Most-Unit-Testing-is-Waste.pdf

http://rbcs-us.com/documents/Why-Most-Unit-Testing-is-Waste.pdf

http://maven.apache.org/surefire/maven-surefire-plugin/

http://maven.apache.org/surefire/maven-surefire-plugin/

[23] Apache Software Foundation, Maven FailSafe Plugin Website, http:

//maven.apache.org/surefire/maven-failsafe-plugin/, [ac-


[24] T. Kanstren, Towards a deeper understanding of test coverage, Journal of

Software: Evolution and Process 20 (1) (2008) 59–76.

[25] M. Beller, G. Gousios, A. Zaidman, Travistorrent: Synthesizing travis ci

and github for full-stack research on continuous integration, in: Proceedings

of the 14th International Conference on Mining Software Repositories, 2017.

[26] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, M. Harman, Mu-

tation testing advances: an analysis and survey, Advances in Computers.

[27] M. Papadakis, Y. Jia, M. Harman, Y. Le Traon, Trivial compiler equiva-

lence: A large scale empirical study of a simple, fast and effective equivalent

mutant detection technique, in: Proceedings of the 37th International Con-

ference on Software Engineering-Volume 1, IEEE Press, 2015, pp. 936–946.

[28] D. Schuler, A. Zeller, Covering and uncovering equivalent mutants, Soft-

ware Testing, Verification and Reliability 23 (5) (2013) 353–374.

[29] M. Daran, P. Thevenod-Fosse, Software error analysis: A real case study

involving real faults and mutations, in: ACM SIGSOFT Software Engi-

neering Notes, Vol. 21, ACM, 1996, pp. 158–171.

[30] J. H. Andrews, L. C. Briand, Y. Labiche, Is mutation an appropriate tool

for testing experiments?, in: Proceedings of the 27th International Confer-

ence on Software Engineering, ACM, 2005, pp. 402–411.

[31] J. H. Andrews, L. C. Briand, Y. Labiche, A. S. Namin, Using mutation

analysis for assessing and comparing testing coverage criteria, IEEE Trans-

actions on Software Engineering 32 (8) (2006) 608–624.

[32] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, G. Fraser, Are

mutants a valid substitute for real faults in software testing?, in: Proceed-

38

http://maven.apache.org/surefire/maven-failsafe-plugin/

http://maven.apache.org/surefire/maven-failsafe-plugin/

ings of the 22nd ACM SIGSOFT International Symposium on Foundations

of Software Engineering, ACM, 2014, pp. 654–665.

[33] A. S. Namin, S. Kakarla, The use of mutation in testing experiments and

its sensitivity to external threats, in: Proceedings of the 2011 International

Symposium on Software Testing and Analysis, ACM, 2011, pp. 342–352.

[34] T. T. Chekam, M. Papadakis, Y. L. Traon, M. Harman, An empirical

study on mutation, statement and branch coverage fault revelation that

avoids the unreliable clean program assumption, in: Proceedings of the

39th International Conference on Software Engineering, IEEE Press, 2017,

pp. 597–608.

[35] M. Papadakis, D. Shin, S. Yoo, D.-H. Bae, Are mutation scores correlated

with real fault detection? a large scale empirical study on the relation-

ship between mutants and real faults, in: 40th International Conference on

Software Engineering, May 27-3 June 2018, Gothenburg, Sweden, 2018.

[36] J. H. Hayes, Building a requirement fault taxonomy: Experiences from

a nasa verification and validation research project, in: 14th International

Symposium on Software Reliability Engineering, IEEE, 2003, pp. 49–59.

[37] Y. Zhao, H. Leung, Y. Yang, Y. Zhou, B. Xu, Towards an understanding

of change types in bug fixing code, Information and Software Technology

86 (2017) 37–53.

[38] X. Xia, D. Lo, X. Wang, B. Zhou, Automatic defect categorization based on

fault triggering conditions, in: Proceedings of the 19th International Con-

ference on Engineering of Complex Computer Systems (ICECCS), IEEE,

2014, pp. 39–48.

[39] L. Tan, C. Liu, Z. Li, X. Wang, Y. Zhou, C. Zhai, Bug characteristics in

open source software, Empirical Software Engineering 19 (6) (2014) 1665–

1705.

39

[40] R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday, D. S. Moebus,

B. K. Ray, M.-Y. Wong, Orthogonal defect classification-a concept for in-

process measurements, IEEE Transactions on Software Engineering 18 (11)

(1992) 943–956.

[41] A. J. Offutt, J. H. Hayes, A semantic model of program faults, in: ACM

SIGSOFT Software Engineering Notes, Vol. 21, ACM, 1996, pp. 195–200.

[42] J. H. Hayes, C. I. Raphael, V. K. Surisetty, A. Andrews, Fault links: ex-

ploring the relationship between module and fault types, in: European

Dependable Computing Conference, Springer, 2005, pp. 415–434.

[43] Q. Tu, et al., Evolution in open source software: A case study, in: Proceed-

ings of the Internation Conference on Software Maintenance, IEEE, 2000,

pp. 131–142.

[44] Apache Software Foundation, Maven Project Homepage, https://

maven.apache.org/, [accessed 30-November-2018] (2017).

[45] C. Beust, TestNG Documentation, http://testng.org/doc/, [ac-


[46] H. Borges, A. Hora, M. T. Valente, [dataset] improved list of popular github

repositories, https://doi.org/10.5281/zenodo.804473, [accessed

30-November-2018] (2017).

[47] MVN Repository, MVN Repository - Most Popular, https://

mvnrepository.com/popular, [accessed 30-November-2018] (2018).

[48] F. Trautsch, S. Herbold, P. Makedonski, J. Grabowski, Adressing problems

with external validity of repository mining studies through a smart data

platform, in: Proceedings of the 13th International Conference on Mining

Software Repositories, ACM, 2016, pp. 97–108.

[49] N. C. Borle, M. Feghhi, E. Stroulia, R. Greiner, A. Hindle, Analyzing the

effects of test driven development in github, Empirical Software Engineering

(2017) 1–28.

40

https://maven.apache.org/

https://maven.apache.org/

http://testng.org/doc/

https://doi.org/10.5281/zenodo.804473

https://mvnrepository.com/popular

https://mvnrepository.com/popular

[50] Oracle, What Is a Package?, https://docs.oracle.com/javase/

tutorial/java/concepts/package.html, [accessed 30-November-

2018] (2017).

[51] Apache Software Foundation, commons-io GitHub, https://github.

com/apache/commons-io, [accessed 30-November-2018] (2018).

[52] H. Coles, T. Laurent, C. Henard, M. Papadakis, A. Ventresque, Pit: a

practical mutation testing tool for java, in: Proceedings of the 25th Inter-

national Symposium on Software Testing and Analysis, ACM, 2016, pp.

449–452.

[53] M. Kintis, M. Papadakis, A. Papadopoulos, E. Valvis, N. Malevris,

Y. Le Traon, How effective are mutation testing tools? an empirical anal-

ysis of java mutation testing tools with manual analysis and real faults,

Empirical Software Engineering (2017) 1–38.

[54] H. Coles, PIT Project Homepage, http://pitest.org/, [accessed 30-

November-2018] (2017).

[55] T. Laurent, M. Papadakis, M. Kintis, C. Henard, Y. Le Traon, A. Ven-

tresque, Assessing and improving the mutation testing practice of pit, in:

IEEE International Conference on Software Testing, Verification and Vali-

dation (ICST), IEEE, 2017, pp. 430–435.

[56] M. Papadakis, C. Henard, M. Harman, Y. Jia, Y. Le Traon, Threats to

the validity of mutation-based test assessment, in: Proceedings of the 25th

International Symposium on Software Testing and Analysis, ACM, 2016,

pp. 354–365.

[57] M. E. Delamaro, J. Maidonado, A. P. Mathur, Interface mutation: An ap-

proach for integration testing, IEEE Transactions on Software Engineering

27 (3) (2001) 228–247.

41

https://docs.oracle.com/javase/tutorial/java/concepts/package.html

https://docs.oracle.com/javase/tutorial/java/concepts/package.html

https://github.com/apache/commons-io

https://github.com/apache/commons-io

http://pitest.org/

[58] M. Grechanik, G. Devanla, Mutation integration testing, in: IEEE Inter-

national Conference on Software Quality, Reliability and Security (QRS),

IEEE, 2016, pp. 353–364.

[59] F. Trautsch, BugFixClassifier GitHub, https://github.com/

ftrautsch/BugFixClassifier, [accessed 30-November-2018] (2018).

[60] B. Fluri, H. C. Gall, Classifying change types for qualifying change cou-

plings, in: 14th IEEE International Conference on Program Comprehen-

sion, IEEE, 2006, pp. 35–45.

[61] R. Wang, S. W. Lagakos, J. H. Ware, D. J. Hunter, J. M. Drazen, Statistics

in medicinereporting of subgroup analyses in clinical trials, New England

Journal of Medicine 357 (21) (2007) 2189–2194.

[62] S. S. Shapiro, M. B. Wilk, An analysis of variance test for normality (com-

plete samples), Biometrika 52 (3/4) (1965) 591–611.

[63] I. Olkin, Contributions to probability and statistics: essays in honor of

Harold Hotelling, Stanford University Press, 1960.

[64] Student, The probable error of a mean, Biometrika 6 (1) (1908) 1–25.

doi:10.1093/biomet/6.1.1.

[65] H. B. Mann, D. R. Whitney, On a test of whether one of two random

variables is stochastically larger than the other, The annals of mathematical

statistics (1947) 50–60.

[66] M. Aickin, H. Gensler, Adjusting for multiple testing when reporting re-

search results: the bonferroni vs holm methods., American journal of public

health 86 (5) (1996) 726–728.

[67] F. Trautsch, S. Herbold, J. Grabowski, [dataset] Replication Kit, https:

//doi.org/10.5281/zenodo.2267946, [accessed 30-November-2018]

(2018).

42

https://github.com/ftrautsch/BugFixClassifier

https://github.com/ftrautsch/BugFixClassifier

http://dx.doi.org/10.1093/biomet/6.1.1



[68] S. Demeyer, S. Ducasse, O. Nierstrasz, Object-oriented reengineering pat-

terns, Elsevier, 2002.

[69] SciPy developers, SciPy Website, https://www.scipy.org/, [accessed

30-November-2018] (2017).

[70] M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia,

D. Poshyvanyk, There and back again: Can you compile that snapshot?,

Journal of Software: Evolution and Process 29 (4).

Appendix A. Change Type Mapping

Change Type Defect Class

ADDING ATTRIBUTE MODIFIABILITY Data

ADDING CLASS DERIVABILITY Interface

ADDING METHOD OVERRIDABILITY Interface

ADDITIONAL CLASS Interface

ADDITIONAL OBJECT STATE Data

ALTERNATIVE PART DELETE Logic/Control

ALTERNATIVE PART INSERT Logic/Control

ATTRIBUTE RENAMING Data

ATTRIBUTE TYPE CHANGE Data

CLASS RENAMING Interface

COMMENT DELETE Other

COMMENT INSERT Other

COMMENT MOVE Other

COMMENT UPDATE Other

CONDITION EXPRESSION CHANGE Logic/Control

DECREASING ACCESSIBILITY CHANGE Interface

DOC DELETE Other

DOC INSERT Other

43

https://www.scipy.org/

DOC UPDATE Other

INCREASING ACCESSIBILITY CHANGE Interface

METHOD RENAMING Interface

PARAMETER DELETE Interface

PARAMETER INSERT Interface

PARAMETER ORDERING CHANGE Interface

PARAMETER RENAMING Interface

PARAMETER TYPE CHANGE Interface

PARENT CLASS CHANGE Interface

PARENT CLASS DELETE Interface

PARENT CLASS INSERT Interface

PARENT INTERFACE CHANGE Interface

PARENT INTERFACE DELETE Interface

PARENT INTERFACE INSERT Interface

REMOVED CLASS Interface

REMOVED OBJECT STATE Data

REMOVING ATTRIBUTE MODIFIABILITY Data

REMOVING CLASS DERIVABILITY Interface

REMOVING METHOD OVERRIDABILITY Interface

RETURN TYPE CHANGE Interface

RETURN TYPE DELETE Interface

RETURN TYPE INSERT Interface

Table A.7 Mapping of the change types by Fluri et al. [60] that can be directly

mapped onto the defect classes by Zhao et al. [37].

44

Condition Defect

Class

CT ∈ {STATEMENT *} ∧

CE ∈ {ASSIGNMENT, POSTFIX EXPRESSION, PREFIX EXPRESSION} ∧

PE /∈ {FOR INCR}

Computation


CE ∈ {VARIABLE DECLARATION STATEMENT} ∧

PE /∈ {FOR INIT}

Data

CT ∈ {UNCLASSIFIED CHANGE} ∧

CE ∈ {MODIFIER}

Data


CE ∈ {METHOD INVOCATION, CONSTRUCTOR INVOCATION,

SYNCHRONIZED STATEMENT, CLASS INSTANCE CREATION}

Interface

CT ∈ {ADDING FUNCTIONALITY, REMOVING FUNCTIONALITY} ∧

CE ∈ {METHOD}

Interface

CT ∈ {UNCLASSIFIED CHANGE} ∧

CE ∈ {TYPE PARAMETER}

Interface


CE ∈ {IF STATEMENT, FOREACH STATEMENT, CONTINUE STATEMENT,

RETURN STATEMENT, THROW STATEMENT, SWITCH CASE,

SWITCH STATEMENT, BREAK STATEMENT, CATCH CLAUSE,

TRY STATEMENT, FOR STATEMENT, WHILE STATEMENT, DO STATEMENT,

LABELED STATEMENT}

Logic/Control


CE ∈ {ASSIGNMENT, POSTFIX EXPRESSION, PREFIX EXPRESSION} ∧

PE ∈ {FOR INCR}

Logic/Control


CE ∈ {VARIABLE DECLARATION STATEMENT} ∧

PE ∈ {FOR INIT}

Logic/Control


CE ∈ {ASSERT STATEMENT}

Other

45

Table A.8 Mapping of the change types (CT) by Fluri et al. [60], where the changed

entity (CE) and/or the parent entity (PE) needs to be taken into account to map

a change onto the defect classes by Zhao et al. [37]. The term STATEMENT *

includes the general change types, i.e., STATEMENT UPDATE, STATEMENT INSERT,

STATEMENT DELETE, STATEMENT PARENT CHANGE, STATEMENT ORDERING CHANGE.

46

Documents

Are Unit and Integration Test De nitions Still Valid for ... · de nitions still t to modern software development. Objective: We analyze, if the existing standard de nitions of unit