13
Combining Spectrum-Based Fault Localization and Statistical Debugging: An Empirical Study Jiajun Jiang , Ran Wang , Yingfei Xiong , Xiangping Chen , Lu Zhang Key Laboratory of High Confidence Software Technologies, Ministry of Education (PKU) Department of Computer Science and Technology, EECS, Peking University, Beijing, China Guangdong Key Laboratory for Big Data Analysis and Simulation of Public Opinion The School of Communication and Design, Sun Yat-sen University, Guangzhou, China {jiajun.jiang, wangrancs, xiongyf, zhanglucs}@pku.edu.cn, [email protected] Abstract—Program debugging is a time-consuming task, and researchers have proposed different kinds of automatic fault lo- calization techniques to mitigate the burden of manual debugging. Among these techniques, two popular families are spectrum- based fault localization (SBFL) and statistical debugging (SD), both localizing faults by collecting statistical information at runtime. Though the ideas are similar, the two families have been developed independently and their combinations have not been systematically explored. In this paper we perform a systematical empirical study on the combination of SBFL and SD. We first build a unified model of the two techniques, and systematically explore four types of variations, different predicates, different risk evaluation formulas, different granularities of data collection, and different methods of combining suspicious scores. Our study leads to several findings. First, most of the effec- tiveness of the combined approach contributed by a simple type of predicates: branch conditions. Second, the risk evaluation for- mulas of SBFL significantly outperform that of SD. Third, fine- grained data collection significantly outperforms coarse-grained data collection with a little extra execution overhead. Fourth, a linear combination of SBFL and SD predicates outperforms both individual approaches. According to our empirical study, we propose a new fault local- ization approach, PREDFL (Predicate-based Fault Localization), with the best configuration for each dimension under the uni- fied model. Then, we explore its complementarity to existing techniques by integrating PREDFL with a state-of-the-art fault localization framework. The experimental results show that PREDFL can further improve the effectiveness of state-of-the- art fault localization techniques. More concretely, integrating PREDFL results in an up to 20.8% improvement w.r.t the faults successfully located at Top-1, which reveals that PREDFL complements existing techniques. Index Terms—Software engineering, Fault localization, Pro- gram debugging I. I NTRODUCTION Given the cost of manual debugging, a large number of fault localization approaches have been proposed in the past two decades. Among all these approaches, two families of ap- proaches have received significant attentions: spectrum-based fault localization (SBFL) [1]–[3] and statistical debugging (SD) [4]–[7]. Spectrum-based fault localization collects cover- age information over different elements during test execution, and calculates the suspiciousness of each element using a risk * Yingfei Xiong is the corresponding author. evaluation formula. On the other hand, statistical debugging seeds a set of predicates into the programs, collects whether they are covered and whether they are evaluated to true during test executions, and calculates the suspiciousness of each predicate using a risk evaluation formula. Though being similar in the sense of using runtime coverage information to find the suspicious elements, the two families have been developed mostly independently and researchers in each community explored different aspects of the approaches. In the domain of SBFL, different risk evaluation formulas have been proposed and compared both theoretically and empirically [8], [9]. In the domain of SD, different classes of predicates have been designed and studied [5]–[7], [10]–[14]. Since the basic ideas behind the two families are similar, a question naturally raises: whether is it possible to combine the strength of the two families and how would the combination perform? To answer this question, we build a unified model that captures the commonalities of the two approaches. In this model, SBFL is treated as a kind of predicate in SD, SBFL risk evaluation formulas are mapped to SD risk evaluation formulas, and the suspicious scores of predicates are mapped back to program elements. In this way, we can combine the two approaches as a unified one and explore different variation points. Based on this model, we perform a systematic study to empirically explore four variation points. Predicates In SD, a large number of predicates have been proposed. We explore the performance of different pred- icate groups, including three groups from SD and one group from SBFL. We also explore the performance of combinations of the groups. Risk Evaluation Formulas Particularly in SBFL, a large number of risk evaluation formulas have been proposed. In this paper we evaluate how the different formulas perform over SD predicates. Granularity of Data Collection Existing studies have seeded predicates at different granularities, e.g., at the method level or at the statement level. In this paper we explore the effect of seeding different predicates on the localization performance and the execution time. Methods for Combining Suspicious Scores When mapping the suspicious scores from predicates to elements, we

Combining Spectrum-Based Fault Localization and

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Combining Spectrum-Based Fault Localization and

Combining Spectrum-Based Fault Localization andStatistical Debugging: An Empirical Study

Jiajun Jiang†, Ran Wang†, Yingfei Xiong†, Xiangping Chen‡, Lu Zhang††Key Laboratory of High Confidence Software Technologies, Ministry of Education (PKU)†Department of Computer Science and Technology, EECS, Peking University, Beijing, China‡Guangdong Key Laboratory for Big Data Analysis and Simulation of Public Opinion‡The School of Communication and Design, Sun Yat-sen University, Guangzhou, China{jiajun.jiang, wangrancs, xiongyf, zhanglucs}@pku.edu.cn, [email protected]

Abstract—Program debugging is a time-consuming task, andresearchers have proposed different kinds of automatic fault lo-calization techniques to mitigate the burden of manual debugging.Among these techniques, two popular families are spectrum-based fault localization (SBFL) and statistical debugging (SD),both localizing faults by collecting statistical information atruntime. Though the ideas are similar, the two families havebeen developed independently and their combinations have notbeen systematically explored.

In this paper we perform a systematical empirical study onthe combination of SBFL and SD. We first build a unifiedmodel of the two techniques, and systematically explore fourtypes of variations, different predicates, different risk evaluationformulas, different granularities of data collection, and differentmethods of combining suspicious scores.

Our study leads to several findings. First, most of the effec-tiveness of the combined approach contributed by a simple typeof predicates: branch conditions. Second, the risk evaluation for-mulas of SBFL significantly outperform that of SD. Third, fine-grained data collection significantly outperforms coarse-graineddata collection with a little extra execution overhead. Fourth, alinear combination of SBFL and SD predicates outperforms bothindividual approaches.

According to our empirical study, we propose a new fault local-ization approach, PREDFL (Predicate-based Fault Localization),with the best configuration for each dimension under the uni-fied model. Then, we explore its complementarity to existingtechniques by integrating PREDFL with a state-of-the-art faultlocalization framework. The experimental results show thatPREDFL can further improve the effectiveness of state-of-the-art fault localization techniques. More concretely, integratingPREDFL results in an up to 20.8% improvement w.r.t thefaults successfully located at Top-1, which reveals that PREDFLcomplements existing techniques.

Index Terms—Software engineering, Fault localization, Pro-gram debugging

I. INTRODUCTION

Given the cost of manual debugging, a large number offault localization approaches have been proposed in the pasttwo decades. Among all these approaches, two families of ap-proaches have received significant attentions: spectrum-basedfault localization (SBFL) [1]–[3] and statistical debugging(SD) [4]–[7]. Spectrum-based fault localization collects cover-age information over different elements during test execution,and calculates the suspiciousness of each element using a risk

∗Yingfei Xiong is the corresponding author.

evaluation formula. On the other hand, statistical debuggingseeds a set of predicates into the programs, collects whetherthey are covered and whether they are evaluated to true duringtest executions, and calculates the suspiciousness of eachpredicate using a risk evaluation formula.

Though being similar in the sense of using runtime coverageinformation to find the suspicious elements, the two familieshave been developed mostly independently and researchers ineach community explored different aspects of the approaches.In the domain of SBFL, different risk evaluation formulashave been proposed and compared both theoretically andempirically [8], [9]. In the domain of SD, different classes ofpredicates have been designed and studied [5]–[7], [10]–[14].Since the basic ideas behind the two families are similar, aquestion naturally raises: whether is it possible to combine thestrength of the two families and how would the combinationperform?

To answer this question, we build a unified model thatcaptures the commonalities of the two approaches. In thismodel, SBFL is treated as a kind of predicate in SD, SBFLrisk evaluation formulas are mapped to SD risk evaluationformulas, and the suspicious scores of predicates are mappedback to program elements. In this way, we can combine thetwo approaches as a unified one and explore different variationpoints. Based on this model, we perform a systematic studyto empirically explore four variation points.Predicates In SD, a large number of predicates have been

proposed. We explore the performance of different pred-icate groups, including three groups from SD and onegroup from SBFL. We also explore the performance ofcombinations of the groups.

Risk Evaluation Formulas Particularly in SBFL, a largenumber of risk evaluation formulas have been proposed.In this paper we evaluate how the different formulasperform over SD predicates.

Granularity of Data Collection Existing studies haveseeded predicates at different granularities, e.g., at themethod level or at the statement level. In this paper weexplore the effect of seeding different predicates on thelocalization performance and the execution time.

Methods for Combining Suspicious Scores When mappingthe suspicious scores from predicates to elements, we

Page 2: Combining Spectrum-Based Fault Localization and

need to combine the suspicious scores of different ele-ments into one. Here we consider two ways of combiningthe scores, the maximum of different predicates and theweighted sum of different predicates.

Our empirical study is carried out over Defects4J [15],a collection of real-world faults that are widely used inrecent studies [16]–[19]. The empirical study leads to severalfindings, as listed below.• Among all predicates, the predicates from branch condi-

tions contribute most to the Top-1 recall of fault local-ization accuracy.

• Using SBFL formulas for SD predicates significantlyincreases the localization accuracy, a 227.9% increase inTop-1 compared with using original SD formula and a52.7% increase compared with a revised SD formula.

• Collecting information at the statement level leads to69.9% increase in localization accuracy with respect toTop-1 and 40.0% increase in execution time compared tothe method level. However, the overall localization timeis still small, i.e, less than three minutes.

• A linear combination of the suspicious scores from SBFLand SD predicates leads to the best results.

Inspired by the above findings, we proposed a new faultlocalization technique, PREDFL, that combines the predicatesof SBFL and SD under our unified model. The evaluationresults show that PREDFL could improve the state-of-the-artwhen further combined with other techniques. In summary, wemake the following contributions in this paper.• A unified model for combining two popular families of

fault localization techniques, i.e., spectrum-based faultlocalization and statistical debugging.

• A systematic analysis to explore different combinationsbetween SBFL and SD under the framework with respectto four variation points, which provides insights for futurefault localization research.

• A new fault localization technique coming from thesystematic analysis, PREDFL, was evaluated via a con-trast experiment on a state-of-the-art fault localizationframework. The results show that PREDFL complementsexisting techniques and could further improve their effec-tiveness. Especially, it achieved an up to 20.8% improve-ment w.r.t Top-1.

II. BACKGROUND

A. Spectrum-Based Fault Localization

Spectrum-based fault localization (SBFL) technique is oneof the most popular approaches used in recent studies [20]–[23] because of its simplicity and efficiency. More concretely,when given a faulty program and a set of test cases in which atleast one test failed, a typical spectrum-based fault localizationapproach collects coverage information for each program ele-ment while running test cases, and then employs a predefinedrisk evaluation formula to compute suspicious scores forprogram elements, which represent how likely the elementsare fautly. The granularity of elements varies on demand, and

common granularities include statement and method. Differentspectrum-based fault localization approaches follow the sameparadigm but use different formulas to compute the suspiciousscores.

To make it more rigorous, we next give the general formof spectrum-based fault localization approaches formally. Aprogram E is a set of elements. Given a program elemente ∈ E, we define the following notations:• failed(e) denotes the number of failed test cases that

cover program element e.• passed(e) denotes the number of passed test cases that

cover program element e.• totalfailed denotes the number of all failed test cases.• totalpassed denotes the number of all passed test cases.A risk evaluation formula is a function mapping each ele-

ment to a suspicious score based on the above four values. Forinstance, a popular formula in SBFL technique is Ochiai [24],which is defined as below.

Ochiai(e) =failed(e)√

totalfailed · (failed(e) + passed(e))(1)

This formula reveals the basic idea of spectrum-basedfault localization approaches: the more frequently executed byfailed test cases, the bigger the suspicious score of an element;the more frequently executed by passed test cases, the smallerthe suspicious score, and thus demonstrates the relationshipbetween test cases and program elements.

An SBFL approach assigns suspicious scores to elementsbased on a risk evaluation formula. Given a risk evaluationformula r and a program E, a simple SBFL approach directlyreturns the suspicious score from r for each element, asfollows.

SBFLrsimple(E) = {(e, r(e)) | e ∈ E}.

Recent approaches also consider collecting information ata granularity finer than the granularity of the target elements.For example, for method-level fault localization, several recentapproaches [17], [18] first calculate the suspicious scores atthe statement level, and then take the highest suspicious scoreof the inner statements as the suspicious score of the method.To model this behavior, we introduce a granularity functiong : E → 2E that maps an element to a set of sub elements,and then compute the suspicious score for each of them. If wedo not calculate suspicious scores at the sub element level, gjust map an element to itself. Given a risk evaluation formula r,a granularity function g, and a program E, an SBFL approachis defined as follows.

SBFLr,g(E) = {(e,maxei∈g(e)r(ei)) | e ∈ E}

B. Statistical Debugging

Statistical debugging (SD) was originally proposed forremote debugging by Liblit. et al. [4], [5]. Given a faultyprogram, SD dynamically instruments the program with a setof predefined predicates. Then the program is deployed forexecution, where the values of predicates during executions

Page 3: Combining Spectrum-Based Fault Localization and

are collected. Based on the information collected from manyexecutions, SD identifies the important predicates that arerelated to the root cause of the failure.

More formally, given a program, we use a set P to denotethe instances of predicates seeded into the program. Given apredicate instance p ∈ P , we use the following notations todenote the information related to the predicate instance.• F (p) denotes the number of failed executions where p is

evaluated to true at least once.• S(p) denotes the number of successful executions wherep is evaluated to true at least once.

• F0(p) denotes the number of failed executions where pis covered.

• S0(p) denotes the number of successful executions wherep is covered.

Similar to SBFL, SD uses a formula to determine the suspi-ciousness of a predicate, known as the importance score inthe SD context. An importance formula i for SD is a functionthat maps an element to an importance score based on theabove values. A standard importance formula [5] is defined asfollows.

Importance(p) =2

1Increase(p) +

1Sensitive(p)

(2)

where

Increase(p) =F (p)

S(p) + F (p)− Fo(p)

So(p) + Fo(p)

Sensitive(p) =log(F (p))

log(totalfailed)

In Equation 2, the importance formula is the harmonic meanof two compositions, Increase(p) and Sensitive(p), whereIncrease(p) distinguishes the value distributions of predicatep for failed executions among all executions that cover p,while Sensitive(p) captures how often failed test cases evaluatepredicate p to true. The more frequently being true of predicatep in failed executions, the bigger the importance score ofp. The less frequently being true of p in failed executions,the smaller the importance score of p. Hence, the importanceformula reflects the relationship between test cases and thevalue of predicates. In the rest of the paper we shall referthe above importance formula as the SD formula. Particularly,when totalfailed = 1, the score of predicate is 0 to avoid“divide by zero” error according to the definition in thepaper [5].

A typical SD approach uses three groups of predefinedpredicates: branches, returns and scalar-pairs [5].• The branches group includes all conditional expressions

and their negations in the program, such as the expressionin an if statement.

• The returns group compares the method return value(if exists) with constant 0 with respect to a set ofcomparators, i.e., >, <, ≥, ≤, == and 6=.

• The scalar-pairs denotes comparing the left-value of anAssignment with all in-scope variables or constants withthe same type.

The three groups of predicates are inserted into differentlocations of the program. As described by the predicatedefinitions, the branches predicates are instrumented into thelocations of conditional expressions, the returns predicates areinstrumented into the locations of return statements while thescalar-pairs predicates are instrumented into the locations ofassignments or variable declarations.

To model the seeding of predicates, we define the conceptof seeding function. A seeding function s maps a programelement to a set of predicate instances. By using differentseeding functions, we get different sets of predicate instances.

Given an importance formula i, a seeding function s, an SDapproach is a function mapping a program (a set of elements)to a set of predicate instances with their importance scores.

SD i,s(E) = {(p, i(p)) | p ∈ s(e), e ∈ E}

III. THE UNIFIED MODEL

As we can see from the previous section, there are manycommonalities between the two families of approaches, e.g.,both approaches use a formula to score elements. Therefore, itis possible to build a model that combines the two approaches.

The basic idea of our model is to treat SBFL as a kindof predicate in SD. Since SBFL concerns about the coverageof an element, its behavior can be captured by a predicate“True”, which evaluates to true whenever the element is cov-ered. Formally, given an instance of predicate True at elemente, Truee, we define its runtime information as follows.

F (Truee) = F0(Truee) = failed(e)

S(Truee) = S0(Truee) = passed(e)

Also, we would like to use the SBFL formulas for SD pred-icates. Since SBFL relies on function failed() and passed(),we need to make these functions work for predicate instances.Given a predicate instance p, we define the following.

failed(p) = F (p)

passed(p) = S(p)

Please note here we use F (p) and S(p) instead of F0(p)and S0(p), because choosing the latter two would lead to thesame suspicious score for all predicates inserted at the sameelements. In other words, we view each predicate as a subelement that captures a subset of states at a specific programpoint. Based on this unification, we shall not distinguishrisk evaluation formulas and importance formulas and addressthem uniformly as risk evaluation formulas.

Furthermore, SBFL and SD have different results: whileSBFL gives a list of suspicious program elements, SD givesa list of important predicates. To unify the results, we returna list of suspicious elements in the unified model. We assumethe existence of a high-order function that aggregates theimportance scores of each predicate instance to produce thesuspicious score of the program element, called combiningmethod. A combining method c is a function that takes aseeding function s, a risk evaluation formula r and an elemente, and produces the suspicious score of e. A basic combining

Page 4: Combining Spectrum-Based Fault Localization and

method is to take the maximum importance score of allpredicates.

c(s, e, r) = maxp∈s(e)r(p)

Putting everything together, we can define the unified ap-proach of SBFL and SD. Given a seeding function s, a riskevaluation formula r, a granularity function g, and a combiningmethod c, a unified approach is defined as follows.

UNI s,r,g,c(E) = {(e,maxei∈g(e)c(s, ei, r)) | e ∈ E} (3)

As we can see from equation 3, the unified approachis parameterized on four components, s, r, g, and c. Inthe following sections we shall systematically explore thevariations of the four components.

IV. EMPIRICAL STUDY SETUP

In this section, we systematically explore the variations ofthe four components in the unified model. We first explainthe research questions, then introduce the experiment setupincluding the subjects used in the experiment and the variationsfor each component. Finally, we present the evaluation metricsand experiment implementation.

A. Research Questions

1) Which kinds of predicates are most important?After treating SBFL as a predicate in the unified model,we have an even richer set of predicates than SD ap-proaches. In this question, we explore the effectivenessof different predicate groups and understand which kindsof predicates contributed the effectiveness most.

2) How does the risk evaluation formula impact theeffectiveness of fault localization?Existing researches have both theoretically and empir-ically studied different risk evaluation formulas in theSBFL scenario [8], [9]. In this research question, we alsoselected a set of representative formulas to explore theeffects of different ones under our unified model.

3) How does the granularity of data collection impactthe fault localization result?Recently proposed method-level fault localization ap-proaches collect program coverage data from two differ-ent granularities, which are statement level [17], [18] andmethod level [25]. Intuitively, the finer the granularity, thebetter localization accuracy and the longer execution time.However, this intuitive proposition has not been evaluatedin existing studies, nor it is clear how much the increasesat localization accuracy and execution time will be.Therefore, in this question, we aim to explore the impactof different granularity functions to the effectiveness andefficiency of fault localization approach via a contrastexperiment.

4) How does combining method among different predi-cates impact the effectiveness of fault localization?In the unified model, we need to combine the suspiciousscores of the predicates to form the scores of the programelements. In this research question we explore two basic

ways of combination: the maximum suspicious score ofall predicates and the linear combination of all suspiciousscores (details in Section IV-C4).

B. Subject Projects

We conducted the experiments on the Defects4J [15](v1.0.1) benchmark, which is a commonly used dataset inautomatic fault localization [17], [25] and program repairstudies [21], [23], [26]–[33]. It contains 357 real-worldfaults from five large open source projects in Java language.JFreeChart is a framework to create chart, Google Closureis a JavaScript compiler for optimization, Apache commons-Lang and commons-Math are two libraries complementingthe existings in JDK, while Joda-Time is a standard time li-brary. The subjects involve diversity applications with differentscales (from 22k to 96k LOC) of source code, which is listedin Table I.

TABLE I: Details of the experiment benchmark.

Project #Bugs #KLoC #Tests

JFreeChart 26 96 2,205Apache commons-Math 106 85 3,602Apache commons-Lang 65 22 2,245Joda-Time 27 28 4,130Closure compiler 133 90 7,927

Total 357 321 20,109

In the table, column “#Bugs” denotes the total number offaults in the benchmark, column “#KLoC” denotes the averagenumber of thousands of lines of code in a faulty program, andcolumn “#Tests” denotes the total number of test cases foreach project.

C. Experiment Configuration

To answer the research questions above, we first definea set of variations for each component of our model, andtogether these variations form a four-dimensional design spaceof the combined approach. Since the whole space is largeand not feasible to fully traverse, we first choose a defaultconfiguration, which was selected by a pilot study on 20bugs randomly sampled. To ensure the representativeness ofthe selected bugs, we evenly and randomly sampled 4 bugsfrom each project in Defects4J benchmark shown in Table I.Then, each time we change one component over the defaultconfiguration in the evaluation. In the following we introducethe variations we considered for the four components, as wellas the default variation we choose for each component.

In our study we focus on method-level fault localization.That is, the fault localization method assigns a suspicious valuefor each method. As argued by several existing studies [25],[34], method level is more suitable for developers than othergranularities such as file level and statement level.

1) Predicates: To explore the performance of differentpredicates, we follow the original classification of statisticaldebugging that divides the predicates into three groups, i.e.,branches, returns and scalar-pairs. Additionally, we consider

Page 5: Combining Spectrum-Based Fault Localization and

the predicates of SBFL as a separate group. Therefore, in totalthere are four groups of predicates.

Especially, when we study a specific group of predicates,the scores of predicates in other groups will be always 0 sincethey are not seeded. However, for other explorations, we useall the predicates (i.e., all four groups of predicates) by default.

2) Risk Evaluation Formulas: To compare the differenceof effectiveness among risk evaluation formulas, we selectedseven formulas in total, five from spectrum-based fault local-ization and two from statistical debugging. According to thedefinition in Section II-B, when totalfailed = 1, a prevalentcase in Defects4J benchmark, the score of predicate will be 0.To mitigate this influence, besides the original SD formula, weadditionally employed a formula derived from it with minorchanges, called NewSD. Table II presents the definitions of allformulas used in our evaluation except SD, which is definedby Equation 2.

Particularly, we use the Ochiai, commonly used in recentstudies on fault localization [16] and program repair [21], [23],[30], as the default risk evaluation formula in our experiments.

TABLE II: Formulas employed in the experiment.

Name Formula

Ochiai [24] r(p) = failed(p)√totalfailed·(failed(p)+passed(p))

Tarantula [2] r(p) = failed(p)/totalfailedfailed(p)/totalfailed+passed(p)/totalpassed

Barinel [35] r(p) = 1− passed(p)passed(p)+failed(p)

DStar† [36] r(p) = failed(p)∗

passed(p)+(totalfailed−failed(p))

Op2 [37] r(p) = failed(p)− passed(p)totalpassed+1

NewSD‡ r(p) = 21/Increase(p)+log(totalfailed)/ log(F (p)+1)

† The variable ∗ in the formula is greater than 0. Here, we set its value as 2,which is also employed by previous study [16].‡ Function Increase(p) in NewSD is the same as it is in Equation 2.

3) Granularity of Data Collection: To localize faults inmethods, we explore two granularity functions that respec-tively collect predicate coverage data at method level andstatement level. The granularity function at statement levelmaps each method to the set of statements inside the method,while the granularity function at method level maps eachmethod to a set containing itself.

The default variation for the granularity function is the oneat statement level.

4) Methods for Combining Suspicious Scores: In this paper,we use “max” or “linear” functions to combine all predicates.However, when using linear combination, the number ofcomponents should be fixed, so we linearly combine SBFLpredicates with all the rest under our unified model. We callthese two combining methods as MAXPRED (i.e., MAX scoreof PREDicates) and LINPRED (i.e., LINear score combinationof PREDicates), respectively. More formally, when given aseeding function s and a risk evaluation function r, we definethe combining methods as follows.

MAXPRED Given a program element e, we compute itssuspicious score as the maximum score of all predicatesrelated to it, i.e., c(s, e, r) = maxp∈s(e)r(p).

LINPRED Given a program element e, we partition the pred-icates related to it into two standalone sets: P1 and P2,where P1 ∪P2 = s(e) and P1 contains one predicate fromSBFL while the others constitute P2. Then, the combiningmethod is defined as c(s, e, r) = (1−α) ·maxp∈P1r(p)+α ·maxp∈P2r(p), where α ∈ [0, 1.0].

Particularly, we set the default variation for the combiningmethod in the experiments as LINPRED, and the defaultcoefficient is α = 0.5.

D. Evaluation Metrics

To evaluate the effectiveness of fault localization techniques,we employed two metrics that are usually employed byexisting studies.

Recall of Top-k. This metric is used to measure howmany faults can be located within top k program elementsamong all candidates. In a survey conducted by Kochhar etal. [34], 73% practitioners think that inspecting 5 programelements is acceptable and almost all practitioners agree that10 elements are the upper bound for inspection within theiracceptability level. Therefore, in this paper, we consider rankswithin top k locations, where k ∈ {1, 3, 5, 10}. Moreover, wetake the mean rank of a program element to break the tiewhen multiple elements have the same suspicious scores likeexiting approaches [16], [38]–[40]. Specifically, we use ceilingfunction to normalize ranks to integers.

EXAM Score. This metric is used to measure the percentageof program elements that need to be inspected by developersamong all candidates before reaching the first desired faultyelement, reflecting the relative ranks of faulty elements amongall candidates and the overall effectiveness of a fault localiza-tion approach. As a result, many previous studies [16], [41],[42] employed this metric as well. The smaller the EXAMscore is, the better the result is.

E. Implementation

To collect coverage information for program element atruntime, we implemented a program instrument frameworkthat can statically instrument programs at both statement andmethod level in the source code. This framework is built atopthe Eclipse Java Development Tool (JDT) library1, whichis a Java source code manipulation framework that providesplentiful operations to Java code, such as source code deletion,insertion and replacement. Besides, it also provides the abilityto parse types of variables or complex expressions, whichis helpful for generating predicates of statistical debuggingsince those predefined predicates depend on the types ofvariables or expressions. Moreover, the instrumented programcan preserve the semantics of the original progam and is freefrom side effects. Our implementation is publicly available athttps://github.com/xgdsmileboy/StateCoverLocator.

1https://www.eclipse.org/jdt

Page 6: Combining Spectrum-Based Fault Localization and

V. RESULT AND ANALYSIS

In this section, we present our experimental results indetail with respect to the research questions introduced inSection IV-A.

A. Effectiveness of Individual Predicates

In this section, we explore the effectiveness of four differentgroups of predicates explained in Section IV-C1, and theirresults are shown in Figure 1, where x-axis denotes differentgroups of predicates, while y-axis denotes the percentages offaults that successfully located within the corresponding Top-k recall. Additionally, we present the corresponding EXAMscore beneath the group names of predicates. For all followingfigures, we employ the same notations consistently for allfigures and will not introduce them redundantly later.

0.319

0.356

0.129

0.269

0.619

0.599

0.266

0.403

0.692

0.675

0.291

0.443

0.784

0.768

0.350

0.493

0.0

0.3

0.6

0.9

SBFL branches returns scalar-pairs

Top-1Top-3Top-5Top-10

PercentageofFaults

[0.212] [0.259] [0.547] [0.443] [Exam Score]

Fig. 1: Fault localization results when only employing indi-vidual group of predicates.

From the figure we can see that the predicates frombranches achieved the best result with respect to Top-1 againstany other group of predicates, but the predicates from SBFLare also effective since it achieved the best result on Top-3 andobtained the best EXAM score. Compared with the other twogroups of predicates, i.e., returns and scalar-pairs, the resultsshow that the predicates of branches perform significantlybetter with about 0.3-1.8 times improvements in terms of Top-1. Besides, the EXAM scores of both SBFL and branchesare merit against the others. Then, we further analyzed thereason for this, and found that a majority of faults arerelated to existing conditional expressions, included by bothSBFL and branches, which was also observed by previousstudies [43]. Furthermore, to investigate whether differentgroups of predicates complement each other, we analyzed thefault localization results of several sampled combinations ofdifferent group predicates, and present their results in Figure 2.

From the figure we can find that the combination of pred-icates from SBFL and branches achieves almost the sameresult with the approach using all the predicates (the firstcolumn in the figure), and outperforms the other combinations,where the improvements are from 8.7% to 42.4% on Top-1. Moreover, by comparing Figure 1 and 2, the combinedapproaches are superior to those with standalone group ofpredicates in terms of either Top-1 or EXAM score, whichdenotes that different groups of predicates complement each

0.423

0.423

0.297 0.389

0.356

0.658

0.653

0.557 0.611

0.611

0.720

0.717

0.639 0.697

0.678

0.790

0.790

0.742

0.768

0.773

0.0

0.3

0.6

0.9

SBFL+SD SBFL+branches SBFL+re

turnsSBFL+sc

alar-pairs

returns+branche

s+scalar-pairs

Top-1Top-3Top-5Top-10

SBFLbranches

SBFLreturns

SBFLscalar-pairs

branchesreturn

scalar-pairs

Allgroups

[0.201] [0.202] [0.222] [0.201] [0.249] [Exam Score]

PercentageofFaults

Fig. 2: Fault localization results when considering the combi-nation among different groups of predicates.

other to some extent. Furthermore, the most contributors of thecombined approach are predicates from SBFL and branchesof SD. The figure also presents that the combination of thesetwo groups of predicates performs as good as that using allpredicates and better than others.

Finding 1. Among all predicates, those from existingbranch conditions and the SBFL predicates are twomost important groups that contribute to the accuratefault localization result in our experiment. Moreover,existing conditions contribute most to our combinedapproaches with respect to Top-1.

B. Different Risk Evaluation formulas

Figure 3 presents the fault localization results of LINPREDwhen employing different risk evaluation formulas. This figureshows the Top-k recall and EXAM score for each formula.In the figure, “SD” denotes the risk evaluation formula oforiginal statistical debugging defined by Equation 2. Theother formulas are defined in Table II. As shown in thefigure, there is no significant difference for both Top-k andEXAM score among the results except for those of “SD” and“NewSD”, which is consistent with the previous studies thatdifferent risk evaluation formulas do not produce significantdifference on fault localization results. However, we can noticethe difference between two different kinds of formulas, i.e.,formulas of SBFL and statistical debugging, where the formersignificantly outperform the latter in our experiment with an upto 227.9% increase at Top-1 against the original SD formula.

As explained in Section II-B, the SD formula was originallydesigned for remote sampling, where usually more than onefailed test case exist. However, in our evaluation scenario,usually a bug is triggered by only one test case. The resultcomparison shows that this has a great impact on the effective-ness of SD formula since NewSD successfully located doublenumber (26.9% vs 12.9%) of faults at Top-1 compared withthe SD formula. However, it is still lower than the resultsof SBFL formulas, which located on average 39.7% faults atTop-1. It reveals that the formulas from SBFL are better at

Page 7: Combining Spectrum-Based Fault Localization and

0.406

0.359 0.423

0.401

0.395

0.129

0.277

0.630

0.641

0.658

0.616

0.619

0.269

0.501

0.709

0.706

0.720

0.686

0.697

0.347

0.566

0.784

0.793

0.790

0.754

0.768

0.426

0.639

0.0

0.3

0.6

0.9

Barinel DStar Ochiai Op2 Tarantula SD NewSD

Top-1Top-3Top-5Top-10

[0.204] [0.206] [0.201] [0.229] [0.229] [0.580] [0.431] [Exam Score]

PercentageofFaults

Fig. 3: Fault localization results when using different riskevaluation formulas.

distinguishing the difference of predicates among failed andpassed executions in our experiment.

Finding 2. The risk evaluation formulas of SBFLsignificantly outperform the original SD formula inthe experiment with an up to 227.9% improvementsin terms of fault numbers located at Top-1.

C. Different Granularity of Data Collection

In this section, we explore the effectiveness of fault lo-calization appoaches under different granularity functions ofdata collection. Figure 4 presents the statistical results of ourexperiment, where the x-axis denotes the granularity of datacollection, i.e., statement and method level in our evaluation.

0.423

0.249

0.658

0.412

0.720

0.499

0.790

0.608

0.0

0.3

0.6

0.9

Statement Method

Top-1Top-3Top-5Top-10

[0.201] [0.266] [Exam Score]

PercentageofFaults

Fig. 4: Comparison of fault localization results under statementand method level data collection.

Based on the figure, the fault localization result whencollecting predicate data at statement level is much better thanthat at method level with respect to both the recall of Top-k (k=1,3,5,10) and the EXAM score. More specifically, theformer increased up to 69.9% with respect to Top-1 against thelatter. This result reflects the fine-grained data collection has amore powerful capability to distinguish fault-related programelements, which is consistent with our intuition.

Finding 3. Fine-grained data collection (statementlevel) contributes 69.9% better fault localization resultthan the coarse-grained data collection (method level).

However, scalability is also essential to fault localizationapproaches to make it practical. In the study conducted byKochhar et al. [34], the satisfaction rate of practitionersincreases along with the scalability of fault localization ap-proaches. It is intuitive to us that finer granularity tend to causelarger overhead to data collection since more predicates willbe considered and gathered, thus we analyzed the executionoverhead under different data collection functions as well. Wefound that on average the predicates collected at statementlevel is much more than those collected at method level (about4.1 times), while the test execution time is only 1.4 times.Moreover, the execution time of either statement or methodlevel data collection is still less than three minutes on average(usually hours for mutation-based fault localization), which isrelatively small and denotes the increase of predicates does notheavily damage the efficiency of fault localization approaches.

Finding 4. Statement-level data collection results in4.1 times as many predicates as method-level, whileuses 1.4 times as much as execution time. The overallexecution time is still less than 3 minutes.

D. Different Combining Methods

In this section, we compare the effectiveness of faultlocalization approaches using different combining methodsunder our unified model, which are MAXPRED (using “max”function) and LINPRED (using “linear” function) as introducedin Section IV-C4. We present our experimental results inFigure 5. In order to understand the improvement on effec-tiveness of the combined approaches against individuals, wealso redundantly show the results of traditional SBFL and SDapproaches in the figure.

0.129

0.319

0.325

0.423

0.269

0.619

0.608 0.658

0.347

0.692

0.703

0.720

0.426

0.784

0.787

0.790

0.0

0.3

0.6

0.9

SD SBFL MaxPred LinPred

Top-1Top-3Top-5Top-10

[0.580] [0.212] [0.208] [0.201] [Exam Score]

PercentageofFaults

Fig. 5: Result comparison among traditional SBFL and SDapproaches with the combined methods.

From the figure, the combined approaches can improve boththe original SD and SBFL techniques with respect to Top-1. However, we can notice that the improvements are quitedifferent, where the result of MAXPRED is almost the same

Page 8: Combining Spectrum-Based Fault Localization and

with SBFL while the result of LINPRED is much better than allthe others with respectively 227.9% and 32.6% improvementsagainst traditional SD and SBFL. In fact the reason is simple,on the one hand, both groups of predicates from SBFL and SDcontain existing conditions, which contribute the most to Top-1 as concluded in the previous evaluation. On the other hand,both predicates from these two groups potentially containpredicates that will damage the fault localization accuracy, i.e.,predicates with high suspicious scores at non-faulty locations,we call these predicates noises.

Take the fault Math-72 as an example, when only employingindividual predicates (either from SBFL or SD), two candidatefaulty methods (m1 and m2 for SBFL, m1 and m3 for SD)share the highest rank with the same suspicious score as1.0, respectively, resulting in the faulty method ranked 2nd.Therefore, after taking the linear combination, LINPRED suc-cessfully ranked the faulty method m1 at top 1 with mitigatingthe noises from individual sources, which cannot be overcomeby MAXPRED. As a result, the improvement of LINPRED issignificant but not that of MAXPRED. Moreover, the resultindicates that these two sources of predicates, predicates fromSBFL and SD, complement each other only if we take theproper combination of them.

As a matter of fact, in theory (see Section II-A andSection II-B) both SBFL and SD predicates have limitations.For SBFL, since all the predicates related to different programelements are constant True, they cannot distinguish programelements in the same execution path though the programstates [44] at these program elements can be different. Onthe other hand, for SD, since the predicates only exist at someselective locations, e.g., the location of conditional expression,they will fail to identify those faulty elements that do notcontain them. However, by combining the predicates fromthese two source, the predicates from SBFL can alleviate thepredicate absence of SD in some program elements while themore concrete predicates from SD can assist SBFL in betterdistinguishing program elements in a same path.

For example, the following code snippet comes from theMath project.

1 public double getSumSquaredErrors() {2 - return sumYY-sumXY*sumXY/sumXX;3 + return Math.max(0d,sumYY-sumXY*sumXY/sumXX);4 }

Listing 1: Faulty code snippet of Math-105 in Defects4J.

In the code snippet, the return statement in line 2 is faulty,whose repair code is shown in line 3. Individually based onthe predicates from SBFL, the faulty location was ranked8th among all the candidate locations. However, after in-volving the SD predicates, the ranking of the faulty lo-cation ranked 1st. The key contributor is the predicate ofsumYY-sumXY*sumXY/sumXX<0 defined by returns in SD,which exactly captures the root cause of the fault. Similarly, asanother example, for the fault Math-53, the faulty method doesnot contain any predicates of SD, causing it ranked relatively

low (rank 6th). However, the combined approach successfullyranked the faulty method at Top-1.

0.319 0.387

0.423

0.431

0.423

0.356

0.619

0.627

0.650

0.647

0.644

0.6110.692

0.717

0.717

0.717

0.706

0.678

0.784

0.790

0.796

0.790

0.787

0.773

0.0

0.3

0.6

0.9

0 0.2 0.4 0.6 0.8 1

Top-1Top-3Top-5Top-10

(𝛼)[0.212] [0.204] [0.201] [0.203] [0.209] [0.249] [Exam Score]

PercentageofFaults

Fig. 6: The result comparison of LINPRED with differentcoefficients.

From the above explanation, both the complementarity andthe individual limitations should be deliberated to design arelatively good combining method. Therefore, we further ex-plored different coefficient values (α) of the linear combinationand presented the results in Figure 6.

From this figure, LINPRED is always superior to the originalSBFL approach (when α=0) with respect to the recall of Top-1 though the improvements varies along with the value ofα. Particularly, when α falls into the interval of [0.4-0.8],LINPRED performs best. All in all, the combining approachesare relatively merit. However, we only explored several simplecombinations in this paper, i.e., MAXPRED and LINPRED,more thorough investigations of potential combinations willbe studied in future work.

Finding 5. The linear combination of SBFL andSD predicates (LINPRED) performs 23.2% better w.r.tTop-1 compared to the combination with “max” func-tion (MAXPRED). Especially, when the coefficient fallsinto [0.4, 0.8], LINPRED achieves the best result.

VI. PREDFL AND ITS PERFORMANCE

According to the previous analysis, we could find thatthe combined approach under the unified model is effectiveand better than each standalone technique. Additionally, fromthe experimental results, we can conclude that the combinedapproach with the default configurations for all variations isrelatively merit in contrast with others. Therefore, we proposea new fault localization approach via instantiating the unifiedmodel with all default configurations, i.e., collecting all groupsof predicates at statement level and linearly combining pred-icate scores computed with Ochiai formula. For convenientreference, we call this approach as PREDFL.

Although standalone fault localization approaches havegained a big success [24], [44]–[46] in the last decade,recently researchers have noticed the benefit of combiningdifferent data sources or existing fault localization techniquesto improve state-of-the-art techniques. Based on the previousinvestigation, each standalone approach has advantages and

Page 9: Combining Spectrum-Based Fault Localization and

limitations. Therefore, by combining different kinds of tech-niques, it potentially could exert their strength while mitigatinglimitations. As a result, such combining techniques alreadybenefit state-of-the-art fault localization approaches [17]–[19], [38]. Since different information have been utilizedby these combining techniques, hence a question naturallyraises, whether our approach (PREDFL) is already implicitlycovered by existing techniques. In other words, is PREDFLcomplementary to existing techniques?

Therefore, in this section we explore the potential ofPREDFL to improve state-of-the-art fault localization tech-niques, for which we integrated PREDFL with a state-of-the-art fault localization framework developed by Zou etal. [19], COMBINEFL for short. It has achieved better faultlocalization results compared with other state-of-the-art ap-proaches, including MULTRIC [17], Savant [25], TraPT [17]and FLUCCS [18].

As a result, to check whether our approach is complemen-tary to existing approaches, we combine our approach withthis framework and check whether the combination brings anyimprovements. More concretely, the current implementation ofCOMBINEFL consists of spectrum-, mutation-, history- andIR-based fault localization techniques [24], [36], [45], [47]–[49], stacktrace analysis [50], dynamic slicing [51]–[53] andpredicate switching technique [44]. Different techniques canbe freely assembled. Especially, based on the time used to per-form fault localization (including data collection) of differenttechniques, they are further classified into four levels and listedin Table III. For example, if we would like to locate the faultyelements within minutes, we can use all the techniques inLevel 1 and Level 2 together. According to our empirical studyresults in Section V, the execution time of PREDFL is less thanthree minutes. Therefore, in this contrast experiment, we alsoselected the same execution time configuration, i.e., integratingPREDFL with those techniques in Level 2 (techniques in Level1 were included). Additionally, we employed Level 4 as wellsince it contains almost all existing techniques, where we fedthe results of PREDFL together with those of other techniquesinto the framework. Also, we performed method-level faultlocalization on Defects4J benchmark.

TABLE III: Different integration levels of COMBINEFLframework according to execution time.

Time Level Family Technique

Level 1(Seconds)

history-based Bugspotsstack trace stack traceIR-based BugLocator

Level 2(Minutes)

slicing union, intersectionand frequency

spectrum-based Ochiai and DStarLevel 3

(∼ Ten minutes) predicate switching predicate switching

Level 4(Hours) mutation-based Metallaxis and MUSE

Table IV presents the results in detail. From the table, wecan see that after integrating our approach, all the results are

improved from 4.8% (75.9% vs 79.6%) to 20.8% (48.7%vs 40.3%). Furthermore, PREDFL even improves the faultlocalization effectiveness on each individual project w.r.t Top-1. Especially, after integrating PREDFL, the fault localizationresults at Level 2 are even better than those at Level 4 withoutPREDFL. Please note, since the current implementation ofCOMBINEFL is already superior to all existing techniques asevaluated in their paper, we do not redundantly compare thosetechniques here because the integrated approach is alreadymore effective than the original one.

Finding 6. Our combined approach, PREDFL thatlinearly combines predicates from SBFL and SD, canfurther improve existing techniques, and the improve-ment can be up to 20.8% in terms of Top-1, which de-notes our approach complements existing techniques.

VII. IMPLICATIONS FOR FUTURE STUDY

A. Predicate Selection

From the experimental results we can see that the selectionof predicates play an important role in the performance ofthe fault localization. On the one hand, not all predicatesplay an equal role in identifying the faulty locations. As ourexperiment results, predicate group branches contributes mostto the accuracy of fault localization, while some predicatesmay even play negative roles, the performance of all predicatesis sometimes lower than using the predicates from one group.On the other hand, the performance of a predicate groupmay vary depending on where the predicates are seeded. Forexample, the following code snippet in Listing 2 shows a patchfor bug Chart-9 in Defects4J benchmark.

1 if(start == null) {2 throw new IllegalArgumentException();3 }4 ...5 - if(endIndex<0){6 + if((endIndex<0)||(endIndex<startIndex)){

Listing 2: Patch of Chart-9 in Defects4J benchmark.

From the patch we can see that the root cause is thecondition of the second if statement in line 5, where an addi-tional condition expression endIndex<startIndex shouldbe inserted as shown in line 6. However, after calculatingthe suspiciousness scores of predicates from the branchesgroup, we will notice that the predicate !(start==null)

taken from line 1 has higher suspiciousness than the predicateendIndex<0 from line 5, leading to an incorrect localizationof line 1. That is, while generally the branches group playspositive role, it may also lead to negative effects.

Based on the observation, one potential direction to improvethis kind of situation is developing a more effective approachfor selecting predicates. Since the performance of predicatesvary from location to location, the selection should also beconditional over the context of the predicates. Techniques suchas machine learning or invariant generation may be used.

Page 10: Combining Spectrum-Based Fault Localization and

TABLE IV: Integration fault localization results with existing techniques on Defects4J benchmark.

Level 2 Level 4Top-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

- + - + - + - + - + - + - + - +Chart 11 13 19 18 21 18 23 25 13 14 19 18 21 18 24 20Math 49 58 78 81 84 87 90 94 63 67 83 86 85 91 93 95Lang 42 45 55 55 57 56 60 60 47 51 59 57 61 61 61 61Time 12 12 15 16 17 19 18 20 10 12 12 17 16 19 19 20

Closure 30 46 47 64 52 68 59 81 35 41 57 64 64 76 74 88Total 144 174 214 234 231 248 250 280 168 185 230 242 247 265 271 284

Percent 40.3% 48.7% 59.9% 65.5% 64.7% 69.5% 70.0% 78.4% 47.1% 51.8% 64.4% 67.8% 69.2% 74.2% 75.9% 79.6%In the table, the columns leading by “-” denote the original result of the fault localization framework COMBINEFL, while the columns leading by “+”denote the result after integrating PREDFL.

B. Better Formulas

In our study, we found that applying SBFL formulas toSD predicates could achieve significantly better performance.However, SBFL formulas are designed for program elements,and thus may not be the best choice for SD predicates. Forinstance, SBFL formulas do not utilize the two values F0

and S0. Future studies could explore the design space of riskevaluation formulas, and check if better formulas could bediscovered.

VIII. DISCUSSION

Threats to internal validity is related to errors and selectionbias in the experiment. To mitigate this risk, two authors of thepaper with at least five-year developing experience carefullychecked the implementation of our experiment with codereview. Moreover, we have also run our approach multipletimes to further guarantee the correctness of implementation.Besides, the selection of metrics in our experiment may alsoaffect the evaluation results and corresponding conclusions.To tackle this problem, we selected two distinct metrics inour study, i.e., Top-k recall and EXAM score, both of whichare prevalently employed by previous research.

Threats to external validity is related to the generalizabilityof our approach. In our experiment, we selected a commonlyused benchmark, Defects4J, as our experimental dataset, whichcontains faults from real-world projects related to diversesoftware projects. Moreover, we presented and discussed thefinal results against existing approaches to show the effec-tiveness of our combined approach. As studied by Just etal. [54] that developer-provided triggering test cases couldpotentially overestimate the performance of fault localizationtechniques. This paper inherits this threat from the dataset aswell. However, the results of our study are still informativewith multiple-dimension comparisons. Moreover, the unifiedmodel and the implications learned from the study are freefrom this threat.

IX. RELATED WORK

In this section, we discuss related work with respect to fouraspects. First, we explain spectrum-based fault localization andstatistical debugging techniques explored in this paper. Thenwe discuss recent fault localization approaches that combine

different techniques to improve state-of-the-art. Finally, wediscuss existing empirical analysis on fault localization.

Spectrum-based fault localization is one of the mostprevalent fault localization techniques, which is broadly uti-lized in program debugging, such as automatic program re-pair [21], [23]. In the past decade, many SBFL approacheshave been proposed. For example, Tarantula [2] is a rep-resentative approach early proposed by Jones and Harrold,then Abreu and colleagues did a more thorough study on itand proposed a superior formula, Ochiai [24], which is oftenused to measure similarity in the molecular biology domain.Recently, some other formulas were proposed [55]. Thoughmany approaches exist, as introduced in Section II-A, differentSBFL approaches follow the same paradigm that leveragesprogram syntax coverage to compute suspicious scores forcandidate program elements with a predefined formula. Re-cently, SBFL has been employed in different domains, suchas in SQL predicates [56], model transformations [57], logic-based reasoning [58], compiler bug isolation [59], and softwareproduct line [60], where it performs effective. Besides, existingstudies try to improve SBFL technique with other optimizedinformation such as suspicious variables [61] and reduced testcases by delta-debugging [62], or incorporating techniques likedata-augmented diagnosis [63] and Feedback-based GoodnessEstimate [64]. In contrast, in this paper, we further explore theeffectiveness of SBFL predicates and formulas by combiningstatistical debugging under a unified model. Like existingstudies [16], we selected a set of representative formulas inour evaluation to mitigate the selection bias. We leave a morethorough comparison among different formulas to future work.

Statistical debugging was proposed for remote programsampling by Liblit et al. [4], [5]. In the past years, many ap-proaches have been proposed devoting to improving statisticaldebugging. Arumuga Nainar et al. [10] proposed to combinesimple predicates to complex ones in order to improve thedistinguishing ability of them. Zheng et al. [12] proposedto use a machine learning model to isolate predicates toclusters, each of which ideally can capture one single fault ina program that contains multiple faults. Similarly, Jiang andSu [65] proposed to cluster selected predicates to reconstructcandidate paths that are most likely to be faulty. Liu et al. [7],[13] proposed a new formula to compute the scores of predi-

Page 11: Combining Spectrum-Based Fault Localization and

cates, which distinguishes the predicate coverage distributionsamong failed and passed tests. Arumuga Nainar and Liblit [6]proposed to reduce the performance overhead of statisticaldebugging with adaptive binary instrumentation along a setof heuristics, and Zuo et. al [66] improved the performanceof statistical debugging with abstraction refinement. Recently,Chilimbi and colleagues [11] proposed to improve traditionalstatistical debugging with path profiling, while Yu et al. [67]utilize predicates to reduce test cases. In this paper, we furtherexplore the effectiveness of different predicates of SBFL andstatistical debugging and investigate their different combina-tions aiming to boost current fault localization techniques,which is orthogonal to existing studies.

Fault localization combination denotes exploring differentcombination methods among fault localization techniques.Recently, many approaches have been proposed and improvedthe state-of-the-art. Santelices et al. [68] explored three kindsof coverage information, statement, branch and define-use paircoverage, where they found that the combination of three kindsof coverage can improve individual techniques. Recently, Tuet al. [69] proposed to combine SBFL with slicing-hitting-set-computation, which improves the effectiveness of SBFL. Xuanand Monperrus [38] proposed to leverage a learning-to-rankmethod to combine dozens of SBFL formulas, which obtainedsignificant improvement against single formula. Similarly,Lucia et al. [70] proposed to fuse the diversity of existingspectrum-based fault localization techniques with consideringdozens of variants of their combinations. Wang et al. [71]proposed to employ genetic algorithm and simulated annealingto look for optimal solutions for the combination of differentspectrum-based fault localization techniques. Both of themwere evaluated to outperform individual approaches. Recentlyseveral approaches have been proposed to leverage a learning-to-rank model to combine SBFL and supplementary infor-mation, such as invariants mined from executions [25], bugreports [72], code changes [18] and mutant coverage [17],[19], and improved existing techniques. Additionally, Li etal. [73] leveraged deep learning models to combine variousfeatures from the fault localization, defect prediction andinformation retrieval areas to further improve the state-of-the-art techniques. In this paper, we proposed a unified modelof SBFL and statistical debugging, based on which we firstsystematically explored four different variations under themodel. As a result, we found the effective configuration foreach individual variation and proposed a simple combiningmethod that incorporates their merits, which is evaluated to beeffective and efficient. Moreover, our approach is orthogonalto existing techniques and evaluated to have the potential tofurther improve the state-of-the-art.

Empirical analysis on the performance of fault local-ization mainly focuses on analyzing existing fault localiza-tion techniques and investigating the essence of effectiveness.Harrold et al. [74] studied the relationships between programspectra and faults, which established the foundation of SBFL.Recently. Xie et al. [8] and Chen et al. [75] have theoreticallyrevisited different formulas of SBFL approaches and found

that there dose not exist the “best” formula that significantlyoutperforms the others. Pearson et al. [16] systematicallyexplored the effectiveness of SBFL approaches and mutationbased fault localization on both real-word faults and artificialfaults, where they found the artificial faults are not signifi-cantly useful for predicting which fault localization techniquesperform best on real faults. In this paper, we explored fourvariations that contributed to the improvement of combiningdifferent predicates, based on which we proposed a simpleand effective fault localization approach. Moreover, our anal-ysis provide several implications for future fault localizationstudies.

Empirical analysis on the usefulness of fault localizationinvestigates the usability of automated fault localization tech-niques in practice. Recently, Parnin and Orso [76] and Xie etal. [77] studied the usefulness of automated fault localizationtechniques and observed that improvements in fault localiza-tion accuracy may not translate to improved manual debug-ging directly and even weaken programmers abilities in faultdetection. However, a more rigorous study on larger-scale,real-world bugs performed by Xia et al. [78] reported thatautomated fault localization can positively impact debuggingsuccess and efficiency. Moreover, they found that the moreaccurate the fault localization results are, the more efficient thedebugging process will be, demonstrating the practicability ofautomated fault localization techniques in practice. Similarly,Debroy and Wong [79] empirically evaluated that better faultlocalization can potentially improve the effectiveness of fault-fixing processes as well. As a consequence, in this paper, wetarget to improve the accuracy of current fault localizationtechniques and explore the combination of spectrum-basedfault localization and statistical debugging.

X. CONCLUSION

In this paper, we empirically investigated different com-binations of spectrum-based fault localization and statisticaldebugging techniques, for which we proposed a unified model.In addition, we systematically explored four variations underthe model, which leads to several findings: (1) among allpredicates, those from existing conditions contribute most tothe Top-1 fault localization accuracy; (2) fine-grained datacollection contributes more effective fault localization withlittle more execution overhead; (3) a linear combination ofsuspicious scores from SBFL and SD predicates leads to thebest result. Based on our findings, we proposed a new faultlocalization technique, PREDFL, which is evaluated to be com-plementary to existing techniques as it could further improvestate-of-the-art by combining with existing techniques.

ACKNOWLEDGMENT

This work was partially supported by the National KeyResearch and Development Program of China under GrantNo.2017YFB1001803, and National Natural Science Foun-dation of China under Grant Nos. 61672045, 61529201 and6167254.

Page 12: Combining Spectrum-Based Fault Localization and

REFERENCES

[1] J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test infor-mation to assist fault localization,” in ICSE, 2002.

[2] J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantulaautomatic fault-localization technique,” in ASE. New York, NY, USA:ACM, 2005, pp. 273–282.

[3] R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy ofspectrum-based fault localization,” in TAICPART-MUTATION, 2007, pp.89–98.

[4] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan, “Bug isolation viaremote program sampling,” in PLDI. New York, NY, USA: ACM,2003, pp. 141–154.

[5] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalablestatistical bug isolation,” in PLDI. New York, NY, USA: ACM, 2005,pp. 15–26.

[6] P. Arumuga Nainar and B. Liblit, “Adaptive bug isolation,” ser. ICSE,New York, NY, USA, 2010, pp. 255–264.

[7] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff, “Sober: Statisticalmodel-based bug localization,” ser. ESEC/FSE-13. New York, NY,USA: ACM, 2005, pp. 286–295.

[8] X. Xie, T. Y. Chen, F.-C. Kuo, and B. Xu, “A theoretical analysis ofthe risk evaluation formulas for spectrum-based fault localization,” ACMTrans. Softw. Eng. Methodol., vol. 22, no. 4, pp. 31:1–31:40, 2013.

[9] T. D. B. Le, F. Thung, and D. Lo, “Theory and practice, do they match? acase with spectrum-based fault localization,” in 2013 IEEE InternationalConference on Software Maintenance, 2013, pp. 380–383.

[10] P. Arumuga Nainar, T. Chen, J. Rosin, and B. Liblit, “Statisticaldebugging using compound boolean predicates,” ser. ISSTA, New York,NY, USA, 2007, pp. 5–15.

[11] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani,“Holmes: Effective statistical debugging via efficient path profiling,” ser.ICSE, Washington, DC, USA, 2009, pp. 34–44.

[12] A. X. Zheng, M. I. Jordan, B. Liblit, M. Naik, and A. Aiken, “Statisticaldebugging: Simultaneous identification of multiple bugs,” ser. ICML,New York, NY, USA, 2006, pp. 1105–1112.

[13] C. Liu, L. Fei, X. Yan, J. Han, and S. P. Midkiff, “Statistical debugging:A hypothesis testing-based approach,” TSE, vol. 32, no. 10, pp. 831–848,Oct 2006.

[14] L. Jiang and Z. Su, “Context-aware statistical debugging: From bugpredictors to faulty control flow paths,” ser. ASE, New York, NY, USA,2007, pp. 184–193.

[15] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existingfaults to enable controlled testing studies for java programs,” in ISSTA,2014, pp. 437–440.

[16] S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst,D. Pang, and B. Keller, “Evaluating and improving fault localization,”ser. ICSE ’17, 2017, pp. 609–620.

[17] X. Li and L. Zhang, “Transforming programs and tests in tandem forfault localization,” pp. 92:1–92:30, 2017.

[18] J. Sohn and S. Yoo, “Fluccs: Using code and change metrics to improvefault localization,” ser. ISSTA, New York, NY, USA, 2017, pp. 273–283.

[19] D. Zou, J. Liang, Y. Xiong, M. D. Ernst, and L. Zhang, “An empiricalstudy of fault localization families and their combinations,” IEEETransactions on Software Engineering, 2019.

[20] J. Xuan, M. Martinez, F. Demarco, M. Clement, S. Lamelas, T. Durieux,D. Le Berre, and M. Monperrus, “Nopol: Automatic repair of conditionalstatement bugs in java programs,” TSE, 2017.

[21] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and L. Zhang,“Precise condition synthesis for program repair,” in ICSE, 2017.

[22] X.-B. D. Le, D. Lo, and C. Le Goues, “History driven program repair,”in SANER, 2016, pp. 213–224.

[23] Q. Xin and S. P. Reiss, “Leveraging syntax-related code forautomated program repair,” ser. ASE, 2017. [Online]. Available:http://dl.acm.org/citation.cfm?id=3155562.3155644

[24] R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “An evaluationof similarity coefficients for software fault localization,” ser. PRDC.Washington, DC, USA: IEEE Computer Society, 2006, pp. 39–46.

[25] T.-D. B. Le, D. Lo, C. Le Goues, and L. Grunske, “A learning-to-rankbased fault localization approach using likely invariants,” in ISSTA, 2016,pp. 177–188.

[26] M. Martinez, T. Durieux, R. Sommerard, J. Xuan, and M. Monperrus,“Automatic repair of real bugs in java: A large-scale experiment on theDefects4J dataset,” Empirical Software Engineering, pp. 1–29, 2016.

[27] L. Chen, Y. Pei, and C. A. Furia, “Contract-based program repair withoutthe contracts,” in ASE, 2017.

[28] R. K. Saha, Y. Lyu, H. Yoshida, and M. R. Prasad, “Elixir: Effectiveobject oriented program repair,” in ASE. IEEE Press, 2017. [Online].Available: http://dl.acm.org/citation.cfm?id=3155562.3155643

[29] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context-awarepatch generation for better automated program repair,” in ICSE, 2018.

[30] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping programrepair space with existing patches and similar code,” in ISSTA, 2018.

[31] J. Hua, M. Zhang, K. Wang, and S. Khurshid, “Towards practicalprogram repair with on-demand candidate generation,” in ICSE, 2018.

[32] A. Ghanbari, S. Benton, and L. Zhang, “Practical program repair viabytecode mutation,” in ISSTA. New York, NY, USA: ACM, 2019, pp.19–30.

[33] J. Jiang, Y. Xiong, and X. Xia, “A manual inspection of defects4jbugs and its implications for automatic program repair,” Science ChinaInformation Sciences, vol. 62, p. 200102, Sep 2019.

[34] P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations onautomated fault localization,” ser. ISSTA, New York, NY, USA, 2016,pp. 165–176.

[35] R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “Spectrum-basedmultiple fault localization,” in ASE. Washington, DC, USA: IEEEComputer Society, 2009, pp. 88–99.

[36] W. E. Wong, V. Debroy, R. Gao, and Y. Li, “The dstar method foreffective software fault localization,” IEEE Transactions on Reliability,no. 1, pp. 290–308, March 2014.

[37] L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,” ACM Trans. Softw. Eng. Methodol., no. 3,pp. 11:1–11:32, 2011.

[38] J. Xuan and M. Monperrus, “Learning to combine multiple rankingmetrics for fault localization,” in ICSME, 2014, pp. 191–200.

[39] F. Steimann, M. Frenkel, and R. Abreu, “Threats to the validity andvalue of empirical assessments of the accuracy of coverage-based faultlocators,” in ISSTA, 2013, pp. 314–324.

[40] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey onsoftware fault localization,” TSE, pp. 707–740, 2016.

[41] W. E. Wong, V. Debroy, R. Golden, X. Xu, and B. Thuraisingham,“Effective software fault localization using an rbf neural network,” IEEETransactions on Reliability, pp. 149–169, 2012.

[42] E. Wong, T. Wei, Y. Qi, and L. Zhao, “A crosstab-based statisticalmethod for effective fault localization,” in 2008 1st International Confer-ence on Software Testing, Verification, and Validation, 2008, pp. 42–51.

[43] M. Soto, F. Thung, C.-P. Wong, C. Le Goues, and D. Lo, “A deeperlook into bug fixes: Patterns, replacements, deletions, and additions,” inMSR. New York, NY, USA: ACM, 2016, pp. 512–515.

[44] X. Zhang, N. Gupta, and R. Gupta, “Locating faults through automatedpredicate switching,” in ICSE, 2006, pp. 272–281.

[45] S. Moon, Y. Kim, M. Kim, and S. Yoo, “Ask the mutants: Mutatingfaulty programs for fault localization,” in ICST, March 2014, pp. 153–162.

[46] X. Zhang, N. Gupta, and R. Gupta, “Pruning dynamic slices withconfidence,” ser. PLDI, New York, NY, USA, 2006, pp. 169–180.

[47] M. Papadakis and Y. Le Traon, “Metallaxis-fl: Mutation-based faultlocalization,” Softw. Test. Verif. Reliab., pp. 605–628, Aug. 2015.

[48] F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu, “Bugcachefor inspections: Hit or miss?” ser. ESEC/FSE ’11, New York, NY, USA,2011, pp. 322–331.

[49] J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed?more accurate information retrieval-based bug localization based on bugreports,” in ICSE, June 2012, pp. 14–24.

[50] A. Schroter, A. Schrter, N. Bettenburg, and R. Premraj, “Do stack traceshelp developers fix bugs?” in MSR 2010, May 2010, pp. 118–121.

[51] C. Hammacher, “Design and implementation of an efficient dynamicslicer for Java,” Bachelor’s Thesis, Nov. 2008.

[52] T. Wang and A. Roychoudhury, “Using compressed bytecode traces forslicing java programs,” in ICSE, May 2004, pp. 512–521.

[53] H. Pan and E. H. Spafford, “Heuristics for automatic localization ofsoftware faults,” World Wide Web, 1992.

[54] R. Just, C. Parnin, I. Drosos, and M. D. Ernst, “Comparing developer-provided to user-provided tests for fault localization and automatedprogram repair,” in ISSTA, 2018, pp. 287–297.

[55] L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,” ACM Trans. Softw. Eng. Methodol., vol. 20,pp. 11:1–11:32, 2011.

Page 13: Combining Spectrum-Based Fault Localization and

[56] Y. Guo, N. Li, J. Offutt, and A. Motro, “Exoneration-based faultlocalization for sql predicates,” Journal of Systems and Software, vol.147, pp. 230 – 245, 2019.

[57] J. Troya, S. Segura, J. A. Parejo, and A. Ruiz-Cortes, “Spectrum-basedfault localization in model transformations,” ACM Trans. Softw.Eng. Methodol., pp. 13:1–13:50, Sep. 2018. [Online]. Available:http://doi.acm.org/10.1145/3241744

[58] I. Pill and F. Wotawa, “Spectrum-based fault localization for logic-based reasoning,” in 2018 IEEE International Symposium on SoftwareReliability Engineering Workshops (ISSREW), Oct 2018, pp. 192–199.

[59] J. Chen, J. Han, P. Sun, L. Zhang, D. Hao, and L. Zhang, “Compiler bugisolation via effective witness test program generation,” in ESEC/FSE.New York, NY, USA: ACM, 2019, pp. 223–234.

[60] A. Arrieta, S. Segura, U. Markiegi, G. Sagardui, and L. Etxeberria,“Spectrum-based fault localization in software product lines,” Informa-tion and Software Technology, vol. 100, pp. 18 – 31, 2018.

[61] J. Kim, J. Kim, and E. Lee, “Vfl: Variable-based fault localization,”Information and Software Technology, vol. 107, pp. 179 – 191, 2019.

[62] A. Christi, M. L. Olson, M. A. Alipour, and A. Groce, “Reduce beforeyou localize: Delta-debugging and spectrum-based fault localization,” in2018 IEEE International Symposium on Software Reliability Engineer-ing Workshops (ISSREW), Oct 2018, pp. 184–191.

[63] A. Elmishali, R. Stern, and M. Kalech, “Data-augmented softwarediagnosis,” in Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence, ser. AAAI’16. AAAI Press, 2016, pp. 4003–4009.[Online]. Available: http://dl.acm.org/citation.cfm?id=3016387.3016470

[64] N. Cardoso and R. Abreu, “A kernel density estimate-basedapproach to component goodness modeling,” in Proceedings ofthe Twenty-Seventh AAAI Conference on Artificial Intelligence, ser.AAAI’13. AAAI Press, 2013, pp. 152–158. [Online]. Available:http://dl.acm.org/citation.cfm?id=2891460.2891482

[65] L. Jiang and Z. Su, “Context-aware statistical debugging: From bugpredictors to faulty control flow paths,” ser. ASE, New York, NY, USA,2007, pp. 184–193.

[66] Z. Zuo, L. Fang, S.-C. Khoo, G. Xu, and S. Lu, “Low-overhead andfully automated statistical debugging with abstraction refinement,” ser.OOPSLA. New York, NY, USA: ACM, 2016, pp. 881–896. [Online].Available: http://doi.acm.org/10.1145/2983990.2984005

[67] Y. Yu, J. Jones, and M. J. Harrold, “An empirical study of the effects oftest-suite reduction on fault localization,” in ICSE, 2008, pp. 201–210.

[68] R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold, “Lightweight fault-

localization using multiple coverage types,” ser. ICSE. Washington,DC, USA: IEEE Computer Society, 2009, pp. 56–66.

[69] J. Tu, X. Xie, T. Y. Chen, and B. Xu, “On the analysis ofspectrum based fault localization using hitting sets,” Journal of Systemsand Software, vol. 147, pp. 106 – 123, 2019. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0164121218302231

[70] Lucia, D. Lo, and X. Xia, “Fusion fault localizers,” in ASE. NewYork, NY, USA: ACM, 2014, pp. 127–138. [Online]. Available:http://doi.acm.org/10.1145/2642937.2642983

[71] Shaowei Wang, D. Lo, L. Jiang, Lucia, and H. C. Lau, “Search-basedfault localization,” in ASE 2011, Nov 2011, pp. 556–559.

[72] T.-D. B. Le, R. J. Oentaryo, and D. Lo, “Information retrieval andspectrum based bug localization: Better together,” in ESEC/FSE. NewYork, NY, USA: ACM, 2015, pp. 579–590.

[73] X. Li, W. Li, Y. Zhang, and L. Zhang, “Deepfl: Integrating multiplefault diagnosis dimensions for deep fault localization,” in ISSTA. NewYork, NY, USA: ACM, 2019, pp. 169–180.

[74] M. J. Harrold, G. Rothermel, K. Sayre, R. Wu, and L. Yi, “Anempirical investigation of the relationship between spectra differencesand regression faults,” Software Testing, Verification and Reliability, pp.171–194.

[75] T. Y. Chen, X. Xie, F. C. Kuo, and B. Xu, “A revisit of a theoreticalanalysis on spectrum-based fault localization,” in 2015 IEEE 39thAnnual Computer Software and Applications Conference, July 2015, pp.17–22.

[76] C. Parnin and A. Orso, “Are automated debugging techniques actuallyhelping programmers?” in ISSTA, 2011, pp. 199–209.

[77] X. Xie, Z. Liu, S. Song, Z. Chen, J. Xuan, and B. Xu, “Revisitof automatic debugging via human focus-tracking analysis,” in ICSE.ACM, 2016, pp. 808–819.

[78] X. Xia, L. Bao, D. Lo, and S. Li, “automated debugging consideredharmful considered harmful: A user study revisiting the usefulnessof spectra-based fault localization techniques with professionals usingreal bugs from large systems,” 2016 IEEE International Conference onSoftware Maintenance and Evolution (ICSME), pp. 267–278, 2016.

[79] V. Debroy and W. E. Wong, “Combining mutation and faultlocalization for automated program debugging,” Journal of Systemsand Software, vol. 90, pp. 45 – 60, 2014. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0164121213002616